Writing A Grep Clone In 9 Lines Of Python
2020-11-02 - By Robert Elder
Introduction
This purpose of this article is to show how you can build a basic 'grep' clone in less than 9 lines of Python source code. This simple clone of grep won't contain nearly as many features as the real version, and its output may have a few differences that will be documented below. The goal here is simply to provide insights into how grep works rather than to re-create a fully functional replacement.
Simple 'grep -P' Clone In 8 Lines Of Python
The following 8 lines of Python code show how you could write a simple clone of the 'grep' command that is close to the behaviour of grep when run with the '-P' flag:
# Put this in a file called '1.py'
import sys
import re
for line in sys.stdin:
regex_pattern = sys.argv[1]
pattern = re.compile(regex_pattern)
if pattern.search(line):
sys.stdout.write(line)
The code above will read in data from stdin one line at a time and perform a regex search on each line. The 'sys.argv[1]' part references the first argument that is passed to the python script, which will then be treated as a regular expression. Here is an example of how you could use this python script to search a file for a regex pattern like '[A-Z]N0':
cat example_data.csv | python 1.py "[A-Z]N0"
Which should output the following:
Sneakers, MN009, 49.99, 1.11
Shirt, MN089, 8.99, 1.44
Sneakers, KN09, 49.99, 1.11
Shoes, BN009, 449.22, 4.31
Which is the same output that we get from using grep:
cat example_data.csv | grep -P "[A-Z]N0"
Simple 'grep -Po' Clone In 9 Lines Of Python
The 'grep' command also has a feature that lets you extract only the matched part of the text. The following 9 lines of Python code show how you could write a simple clone of the 'grep' command that is close to the behaviour of grep when run with the '-Po' flag:
# Put this in a file called '2.py'
import sys
import re
for line in sys.stdin:
regex_pattern = "(" + sys.argv[1] + ")"
pattern = re.compile(regex_pattern)
m = pattern.search(line)
if m:
sys.stdout.write(m.group(1) + "\n")
The code above works similarly to the first example, except this time, the regex is enclosed in parentheses. The parentheses create a capture group that lets us extract whatever text matched the regex using 'm.group(1)'. Here is an example of how you could use this python script to extract all strings that match the regex pattern '[A-Z]N0[^,]*':
cat example_data.csv | python 2.py "[A-Z]N0[^,]*"
Which should output the following:
MN009
MN089
KN09
BN009
Which is the same output that we get from using grep:
cat example_data.csv | grep -Po "[A-Z]N0[^,]*"
Caveats
In the examples above, grep was used with the '-P' flag which causes grep to use 'Perl-Compatible Regular Expressions' (PCRE). PCRE regular expressions are slightly different from 'Basic Regular Expressions' (the default regex mode used by grep), and 'Extended Regular Expressions' (when grep is used with the -E flag). Also, not all versions of grep support the -P flag, and even when it is supported, there may be slight differences between grep's implementation and Python's implementation.
Both of the examples above are very inefficient since they re-compile the regex once for every line in the file, which isn't necessary. This could be easily fixed by moving the regex compiling outside the for loop, but I've kept it the same for consistency with the video.
Another difference is that grep doesn't always just read from stdin. You can also use one or more files as arguments to grep. The example scripts shown in this article don't support this feature (but it wouldn't be that hard to add).
Finally, grep supports many other flags and features (too many to list here) and none of these cases are covered by the examples in this article.
References
Here is the 'example_data.csv' used the the examples above:
item, modelnumber, price, tax
Sneakers, MN009, 49.99, 1.11
Sneakers, MTG09, 139.99, 4.11
Shirt, MN089, 8.99, 1.44
Pants, N09, 39.99, 1.11
Sneakers, KN09, 49.99, 1.11
Shoes, BN009, 449.22, 4.31
Sneakers, dN099, 9.99, 1.22
Bananas, GG009, 4.99, 1.11
The Most Confusing Grep Mistakes I've Ever Made
Published 2020-11-02 |
$1.00 CAD |
Can You Use 'ed' As A Drop-in Replacement For vim, grep & sed?
Published 2020-10-15 |
Undefined Behaviour With Grep -E
Published 2020-10-01 |
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
Why Is It so Hard to Detect Keyup Event on Linux?
Published 2019-01-10 |
Use The 'tail' Command To Monitor Everything
Published 2021-04-08 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|