How To Become A 10x Engineer Using The Awk Command
2021-01-30 - By Robert Elder
Introduction
In this article, I will attempt to convince you that 'awk' is a great tool to learn for solving one-off text manipulation and analysis problems. After learning awk, you'll immediately be accepted into the "10x engineer's club" and be cured of your imposter syndrome. From now on, your typical team meetings will go something like this:
Manager: "Hey, let's make a well-informed data-driven decision about which product to focus on selling. Do we have any numbers on which product(s) make the most profit in total?".
All 1x Engineers in the room (speaking together in a synchronized monotone voice): "Nah, we'd have to write an ETL job to import the .csv files and write many Java Classes. It would take at least 4 weeks.".
You: "I just spent 30 seconds writing an awk one-liner to calculate the profit for each product. Here is the list sorted from highest to lowest."
Manager: "Thanks. I value your expertise, and have a great appreciation for your contributions to this company."
What Is Awk And Why Is It Useful?
Awk is a command-line tool that can be used for processing text files line by line. The 'awk' command could be compared to the 'grep' or 'sed' commands. In fact, many of the basic use cases for 'grep' or 'sed' can actually be solved in a nearly identically way by awk. Therefore, an important question worth asking is: If some of the use cases for awk are replicated by 'sed' or 'grep', why would you ever want to use awk? The answer is that awk is a fully-fledged programming language. The 'grep' command is specifically designed for searching, and the 'sed' command is specifically designed for replacing. Awk is able to do both at the same time, but even more importantly, 'awk' gives you access to things that a 'real' programming language has, like math or logic operations, formatted printing, arrays, etc.
But, you may wonder, if 'awk' acts like a 'real' programming language, why not stick to another established 'real' programming language like Python? The answer is that 'awk' was specifically designed to serve in this middle-ground area of 'sort of command-line' and 'sort of programming language'. It's definitely possible to insert Python into the middle of a shell pipe (with python -c), but it quickly gets pretty messy. Awk, on the other hand, is much better suited for the task, and it includes a lot of default assumptions that make it very terse (but also intimidating to beginners).
Using Awk To Normalize Temperature Values
Here's a specific example of a problem that 'awk' can solve, but 'grep' or 'sed' can't. The following file 'temps.csv' contains a list of temperature values:
temp unit
26.1 C
78.1 F
23.1 C
25.7 C
76.3 F
77.3 F
24.2 C
79.3 F
27.9 C
75.1 F
25.9 C
79.0 F
Some of these temperature values are stated in Celsius, while other are stated in Fahrenheit. In a case like this, you might want to normalize these values to use the same unit so that you can graph them, or compute an average etc. You can do exactly that with the following 'awk' command:
awk 'NR==1; NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv
which gives the following output:
temp unit
26.1 C
25.6111 C
23.1 C
25.7 C
24.6111 C
25.1667 C
24.2 C
26.2778 C
27.9 C
23.9444 C
25.9 C
26.1111 C
To understand how this awk command works, let's work on re-building it from scratch. Here is a much simpler awk command that shows how it will automatically parse and split up the two columns for us so we can work with them separately:
awk '{print "First column item: " $1 " Second column item: " $2 }' temps.csv
which gives the following output:
First column item: temp Second column item: unit
First column item: 26.1 Second column item: C
First column item: 78.1 Second column item: F
First column item: 23.1 Second column item: C
First column item: 25.7 Second column item: C
First column item: 76.3 Second column item: F
First column item: 77.3 Second column item: F
First column item: 24.2 Second column item: C
First column item: 79.3 Second column item: F
First column item: 27.9 Second column item: C
First column item: 75.1 Second column item: F
First column item: 25.9 Second column item: C
First column item: 79.0 Second column item: F
Since the columns in the original file are separated by white-space, the awk command will automatically split up the 'columns' and make them accessible inside the '$1' and '$2' variables. As you can see, our last example simply prints these out without doing any kind of logic on them. This stuff inside the {...} characters is called the 'action' statement.
By default (on my machine), awk will attempt to run every single 'action' statement on every line of the file. In our case, we don't actually want our special print statement to run on every line of the file. In particular, we don't want to make changes to the first line, since it's just the list of column headers. We can add the requirement that our special print statement should only occur on 'record number greater than 1', or 'NR>1'. 'NR' stands for Number of Record:
awk 'NR>1{print "First column item: " $1 " Second column item: " $2 }' temps.csv
But this won't print out the header at all. To print out the first line without changing it, we can add the statement 'NR==1;' to our awk command. The 'NR==1' part means "Only do this on the record #1", and the ';' part is an empty statement which defaults to printing out the current line without changing it:
awk 'NR==1; NR>1{print "First column item: " $1 " Second column item: " $2 }' temps.csv
Here is the output from this command:
temp unit
First column item: 26.1 Second column item: C
First column item: 78.1 Second column item: F
First column item: 23.1 Second column item: C
First column item: 25.7 Second column item: C
First column item: 76.3 Second column item: F
First column item: 77.3 Second column item: F
First column item: 24.2 Second column item: C
First column item: 79.3 Second column item: F
First column item: 27.9 Second column item: C
First column item: 75.1 Second column item: F
First column item: 25.9 Second column item: C
First column item: 79.0 Second column item: F
And now we're almost back at the full command that we started off with. We just need to add the temperature normalizing part. The logic/math for normalizing to Celsius units would look something like this:
if($2=="F"){
print ($1-32) / 1.8 # deg F to deg C conversion formula.
}else{
print $1
}
or, we could re-write this using the ternary operator as the following:
print ($2=="F" ? ($1-32) / 1.8 : $1)
Substituting this back into our print statement gives us the full solution that was introduced previously:
awk 'NR==1; NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv
The final output that was shown before was not very nice to look at since most of the converted Fahrenheit values had way too many digits after the decimal place. To fix this, we can use the more sophisticated 'printf' to do a formatted print operation and make the output look even better:
awk 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
which outputs the following:
temp unit
26.1 C
25.6 C
23.1 C
25.7 C
24.6 C
25.2 C
24.2 C
26.3 C
27.9 C
23.9 C
25.9 C
26.1 C
Specifying Different Field Separators
In order to split up the columns in each line, awk needs to know what the 'field separator' is. The default field separator can be different on different systems or different implementations of awk, but you can often assume it to be 'white-space' for some definition of white-space. You can also explicitly specify it using the '-F' flag. For example, here is the same awk command we just used, but using a comma for the field separator:
awk -F',' 'NR==1; NR>1{printf("%.1f,%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
and here it is again with an explicitly specified tab character:
awk -F'\t' 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
Field separators can even be regular expressions:
echo "col1-481-col2-981-col3" | awk -F'-[0-9]{3}-' '{print $1" "$2" "$3}'
outputs the following:
col1 col2 col3
Awk Is Extremely Difficult To Learn
False. Awk is extremely easy to learn. The reason that it seems so hard to learn is because awk has so many implicit defaults, but nobody ever seems to explain this fact.
You can think of every awk command as a collection of 'if statements' that run against every line in the file. The syntax of every awk command looks pretty close to something like this:
awk 'if(PATTERN1){...print something...} if(PATTERN2){...print something...} ...'
with the one exception beging that the 'if' keyword is never actually written out since it's assumed to be there by default (if you do write it, you'll get a syntax error). Therefore, the overall syntax for every awk command (that doesn't rely on defaults) is pretty much this:
awk '(PATTERN1){...Action 1...} (PATTERN2){...Action 2...} ...'
In the above command, the 'PATTERN1' or 'PATTERN2' is the trigger you want to cause the stuff inside the '{' '}' characters to actually execute. Here are a few examples of commonly used patterns:
# Print out the second line:
echo -e "hello\nworld" | awk '(NR==2){print $0}'
will output:
world
# Print out any line that matches a regular expression (that just looks for an 'l' character):
# The '~' character has a special meaning here in relation to regular expression matching
echo -e "hello\nthere\nworld" | awk '($0 ~ /l/){print $0}'
will output:
hello
world
# If the item in the first column is greater than 5 characters, print out the item in the second column:
echo -e "acb def\nsomething else" | awk '(length($1) > 5){print $2}'
will output:
else
This provides some context on what you can do in the 'pattern' part, but what about the 'action' part? Well, you can use your imagination since that's where awk becomes a fully fledged programming language. Here is an example awk command that will iterate over every character in the 3rd column on the 4th line and print out each character on a different line:
echo -e "a a a a\na a a a\na a a a\na a hello_there a" | awk '(NR==4){
n_chrs = split($3, individual_characters, "")
for (i=1; i <= n_chrs; i++){
printf("Here is character number %d : %c\n", i, individual_characters[i]);
}
}'
Here is character number 1 : h
Here is character number 2 : e
Here is character number 3 : l
Here is character number 4 : l
Here is character number 5 : o
Here is character number 6 : _
Here is character number 7 : t
Here is character number 8 : h
Here is character number 9 : e
Here is character number 10 : r
Here is character number 11 : e
Awk's Many Implicit Defaults
As previously mentioned, awk makes many implicit default assumptions. To illustrate them, let's do a few more examples of matching regular expression against the following file 'animals.txt':
Rabbit
Bird
Dog
Pig
Lobster
Ape
Chicken
Lion
Pony
Fish
Cow
Cat
Horse
Deer
Turkey
Spider
Duck
Shark
Bear
Snake
Eagle
Bison
Monkey
Dolphin
Let's use awk to perform a simple regular expression search that will print out any lines that end with the letter 'e'. We can do this with the following awk command (note the use of the '~' operator here which means to match a string against a regex, and not to compare for equality):
awk '($0 ~ /e$/){print $0;}' animals.txt
which outputs the following:
Ape
Horse
Snake
Eagle
But, the parentheses on the PATTERN are optional, so we can do this:
awk '$0 ~ /e$/{print $0;}' animals.txt
But, we don't actually need to specify the '$0' part (the variable that denotes the current entire line). If you write a regular expression by itself, it will be assume that you're comparing it against the contents of the current line. Therefore, we can do this:
awk '/e$/{print $0;}' animals.txt
But, if we're printing out the entire line, we don't actually need to say 'print $0', we can just say 'print;' and it will assume that we want to print out the current line:
awk '/e$/{print;}' animals.txt
But, we don't even need to specify the action at all since it's optional! In cases where the action is missing, the assumption is to print out the entire line, so we can just do this:
awk '/e$/' animals.txt
Now you can go around sharing variations of the above extremely terse awk command to beginners without explaining to them how it actually works. They'll think you're a wizard.
At this point, we've simplified awk to the point where it would do pretty much the same this that grep does with the '-E' flag:
# Extended Regular Expression Search In Grep:
grep -E 'THE_REGEX' animals.txt
# Extended Regular Expression Search In Awk:
awk 'THE_REGEX' animals.txt
Using Awk As A Replacement For Sed
I said before that you could use awk as a replacement for sed if you wanted to. This replacement isn't quite as elegant as it is with grep, and I can't think of a good reason to do this when sed works just as well. Having said that, here is an example awk command that will replace the 'e' character at the end of a line with five 'z' characters:
awk '{ gsub(/e$/, "zzzzz"); print}' animals.txt
BEGIN & END Actions
Awk has two very special 'actions' called 'BEGIN' and 'END'. The 'BEGIN' action runs when awk first starts up, and the 'END' action runs when awk is about to shut down. Here is a brief example of this in action:
awk '
BEGIN{print "I run once when awk starts up."}
END{print "I run once when awk is about to exit."}
' temps.csv
which outputs the following:
I run once when awk starts up.
I run once when awk is about to exit.
These two special actions are extremely useful, because you can use them to do important 'programming language' type things like setting up and initializing variables in the 'BEGIN' action, or checking and aggregating information in the 'END' action. Here is an example use case of awk that calculates the average temperature (in Celsius) from our file of mixed Fahrenheit and Celsius values:
awk '
BEGIN{temp_sum=0; total_records=0; print "Begin calculating average temperature."}
$2=="F"{temp_sum += ($1-32) / 1.8; total_records += 1;}
$2=="C"{temp_sum += $1; total_records += 1;}
END{print "Average temperature: "(temp_sum/total_records) C" = "(temp_sum)" / "(total_records)}
' temps.csv
which outputs the following:
Begin calculating average temperature.
Average temperature: 25.3852 = 304.622 / 12
Other More Advanced Awk Examples
If you're sold on the idea that awk is worth learning, this blog has a couple of other posts that make use of awk in more advanced contexts:
- Bash One Liner - Compose Music From Entropy in /dev/urandom
- Audio codec & decompresser written in awk
Caveats
There are a few things that can catch you off guard about awk that you should be aware of:
- Awk uses 1 based indexing for arrays.
- Awk uses 'Extended Regular Expressions' (ERE) and not 'Perl-Compatible Regular Expressions' (PCRE). ERE is much older and includes fewer features than PCRE. It also matches certain regex patterns differently.
- The 'awk' command is a part of the POSIX Standard, which means that some features of awk are 'POSIX' and they are likely to be portable to different operating system environments. Some features, however, only exist in specific implementations of awk, for example, in the GNU implementation of awk.
And that's why the 'awk' command is my favourite Linux command.
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
$1.00 CAD |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
The Most Confusing Grep Mistakes I've Ever Made
Published 2020-11-02 |
Use The 'tail' Command To Monitor Everything
Published 2021-04-08 |
An Overview of How to Do Everything with Raspberry Pi Cameras
Published 2019-05-28 |
An Introduction To Data Science On The Linux Command Line
Published 2019-10-16 |
Using A Piece Of Paper As A Display Terminal - ed Vs. vim
Published 2020-10-05 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|