Liviu
Blocking Queue

Blocking Queue

Yet Another AWK Cheat Sheet

Yet Another AWK Cheat Sheet

Liviu's photo
Liviu
·Dec 15, 2021·

4 min read

I like bash, you can achieve a lot of things with some basic commands. I also like to make scripts for my work because it is the perfect documentation.

A long time ago I learned the basics of sed and grep and this helped me a lot while doing support. Before having fancy tools like Kibana and CloudWatch knowing to grep was paying off. Even now if I need to debug something very complicated I prefer having the logs in a console.

Another thing that I still do in the console is some sanity checks between different sources. For example, I have a customer's file that I need to match to another file that has database records and see they are in sync. This is how I got to learn AWK and write about how I use it.

I could add a proper introduction to AWK here but I assume you have some experience with it so I'll keep it short. AWK is a text-processing language and it is used as a data extraction tool or other similar tasks.

How it works

Awk is a scripting language, requires no compilation, and has variables, numeric and string functions, and logical operators.

The basic thing to remember about it is that it deals with records and fields. Records are read one at a time and then they are split into fields. You can imagine it like reading a simple CSV file where a record is a line and a field is a cell.

Basic usage

The most basic usage of awk is to get just a few fields out of each line.

$ echo "a b c" | awk '{ print $2" "$3}'
b c

A few things to remember from here. Each field has its own variable: $1, $2, $3, and so on ($0 represents the entire record). Other important to remember variables (not all of them):

  • NR: Number of Records
  • NF: Number of Fields
  • FNR refers to the record number (typically the line number) in the current file different than NR refers to the total record number. This has an interesting usage because NR==FNR is true only for the first file, and is used a lot in file comparison

Separator

By default, awk will split records based on whitespace. If you have a different separator use -F. The delimiter can be a regular expression: -F'[,:]' so you can use multiple delimiters.

$ echo "a,b,c" | awk -F','  '{print $1" "$2}'
a b

Input

Most probably you won't use awk with echo or commands from the standard input, reading from a file would look like this:

$ echo "a b c" > f1.txt
$ awk '{ print $2" "$3}' f1.txt 
b c

Awk accepts multiple input files:

$  echo "1 2 3" > f2.txt
$ awk '{ print $2" "$3}' f1.txt f2.txt 
b c
2 3

Program source code can also be read from a file, for this we use the -f option:

$ echo '{ print $2" "$3}' > aw
$ awk -f aw f1.txt 
b c

Selection Criteria

Before doing the processing for each record you can use selection criteria. This can be a pattern or a condition. This is useful if you want to process the lines which match the given pattern or respect condition.

awk '/manager/ {print}' f1.txt

Begin/End

You can also add begin and end actions.

$awk '      
BEGIN {print "start"}
{print $1" "$3}
END {print "end"}' f1.txt 

start
a c
end

These are very useful if you need to do an average or print a sum.

$ seq 99 101 >numbers.txt
$ awk '
{sum=sum+$1}
END {print "Sum: "sum" Avg: "sum/NR}' numbers.txt 

Sum: 300 Avg: 100

seq 99 101 >numbers.txt will create a sequence of numbers, one per line in a file

Matching 2 files

With these examples in mind, we can try the main task, match 2 files based on one of the fields. You can think of it as a database inner join.

There will be 2 files: names.txt and license.txt. In names.txt the first column is the id while in license.txt the second column is the id. The script will be in the matchScript file. We want to match each name to its license.

Awk will iterate over both files. As mentioned earlier, FNR is the record number in the current file while NR is the record number out of the total. The condition NR==FNR will hold true only while iterating records from the first files, this means details from the first file will be stored in a map. The map will be initialized on the first usage. Next will end current rerond processing moving to the next one.

The second part will check if the id is already in the map and print if the condition matched.

$ cat names.txt 
1,Sleepy
3,Dopey
2,Grumpy

$ cat license.txt 
BASIC,2
BASIC,1
PRO,3

$ cat matchScript 
(NR==FNR){
   names[$1]=$2; 
   next
}
($2 in names){
   print names[$2]","$1
}

$ awk -f matchScript -F ',' names.txt license.txt 
Grumpy,BASIC
Sleepy,BASIC
Dopey,PRO

Other

If you need more info or you want to go into deep details this is the best place for it.

 
Share this