Fast, Efficient and Reliable Pattern Scanning and Processing with Awk

1
5546

Whether the need is to provide some useful information to customers or the management, monitor a system’s activity, extract certain lines from a file, or to automate certain tasks, a systems administrator has to often deal with text files. Carrying out theses tasks usually requires parsing or manipulating text files, which might be as small as a few lines or as huge as a few gigabytes. Although there are various ways of doing this in *nix systems, most are either too trivial or so complex that one has to write several lines of code to accomplish a small task. However, Awk, with its built-in features to recognise patterns and its ability to manipulate them easily, is one of the fastest, most efficient and reliable tools. It combines the best of both worlds by providing the ease of use of grep-like tools, and the features and efficiency of a programming language. Its C-like syntax is very easy to learn.

Awk was created by Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan (one of the creators of C language). It gets its name from the initials of its creators. Awk has evolved over the years and the current version is usually called Nawk for ‘New Awk’. Gawk, the GNU version of Awk, supports the new features as well as some GNU-specific extensions. Gawk is included with the various Linux distributions and is used widely in several start-up scripts. Usually, in most Linux distributions, awk is a symbolic link to the gawk utility.

Records and fields

An Awk program divides the input file(s) into records and fields. The input file is divided into records based on a Record Separator (RS) variable. By default, records are separated by a newline character. However, an RS can be any single character or a regular expression. For example, setting RS=“$” will separate a file into multiple records based on the occurrence of $ in the text.

[testuser@localhost ~]$ cat testfile
abc$def$ghi$klm

[testuser@localhost ~]$ awk -v RS=”$” ‘{print}’ testfile
abc
def
ghi
klm

Output is printed based on the value of the Output Record Separator (ORS) variable. The default value of ORS is the newline character (n). We will get into the details of this small program shortly.

Fields are, by far, the most important feature of Awk. Each input record is further divided into fields based on a Field Separator (FS) variable. Like RS, FS can be any single character or a regular expression. By default, it’s a space character. If it’s a space, records are separated into fields if any white space character (space, tab or newline) occurs in the text. Fields are what make Awk so useful for text manipulation. Most of the time, one has to search for a particular text at a particular location in the line. That’s where fields come in handy.

Awk assigns the value of the field to a built-in variable $n, based on the order of occurrence of that field in the record. For example, for a line containing the text “Hello, World.”, if the FS value is “,$1 becomes “Hello” and $2 becomes “World.”. These variables are valid only till the current input record. There’s a special variable $0 that is equal to the whole input record. Fields may be referenced through constants like $1, $2, etc, or through variables. For example, if N=5, $N may be used instead of $5.

Tip: The FS value can also be assigned using a command-line switch -F.

In short, each input file is divided into records based on the RS variable, and each record is further divided into fields based on the FS variable. Field values are assigned to special variables based on their occurrence, with the first field being $1, the second $2 and so on. The special variable $0 is equivalent to the input record. Now, we will see how an Awk program works.

Anatomy of an Awk program

An Awk program mainly consists of patterns and actions. It can also include variable assignments and function definitions, but the most important parts are patterns and the actions that are to be taken when those patterns occur in the input text. Each pattern specified is checked against each input line read, and the actions defined for that pattern are executed. Either the pattern may be missing, in which case the defined action is executed for all input lines, or the action may be missing—in which case the default action of printing the current input line is executed, i.e., {print $0}.

The syntax of an Awk program is 'PATTERN{ ACTION}'. Action statements are enclosed within { } and the whole program is enclosed within single quotes when executed directly on the command line. An Awk program can also be saved in a file and executed using the -f switch. In such a case there’s no need to enclose the program within quotes.

[testuser@localhost ~]$ cat testfile
abc,def,ghi,klm
abc,test,ghi,dsfdss
abc,def,test,kshsf

[testuser@localhost ~]$ cat myfile
/test/{ print}  # pattern{ action }, Print those lines which contain the pattern “test”

[testuser@localhost ~]$ awk -F”,” -f myfile testfile
abc,test,ghi,dsfdss
abc,def,test,kshsf

Pattern forms

Patterns can be specified in various forms like:

  • Regular expressions: A pattern can be any regular expression. Gawk supports extended regular expressions. Thus patterns containing character classes like [:alpha:], [:digit:], [:lower:], etc, are also supported. A detailed discussion of the regular expressions is beyond the scope of this article.
  • Relational expressions: Relational expressions utilising the operators &&, ||, ! can be used to match complex patterns. The C ternary operator ?: is also supported for pattern matching. In this case, an expression is specified as pattern1?pattern2:pattern3. If pattern1 is true, pattern2 is evaluated, else, pattern3 is evaluated.
  • Pattern1, Pattern2: The man page states that this form specifies a range of text wherein the actions specified are executed for all the record lines starting with a record matching pattern1, continuing until a record matching pattern2. (See the examples at the end.)
  • BEGIN and END patterns: There are two special patterns defined in Awk—BEGIN and END. Actions specified for BEGIN are executed at the start of the program before any input records are read. Thus, it’s a good place for any global variable initialisation or to perform any tasks that should precede the start of the input. Similarly, actions specified for the END pattern are executed after all the input records have been read and the actions specified for other patterns have been executed. Actions for BEGIN and END patterns are executed only once, and are independent of the number of input records.

Actions

As mentioned earlier, action statements are enclosed within {}. They are similar to the assignment, conditional and looping statements of the C language. A statement may end with a newline character or “;”. Comments can be specified using a #.

Variables

The man page states that variables in Awk are dynamic in nature, i.e., they come into existence when they are used. Variable values are either strings or floating-point numbers. Their type is decided based on the context they are used in. They can be assigned values as variable=value and can be used in expressions and/or statements using their names. For example, var=25; print var*3 will output 75. In this case, the variable var is treated as a floating-point number.

Besides the RS, FS and ORS variables that we covered earlier, there are a few others that are very useful while writing programs in Awk.

  • NF: The NF variable provides the number of fields there are in the current input record. If the current record has six fields, then the value of NF would be 6. This is very useful for looping through the fields of a record when the number of fields is not known. $NF represents the last field in a record.
  • NR: The NR variable provides the number of input records that have so far been read.
  • FNR: This variable provides the number of input records that have so far been read for the current input file. FNR and NR will be same if only one input file is provided. However, if there are more than one input files, FNR will show the number of records read in the current input file, whereas, NR will show the number of records read since the first record of the first input file.
  • FILENAME: This provides the name of the current input file. If the input is specified on the command line, then its value is “-”.
  • IGNORECASE: By default, all string operations and regular-expression matches in Awk are case-sensitive. To change that, IGNORECASE needs to be set a value greater than 0.
  • OFS: The OFS variable specifies the field separator to be used in the output. By default, it’s a space character.
  • Operators: Operators in Awk are similar to various programming languages. Like =, +=,-= etc for assignment. Logical operators &&, ||, ! and the increment (++) , decrement (--) operators. One notable exception is the ~ operator. This operator is used to match against a regular expression. It can be negated using a ! operator to check against a negative match. Operators have a particular order of precedence. Please check the Awk man page for reference.

Control statements

Awk provides a variety of control statements that are similar to the control statements used in C. They start with keywords like if, for, while, continue, etc. This differentiates them from the simple expressions. Awk provides both the varieties of the for loop.

for (i = 1; i <= var; i++)
     print i
as well as

for (i in array)
     print array[i]

The second form is very useful for looping through the elements of an array, whose indices are string values. We will cover the arrays and some other advanced features of Awk in the next part of this article.

Passing variables to Gawk

Gawk supports the -v switch that allows users to pass variables to an Awk program. Shell variables can also be passed on to the Gawk program using the -v switch. The syntax is:

awk -v varname=${shell_var} 'pattern{ action}'

Once defined, varname can be used like any dynamic variable defined in Gawk.

Functions

There are several built-in functions available in Gawk. Users can also define their own functions and use them. We will get into the details of various functions available and how to define our own functions in the next article in this series on Awk. For now, we will take a look at a few important functions.

  • print: As the name suggests, the print function allows printing of text/values. If called without any arguments, it prints the current record. Thus, print and print $0 are equivalent. Its output can be re-directed to a file using > filename.
  • length: The length(str) function returns the length of the string str or the length of $0 if no str is provided.
  • gsub: The gsub function allows you to replace every occurrence of the text matching a pattern ptr, with a substitute string str in the target string tr. Its syntax is gsub(ptr,str,tr). If no target string is provided, $0 is used. gsub returns the number of substitutions that took place.
  • sub: The sub function is similar to gsub, but it replaces only the first occurrence of the pattern.

Some examples

We will make a copy of the file /etc/passwd as passwd2 so that we don’t accidentally break the system.

Searching for a pattern:

[testuser@localhost ~]$ pat="/sbin/nologin"
[testuser@localhost ~]$ awk -F":" -v pat=${pat} '$NF~pat{print $1}' passwd2

This provides a list of all the users who are not allowed to log in to the system via the console, Telnet, SSH, etc. We specify the field separator as “:” using the -F switch and pass the shell variable pat as pattern to Gawk using the -v switch. We match the pattern against the last field ($NF) of each record and print the user-name ($1 is the first field) for the records that match the given pattern. While matching $NF against the variable pat, we use pat and not $pat since the latter will look for a field with the field number as “/sbin/nologin” which will be a null value.

Inserting text: Let’s suppose, we want to add a comment “Not allowed to log in” to the passwd2 file for all users that have “/sbin/nologin” as shell, and “Allowed to log in” for the rest. We know that comments can be placed in the GECOS field in the passwd file, which is the fifth field in the file. Then, we can use the following code to add the comment:

awk -F":" -v pat=${pat} -v OFS=":" '{
  if ($NF~pat) $5="Not allowed to Login";
else $5="Allowed to login";
  print }' passwd2

Building on the previous example, we set OFS=”:” . The default value for OFS is space and the passwd file has “:” as the separator.

Replacing text: In this example, we will try to replace /bin/bash with /bin/sh.

pat="/bin/bash"
awk -F":" -v pat=${pat} -v OFS=":" 'BEGIN{newshell="/bin/sh"} {sub(pat,newshell,$NF); print}' passwd2

We utilise the sub-function to replace /bin/bash with /bin/sh. We also use the special BEGIN pattern to initialise the newshell variable.

Range pattern: Users with UID between 500 and 510.

awk -F":" -v pat=${pat} -v OFS=":" '$3~/500/,$3~/510/' passwd2

In this first article we covered the basic concepts of Awk. In subsequent articles we will cover advanced topics like arrays, user-defined functions, in-built functions available for string manipulation, and date functions. We will also look at some useful real-world examples like extracting a particular table from a MySQL dump, calculating the date for a few days before or after the current date, and formatting the output.

1 COMMENT

  1. Ur site given lot of information. Thanks. I have a doubt. Pls. clarfiy. I have to read a file-line by line which consists of following format using shell script. The input file contains information as
    Chr18:4000-3030303
    Chr20:3030303-4040303
    Chr34:3939393-9939393

    How to write a shell script to read line by line because after reading each line should be input to another command like ‘samtools faidx Bos_taurus.fa $line’ to generate a sequence. Here $line should be the line reading from that input file.

    Pls. write a script for this and help me in my research.

LEAVE A REPLY

Please enter your comment!
Please enter your name here