A Beginner’s Guide To Grep: Basics And Regular Expressions

9
181418
Grep me out!

Grep me out!

Grep is one among the system administrator’s “Swiss Army knife” set of tools, and is extremely useful to search for strings and patterns in a group of files, or even sub-folders. This article introduces the basics of Grep, provides examples of advanced use and links you to further reading.

Grep (an acronym for “Global Regular Expression Print”) is installed by default on almost every distribution of Linux, BSD and UNIX, and is even available for Windows. GNU and the Free Software Foundation distribute Grep as part of their suite of open source tools. This tutorial focuses primarily on this GNU version, as it is currently the most widely used.

Grep finds a string in a given file or input, quickly and efficiently. While most everyday uses of the command are simple, there are a variety of more advanced uses that most people don’t know about — including regular expressions and more, which can become quite complicated.

The tool has its roots in an extended regular expression syntax that was added to UNIX after Ken Thompson’s original regular expression implementation. The latter searches for any of a list of fixed strings, using the Aho-Corasick algorithm. These variants are embodied in most modern Grep implementations as command-line switches (and standardised as -E and -F in POSIX.2). In such combined implementations, Grep may also behave differently depending on the name by which it is invoked, allowing fGrep, eGrep, and Grep to be links to the same program.

There are two ways to provide input to Grep, each with its own particular uses. First, Grep can be used to search a given file or files on a system (including a recursive search through sub-folders). Grep also accepts inputs (usually via a pipe) from another command or series of commands.

Regular expressions

A regular expression, often shortened to “regex” or “regexp”, is a way of specifying a pattern (a particular set of characters or words) in text that can be applied to variable inputs to find all occurrences that match the pattern. Regexes enhance the ability to meaningfully process text content, especially when combined with other commands.
Usually, regular expressions are included in the Grep command in the following format:

grep [options] [regexp] [filename]

GNU Grep uses the GNU version of regular expressions, which is very similar (but not identical) to POSIX regular expressions. In fact, most varieties of regular expressions are quite similar, but have differences in escapes, meta-characters, or special operators.

GNU Grep has two regular expression feature sets: Basic and Extended. In basic regular expressions, the meta-characters ?, +, {, |, (, and ) lose their special meaning (whose uses are described later in this article). As mentioned below, to switch to using extended regular expressions, you need to add the option -E to the grep command.

It is customary to enclose the regular expression in single quotation marks, to prevent the shell (Bash or others) from trying to interpret and expand the expression before launching the grep process. For example, if a pair of back-ticks in the regexp is not quoted, it would result in the text between the back-ticks being executed as a Bash sub-process — and if this happens to be a valid command, the text returned by it takes the regular expression’s place in the command-line parameters given to Grep! Not at all what we want.

Again, due to shell behaviour, you can also enclose the regex in double quotes — in this case, you can use environment variables in the regex, and the shell will substitute them before calling Grep. This can be very useful, depending on what you’re trying to do — or it could turn out to be a nuisance. Remember the difference in behaviour.

Basic usage

Now let’s go on to some practical examples of using Grep. To better understand the results, I’ve created a simple text file on which we will run our Grep searches; the file contains the following lines:

Hi 
this 
is test file 
to carry out few regular expressions 
practical with grep 
123 456 
Abcd
ABCD

Case-insensitive search (grep -i):

[manish@clone ~]$ grep -i 'abcd' testfile 
Abcd 
ABCD

As you can see, the -i flag causes a search for “abcd” to return matches that have different cases for the characters from what the search string does.

Whole-word search (grep -w):

[manish@clone ~]$ grep -w 'test' testfile 
is test file

This type of search only returns lines where the sought-for string is a whole word and not part of a larger word.

Recursively search through sub-folders (grep -r <pattern> <path>):

[manish@clone ~]$ grep -r '456' /root/ 
/root/testfile:Year is 2010

Inverted search (grep -v):

[manish@clone ~]$ grep -v 'practical' testfile 
Hi 
this 
is test file 
to carry out few regular expressions 
123 456 
Abcd 
ABCD

This prints all the lines in the file, except the line that contains the word “practical”.

An interesting relative is the -L flag (you can also use --files-without-match), which outputs the names of files that do NOT contain matches for your search pattern. The matches for your search pattern are not themselves printed, only the names are.

[manish@clone ~]$ grep -r -L "Network" /var/log/* 
/var/log/anaconda.log 
/var/log/anaconda.syslog 
/var/log/audit/audit.log 
/var/log/boot.log 
/var/log/boot.log.1 
...

The “opposite” flag to -L is -l or --files-with-matches, which prints out (only) the names of files that do contain matches for your search pattern.

Print additional (trailing) context lines after match (grep -A <NUM>):

[manish@clone ~]$ grep -A1 '123'  testfile
123 456 
Abcd

For each line that matches the search, Grep prints the matching line, as well as the next one line after the match. Varying the number provided to -A changes the number of additional lines that are in the output.

Print additional (leading) context lines before match (grep -B <NUM>):

[manish@clone ~]$ grep -B2 'Abcd' testfile
practical with grep 
123 456 
Abcd

Print additional (leading and trailing) context lines before and after the match (grep -C <NUM>):

[manish@clone ~]$ grep -C2 'carry' testfile
this
is test file
to carry out few regular expressions
practical with grep
123 456

As you can see, this has printed out two lines before and after the single match found in the file; if there are multiple matches, Grep inserts a line containing -- between each group of lines (each match and its context lines).

Print the filename for each match (grep -H <pattern> filename):

[manish@clone ~]$ grep -H 'a' testfile
testfile:to carry out few regular expressions 
testfile:practical with grep

Now, let’s run the search a bit differently:

[manish@clone ~]$ cat testfile | grep -H 'a' 
(standard input):to carry out few regular expressions 
(standard input):practical with grep

When the stream that Grep is asked to search is passed to its standard input via a pipe from a previous command in the chain, grep -H displays (standard input) as the filename.

Run in “quiet” mode (grep -q): When run with this flag, Grep does not write anything to standard output, but sets its return value (also known as exit status) to reflect whether a match was found or not. This option is mainly used in scripts that need to check if a given file contains a particular match. A return status of 0 (zero) indicates that a match was found; 1 indicates that no match was found.

[manish@clone ~]$ grep -q '2010' testfile 
[manish@clone ~]$ echo $? 
1
[manish@clone ~]$ grep -q '456' testfile 
[manish@clone ~]$ echo $? 
0

Using regular expressions

[manish@clone ~]$ grep 'c.r' testfile 
to carry out few regular expressions

In the search above, . is used to match any single character — which is why it matches “car” in “carry”. Grep has a powerful regular expression matching engine, which we can’t hope to cover in depth here, but we will include a few important points:

  • Most characters, including all letters and digits, are actually regular expressions that match themselves.
  • Any meta-character (with special meaning to Grep, like the . in the example above) may be quoted by preceding it with a backslash. This makes Grep treat it as an ordinary character.
[manish@clone ~]$ grep 'c\.r' testfile 
[manish@clone ~]$

As you can see, preceding . with a backslash has removed its significance as a meta-character.
A regular expression may be followed by one of several repetition operators:

  • The period (.) matches any single character.
  • ? means that the preceding item is optional, and if found, will be matched at the most, once.
  • * means that the preceding item will be matched zero or more times.
  • + means the preceding item will be matched one or more times.
  • {n} means the preceding item is matched exactly n times, while {n,} means the item is matched n or more times. {n,m} means that the preceding item is matched at least n times, but not more than m times. {,m} means that the preceding item is matched, at the most, m times.

However, the repetition operators are part of GNU Grep’s extended regular expression syntax, so to use these effectively, remember to add the -E option to your command.

Read this tutorial for an introduction to more of Grep regular expression features. For more information on regular expression syntax, refer to the Regular Expressions chapter in the Grep manual. Meanwhile, we will present some examples of regular expressions and try to show how they work.

Character classes in regular expressions

The “character class” tool is one of the more flexible and often-used features of regular expressions. There are two basic ways to use character classes: to specify a list of characters (for example, [aeiou] is a list of vowel characters), or a range (like [m-t], which expands to [mnopqrst]). Ranges are a convenience that saves having to type an entire sequence of characters. A character class can also include a list of special characters, but they can’t be used as a range.

A single character class instance will match only one character; to match multiple occurrences of the class, you would need to add a repetition operator, like those mentioned above. For example, to find an eleven-letter string comprising only lower-case alphabets, the regex would be: [a-z]{11}. As mentioned earlier, to use the repetition operators, we need to add the option -E. Let’s run this on our test file:

[manish@clone ~]$ Grep -E '[a-z]{11}' testfile 
to carry out few regular expressions

Here, “expressions” is the only all-lowercase 11-character string in the file; so this is the only line printed as the output.

There are quite a few character classes that are very commonly used in regular expressions, and these are provided as named classes. For example, the [a-z] class of lower-case alphabets that we used above, has the named class [:lower:]. Naturally, [:upper:] is upper-case letters A to Z, and [:alpha:] is all alphabetic characters, equivalent to [:lower:] plus [:upper:][:digit:] is the digits 0 to 9, and [:alnum:] is alphanumeric characters — a combination of [:alpha:] and [:digit:]. The Grep manual lists out more of these named classes.

When a carat (^) is used as the first character in a character class, it is a negation of the class, effectively meaning, “none of these characters”.

Line and word anchors

The ^ anchor specifies that the pattern following it should be at the start of the line:

[manish@clone ~]$ grep '^th' testfile 
this

The $ anchor specifies that the pattern before it should be at the end of the line.

[manish@clone ~]$ grep 'i$' testfile 
Hi

The operator \< anchors the pattern to the start of a word.

[manish@clone ~]$ grep '\<fe' testfile 
to carry out few regular expressions

Similarly, \> anchors the pattern to the end of a word.

[manish@clone ~]$ grep 'le\>' testfile 
is test file

The \b (word boundary) anchor can be used in place of \< and \> to signify the beginning or end of a word:

[manish@clone ~]$ grep -e '\breg' testfile 
to carry out few regular expressions

Finally, we look at the | (alternation) operator, which is part of the extended regex features. A pattern containing this operator separately matches the parts on either side of it; if either one is found, the line containing it is a match. The parts can themselves be complex regular expressions, so this means you can check each line in a file for multiple search patterns in one pass.

[manish@clone ~]$ grep -E 'hi|bc' testfile
this
Abcd

That was pretty simple; so let’s try a more complicated one. Can you reason out why the output lines for this regex are as shown below?

[manish@clone ~]$ grep -E '^[t-z]+|[^a-z]+$' testfile 
this
to carry out few regular expressions
123 456
ABCD

Using shell expansions in the pattern input to Grep

As mentioned earlier, if you don’t single-quote the pattern passed to Grep, the shell could perform shell expansion on the pattern and actually feed a changed pattern to Grep. This can also be done intentionally, when you need it — let’s look at a few examples.

[root@clone ~]# grep "$HOME" /etc/passwd 
root:x:0:0:root:/root:/bin/bash 
operator:x:11:0:operator:/root:/sbin/nologin

Here, we intentionally use double quotes to make the Bash shell replace the environment variable $HOME with the actual value of the variable (in this case, /root). Thus, Grep searches the /etc/passwd file for the text /root, yielding the two lines that match.

[root@clone ~]# grep `whoami` /etc/passwd 
root:x:0:0:root:/root:/bin/bash 
operator:x:11:0:operator:/root:/sbin/nologin

Here, back-tick expansion is done by the shell, replacing `whoami` with the user name (root) that is returned by the whoami command.

Well, we hope this has set you on your way to using this very efficient tool.


This article was originally published in May 2010 issue.

9 COMMENTS

  1. Answer to the “Can you reason” script:

    Expression “^[t-z]+” prints lines having one or more occurrences of any character from t to z at the beginning.

    Expression “[^a-z]+$” prints lines NOT having one or more occurrences of any character from a to z at the end.

    The overall script prints the lines which satisfy either or both of the above command expressions.
    Therefore, lines
    “this” and
    “to carry out few regular expressions”
    satisfy the first command expression, while lines
    “123 456”
    “ABCD”
    satisfy the second command expression.

  2. Please explain this command “grep -a –null-data U-Boot u-boot.img ” What does U-Boot stand for here, is it a file name, an alias ? u-boot.img is a file I can see it .

LEAVE A REPLY

Please enter your comment!
Please enter your name here