The GNU Awk User's Guide

Go to the first, previous, next, last section, table of contents.

Searching for Regular Expressions in Files

The egrep utility searches files for patterns. It uses regular expressions that are almost identical to those available in awk (see section Regular Expression Constants). It is used this way:

egrep [ options ] 'pattern' files ...

The pattern is a regexp. In typical usage, the regexp is quoted to prevent the shell from expanding any of the special characters as file name wildcards. Normally, egrep prints the lines that matched. If multiple file names are provided on the command line, each output line is preceded by the name of the file and a colon.

The options are:

-c: Print out a count of the lines that matched the pattern, instead of the lines themselves.
-s: Be silent. No output is produced, and the exit value indicates whether or not the pattern was matched.
-v: Invert the sense of the test. egrep prints the lines that do not match the pattern, and exits successfully if the pattern was not matched.
-i: Ignore case distinctions in both the pattern and the input data.
-l: Only print the names of the files that matched, not the lines that matched.
-e pattern: Use pattern as the regexp to match. The purpose of the `-e' option is to allow patterns that start with a `-'.

This version uses the getopt library function (see section Processing Command Line Options), and the file transition library program (see section Noting Data File Boundaries).

The program begins with a descriptive comment, and then a BEGIN rule that processes the command line arguments with getopt. The `-i' (ignore case) option is particularly easy with gawk; we just use the IGNORECASE built in variable (see section Built-in Variables).

# egrep.awk --- simulate egrep in awk
# Arnold Robbins, [email protected], Public Domain
# May 1993

# Options:
#    -c    count of lines
#    -s    silent - use exit value
#    -v    invert test, success if no match
#    -i    ignore case
#    -l    print filenames only
#    -e    argument is pattern

BEGIN {
    while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) {
        if (c == "c")
            count_only++
        else if (c == "s")
            no_print++
        else if (c == "v")
            invert++
        else if (c == "i")
            IGNORECASE = 1
        else if (c == "l")
            filenames_only++
        else if (c == "e")
            pattern = Optarg
        else
            usage()
    }

Next comes the code that handles the egrep specific behavior. If no pattern was supplied with `-e', the first non-option on the command line is used. The awk command line arguments up to ARGV[Optind] are cleared, so that awk won't try to process them as files. If no files were specified, the standard input is used, and if multiple files were specified, we make sure to note this so that the file names can precede the matched lines in the output.

The last two lines are commented out, since they are not needed in gawk. They should be uncommented if you have to use another version of awk.

    if (pattern == "")
        pattern = ARGV[Optind++]

    for (i = 1; i < Optind; i++)
        ARGV[i] = ""
    if (Optind >= ARGC) {
        ARGV[1] = "-"
        ARGC = 2
    } else if (ARGC - Optind > 1)
        do_filenames++

#    if (IGNORECASE)
#        pattern = tolower(pattern)
}

The next set of lines should be uncommented if you are not using gawk. This rule translates all the characters in the input line into lower-case if the `-i' option was specified. The rule is commented out since it is not necessary with gawk.

#{
#    if (IGNORECASE)
#        $0 = tolower($0)
#}

The beginfile function is called by the rule in `ftrans.awk' when each new file is processed. In this case, it is very simple; all it does is initialize a variable fcount to zero. fcount tracks how many lines in the current file matched the pattern.

function beginfile(junk)
{
    fcount = 0
}

The endfile function is called after each file has been processed. It is used only when the user wants a count of the number of lines that matched. no_print will be true only if the exit status is desired. count_only will be true if line counts are desired. egrep will therefore only print line counts if printing and counting are enabled. The output format must be adjusted depending upon the number of files to be processed. Finally, fcount is added to total, so that we know how many lines altogether matched the pattern.

function endfile(file)
{
    if (! no_print && count_only)
        if (do_filenames)
            print file ":" fcount
        else
            print fcount

    total += fcount
}

This rule does most of the work of matching lines. The variable matches will be true if the line matched the pattern. If the user wants lines that did not match, the sense of the matches is inverted using the `!' operator. fcount is incremented with the value of matches, which will be either one or zero, depending upon a successful or unsuccessful match. If the line did not match, the next statement just moves on to the next record.

There are several optimizations for performance in the following few lines of code. If the user only wants exit status (no_print is true), and we don't have to count lines, then it is enough to know that one line in this file matched, and we can skip on to the next file with nextfile. Along similar lines, if we are only printing file names, and we don't need to count lines, we can print the file name, and then skip to the next file with nextfile.

Finally, each line is printed, with a leading filename and colon if necessary.

{
    matches = ($0 ~ pattern)
    if (invert)
        matches = ! matches

    fcount += matches    # 1 or 0

    if (! matches)
        next

    if (no_print && ! count_only)
        nextfile

    if (filenames_only && ! count_only) {
        print FILENAME
        nextfile
    }

    if (do_filenames && ! count_only)
        print FILENAME ":" $0
    else if (! count_only)
        print
}

The END rule takes care of producing the correct exit status. If there were no matches, the exit status is one, otherwise it is zero.

END    \
{
    if (total == 0)
        exit 1
    exit 0
}

The usage function prints a usage message in case of invalid options and then exits.

function usage(    e)
{
    e = "Usage: egrep [-csvil] [-e pat] [files ...]"
    print e > "/dev/stderr"
    exit 1
}

The variable e is used so that the function fits nicely on the printed page.

Just a note on programming style. You may have noticed that the END rule uses backslash continuation, with the open brace on a line by itself. This is so that it more closely resembles the way functions are written. Many of the examples in this chapter use this style. You can decide for yourself if you like writing your BEGIN and END rules this way, or not.

Go to the first, previous, next, last section, table of contents.