The GNU Awk User's Guide - Extract Program

Go to the first, previous, next, last section, table of contents.

Extracting Programs from Texinfo Source Files

Both this chapter and the previous chapter (section A Library of awk Functions), present a large number of awk programs. If you wish to experiment with these programs, it is tedious to have to type them in by hand. Here we present a program that can extract parts of a Texinfo input file into separate files.

This book is written in Texinfo, the GNU project's document formatting language. A single Texinfo source file can be used to produce both printed and on-line documentation. Texinfo is fully documented in Texinfo--The GNU Documentation Format, available from the Free Software Foundation.

For our purposes, it is enough to know three things about Texinfo input files.

The "at" symbol, `@', is special in Texinfo, much like `\' in C or awk. Literal `@' symbols are represented in Texinfo source files as `@@'.
Comments start with either `@c' or `@comment'. The file extraction program will work by using special comments that start at the beginning of a line.
Example text that should not be split across a page boundary is bracketed between lines containing `@group' and `@end group' commands.

The following program, `extract.awk', reads through a Texinfo source file, and does two things, based on the special comments. Upon seeing `@c system ...', it runs a command, by extracting the command text from the control line and passing it on to the system function (see section Built-in Functions for Input/Output). Upon seeing `@c file filename', each subsequent line is sent to the file filename, until `@c endfile' is encountered. The rules in `extract.awk' will match either `@c' or `@comment' by letting the `omment' part be optional. Lines containing `@group' and `@end group' are simply removed. `extract.awk' uses the join library function (see section Merging an Array Into a String).

The example programs in the on-line Texinfo source for Effective AWK Programming (`gawk.texi') have all been bracketed inside `file', and `endfile' lines. The gawk distribution uses a copy of `extract.awk' to extract the sample programs and install many of them in a standard directory, where gawk can find them.

`extract.awk' begins by setting IGNORECASE to one, so that mixed upper-case and lower-case letters in the directives won't matter.

The first rule handles calling system, checking that a command was given (NF is at least three), and also checking that the command exited with a zero exit status, signifying OK.

# extract.awk --- extract files and run programs
#                 from texinfo files
# Arnold Robbins, [email protected], Public Domain
# May 1993

BEGIN    { IGNORECASE = 1 }

/^@c(omment)?[ \t]+system/    \
{
    if (NF < 3) {
        e = (FILENAME ":" FNR)
        e = (e  ": badly formed `system' line")
        print e > "/dev/stderr"
        next
    }
    $1 = ""
    $2 = ""
    stat = system($0)
    if (stat != 0) {
        e = (FILENAME ":" FNR)
        e = (e ": warning: system returned " stat)
        print e > "/dev/stderr"
    }
}

The variable e is used so that the function fits nicely on the page.

The second rule handles moving data into files. It verifies that a file name was given in the directive. If the file named is not the current file, then the current file is closed. This means that an `@c endfile' was not given for that file. (We should probably print a diagnostic in this case, although at the moment we do not.)

The `for' loop does the work. It reads lines using getline (see section Explicit Input with getline). For an unexpected end of file, it calls the unexpected_eof function. If the line is an "endfile" line, then it breaks out of the loop. If the line is an `@group' or `@end group' line, then it ignores it, and goes on to the next line.

Most of the work is in the following few lines. If the line has no `@' symbols, it can be printed directly. Otherwise, each leading `@' must be stripped off.

To remove the `@' symbols, the line is split into separate elements of the array a, using the split function (see section Built-in Functions for String Manipulation). Each element of a that is empty indicates two successive `@' symbols in the original line. For each two empty elements (`@@' in the original file), we have to add back in a single `@' symbol.

When the processing of the array is finished, join is called with the value of SUBSEP, to rejoin the pieces back into a single line. That line is then printed to the output file.

/^@c(omment)?[ \t]+file/    \
{
    if (NF != 3) {
        e = (FILENAME ":" FNR ": badly formed `file' line")
        print e > "/dev/stderr"
        next
    }
    if ($3 != curfile) {
        if (curfile != "")
            close(curfile)
        curfile = $3
    }

    for (;;) {
        if ((getline line) <= 0)
            unexpected_eof()
        if (line ~ /^@c(omment)?[ \t]+endfile/)
            break
        else if (line ~ /^@(end[ \t]+)?group/)
            continue
        if (index(line, "@") == 0) {
            print line > curfile
            continue
        }
        n = split(line, a, "@")
        # if a[1] == "", means leading @,
        # don't add one back in.
        for (i = 2; i <= n; i++) {
            if (a[i] == "") { # was an @@
                a[i] = "@"
                if (a[i+1] == "")
                    i++
            }
        }
        print join(a, 1, n, SUBSEP) > curfile
    }
}

An important thing to note is the use of the `>' redirection. Output done with `>' only opens the file once; it stays open and subsequent output is appended to the file (see section Redirecting Output of print and printf). This allows us to easily mix program text and explanatory prose for the same sample source file (as has been done here!) without any hassle. The file is only closed when a new data file name is encountered, or at the end of the input file.

Finally, the function unexpected_eof prints an appropriate error message and then exits.

The END rule handles the final cleanup, closing the open file.

function unexpected_eof()
{
    printf("%s:%d: unexpected EOF or error\n", \
        FILENAME, FNR) > "/dev/stderr"
    exit 1
}

END {
    if (curfile)
        close(curfile)
}

Go to the first, previous, next, last section, table of contents.