The GNU Awk User's Guide

Go to the first, previous, next, last section, table of contents.

How Input is Split into Records

The awk utility divides the input for your awk program into records and fields. Records are separated by a character called the record separator. By default, the record separator is the newline character. This is why records are, by default, single lines. You can use a different character for the record separator by assigning the character to the built-in variable RS.

You can change the value of RS in the awk program, like any other variable, with the assignment operator, `=' (see section Assignment Expressions). The new record-separator character should be enclosed in quotation marks, which indicate a string constant. Often the right time to do this is at the beginning of execution, before any input has been processed, so that the very first record will be read with the proper separator. To do this, use the special BEGIN pattern (see section The BEGIN and END Special Patterns). For example:

awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list

changes the value of RS to "/", before reading any input. This is a string whose first character is a slash; as a result, records are separated by slashes. Then the input file is read, and the second rule in the awk program (the action with no pattern) prints each record. Since each print statement adds a newline at the end of its output, the effect of this awk program is to copy the input with each slash changed to a newline. Here are the results of running the program on `BBS-list':

$ awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list
-| aardvark     555-5553     1200
-| 300          B
-| alpo-net     555-3412     2400
-| 1200
-| 300     A
-| barfly       555-7685     1200
-| 300          A
-| bites        555-1675     2400
-| 1200
-| 300     A
-| camelot      555-0542     300               C
-| core         555-2912     1200
-| 300          C
-| fooey        555-1234     2400
-| 1200
-| 300     B
-| foot         555-6699     1200
-| 300          B
-| macfoo       555-6480     1200
-| 300          A
-| sdace        555-3430     2400
-| 1200
-| 300     A
-| sabafoo      555-2127     1200
-| 300          C
-|

Note that the entry for the `camelot' BBS is not split. In the original data file (see section Data Files for the Examples), the line looks like this:

camelot      555-0542     300               C

It only has one baud rate; there are no slashes in the record.

Another way to change the record separator is on the command line, using the variable-assignment feature (see section Other Command Line Arguments).

awk '{ print $0 }' RS="/" BBS-list

This sets RS to `/' before processing `BBS-list'.

Using an unusual character such as `/' for the record separator produces correct behavior in the vast majority of cases. However, the following (extreme) pipeline prints a surprising `1'. There is one field, consisting of a newline. The value of the built-in variable NF is the number of fields in the current record.

$ echo | awk 'BEGIN { RS = "a" } ; { print NF }'
-| 1

Reaching the end of an input file terminates the current input record, even if the last character in the file is not the character in RS (d.c.).

The empty string, "" (a string of no characters), has a special meaning as the value of RS: it means that records are separated by one or more blank lines, and nothing else. See section Multiple-Line Records, for more details.

If you change the value of RS in the middle of an awk run, the new value is used to delimit subsequent records, but the record currently being processed (and records already processed) are not affected.

After the end of the record has been determined, gawk sets the variable RT to the text in the input that matched RS.

The value of RS is in fact not limited to a one-character string. It can be any regular expression (see section Regular Expressions). In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string. This general rule is actually at work in the usual case, where RS contains just a newline: a record ends at the beginning of the next matching string (the next newline in the input) and the following record starts just after the end of this string (at the first character of the following line). The newline, since it matches RS, is not part of either record.

When RS is a single character, RT will contain the same single character. However, when RS is a regular expression, then RT becomes more useful; it contains the actual input text that matched the regular expression.

The following example illustrates both of these features. It sets RS equal to a regular expression that matches either a newline, or a series of one or more upper-case letters with optional leading and/or trailing white space (see section Regular Expressions).

$ echo record 1 AAAA record 2 BBBB record 3 |
> gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
>             { print "Record =", $0, "and RT =", RT }'
-| Record = record 1 and RT =  AAAA 
-| Record = record 2 and RT =  BBBB 
-| Record = record 3 and RT = 
-|

The final line of output has an extra blank line. This is because the value of RT is a newline, and then the print statement supplies its own terminating newline.

See section A Simple Stream Editor, for a more useful example of RS as a regexp and RT.

The use of RS as a regular expression and the RT variable are gawk extensions; they are not available in compatibility mode (see section Command Line Options). In compatibility mode, only the first character of the value of RS is used to determine the end of the record.

The awk utility keeps track of the number of records that have been read so far from the current input file. This value is stored in a built-in variable called FNR. It is reset to zero when a new file is started. Another built-in variable, NR, is the total number of input records read so far from all data files. It starts at zero but is never automatically reset to zero.

Go to the first, previous, next, last section, table of contents.