The GNU Awk User's Guide

Go to the first, previous, next, last section, table of contents.

Printing Non-duplicated Lines of Text

The uniq utility reads sorted lines of data on its standard input, and (by default) removes duplicate lines. In other words, only unique lines are printed, hence the name. uniq has a number of options. The usage is:

uniq [-udc [-n]] [+n] [ input file [ output file ]]

The option meanings are:

-d: Only print repeated lines.
-u: Only print non-repeated lines.
-c: Count lines. This option overrides `-d' and `-u'. Both repeated and non-repeated lines are counted.
-n: Skip n fields before comparing lines. The definition of fields is similar to awk's default: non-whitespace characters separated by runs of spaces and/or tabs.
+n: Skip n characters before comparing lines. Any fields specified with `-n' are skipped first.
input file: Data is read from the input file named on the command line, instead of from the standard input.
output file: The generated output is sent to the named output file, instead of to the standard output.

Normally uniq behaves as if both the `-d' and `-u' options had been provided.

Here is an awk implementation of uniq. It uses the getopt library function (see section Processing Command Line Options), and the join library function (see section Merging an Array Into a String).

The program begins with a usage function and then a brief outline of the options and their meanings in a comment.

The BEGIN rule deals with the command line arguments and options. It uses a trick to get getopt to handle options of the form `-25', treating such an option as the option letter `2' with an argument of `5'. If indeed two or more digits were supplied (Optarg looks like a number), Optarg is concatenated with the option digit, and then result is added to zero to make it into a number. If there is only one digit in the option, then Optarg is not needed, and Optind must be decremented so that getopt will process it next time. This code is admittedly a bit tricky.

If no options were supplied, then the default is taken, to print both repeated and non-repeated lines. The output file, if provided, is assigned to outputfile. Earlier, outputfile was initialized to the standard output, `/dev/stdout'.

# uniq.awk --- do uniq in awk
# Arnold Robbins, [email protected], Public Domain
# May 1993

function usage(    e)
{
    e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
    print e > "/dev/stderr"
    exit 1
}

# -c    count lines. overrides -d and -u
# -d    only repeated lines
# -u    only non-repeated lines
# -n    skip n fields
# +n    skip n characters, skip fields first

BEGIN    \
{
    count = 1
    outputfile = "/dev/stdout"
    opts = "udc0:1:2:3:4:5:6:7:8:9:"
    while ((c = getopt(ARGC, ARGV, opts)) != -1) {
        if (c == "u")
            non_repeated_only++
        else if (c == "d")
            repeated_only++
        else if (c == "c")
            do_count++
        else if (index("0123456789", c) != 0) {
            # getopt requires args to options
            # this messes us up for things like -5
            if (Optarg ~ /^[0-9]+$/)
                fcount = (c Optarg) + 0
            else {
                fcount = c + 0
                Optind--
            }
        } else
            usage()
    }

    if (ARGV[Optind] ~ /^\+[0-9]+$/) {
        charcount = substr(ARGV[Optind], 2) + 0
        Optind++
    }

    for (i = 1; i < Optind; i++)
        ARGV[i] = ""

    if (repeated_only == 0 && non_repeated_only == 0)
        repeated_only = non_repeated_only = 1

    if (ARGC - Optind == 2) {
        outputfile = ARGV[ARGC - 1]
        ARGV[ARGC - 1] = ""
    }
}

The following function, are_equal, compares the current line, $0, to the previous line, last. It handles skipping fields and characters.

If no field count and no character count were specified, are_equal simply returns one or zero depending upon the result of a simple string comparison of last and $0. Otherwise, things get more complicated.

If fields have to be skipped, each line is broken into an array using split (see section Built-in Functions for String Manipulation), and then the desired fields are joined back into a line using join. The joined lines are stored in clast and cline. If no fields are skipped, clast and cline are set to last and $0 respectively.

Finally, if characters are skipped, substr is used to strip off the leading charcount characters in clast and cline. The two strings are then compared, and are_equal returns the result.

function are_equal(    n, m, clast, cline, alast, aline)
{
    if (fcount == 0 && charcount == 0)
        return (last == $0)

    if (fcount > 0) {
        n = split(last, alast)
        m = split($0, aline)
        clast = join(alast, fcount+1, n)
        cline = join(aline, fcount+1, m)
    } else {
        clast = last
        cline = $0
    }
    if (charcount) {
        clast = substr(clast, charcount + 1)
        cline = substr(cline, charcount + 1)
    }

    return (clast == cline)
}

The following two rules are the body of the program. The first one is executed only for the very first line of data. It sets last equal to $0, so that subsequent lines of text have something to be compared to.

The second rule does the work. The variable equal will be one or zero depending upon the results of are_equal's comparison. If uniq is counting repeated lines, then the count variable is incremented if the lines are equal. Otherwise the line is printed and count is reset, since the two lines are not equal.

If uniq is not counting, count is incremented if the lines are equal. Otherwise, if uniq is counting repeated lines, and more than one line has been seen, or if uniq is counting non-repeated lines, and only one line has been seen, then the line is printed, and count is reset.

Finally, similar logic is used in the END rule to print the final line of input data.

NR == 1 {
    last = $0
    next
}
    
{
    equal = are_equal()

    if (do_count) {    # overrides -d and -u
        if (equal)
            count++
        else {
            printf("%4d %s\n", count, last) > outputfile
            last = $0
            count = 1    # reset
        }
        next
    }

    if (equal)
        count++
    else {
        if ((repeated_only && count > 1) ||
            (non_repeated_only && count == 1))
                print last > outputfile
        last = $0
        count = 1
    }
}

END {
    if (do_count)
        printf("%4d %s\n", count, last) > outputfile
    else if ((repeated_only && count > 1) ||
            (non_repeated_only && count == 1))
        print last > outputfile
}

Go to the first, previous, next, last section, table of contents.