The uniq
utility reads sorted lines of data on its standard input,
and (by default) removes duplicate lines. In other words, only unique lines
are printed, hence the name. uniq
has a number of options. The usage is:
uniq [-udc [-n]] [+n] [ input file [ output file ]]
The option meanings are:
-d
-u
-c
-n
awk
's default: non-whitespace characters separated
by runs of spaces and/or tabs.
+n
input file
output file
Normally uniq
behaves as if both the `-d' and `-u' options
had been provided.
Here is an awk
implementation of uniq
. It uses the
getopt
library function
(see section Processing Command Line Options),
and the join
library function
(see section Merging an Array Into a String).
The program begins with a usage
function and then a brief outline of
the options and their meanings in a comment.
The BEGIN
rule deals with the command line arguments and options. It
uses a trick to get getopt
to handle options of the form `-25',
treating such an option as the option letter `2' with an argument of
`5'. If indeed two or more digits were supplied (Optarg
looks
like a number), Optarg
is
concatenated with the option digit, and then result is added to zero to make
it into a number. If there is only one digit in the option, then
Optarg
is not needed, and Optind
must be decremented so that
getopt
will process it next time. This code is admittedly a bit
tricky.
If no options were supplied, then the default is taken, to print both
repeated and non-repeated lines. The output file, if provided, is assigned
to outputfile
. Earlier, outputfile
was initialized to the
standard output, `/dev/stdout'.
# uniq.awk --- do uniq in awk # Arnold Robbins, [email protected], Public Domain # May 1993 function usage( e) { e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]" print e > "/dev/stderr" exit 1 } # -c count lines. overrides -d and -u # -d only repeated lines # -u only non-repeated lines # -n skip n fields # +n skip n characters, skip fields first BEGIN \ { count = 1 outputfile = "/dev/stdout" opts = "udc0:1:2:3:4:5:6:7:8:9:" while ((c = getopt(ARGC, ARGV, opts)) != -1) { if (c == "u") non_repeated_only++ else if (c == "d") repeated_only++ else if (c == "c") do_count++ else if (index("0123456789", c) != 0) { # getopt requires args to options # this messes us up for things like -5 if (Optarg ~ /^[0-9]+$/) fcount = (c Optarg) + 0 else { fcount = c + 0 Optind-- } } else usage() } if (ARGV[Optind] ~ /^\+[0-9]+$/) { charcount = substr(ARGV[Optind], 2) + 0 Optind++ } for (i = 1; i < Optind; i++) ARGV[i] = "" if (repeated_only == 0 && non_repeated_only == 0) repeated_only = non_repeated_only = 1 if (ARGC - Optind == 2) { outputfile = ARGV[ARGC - 1] ARGV[ARGC - 1] = "" } }
The following function, are_equal
, compares the current line,
$0
, to the
previous line, last
. It handles skipping fields and characters.
If no field count and no character count were specified, are_equal
simply returns one or zero depending upon the result of a simple string
comparison of last
and $0
. Otherwise, things get more
complicated.
If fields have to be skipped, each line is broken into an array using
split
(see section Built-in Functions for String Manipulation),
and then the desired fields are joined back into a line using join
.
The joined lines are stored in clast
and cline
.
If no fields are skipped, clast
and cline
are set to
last
and $0
respectively.
Finally, if characters are skipped, substr
is used to strip off the
leading charcount
characters in clast
and cline
. The
two strings are then compared, and are_equal
returns the result.
function are_equal( n, m, clast, cline, alast, aline) { if (fcount == 0 && charcount == 0) return (last == $0) if (fcount > 0) { n = split(last, alast) m = split($0, aline) clast = join(alast, fcount+1, n) cline = join(aline, fcount+1, m) } else { clast = last cline = $0 } if (charcount) { clast = substr(clast, charcount + 1) cline = substr(cline, charcount + 1) } return (clast == cline) }
The following two rules are the body of the program. The first one is
executed only for the very first line of data. It sets last
equal to
$0
, so that subsequent lines of text have something to be compared to.
The second rule does the work. The variable equal
will be one or zero
depending upon the results of are_equal
's comparison. If uniq
is counting repeated lines, then the count
variable is incremented if
the lines are equal. Otherwise the line is printed and count
is
reset, since the two lines are not equal.
If uniq
is not counting, count
is incremented if the lines are
equal. Otherwise, if uniq
is counting repeated lines, and more than
one line has been seen, or if uniq
is counting non-repeated lines,
and only one line has been seen, then the line is printed, and count
is reset.
Finally, similar logic is used in the END
rule to print the final
line of input data.
NR == 1 { last = $0 next } { equal = are_equal() if (do_count) { # overrides -d and -u if (equal) count++ else { printf("%4d %s\n", count, last) > outputfile last = $0 count = 1 # reset } next } if (equal) count++ else { if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile last = $0 count = 1 } } END { if (do_count) printf("%4d %s\n", count, last) > outputfile else if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile }
Go to the first, previous, next, last section, table of contents.