| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The uniq utility reads sorted lines of data on its standard
input, and by default removes duplicate lines. In other words, it only
prints unique lines--hence the name. uniq has a number of
options. The usage is as follows:
uniq [-udc [-n]] [+n] [ input file [ output file ]] |
The option meanings are:
-d
-u
-c
-n
awk's default: non-whitespace characters separated
by runs of spaces and/or tabs.
+n
input file
output file
Normally uniq behaves as if both the `-d' and
`-u' options are provided.
uniq uses the
getopt library function
(see section Processing Command-Line Options)
and the join library function
(see section Merging an Array into a String).
The program begins with a usage function and then a brief outline of
the options and their meanings in a comment.
The BEGIN rule deals with the command-line arguments and options. It
uses a trick to get getopt to handle options of the form `-25',
treating such an option as the option letter `2' with an argument of
`5'. If indeed two or more digits are supplied (Optarg looks
like a number), Optarg is
concatenated with the option digit and then the result is added to zero to make
it into a number. If there is only one digit in the option, then
Optarg is not needed. Optind must be decremented so that
getopt processes it next time. This code is admittedly a bit
tricky.
If no options are supplied, then the default is taken, to print both
repeated and non-repeated lines. The output file, if provided, is assigned
to outputfile. Early on, outputfile is initialized to the
standard output, `/dev/stdout':
# uniq.awk --- do uniq in awk
#
# Requires getopt and join library functions
function usage( e)
{
e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
print e > "/dev/stderr"
exit 1
}
# -c count lines. overrides -d and -u
# -d only repeated lines
# -u only non-repeated lines
# -n skip n fields
# +n skip n characters, skip fields first
BEGIN \
{
count = 1
outputfile = "/dev/stdout"
opts = "udc0:1:2:3:4:5:6:7:8:9:"
while ((c = getopt(ARGC, ARGV, opts)) != -1) {
if (c == "u")
non_repeated_only++
else if (c == "d")
repeated_only++
else if (c == "c")
do_count++
else if (index("0123456789", c) != 0) {
# getopt requires args to options
# this messes us up for things like -5
if (Optarg ~ /^[0-9]+$/)
fcount = (c Optarg) + 0
else {
fcount = c + 0
Optind--
}
} else
usage()
}
if (ARGV[Optind] ~ /^\+[0-9]+$/) {
charcount = substr(ARGV[Optind], 2) + 0
Optind++
}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
if (repeated_only == 0 && non_repeated_only == 0)
repeated_only = non_repeated_only = 1
if (ARGC - Optind == 2) {
outputfile = ARGV[ARGC - 1]
ARGV[ARGC - 1] = ""
}
}
|
The following function, are_equal, compares the current line,
$0, to the
previous line, last. It handles skipping fields and characters.
If no field count and no character count are specified, are_equal
simply returns one or zero depending upon the result of a simple string
comparison of last and $0. Otherwise, things get more
complicated.
If fields have to be skipped, each line is broken into an array using
split
(see section String Manipulation Functions);
the desired fields are then joined back into a line using join.
The joined lines are stored in clast and cline.
If no fields are skipped, clast and cline are set to
last and $0, respectively.
Finally, if characters are skipped, substr is used to strip off the
leading charcount characters in clast and cline. The
two strings are then compared and are_equal returns the result:
function are_equal( n, m, clast, cline, alast, aline)
{
if (fcount == 0 && charcount == 0)
return (last == $0)
if (fcount > 0) {
n = split(last, alast)
m = split($0, aline)
clast = join(alast, fcount+1, n)
cline = join(aline, fcount+1, m)
} else {
clast = last
cline = $0
}
if (charcount) {
clast = substr(clast, charcount + 1)
cline = substr(cline, charcount + 1)
}
return (clast == cline)
}
|
The following two rules are the body of the program. The first one is
executed only for the very first line of data. It sets last equal to
$0, so that subsequent lines of text have something to be compared to.
The second rule does the work. The variable equal is one or zero,
depending upon the results of are_equal's comparison. If uniq
is counting repeated lines, and the lines are equal, then it increments the count variable.
Otherwise it prints the line and resets count,
since the two lines are not equal.
If uniq is not counting, and if the lines are equal, count is incremented.
Nothing is printed, since the point is to remove duplicates.
Otherwise, if uniq is counting repeated lines and more than
one line is seen, or if uniq is counting non-repeated lines
and only one line is seen, then the line is printed, and count
is reset.
Finally, similar logic is used in the END rule to print the final
line of input data:
NR == 1 {
last = $0
next
}
{
equal = are_equal()
if (do_count) { # overrides -d and -u
if (equal)
count++
else {
printf("%4d %s\n", count, last) > outputfile
last = $0
count = 1 # reset
}
next
}
if (equal)
count++
else {
if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
last = $0
count = 1
}
}
END {
if (do_count)
printf("%4d %s\n", count, last) > outputfile
else if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
}
|
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |