[Top] | [Contents] | [Index] | [ ? ] |
These commands are available on POSIX-compliant systems, as well as on traditional Unix based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes.
Often, these systems
use gawk
for their awk
implementation!
All such differences
appear in the index under the heading "differences between gawk
and
awk
."
GNU stands for "GNU's not Unix."
The terminology "GNU/Linux" is explained in the Glossary.
Although we generally recommend the use of single quotes around the program text, double quotes are needed here in order to put the single quote into the message.
The `#!' mechanism works on Linux systems, systems derived from the 4.4-Lite Berkeley Software Distribution, and most commercial Unix systems.
The
line beginning with `#!' lists the full file name of an interpreter
to run and an optional initial command-line argument to pass to that
interpreter. The operating system then runs the interpreter with the given
argument and the full argument list of the executed program. The first argument
in the list is the full file name of the awk
program. The rest of the
argument list is either options to awk
, or data files,
or both.
In the C shell (csh
), you need to type
a semicolon and then a backslash at the end of the first line; see
awk
Statements Versus Lines, for an
explanation as to why. In a POSIX-compliant shell, such as the Bourne
shell or bash
, you can type the example as shown. If the command
`echo $path' produces an empty output line, you are most likely
using a POSIX-compliant shell. Otherwise, you are probably using the
C shell or a shell derived from it.
On some very old systems, you may need to use `ls -lg' to get this output.
The `?' and `:' referred to here is the
three-operand conditional expression described in
Conditional Expressions.
Splitting lines after `?' and `:' is a minor gawk
extension; if `--posix' is specified
(see section Command-Line Options), then this extension is disabled.
In other literature, you may see a character list referred to as either a character set, a character class or a bracket expression.
Use two backslashes if you're using a string constant with a regexp operator or function.
Experienced C and C++ programmers will note that it is possible, using something like `IGNORECASE = 1 && /foObAr/ { ... }' and `IGNORECASE = 0 || /foobar/ { ... }'. However, this is somewhat obscure and we don't recommend it.
At least that we know about.
In POSIX awk
, newlines are not
considered whitespace for separating fields.
The sed
utility is a "stream editor."
Its behavior is also defined by the POSIX standard.
Older versions of
gawk
would only interpret these names internally if the system
did not actually have a a `/dev/fd' directory or any of the other
above listed special files. Usually this didn't make a difference,
but sometimes it did; thus, it was decided to make gawk
's
behavior consistent on all systems and to have it always interpret
the special file names itself.
The technical terminology is rather morbid. The finished child is called a "zombie," and cleaning up after it is referred to as "reaping."
The internal representation of all numbers, including integers, uses double-precision floating-point numbers. On most modern systems, these are in IEEE 754 standard format.
Pathological cases can require up to 752 digits (!), but we doubt that you need to worry about this.
The POSIX standard is under
revision. The revised standard's rules for typing and comparison are
the same as just described for gawk
.
The original version of awk
used to keep
reading and ignoring input until end of file was seen.
In
POSIX awk
, newline does not count as whitespace.
Some early implementations of Unix awk
initialized
FILENAME
to "-"
, even if there were data files to be
processed. This behavior was incorrect and should not be relied
upon in your programs.
Thanks to Michael Brennan for pointing this out.
The C version of rand
is known to produce fairly poor sequences of random numbers.
However, nothing requires that an awk
implementation use the C
rand
to implement the awk
version of rand
.
In fact, gawk
uses the BSD random
function, which is
considerably better than rand
, to produce random numbers.
Computer generated random numbers really are not truly random. They are technically known as "pseudo-random." This means that while the numbers in a sequence appear to be random, you can in fact generate the same sequence of random numbers over and over again.
Unless you use the `--non-decimal-data' option, which isn't recommended. See section Allowing Non-Decimal Input Data, for more information.
This is different from C and C++, where the first character is number zero.
This consequence was certainly unintended.
As this Web page was being finalized,
we learned that the POSIX standard will not use these rules.
However, it was too late to change gawk
for the 3.1 release.
gawk
behaves as described here.
A program is interactive if the standard output is connected to a terminal device.
See section Glossary, especially the entries for "Epoch" and "UTC."
The GNU date
utility can
also do many of the things described here. It's use may be preferable
for simple time-related operations in shell scripts.
Occasionally there are minutes in a year with a leap second, which is why the seconds can go up to 60.
As this
is a recent standard, not every system's strftime
necessarily
supports all of the conversions listed here.
If you don't understand any of this, don't worry about
it; these facilities are meant to make it easier to "internationalize"
programs.
Other internationalization features are described in
Internationalization with gawk
.
This is because ISO C leaves the
behavior of the C version of strftime
undefined and gawk
uses the system's version of strftime
if it's there.
Typically, the conversion specifier either does not appear in the
returned string or it appears literally.
This example
shows that 0's come in on the left side. For gawk
, this is
always true, but in some languages, it's possible to have the left side
fill with 1's. Caveat emptor.
For some operating systems, the gawk
port doesn't support GNU gettext
. This applies most notably to
the PC operating systems. As such, these features are not available
if you are using one of those operating systems. Sorry.
Americans
use a comma every three decimal places and a period for the decimal
point, while many Europeans do exactly the opposite:
1,234.56
vs. 1.234,56
.
Eventually,
the xgettext
utility that comes with GNU gettext
will be
taught to automatically run `gawk --gen-po' for `.awk' files,
freeing the translator from having to do it manually.
This example is borrowed
from the GNU gettext
manual.
This is good fodder for an "Obfuscated
awk
" contest.
Perhaps it would be better if it were called "Hippy." Ah, well.
This is very
different from the same operator in the C shell, csh
.
Not recommended.
Your version of gawk
may use a different directory; it
will depend upon how gawk
was built and installed. The actual
directory is the value of `$(datadir)' generated when
gawk
was configured. You probably don't need to worry about this
though.
The effects are
not identical. Output of the transformed
record will be in all lowercase, while IGNORECASE
preserves the original
contents of the input record.
While all the library routines could have
been rewritten to use this convention, this was not done, in order to
show how my own awk
programming style has evolved, and to
provide some basis for this discussion.
gawk
's `--dump-variables' command-line
option is useful for verifying this.
http://mathworld.wolfram.com/CliffRandomNumberGenerator.hmtl
ASCII
has been extended in many countries to use the values from 128 to 255
for country-specific characters. If your system uses these extensions,
you can simplify _ord_init
to simply loop from 0 to 255.
It would
be nice if awk
had an assignment operator for concatenation.
The lack of an explicit operator for concatenation makes string operations
more difficult than they really need to be.
This
function was written before gawk
acquired the ability to
split strings into single characters using ""
as the separator.
We have left it alone, since using substr
is more portable.
It is often the case that password information is stored in a network database.
It also introduces a subtle bug; if a match happens, we output the translated line, not the original.
wc
can't just use
the value of FNR
in endfile
. If you examine the code in
Noting Data File Boundaries,
you will see that FNR
has already been reset by the time
endfile
is called.
On some older
System V systems,
tr
may require that the lists be written as
range expressions enclosed in square brackets (`[a-z]') and quoted,
to prevent the shell from attempting a file name expansion. This is
not a feature.
This
program was written before gawk
acquired the ability to
split each character in a string into separate array elements.
"Real world" is defined as "a program actually used to get something done."
On some very old versions of awk
, the test
`getline junk < t' can loop forever if the file exists but is empty.
Caveat emptor.
http://cm.bell-labs.com/who/bwk
This version is edited
slightly for presentation. The complete version can be found in
`extension/filefuncs.c' in the gawk
distribution.
Compiled programs are typically written in lower-level languages such as C, C++, Fortran, or Ada, and then translated, or compiled, into a form that the computer can execute directly.
http://www.validgh.com/goldberg/paper.ps
Pathological cases can require up to 752 digits (!), but we doubt that you need to worry about this.