@FIXME{need an intro here}
tar
Archives More Portable
Creating a tar
archive on a particular system that is meant to be
useful later on many other machines and with other versions of tar
is more challenging than you might think. tar
archive formats
have been evolving since the first versions of Unix. Many such formats
are around, and are not always comptible with each other. This section
discusses a few problems, and gives some advice about making tar
archives more portable.
One golden rule is simplicity. For example, limit your tar
archives to contain only regular files and directories, avoiding
other kind of special files. Do not attempt to save sparse files or
contiguous files as such. Let's discuss a few more problems, in turn.
Use straight file and directory names, made up of printable ASCII characters, avoiding colons, slashes, backslashes, spaces, and other dangerous characters. Avoid deep directory nesting. Accounting for oldish System V machines, limit your file and directory names to 14 characters or less.
If you intend to have your tar
archives to be read under MSDOS,
you should not rely on case distinction for file names, and you might
use the GNU doschk
program for helping you further diagnosing
illegal MSDOS names, which are even more limited than System V's.
Normally, when tar
archives a symbolic link, it writes a
block to the archive naming the target of the link. In that way, the
tar
archive is a faithful record of the filesystem contents.
--dereference (-h) is used with --create (-c), and causes tar
to archive the files symbolic links point to, instead of the links
themselves. When this option is used, when tar
encounters a
symbolic link, it will archive the linked-to file, instead of simply
recording the presence of a symbolic link.
The name under which the file is stored in the file system is not
recorded in the archive. To record both the symbolic link name and
the file name in the system, archive the file under both names. If
all links were recorded automatically by tar
, an extracted file
might be linked to a file name that no longer exists in the file
system.
If a linked-to file is encountered again by tar
while creating
the same archive, an entire second copy of it will be stored. (This
might be considered a bug.)
So, for portable archives, do not archive symbolic links as such, and use --dereference (-h): many systems do not support symbolic links, and moreover, your distribution might be unusable if it contains unresolved symbolic links.
Certain old versions of tar
cannot handle additional
information recorded by newer tar
programs. To create an
archive in V7 format (not ANSI), which can be read by these old
versions, specify the --old-archive (-o) option in
conjunction with the --create (-c). tar
also
accepts `--portability' for this option. When you specify it,
tar
leaves out information about directories, pipes, fifos,
contiguous files, and device files, and specifies file ownership by
group and user IDs instead of group and user names.
When updating an archive, do not use --old-archive (-o) unless the archive was created with using this option.
In most cases, a new format archive can be read by an old
tar
program without serious trouble, so this option should
seldom be needed. On the other hand, most modern tar
s are
able to read old format archives, so it might be safer for you to
always use --old-archive (-o) for your distributions.
tar
and POSIX tar
GNU tar
was based on an early draft of the POSIX 1003.1
ustar
standard. GNU extensions to tar
, such as the
support for file names longer than 100 characters, use portions of the
tar
header record which were specified in that POSIX draft as
unused. Subsequent changes in POSIX have allocated the same parts of
the header record for other purposes. As a result, GNU tar
is
incompatible with the current POSIX spec, and with tar
programs
that follow it.
We plan to reimplement these GNU extensions in a new way which is
upward compatible with the latest POSIX tar
format, but we
don't know when this will be done.
In the mean time, there is simply no telling what might happen if you
read a GNU tar
archive, which uses the GNU extensions, using
some other tar
program. So if you want to read the archive
with another tar
program, be sure to write it using the
`--old-archive' option (`-o').
@FIXME{is there a way to tell which flavor of tar was used to write a particular archive before you try to read it?}
Traditionally, old tar
s have a limit of 100 characters. GNU
tar
attempted two different approaches to overcome this limit,
using and extending a format specified by a draft of some P1003.1.
The first way was not that successful, and involved `@MaNgLeD@'
file names, or such; while a second approach used `././@LongLink'
and other tricks, yielding better success. In theory, GNU tar
should be able to handle file names of practically unlimited length.
So, if GNU tar
fails to dump and retrieve files having more
than 100 characters, then there is a bug in GNU tar
, indeed.
But, being strictly POSIX, the limit was still 100 characters.
For various other purposes, GNU tar
used areas left unassigned
in the POSIX draft. POSIX later revised P1003.1 ustar
format by
assigning previously unused header fields, in such a way that the upper
limit for file name length was raised to 256 characters. However, the
actual POSIX limit oscillates between 100 and 256, depending on the
precise location of slashes in full file name (this is rather ugly).
Since GNU tar
use the same fields for quite other purposes,
it became incompatible with the latest POSIX standards.
For longer or non-fitting file names, we plan to use yet another set
of GNU extensions, but this time, complying with the provisions POSIX
offers for extending the format, rather than conflicting with it.
Whenever an archive uses old GNU tar
extension format or POSIX
extensions, would it be for very long file names or other specialities,
this archive becomes non-portable to other tar
implementations.
In fact, anything can happen. The most forgiving tar
s will
merely unpack the file using a wrong name, and maybe create another
file named something like `@LongName', with the true file name
in it. tar
s not protecting themselves may segment violate!
Compatibility concerns make all this thing more difficult, as we
will have to support all these things together, for a while.
GNU tar
should be able to produce and read true POSIX format
files, while being able to detect old GNU tar
formats, besides
old V7 format, and process them conveniently. It would take years
before this whole area stabilizes...
There are plans to raise this 100 limit to 256, and yet produce POSIX
conformant archives. Past 256, I do not know yet if GNU tar
will go non-POSIX again, or merely refuse to archive the file.
There are plans so GNU tar
support more fully the latest POSIX
format, while being able to read old V7 format, GNU (semi-POSIX plus
extension), as well as full POSIX. One may ask if there is part of
the POSIX format that we still cannot support. This simple question
has a complex answer. Maybe that, on intimate look, some strong
limitations will pop up, but until now, nothing sounds too difficult
(but see below). I only have these few pages of POSIX telling about
`Extended tar Format' (P1003.1-1990 -- section 10.1.1), and there are
references to other parts of the standard I do not have, which should
normally enforce limitations on stored file names (I suspect things
like fixing what / and NUL means). There are also
some points which the standard does not make clear, Existing practice
will then drive what I should do.
POSIX mandates that, when a file name cannot fit within 100 to
256 characters (the variance comes from the fact a / is
ideally needed as the 156'th character), or a link name cannot
fit within 100 characters, a warning should be issued and the file
not be stored. Unless some --posix option is given
(or POSIXLY_CORRECT
is set), I suspect that GNU tar
should disobey this specification, and automatically switch to using
GNU extensions to overcome file name or link name length limitations.
There is a problem, however, which I did not intimately studied yet.
Given a truly POSIX archive with names having more than 100 characters,
I guess that GNU tar
up to 1.11.8 will process it as if it were an
old V7 archive, and be fooled by some fields which are coded differently.
So, the question is to decide if the next generation of GNU tar
should produce POSIX format by default, whenever possible, producing
archives older versions of GNU tar
might not be able to read
correctly. I fear that we will have to suffer such a choice one of these
days, if we want GNU tar
to go closer to POSIX. We can rush it.
Another possibility is to produce the current GNU tar
format
by default for a few years, but have GNU tar
versions from some
1.POSIX and up able to recognize all three formats, and let older
GNU tar
fade out slowly. Then, we could switch to producing POSIX
format by default, with not much harm to those still having (very old at
that time) GNU tar
versions prior to 1.POSIX.
POSIX format cannot represent very long names, volume headers,
splitting of files in multi-volumes, sparse files, and incremental
dumps; these would be all disallowed if --posix or
POSIXLY_CORRECT
. Otherwise, if tar
is given long
names, or `-[VMSgG]', then it should automatically go non-POSIX.
I think this is easily granted without much discussion.
Another point is that only mtime
is stored in POSIX
archives, while GNU tar
currently also store atime
and ctime
. If we want GNU tar
to go closer to POSIX,
my choice would be to drop atime
and ctime
support on
average. On the other hand, I perceive that full dumps or incremental
dumps need atime
and ctime
support, so for those special
applications, POSIX has to be avoided altogether.
A few users requested that --sparse (-S) be always active by
default, I think that before replying to them, we have to decide
if we want GNU tar
to go closer to POSIX on average, while
producing files. My choice would be to go closer to POSIX in the
long run. Besides possible double reading, I do not see any point
of not trying to save files as sparse when creating archives which
are neither POSIX nor old-V7, so the actual --sparse (-S) would
become selected by default when producing such archives, whatever
the reason is. So, --sparse (-S) alone might be redefined to force
GNU-format archives, and recover its previous meaning from this fact.
GNU-format as it exists now can easily fool other POSIX tar
,
as it uses fields which POSIX considers to be part of the file name
prefix. I wonder if it would not be a good idea, in the long run,
to try changing GNU-format so any added field (like ctime
,
atime
, file offset in subsequent volumes, or sparse file
descriptions) be wholly and always pushed into an extension block,
instead of using space in the POSIX header block. I could manage
to do that portably between future GNU tar
s. So other POSIX
tar
s might be at least able to provide kind of correct listings
for the archives produced by GNU tar
, if not able to process
them otherwise.
Using these projected extensions might induce older tar
s to fail.
We would use the same approach as for POSIX. I'll put out a tar
capable of reading POSIXier, yet extended archives, but will not produce
this format by default, in GNU mode. In a few years, when newer GNU
tar
s will have flooded out tar
1.11.X and previous, we
could switch to producing POSIXier extended archives, with no real harm
to users, as almost all existing GNU tar
s will be ready to read
POSIXier format. In fact, I'll do both changes at the same time, in a
few years, and just prepare tar
for both changes, without effecting
them, from 1.POSIX. (Both changes: 1--using POSIX convention for
getting over 100 characters; 2--avoiding mangling POSIX headers for GNU
extensions, using only POSIX mandated extension techniques).
So, a future tar
will have a --posix
flag forcing the usage of truly POSIX headers, and so, producing
archives previous GNU tar
will not be able to read.
So, once pretest will announce that feature, it would be
particularly useful that users test how exchangeable will be archives
between GNU tar
with --posix and other POSIX tar
.
In a few years, when GNU tar
will produce POSIX headers by
default, --posix will have a strong meaning and will disallow
GNU extensions. But in the meantime, for a long while, --posix
in GNU tar will not disallow GNU extensions like --label=archive-label (-V archive-label),
--multi-volume (-M), --sparse (-S), or very long file or link names.
However, --posix with GNU extensions will use POSIX
headers with reserved-for-users extensions to headers, and I will be
curious to know how well or bad POSIX tar
s will react to these.
GNU tar
prior to 1.POSIX, and after 1.POSIX without
--posix, generates and checks `ustar ', with two
suffixed spaces. This is sufficient for older GNU tar
not to
recognize POSIX archives, and consequently, wrongly decide those archives
are in old V7 format. It is a useful bug for me, because GNU tar
has other POSIX incompatibilities, and I need to segregate GNU tar
semi-POSIX archives from truly POSIX archives, for GNU tar
should
be somewhat compatible with itself, while migrating closer to latest
POSIX standards. So, I'll be very careful about how and when I will do
the correction.
SunOS and HP-UX tar
fail to accept archives created using GNU
tar
and containing non-ASCII file names, that is, file names
having characters with the eight bit set, because they use signed
checksums, while GNU tar
uses unsigned checksums while creating
archives, as per POSIX standards. On reading, GNU tar
computes
both checksums and accept any. It is somewhat worrying that a lot of
people may go around doing backup of their files using faulty (or at
least non-standard) software, not learning about it until it's time
to restore their missing files with an incompatible file extractor,
or vice versa.
GNU tar
compute checksums both ways, and accept any on read,
so GNU tar can read Sun tapes even with their wrong checksums.
GNU tar
produces the standard checksum, however, raising
incompatibilities with Sun. That is to say, GNU tar
has not
been modified to produce incorrect archives to be read by buggy
tar
's. I've been told that more recent Sun tar
now
read standard archives, so maybe Sun did a similar patch, after all?
The story seems to be that when Sun first imported tar
sources on their system, they recompiled it without realizing that
the checksums were computed differently, because of a change in
the default signing of char
's in their compiler. So they
started computing checksums wrongly. When they later realized their
mistake, they merely decided to stay compatible with it, and with
themselves afterwards. Presumably, but I do not really know, HP-UX
has chosen that their tar
archives to be compatible with Sun's.
The current standards do not favor Sun tar
format. In any
case, it now falls on the shoulders of SunOS and HP-UX users to get
a tar
able to read the good archives they receive.
gzip
.
@FIXME{ach; these two bits orig from "compare" (?). where to put?} Some format parameters must be taken into consideration when modifying an archive: @FIXME{???}. Compressed archives cannot be modified.
You can use `--gzip' and `--gunzip' on physical devices
(tape drives, etc.) and remote files as well as on normal files; data
to or from such devices or remote files is reblocked by another copy
of the tar
program to enforce the specified (or default) record
size. The default compression parameters are used; if you need to
override them, avoid the --gzip (--gunzip, --ungzip, -z) option and run gzip
explicitly. (Or set the `GZIP' environment variable.)
The --gzip (--gunzip, --ungzip, -z) option does not work with the --multi-volume (-M) option, or with the --update (-u), --append (-r), --concatenate (--catenate, -A), or --delete operations.
It is not exact to say that GNU tar
is to work in concert
with gzip
in a way similar to zip
, say. Surely, it is
possible that tar
and gzip
be done with a single call,
like in:
$ tar cfz archive.tar.gz subdir
to save all of `subdir' into a gzip
'ed archive. Later you
can do:
$ tar xfz archive.tar.gz
to explode and unpack.
The difference is that the whole archive is compressed. With
zip
, archive members are archived individually. tar
's
method yields better compression. On the other hand, one can view the
contents of a zip
archive without having to decompress it. As
for the tar
and gzip
tandem, you need to decompress the
archive to see its contents. However, this may be done without needing
disk space, by using pipes internally:
$ tar tfz archive.tar.gz
About corrupted compressed archives: gzip
'ed files have no
redundancy, for maximum compression. The adaptive nature of the
compression scheme means that the compression tables are implicitly
spread all over the archive. If you lose a few blocks, the dynamic
construction of the compression tables becomes unsychronized, and there
is little chance that you could recover later in the archive.
There are pending suggestions for having a per-volume or per-file
compression in GNU tar
. This would allow for viewing the
contents without decompression, and for resynchronizing decompression at
every volume or file, in case of corrupted archives. Doing so, we might
loose some compressibility. But this would have make recovering easier.
So, there are pros and cons. We'll see!
compress
. Otherwise like --gzip (--gunzip, --ungzip, -z).
--compress (--uncompress, -Z) stores an archive in compressed format. This
option is useful in saving time over networks and space in pipes, and
when storage space is at a premium. --compress (--uncompress, -Z) causes
tar
to compress when writing the archive, or to uncompress when
reading the archive.
To perform compression and uncompression on the archive, tar
runs the compress
utility. tar
uses the default
compression parameters; if you need to override them, avoid the
--compress (--uncompress, -Z) option and run the compress
utility
explicitly. It is useful to be able to call the compress
utility from within tar
because the compress
utility by
itself cannot access remote tape drives.
The --compress (--uncompress, -Z) option will not work in conjunction with the
--multi-volume (-M) option or the --append (-r), --update (-u),
--append (-r) and --delete operations. See section The Five Advanced tar
Operations, for
more information on these operations.
If there is no compress utility available, tar
will report an error.
Please note that the compress
program may be covered by
a patent, and therefore we recommend you stop using it.
tar
will compress (when writing
an archive), or uncompress (when reading an archive). Used in
conjunction with the --create (-c), --extract (--get, -x), --list (-t) and
--compare (--diff, -d) operations.
You can have archives be compressed by using the --gzip (--gunzip, --ungzip, -z) option.
This will arrange for tar
to use the gzip
program to be
used to compress or uncompress the archive wren writing or reading it.
To use the older, obsolete, compress
program, use the
--compress (--uncompress, -Z) option. The GNU Project recommends you not use
compress
, because there is a patent covering the algorithm it
uses. You could be sued for patent infringment merely by running
compress
.
I have one question, or maybe it's a suggestion if there isn't a way
to do it now. I would like to use --gzip (--gunzip, --ungzip, -z), but I'd also like the
output to be fed through a program like GNU ecc
(actually, right
now that's `exactly' what I'd like to use :-)), basically adding
ECC protection on top of compression. It seems as if this should be
quite easy to do, but I can't work out exactly how to go about it.
Of course, I can pipe the standard output of tar
through
ecc
, but then I lose (though I haven't started using it yet,
I confess) the ability to have tar
use rmt
for it's I/O
(I think).
I think the most straightforward thing would be to let me specify a general set of filters outboard of compression (preferably ordered, so the order can be automatically reversed on input operations, and with the options they require specifiable), but beggars shouldn't be choosers and anything you decide on would be fine with me.
By the way, I like ecc
but if (as the comments say) it can't
deal with loss of block sync, I'm tempted to throw some time at adding
that capability. Supposing I were to actually do such a thing and
get it (apparantly) working, do you accept contributed changes to
utilities like that? (Leigh Clayton `[email protected]', May 1995).
Isn't that exactly the role of the --use-compress-prog=program option? I never tried it myself, but I suspect you may want to write a prog script or program able to filter stdin to stdout to way you want. It should recognize the `-d' option, for when extraction is needed rather than creation.
It has been reported that if one writes compressed data (through the --gzip (--gunzip, --ungzip, -z) or --compress (--uncompress, -Z) options) to a DLT and tries to use the DLT compression mode, the data will actually get bigger and one will end up with less space on the tape.
This option causes all files to be put in the archive to be tested for
sparseness, and handled specially if they are. The --sparse (-S)
option is useful when many dbm
files, for example, are being
backed up. Using this option dramatically decreases the amount of
space needed to store such a file.
In later versions, this option may be removed, and the testing and treatment of sparse files may be done automatically with any special GNU options. For now, it is an option needing to be specified on the command line with the creation or updating of an archive.
Files in the filesystem occasionally have "holes." A hole in a file
is a section of the file's contents which was never written. The
contents of a hole read as all zeros. On many operating systems,
actual disk storage is not allocated for holes, but they are counted
in the length of the file. If you archive such a file, tar
could create an archive longer than the original. To have tar
attempt to recognize the holes in a file, use --sparse (-S). When
you use the --sparse (-S) option, then, for any file using less
disk space than would be expected from its length, tar
searches
the file for consecutive stretches of zeros. It then records in the
archive for the file where the consecutive stretches of zeros are, and
only archives the "real contents" of the file. On extraction (using
--sparse (-S) is not needed on extraction) any such files have
hols created wherever the continuous stretches of zeros were found.
Thus, if you use --sparse (-S), tar
archives won't take
more space than the original.
A file is sparse if it contains blocks of zeros whose existence is
recorded, but that have no space allocated on disk. When you specify
the --sparse (-S) option in conjunction with the --create (-c)
operation, tar
tests all files for sparseness while archiving.
If tar
finds a file to be sparse, it uses a sparse representation of
the file in the archive. See section How to Create Archives, for more information
about creating archives.
--sparse (-S) is useful when archiving files, such as dbm files, likely to contain many nulls. This option dramatically decreases the amount of space needed to store such an archive.
Please Note: Always use --sparse (-S) when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.
Even if your system has no sparse files currently, some may be created in the future. If you use --sparse (-S) while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). @FIXME-xref{incremental when node name is set.}
tar
ignores the --sparse (-S) option when reading an archive.
However, users should be well aware that at archive creation time, GNU
tar
still has to read whole disk file to locate the holes, and
so, even if sparse files use little space on disk and in the archive, they
may sometimes require inordinate amount of time for reading and examining
all-zero blocks of a file. Although it works, it's painfully slow for a
large (sparse) file, even though the resulting tar archive may be small.
(One user reports that dumping a `core' file of over 400 megabytes,
but with only about 3 megabytes of actual data, took about 9 minutes on
a Sun Sparstation ELC, with full CPU utilisation.)
This reading is required in all cases and is not related to the fact the --sparse (-S) option is used or not, so by merely not using the option, you are not saving time(6).
Programs like dump
do not have to read the entire file; by examining
the file system directly, they can determine in advance exactly where the
holes are and thus avoid reading through them. The only data it need read
are the actual allocated data blocks. GNU tar
uses a more portable
and straightforward archiving approach, it would be fairly difficult that
it does otherwise. Elizabeth Zwicky writes to `comp.unix.internals',
on 1990-12-10:
What I did say is that you cannot tell the difference between a hole and an equivalent number of nulls without reading raw blocks.
st_blocks
at best tells you how many holes there are; it doesn't tell you where. Just as programs may, conceivably, care whatst_blocks
is (care to name one that does?), they may also care where the holes are (I have no examples of this one either, but it's equally imaginable).I conclude from this that good archivers are not portable. One can arguably conclude that if you want a portable program, you can in good conscience restore files with as many holes as possible, since you can't get it right.
@UNREVISED
When tar
reads files, this causes them to have the access times
updated. To have tar
attempt to set the access times back to
what they were before they were read, use the --atime-preserve
option. This doesn't work for files that you don't own, unless
you're root, and it doesn't interact with incremental dumps nicely
(see section Performing Backups and Restoring Files), but it is good enough for some purposes.
Handling of file attributes
tar
leaves the modification times
of the files it extracts as the time when the files were extracted,
instead of setting it to the time recorded in the archive.
This option is meaningless with --list (-t).
tar
is executed on those systems able to give files away. This is
considered as a security flaw by many people, at least because it
makes quite difficult to correctly account users for the disk space
they occupy. Also, the suid
or sgid
attributes of
files are easily and silently lost when files are given away.
When writing an archive, tar
writes the user id and user name
separately. If it can't find a user name (because the user id is not
in `/etc/passwd'), then it does not write one. When restoring,
and doing a chmod
like when you use --same-permissions (--preserve-permissions, -p)
(@FIXME{same-owner?}), it tries to look the name (if one was written)
up in `/etc/passwd'. If it fails, then it uses the user id
stored in the archive instead.
tar
archives.
The identifying names are added at create time when provided by the
system, unless --old-archive (-o) is used. Numeric ids could be
used when moving archives between a collection of machines using
a centralized management for attribution of numeric ids to users
and groups. This is often made through using the NIS capabilities.
When making a tar
file for distribution to other sites, it
is sometimes cleaner to use a single owner for all files in the
distribution, and nicer to specify the write permission bits of the
files as stored in the archive independently of their actual value on
the file system. The way to prepare a clean distribution is usually
to have some Makefile rule creating a directory, copying all needed
files in that directory, then setting ownership and permissions as
wanted (there are a lot of possible schemes), and only then making a
tar
archive out of this directory, before cleaning everything
out. Of course, we could add a lot of options to GNU tar
for
fine tuning permissions and ownership. This is not the good way,
I think. GNU tar
is already crowded with options and moreover,
the approach just explained gives you a great deal of control already.
tar
to set the modes (access permissions) of
extracted files exactly as recorded in the archive. If this option
is not used, the current umask
setting limits the permissions
on extracted files.
This option is meaningless with --list (-t).
@UNREVISED
While an archive may contain many files, the archive itself is a
single ordinary file. Like any other file, an archive file can be
written to a storage device such as a tape or disk, sent through a
pipe or over a network, saved on the active file system, or even
stored in another archive. An archive file is not easy to read or
manipulate without using the tar
utility or Tar mode in GNU
Emacs.
Physically, an archive consists of a series of file entries terminated
by an end-of-archive entry, which consists of 512 zero bytes. A file
entry usually describes one of the files in the archive (an
archive member), and consists of a file header and the contents
of the file. File headers contain file names and statistics, checksum
information which tar
uses to detect file corruption, and
information about file types.
Archives are permitted to have more than one member with the same member name. One way this situation can occur is if more than one version of a file has been stored in the archive. For information about adding new versions of a file to an archive, see section Updating an Archive, and to learn more about having more than one archive member with the same name, see @FIXME-xref{-backup node, when it's written}.
In addition to entries describing archive members, an archive may
contain entries which tar
itself uses to store information.
See section Including a Label in the Archive, for an example of such an archive entry.
A tar
archive file contains a series of blocks. Each block
contains BLOCKSIZE
bytes. Although this format may be thought
of as being on magnetic tape, other media are often used.
Each file archived is represented by a header block which describes the file, followed by zero or more blocks which give the contents of the file. At the end of the archive file there may be a block filled with binary zeros as an end-of-file marker. A reasonable system should write a block of zeros at the end, but must not assume that such a block exists when reading an archive.
The blocks may be blocked for physical I/O operations.
Each record of n blocks (where n is set by the
--blocking-factor=512-size (-b 512-size) option to tar
) is written with a single
`write ()' operation. On magnetic tapes, the result of
such a write is a single record. When writing an archive,
the last record of blocks should be written at the full size, with
blocks after the zero block containing all zeros. When reading
an archive, a reasonable system should properly handle an archive
whose last record is shorter than the rest, or which contains garbage
records after a zero block.
The header block is defined in C as follows. In the GNU tar
distribution, this is part of file `src/tar.h':
/* GNU tar Archive Format description. */ /* If OLDGNU_COMPATIBILITY is not zero, tar produces archives which, by default, are readable by older versions of GNU tar. This can be overriden by using --posix; in this case, POSIXLY_CORRECT in environment may be set for enforcing stricter conformance. If OLDGNU_COMPATIBILITY is zero or undefined, tar will eventually produces archives which, by default, POSIX compatible; then either using --posix or defining POSIXLY_CORRECT enforces stricter conformance. This #define will disappear in a few years. FP, June 1995. */ #define OLDGNU_COMPATIBILITY 1 /*---------------------------------------------. | `tar' Header Block, from POSIX 1003.1-1990. | `---------------------------------------------*/ /* POSIX header. */ struct posix_header { /* byte offset */ char name[100]; /* 0 */ char mode[8]; /* 100 */ char uid[8]; /* 108 */ char gid[8]; /* 116 */ char size[12]; /* 124 */ char mtime[12]; /* 136 */ char chksum[8]; /* 148 */ char typeflag; /* 156 */ char linkname[100]; /* 157 */ char magic[6]; /* 257 */ char version[2]; /* 263 */ char uname[32]; /* 265 */ char gname[32]; /* 297 */ char devmajor[8]; /* 329 */ char devminor[8]; /* 337 */ char prefix[155]; /* 345 */ /* 500 */ }; #define TMAGIC "ustar" /* ustar and a null */ #define TMAGLEN 6 #define TVERSION "00" /* 00 and no null */ #define TVERSLEN 2 /* Values used in typeflag field. */ #define REGTYPE '0' /* regular file */ #define AREGTYPE '\0' /* regular file */ #define LNKTYPE '1' /* link */ #define SYMTYPE '2' /* reserved */ #define CHRTYPE '3' /* character special */ #define BLKTYPE '4' /* block special */ #define DIRTYPE '5' /* directory */ #define FIFOTYPE '6' /* FIFO special */ #define CONTTYPE '7' /* reserved */ /* Bits used in the mode field, values in octal. */ #define TSUID 04000 /* set UID on execution */ #define TSGID 02000 /* set GID on execution */ #define TSVTX 01000 /* reserved */ /* file permissions */ #define TUREAD 00400 /* read by owner */ #define TUWRITE 00200 /* write by owner */ #define TUEXEC 00100 /* execute/search by owner */ #define TGREAD 00040 /* read by group */ #define TGWRITE 00020 /* write by group */ #define TGEXEC 00010 /* execute/search by group */ #define TOREAD 00004 /* read by other */ #define TOWRITE 00002 /* write by other */ #define TOEXEC 00001 /* execute/search by other */ /*-------------------------------------. | `tar' Header Block, GNU extensions. | `-------------------------------------*/ /* In GNU tar, SYMTYPE is for to symbolic links, and CONTTYPE is for contiguous files, so maybe disobeying the `reserved' comment in POSIX header description. I suspect these were meant to be used this way, and should not have really been `reserved' in the published standards. */ /* *BEWARE* *BEWARE* *BEWARE* that the following information is still boiling, and may change. Even if the OLDGNU format description should be accurate, the so-called GNU format is not yet fully decided. It is surely meant to use only extensions allowed by POSIX, but the sketch below repeats some ugliness from the OLDGNU format, which should rather go away. Sparse files should be saved in such a way that they do *not* require two passes at archive creation time. Huge files get some POSIX fields to overflow, alternate solutions have to be sought for this. */ /* Descriptor for a single file hole. */ struct sparse { /* byte offset */ char offset[12]; /* 0 */ char numbytes[12]; /* 12 */ /* 24 */ }; /* Sparse files are not supported in POSIX ustar format. For sparse files with a POSIX header, a GNU extra header is provided which holds overall sparse information and a few sparse descriptors. When an old GNU header replaces both the POSIX header and the GNU extra header, it holds some sparse descriptors too. Whether POSIX or not, if more sparse descriptors are still needed, they are put into as many successive sparse headers as necessary. The following constants tell how many sparse descriptors fit in each kind of header able to hold them. */ #define SPARSES_IN_EXTRA_HEADER 16 #define SPARSES_IN_OLDGNU_HEADER 4 #define SPARSES_IN_SPARSE_HEADER 21 /* The GNU extra header contains some information GNU tar needs, but not foreseen in POSIX header format. It is only used after a POSIX header (and never with old GNU headers), and immediately follows this POSIX header, when typeflag is a letter rather than a digit, so signaling a GNU extension. */ struct extra_header { /* byte offset */ char atime[12]; /* 0 */ char ctime[12]; /* 12 */ char offset[12]; /* 24 */ char realsize[12]; /* 36 */ char longnames[4]; /* 48 */ char unused_pad1[68]; /* 52 */ struct sparse sp[SPARSES_IN_EXTRA_HEADER]; /* 120 */ char isextended; /* 504 */ /* 505 */ }; /* Extension header for sparse files, used immediately after the GNU extra header, and used only if all sparse information cannot fit into that extra header. There might even be many such extension headers, one after the other, until all sparse information has been recorded. */ struct sparse_header { /* byte offset */ struct sparse sp[SPARSES_IN_SPARSE_HEADER]; /* 0 */ char isextended; /* 504 */ /* 505 */ }; /* The old GNU format header conflicts with POSIX format in such a way that POSIX archives may fool old GNU tar's, and POSIX tar's might well be fooled by old GNU tar archives. An old GNU format header uses the space used by the prefix field in a POSIX header, and cumulates information normally found in a GNU extra header. With an old GNU tar header, we never see any POSIX header nor GNU extra header. Supplementary sparse headers are allowed, however. */ struct oldgnu_header { /* byte offset */ char unused_pad1[345]; /* 0 */ char atime[12]; /* 345 */ char ctime[12]; /* 357 */ char offset[12]; /* 369 */ char longnames[4]; /* 381 */ char unused_pad2; /* 385 */ struct sparse sp[SPARSES_IN_OLDGNU_HEADER]; /* 386 */ char isextended; /* 482 */ char realsize[12]; /* 483 */ /* 495 */ }; /* OLDGNU_MAGIC uses both magic and version fields, which are contiguous. Found in an archive, it indicates an old GNU header format, which will be hopefully become obsolescent. With OLDGNU_MAGIC, uname and gname are valid, though the header is not truly POSIX conforming. */ #define OLDGNU_MAGIC "ustar " /* 7 chars and a null */ /* The standards committee allows only capital A through capital Z for user-defined expansion. */ /* This is a dir entry that contains the names of files that were in the dir at the time the dump was made. */ #define GNUTYPE_DUMPDIR 'D' /* Identifies the *next* file on the tape as having a long linkname. */ #define GNUTYPE_LONGLINK 'K' /* Identifies the *next* file on the tape as having a long name. */ #define GNUTYPE_LONGNAME 'L' /* This is the continuation of a file that began on another volume. */ #define GNUTYPE_MULTIVOL 'M' /* For storing filenames that do not fit into the main header. */ #define GNUTYPE_NAMES 'N' /* This is for sparse files. */ #define GNUTYPE_SPARSE 'S' /* This file is a tape/volume header. Ignore it on extraction. */ #define GNUTYPE_VOLHDR 'V' /*--------------------------------------. | tar Header Block, overall structure. | `--------------------------------------*/ /* tar files are made in basic blocks of this size. */ #define BLOCKSIZE 512 enum archive_format { DEFAULT_FORMAT, /* format to be decided later */ V7_FORMAT, /* old V7 tar format */ OLDGNU_FORMAT, /* GNU format as per before tar 1.12 */ POSIX_FORMAT, /* restricted, pure POSIX format */ GNU_FORMAT /* POSIX format with GNU extensions */ }; union block { char buffer[BLOCKSIZE]; struct posix_header header; struct extra_header extra_header; struct oldgnu_header oldgnu_header; struct sparse_header sparse_header; }; /* End of Format description. */
All characters in header blocks are represented by using 8-bit characters in the local variant of ASCII. Each field within the structure is contiguous; that is, there is no padding used within the structure. Each character on the archive medium is stored contiguously.
Bytes representing the contents of files (after the header block
of each file) are not translated in any way and are not constrained
to represent characters in any character set. The tar
format
does not distinguish text files from binary files, and no translation
of file contents is performed.
The name
, linkname
, magic
, uname
, and
gname
are null-terminated character strings. All other fileds
are zero-filled octal numbers in ASCII. Each numeric field of width
w contains w minus 2 digits, a space, and a null, except
size
, and mtime
, which do not contain the trailing null.
The name
field is the file name of the file, with directory names
(if any) preceding the file name, separated by slashes.
@FIXME{how big a name before field overflows?}
The mode
field provides nine bits specifying file permissions
and three bits to specify the Set UID, Set GID, and Save Text
(sticky) modes. Values for these bits are defined above.
When special permissions are required to create a file with a given
mode, and the user restoring files from the archive does not hold such
permissions, the mode bit(s) specifying those special permissions
are ignored. Modes which are not supported by the operating system
restoring files from the archive will be ignored. Unsupported modes
should be faked up when creating or updating an archive; e.g. the
group permission could be copied from the other permission.
The uid
and gid
fields are the numeric user and group
ID of the file owners, respectively. If the operating system does
not support numeric user or group IDs, these fields should be ignored.
The size
field is the size of the file in bytes; linked files
are archived with this field specified as zero. @FIXME-xref{Modifiers}, in
particular the --incremental (-G) option.
The mtime
field is the modification time of the file at the time
it was archived. It is the ASCII representation of the octal value of
the last time the file was modified, represented as an integer number of
seconds since January 1, 1970, 00:00 Coordinated Universal Time.
The chksum
field is the ASCII representation of the octal value
of the simple sum of all bytes in the header block. Each 8-bit
byte in the header is added to an unsigned integer, initialized to
zero, the precision of which shall be no less than seventeen bits.
When calculating the checksum, the chksum
field is treated as
if it were all blanks.
The typeflag
field specifies the type of file archived. If a
particular implementation does not recognize or permit the specified
type, the file will be extracted as if it were a regular file. As this
action occurs, tar
issues a warning to the standard error.
The atime
and ctime
fields are used in making incremental
backups; they store, respectively, the particular file's access time
and last inode-change time.
The offset
is used by the --multi-volume (-M) option, when
making a multi-volume archive. The offset is number of bytes into
the file that we need to restart at to continue the file on the next
tape, i.e., where we store the location that a continued file is
continued at.
The following fields were added to deal with sparse files. A file
is sparse if it takes in unallocated blocks which end up being
represented as zeros, i.e., no useful data. A test to see if a file
is sparse is to look at the number blocks allocated for it versus the
number of characters in the file; if there are fewer blocks allocated
for the file than would normally be allocated for a file of that
size, then the file is sparse. This is the method tar
uses to
detect a sparse file, and once such a file is detected, it is treated
differently from non-sparse files.
Sparse files are often dbm
files, or other database-type files
which have data at some points and emptiness in the greater part of
the file. Such files can appear to be very large when an `ls
-l' is done on them, when in truth, there may be a very small amount
of important data contained in the file. It is thus undesirable
to have tar
think that it must back up this entire file, as
great quantities of room are wasted on empty blocks, which can lead
to running out of room on a tape far earlier than is necessary.
Thus, sparse files are dealt with so that these empty blocks are
not written to the tape. Instead, what is written to the tape is a
description, of sorts, of the sparse file: where the holes are, how
big the holes are, and how much data is found at the end of the hole.
This way, the file takes up potentially far less room on the tape,
and when the file is extracted later on, it will look exactly the way
it looked beforehand. The following is a description of the fields
used to handle a sparse file:
The sp
is an array of struct sparse
. Each struct
sparse
contains two 12-character strings which represent an offset
into the file and a number of bytes to be written at that offset.
The offset is absolute, and not relative to the offset in preceding
array element.
The header can hold four of these struct sparse
at the moment;
if more are needed, they are not stored in the header.
The isextended
flag is set when an extended_header
is needed to deal with a file. Note that this means that this flag
can only be set when dealing with a sparse file, and it is only set
in the event that the description of the file will not fit in the
alloted room for sparse structures in the header. In other words,
an extended_header is needed.
The extended_header
structure is used for sparse files which
need more sparse structures than can fit in the header. The header can
fit 4 such structures; if more are needed, the flag isextended
gets set and the next block is an extended_header
.
Each extended_header
structure contains an array of 21
sparse structures, along with a similar isextended
flag
that the header had. There can be an indeterminate number of such
extended_header
s to describe a sparse file.
REGTYPE
AREGTYPE
tar
, a typeflag
value of
AREGTYPE
should be silently recognized as a regular file.
New archives should be created using REGTYPE
. Also, for
backward compatibility, tar
treats a regular file whose name
ends with a slash as a directory.
LNKTYPE
linkname
field with a trailing null.
SYMTYPE
linkname
field with a trailing null.
CHRTYPE
BLKTYPE
devmajor
and devminor
fields will contain the major and minor device numbers respectively.
Operating systems may map the device specifications to their own
local specification, or may ignore the entry.
DIRTYPE
name
field should end with a slash. On systems where
disk allocation is performed on a directory basis, the size
field
will contain the maximum number of bytes (which may be rounded to
the nearest disk block allocation unit) which the directory may
hold. A size
field of zero indicates no such limiting. Systems
which do not support limiting in this manner should ignore the
size
field.
FIFOTYPE
CONTTYPE
A
... Z
Other values are reserved for specification in future revisions of
the P1003 standard, and should not be used by any tar
program.
The magic
field indicates that this archive was output in
the P1003 archive format. If this field contains TMAGIC
,
the uname
and gname
fields will contain the ASCII
representation of the owner and group of the file respectively.
If found, the user and group IDs are used rather than the values in
the uid
and gid
fields.
For references, see ISO/IEC 9945-1:1990 or IEEE Std 1003.1-1990, pages 169-173 (section 10.1) for Archive/Interchange File Format; and IEEE Std 1003.2-1992, pages 380-388 (section 4.48) and pages 936-940 (section E.4.48) for pax - Portable archive interchange.
@UNREVISED
The GNU format uses additional file types to describe new types of files in an archive. These are listed below.
GNUTYPE_DUMPDIR
'D'
size
field gives the total
size of the associated list of files. Each file name is preceded by
either a `Y' (the file should be in this archive) or an `N'.
(The file is a directory, or is not stored in the archive.) Each file
name is terminated by a null. There is an additional null after the
last file name.
GNUTYPE_MULTIVOL
'M'
size
field gives the
maximum size of this piece of the file (assuming the volume does
not end before the file is written out). The offset
field
gives the offset from the beginning of the file where this part of
the file begins. Thus size
plus offset
should equal
the original size of the file.
GNUTYPE_SPARSE
'S'
GNUTYPE_VOLHDR
'V'
name
field contains the name
given after the --label=archive-label (-V archive-label) option.
The size
field is zero. Only the first file in each volume
of an archive should have this type.
You may have trouble reading a GNU format archive on a non-GNU
system if the options --incremental (-G), --multi-volume (-M),
--sparse (-S), or --label=archive-label (-V archive-label) were used when writing the archive.
In general, if tar
does not use the GNU-added fields of the
header, other versions of tar
should be able to read the
archive. Otherwise, the tar
program will give an error, the
most likely one being a checksum error.
tar
and cpio
@UNREVISED
@FIXME{Reorganize the following material}
The cpio
archive formats, like tar
, do have maximum
pathname lengths. The binary and old ASCII formats have a max path
length of 256, and the new ASCII and CRC ASCII formats have a max
path length of 1024. GNU cpio
can read and write archives
with arbitrary pathname lengths, but other cpio
implementations
may crash unexplainedly trying to read them.
tar
handles symbolic links in the form in which it comes in BSD;
cpio
doesn't handle symbolic links in the form in which it comes
in System V prior to SVR4, and some vendors may have added symlinks
to their system without enhancing cpio
to know about them.
Others may have enhanced it in a way other than the way I did it
at Sun, and which was adopted by AT&T (and which is, I think, also
present in the cpio
that Berkeley picked up from AT&T and put
into a later BSD release--I think I gave them my changes).
(SVR4 does some funny stuff with tar
; basically, its cpio
can handle tar
format input, and write it on output, and it
probably handles symbolic links. They may not have bothered doing
anything to enhance tar
as a result.)
cpio
handles special files; traditional tar
doesn't.
tar
comes with V7, System III, System V, and BSD source;
cpio
comes only with System III, System V, and later BSD
(4.3-tahoe and later).
tar
's way of handling multiple hard links to a file can handle
file systems that support 32-bit inumbers (e.g., the BSD file system);
cpio
s way requires you to play some games (in its "binary"
format, i-numbers are only 16 bits, and in its "portable ASCII" format,
they're 18 bits--it would have to play games with the "file system ID"
field of the header to make sure that the file system ID/i-number pairs
of different files were always different), and I don't know which
cpio
s, if any, play those games. Those that don't might get
confused and think two files are the same file when they're not, and
make hard links between them.
tar
s way of handling multiple hard links to a file places only
one copy of the link on the tape, but the name attached to that copy
is the only one you can use to retrieve the file; cpio
s
way puts one copy for every link, but you can retrieve it using any
of the names.
What type of check sum (if any) is used, and how is this calculated.
See the attached manual pages for tar
and cpio
format.
tar
uses a checksum which is the sum of all the bytes in the
tar
header for a file; cpio
uses no checksum.
If anyone knows why
cpio
was made whentar
was present at the unix scene,
It wasn't. cpio
first showed up in PWB/UNIX 1.0; no
generally-available version of UNIX had tar
at the time. I don't
know whether any version that was generally available within AT&T
had tar
, or, if so, whether the people within AT&T who did
cpio
knew about it.
On restore, if there is a corruption on a tape tar
will stop at
that point, while cpio
will skip over it and try to restore the
rest of the files.
The main difference is just in the command syntax and header format.
tar
is a little more tape-oriented in that everything is blocked
to start on a record boundary.
Is there any differences between the ability to recover crashed archives between the two of them. (Is there any chance of recovering crashed archives at all.)
Theoretically it should be easier under tar
since the blocking
lets you find a header with some variation of `dd skip=nn'.
However, modern cpio
's and variations have an option to just
search for the next file header after an error with a reasonable chance
of re-syncing. However, lots of tape driver software won't allow you to
continue past a media error which should be the only reason for getting
out of sync unless a file changed sizes while you were writing the
archive.
If anyone knows why
cpio
was made whentar
was present at the unix scene, please tell me about this too.
Probably because it is more media efficient (by not blocking everything
and using only the space needed for the headers where tar
always uses 512 bytes per file header) and it knows how to archive
special files.
You might want to look at the freely available alternatives. The major
ones are afio
, GNU tar
, and pax
, each of which
have their own extensions with some backwards compatibility.
Sparse files were tar
red as sparse files (which you can easily
test, because the resulting archive gets smaller, and GNU cpio
can no longer read it).
Go to the first, previous, next, last section, table of contents.