The TAR package
In addition to the information stored
in ArchiveEntry
a TarArchiveEntry
stores various attributes including information about the
original owner and permissions.
There are several different dialects of the TAR format, maybe
even different TAR formats. The tar package contains special
cases in order to read many of the existing dialects and will by
default try to create archives in the original format (often
called "ustar"). This original format didn't support file names
longer than 100 characters or bigger than 8 GiB and the tar
package will by default fail if you try to write an entry that
goes beyond those limits. "ustar" is the common denominator of
all the existing tar dialects and is understood by most of the
existing tools.
The tar package does not support the full POSIX tar standard
nor more modern GNU extension of said standard.
Long File Names
The longFileMode
option of
TarArchiveOutputStream
controls how files with
names longer than 100 characters are handled. The possible
choices are:
LONGFILE_ERROR
: throw an exception if such a
file is added. This is the default.
LONGFILE_TRUNCATE
: truncate such names.
LONGFILE_GNU
: use a GNU tar variant now
referred to as "oldgnu" of storing such names. If you choose
the GNU tar option, the archive can not be extracted using
many other tar implementations like the ones of OpenBSD,
Solaris or MacOS X.
LONGFILE_POSIX
: use a PAX extended
header as defined by POSIX 1003.1. Most modern tar
implementations are able to extract such archives. since
Commons Compress 1.4
TarArchiveInputStream
will recognize the GNU
tar as well as the POSIX extensions (starting with Commons
Compress 1.2) for long file names and reads the longer names
transparently.
Big Numeric Values
The bigNumberMode
option of
TarArchiveOutputStream
controls how files larger
than 8GiB or with other big numeric values that can't be
encoded in traditional header fields are handled. The
possible choices are:
BIGNUMBER_ERROR
: throw an exception if such an
entry is added. This is the default.
BIGNUMBER_STAR
: use a variant first
introduced by Jörg Schilling's star
and later adopted by GNU and BSD tar. This method is not
supported by all implementations.
BIGNUMBER_POSIX
: use a PAX extended
header as defined by POSIX 1003.1. Most modern tar
implementations are able to extract such archives.
Starting with Commons Compress 1.4
TarArchiveInputStream
will recognize the star as
well as the POSIX extensions for big numeric values and reads them
transparently.
File Name Encoding
The original ustar format only supports 7-Bit ASCII file
names, later implementations use the platform's default
encoding to encode file names. The POSIX standard recommends
using PAX extension headers for non-ASCII file names
instead.
Commons Compress 1.1 to 1.3 assumed file names would be
encoded using ISO-8859-1. Starting with Commons Compress 1.4
you can specify the encoding to expect (to use when writing)
as a parameter to TarArchiveInputStream
(TarArchiveOutputStream
), it now defaults to the
platform's default encoding.
Since Commons Compress 1.4 another optional parameter -
addPaxHeadersForNonAsciiNames
- of
TarArchiveOutputStream
controls whether PAX
extension headers will be written for non-ASCII file names.
By default they will not be written to preserve space.
TarArchiveInputStream
will read them
transparently if present.
Sparse files
Prior to Commons Compress 1.20 TarArchiveInputStream
would recognize sparse
file entries stored using the "oldgnu" format
(--sparse-version=0.0
in GNU tar) but not
able to extract them correctly. Starting with Commons Compress 1.20
all GNU and POSIX variants of sparse files are recognized and
can be read.
Consuming Archives Completely
The end of a tar archive is signaled by two consecutive
records of all zeros. Unfortunately not all tar
implementations adhere to this and some only write one record
to end the archive. Commons Compress will always write two
records but stop reading an archive as soon as finds one
record of all zeros.
Prior to version 1.5 this could leave the second EOF record
inside the stream when getNextEntry
or
getNextTarEntry
returned null
Starting with version 1.5 TarArchiveInputStream
will try to read a second record as well if present,
effectively consuming the archive completely.
PAX Extended Header
The tar package has supported reading PAX extended headers
since 1.3 for local headers and 1.11 for global headers. The
following entries of PAX headers are applied when reading:
- path
- set the entry's name
- linkpath
- set the entry's link name
- gid
- set the entry's group id
- gname
- set the entry's group name
- uid
- set the entry's user id
- uname
- set the entry's user name
- size
- set the entry's size
- mtime
- set the entry's modification time
- SCHILY.devminor
- set the entry's minor device number
- SCHILY.devmajor
- set the entry's major device number
in addition some fields used by GNU tar and star used to
signal sparse entries are supported and are used for the
is*GNUSparse
and isStarSparse
methods.
Some PAX extra headers may be set when writing archives,
for example for non-ASCII names or big numeric values. This
depends on various setting of the output stream - see the
previous sections.
Since 1.15 you can directly access all PAX extension
headers that have been found when reading an entry or specify
extra headers to be written to a (local) PAX extended header
entry.
Some hints if you try to set extended headers:
- pax header keywords should be ascii. star/gnutar
(SCHILY.xattr.* ) do not check for this. libarchive/bsdtar
(LIBARCHIVE.xattr.*) uses URL-Encoding.
- pax header values should be encoded as UTF-8 characters
(including trailing
\0
). star/gnutar
(SCHILY.xattr.*) do not check for this. libarchive/bsdtar
(LIBARCHIVE.xattr.*) encode values using Base64.
- libarchive/bsdtar will read SCHILY.xattr headers, but
will not generate them.
- gnutar will complain about LIBARCHIVE.xattr (and any
other unknown) headers and will neither encode nor decode
them.
Random Access
Starting with Commons Compress 1.21 the tar package
contains a TarFile
class that provides random
access to archives. Except for the ability to access entries
out of order TarFile
is not superior to
TarArchiveInputStream
.