Apache Commons logo Commons Compress

The TAR package

In addition to the information stored in ArchiveEntry a TarArchiveEntry stores various attributes including information about the original owner and permissions.

There are several different dialects of the TAR format, maybe even different TAR formats. The tar package contains special cases in order to read many of the existing dialects and will by default try to create archives in the original format (often called "ustar"). This original format didn't support file names longer than 100 characters or bigger than 8 GiB and the tar package will by default fail if you try to write an entry that goes beyond those limits. "ustar" is the common denominator of all the existing tar dialects and is understood by most of the existing tools.

The tar package does not support the full POSIX tar standard nor more modern GNU extension of said standard.

Long File Names

The longFileMode option of TarArchiveOutputStream controls how files with names longer than 100 characters are handled. The possible choices are:

  • LONGFILE_ERROR: throw an exception if such a file is added. This is the default.
  • LONGFILE_TRUNCATE: truncate such names.
  • LONGFILE_GNU: use a GNU tar variant now refered to as "oldgnu" of storing such names. If you choose the GNU tar option, the archive can not be extracted using many other tar implementations like the ones of OpenBSD, Solaris or MacOS X.
  • LONGFILE_POSIX: use a PAX extended header as defined by POSIX 1003.1. Most modern tar implementations are able to extract such archives. since Commons Compress 1.4

TarArchiveInputStream will recognize the GNU tar as well as the POSIX extensions (starting with Commons Compress 1.2) for long file names and reads the longer names transparently.

Big Numeric Values

The bigNumberMode option of TarArchiveOutputStream controls how files larger than 8GiB or with other big numeric values that can't be encoded in traditional header fields are handled. The possible choices are:

  • BIGNUMBER_ERROR: throw an exception if such an entry is added. This is the default.
  • BIGNUMBER_STAR: use a variant first introduced by Jörg Schilling's star and later adopted by GNU and BSD tar. This method is not supported by all implementations.
  • BIGNUMBER_POSIX: use a PAX extended header as defined by POSIX 1003.1. Most modern tar implementations are able to extract such archives.

Starting with Commons Compress 1.4 TarArchiveInputStream will recognize the star as well as the POSIX extensions for big numeric values and reads them transparently.

File Name Encoding

The original ustar format only supports 7-Bit ASCII file names, later implementations use the platform's default encoding to encode file names. The POSIX standard recommends using PAX extension headers for non-ASCII file names instead.

Commons Compress 1.1 to 1.3 assumed file names would be encoded using ISO-8859-1. Starting with Commons Compress 1.4 you can specify the encoding to expect (to use when writing) as a parameter to TarArchiveInputStream (TarArchiveOutputStream), it now defaults to the platform's default encoding.

Since Commons Compress 1.4 another optional parameter - addPaxHeadersForNonAsciiNames - of TarArchiveOutputStream controls whether PAX extension headers will be written for non-ASCII file names. By default they will not be written to preserve space. TarArchiveInputStream will read them transparently if present.

Sparse files

TarArchiveInputStream will recognize sparse file entries stored using the "oldgnu" format (--sparse-version=0.0 in GNU tar) but is not able to extract them correctly. canReadEntryData will return false on such entries. The other variants of sparse files can currently not be detected at all.

Consuming Archives Completely

The end of a tar archive is signalled by two consecutive records of all zeros. Unfortunately not all tar implementations adhere to this and some only write one record to end the archive. Commons Compress will always write two records but stop reading an archive as soon as finds one record of all zeros.

Prior to version 1.5 this could leave the second EOF record inside the stream when getNextEntry or getNextTarEntry returned null Starting with version 1.5 TarArchiveInputStream will try to read a second record as well if present, effectively consuming the archive completely.

PAX Extended Header

The tar package has supported reading PAX extended headers since 1.3 for local headers and 1.11 for global headers. The following entries of PAX headers are applied when reading:

path
set the entry's name
linkpath
set the entry's link name
gid
set the entry's group id
gname
set the entry's group name
uid
set the entry's user id
uname
set the entry's user name
size
set the entry's size
mtime
set the entry's modification time
SCHILY.devminor
set the entry's minor device number
SCHILY.devmajor
set the entry's major device number

in addition some fields used by GNU tar and star used to signal sparse entries are supported and are used for the is*GNUSparse and isStarSparse methods.

Some PAX extra headers may be set when writing archives, for example for non-ASCII names or big numeric values. This depends on various setting of the output stream - see the previous sections.

Since 1.15 you can directly access all PAX extension headers that have been found when reading an entry or specify extra headers to be written to a (local) PAX extended header entry.

Some hints if you try to set extended headers:

  • pax header keywords should be ascii. star/gnutar (SCHILY.xattr.* ) do not check for this. libarchive/bsdtar (LIBARCHIVE.xattr.*) uses URL-Encoding.
  • pax header values should be encoded as UTF-8 characters (including trailing \0). star/gnutar (SCHILY.xattr.*) do not check for this. libarchive/bsdtar (LIBARCHIVE.xattr.*) encode values using Base64.
  • libarchive/bsdtar will read SCHILY.xattr headers, but will not generate them.
  • gnutar will complain about LIBARCHIVE.xattr (and any other unknown) headers and will neither encode nor decode them.