The ZIP package provides features not found
in java.util.zip:
In addition to the information stored
in ArchiveEntry a ZipArchiveEntry
stores internal and external attributes as well as extra
fields which may contain information like Unix permissions,
information about the platform they've been created on, their
last modification time and an optional comment.
ZIP archives store a archive entries in sequence and contain a registry of all entries at the very end of the archive. It is acceptable for an archive to contain several entries of the same name and have the registry (called the central directory) decide which entry is actually to be used (if any).
In addition the ZIP format stores certain information only inside the central directory but not together with the entry itself, this is:
This means the ZIP format cannot really be parsed
correctly while reading a non-seekable stream, which is what
ZipArchiveInputStream is forced to do. As a
result ZipArchiveInputStream
ZipArchiveInputStream shares these limitations
with java.util.zip.ZipInputStream.
ZipFile is able to read the central directory
first and provide correct and complete information on any
ZIP archive.
If possible, you should always prefer ZipFile
over ZipArchiveInputStream.
Inside a ZIP archive, additional data can be attached to
each entry. The java.util.zip.ZipEntry class
provides access to this via the get/setExtra
methods as arrays of bytes.
Actually the extra data is supposed to be more structured
than that and Compress' ZIP package provides access to the
structured data as ExtraField instances. Only
a subset of all defined extra field formats is supported by
the package, any other extra field will be stored
as UnrecognizedExtraField.
Traditionally the ZIP archive format uses CodePage 437 as encoding for file name, which is not sufficient for many international character sets.
Over time different archivers have chosen different ways to
work around the limitation - the java.util.zip
packages simply uses UTF-8 as its encoding for example.
Ant has been offering the encoding attribute of the zip and unzip task as a way to explicitly specify the encoding to use (or expect) since Ant 1.4. It defaults to the platform's default encoding for zip and UTF-8 for jar and other jar-like tasks (war, ear, ...) as well as the unzip family of tasks.
More recent versions of the ZIP specification introduce
something called the "language encoding flag"
which can be used to signal that a file name has been
encoded using UTF-8. All ZIP-archives written by Compress
will set this flag, if the encoding has been set to UTF-8.
Our interoperability tests with existing archivers didn't
show any ill effects (in fact, most archivers ignore the
flag to date), but you can turn off the "language encoding
flag" by setting the attribute
useLanguageEncodingFlag to false on the
ZipArchiveOutputStream if you should encounter
problems.
The ZipFile
and ZipArchiveInputStream classes will
recognize the language encoding flag and ignore the encoding
set in the constructor if it has been found.
The InfoZIP developers have introduced new ZIP extra fields
that can be used to add an additional UTF-8 encoded file
name to the entry's metadata. Most archivers ignore these
extra fields. ZipArchiveOutputStream supports
an option createUnicodeExtraFields which makes
it write these extra fields either for all entries
("always") or only those whose name cannot be encoded using
the specified encoding (not-encodeable), it defaults to
"never" since the extra fields create bigger archives.
The fallbackToUTF8 attribute
of ZipArchiveOutputStream can be used to create
archives that use the specified encoding in the majority of
cases but UTF-8 and the language encoding flag for filenames
that cannot be encoded using the specified encoding.
The ZipFile
and ZipArchiveInputStream classes recognize the
Unicode extra fields by default and read the file name
information from them, unless you set the constructor parameter
scanForUnicodeExtraFields to false.
The optimal setting of flags depends on the archivers you expect as consumers/producers of the ZIP archives. Below are some test results which may be superseded with later versions of each tool.
So, what to do?
If you are creating jars, then java.util.zip is your main consumer. We recommend you set the encoding to UTF-8 and keep the language encoding flag enabled. The flag won't help or hurt java.util.zip but archivers that support it will show the correct file names.
For maximum interop it is probably best to set the encoding to UTF-8, enable the language encoding flag and create Unicode extra fields when writing ZIPs. Such archives should be extracted correctly by java.util.zip, 7Zip, WinZIP, PKWARE tools and most likely InfoZIP tools. They will be unusable with Windows' "compressed folders" feature and bigger than archives without the Unicode extra fields, though.
If Windows' "compressed folders" is your primary consumer, then your best option is to explicitly set the encoding to the target platform. You may want to enable creation of Unicode extra fields so the tools that support them will extract the file names correctly.