The ZIP package

The ZIP package provides features not found in java.util.zip:

  • Support for encodings other than UTF-8 for filenames and comments.
  • Access to internal and external attributes (which are used to store Unix permission by some zip implementations).
  • Structured support for extra fields.

In addition to the information stored in ArchiveEntry a ZipArchiveEntry stores internal and external attributes as well as extra fields which may contain information like Unix permissions, information about the platform they've been created on, their last modification time and an optional comment.

ZipArchiveInputStream vs ZipFile

ZIP archives store a archive entries in sequence and contain a registry of all entries at the very end of the archive. It is acceptable for an archive to contain several entries of the same name and have the registry (called the central directory) decide which entry is actually to be used (if any).

In addition the ZIP format stores certain information only inside the central directory but not together with the entry itself, this is:

  • internal and external attributes
  • different or additional extra fields

This means the ZIP format cannot really be parsed correctly while reading a non-seekable stream, which is what ZipArchiveInputStream is forced to do. As a result ZipArchiveInputStream

  • may return entries that are not part of the central directory at all and shouldn't be considered part of the archive.
  • may return several entries with the same name.
  • will not return internal or external attributes.
  • may return incomplete extra field data.

ZipArchiveInputStream shares these limitations with java.util.zip.ZipInputStream.

ZipFile is able to read the central directory first and provide correct and complete information on any ZIP archive.

If possible, you should always prefer ZipFile over ZipArchiveInputStream.

Extra Fields

Inside a ZIP archive, additional data can be attached to each entry. The java.util.zip.ZipEntry class provides access to this via the get/setExtra methods as arrays of bytes.

Actually the extra data is supposed to be more structured than that and Compress' ZIP package provides access to the structured data as ExtraField instances. Only a subset of all defined extra field formats is supported by the package, any other extra field will be stored as UnrecognizedExtraField.

Encoding

Traditionally the ZIP archive format uses CodePage 437 as encoding for file name, which is not sufficient for many international character sets.

Over time different archivers have chosen different ways to work around the limitation - the java.util.zip packages simply uses UTF-8 as its encoding for example.

Ant has been offering the encoding attribute of the zip and unzip task as a way to explicitly specify the encoding to use (or expect) since Ant 1.4. It defaults to the platform's default encoding for zip and UTF-8 for jar and other jar-like tasks (war, ear, ...) as well as the unzip family of tasks.

More recent versions of the ZIP specification introduce something called the "language encoding flag" which can be used to signal that a file name has been encoded using UTF-8. All ZIP-archives written by Compress will set this flag, if the encoding has been set to UTF-8. Our interoperability tests with existing archivers didn't show any ill effects (in fact, most archivers ignore the flag to date), but you can turn off the "language encoding flag" by setting the attribute useLanguageEncodingFlag to false on the ZipArchiveOutputStream if you should encounter problems.

The ZipFile and ZipArchiveInputStream classes will recognize the language encoding flag and ignore the encoding set in the constructor if it has been found.

The InfoZIP developers have introduced new ZIP extra fields that can be used to add an additional UTF-8 encoded file name to the entry's metadata. Most archivers ignore these extra fields. ZipArchiveOutputStream supports an option createUnicodeExtraFields which makes it write these extra fields either for all entries ("always") or only those whose name cannot be encoded using the specified encoding (not-encodeable), it defaults to "never" since the extra fields create bigger archives.

The fallbackToUTF8 attribute of ZipArchiveOutputStream can be used to create archives that use the specified encoding in the majority of cases but UTF-8 and the language encoding flag for filenames that cannot be encoded using the specified encoding.

The ZipFile and ZipArchiveInputStream classes recognize the Unicode extra fields by default and read the file name information from them, unless you set the constructor parameter scanForUnicodeExtraFields to false.

Recommendations for Interoperability

The optimal setting of flags depends on the archivers you expect as consumers/producers of the ZIP archives. Below are some test results which may be superseded with later versions of each tool.

  • The java.util.zip package used by the jar executable or to read jars from your CLASSPATH reads and writes UTF-8 names, it doesn't set or recognize any flags or Unicode extra fields.
  • 7Zip writes CodePage 437 by default but uses UTF-8 and the language encoding flag when writing entries that cannot be encoded as CodePage 437 (similar to the zip task with fallbacktoUTF8 set to true). It recognizes the language encoding flag when reading and ignores the Unicode extra fields.
  • WinZIP writes CodePage 437 and uses Unicode extra fields by default. It recognizes the Unicode extra field and the language encoding flag when reading.
  • Windows' "compressed folder" feature doesn't recognize any flag or extra field and creates archives using the platforms default encoding - and expects archives to be in that encoding when reading them.
  • InfoZIP based tools can recognize and write both, it is a compile time option and depends on the platform so your mileage may vary.
  • PKWARE zip tools recognize both and prefer the language encoding flag. They create archives using CodePage 437 if possible and UTF-8 plus the language encoding flag for file names that cannot be encoded as CodePage 437.

So, what to do?

If you are creating jars, then java.util.zip is your main consumer. We recommend you set the encoding to UTF-8 and keep the language encoding flag enabled. The flag won't help or hurt java.util.zip but archivers that support it will show the correct file names.

For maximum interop it is probably best to set the encoding to UTF-8, enable the language encoding flag and create Unicode extra fields when writing ZIPs. Such archives should be extracted correctly by java.util.zip, 7Zip, WinZIP, PKWARE tools and most likely InfoZIP tools. They will be unusable with Windows' "compressed folders" feature and bigger than archives without the Unicode extra fields, though.

If Windows' "compressed folders" is your primary consumer, then your best option is to explicitly set the encoding to the target platform. You may want to enable creation of Unicode extra fields so the tools that support them will extract the file names correctly.