Apache Commons logo Commons Text

User Guide for Commons "Text"

Description

The Commons Text library provides additions to the standard JDK's text handling. Our goal is to provide a consistent set of tools for processing text generally from computing distances between Strings to being able to efficiently do String escaping of various types.

Package org.apache.commons.text

Originally the text package was added in Commons Lang 2.2. However, its new home is here. It provides, amongst other classes, a replacement for StringBuffer named StrBuilder, a class for substituting variables within a String named StrSubstitutor and a replacement for StringTokenizer named StrTokenizer. While somewhat ungainly, the Str prefix has been used to ensure we don't clash with any current or future standard Java classes.

Beyond the text utilities ported over from Commons Lang, we have also included various string similarity and distance functions. Lastly, there are also utilities for addressing differences between bodies of text for the sake of viewing these differences.

Class StringEscapeUtils

From Lang 3.5, we have moved into Text StringEscapeUtils and StrTokenizer. It provides ways in which to generate pieces of text, such as might be used for default passwords. StringEscapeUtils contains methods to escape and unescape Java, JavaScript, HTML and XML. It is worth noting that the package org.apache.commons.text.translate holds the functionality underpinning the StringEscapeUtils with mappings and translations between such mappings for the sake of doing String escaping. StrTokenizer is an improved alternative to java.util.StringTokenizer.

Class StringSubstitutor

The simplest example is to use this class to replace Java System properties. For example:

              
        StringSubstitutor.replaceSystemProperties(
          "You are running with java.version = ${java.version} and os.name = ${os.name}.");
        

For details see StringSubstitutor.

Use a StringSubstitutorReader to avoid reading a whole file into memory as a String to perform string substitution, for example, when a Servlet filters a file to a client.

To build a default full-featured substitutor, use:

The available substitutions are defined in org.apache.commons.text.lookup.StringLookupFactory.

Similarity and Distance

The org.apache.commons.text.similarity packages contains various different mechanisms of calculating "similarity scores" as well as "edit distances between Strings. Note, the difference between a "similarity score" and a "distance function" is that a distance functions meets the following qualifications:

  • d(x,y) >= 0, non-negativity or separation axiom
  • d(x,y) == 0, if and only if, x == y
  • d(x,y) == d(y,x), symmetry, and
  • d(x,z) <= d(x,y) + d(y,z), the triangle inequality
whereas a "similarity score" need not satisfy all such properties. Though, it is fairly easy to "normalize" a similarity score to manufacture an "edit distance."

The list of "edit distances" that we currently support follow:

  • Cosine Distance,
  • Hamming Distance,
  • Jaccard Distance,
  • Jaro Winkler Distance,
  • Levenshtein Distance,
  • Longest Commons Subsequence Distance,
and the list of "similarity scores" that we support follows:
  • Cosine Similarity,
  • Fuzzy Score Similarity,
  • Jaccard Similarity,
  • Jaro-Winkler Similarity, and
  • Longest Common Subsequence Similarity.

Text diff'ing

The org.apache.commons.text.diff package contains code for doing diff between strings. The initial implementation of the Myers algorithm was adapted from the commons-collections sequence package.

Package org.apache.commons.text.diff

Provides algorithms for diff between strings.

The initial implementation of the Myers algorithm was adapted from the commons-collections sequence package.

Package org.apache.commons.text.lookup

Provides algorithms for looking up strings used by a StringSubstitutor. Standard lookups are defined in StringLookupFactory and the associated DefaultStringLookup enum.

The example below demonstrates use of the default lookups for StringSubstitutor in order to construct a complex string.

NOTE: The list of lookups available by default changed in version 1.10.0. See the documentation for StringLookupFactory for details and instructions on how to reproduce the previous behavior.

final StringSubstitutor interpolator = StringSubstitutor.createInterpolator();
final String text = interpolator.replace(
    "Base64 Decoder:        ${base64Decoder:SGVsbG9Xb3JsZCE=}\n" +
    "Base64 Encoder:        ${base64Encoder:HelloWorld!}\n" +
    "Java Constant:         ${const:java.awt.event.KeyEvent.VK_ESCAPE}\n" +
    "Date:                  ${date:yyyy-MM-dd}\n" +
    "Environment Variable:  ${env:USERNAME}\n" +
    "File Content:          ${file:UTF-8:src/test/resources/document.properties}\n" +
    "Java:                  ${java:version}\n" +
    "Local host:            ${localhost:canonical-name}\n" +
    "Loopback address:      ${loopbackAddress:canonical-name}\n" +
    "Properties File:       ${properties:src/test/resources/document.properties::mykey}\n" +
    "Resource Bundle:       ${resourceBundle:org.apache.commons.text.example.testResourceBundleLookup:mykey}\n" +
    "System Property:       ${sys:user.dir}\n" +
    "URL Decoder:           ${urlDecoder:Hello%20World%21}\n" +
    "URL Encoder:           ${urlEncoder:Hello World!}\n" +
    "XML Decoder:           ${xmlDecoder:&lt;element&gt;}\n" +
    "XML Encoder:           ${xmlEncoder:<element>}\n" +
    "XML XPath:             ${xml:src/test/resources/document.xml:/root/path/to/node}\n"
);
      

Package org.apache.commons.text.similarity

Provides algorithms for string similarity.

The algorithms that implement the EditDistance interface follow the same simple principle: the more similar (closer) strings are, the lower is the distance. For example, the words house and hose are closer than house and trousers.

The following algorithms are available at the moment:

  • CosineDistance
  • CosineSimilarity
  • FuzzyScore
  • HammingDistance
  • JaroWinklerDistance
  • JaroWinklerSimilarity
  • LevenshteinDistance
  • LongestCommonSubsequenceDistance

The CosineDistance utilises a RegexTokenizer regular expression tokenizer (\w+). And the LevenshteinDistance's behavior can be changed to take into consideration a maximum throughput.

Package org.apache.commons.text.translate.*

An API for creating text translation routines from a set of smaller building blocks. Initially created to make it possible for the user to customize the rules in the StringEscapeUtils class.

These classes are immutable, and therefore thread-safe.