Apache Commons logo Commons Text

User guide for Commons "Text"

Description

The Commons Text library provides additions to the standard JDK's text handling. Our goal is to provide a consistent set of tools for processing text generally from computing distances between Strings to being able to effeciently do String escaping of various types.

text.*

Originally the text package was added in Commons Lang 2.2. However, its new home is here. It provides, amongst other classes, a replacement for StringBuffer named StrBuilder, a class for substituting variables within a String named StrSubstitutor and a replacement for StringTokenizer named StrTokenizer. While somewhat ungainly, the Str prefix has been used to ensure we don't clash with any current or future standard Java classes.

Beyond the text utilities ported over from lang, we have also included various string similarity and distance functions. Lastly, there are also utilities for addressing differences between bodies of text for the sake of viewing these differences.

StringEscapeUtils

From Lang 3.5, we have moved into Text StringEscapeUtils and StrTokenizer. It's provides ways in which to generate pieces of text, such as might be used for default passwords. StringEscapeUtils contains methods to escape and unescape Java, JavaScript, HTML, XML and SQL. It is worth noting that the package org.apache.commons.text.translate holds the functionality underpinning the StringEscapeUtils with mappings and translations between such mappings for the sake of doing String escaping. StrTokenizer is an improved alternative to java.util.StringTokenizer.

Similarity and Distance

The org.apache.commons.text.similarity packages contains various different mechanisms of calculating "similarity scores" as well as "edit distances between Strings. Note, the difference between a "similarity score" and a "distance function" is that a distance functions meets the following qualifications:

  • d(x,y) >= 0, non-negativity or separation axiom
  • d(x,y) == 0, if and only if, x == y
  • d(x,y) == d(y,x), symmetry, and
  • d(x,z) <= d(x,y) + d(y,z), the triangle inequality
whereas a "similarity score" need not satisfy all such properties. Though, it is fairly easy to "normalize" a similarity score to manufacture an "edit distance."

The list of "edit distances" that we currently support follow:

  • Cosine Distance,
  • Hamming Distance,
  • Jaccard Distance,
  • Jaro Winkler Distance,
  • Levenshtein Distance,
  • Longest Commons Subsequence Distance,
and the list of "similarity scores" that we support follows:
  • Cosine Similarity,
  • Fuzzy Score Similarity,
  • Jaccard Similarity, and
  • Longest Common Subsequence Similarity.

Text diff'ing

The org.apache.commons.text.diff package contains code for doing diff between strings. The initial implementation of the Myers algorithm was adapted from the commons-collections sequence package.

text.diff.*

Provides algorithms for diff between strings.

The initial implementation of the Myers algorithm was adapted from the commons-collections sequence package.

text.similarity.*

Provides algorithms for string similarity.

The algorithms that implement the EditDistance interface follow the same simple principle: the more similar (closer) strings are, lower is the distance. For example, the words house and hose are closer than house and trousers.

The following algorithms are available at the moment:

  • CosineDistance
  • CosineSimilarity
  • FuzzyScore
  • HammingDistance
  • JaroWinklerDistance
  • LevenshteinDistance
  • LongestCommonSubsequenceDistance

The CosineDistance utilises a RegexTokenizer regular expression tokenizer (\w+). And the LevenshteinDistance's behaviour can be changed to take into consideration a maximum throughput.

text.translate.*

An API for creating text translation routines from a set of smaller building blocks. Initially created to make it possible for the user to customize the rules in the StringEscapeUtils class.

These classes are immutable, and therefore thread-safe.