The Commons Text Package
The Commons Text library provides additions to the standard JDK's text handling. Our goal is to provide a consistent set of tools for processing text generally from computing distances between Strings to being able to effeciently do String escaping of various types.
Originally the text package was added in Commons Lang 2.2. However, its new home is here. It provides, amongst other classes, a replacement for StringBuffer named StrBuilder, a class for substituting variables within a String named StrSubstitutor and a replacement for StringTokenizer named StrTokenizer. While somewhat ungainly, the Str prefix has been used to ensure we don't clash with any current or future standard Java classes.
Beyond the text utilities ported over from lang, we have also included various string similarity and distance functions. Lastly, there are also utilities for addressing differences between bodies of text for the sake of viewing these differences.
The org.apache.commons.text.similarity packages contains various different mechanisms of calculating "similarity scores" as well as "edit distances between Strings. Note, the difference between a "similarity score" and a "distance function" is that a distance functions meets the following qualifications:
The list of "edit distances" that we currently support follow:
Provides algorithms for diff between strings.
The initial implementation of the Myers algorithm was adapted from the commons-collections sequence package.
Provides algorithms for string similarity.
The algorithms that implement the EditDistance interface follow the same simple principle: the more similar (closer) strings are, lower is the distance. For example, the words house and hose are closer than house and trousers.
The following algorithms are available at the moment:
The CosineDistance utilises a RegexTokenizer regular expression tokenizer (\w+). And the LevenshteinDistance's behaviour can be changed to take into consideration a maximum throughput.