public class BeiderMorseEncoder extends Object implements StringEncoder
Beider-Morse phonetic encodings are optimised for family names. However, they may be useful for a wide range of words.
This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it is mutable,
and may not be thread-safe. If you require a guaranteed thread-safe encoding then use PhoneticEngine
directly.
Encoding overview
Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what
language the word comes from. For example, if it ends in "ault
" then it infers that the word is French.
Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some runs of
letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up into phonemes at
different places, so this stage results in a set of possible language-specific phonetic representations. Lastly, this
language-specific phonetic representation is processed by a table of rules that re-writes it phonetically taking into
account systematic pronunciation differences between languages, to move it towards a pan-indo-european phonetic
representation. Again, sometimes there are multiple ways this could be done and sometimes things that can be
pronounced in several ways in the source language have only one way to represent them in this average phonetic
language, so the result is again a set of phonetic spellings.
Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated. In
this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final encoding.
Secondly, some names have standard prefixes, for example, "Mac/Mc
" in Scottish (English) names. As
sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word is encoded once
with the prefix and once without it. The resulting encoding contains one and then the other result.
Encoding format
Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where there
are multiple possible phonetic representations, these are joined with a pipe (|
) character. If multiple
hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed in elipses and
these blocks are then joined with hyphens. For example, "d'ortley
" has a possible prefix. The form
without prefix encodes to "ortlaj|ortlej
", while the form with prefix encodes to "
dortlaj|dortlej
". Thus, the full, combined encoding is "(ortlaj|ortlej)-(dortlaj|dortlej)
".
The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many
potential phonetic interpretations. For example, "Renault
" encodes to "
rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult
". The APPROX
rules will tend to produce larger
encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word.
Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by
splitting on pipe (|
) and indexing under each of these alternatives.
Note: this version of the Beider-Morse encoding is equivalent with v3.4 of the reference implementation.
This class is Not ThreadSafe
Constructor and Description |
---|
BeiderMorseEncoder() |
Modifier and Type | Method and Description |
---|---|
Object |
encode(Object source)
Encodes an "Object" and returns the encoded content as an Object.
|
String |
encode(String source)
Encodes a String and returns a String.
|
NameType |
getNameType()
Gets the name type currently in operation.
|
RuleType |
getRuleType()
Gets the rule type currently in operation.
|
boolean |
isConcat()
Discovers if multiple possible encodings are concatenated.
|
void |
setConcat(boolean concat)
Sets how multiple possible phonetic encodings are combined.
|
void |
setMaxPhonemes(int maxPhonemes)
Sets the number of maximum of phonemes that shall be considered by the engine.
|
void |
setNameType(NameType nameType)
Sets the type of name.
|
void |
setRuleType(RuleType ruleType)
Sets the rule type to apply.
|
public Object encode(Object source) throws EncoderException
Encoder
byte[]
or String
s depending on the implementation used.encode
in interface Encoder
source
- An object to encodeEncoderException
- An encoder exception is thrown if the encoder experiences a failure condition during the encoding
process.public String encode(String source) throws EncoderException
StringEncoder
encode
in interface StringEncoder
source
- the String to encodeEncoderException
- thrown if there is an error condition during the encoding process.public NameType getNameType()
public RuleType getRuleType()
public boolean isConcat()
public void setConcat(boolean concat)
concat
- true if multiple encodings are to be combined with a '|', false if just the first one is
to be consideredpublic void setNameType(NameType nameType)
NameType.GENERIC
unless you specifically want phonetic encodings
optimized for Ashkenazi or Sephardic Jewish family names.nameType
- the NameType in usepublic void setRuleType(RuleType ruleType)
ruleType
- RuleType.APPROX
or RuleType.EXACT
for approximate or exact phonetic matchespublic void setMaxPhonemes(int maxPhonemes)
maxPhonemes
- the maximum number of phonemes returned by the engineCopyright © 2002–2020 The Apache Software Foundation. All rights reserved.