org.apache.commons.codec.language.bm.BeiderMorseEncoder

All Implemented Interfaces:: Encoder, StringEncoder

public class BeiderMorseEncoder extends Object implements StringEncoder

Encodes strings into their Beider-Morse phonetic encoding.

Beider-Morse phonetic encodings are optimized for family names. However, they may be useful for a wide range of words.

This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it is mutable, and may not be thread-safe. If you require a guaranteed thread-safe encoding then use PhoneticEngine directly.

Encoding overview

Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what language the word comes from. For example, if it ends in "ault" then it infers that the word is French. Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some runs of letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up into phonemes at different places, so this stage results in a set of possible language-specific phonetic representations. Lastly, this language-specific phonetic representation is processed by a table of rules that re-writes it phonetically taking into account systematic pronunciation differences between languages, to move it towards a pan-indo-european phonetic representation. Again, sometimes there are multiple ways this could be done and sometimes things that can be pronounced in several ways in the source language have only one way to represent them in this average phonetic language, so the result is again a set of phonetic spellings.

Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated. In this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final encoding. Secondly, some names have standard prefixes, for example, "Mac/Mc" in Scottish (English) names. As sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word is encoded once with the prefix and once without it. The resulting encoding contains one and then the other result.

Encoding format

Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where there are multiple possible phonetic representations, these are joined with a pipe (|) character. If multiple hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed in ellipses and these blocks are then joined with hyphens. For example, "d'ortley" has a possible prefix. The form without prefix encodes to "ortlaj|ortlej", while the form with prefix encodes to " dortlaj|dortlej". Thus, the full, combined encoding is "(ortlaj|ortlej)-(dortlaj|dortlej)".

The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many potential phonetic interpretations. For example, "Renault" encodes to " rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult". The APPROX rules will tend to produce larger encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word. Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by splitting on pipe (|) and indexing under each of these alternatives.

Note: this version of the Beider-Morse encoding is equivalent with v3.4 of the reference implementation.

This class is Not ThreadSafe.

Since:

1.6

See Also:

Constructor Summary

Constructors

Constructor

Description

BeiderMorseEncoder()

Constructs a new instance.
Method Summary

Modifier and Type

Method

Description

Object

encode(Object source)

Encodes an "Object" and returns the encoded content as an Object.

String

encode(String source)

Encodes a String and returns a String.

NameType

getNameType()

Gets the name type currently in operation.

RuleType

getRuleType()

Gets the rule type currently in operation.

boolean

isConcat()

Discovers if multiple possible encodings are concatenated.

void

setConcat(boolean concat)

Sets how multiple possible phonetic encodings are combined.

void

setMaxPhonemes(int maxPhonemes)

Sets the number of maximum of phonemes that shall be considered by the engine.

void

setNameType(NameType nameType)

Sets the type of name.

void

setRuleType(RuleType ruleType)

Sets the rule type to apply.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- BeiderMorseEncoder
  
  public BeiderMorseEncoder()
  
  Constructs a new instance.
Method Details
- encode
  
  public Object encode(Object source) throws EncoderException
  
  Description copied from interface: Encoder
  
  Encodes an "Object" and returns the encoded content as an Object. The Objects here may just be byte[] or Strings depending on the implementation used.
  
  Specified by:
  
  encode in interface Encoder
  
  Parameters:
  
  source - An object to encode.
  
  Returns:
  
  An "encoded" Object.
  
  Throws:
  
  EncoderException - An encoder exception is thrown if the encoder experiences a failure condition during the encoding process.
- encode
  
  public String encode(String source) throws EncoderException
  
  Description copied from interface: StringEncoder
  
  Encodes a String and returns a String.
  
  Specified by:
  
  encode in interface StringEncoder
  
  Parameters:
  
  source - the String to encode.
  
  Returns:
  
  the encoded String.
  
  Throws:
  
  EncoderException - thrown if there is an error condition during the encoding process.
- getNameType
  
  public NameType getNameType()
  
  Gets the name type currently in operation.
  
  Returns:
  
  the NameType currently being used.
- getRuleType
  
  public RuleType getRuleType()
  
  Gets the rule type currently in operation.
  
  Returns:
  
  the RuleType currently being used.
- isConcat
  
  public boolean isConcat()
  
  Discovers if multiple possible encodings are concatenated.
  
  Returns:
  
  true if multiple encodings are concatenated, false if just the first one is returned.
- setConcat
  
  public void setConcat(boolean concat)
  
  Sets how multiple possible phonetic encodings are combined.
  
  Parameters:
  
  concat - true if multiple encodings are to be combined with a '|', false if just the first one is to be considered.
- setMaxPhonemes
  
  public void setMaxPhonemes(int maxPhonemes)
  
  Sets the number of maximum of phonemes that shall be considered by the engine.
  
  Parameters:
  
  maxPhonemes - the maximum number of phonemes returned by the engine.
  
  Since:
  
  1.7
- setNameType
  
  public void setNameType(NameType nameType)
  
  Sets the type of name. Use NameType.GENERIC unless you specifically want phonetic encodings optimized for Ashkenazi or Sephardic Jewish family names.
  
  Parameters:
  
  nameType - the NameType in use.
- setRuleType
  
  public void setRuleType(RuleType ruleType)
  
  Sets the rule type to apply. This will widen or narrow the range of phonetic encodings considered.
  
  Parameters:
  
  ruleType - RuleType.APPROX or RuleType.EXACT for approximate or exact phonetic matches.

Class BeiderMorseEncoder

Encoding overview

Encoding format

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

BeiderMorseEncoder

Method Details

encode

encode

getNameType

getRuleType

isConcat

setConcat

setMaxPhonemes

setNameType

setRuleType