Approximate strings joins in a database - Part 1
Strings are ubiquitous and ambiguous. When a communication channel is established between two people, inevitable noise and misunderstanding can introduce many errors when transferring textual data.
In any enterprise system, there are many places where this type of error can occur. Client names, addresses, company names, etc. This kind of error can become impracticable the exact match against a query for a registry.
Think about the poor life of the mega-action star Arnold Schwarzenegger, how many times he needs to repeat his name to the operator?
To solve this type of problem, a number of phonetic algorithms was developed. A phonetic algorithm use rules to transform substrings into phonemes, trying to unify two strings written as spoken. Some algorithms:
But all these algorithms suffer from the same problems. The rules are written specifically for one target language, and the most common target language is English. This isn’t a matter if your native spoken language is English, or if you don’t need to do internationalized applications…
There are another solutions? Sure. Searching for alternatives, I found the approximate string search algorithms, mainly the Levenshtein distance (or Edit-Distance) algorithm. More robust and reliable, can be used without changes. With a little, but very important, advantage: we can use to sort the result candidates by your distance to the searched string. This little advantage can be the difference between a poor result, with pages and pages of useless results, or a well ordered list, with the most relevant results first.
I expect to show how to use this great algorithm to construct a full text search engine on databases, soon.
References:
- Wikipedia
- Phonetic algorithms implementations: Apache Commons Codec
- Levenshtein distance algorithm implementation: Apache Commons Lang




















May 13th, 2007 at 12:18 pm
[…] In the next step, I will show some n-gram properties and its use for approximate string match in database. See the part 1 Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages. […]