Archive for the 'algorithm' Category

Approximate strings joins in a database - Part 2

Sunday, May 13th, 2007

Continuing with my search to algorithms to do approximate string match in databases, I found some interesting relationship between the edit-distance algorithm and the n-gram concept.

A very useful concept in statistics, natural language processing and genetic sequence analysis is n-grams. An n-gram is a sub-sequence of n items from a given sequence, where the sequence can be anything you like.

A word is a sequence of symbols from a alphabet. Given a word s with characters in a alphabet A, its positional n-grams are obtained by “sliding” a window of length n over the characters of s. Since some n-grams at the beginning and the end of the string can have fewer than n characters from s, we introduce new characters “#” and “$” not in A, and extend s by prefixing it with n - 1 occurrences of “#” and suffixing it with n - 1 occurrences of “$”. Then, all the n-grams of s have exactly n characters, some of these may not be from the alphabet A. Example:

Word s: foobar
Alphabet A: [a-zA-Z]+ (regular expression)

if searching for all 2-grams, prefix and suffix the word with 2 - 1 = 1 characters not in A:
s extended: $foobar#

2-gram set of $foobar#: {$f, fo, oo, ob, ba, ar, r#}

A positional n-gram of a word s is a pair (i, s[i, i + n - 1]), where s[i, i + n - 1] is the n-gram of s starting at position i on the extended string. The set Gs is the set with all |s| + n - 1 pairs of positional n-grams of the word s. Example:

G2 of foobar: {(1, $f), (2, fo), (3, oo), (4, ob), (5, ba), (6, ar), (7, r#)}

The intuition behind the use of n-grams as a foundation for approximate string match is that when two strings s1 and s2 are within a small edit distance of each other, they share a large number of n-grams in common. Example:

The positional n-grams of length n=2 for string “world” are {(1, $w), (2, wo), (3, or), (4, rl), (5, ld), (6, d#)}. Similarly, the positional n-grams of length n = 2 for the string “word”, which is at an edit distance of one from “world”, are {(1, $w), (2, wo), (3, or), (4, rd), (5, d#)}. If we ignore the positional information, the two n-gram sets have 4 n-grams in common. Interestingly, only the first 3 positional n-grams of the first string are also positional n-grams of the second string. However, an additional two positional n-gram in the two strings differ in their position by just one position. This show us that the use of positional n-grams will involve comparing positions of “matching” n-grams within a certain “band”.

In the next step, I will show some n-gram properties and its use for approximate string match in database.

See the part 1

Approximate strings joins in a database - Part 1

Wednesday, January 10th, 2007

Strings are ubiquitous and ambiguous. When a communication channel is established between two people, inevitable noise and misunderstanding can introduce many errors when transferring textual data.

In any enterprise system, there are many places where this type of error can occur. Client names, addresses, company names, etc. This kind of error can become impracticable the exact match against a query for a registry.

Think about the poor life of the mega-action star Arnold Schwarzenegger, how many times he needs to repeat his name to the operator?

To solve this type of problem, a number of phonetic algorithms was developed. A phonetic algorithm use rules to transform substrings into phonemes, trying to unify two strings written as spoken. Some algorithms:

But all these algorithms suffer from the same problems. The rules are written specifically for one target language, and the most common target language is English. This isn’t a matter if your native spoken language is English, or if you don’t need to do internationalized applications…

There are another solutions? Sure. Searching for alternatives, I found the approximate string search algorithms, mainly the Levenshtein distance (or Edit-Distance) algorithm. More robust and reliable, can be used without changes. With a little, but very important, advantage: we can use to sort the result candidates by your distance to the searched string. This little advantage can be the difference between a poor result, with pages and pages of useless results, or a well ordered list, with the most relevant results first.

I expect to show how to use this great algorithm to construct a full text search engine on databases, soon.

References: