Approximate strings joins in a database - Part 2
Sunday, May 13th, 2007Continuing with my search to algorithms to do approximate string match in databases, I found some interesting relationship between the edit-distance algorithm and the n-gram concept.
A very useful concept in statistics, natural language processing and genetic sequence analysis is n-grams. An n-gram is a sub-sequence of n items from a given sequence, where the sequence can be anything you like.
A word is a sequence of symbols from a alphabet. Given a word s with characters in a alphabet A, its positional n-grams are obtained by “sliding” a window of length n over the characters of s. Since some n-grams at the beginning and the end of the string can have fewer than n characters from s, we introduce new characters “#” and “$” not in A, and extend s by prefixing it with n - 1 occurrences of “#” and suffixing it with n - 1 occurrences of “$”. Then, all the n-grams of s have exactly n characters, some of these may not be from the alphabet A. Example:
Word s: foobar
Alphabet A: [a-zA-Z]+ (regular expression)
if searching for all 2-grams, prefix and suffix the word with 2 - 1 = 1 characters not in A:
s extended: $foobar#
2-gram set of $foobar#: {$f, fo, oo, ob, ba, ar, r#}
A positional n-gram of a word s is a pair (i, s[i, i + n - 1]), where s[i, i + n - 1] is the n-gram of s starting at position i on the extended string. The set Gs is the set with all |s| + n - 1 pairs of positional n-grams of the word s. Example:
G2 of foobar: {(1, $f), (2, fo), (3, oo), (4, ob), (5, ba), (6, ar), (7, r#)}
The intuition behind the use of n-grams as a foundation for approximate string match is that when two strings s1 and s2 are within a small edit distance of each other, they share a large number of n-grams in common. Example:
The positional n-grams of length n=2 for string “world” are {(1, $w), (2, wo), (3, or), (4, rl), (5, ld), (6, d#)}. Similarly, the positional n-grams of length n = 2 for the string “word”, which is at an edit distance of one from “world”, are {(1, $w), (2, wo), (3, or), (4, rd), (5, d#)}. If we ignore the positional information, the two n-gram sets have 4 n-grams in common. Interestingly, only the first 3 positional n-grams of the first string are also positional n-grams of the second string. However, an additional two positional n-gram in the two strings differ in their position by just one position. This show us that the use of positional n-grams will involve comparing positions of “matching” n-grams within a certain “band”.
In the next step, I will show some n-gram properties and its use for approximate string match in database.