Archive for May, 2007

Smalltalk passion

Sunday, May 13th, 2007

I’m very rational programmer, with a academic bias. I know C, C++, Java, Perl, PHP, VBScript, Object Pascal and Prolog languages. But my language of choice is beyond the rational. Smalltalk.

Smalltalk is much more than a language, is a concept, a vision. Much of what we use today was inspired in the Smalltalk ideas and environment. Steve Jobs only saw the mouse and the window system, if he saw the Smalltalk language too, our lifes would be better now.

“But if Smalltalk is so good, why isn’t a mainstream language?” you can ask me. I guess it was a matter of investments, a big company behind, the resistance against the paradigm shift… Java is the main object oriented language today, and cleverly the language designer choose a syntax similar with C++. Smalltalk always was and is today a big lab beyond the common place.

If you don’t know Smalltalk, I invite you to take some time to know a free implementation, Squeak. I will try to present some features here, soon.

Approximate strings joins in a database - Part 2

Sunday, May 13th, 2007

Continuing with my search to algorithms to do approximate string match in databases, I found some interesting relationship between the edit-distance algorithm and the n-gram concept.

A very useful concept in statistics, natural language processing and genetic sequence analysis is n-grams. An n-gram is a sub-sequence of n items from a given sequence, where the sequence can be anything you like.

A word is a sequence of symbols from a alphabet. Given a word s with characters in a alphabet A, its positional n-grams are obtained by “sliding” a window of length n over the characters of s. Since some n-grams at the beginning and the end of the string can have fewer than n characters from s, we introduce new characters “#” and “$” not in A, and extend s by prefixing it with n - 1 occurrences of “#” and suffixing it with n - 1 occurrences of “$”. Then, all the n-grams of s have exactly n characters, some of these may not be from the alphabet A. Example:

Word s: foobar
Alphabet A: [a-zA-Z]+ (regular expression)

if searching for all 2-grams, prefix and suffix the word with 2 - 1 = 1 characters not in A:
s extended: $foobar#

2-gram set of $foobar#: {$f, fo, oo, ob, ba, ar, r#}

A positional n-gram of a word s is a pair (i, s[i, i + n - 1]), where s[i, i + n - 1] is the n-gram of s starting at position i on the extended string. The set Gs is the set with all |s| + n - 1 pairs of positional n-grams of the word s. Example:

G2 of foobar: {(1, $f), (2, fo), (3, oo), (4, ob), (5, ba), (6, ar), (7, r#)}

The intuition behind the use of n-grams as a foundation for approximate string match is that when two strings s1 and s2 are within a small edit distance of each other, they share a large number of n-grams in common. Example:

The positional n-grams of length n=2 for string “world” are {(1, $w), (2, wo), (3, or), (4, rl), (5, ld), (6, d#)}. Similarly, the positional n-grams of length n = 2 for the string “word”, which is at an edit distance of one from “world”, are {(1, $w), (2, wo), (3, or), (4, rd), (5, d#)}. If we ignore the positional information, the two n-gram sets have 4 n-grams in common. Interestingly, only the first 3 positional n-grams of the first string are also positional n-grams of the second string. However, an additional two positional n-gram in the two strings differ in their position by just one position. This show us that the use of positional n-grams will involve comparing positions of “matching” n-grams within a certain “band”.

In the next step, I will show some n-gram properties and its use for approximate string match in database.

See the part 1