I have recently been tempted to go “all in” into semantic analysis. I have played around with the Sphinx search engine, even wrote a plugin for it, checked out the logic for infixes (suffixes, prefixes), word forms, spelling, stemming, etc. Have to say, the code they have in there is a mess, but still fast. It’s hard to make heads or tails out of a 28 000 lines cpp file. (yes, it’s not a mistake 28k lines). Maybe they make it this way to stop others from playing around with the code?! What’s the meaning of open source then. Anyway, for the curious minds, here are the results of my research for the past couple of days:

spelldump

Great utility, comes with sphinx. Have a read here: http://sphinxsearch.com/docs/1.10/ref-spelldump.html


snowball

Stemming library with multiple steamers for romantic, germanic, northern and russian languages http://snowball.tartarus.org/


php-stemmer

Stemmer written in php: http://code.google.com/p/php-stemmer/


word lists

All the english words dictionary (not really, but close enough) http://wordlist.sourceforge.net


soundex

A phonetic algorithm http://en.wikipedia.org/wiki/Soundex


metaphone

A better phonetic algorithm http://en.wikipedia.org/wiki/Metaphone


double metaphone

An even better (but way slower) phonetic algorithm http://en.wikipedia.org/wiki/Metaphone


levenshtein distance

The distance between two words as explained here: http://en.wikipedia.org/wiki/Levenshtein_distance


ispell

Non-GNU, but part of GNU spelling and typographical error correctors. http://www.gnu.org/software/ispell/ispell.html


So how can you use these you ask ? Simple:

If the word is not in your dictionary, get the phonetic form of it (through one of the algorithms above) and calculate the Levenshtein distance to the phonetic representations of all the other words. It helps if your dictionary is loaded into a hashtable - pre-phonetically computed. Find the closest match and get the stemming of it. Now check if it’s in your accepted list again. If it is, let the people submitting the words know that this is a match based on autocorrect (something like “Did you mean … ?“)

All of the above are cool to be done in C, C++, but even php has implementations of these (mostly written in C and made available through native functions):

Hope the above help you get started with whatever you want to do. There are tons of applications.

About the author

Mircea Danila Dumitrescu is a highly technical advisor to startups, CTO, Entrepreneur, Geek, Mentor, Best AI Startup Winner, who previously ran multiple complex systems with billions of records and millions of customers.