Spell Check Dictionary Improvements
Wednesday, February 11, 2009
If you're anything like us, you're spending more and more of your time working online. The spellchecker built into Chromium can be a big help in keeping your blog, email, documents, and forum postings spelled correctly and easy to read. Chromium integrates the popular open source library Hunspell with WebKit's built-in spellchecking infrastructure to check words and to provide suggestions in 27 different languages.
The Hunspell dictionary maintainers have done a great job creating high-quality dictionaries that anybody can use, but one of the problems with any dictionary is that there are inevitably omissions, especially as new words appear or proper nouns come into common use. We at Google are in a good position to use our knowledge of the internet to identify and fix some of these omissions. The Google translation team used their language models to generate a sorted list of the most popular words in each language. This was cross-checked with the Hunspell dictionaries to generate a list of the top 1000 words not present in each dictionary. This list includes many popular words, but also common misspellings. To remove these words, each list was reviewed by specialist in that language. Generally, we tried to keep proper nouns and even foreign words as long as they were in common usage.
We hope that by using the the existing GPL/LGPL/MPL tri-license for our addition, our work can be picked up by other users of Hunspell. We also hope to make more improvements in the future, both for additional languages like Turkish, and to refine the word lists we already have. If you're passionate about your language, you can help out by writing affix rules for the added words or reviewing more word lists.
The recent dev-channel release of Google Chrome (2.0.160.0) has the additional words we generated for 19 of the languages. Hopefully, you'll see fewer common words marked as misspelled. For example, the English dictionary now includes "antivirus," "anime," "screensaver," and "webcam," and commonly used names such as "BibTeX," "Mozilla," "Obama," and "Wikipedia." For our scientific users, we even have "gastroenterology," "oligonucleotide," and "Saccharomyces"! We'd like to give special thanks to the great help we got from the translation team who generated the words and the language search specialists who reviewed the lists.