Table of Contents
What we need to build a good spellchecker for tamil?
1 We need tons of good base words collection.
Tamil may have few million base words. All others are derivative words. We should quickly compare a given word with our all-correct dataset and say that given word is correct or not.
If the given word is in our dataset, all fine. move to next word.
1.1 Few issues here.
The dataset will be really huge. May scale to few GBs. How to search within this dataset quickly? How to compact the dataset so that we can use in mobile too. Will you install a spellchecker if it comes with GBs of dataset? What if querying this dataset takes too much time? Will we use a spellchecker if it takes some 10 minutes to complete checking for errors?
We need memory and processor Effient ways to query the dataset.
Today morning, discussed this issue with Malaikannan from IndicNLP team. He introduced “Bloom Filer”. This is very efficient algorithm to find a word from a given data set and tell that word is there in the dataset or not. Interesting.
He told that IndicNLP project is already using this. He would make this bloom filter as a separate function and contribute to open-tamil.
happy to see more helping hands. 🙂 Thanks Malaikannan and IndicNLP team for their great works.
IndicNLP unique words from wikipedia duump – https://github.com/malaikannan/IndicNLPUniqueWords
2 Apply grammer rules to check for errors on derived words
Once the base words are checked for error, now we have to check for derieved words. We have to do stemming and find the base word. Apply available grammar rules and generate the possible derived words and compare with the given word.
Stemming may be one action. There may be too many grammer related works to play around with words. We need a Good Tamil Grammar Expert to guide us.
2.1 Issues here
Tamil has huge set of grammar rules. Applying them for each word may take time. How to do this in optimal way? Have to explore on this.
3 Suggestion alternates
Once a word is marked as error, we have to suggest other correct alternates. Open-Tamil already implemented Norvig algorithm to provide suggestions. Have to test for the efficiency and have to explore other possibilities, if any.
4 Unknown items
On the path of building spellchecker, there may be unknown obstacles to cross. We will find them only when we go through the path.
If all the above items are solved, we can provide the spellchecker as web application along with API (with throttles). This is easy AFAIK. Open-Tamil has a web version already. But the real users are living in another world. They may be using LibreOffice/MS Office/Page Maker/Indesign. Have to check for building plugins for these application. Finally Mobile users. How we are going to give a spellchecker for mobile users?
We can explore all these packaging stuff once the basic web version is ready.
While I am typing this post, Malaikannan pulled all text from project madurai site. Build a bloom filter code to check a given word against project madurai data set. 🙂
Here is a screenshot of his quick implementation
Here are his quick github gists –
6. What next?
We need huge dataset with correct words to give input for this bloom filter. Wikipedia dump one public dataset available in CC-BY-SA license. We can use it. But there may be many error words. Will explore for other possibilities to get good words.