Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset

What we need to build a good spellchecker for tamil?

1 We need tons of good base words collection.

Tamil may have few million base words. All others are derivative words. We should quickly compare a given word with our all-correct dataset and say that given word is correct or not.

If the given word is in our dataset, all fine. move to next word.

1.1 Few issues here.

The dataset will be really huge. May scale to few GBs. How to search within this dataset quickly? How to compact the dataset so that we can use in mobile too. Will you install a spellchecker if it comes with GBs of dataset? What if querying this dataset takes too much time? Will we use a spellchecker if it takes some 10 minutes to complete checking for errors?

We need memory and processor Effient ways to query the dataset.

1.2 Possibilities

Today morning, discussed this issue with Malaikannan from IndicNLP team. He introduced “Bloom Filer”. This is very efficient algorithm to find a word from a given data set and tell that word is there in the dataset or not. Interesting.

He told that IndicNLP project is already using this. He would make this bloom filter as a separate function and contribute to open-tamil.

happy to see more helping hands. 🙂 Thanks Malaikannan and IndicNLP team for their great works.

https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

https://llimllib.github.io/bloomfilter-tutorial/

IndicNLP unique words from wikipedia duump – https://github.com/malaikannan/IndicNLPUniqueWords

View at Medium.com

 

2 Apply grammer rules to check for errors on derived words

Once the base words are checked for error, now we have to check for derieved words. We have to do stemming and find the base word. Apply available grammar rules and generate the possible derived words and compare with the given word.

Stemming may be one action. There may be too many grammer related works to play around with words. We need a Good Tamil Grammar Expert to guide us.

2.1 Issues here

Tamil has huge set of grammar rules. Applying them for each word may take time. How to do this in optimal way? Have to explore on this.

3 Suggestion alternates

Once a word is marked as error, we have to suggest other correct alternates. Open-Tamil already implemented Norvig algorithm to provide suggestions. Have to test for the efficiency and have to explore other possibilities, if any.

4 Unknown items

On the path of building spellchecker, there may be unknown obstacles to cross. We will find them only when we go through the path.

5 Packaging

If all the above items are solved, we can provide the spellchecker as web application along with API (with throttles). This is easy AFAIK. Open-Tamil has a web version already. But the real users are living in another world. They may be using LibreOffice/MS Office/Page Maker/Indesign. Have to check for building plugins for these application. Finally Mobile users. How we are going to give a spellchecker for mobile users?

We can explore all these packaging stuff once the basic web version is ready.

While I am typing this post, Malaikannan pulled all text from project madurai site. Build a bloom filter code to check a given word against project madurai data set. 🙂

https://www.projectmadurai.org/pmworks.html

 

Here is a screenshot of his quick implementation

Image

Here are his quick github gists –

 

6. What next?

We need huge dataset with correct words to give input for this bloom filter. Wikipedia dump one public dataset available in CC-BY-SA license. We can use it. But there may be many error words. Will explore for other possibilities to get good words.

7 thoughts on “Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset

  1. Pingback: Building Tamil Spellchecker – Day 2 – Collecting all Tamil Nouns | Going GNU

  2. Pingback: Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words? | Going GNU

  3. Pingback: Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words | Going GNU

  4. Pingback: Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter? | Going GNU

  5. Pingback: Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words | Going GNU

  6. Pingback: Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python | Going GNU

  7. Pingback: Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s