Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?

Today, did the below things.

1. Did unique/sort on the existing names repo.
https://github.com/KaniyamFoundation/all_tamil_nouns

Now the current master file for noun,
unique_sorted_noun_master.txt, is having 1,53,548 nouns

2. Merge nouns and all words

merged all the collected nouns and words to make a master file to use for bloom
filter.

Word Count:
wc -l unique_sorted_words_in_words_master.txt
23,92,064 unique_sorted_words_in_words_master.txt (with 79 mb size)

Get all words from here

https://github.com/KaniyamFoundation/all_tamil_words

So, we have nearly 24 Lakh unique tamil words. Yey. 🙂
Will be adding more words.

If any of you have code for scrapping blogger and wordpress websites,
share with me. Let us run web scrapping for few days and collect more unique words
and add to this master list.

3. Run bloom filter for this new file

Malaikannan released a working version of tiny spellcheker here with words from Project Madurai.
https://github.com/malaikannan/TamilSpellChecker

Used the file unique_sorted_words_in_words_master.txt in that repo, to
check the performance of the bloom filter.
File : TamilBloomFilterCreator.py

This file reads the given words dataset and creates a hash table kind of file,
with all 0 and 1 for its quick query.

bloomcreator = TamilBLoomFilterCreator(“0.001″,”tamil_bloom_filter_all_tamil_words.txt”, \
“unique_sorted_words_in_words_master.txt”)
bloomcreator.create_bloomfilter_file()

It took 15 seconds to run.

time python3 TamilBloomFilterCreator.py

real 0m15.671s
user 0m15.275s
sys 0m0.296s

The result file is tamil_bloom_filter_all_tamil_words.txt – 33 M (for the 79 MB input file)

File: TamilwordChecker.py

This file uses the bloom filter file and checks quickly, for any given word,
with the given dataset and tells if that word is available in dataset or not.

tamilwordchecker = TamilwordChecker(2392064,”tamil_bloom_filter_all_tamil_words.txt”)
print(tamilwordchecker.tamil_word_exists(“மேகம்”))

time python3 TamilwordChecker.py
True

real 0m0.475s
user 0m0.419s
sys 0m0.056s

nearly 0.5 seconds to check a word in a collection of 24 lakh words. Good fast.
Never expected this fast.

Usually, I use grep to find a word in any file.

time grep “^மேகம்$” unique_sorted_words_in_words_master.txt
மேகம்

real 0m0.183s
user 0m0.148s
sys 0m0.032s

Grep is super fast here. But the real usage of bloom comes with its auto suggestions
for nearly similar words.

File : TamilSpellingAutoCorrect.py

spellchecker = TamilSpellingAutoCorrect(“tamil_bloom_filter_all_tamil_words.txt”, \
“unique_sorted_words_in_words_master.txt”)
print(spellchecker.tamil_correct_spelling(“மேக்ம்”))

run results.

time python3 TamilSpellingAutoCorrect.py
[‘மேகசம்’, ‘மேகநம்’, ‘மேகம்’, ‘மேகனம்’, ‘மேகாம்’, ‘மேகும்’, ‘மேக்ச்’, ‘மேக்மா’, ‘மேக்ஸ்’, ‘மேட்ம்’, ‘மேன்ம்’]

real 0m5.418s
user 0m5.206s
sys 0m0.205s

with old dataset of only project madurai words, here is the result.

time python3 TamilSpellingAutoCorrect.py
[‘மேட்ம்’, ‘மேகநம்’, ‘மேகம்’, ‘மேகாம்’, ‘மேகும்’, ‘மேன்ம்’]

real 0m6.945s
user 0m4.874s
sys 0m0.333s

With more input dataset, we can get more suggestions.

Web version of spellchecker in open-tamil

Muthu annamalai, added a awesome web interface to the spellchecker, in open-tamil
package itself. That is based on norvig algorithm to lookup and suggestions.

We can run it using the below commands.

git clone https://github.com/Ezhil-Language-Foundation/open-tamil

sudo pip3 install flask open-tamil

python3 runwebspell.py

point browser to http://127.0.0.1:5000/static/tinymce/index.html

This gives a TinyMCE based editor. You can type some sentences and run spellchecker
with Tools->Spellcheck.

open-tamil-spell-checker-web-version-1

It shows the error words in red underline. On right clicking those words, it gives
suggested words to choose.

open-tamil-spell-checker-web-version-2

Wonderful. We have all the roads built already. We can run vehicles with clean dataset as fuel.

We can just build a clean words dataset and use it to launch a minimal working version,
for public demo.

I am so happy to see the progress. Though we are taking baby steps. There is a long way to go.

There are many suggestions, ideas and helping hands.

i’m using https://f-droid.org/en/packages/de.reimardoeffinger.quickdic/
for offline translating.
maybe the collection could be a new dic for quickdic???

— lgs at ILUGC mailing list

For palaniappa en-ta dictionary, and tamil-lexicon digitized forms, we made them in unicode, and created .babylon source files, here one can find them:
https://github.com/indic-dict/stardict-tamil

— damodarreddy challa

https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/
https://thottingal.in/blog/2017/12/10/number-spellout-and-generation-in-malayalam-using-morphology-analyser/
https://thottingal.in/blog/2018/09/08/malayalam-spellchecker-a-morphology-analyser-based-approach/
https://morph.smc.org.in/
https://github.com/smc/mlmorph

— Santhosh Thottingal

If we have a spell checker to account for all sort of proper nouns/names, complex inflections with so many variations, that would be an ideal software not just for spell checker but for other applications as well. I am sure your list of words will contribute toward that goal. My spell checker is constituted of a few steps: 1) consult the dictionary words, 2) apply morphological tagger to get to the root form 3) use the custom database with inflected words that can not be accounted for, 4) make a database available with all complex forms for the admin to edit and update the custom database. This custom database is to be developed with a crowdsourcing option. 5) Identify all variations of use – Use of spoken Tamil forms, many variations of use of proper names. Adding to the complexity, we also have issues with newly coined words for scientific terms; compound words and variations in compound word formations and so on so forth.
Have a look at http://spellcheck.tamilnlp.com/viewcustomdatabase.php where you can see a custom database that my crowd-sourcing application collected over the period. You can see all sort of complexities in it. I identify a few below:

ஏற்றுக்கொள்ளப்படும்பட்சத்தில் (there are many cases like this. பட்சத்தில் is used as a suffix) ஐன்ஸ்ட்டீனுடையதை (many variations like this. Number of all the case forms of all such variations would be exponential and accounting for all of them would be a nightmare. We have so many acronyms.

சிபிஐ
சிபிஐபாழடைந்த
சிபிஐகாவல்
சிபிஐக்கு
சிபியின்
சிபிஜ
சிபிசிஐடி
சிபிசிஐடிக்கு
சிபிஎஸ்இ
சிபிஎஸ்இக்கு

Then, we have so many variations due to using spoken Tamil forms like மன்னிக்கிறிங்க, மன்னிக்கிறீங்க …

Professor Deivasundaram did an amazing work on accounting for so many complex words including that of proper names, complex inflections and so on. I am sure Neechalkaran did a lot of this as well. I see Elango’s point on problems in accounting for all sort of complex names with many spelling variations.

— Prof. Vasu Ranganathan

We have to do all in the above said. It is a very long path to go.

We are taking baby steps and make little progress daily. Really dont know where we will be ending. We are just few computer programmers. We dont have many mentors or tamil linguistics to give us the rules and datasets.

(Palani sir agreed to mentor. Once we complete all the data based explorations, will reach him for rules)

We may not end up with the high quality works by deivasundaram sir, or other’s works.

Whatever we do, we make all the code, process, logics, dataset as open source. (releasing daily). So that anyone can build on this and collaborate.

Will be sharing the progress daily.

Requesting you all to guide/mentor/share your thoughts.

If possible, share the logics, grammar rules, datasets if you have.
It will help a lot to progress faster.
ToDo:

  • Parse the existing datasets and find most frequently used words, to build a clean dataset.
  • Compare norvig and bloom for the performance.
  • Use the better one with the Flask application for a bare minimal public demo.

If you are interested in contributing to this project, email me at (tshrinivasan@gmail.com)

Read Previous days notes on building tamil spellchecker.

Study notes on open-tamil spellchecker – day 1
Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words
Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?

6 thoughts on “Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?

  1. Pingback: Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words | Going GNU

  2. Pingback: Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python | Going GNU

  3. Pingback: Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s