Today, did the below things.
1. Did unique/sort on the existing names repo.
Now the current master file for noun,
unique_sorted_noun_master.txt, is having 1,53,548 nouns
2. Merge nouns and all words
merged all the collected nouns and words to make a master file to use for bloom
wc -l unique_sorted_words_in_words_master.txt
23,92,064 unique_sorted_words_in_words_master.txt (with 79 mb size)
Get all words from here
So, we have nearly 24 Lakh unique tamil words. Yey. 🙂
Will be adding more words.
If any of you have code for scrapping blogger and wordpress websites,
share with me. Let us run web scrapping for few days and collect more unique words
and add to this master list.
3. Run bloom filter for this new file
Malaikannan released a working version of tiny spellcheker here with words from Project Madurai.
Used the file unique_sorted_words_in_words_master.txt in that repo, to
check the performance of the bloom filter.
File : TamilBloomFilterCreator.py
This file reads the given words dataset and creates a hash table kind of file,
with all 0 and 1 for its quick query.
bloomcreator = TamilBLoomFilterCreator(“0.001″,”tamil_bloom_filter_all_tamil_words.txt”, \
It took 15 seconds to run.
time python3 TamilBloomFilterCreator.py
The result file is tamil_bloom_filter_all_tamil_words.txt – 33 M (for the 79 MB input file)
This file uses the bloom filter file and checks quickly, for any given word,
with the given dataset and tells if that word is available in dataset or not.
tamilwordchecker = TamilwordChecker(2392064,”tamil_bloom_filter_all_tamil_words.txt”)
time python3 TamilwordChecker.py
nearly 0.5 seconds to check a word in a collection of 24 lakh words. Good fast.
Never expected this fast.
Usually, I use grep to find a word in any file.
time grep “^மேகம்$” unique_sorted_words_in_words_master.txt
Grep is super fast here. But the real usage of bloom comes with its auto suggestions
for nearly similar words.
File : TamilSpellingAutoCorrect.py
spellchecker = TamilSpellingAutoCorrect(“tamil_bloom_filter_all_tamil_words.txt”, \
time python3 TamilSpellingAutoCorrect.py
[‘மேகசம்’, ‘மேகநம்’, ‘மேகம்’, ‘மேகனம்’, ‘மேகாம்’, ‘மேகும்’, ‘மேக்ச்’, ‘மேக்மா’, ‘மேக்ஸ்’, ‘மேட்ம்’, ‘மேன்ம்’]
with old dataset of only project madurai words, here is the result.
time python3 TamilSpellingAutoCorrect.py
[‘மேட்ம்’, ‘மேகநம்’, ‘மேகம்’, ‘மேகாம்’, ‘மேகும்’, ‘மேன்ம்’]
With more input dataset, we can get more suggestions.
Web version of spellchecker in open-tamil
Muthu annamalai, added a awesome web interface to the spellchecker, in open-tamil
package itself. That is based on norvig algorithm to lookup and suggestions.
We can run it using the below commands.
sudo pip3 install flask open-tamil
point browser to http://127.0.0.1:5000/static/tinymce/index.html
This gives a TinyMCE based editor. You can type some sentences and run spellchecker
It shows the error words in red underline. On right clicking those words, it gives
suggested words to choose.
Wonderful. We have all the roads built already. We can run vehicles with clean dataset as fuel.
We can just build a clean words dataset and use it to launch a minimal working version,
for public demo.
I am so happy to see the progress. Though we are taking baby steps. There is a long way to go.
There are many suggestions, ideas and helping hands.
i’m using https://f-droid.org/en/packages/de.reimardoeffinger.quickdic/
for offline translating.
maybe the collection could be a new dic for quickdic???
— lgs at ILUGC mailing list
For palaniappa en-ta dictionary, and tamil-lexicon digitized forms, we made them in unicode, and created .babylon source files, here one can find them:
— damodarreddy challa
— Santhosh Thottingal
If we have a spell checker to account for all sort of proper nouns/names, complex inflections with so many variations, that would be an ideal software not just for spell checker but for other applications as well. I am sure your list of words will contribute toward that goal. My spell checker is constituted of a few steps: 1) consult the dictionary words, 2) apply morphological tagger to get to the root form 3) use the custom database with inflected words that can not be accounted for, 4) make a database available with all complex forms for the admin to edit and update the custom database. This custom database is to be developed with a crowdsourcing option. 5) Identify all variations of use – Use of spoken Tamil forms, many variations of use of proper names. Adding to the complexity, we also have issues with newly coined words for scientific terms; compound words and variations in compound word formations and so on so forth.Have a look at http://spellcheck.tamilnlp.com/viewcustomdatabase.php where you can see a custom database that my crowd-sourcing application collected over the period. You can see all sort of complexities in it. I identify a few below:
ஏற்றுக்கொள்ளப்படும்பட்சத்தில் (there are many cases like this. பட்சத்தில் is used as a suffix) ஐன்ஸ்ட்டீனுடையதை (many variations like this. Number of all the case forms of all such variations would be exponential and accounting for all of them would be a nightmare. We have so many acronyms.
சிபிஐ சிபிஐபாழடைந்த சிபிஐகாவல் சிபிஐக்கு சிபியின் சிபிஜ சிபிசிஐடி சிபிசிஐடிக்கு சிபிஎஸ்இ சிபிஎஸ்இக்கு
Then, we have so many variations due to using spoken Tamil forms like மன்னிக்கிறிங்க, மன்னிக்கிறீங்க …
Professor Deivasundaram did an amazing work on accounting for so many complex words including that of proper names, complex inflections and so on. I am sure Neechalkaran did a lot of this as well. I see Elango’s point on problems in accounting for all sort of complex names with many spelling variations.
— Prof. Vasu Ranganathan
We have to do all in the above said. It is a very long path to go.
We are taking baby steps and make little progress daily. Really dont know where we will be ending. We are just few computer programmers. We dont have many mentors or tamil linguistics to give us the rules and datasets.
(Palani sir agreed to mentor. Once we complete all the data based explorations, will reach him for rules)
We may not end up with the high quality works by deivasundaram sir, or other’s works.
Whatever we do, we make all the code, process, logics, dataset as open source. (releasing daily). So that anyone can build on this and collaborate.
Will be sharing the progress daily.
Requesting you all to guide/mentor/share your thoughts.
If possible, share the logics, grammar rules, datasets if you have.
It will help a lot to progress faster.
- Parse the existing datasets and find most frequently used words, to build a clean dataset.
- Compare norvig and bloom for the performance.
- Use the better one with the Flask application for a bare minimal public demo.
If you are interested in contributing to this project, email me at (firstname.lastname@example.org)
Read Previous days notes on building tamil spellchecker.
Study notes on open-tamil spellchecker – day 1
Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words
Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?