Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python


Since Tamilnadu government released a spellchecker as open source here - https://github.com/Tamil-Virtual-Academy/Tamilinaiya-Spellchecker I have joined with friends on porting this to Python. It is a desktop application in C#.  As linux has a C# environment called mono, I got recommendations to port to mono first. But, I am all new to C# and mono. Decided … Continue reading Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python

Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python


Recently, Tamil Virtual Academy released 10 Tamil NLP tools as Free/Open Source Software with source code. It has a SpellChecker too. Read more here about this. https://goinggnu.wordpress.com/2020/08/16/tamilvu-released-10-tamil-software-as-free-open-source-software/ The spellchecker is written in C#. I want it to be ported to Python so that we can extend it very well. C# is very new language for … Continue reading Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python

Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words


Table of Contents 1. Day 7 - Scrapping websites to get more words 2. Mirroring websites with httrack 3. Frequency of words 4. Current status 5. Harinath builds scrapper program for wordpress sites 6. Few more data sources 7. Got a server to run the scrapper programs 8. Can we build a free open source … Continue reading Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?


Today, did the below things. 1. Did unique/sort on the existing names repo. https://github.com/KaniyamFoundation/all_tamil_nouns Now the current master file for noun, unique_sorted_noun_master.txt, is having 1,53,548 nouns 2. Merge nouns and all words merged all the collected nouns and words to make a master file to use for bloom filter. Word Count: wc -l unique_sorted_words_in_words_master.txt 23,92,064 … Continue reading Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?

Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words


  Today, I downloaded the below datasets. Cleaned them all to find unique words in all the datasets. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp https://www.kaggle.com/sembiyan/tamil-oscar-corpus https://www.kaggle.com/disisbig/tamil-wikipedia-articles https://www.kaggle.com/disisbig/tamil-news-dataset Here is few stats/counts on original VS unique words. 1. ta_dedup.txt - 5.1 GB - 6971837 words unique_sorted_words_in_ta_dedup.txt - 19 M - 553976 words 2. tamil-language-corpus-for-nlp.zip wiki.txt - 354M - 1444046 unique_sorted_words_in_wiki.txt - … Continue reading Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words

Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?


Yesterday, collected the tamil nouns and published here - https://github.com/KaniyamFoundation/all_tamil_nouns Well. There are too many derieved words in Tamil. How to check these derived words? There are rules and rules. Tons of grammar rules available in Tamil. We have to apply those rules for each word that is not a base word or noun. Today, … Continue reading Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?

Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset


Table of Contents 1. We need tons of good base words collection. 1.1. Few issues here. 1.2. Possibilities 2. Apply grammer rules to check for errors on derived words 2.1. Issues here 3. Suggestion alternates 4. Unknown items 5. Packaging What we need to build a good spellchecker for tamil? 1 We need tons of … Continue reading Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset

Study notes on open-tamil spellchecker – day 1


Open-Tamil is a wonderful python module built to process tamil text. We can build awesome NLP tools for Tamil using this module. We can get it from here - https://github.com/Ezhil-Language-Foundation/open-tamil I am dreaming of a open source tamil spellchecker for around 10 years. It needs someone to explore and work on it continuously. We have … Continue reading Study notes on open-tamil spellchecker – day 1