Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python


Since Tamilnadu government released a spellchecker as open source here - https://github.com/Tamil-Virtual-Academy/Tamilinaiya-Spellchecker I have joined with friends on porting this to Python. It is a desktop application in C#.  As linux has a C# environment called mono, I got recommendations to port to mono first. But, I am all new to C# and mono. Decided … Continue reading Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python

IRC logs of “Introduction to Open-Tamil Python library”


I gave a IRC based text chat talk on "Introduction to Open-Tamil Python library" at Indian Linux Users Group Chennai July 2020 meet. Here is the chat log. Welcome all. Today, let us explore a python library for Tamil it is Open-Tamil. you can download it from https://github.com/Ezhil-Language-Foundation/open-tamil It gives all the basic functionalities for … Continue reading IRC logs of “Introduction to Open-Tamil Python library”

Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words


Table of Contents 1. Day 7 - Scrapping websites to get more words 2. Mirroring websites with httrack 3. Frequency of words 4. Current status 5. Harinath builds scrapper program for wordpress sites 6. Few more data sources 7. Got a server to run the scrapper programs 8. Can we build a free open source … Continue reading Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?


Today, did the below things. 1. Did unique/sort on the existing names repo. https://github.com/KaniyamFoundation/all_tamil_nouns Now the current master file for noun, unique_sorted_noun_master.txt, is having 1,53,548 nouns 2. Merge nouns and all words merged all the collected nouns and words to make a master file to use for bloom filter. Word Count: wc -l unique_sorted_words_in_words_master.txt 23,92,064 … Continue reading Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?

Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words


  Today, I downloaded the below datasets. Cleaned them all to find unique words in all the datasets. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp https://www.kaggle.com/sembiyan/tamil-oscar-corpus https://www.kaggle.com/disisbig/tamil-wikipedia-articles https://www.kaggle.com/disisbig/tamil-news-dataset Here is few stats/counts on original VS unique words. 1. ta_dedup.txt - 5.1 GB - 6971837 words unique_sorted_words_in_ta_dedup.txt - 19 M - 553976 words 2. tamil-language-corpus-for-nlp.zip wiki.txt - 354M - 1444046 unique_sorted_words_in_wiki.txt - … Continue reading Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words

Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?


Yesterday, collected the tamil nouns and published here - https://github.com/KaniyamFoundation/all_tamil_nouns Well. There are too many derieved words in Tamil. How to check these derived words? There are rules and rules. Tons of grammar rules available in Tamil. We have to apply those rules for each word that is not a base word or noun. Today, … Continue reading Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?

Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset


Table of Contents 1. We need tons of good base words collection. 1.1. Few issues here. 1.2. Possibilities 2. Apply grammer rules to check for errors on derived words 2.1. Issues here 3. Suggestion alternates 4. Unknown items 5. Packaging What we need to build a good spellchecker for tamil? 1 We need tons of … Continue reading Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset