Since Tamilnadu government released a spellchecker as open source here - https://github.com/Tamil-Virtual-Academy/Tamilinaiya-Spellchecker I have joined with friends on porting this to Python. It is a desktop application in C#. As linux has a C# environment called mono, I got recommendations to port to mono first. But, I am all new to C# and mono. Decided … Continue reading Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python
Recently, Tamil Virtual Academy released 10 Tamil NLP tools as Free/Open Source Software with source code. It has a SpellChecker too. Read more here about this. https://goinggnu.wordpress.com/2020/08/16/tamilvu-released-10-tamil-software-as-free-open-source-software/ The spellchecker is written in C#. I want it to be ported to Python so that we can extend it very well. C# is very new language for … Continue reading Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python
Table of Contents 1. Day 7 - Scrapping websites to get more words 2. Mirroring websites with httrack 3. Frequency of words 4. Current status 5. Harinath builds scrapper program for wordpress sites 6. Few more data sources 7. Got a server to run the scrapper programs 8. Can we build a free open source … Continue reading Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words
Today, did the below things. 1. Did unique/sort on the existing names repo. https://github.com/KaniyamFoundation/all_tamil_nouns Now the current master file for noun, unique_sorted_noun_master.txt, is having 1,53,548 nouns 2. Merge nouns and all words merged all the collected nouns and words to make a master file to use for bloom filter. Word Count: wc -l unique_sorted_words_in_words_master.txt 23,92,064 … Continue reading Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?
Today, I downloaded the below datasets. Cleaned them all to find unique words in all the datasets. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp https://www.kaggle.com/sembiyan/tamil-oscar-corpus https://www.kaggle.com/disisbig/tamil-wikipedia-articles https://www.kaggle.com/disisbig/tamil-news-dataset Here is few stats/counts on original VS unique words. 1. ta_dedup.txt - 5.1 GB - 6971837 words unique_sorted_words_in_ta_dedup.txt - 19 M - 553976 words 2. tamil-language-corpus-for-nlp.zip wiki.txt - 354M - 1444046 unique_sorted_words_in_wiki.txt - … Continue reading Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words
Yesterday, collected the tamil nouns and published here - https://github.com/KaniyamFoundation/all_tamil_nouns Well. There are too many derieved words in Tamil. How to check these derived words? There are rules and rules. Tons of grammar rules available in Tamil. We have to apply those rules for each word that is not a base word or noun. Today, … Continue reading Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
On exploring the spellchecker, found that we need good base words as dataset for quick query and tell a word is there in the dataset or not. Bloom filter can be used to do the quick check. But what we need is huge dataset. Collecting All Tamil Nouns Nouns are the major parts in any … Continue reading Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
Table of Contents 1. We need tons of good base words collection. 1.1. Few issues here. 1.2. Possibilities 2. Apply grammer rules to check for errors on derived words 2.1. Issues here 3. Suggestion alternates 4. Unknown items 5. Packaging What we need to build a good spellchecker for tamil? 1 We need tons of … Continue reading Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
Open-Tamil is a wonderful python module built to process tamil text. We can build awesome NLP tools for Tamil using this module. We can get it from here - https://github.com/Ezhil-Language-Foundation/open-tamil I am dreaming of a open source tamil spellchecker for around 10 years. It needs someone to explore and work on it continuously. We have … Continue reading Study notes on open-tamil spellchecker – day 1