Table of Contents 1. Day 7 - Scrapping websites to get more words 2. Mirroring websites with httrack 3. Frequency of words 4. Current status 5. Harinath builds scrapper program for wordpress sites 6. Few more data sources 7. Got a server to run the scrapper programs 8. Can we build a free open source … Continue reading Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words
Today, did the below things. 1. Did unique/sort on the existing names repo. https://github.com/KaniyamFoundation/all_tamil_nouns Now the current master file for noun, unique_sorted_noun_master.txt, is having 1,53,548 nouns 2. Merge nouns and all words merged all the collected nouns and words to make a master file to use for bloom filter. Word Count: wc -l unique_sorted_words_in_words_master.txt 23,92,064 … Continue reading Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?
Today, I downloaded the below datasets. Cleaned them all to find unique words in all the datasets. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp https://www.kaggle.com/sembiyan/tamil-oscar-corpus https://www.kaggle.com/disisbig/tamil-wikipedia-articles https://www.kaggle.com/disisbig/tamil-news-dataset Here is few stats/counts on original VS unique words. 1. ta_dedup.txt - 5.1 GB - 6971837 words unique_sorted_words_in_ta_dedup.txt - 19 M - 553976 words 2. tamil-language-corpus-for-nlp.zip wiki.txt - 354M - 1444046 unique_sorted_words_in_wiki.txt - … Continue reading Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words
Yesterday, collected the tamil nouns and published here - https://github.com/KaniyamFoundation/all_tamil_nouns Well. There are too many derieved words in Tamil. How to check these derived words? There are rules and rules. Tons of grammar rules available in Tamil. We have to apply those rules for each word that is not a base word or noun. Today, … Continue reading Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
On exploring the spellchecker, found that we need good base words as dataset for quick query and tell a word is there in the dataset or not. Bloom filter can be used to do the quick check. But what we need is huge dataset. Collecting All Tamil Nouns Nouns are the major parts in any … Continue reading Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
Table of Contents 1. We need tons of good base words collection. 1.1. Few issues here. 1.2. Possibilities 2. Apply grammer rules to check for errors on derived words 2.1. Issues here 3. Suggestion alternates 4. Unknown items 5. Packaging What we need to build a good spellchecker for tamil? 1 We need tons of … Continue reading Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
Open-Tamil is a wonderful python module built to process tamil text. We can build awesome NLP tools for Tamil using this module. We can get it from here - https://github.com/Ezhil-Language-Foundation/open-tamil I am dreaming of a open source tamil spellchecker for around 10 years. It needs someone to explore and work on it continuously. We have … Continue reading Study notes on open-tamil spellchecker – day 1
OpenStreetMap.org is a wonderful community driven maps portal. We can create maps in our own language too. Here is an Image of Tamil Map. Demo https://api.mapbox.com/v4/srikanthlogic.714e671e/page.html?access_token=pk.eyJ1Ijoic3Jpa2FudGhsb2dpYyIsImEiOiJuQ1RYS3pjIn0.7YUMcAQAc4A7T703-yAu2g#4/13.03/80.07 I am much inspired by the Tamil maps provided by OSM. Only thing we have to do is to translate all the strings to tamil. I had a … Continue reading How to translate OpenStreetMaps to other languages?
We are happy to release the Online Tamil Text to Speech conversion software at http://tts.kaniyam.com This is a Free Software. Get the source code here - https://github.com/KaniyamFoundation/tts-web This is made with Ubuntu/Linux, Python, Django, Celery, MySQL etc. Introduction This online software can convert your Tamil text files to audio files in MP3 format. (TTS = … Continue reading Released – Online Tamil Text to Speech System
PySangamam, is the first Python Conference of TamilNadu. ChennaiPy team arranged for this two days conference. Conducting a conference is like arranging for a marriage. It will take all our time and energy, for few months. I dont know how to thank the ChennaiPy team, organizers and volunteers for this great event. My apologies for … Continue reading PySangamam – Tons of learnings, Happy moments