Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words


Table of Contents 1. Day 7 - Scrapping websites to get more words 2. Mirroring websites with httrack 3. Frequency of words 4. Current status 5. Harinath builds scrapper program for wordpress sites 6. Few more data sources 7. Got a server to run the scrapper programs 8. Can we build a free open source … Continue reading Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?


Today, did the below things. 1. Did unique/sort on the existing names repo. https://github.com/KaniyamFoundation/all_tamil_nouns Now the current master file for noun, unique_sorted_noun_master.txt, is having 1,53,548 nouns 2. Merge nouns and all words merged all the collected nouns and words to make a master file to use for bloom filter. Word Count: wc -l unique_sorted_words_in_words_master.txt 23,92,064 … Continue reading Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?

Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words


  Today, I downloaded the below datasets. Cleaned them all to find unique words in all the datasets. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp https://www.kaggle.com/sembiyan/tamil-oscar-corpus https://www.kaggle.com/disisbig/tamil-wikipedia-articles https://www.kaggle.com/disisbig/tamil-news-dataset Here is few stats/counts on original VS unique words. 1. ta_dedup.txt - 5.1 GB - 6971837 words unique_sorted_words_in_ta_dedup.txt - 19 M - 553976 words 2. tamil-language-corpus-for-nlp.zip wiki.txt - 354M - 1444046 unique_sorted_words_in_wiki.txt - … Continue reading Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words

Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?


Yesterday, collected the tamil nouns and published here - https://github.com/KaniyamFoundation/all_tamil_nouns Well. There are too many derieved words in Tamil. How to check these derived words? There are rules and rules. Tons of grammar rules available in Tamil. We have to apply those rules for each word that is not a base word or noun. Today, … Continue reading Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?

Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset


Table of Contents 1. We need tons of good base words collection. 1.1. Few issues here. 1.2. Possibilities 2. Apply grammer rules to check for errors on derived words 2.1. Issues here 3. Suggestion alternates 4. Unknown items 5. Packaging What we need to build a good spellchecker for tamil? 1 We need tons of … Continue reading Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset

Study notes on open-tamil spellchecker – day 1


Open-Tamil is a wonderful python module built to process tamil text. We can build awesome NLP tools for Tamil using this module. We can get it from here - https://github.com/Ezhil-Language-Foundation/open-tamil I am dreaming of a open source tamil spellchecker for around 10 years. It needs someone to explore and work on it continuously. We have … Continue reading Study notes on open-tamil spellchecker – day 1

How to translate OpenStreetMaps to other languages?


OpenStreetMap.org  is a wonderful community driven maps portal. We can create maps in our own language too. Here is an Image of Tamil Map.   Demo https://api.mapbox.com/v4/srikanthlogic.714e671e/page.html?access_token=pk.eyJ1Ijoic3Jpa2FudGhsb2dpYyIsImEiOiJuQ1RYS3pjIn0.7YUMcAQAc4A7T703-yAu2g#4/13.03/80.07 I am much inspired by the Tamil maps provided by OSM. Only thing we have to do is to translate all the strings to tamil. I had a … Continue reading How to translate OpenStreetMaps to other languages?

Released – Online Tamil Text to Speech System


We are happy to release the Online Tamil Text to Speech conversion software at http://tts.kaniyam.com This is a Free Software. Get the source code here - https://github.com/KaniyamFoundation/tts-web This is made with Ubuntu/Linux, Python, Django, Celery, MySQL etc. Introduction This online software can convert your Tamil text files to audio files in MP3 format. (TTS = … Continue reading Released – Online Tamil Text to Speech System

PySangamam – Tons of learnings, Happy moments


PySangamam, is the first Python Conference of TamilNadu. ChennaiPy team arranged for this two days conference. Conducting a conference is like arranging for a marriage. It will take all our time and energy, for few months. I dont know how to thank the ChennaiPy team, organizers and volunteers for this great event. My apologies for … Continue reading PySangamam – Tons of learnings, Happy moments