On exploring the spellchecker, found that we need good base words as dataset for quick query and tell a word is there in the dataset or not. Bloom filter can be used to do the quick check. But what we need is huge dataset.
Collecting All Tamil Nouns
Nouns are the major parts in any language. our languages are filled with nouns. We use nouns more than 70% in all our communications. So, if we collect all nouns and add in our dataset, it can be a huge collection.
Explored the internet to find a huge list of all tamil nouns. Cant find any. Felt shame on this. We have Anna University/Tamil Virtual University/ Classical Tamil Research center/NRC-FOSS/TDIL and more govt organizations spending crores of public money, for Tamil research and development. Still, They did not release any good dataset for any tamil research. Thay have done great research works. But all sleep on their shelfs and locked in websites.
It is happening again and again to reinvent the wheel on the tamil computing world. To make an end for this, last year, decided to collect all the nouns and make good dataset, to release in public domain license. https://github.com/KaniyamFoundation/ProjectIdeas/issues/18
Started collecting nouns last year itself. Fortunately, we found a good contributor to collect all nouns. Mrs. Divya Gunasekaran, M.Phil Tamil student, from Chellammal Arts College, Chennai. She collected nearly one lakh nouns and published in this public google sheet.
Tons of thanks for Dhivya, for her tireless works on making this huge collection of nouns.
Today, I worked on collecting few more nouns and release as all text version in a github repo.
Collected nouns from below resources
- nouns collected by Dhivya – 97875
- peyar.in boy names – 20391 peyar.in girl names – 24030
- random collection – 1115
- tamilsurangam,in – 1249
- wiktionary – 85256
total – 2,29,916
Unique words count – 1,92,122
Released all here – https://github.com/KaniyamFoundation/all_tamil_nouns
- Collect more nouns and add in this repo.
- Check for any errors and fix them in these files.
- Collect all verbs and other forms in tamil too.
If you are interested in contributing to this project, email me at (email@example.com)
Read Previous days notes on building tamil spellchecker