Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns

On exploring the spellchecker, found that we need good base words as dataset for quick query and tell a word is there in the dataset or not. Bloom filter can be used to do the quick check. But what we need is huge dataset.

Collecting All Tamil Nouns

Nouns are the major parts in any language. our languages are filled with nouns. We use nouns more than 70% in all our communications. So, if we collect all nouns and add in our dataset, it can be a huge collection.

Explored the internet to find a huge list of all tamil nouns. Cant find any. Felt shame on this. We have Anna University/Tamil Virtual University/ Classical Tamil Research center/NRC-FOSS/TDIL and more govt organizations spending crores of public money, for Tamil research and development. Still, They did not release any good dataset for any tamil research. Thay have done great research works. But all sleep on their shelfs and locked in websites.

It is happening again and again to reinvent the wheel on the tamil computing world. To make an end for this, last year, decided to collect all the nouns and make good dataset, to release in public domain license. https://github.com/KaniyamFoundation/ProjectIdeas/issues/18

Started collecting nouns last year itself. Fortunately, we found a good contributor to collect all nouns. Mrs. Divya Gunasekaran, M.Phil Tamil student, from Chellammal Arts College, Chennai. She collected nearly one lakh nouns and published in this public google sheet.

https://docs.google.com/spreadsheets/d/1FqiFLstsTo6DXsPKPKzp7iPKR49Ml2k81UPR6Nq6inQ/edit?usp=sharing

Tons of thanks for Dhivya, for her tireless works on making this huge collection of nouns.

Today, I worked on collecting few more nouns and release as all text version in a github repo.

Collected nouns from below resources

  • nouns collected by Dhivya – 97875
  • peyar.in boy names – 20391 peyar.in girl names – 24030
  • random collection – 1115
  • tamilsurangam,in – 1249
  • wiktionary – 85256

total – 2,29,916

Unique words count – 1,92,122

Released all here – https://github.com/KaniyamFoundation/all_tamil_nouns

TODO

  • Collect more nouns and add in this repo.
  • Check for any errors and fix them in these files.
  • Collect all verbs and other forms in tamil too.

If you are interested in contributing to this project, email me at (tshrinivasan@gmail.com)

Read Previous days notes on building tamil spellchecker

Study notes on open-tamil spellchecker – day 1

Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset

 

 

6 thoughts on “Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns

  1. Pingback: Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words? | Going GNU

  2. Pingback: Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words | Going GNU

  3. Pingback: Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter? | Going GNU

  4. Pingback: Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words | Going GNU

  5. Pingback: Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python | Going GNU

  6. Pingback: Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s