Yesterday, collected the tamil nouns and published here – https://github.com/KaniyamFoundation/all_tamil_nouns
Well. There are too many derieved words in Tamil. How to check these derived words? There are rules and rules. Tons of grammar rules available in Tamil. We have to apply those rules for each word that is not a base word or noun.
Today, got a weird idea of collecting all available unique tamil words.
Totally how many words are there in Tamil? Who knows? May be some Tamil scholars, Tamil linguistic people may tell some rough count.
Every base word can be derived in 30-40 ways. (someone please tell the correct number. Will update here).
If we have one lakh base words, we can get 30-40 lakh derived words.
What if we can generate all these derived words, put them all in a dataset and check each word against it? As bloom filter seems promising to quickly check for any word in a given dataset, seems this is possible.
Instead of generating derived words ourself, (we need to know the grammar rules to apply. If you know tamil grammar rules, please share the rules for deriving more words from base word) I decided to get all the unique words available from the big datasets available.
Found the below are good sources of Tamil Words collection.
- Project Madurai
- Tamil Wikipedia
- Tamil Wiktionary
- Tamil Wikisource
scrapped project madurai already.
Have code to scrape wiki sites. Did that few years ago. https://github.com/tshrinivasan/tamil-wikipedia-word-list
Psankar wrote “Korkai” to build unique word list from varoius sources. https://github.com/psankar/korkai which seems super fast.
Here are few dataset collections.
- https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp https://www.kaggle.com/sembiyan/tamil-oscar-corpus
- https://archive.org/download/tamilpulavar.db.sql.tar.gz/tamilpulavar.db.sql.tar.gz https://github.com/vigneshwaran-chandrasekaran/tamil-language-words-list
Apart from these, we can scrap various blogs, newspaper websites.
If you have already scrapped some sites, please share the data online and share with us. It will save plenty of time and efforts for us.
Issues in this method
Many language experts wont support this collecting all words.
- There may be many non correct words
- Dataset may become huge
- Querying huge dataset may be slow. may not work on old computers.
- No one collected all words and tested for performance.
As no one collected all the tamil words available and tested, I wanted to give it a try. Malaikannan told that this may work well, with good speed of modern computers. Bloom filter can make it even faster. Then, why not give it a try.
Even, if this experiment fails, we can have some learnings on this.
Hence, decided to download the huge datasets from kaggle.
How to cleanup the data?
- Remove all symbols
- Remove all numbers
- Remove all non-tamil letters
- Find unique words
- make them as one word per line
By doing this, we can make unique words collection from any dataset.
Wrote a python script to do these all – https://gist.github.com/tshrinivasan/9ca7203e55ad67971b854d1c9ca22e7f
But it was too slow process a 5GB file from kaggle.
Hence, tried with linux command line tools described here – https://github.com/tshrinivasan/tamil-wikipedia-word-list They rocked with speed. Within few minutes, I got the first 3 points done. Using a python code, just to make them all unique. set() is very useful for this.
Removing sandhi word
Most of the words have sandhi words as last letter. example –
To remove them, we need a list of sandhi characters. Called Mr.Palani. He is a linguistic in kerala. He is encouraging me to do a spellchecker for many years. He agreed to mentor and give all tamil rules to process on words. He told that only 4 characters are there for sandhi. க், ச், த், ப்
We have to parse all the unique words again and remove the sandhi letter. We can add them later, if required, while spell checking for user.
Now, got some issue. What about the words like ஒலிம்பிக், பீப்
Palani sir smiled and told that they are not tamil words. 🙂
Just focus only on tamil words for now and we can explore the other language words later, as they will be very low in numbers.
Good. I will be working on collecting all unique tamil words for few days. Once done, will make a MVP, command line application or a web application for a quick demo on the progress so far. Stay tuned.
Thanks for all great hearts who are providing support for this project.
If you are interested in contributing to this project, email me at (firstname.lastname@example.org)
Read Previous days notes on building tamil spellchecker.