Today, I downloaded the below datasets. Cleaned them all to find unique words in all the datasets.
- https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp https://www.kaggle.com/sembiyan/tamil-oscar-corpus
Here is few stats/counts on original VS unique words.
ta_dedup.txt – 5.1 GB – 6971837 words
unique_sorted_words_in_ta_dedup.txt – 19 M – 553976 words
wiki.txt – 354M – 1444046
unique_sorted_words_in_wiki.txt – 8.7 M – 273953 words
3. Dinamalar_dataset_2009_2019.csv – 5.2 G – 2225214 words
unique_sorted_words_in_Dinamalar_dataset_2009_2019.csv – 5.7 MB – 170845 words
4. Tamilmurasu_dataset_06_Jan_2011_06_Jan_2020.csv – 495M – 138619 words
unique_sorted_words_in_Tamilmurasu_dataset_06_Jan_2011_06_Jan_2020.csv – 458K – 15663 words
Merged all the outputs and did a unique/sort to create a master file.
Here is the counts in the master file.
filename : unique_sorted_words_master.txt
words count : 22,90,236
Here is the repo and the cleanup codes.
repo – https://github.com/KaniyamFoundation/all_tamil_words
Thanks for all, who have collected data and made corpus datasets. Otherwise, it will be a very tough job to scrap many websites. Still we have to scrap many websites. But, these existing datasets provides a great way to progress quickly.
Today morning, Malaikannan released a working version of Tamil Spellchecker, which
can query a given dataset using bloom filter and provide suggestions using levenshtein_distance algorithm.
He released it here – https://github.com/malaikannan/TamilSpellChecker
This may be merged into open-tamil repo.
Will test it tomorrow with the newly collected unique words list.
I am documenting this journey/progress of building open source spellchecker for tamil, daily. Sharing them in tamil related tech mailing lists, twitter, mastodon like social media.
Happy to see the responses from many friends and well wishers. Thanks for the ideas and support you all provide.
Feeling like I am working on this open source project with 10 brains and twenty hands. 🙂
Sharing here the good comments.
“One technique I use is to collect the words with their frequency usage in an archive.
After that sort them by frequency in descending order.
The words at the bottom are likely to be typos.”
— Muthu Nedumaran
Great start Shrini! Here is my Tamil text corpus extracted from Wikipedia and Wikisource:
and the break-up of where the 5.9 million words came from:
for getting unique words, we can use sort cli tool sir.
“sort -u” will not only sort, but emits unique list.
And there are dictionaries like tamil lexicon.
we scraped it and kept in a babylon files here:
It may useful in some-way.
For Malayalam, we are using morphology analyser and we wrote a successful implementation of spellchecker on top of it. The morphology analyser is capable of handling word generation rules as per language grammar. However, developing such a system require huge amount of work and linguistic understanding. But I
would say it is worth the effort, considering you can build many language processing systems on top of it.
write up about this work in multiple posts since 2017. For a short version, here is a paper https://www.aclweb.org/anthology/W19-6801/
— Santhosh Thottingal
Morphological analysers and adding more rules in our pipeline.
Will build a minimal spellchecker with word lookup table.
Once it is ready, we can improve it using rules.
Tomorrow, I will work on collecting the words along with the frequency of used count. As Muthu Nedumaran sir mentioned, low frequency words will be wrong or error. We can easily pick all high frequency used words and make a master list of words for spellchecker.
If you are interested in contributing to this project, email me at (firstname.lastname@example.org)
Read Previous days notes on building tamil spellchecker.
Study notes on open-tamil spellchecker – day 1
Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words