Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words

 

Today, I downloaded the below datasets. Cleaned them all to find unique words in all the datasets.

Here is few stats/counts on original VS unique words.

1.
ta_dedup.txt – 5.1 GB – 6971837 words
unique_sorted_words_in_ta_dedup.txt – 19 M – 553976 words

2. tamil-language-corpus-for-nlp.zip
wiki.txt – 354M – 1444046
unique_sorted_words_in_wiki.txt – 8.7 M – 273953 words

3. Dinamalar_dataset_2009_2019.csv – 5.2 G – 2225214 words
unique_sorted_words_in_Dinamalar_dataset_2009_2019.csv – 5.7 MB – 170845 words

4. Tamilmurasu_dataset_06_Jan_2011_06_Jan_2020.csv – 495M – 138619 words
unique_sorted_words_in_Tamilmurasu_dataset_06_Jan_2011_06_Jan_2020.csv – 458K – 15663 words

Merged all the outputs and did a unique/sort to create a master file.

Here is the counts in the master file.

filename : unique_sorted_words_master.txt
words count : 22,90,236

Here is the repo and the cleanup codes.
repo – https://github.com/KaniyamFoundation/all_tamil_words

Thanks for all, who have collected data and made corpus datasets. Otherwise, it will be a very tough job to scrap many websites. Still we have to scrap many websites. But, these existing datasets provides a great way to progress quickly.

Today morning, Malaikannan released a working version of Tamil Spellchecker, which
can query a given dataset using bloom filter and provide suggestions using levenshtein_distance algorithm.

He released it here – https://github.com/malaikannan/TamilSpellChecker

This may be merged into open-tamil repo.

Will test it tomorrow with the newly collected unique words list.

I am documenting this journey/progress of building open source spellchecker for tamil, daily. Sharing them in tamil related tech mailing lists, twitter, mastodon like social media.

Happy to see the responses from many friends and well wishers. Thanks for the ideas and support you all provide.

Feeling like I am working on this open source project with 10 brains and twenty hands. 🙂

Sharing here the good comments.

“One technique I use is to collect the words with their frequency usage in an archive.
After that sort them by frequency in descending order.
The words at the bottom are likely to be typos.”

— Muthu Nedumaran

 

Great start Shrini! Here is my Tamil text corpus extracted from Wikipedia and Wikisource:
https://github.com/AshokR/TamilNLP/releases/tag/v0.51-alpha
and the break-up of where the 5.9 million words came from:
https://github.com/AshokR/TamilNLP/wiki/Wikipedia-Tamil-Text-Corpus

— AsokanR
https://goinggnu.wordpress.com/2020/05/25/building-tamil-spellchecker-day-4-shall-we-collect-all-tamil-words/#comment-48791

 

for getting unique words, we can use sort cli tool sir.
“sort -u” will not only sort, but emits unique list.

And there are dictionaries like tamil lexicon.
we scraped it and kept in a babylon files here:
https://github.com/indic-dict/stardict-tamil/blob/master/ta-head/tamil_lexicon/tamil_lexicon.babylon
It may useful in some-way.

— damodarareddy
https://goinggnu.wordpress.com/2020/05/25/building-tamil-spellchecker-day-4-shall-we-collect-all-tamil-words/#comment-48865

 

For Malayalam, we are using morphology analyser and we wrote a successful implementation of spellchecker on top of it. The morphology analyser is capable of handling word generation rules as per language grammar. However, developing such a system require huge amount of work and linguistic understanding. But I
would say it is worth the effort, considering you can build many language processing systems on top of it.

write up about this work in multiple posts since 2017. For a short version, here is a paper https://www.aclweb.org/anthology/W19-6801/

— Santhosh Thottingal

Morphological analysers and adding more rules in our pipeline.
Will build a minimal spellchecker with word lookup table.
Once it is ready, we can improve it using rules.

Tomorrow, I will work on collecting the words along with the frequency of used count. As Muthu Nedumaran sir mentioned, low frequency words will be wrong or error. We can easily pick all high frequency used words and make a master list of words for spellchecker.

 

If you are interested in contributing to this project, email me at (tshrinivasan@gmail.com)

Read Previous days notes on building tamil spellchecker.

Study notes on open-tamil spellchecker – day 1
Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words

 

4 thoughts on “Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words

  1. Pingback: Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter? | Going GNU

  2. Pingback: Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words | Going GNU

  3. Pingback: Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python | Going GNU

  4. Pingback: Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s