Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

1 Day 7 – Scrapping websites to get more words

We had 24 lakh unique words in our collection. –

Still this is very low in number for on comparing for the possible words in Tamil. There may be 30-50 crore possible words in Tamil.

Can we collect them all? Not sure. But, wanted to give a try on collecting as much as possible.

That huge words collection can be used for various purposes.

2 Mirroring websites with httrack

I tried to write scrapping programs for getting data from various websites. Writing custom program took much time and left the idea of writing code to scrap the websites.

I started with site and it has very weird structure. I should start with easy sites.

I found a super commandline utility, which can mirror a website, to a folder in our computer. It is “httrack

httrack -p1 sitename

The above command will clone the website, without images (-p1) As dont need images and need only html files, this command works fine to get all the content from a website.

It took 1 day and 16 hours to get 12 GB of files, from
It took 19 hours to get 3 GB of files, from

Once received all the files, run the below command to merge all html content to one single file.

find -iname ‘*.html’ -exec cat {} >> ../all.html \;

Then, process the all.html using the here

Using this, we can get only, unique sorted tamil word list from the all.html file.

3 Frequency of words

Add the feature to get the frequency of each word in a given dataset and listing them out. Thanks for “Muthu Nedumaran” sir for providing this idea. The high frequency for any word, the word is correct.

This python code gets the frequency of the words.

4 Current status

I ran the scripts to get unique tamil words and frequency of them in different datasets.

Now we have 25,83,001 words. (25 Lakh)

I may be spending few more days to collect more words. The more words we have, the less chances to apply grammar rules.

5 Harinath builds scrapper program for wordpress sites

In the meantime, got the nice mail from Harinath.

This is Harinath from Jaya Engineering college now working in a IT company. Hope you are well. Thanks for the taking efforts in publishing the work on Tamil related programs.

I like to contribute back to the project by any means.

Skills I can contribute effectively. Python, scala, flask, play framework, sql, elasticsearch and benchmarking.

Other means also I can support by writing documentation and deploying application at heroku or at cloud systems.

– Harinath

He developed a wordpress site scrapper. Using the python code, we can get all the content of a wordpress based website.

Here is is code –


That is the sprit and wonders of the Free/Open Source world. We get helping hands all around the world and we can feel as having more brains and hands.

6 Few more data sources

We get more data too. Here is inputs from “The Neechalkaran”.

வணக்கம், சிறப்பான முயற்சி. புதிய தொழில்நுட்பம் தமிழுக்கு வருவது வரவேற்கத்தக்கது. கீழுள்ள சொல்வங்கிகள் பயன்படுமானால் பயன்படுத்திக் கொள்ளலாம்.

Below is the Terminology corpus collected through our Nigandiyam Project

Few of my Corpus

– Neechalkaran

7 Got a server to run the scrapper programs

David Rajamani provided a server to run the scrapping programs.

Running scrapper on local laptop, gives issues with quickly filling hard disk, restart on power down times etc. His server helps a lot to run the script without any interruptions.

8 Can we build a free open source for tamil?

Malaikannan is reading papers by Santhosh Thottingal and his team are exploring malayalam spellchecker.

We are discussing of a robust tool like for Tamil. He is exploring more on this area.


Here is a rough workflow.

He found this morphological analyzer for tamil.

This seems working as a stemmer –

But, the source code is not released.

The stemmer in open-tamil works better than this.

9 awesome-tamil

Muthu is collecing a big list of awesome open source tamil NLP tools. Read the list here –

If you know any such tool, share the details to add in the list.

10 What next? – Find Grammar rules

Though we collect many words, they are not 100% correct. There may be around 30-40% of wrong words. We have to clean the words and build a perfect master list to look for.

We have to find the grammar rules to check any given word is valid tamil word or not.

If you know any such rules, share with us.

Today, I had a discussion with Pichaimuthu, The founder of opensource tamil dictionary for windows and android . Check his works here –


He found some good rules from nanool, a very old grammar book.

In this page, he wrote on the list of letters that a tamil word can start with and end with.

In this page, he discuss about the middle letters in any word.

I will explore these rules and build a parser to filter out error words from our words collection.

11 Where are we discussing, getting feedback and helping hands?

Many friends wonder, from where the magic is happening? The progressive feedbacks and great minds to help are real magics from the community.

Here are the mailing lists I share the progress and getting feedback.

  1. Indian Linux Users Group, Chennai – My homeland, school, Inspiration for all my Free software contributions. –
  2. Thamizha – The old original open source community for Tamil –!msg/freetamilcomputing
  3. Kanitaml Valarchi – A group from Tamil Virtual Academy –!forum/tva_kanitamil_valarchi
  4. Kanittamiz – A group from INFITT –!forum/kanittamiz
  5. Kaniyam Pangalippor – A group for writers –

There are few more groups. I am not posting here. But, worth to join for Tamil NLP resources.

  1. Project Madurai –!forum/pmadurai
  2. Isaiyini – Pichaimuthu writes here on his Tamil NLP explorations –

Few people are only on Twitter, to discuss on Open Source Tamil NLP

  1. Muthu of Ezhil Langauge –
  2. R. Asokan –
  3. Malaikannan –

There may be more people on twitter. If you know some people to add in the list, share the details to me.

With all the good hearts and helping hands in this groups, we can do wonders.

Tons of thanks for all the contributors.

If you are interested in contributing to this project, email me at (

Read Previous days notes on building tamil spellchecker.

  1. Study notes on open-tamil spellchecker – day 1
  2. Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
  3. Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
  4. Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
  5. Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words
  6. Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?
  7. Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

2 thoughts on “Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

  1. Pingback: Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python | Going GNU

  2. Pingback: Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s