Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

1 Day 7 – Scrapping websites to get more words

We had 24 lakh unique words in our collection. – https://github.com/KaniyamFoundation/all_tamil_words

Still this is very low in number for on comparing for the possible words in Tamil. There may be 30-50 crore possible words in Tamil.

Can we collect them all? Not sure. But, wanted to give a try on collecting as much as possible.

That huge words collection can be used for various purposes.

2 Mirroring websites with httrack

I tried to write scrapping programs for getting data from various websites. Writing custom program took much time and left the idea of writing code to scrap the websites.

I started with TamilVU.org site and it has very weird structure. I should start with easy sites.

I found a super commandline utility, which can mirror a website, to a folder in our computer. It is “httrack

httrack -p1 sitename

The above command will clone the website, without images (-p1) As dont need images and need only html files, this command works fine to get all the content from a website.

It took 1 day and 16 hours to get 12 GB of files, from http://www.sirukathaigal.com/
It took 19 hours to get 3 GB of files, from https://solvanam.com/

Once received all the files, run the below command to merge all html content to one single file.

find -iname ‘*.html’ -exec cat {} >> ../all.html \;

Then, process the all.html using the cleanup.sh here

https://github.com/KaniyamFoundation/all_tamil_words

Using this, we can get only, unique sorted tamil word list from the all.html file.

3 Frequency of words

Add the feature to get the frequency of each word in a given dataset and listing them out. Thanks for “Muthu Nedumaran” sir for providing this idea. The high frequency for any word, the word is correct.

https://github.com/KaniyamFoundation/all_tamil_words/blob/master/make_unique_words.py

This python code gets the frequency of the words.

4 Current status

I ran the scripts to get unique tamil words and frequency of them in different datasets.

Now we have 25,83,001 words. (25 Lakh)

I may be spending few more days to collect more words. The more words we have, the less chances to apply grammar rules.

5 Harinath builds scrapper program for wordpress sites

In the meantime, got the nice mail from Harinath.

This is Harinath from Jaya Engineering college now working in a IT company. Hope you are well. Thanks for the taking efforts in publishing the work on Tamil related programs.

I like to contribute back to the project by any means.

Skills I can contribute effectively. Python, scala, flask, play framework, sql, elasticsearch and benchmarking.

Other means also I can support by writing documentation and deploying application at heroku or at cloud systems.

– Harinath

He developed a wordpress site scrapper. Using the python code, we can get all the content of a wordpress based website.

Here is is code – https://github.com/hari-kris/web-scraping

🙂

That is the sprit and wonders of the Free/Open Source world. We get helping hands all around the world and we can feel as having more brains and hands.

6 Few more data sources

We get more data too. Here is inputs from “The Neechalkaran”.

வணக்கம், சிறப்பான முயற்சி. புதிய தொழில்நுட்பம் தமிழுக்கு வருவது வரவேற்கத்தக்கது. கீழுள்ள சொல்வங்கிகள் பயன்படுமானால் பயன்படுத்திக் கொள்ளலாம்.

Below is the Terminology corpus collected through our Nigandiyam Project

https://drive.google.com/drive/folders/1_-Z525HAYvTe6ODvlkTFtL6IexKKW3YV

Few of my Corpus https://github.com/neechalkaran/Tamil-corpus

– Neechalkaran

7 Got a server to run the scrapper programs

David Rajamani provided a server to run the scrapping programs.

Running scrapper on local laptop, gives issues with quickly filling hard disk, restart on power down times etc. His server helps a lot to run the script without any interruptions.

8 Can we build a free open source grammarly.com for tamil?

Malaikannan is reading papers by Santhosh Thottingal and his team are exploring malayalam spellchecker.

We are discussing of a robust tool like grammarly.com for Tamil. He is exploring more on this area.

https://twitter.com/malai_san/status/1267258883795386368

 

Here is a rough workflow.

 

https://pbs.twimg.com/media/EZY1KoKUYAAxODe?format=jpg&name=small

He found this morphological analyzer for tamil.

https://sarves.github.io/thamizhi-morph/

This seems working as a stemmer – http://parsers.projects.uom.lk/fst-ta/index.php

But, the source code is not released.

The stemmer in open-tamil works better than this. https://github.com/Ezhil-Language-Foundation/open-tamil/tree/master/tamilstemmer

9 awesome-tamil

Muthu is collecing a big list of awesome open source tamil NLP tools. Read the list here – https://github.com/Ezhil-Language-Foundation/awesome-tamil

If you know any such tool, share the details to add in the list.

10 What next? – Find Grammar rules

Though we collect many words, they are not 100% correct. There may be around 30-40% of wrong words. We have to clean the words and build a perfect master list to look for.

We have to find the grammar rules to check any given word is valid tamil word or not.

If you know any such rules, share with us.

Today, I had a discussion with Pichaimuthu, The founder of opensource tamil dictionary for windows and android . Check his works here – https://thanithamizhakarathikalanjiyam.github.io

https://thanithamizhakarathikalanjiyam.github.io/android/

https://thanithamizhakarathikalanjiyam.github.io/ttak-web/

https://thanithamizhakarathikalanjiyam.github.io/kirantha_neekki/

 

He found some good rules from nanool, a very old grammar book.

https://thanithamizhakarathikalanjiyam.github.io/windows_283/

In this page, he wrote on the list of letters that a tamil word can start with and end with.

https://thanithamizhakarathikalanjiyam.github.io/tamil_idaieluthukkal/

In this page, he discuss about the middle letters in any word.

I will explore these rules and build a parser to filter out error words from our words collection.

11 Where are we discussing, getting feedback and helping hands?

Many friends wonder, from where the magic is happening? The progressive feedbacks and great minds to help are real magics from the community.

Here are the mailing lists I share the progress and getting feedback.

  1. Indian Linux Users Group, Chennai – My homeland, school, Inspiration for all my Free software contributions. – https://www.freelists.org/list/ilugc
  2. Thamizha – The old original open source community for Tamil – https://groups.google.com/forum/#!msg/freetamilcomputing
  3. Kanitaml Valarchi – A group from Tamil Virtual Academy – https://groups.google.com/forum/#!forum/tva_kanitamil_valarchi
  4. Kanittamiz – A group from INFITT – https://groups.google.com/forum/#!forum/kanittamiz
  5. Kaniyam Pangalippor – A group for kaniyam.com writers – http://madaladal.kaniyam.com/listinfo.cgi/pangalippor-kaniyam.com

There are few more groups. I am not posting here. But, worth to join for Tamil NLP resources.

  1. Project Madurai – https://groups.google.com/forum/#!forum/pmadurai
  2. Isaiyini – Pichaimuthu writes here on his Tamil NLP explorations – https://groups.yahoo.com/neo/groups/isaiyini

Few people are only on Twitter, to discuss on Open Source Tamil NLP

  1. Muthu of Ezhil Langauge – https://twitter.com/ezhillang
  2. R. Asokan – https://twitter.com/IyalMozhi
  3. Malaikannan – https://twitter.com/malai_san

There may be more people on twitter. If you know some people to add in the list, share the details to me.

With all the good hearts and helping hands in this groups, we can do wonders.

Tons of thanks for all the contributors.

If you are interested in contributing to this project, email me at (tshrinivasan@gmail.com)

Read Previous days notes on building tamil spellchecker.

  1. Study notes on open-tamil spellchecker – day 1
  2. Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
  3. Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
  4. Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?
  5. Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words
  6. Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?
  7. Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

2 thoughts on “Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words

  1. Pingback: Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python | Going GNU

  2. Pingback: Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s