Study notes on open-tamil spellchecker – day 1

Open-Tamil is a wonderful python module built to process tamil text. We can build awesome NLP tools for Tamil using this module.

We can get it from here – https://github.com/Ezhil-Language-Foundation/open-tamil

I am dreaming of a open source tamil spellchecker for around 10 years. It needs someone to explore and work on it continuously. We have waited for long time. Finally decided to give time for it.

I am going to spend some time to build a spellchecker for Tamil.

As a first task, I explored the existing code in the repo.

https://github.com/Ezhil-Language-Foundation/open-tamil/tree/master/solthiruthi

The creator Muthu has done lot of magics on this code. He has provided all the platforms, basic classes and functions to build a spell checker.

We can read the code, just like reading a novel or a account balance sheet. It is so neat and elegant.

I just explored the code and took notes on my emacs org-mode files. Sharing here them for quick reference.

1 solthiruthi

1.1 dataparser.py

This file parses the data files in data folder. In each file, the first line starts with “>> category”

usage: python dataparser.py <filename1> … <filenamen> this command shows categories of words and their frequencies in document(s)

1.2 datastore.py

This builds a Trie datastructure with any given data file.

A trie is a tree-like data structure whose nodes store the letters of an alphabet. By structuring the nodes in a particular way, words and strings can be retrieved from the structure by traversing down a branch path of the tree.

This file loads a sample english list as trie and a sample tamil text files as a trie. It has all the functions to process a Trie data structure.

Learn more about Trie here https://medium.com/basecs/trying-to-understand-tries-3ec6bede0014 https://www.geeksforgeeks.org/trie-insert-and-search/

1.3 resourses.py

read the files in the ‘data’ directory. prepares the datacategories and datadictionary references.

1.4 dictionary.py

Loads the datadictionary files from the “data” directory into memory. Note: dictionary files are words collection files from various sources like wiktionary, project madurai, TamilVU etc.

1.5 dom.py

Classes are defined here to load a text file as a Trie queue.

1.6 WordSpeller.py

Two functions are defined here to process word and return as “correct word or not” and alternate words.

1.7 Ezhimai.py

Loads content from tamilvu dictionary file. Checks the given words against that dictionary. returns a object of processed word result.

Got errors in importing. Fixed like below #from . import WordSpeller #from . import resources import WordSpeller import resources

1.8 heuristics.py

Provides classes/functions to mark the words as false, if the below grammer rules are applied.

  • “”” donot allow adjacent vowels in the word.
  • “”” donot allow adjacent consonants in the word.
  • “”” donot allow more than one repetition of a letter in word “””
  • “”” donot allow vowels with kombu, thunaikaal etc in the word.

1.9 morphology.py

Removes predefined prefix and suffix in the given words. Not a tru stemmer. But removes the prefix/suffix and gives a base word for further processing.

Todo: More prefix/suffix can be added to the list. Move the prefix/suffix to a separate file for easy adding.

1.10 solthiruthi.py

Builds a command line interface to give various options for spell checking. -files, -dialects, -Dictionary, -nalt, -debug, -stdin, -auto, -help are the options provided.

1.11 suggestions.py

defines a word suggestion method  norvig suggestor, using norvig algorythm.

1.12 vinaisorkal.py

Finds irregular verbs and doublets. It uses the classifications defined here – Ref: Dr. V.S. Rajam, http://letsgrammar.org/verbsWithClass.html

TODO: The above link is not working. Get it from archives way back machine and document within the open-tamil repo.

1.13 data folder

This folder has lot of files, that contains single words for each category like countries, fruits. Few files have english and tamil wordslike dictionary. There are few tgz files with huge content from project madurai, wiktionary etc.

There are two types of files here.

  1. random words collections from tamulvu, tamilwiktionary and project madurai. these are called as data dictionaries.
  2. Category based word collection files. For each category, there is a file. like fruits.txt, countries.txt. These are called data categories.

1.14 Todo

1.14.1 Add requirements.txt with version number for external modules like django
1.14.2 check and add developer document/API document i.e auto generated from inline comments.

I am trying to run the web version of the spellchecker. but getting some django import errors.

Will fix them and update the further learnings tomorrow.

 

With all these code, and good collection of words, I am hoping that we can build a super tamil spell checker soon.

If you are interested in joining this game, mail me ( tshrinivasan@gmail.com) or muthi (ezhillang@gmail.com)

Tons of thanks for muthu and other open-tamil contributors.

Let us build a better world for us, ourself, in open source way.

7 thoughts on “Study notes on open-tamil spellchecker – day 1

  1. Pingback: Building Tamil Spellchecker – Day 2 – Collecting all Tamil Nouns | Going GNU

  2. Pingback: Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words? | Going GNU

  3. Pingback: Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil Words | Going GNU

  4. Pingback: Building Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter? | Going GNU

  5. Pingback: Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more words | Going GNU

  6. Pingback: Building Open Source Tamil Spellchecker – Day 8 – Porting from C# to Python | Going GNU

  7. Pingback: Building Open Source Tamil Spellchecker – Day 9 – Ported from C# to Python | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s