Requirements for creating a Free Corpus for Tamil Language

Need for a Free Corpus for Tamil:

Corpus is a collection of words, tagged or annotated with their grammar components.

We can use the text corpus to do the linguistic researches and projects.

There are many projects run by government and many universities to create
a text corpus for Tamil Language. Unfortunately, they are not open for public usage.

We have to create a system to create a Corpus for Tamil, which is public.

So that anyone can use the corpus data.

Collecting words, assigning relevant tags, edit the tags are painful job to do manually and individually.

This web application will enable the users to login, select a word, tag it properly, move to next word.

The admin can load the words and do all the administrative tasks.

The following are the requirements for this web application.

1. User Login

Users can register themself.
They can login via openid, google, facebook, twitter etc.

Forget password facility should be there.

2. Users Profile

Options to change password, add, change photo, other details like email, blog, etc.

3. News Feed for all the activities done like user join, tagging actions, comments etc.

4. Tag Words

System should show a word.
User should select appropriate checkbox to select the relevant "Part of Speech" of that word.

Grammatical category for Tamil is here.
http://www.tamilvu.org/coresite/html/cwannotate.html

We can use the same "Part of Speech" or even we can add more, if required.

We have to add an option to mention as "Root Word"

5. View tags

When a users clicks a word or give a word in the search box, the tagged information for that word should be displayed.

6. Edit tags

When seeing the tags, the users should be able to edit the tags, if feel to add/remove the tags.

7. Downloads

Logged in or public should download the tagged data as text file or csv file.

Specific downloads as

list of verbs
list of nouns
list of root words

etc should be available.

8. Statistics

The following statistics should be there

Total words in the system

Tagged words
Count of each category
list/ count of contributors
Top contributors

9. My contributions

Options to list the users own contributions with his statistics.

option to Share this in social media

10. Comments

Users can add comments to any word page, with their reviews on the words/tags.

Admin Panel:

Administrator can do the following:

1. Define Language

2. Define Parts of Speech for the defined language.

3. Upload a text file.

Split the file into words
Select unique words
Remove non linguistic characters
Compare with existing words

If not in the db, add the new words to the system

4. Manage Users

Add/edit/delete/search/view the users

5. The entire application should be translatable to any language. i18n

6. Manage comments

create/edit/delete/view/search comments

7. Manage words

create/edit/delete/view/search words

The entire web application can be created using python 3 / django 1.5 as they support the unicode extensively.

or we can discuss with the community to select appropriate language/toolsets.

Please add your comments here, if you need any other requirement to this system.

This project is a Free Software released under GPL V3.

A separate mailing list will be created soon, to discuss the development.
We can host the code in the github or launchpad or sourceforge.

Mail to tshrinivasan AT gmail DOT com if you are interested to contribute.

9 thoughts on “Requirements for creating a Free Corpus for Tamil Language

  1. Pingback: ILUGC Monthly Meet (April 13 2013 ) | Going GNU

  2. Pingback: Introducing Sorkandu | Sorkandu

  3. வணக்கம் நண்பரே என்னையும் இணைத்துக் கொள்ளுங்கள்

Leave a comment