Need for a Free Corpus for Tamil:
Corpus is a collection of words, tagged or annotated with their grammar components.
We can use the text corpus to do the linguistic researches and projects.
There are many projects run by government and many universities to create
a text corpus for Tamil Language. Unfortunately, they are not open for public usage.
We have to create a system to create a Corpus for Tamil, which is public.
So that anyone can use the corpus data.
Collecting words, assigning relevant tags, edit the tags are painful job to do manually and individually.
This web application will enable the users to login, select a word, tag it properly, move to next word.
The admin can load the words and do all the administrative tasks.
The following are the requirements for this web application.
1. User Login
Users can register themself.
They can login via openid, google, facebook, twitter etc.
Forget password facility should be there.
2. Users Profile
Options to change password, add, change photo, other details like email, blog, etc.
3. News Feed for all the activities done like user join, tagging actions, comments etc.
4. Tag Words
System should show a word.
User should select appropriate checkbox to select the relevant "Part of Speech" of that word.
Grammatical category for Tamil is here.
We can use the same "Part of Speech" or even we can add more, if required.
We have to add an option to mention as "Root Word"
5. View tags
When a users clicks a word or give a word in the search box, the tagged information for that word should be displayed.
6. Edit tags
When seeing the tags, the users should be able to edit the tags, if feel to add/remove the tags.
Logged in or public should download the tagged data as text file or csv file.
Specific downloads as
list of verbs
list of nouns
list of root words
etc should be available.
The following statistics should be there
Total words in the system
Count of each category
list/ count of contributors
9. My contributions
Options to list the users own contributions with his statistics.
option to Share this in social media
Users can add comments to any word page, with their reviews on the words/tags.
Administrator can do the following:
1. Define Language
2. Define Parts of Speech for the defined language.
3. Upload a text file.
Split the file into words
Select unique words
Remove non linguistic characters
Compare with existing words
If not in the db, add the new words to the system
4. Manage Users
Add/edit/delete/search/view the users
5. The entire application should be translatable to any language. i18n
6. Manage comments
7. Manage words
The entire web application can be created using python 3 / django 1.5 as they support the unicode extensively.
or we can discuss with the community to select appropriate language/toolsets.
Please add your comments here, if you need any other requirement to this system.
This project is a Free Software released under GPL V3.
A separate mailing list will be created soon, to discuss the development.
We can host the code in the github or launchpad or sourceforge.
Mail to tshrinivasan AT gmail DOT com if you are interested to contribute.