Need for a Free Corpus for Tamil:
Corpus is a collection of words, tagged or annotated with their grammar components.
We can use the text corpus to do the linguistic researches and projects.
There are many projects run by government and many universities to create
a text corpus for Tamil Language. Unfortunately, they are not open for public usage.
We have to create a system to create a Corpus for Tamil, which is public.
So that anyone can use the corpus data.
Collecting words, assigning relevant tags, edit the tags are painful job to do manually and individually.
This web application will enable the users to login, select a word, tag it properly, move to next word.
The admin can load the words and do all the administrative tasks.
The following are the requirements for this web application.
1. User Login
Users can register themself.
They can login via openid, google, facebook, twitter etc.
Forget password facility should be there.
2. Users Profile
Options to change password, add, change photo, other details like email, blog, etc.
3. News Feed for all the activities done like user join, tagging actions, comments etc.
4. Tag Words
System should show a word.
User should select appropriate checkbox to select the relevant "Part of Speech" of that word.
Grammatical category for Tamil is here.
http://www.tamilvu.org/coresite/html/cwannotate.html
We can use the same "Part of Speech" or even we can add more, if required.
We have to add an option to mention as "Root Word"
5. View tags
When a users clicks a word or give a word in the search box, the tagged information for that word should be displayed.
6. Edit tags
When seeing the tags, the users should be able to edit the tags, if feel to add/remove the tags.
7. Downloads
Logged in or public should download the tagged data as text file or csv file.
Specific downloads as
list of verbs
list of nouns
list of root words
etc should be available.
8. Statistics
The following statistics should be there
Total words in the system
Tagged words
Count of each category
list/ count of contributors
Top contributors
9. My contributions
Options to list the users own contributions with his statistics.
option to Share this in social media
10. Comments
Users can add comments to any word page, with their reviews on the words/tags.
Admin Panel:
Administrator can do the following:
1. Define Language
2. Define Parts of Speech for the defined language.
3. Upload a text file.
Split the file into words
Select unique words
Remove non linguistic characters
Compare with existing words
If not in the db, add the new words to the system
4. Manage Users
Add/edit/delete/search/view the users
5. The entire application should be translatable to any language. i18n
6. Manage comments
create/edit/delete/view/search comments
7. Manage words
create/edit/delete/view/search words
The entire web application can be created using python 3 / django 1.5 as they support the unicode extensively.
or we can discuss with the community to select appropriate language/toolsets.
Please add your comments here, if you need any other requirement to this system.
This project is a Free Software released under GPL V3.
A separate mailing list will be created soon, to discuss the development.
We can host the code in the github or launchpad or sourceforge.
Mail to tshrinivasan AT gmail DOT com if you are interested to contribute.
http://aakkam.yavarkkum.org/projects/sorkandu
http://www.tamilvu.org/coresite/html/cwannotate.html
not working.
Sir, I am ready to contribute.
Thanks for your interest.
Watch here for the next announcements.
Pingback: ILUGC Monthly Meet (April 13 2013 ) | Going GNU
Pingback: Introducing Sorkandu | Sorkandu
வணக்கம் நண்பரே என்னையும் இணைத்துக் கொள்ளுங்கள்
இவற்றில் பல ஏற்கனவே இருப்பதாக நினைக்கிறேன்
http://ta.wiktionary.org/
I found two open source Python/Django web applications for Corpus annotation:
Djangology: https://sourceforge.net/projects/djangology/
FLAT: https://github.com/proycon/flat
Perhaps we can start by looking into these.