Project Ideas – Part 2 – Looking for contributors


Here are few more project ideas.

1. mobile/web app to record voice for wikisoure – Show a word, record it, upload to commons, link back to wiktionary.

2. mobile/web app to record audio books  – FreeTamilEbooks needs audio books too

3. wordpress to android app convertor – Why cant we convert a wordpress site as android app with RSS feeds?

4. epub to apk convertor – Let us publish ebooks as mobile apps too.

5. blog to epub convertor – fix, add images
https://github.com/sathia27/blog2ebook
Add a feature to download images and add them to ebooks.

6. Daily mobi files for tamil newspapers
Crawl newspapers daily, make mobi, send them to kindle in email daily.

7. Send to kindle – feature for FTE
Add Send to kindle feature to FreeTamilEbooks.com site

8. Lime survey – SAAS – alternate to google forms
Explore limesurvey and make it as alternate for google forms.

9. Collect politicians info and release as app, site

How can we collect all politicians details as education, assets etc and publish for public?

http://tshrinivasan.blogspot.in/2015/12/how-to-collect-details-of-TN-politicians.html

10. setup ELK for tamil literature search, build a search engine on top of it

Explore using ElasticSearch and Kibana for Tamil Text analysis.

11. fix android app to record audio for wiktionary –

https://github.com/Atul22/wikiAudio
done at https://meta.wikimedia.org/wiki/WikiConference_India_2016/Chandigarh_Hackathon

12. Analyse tamil tv/radio show audio, find how many english words are used/hour
This paper may help
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
https://mail.python.org/pipermail/chennaipy/2017-March/001429.html
Contact Ganesh for python implementation of this algo

13. gui for voice record/upload – wiktionary

https://github.com/tshrinivasan/voice-recorder-for-tawictionary

This needs a GUI version for windows users

14. gui for csv uploader

https://github.com/tshrinivasan/tools-for-wiki/tree/master/csv-uploader-wiktionary

This needs a GUI for windows users

15. gui for open-tamil font convertor

https://github.com/Ezhil-Language-Foundation/open-tamil

Need a web application or GUI for all features of open-tamil

16. mobile app to teach tamil – pollachi nasan

http://tshrinivasan.blogspot.in/2015/03/blog-post_9.html

17. wiki massuser create

Sometimes, we need to create 100s of users on wikipedia, for any training/event. Currently, only 6 users can be created. Admins can create multiple users, one by one. Automate this process using mechanize and beautifulsoup.

18. OCR4wikisource web version using google vision api

Rewrite https://github.com/tshrinivasan/OCR4wikisource with google vision api and give a web interface.

19. create a command line TTS from the source of a mobile TTS app.

Here is a open source TTS mobile app for tamil.

http://www.iitm.ac.in/donlab/tts/androidapp.php

Register and download the source and apk.
The voice named “Naveen” is good.

There are many c files in the folder
SSNFlitehtsTamil/app/src/main/jni

Can you compile those files and give a binary file as a command line tool?

Explore these code and share your thoughts on how to convert this as a
desktop/command line application so that we can use it in our
computers.

20. Create a GUI app for bulk photo uploader for http://commons.wikimedia.org

https://github.com/tshrinivasan/mediawiki-uploader

Project Idea – Automation script needed to download British Library books


British Library has already digitized many Indian books (including Tamil, Bengali and other languages) and uploaded them in their website.[1]  The books are split in separate pages in .tiff format, so, we need a script to automate the process of transferring them in Internet Archive/Commons as a single pdf/djvu file, so that we can use it in Wikisource.

https://i1.wp.com/eap.bl.uk/images/header_main.jpg
Got this request from my Wikipedia friend Bodhisattwa Mandal
I checked few Tamil Books.
Example :
http://eap.bl.uk/database/overview_item.a4d?catId=164997;r=18467

“Access for research purposes only” is the license for this file.

But, it seems that these books are very old and already in public domain.
We have all the permissions to download them and publish anywhere.
Now, we need a program in python or any language to download all the books, magazines from the sire http://eap.bl.uk and to provide them as individual PDF files or a zip file of images.
Once, if we get the PDF or image files, we can do OCR them using google OCR and get text out of them. Then, we can publish both images and text for further proofreading and fixing to WikiSource sites, using OCR4WikiSource.
if you are interested to contribute for this project, reply with your details in comment or send mail to tshrinivasan@gmail.com
Thanks.

Project Ideas – Part 1 – Looking for contributors


contribute to open source க்கான பட முடிவு

I am listing here few project ideas and requirements. If you are interested in contributing to any open source project, consider these to start with.

I am giving an intro about each of them in this series of blog posts.

Add your comment here if you pick any of the project to do, so that others can join with you.

1. Clean up Epub files.

We create epub files for FreeTamilEbooks.com by using Calibre. It creates epub files with lot of extra span and other tags. We need to remove all the unwanted tags from those epub files.

Create a command line or web application to clean up the given epub files.

If you are writing in python, plan to create a calibre plugin to clean the epub files.

2.  Download reports for Tamil Wikisource Ebooks

http://ta.wikisource is providing ebooks downloads.

In this database, all language wiki source ebook downloads are stored.

http://tools.wmflabs.org/wsexport/logs.sqlite

Create a web application or command line application to get the details of tamil books and create a download
count report for each book.

Create similar report as http://freetamilebooks.com/htmlbooks/download-report.html

 

3. Improve FreeTamilEbooks android app

The android app for FreeTamilEbooks has some bugs.
https://github.com/jskcse4/FreeTamilEBooks/issues

Use the App and read the issues.
Fix them.

 

4. OCR4WikiSource – Create a web application

OCR4WikiSource is a command line application that connects google ocr and wikisource.
It sends the pdf files to google drive, ocr it, gets text, sends to wikisource.

Create a web application to upload any pdf file, send to google via google vision api, get text, send to wikisource.

Links:
Here is the requirement.
https://github.com/tshrinivasan/OCR4wikisource/issues/89

Few links about it.
https://goinggnu.wordpress.com/2015/12/28/announcing-ocr4wikisource/

https://goinggnu.wordpress.com/2015/09/30/automating-google-ocr-with-python/

https://meta.wikimedia.org/wiki/WikiConference_India_2016/Submissions/Introduction_to_OCR4WikiSource

Discussion with wikipedia developers on this.
https://phabricator.wikimedia.org/T120788

Google Vision API
https://cloud.google.com/vision

Explore the links

https://github.com/GoogleCloudPlatform/cloud-vision

http://terrenceryan.com/blog/index.php/working-with-cloud-vision-api-from-php/

https://github.com/thangman22/google-cloud-vision-php

http://blog.aimanbaharum.com/2016/04/21/ocr-with-google-cloud-vision-api/

 

5. FlipBoard like application for Tamil

Flipboard is a web, mobile app which gives latest content on user selected topics. Create such application for providing tamil content from web on various topics. Content contributors should give links for good articles with relevant categories, tags. Users should subscribe to categories and read the latest content.

 

6. Firefox plugin for tamil wikisource proofreading

 

Tamil wikisource is having around 2000 public domain ebooks, OCRed by google OCR. We have to proofread those books manually.
QuickWikiEditor is a Firefox plugin that enables on the page editing of wiki content.
https://addons.mozilla.org/en-US/firefox/addon/quickwikieditor/

Need to extend this plugin, to send the error words and the corrected words to a remote web application. From there, we can get the list of error words, search for them in entire ta.wikisource.org, replace with the corrected words automatically using bots.

Extend the plugin and create a web application to get the words collection from the plugin.

 

7. Fix the Tamil TTS by IITM

IIT Madras and SSN college, released a Text to speech application for Tamil, as an android application. You can get the source at
https://www.iitm.ac.in/donlab/tts/

It is very initial version. Not as good as the latest  web version available at http://speech.ssn.edu.in/

 

Still, we can learn, extend the initial version.

Explore the android app, get the C code out of it, create a command line app or web app as having the c code as backend.

 

8. Web application to add details about ebooks in a xml file, in github.
We release Tamil ebooks at FreeTamilEbooks.com

We store all the details about the books in a XML file.

This file is source for Android and iOS apps for FreeTamilebooks.

Once an ebook is released, we have to update the xml file manually, which is tough for non-tech contributors.

Need a web application to get the ebooks details in a form, then add those details in XML file and commits to the repo automatically.

 

9. Add ebooks automatically in GoodReads.com

We can add the details about the ebooks in FreeTamilebooks.com to GoodReads.com

We have to fill a long form manually.
Need a command line or web application to simplify this process or automate it for adding info about the books in FreeTamilEbooks.com

10. Build a SAAS version of planet kind of RSS aggregation software.

 

Most tech communities need a planet kind of RSS aggregation software. They have to buy a VPS, install planet software and add the RSS feeds.

It will be good, if we build a SAAS version of planet or similar software, so that they can simply sign in, add rss feeds and start using it.

There are more ideas. Written them somewhere on my notebooks. Will collect them and share soon.

All the projects should be released as Free/Open Source software only.

If you are interested in doing any of the things said above, comment here.

email me to tshrinivasan AT gmail DOT com to know more details on any of the project.

INFITT 2014 – International Conference for Tamil Internet


Home

INFITT is an international organization which connects, Tamil Scholars, Government, IT Professionals and Public.

Every year it conducts “Tamil Internet Conference”. One time in India and Next time in any other country. This year “Tamil Internet Conference 2014” has been conducted in Pondicherry on Sep 19,20,21 2014.

Latest_INFITT_LOGO_2014_2_small

This was my first participation to a INFITT conference.

100 papers were presented from the scholars from 9 countries.

It was a great place to meet most of the Scholars in Tamil.

Around 50 scholars came from Malaysia for this conference.

So happy to meet my Malaysian friends after a year.

I presented a paper on “Open-Tamil” a python library for processing Tamil Text.

Here is the paper

https://docs.google.com/document/d/16PGCQxO-yx8h1JGqOo-YY7Sb2sz3D5YyV_PbaYPlwYU/edit?usp=sharing

Here is the presentation

http://www.slideshare.net/tshrinivasan/open-tamilpresentationta

Sibi from fsftn gave a talk on “Introduction to OCR using Tesseract”

My friends BalaVignesh and Arthi BalaVignesh are researching on OCR using Tesseract.

They are building a web application for training Tesseract for Tamil Text. They gave a talk on their research.

There were many talks in various topics like Fonts conversion, Text to speech, mobile application development, Spell Checker and more.

ElanTamil from Malaysia explained their work on Tamil SpellChecker using hunspell and Grammar Checker using LanguageTool.

Most of the talks were pure academic and there were not much demonstration on practical implementations.

There are tons of research happening on Tamil Computing, Linguistic areas. But the sad part is no one is ready to share their works for public.

Many Universities run funded research on various topics, but they are not ready to share their works.

OCR, Text to Speech, Annotated Corpus, Speech to Text, Spell Checker, Grammar Checker are the highly required softwares. People are asking for them for more than 10 years.

There are many academicians did funded research by universities on these areas and created some working products with the help of their research students. After they retired they package their products and selling them.

As they see that not many people are interested in buying their products, they expect government to buy their software and distribute to all public for free.

I had a discussion with the participants asking for releasing their software as Free/Open Source Software.

But, most of them are not ready for this. They had huge fears on this. If they open source their works, they fear that some big company will take their works, sell and see huge gains.

They really had huge research and created few working software. If I have to create similar software, I have to invest more then 10 years of research, which is impossible.

If they opened their research result and their working software, many people can jump in the Tamil Linguistic area and improve their software.

There are many open source developers are ready to contribute for Tamil. But as we don’t know from where to start, we stand still on the starting point itself.

The existing software sellers, ex professors are not ready to share their works.

They keep on telling that “I have spent 20 years of research on this. Why I have to give it for free? Why I have to open source it? I have to take back the huge revenue for my works.”

They all forget that they got paid for their research works by universities, i.e by public. It is their duty to release their works for public.

I agree that if a company invests huge money and creates some software for tamil, it can sell it and expect the ROI. Even it can sell the closed source software. If the software is really useful and working perfectly, people will buy it for sure.

But these Ex Professors build their products based on their universities fund. The universities should own these software and release to public as Free/Open Source Software. But, these universities are not aware of this truth and these professors sell their works.

This is the great loss for Tamil Computing and Tamil People.

English and other languages are having great software as most of the linguistic research by their universities are released as open source.

Thats why English has so many software available.

I dont know how many decades it may take for Universities to release their tamil research works as open source.

Till then, let us leave these ex professors worrying and wondering on why their software are not, selling.

I dont know what will happen to their hard work and software, after their lifetime.

It is a happy news that few young open source enthusiasts started working on Tamil Software.

There is open-tamil python library for processing tamil text. It can convert 25 types of Tamil Encoding to Unicode. It has tamil to IPA conversion, which is a base for Text to Speech conversion.

Tesseract is being used for Tamil OCR development. Libreoffice got spellchecker and grammar checker.

I hope we can get more contributors for these projects. If they grow well, Tamil will get great open source software.

Apart from these thoughts,

Good stuff about this conference:

  • Met many good contributors for Tamil Computing.
  • Many papers gave new ideas for new open source tamil software development.
  • Co-ordination was good for the talks.
  • Food was nice.
  • The Dinner Treat given by CM was awesome.

Things to improve:

  • Make the Conference free for the audience. So that interested people around the city can participant. The current models enables only paid members to talk and hear the talks.
  • When there are three tracks, place the notice boards and banners to show, the track, talk, and time details.
  • Add the Table of Contents in the Conference book.
  • Release the conference book in creative commons license.
  • Do something more than yearly conference.
  • To increase membership, explain the benefits of members in the website.

I received Rs 5000 Cash Prize for the works on Tamil computing like www.kaniyam.com and www.FreeTamilEbooks.com by Prof.C.R.Selvakumar, Waterloo University, Canada.

Thanks sir for the recognition. This reminds me that I have to do more and continue these projects. These projects are being driven by great volunteers around the globe. I dedicate all the praise and prize to all the volunteers.

20140921_180457

The next conference will be in Singapore.

Hope we can create more open source software for tamil to talk in next conference.