Project Idea – Automation script needed to download British Library books

British Library has already digitized many Indian books (including Tamil, Bengali and other languages) and uploaded them in their website.[1]  The books are split in separate pages in .tiff format, so, we need a script to automate the process of transferring them in Internet Archive/Commons as a single pdf/djvu file, so that we can use it in Wikisource.

https://i1.wp.com/eap.bl.uk/images/header_main.jpg
Got this request from my Wikipedia friend Bodhisattwa Mandal
I checked few Tamil Books.
Example :
http://eap.bl.uk/database/overview_item.a4d?catId=164997;r=18467

“Access for research purposes only” is the license for this file.

But, it seems that these books are very old and already in public domain.
We have all the permissions to download them and publish anywhere.
Now, we need a program in python or any language to download all the books, magazines from the sire http://eap.bl.uk and to provide them as individual PDF files or a zip file of images.
Once, if we get the PDF or image files, we can do OCR them using google OCR and get text out of them. Then, we can publish both images and text for further proofreading and fixing to WikiSource sites, using OCR4WikiSource.
if you are interested to contribute for this project, reply with your details in comment or send mail to tshrinivasan@gmail.com
Thanks.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s