Exploring Tesseract for Tamil – 1

I am exploring Tesseract OCR for Tamil

I am using this link
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

to train Tesseract OCR for Tamil.

Compiled it in machine.

I am planning to auto train a given TTF font.

The following are the commands.

copied a page from http://ta.wikipedia.org and saved as tamil-content.txt

text2image –text=tamil-content.txt –outputbase=tam.LohitTamil.exp0 –font="Lohit Tamil" –fonts_dir=/usr/share/fonts/truetype/ttf-indic-fonts-core

unicharset_extractor tam.LohitTamil.exp0.box

tesseract tam.LohitTamil.exp0.tif tam.LohitTamil.exp0 box.train

Next, I have to run the following command.

shapeclustering -F font_properties -U unicharset tam.LohitTamil.exp0.tr

Unfortunately, the command shapeclustering fails in my computer.

It takes much RAM upto 6 GB, runs for few minutes and then it is killed automatically.

Ran the same command in another laptop.
There, it failed with core dump error.

Have to check the internals of the shapeclustering.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s