Making of Kaniyam ScanBox – DIY Scanner

Scanning is new spinning. Carl Malamud

The human knowledge is being spread across the globe via books for ages. To preserve all these knowledge, we have to preserve these books for long time.

To preserve physical books, itself is an art. It involves huge cost, space, time and efforts. Check with your local librarian to know how tough it is.

Digital preservation is the best way to give more and more life to those books. Physical books will be damaged, torn, infected by book warms and go away very soon, while digital copies can live for ever, with current technologies and reach more readers quickly.

Scanning the books neatly is the only way to preserve them.

Scanning needs a scanner, a computer, software tools to post process the scans, storage and publishing portal to reach all.

Existing Scanners

Scanners are available in multiple formats and in all prices. From flatbed scanners to ultra modern photocopiers do their best.

On behalf of Kaniyam foundation, we are looking for various possibilities to scan tamil books to preserve them.

Most of the scanners need to split the books into two or cost high on not to split. Splitting books are not desirable for us.

The existing scanners are costly.

SV600 – 47,000 INR

CZUR ET 16 Plus Smart Book Scanner – 56,000 INR

Instead of these, we planned to make a scanbox like https://www.kickstarter.com/projects/limemouse/scanbox-turn-your-smartphone-into-a-portable-scann

Long back, I have attended a workshop by Dr.Dhivaji, on preserving the palm leaves. He gave an idea of building custom scanner.

Here is the writeup in Tamil on his DIY Scanner

https://techforelders.blogspot.com/2012/12/blog-post.html

I tried to setup a custom scanbox similar to that.

With modern smartphones, we can scan papers, books with better quality. Though they are not effective as Production grade scanners, DSLR cameras, these phone cameras are good enough. We can read the text clearly. OCR software like tesseract are detecting the text on the images. What else we need?

What do we need to scan using a smartphone?

We need the below items to scan books using a smartphone

  1. A good Smartphone
  2. Software . There are plenty of paid/free apps available. CamScanner Paid version is good. But Adobe Scan is free. Microsoft Office Lens is ok. Still searching for a open source app for scanning.
  3. A good stand to keep the phone, to avoid shakes. We can not take multiple photos without shaking. We need a stand to keep the phone still, while taking photos. Compare this to a tripod. But camera should look down.
  4. A Light source. A light flood will help to spread good amount of light on the book, so that all the corners get good illumination.
  5. A Glass pad with handle, to press the big books. A 3/4 inch transparent glass with good weight will help to press the books, when they are in open position.

With these all items, we can build a nice DIY scanner.

Few months ago, we built one ScanBox. We need a carpenter and electrician to build this. I got Mr. Baskar and Mr. Raja. Both are innovative on the ideas. On explaining the whole Idea, they got it quickly and made the box.

The below are the required hardware.

  1. 3/4 inch plywood – 5 feet x 5 feet – 1
  2. 15 inch x 17 inch 10 mm Glass – 1
  3. Handle for glass – 1
  4. Light – PLK 20W – IP 65
  5. Electrical Switch – 1
  6. Wire connections

With all these we built the box.

Finding a good light source, a flood light was a tough one. Most of the high power lights gives waves under a smartphone camera. After many trials in a lights shop, we found this PLK 20W- IP65 is good one.

Make a hole on the top, at the center or declined on the open side of the box a bit. Make the hole as square share so the phone will reach any corner.

Avoid Reflections

https://user-images.githubusercontent.com/1268536/64410290-57585280-d0a8-11e9-8366-678ec7ae8472.jpg

As we place a glass above the books, the light may be reflected. Reflection will reduce the quality of the content beneath.

Tried to fit the lights on the side walls. But they gave reflections on the glass. still we got reflections. Tried to add various sheets to dim the light. Still we got reflections. Finally found that fixing the light on the back wall is the best method to get least reflections.

How to scan a book?

  1. Keep the book inside the box open
  2. Keep the glass on top of the book. Make sure it presses the book properly
  3. Turn on the light
  4. Open Adobe scan on the smartphone and click through it.
  5. Open the glass, turn a page, put the glass again
  6. Click on adobe scan
  7. Do the same for all the pages.
  8. Once all the pages are done, export as PDF
  9. Do a QA by checking all the pages.
  10. If any pages are missing, scan again
  11. Edit the pages using Adobe scan itself to remove the bad pages and replace with good ones. This can be done on computer too, using any pdf editor like “PDF Mod”

QA

Quality Audit is very important after each book scan. There may be missing pages. Bad scans, blurred scans. Check all the patiently and rescan the required pages. Remember once a book is gone away, it may take years to get it back.

Post Processing

A lot of post processing is required.

  1. Split the scanned pages in vertical.
  2. Deskew the pages. The pages may be slightly moved while scanning. we have to make them straight. This is called Deskew
  3. Color/Brightness/Contrast Adjustments

ScanTailor is a wonderful Free/Open Source Software, which can do above all.

Here is a video on how to use scantailor

To convert all the tif output of scantailor to PDF, to share with all.

ls *.tif | parallel convert {} {.}.pdf

pdfunite *.pdf bookname.pdf

Results

We tried to scan a whole book with this box.

Here is the result.

https://archive.org/details/thamodharam

Click to access thamodharam.pdf

Camera used : Android Phone Honor 9N
Camera Software : Adobe Scan (Saved the original images in gallery, the processed the images via scantailor)

OCR

After the book is scanned and processed well, we can archive it, share it to public. We can get all the text using Tesseract OCR.

Install Scantailor and tesseract in ubuntu Linux

sudo apt-get install tesseract-ocr yagf tesseract-ocr-tam tesseract-ocr-script-taml tesseract-ocr python3-pyocr ocrodjvu ocrmypdf lios gimagereader

sudo apt-get install scantailor

split a pdf to multiple images using ghostscript

gs -dNOPAUSE -dBATCH -sDEVICE=png16m -sOutputFile="Pic-%02d.png" output.pdf

Do ocr using tesseract

ls -1Nv *.png > filelist.txt

tesseract -l eng+tam filelist.txt article txt

Alternates to ScanTailor

Post processing using the ScanTailor takes more time. More time than scanning. Exploring possibilities of automating the process. The command line tools like mutool, deskew, pdfseperate, ghostscript cab be used to automate the entire flow.

But, ScanTailor gives 100% perfect results as we can see and confirm each pages. Try these below automation for trials only.

Automatic Diskew

deskew can be used to fix the position of the images. http://www.fmwconcepts.com/imagemagick/textdeskew/index.php

deskew -o 2d.jpg -b ffffff 2.jpg
for i in ;
do convert $i -level 50x100% $i.tif; 
done

for i in .tif.tif;
do deskew -o $i.png -b ffffff $i ; 
done

for i in *; 
do convert $i $i.jpg; 
done

Split a PDF and OCR

The below script, split a PDF vertically, seperates its pages, convert to image, and does the OCR.

mutool poster -x 2 AryaMaayai.pdf arya.pdf 
pdfseparate arya.pdf pg-%05d.pdf 
for i in pg.pdf; 
do gs -q -DNOPAUSE -DBATCH -r400 -SDEVICE=jpeg -sOutputFile=$i.jpg $i; 
done 

ls -1Nv .jpg > filelist.txt 
tesseract -l eng+tam filelist.txt aryamaayai txt

or the tool OCRMyPDF can be used – https://github.com/jbarlow83/OCRmyPDF

Searchable PDF output

tesseract tam-text.png tam-text.txt -l tam pdf

This command will give a PDF in which we can copy/paste the text. This works for even Tamil like non-english languages. Tesseract just lays a text layer over the PDF.

Cost

The raw materials cost as below

  1. Plywood – 1500 INR
  2. Light – 1000 INR
  3. Carpenter work – 2000 INR
  4. Electrical work – 2000 INR

Total – 6500 INR

The cost will be reduced on next iterations.

Safety

The scanbox is good enough to withstand a weight of a man. But still keep it safe and away from kid. We ended up like this with my son.

🙂

What next?

Find good books to scan.
Make multiple Scanboxes.
Have volunteers to scan and post process.
Make a separate website for Tamil to published the scanned books.

Thanks

Thanks to Carl Malamud for being an inspiration for me to preserve the human knowledge via scanning.

Thanks to Noolaham foundation of srilanka, Tamil Virtual Academy chennai, Indian Wikisource communities for their awesome scan jobs.

More Photos

All the photos of making Scanbox is here

https://photos.app.goo.gl/rCTpCaqkW8tZ68md9

More discussions

More discussions are here

https://github.com/KaniyamFoundation/ProjectIdeas/issues/73

Share your thoughts on how can we improve this here or in the above issue.

7 thoughts on “Making of Kaniyam ScanBox – DIY Scanner

  1. I really appreciate the efforts yet for this kindof crucial digitisation, an actual scanner will provide better results I presume. (Unless we are scanning books with various dimensions)

  2. Pingback: Annual Review – What I did on 2019? | Going GNU

  3. Pingback: Super Cool – Adam Hyde

  4. Pingback: Exploring Sheet Feed Scanner that works with GNU/Linux | Going GNU

  5. Pingback: Exploring Sheet Feed Scanner that works with GNU/Linux | Going GNU

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s