Need help – HTML to PDF with Custom Fonts


We are looking for a solution to convert html pages to A4 pdf and B7 pdf for FreeTamilEbooks.com project.

Training authors to create ebooks themself using Pressbooks.com

They can export epub, mobi, xhtml from pressbooks.

Now, few volunteers are converting xhtml to PDF by printing from Firefox.
by changing the margin and printer settings in Firefox.

Many authors find that this is difficult.

Looking for a solution to automate the process of converting XHTML to A4 and B7 size PDFs so that we add a web interface, host in server, ask authors to upload epub or xhtml file to get PDF files as outputs.

We want to use custom TTF fonts for Tamil.
Ila Sundaram-10.TTF is the font we want to use.
Get this font from http://www.kaniyam.com/ila-sundaram-unicode-tamil-fonts/
Tried to set this font via CSS using @font-face.
But the PDFs are not using this font.

Explored wkhtmltopdf
It is not rendering B7 size properly and can not set custom font.

Looking for volunteers to explore the PhantomJS or wkhtmltopdf to generate PDF files from HTML with custom font.

reply here or contact me if you are interested to volunteer.

Thanks.

Few issues and solutions to install AtoM


AtoM stands for Access to Memory. It is a web-based, open source application for standards-based archival description and access in a multilingual, multi-repository environment. See the AtoM homepage for more information.

I am installing this along with archivematica, an open source digital preservation system.

I followed the instructions here to install atom.

https://www.accesstomemory.org/en/docs/2.2/admin-manual/installation/linux/#installation-linux

I have already installed ‘archivematica’ from http://archivematica.org
it was running on port 80.

As atom uses nginx, I changed its port to 8080

File : /etc/nginx/sites-enabled/atom

original :   listen 80;
change :   listen 8080;

Then executed
sudo service nginx restart

Now, accessed http://<ip-address&gt;:8080

But, it throwed 500 internal error. Checked /var/log/nginx/error.log

it said as ” *8 FastCGI sent in stderr: “PHP message: Unable to open PDO connection [wrapped: SQLSTATE[28000] [1045] Access denied for user ‘root’@’localhost’ (using password: NO)]” while reading response header from upstream, client: 192.168.100.99, server: _, request: “GET / HTTP/1.1”, upstream: “fastcgi://unix:/var/run/php5-fpm.atom.sock:”, host: “192.168.100.101”

Solution: delete the file /usr/share/nginx/atom/config/config.php

Now, the web interface to configure atom is displayed.

When giving the username and password for the database, it gave the following error.

The following errors must be resolved before you can continue the installation process:

Unable to open PDO connection [wrapped: SQLSTATE[28000] [1045] Access denied for user ‘root’@’localhost’ (using password: NO)]

Solution:
sudo chown -R www-data:www-data /usr/share/nginx/atom
sudo service php5-fpm restart

Now, the data are saved and atom installation is completed.

Thanks for the atom mailing list for the answers.
https://groups.google.com/forum/m/#!msg/ica-atom-users/L3jB7FQMaN8/z9zoV0GhefEJ

 

Run many versions of ubuntu with lxc


I am working on a connector between Google Drive OCR and WikiSource projects.

https://github.com/tshrinivasan/OCR4wikisource

When I am developing in Ubuntu 15.04 laptop, everything works fine. But many issues were reported with the tools mutool and pdfunite.

Could not find the reasons for the issues for long time. Finally found that the users are using in Ubuntu 12.04

mutool is not available and pdfunite is older versions in ubuntu 12.04, which is working differently then ubuntu 15.04

Wanted to try ubuntu 12.04. Searched for any free VPS. But there is no free VPS to try anything quickly.

But, LXC container helped here.

We can install any ubuntu version as a mini VPS inside in our ubuntu.

sudo lxc-create -t download -n ubuntu1204  –dist ubuntu –release precise –arch amd64

sudo lxc-start -n ubuntu1204

sudo lxc-attach -n ubuntu1204

with –release  option, we can give any older version of ubuntu. It downloads that version and install a minimal version.

Using this, I checked and found the issue with the pdfunite. Changed the program to work with ubuntu 12.04

Users are happy now :-)

Thanks for Ravi, jayantanth, Sibi, Omshivaprakash for continuous testing and giving ideas for enhancements. Realised the  importance of testing and tasting the true spirits of collaborative contributions.

See here to learn more about LXC containers.

https://www.digitalocean.com/community/tutorials/getting-started-with-lxc-on-an-ubuntu-13-04-vps

 

 

 

 

 

Announcing OCR4wikisource


There are many PDF files and DJVU files in WikiSource in various languages. In many wikisource projects, those files are splited into individual page as an Image, using proofRead extension.

Contributors see those images and type them manually.

This project helps the wikisource team to OCR the entire PDF or DJVU file, using the google drive OCR. Then it will update the relevant page in the wikisource with the text.

Grab the python code from here and run in your GNU/linux machines.

https://github.com/tshrinivasan/OCR4wikisource

It is based on
https://github.com/tshrinivasan/google-ocr-python

Reply here with your suggestions and improvements.

solution for ” too long for Unix domain socket ” with ansible and amazon ec2


fatal: [ec2-x.x.x.x.us-west-2.compute.amazonaws.com] => SSH Error: unix_listener: "/home/shrinivasan/.ansible/cp/ansible-ssh-x.x.x.x.us-west-2.compute.amazonaws.com-22-ubuntu.0wqQt0HttbVPpz9B" too long for Unix domain socket
while connecting to x.x.x.x:22
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.

I got the above error on ansible, when used huge hostnames ( amazon ec2 names) instead of IP addresses, in hosts file for ansible.

Ansible can not log in the the machines via ssh.

To solve this, in /etc/ansible/ansible.cfg file, enable the following.
control_path = %(directory)s/%%h-%%r

After this, ansible can login to remote servers and run the scripts.

Solution for passwd: Module is unknown issue in ubuntu


I have a AWS instance with Ubuntu installed.

When I add a new user, or try to change the password, I got the following error.

# passwd
passwd: Module is unknown
passwd: password unchanged

 

Let us check the auth.log for any issues.

root@ip-172-31-9-242:~# tail -n 2 /var/log/auth.log
Jun  3 14:06:01 ip-172-31-9-242 CRON[31435]: pam_unix(cron:session): session closed for user root
Jun  3 14:08:01 ip-172-31-9-242 CRON[31490]: PAM unable to dlopen(pam_cracklib.so): /lib/security/pam_cracklib.so: cannot open shared object file: No such file or directory

 

It means that it misses the pam_cracklib. Let us search for it.

root@ip-172-31-9-242:~# apt-cache search pam | grep crack
libpam-cracklib – PAM module to enable cracklib support

 

Good. Let us try installing it.

root@ip-172-31-9-242:~# apt-get install libpam-cracklib

Let us try now to change the password.

 

Great. It works now.

 

Lesson : Look for the log files, before googling.

 

unrtf – rtf to html conversion utility


For http://FreeTamilEbooks.com project, we have to convert many Word Documents into html before converting them to epub using http://PressBooks.com

I used LibreOffice to open Word doc and to save as HTML.

Till LibreOffice 4.1, the images are extracted and stored separately along with HTML file.

But, after LibreOffice 4.2, they moved to base64 type of encoding of images, so that images are embedded into HTML files. We can not separate images from HTML files.

This was so annoying and many people are reporting this as a bug here.
https://bugs.documentfoundation.org/show_bug.cgi?id=48887

But, This seems not to be fixed.

So, I installed LibreOffice 4.1 in /opt just to use the old feature of storing images separately.

Just now, found another utility, unrtf to do the same.

http://www.gnu.org/software/unrtf/

To install it in ubuntu/debian

sudo apt-get install unrtf

If you get an word doc with images, save it as a Rich Text File using libreoffice writer.

example:

test.doc -> test.rtf

Then,

unrtf test.rtf > test.html

This gives a nice HTML file and images separately.

Thanks for the GNU team for the nice utility.

I can get rid of old LibreOffice 4.1 now.