Recently, I got a word document to convert as an ebook for the site http://FreeTamilEbooks.com
I use http://pressbooks.com to convert html content as ebooks.
I saved the word document as html page.
It has lot of justify fomatted text.
When I copy the text from the html file, it gave lot of formatting issues on the WYSIWYG editor.
To solve this, we have to strip the unwanted HTML tags like p, font, span etc.
Wrote a small python script, which gave a clean HTML file.
Here is the code.
import lxml.html.clean as clean
from BeautifulSoup import BeautifulSoup
orig_content = open(‘t.html’, ‘rw’).read()
soup = BeautifulSoup(orig_content)
result = str(soup)
strip = clean.Cleaner(meta = True, style = True, page_structure = True, remove_tags = [‘FONT’, ‘font’, ‘span’, ‘h1′,’p’])
content = strip.clean_html(result)
new_content = open(‘tw.html’, ‘w’)