Strip unwanted html tags from a html file using pyhton

Recently, I got a word document to convert as an ebook for the site http://FreeTamilEbooks.com

I use http://pressbooks.com to convert html content as ebooks.

I saved the word document as html page.
It has lot of justify fomatted text.

When I copy the text from the html file, it gave lot of formatting issues on the WYSIWYG editor.

To solve this, we have to strip the unwanted HTML tags like p, font, span etc.

Wrote a small python script, which gave a clean HTML file.

Here is the code.

import lxml.html.clean as clean
from BeautifulSoup import BeautifulSoup

orig_content = open(‘t.html’, ‘rw’).read()

soup = BeautifulSoup(orig_content)

result = str(soup)

strip = clean.Cleaner(meta = True, style = True, page_structure = True, remove_tags = [‘FONT’, ‘font’, ‘span’, ‘h1′,’p’])
content = strip.clean_html(result)

new_content = open(‘tw.html’, ‘w’)
new_content.write(content)
new_content.close()

One thought on “Strip unwanted html tags from a html file using pyhton

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s