Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

At Kaniyam Foundation, we have a dream of collecting publishing TerraBytes of Tamil text data for Tamil LLM and other research works. We are documenting the websites that provide Open Licensed tamil content, like Public Domain, Creative Commons license here. https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

From here, we can get the websites, scrap them and use and share the data.

Firstly, Today, I started to explore the tamil wikipedia data.

All the wikepedia content are stored as XML and SQL files here.

Download the Wikipedia dump for the all the languages from http://dumps.wikimedia.org/backup-index.html.

For tamil wikipedia content, from here, https://dumps.wikimedia.org/tawiki/ I downloaded this file

tawiki-20240501-pages-articles-multistream.xml.bz2

it is 223.3 MB

That page has multiple files. But look for “pages-articles” to get the main content for wikipedia.

Then, extracted as

bunzip2 tawiki-20240501-pages-articles-multistream.xml.bz2

It gave a file tawiki-20240501-pages-articles-multistream.xml for 1.7 GB

It has a XML file. We have to extract the text content from it.

For it, explored and found a good tool. – https://github.com/apertium/WikiExtractor

Downloaded it and used it.

python3 WikiExtractor.py --infn tawiki-20240501-pages-articles-multistream.xml

It ran for 2 minutes and gave a file wiki.txt for 627 MB. It has all the articles content as a one single big plaintext file.

Compressed it with 7z as it gives better compression.

mv wiki.txt tawiki-20240501-pages-article-wiki.txt
7z a tawiki-20240501-pages-article-text.7z tawiki-20240501-pages-article-wiki.txt

it is 70 MB

Like this, will continue to get plain text tamil data from various sources. We have to find, where we can publish few 100 GBs to TBs of data, for free. Till then, will share these files on my self hosted desktop PC at my home.

Published the file here – https://kaniyam.cloudns.nz/tamil_datasets/

Let me know, if you are interested in joining this project.