At Kaniyam Foundation, we have a dream of collecting publishing TerraBytes of Tamil text data for Tamil LLM and other research works. We are documenting the websites that provide Open Licensed tamil content, like Public Domain, Creative Commons license here. https://github.com/KaniyamFoundation/ProjectIdeas/issues/198
From here, we can get the websites, scrap them and use and share the data.
Firstly, Today, I started to explore the tamil wikipedia data.
All the wikepedia content are stored as XML and SQL files here.
Download the Wikipedia dump for the all the languages from http://dumps.wikimedia.org/backup-index.html.
For tamil wikipedia content, from here, https://dumps.wikimedia.org/tawiki/ I downloaded this file
tawiki-20240501-pages-articles-multistream.xml.bz2
it is 223.3 MB
That page has multiple files. But look for “pages-articles” to get the main content for wikipedia.
Then, extracted as
bunzip2 tawiki-20240501-pages-articles-multistream.xml.bz2
It gave a file tawiki-20240501-pages-articles-multistream.xml for 1.7 GB
It has a XML file. We have to extract the text content from it.
For it, explored and found a good tool. – https://github.com/apertium/WikiExtractor
Downloaded it and used it.
python3 WikiExtractor.py --infn tawiki-20240501-pages-articles-multistream.xml
It ran for 2 minutes and gave a file wiki.txt for 627 MB. It has all the articles content as a one single big plaintext file.
Compressed it with 7z as it gives better compression.
mv wiki.txt tawiki-20240501-pages-article-wiki.txt 7z a tawiki-20240501-pages-article-text.7z tawiki-20240501-pages-article-wiki.txt
it is 70 MB
Like this, will continue to get plain text tamil data from various sources. We have to find, where we can publish few 100 GBs to TBs of data, for free. Till then, will share these files on my self hosted desktop PC at my home.
Published the file here – https://kaniyam.cloudns.nz/tamil_datasets/
Let me know, if you are interested in joining this project.