Building Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?


Yesterday, collected the tamil nouns and published here – https://github.com/KaniyamFoundation/all_tamil_nouns

Well. There are too many derieved words in Tamil. How to check these derived words? There are rules and rules. Tons of grammar rules available in Tamil. We have to apply those rules for each word that is not a base word or noun.

Today, got a weird idea of collecting all available unique tamil words.

Totally how many words are there in Tamil? Who knows? May be some Tamil scholars, Tamil linguistic people may tell some rough count.

Every base word can be derived in 30-40 ways. (someone please tell the correct number. Will update here).

If we have one lakh base words, we can get 30-40 lakh derived words.

What if we can generate all these derived words, put them all in a dataset and check each word against it? As bloom filter seems promising to quickly check for any word in a given dataset, seems this is possible.

Instead of generating derived words ourself, (we need to know the grammar rules to apply. If you know tamil grammar rules, please share the rules for deriving more words from base word) I decided to get all the unique words available from the big datasets available.

Found the below are good sources of Tamil Words collection.

  1. Project Madurai
  2. Tamil Wikipedia
  3. Tamil Wiktionary
  4. Tamil Wikisource
  5. FreeTamilEbooks.com

scrapped project madurai already.

Have code to scrape wiki sites. Did that few years ago. https://github.com/tshrinivasan/tamil-wikipedia-word-list

Psankar wrote “Korkai” to build unique word list from varoius sources. https://github.com/psankar/korkai which seems super fast.

Here are few dataset collections.

Apart from these, we can scrap various blogs, newspaper websites.

If you have already scrapped some sites, please share the data online and share with us. It will save plenty of time and efforts for us.

Issues in this method

Many language experts wont support this collecting all words.

  1. There may be many non correct words
  2. Dataset may become huge
  3. Querying huge dataset may be slow. may not work on old computers.
  4. No one collected all words and tested for performance.

As no one collected all the tamil words available and tested, I wanted to give it a try. Malaikannan told that this may work well, with good speed of modern computers. Bloom filter can make it even faster. Then, why not give it a try.

Even, if this experiment fails, we can have some learnings on this.

Hence, decided to download the huge datasets from kaggle.

How to cleanup the data?

  1. Remove all symbols
  2. Remove all numbers
  3. Remove all non-tamil letters
  4. Find unique words
  5. make them as one word per line

By doing this, we can make unique words collection from any dataset.

Wrote a python script to do these all – https://gist.github.com/tshrinivasan/9ca7203e55ad67971b854d1c9ca22e7f

But it was too slow process a 5GB file from kaggle.

Hence, tried with linux command line tools described here – https://github.com/tshrinivasan/tamil-wikipedia-word-list They rocked with speed. Within few minutes, I got the first 3 points done. Using a python code, just to make them all unique. set() is very useful for this.

Removing sandhi word

Most of the words have sandhi words as last letter. example –

To remove them, we need a list of sandhi characters. Called Mr.Palani. He is a linguistic in kerala. He is encouraging me to do a spellchecker for many years. He agreed to mentor and give all tamil rules to process on words. He told that only 4 characters are there for sandhi. க், ச், த், ப்

We have to parse all the unique words again and remove the sandhi letter. We can add them later, if required, while spell checking for user.

Now, got some issue. What about the words like ஒலிம்பிக், பீப்

Palani sir smiled and told that they are not tamil words. 🙂

Just focus only on tamil words for now and we can explore the other language words later, as they will be very low in numbers.

Good. I will be working on collecting all unique tamil words for few days. Once done, will make a MVP, command line application or a web application for a quick demo on the progress so far. Stay tuned.

Thanks for all great hearts who are providing support for this project.

If you are interested in contributing to this project, email me at (tshrinivasan@gmail.com)

Read Previous days notes on building tamil spellchecker.

Study notes on open-tamil spellchecker – day 1

Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset
Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns
 

 

Building Tamil Spellchecker – Day 3 – Collecting all Tamil Nouns


On exploring the spellchecker, found that we need good base words as dataset for quick query and tell a word is there in the dataset or not. Bloom filter can be used to do the quick check. But what we need is huge dataset.

Collecting All Tamil Nouns

Nouns are the major parts in any language. our languages are filled with nouns. We use nouns more than 70% in all our communications. So, if we collect all nouns and add in our dataset, it can be a huge collection.

Explored the internet to find a huge list of all tamil nouns. Cant find any. Felt shame on this. We have Anna University/Tamil Virtual University/ Classical Tamil Research center/NRC-FOSS/TDIL and more govt organizations spending crores of public money, for Tamil research and development. Still, They did not release any good dataset for any tamil research. Thay have done great research works. But all sleep on their shelfs and locked in websites.

It is happening again and again to reinvent the wheel on the tamil computing world. To make an end for this, last year, decided to collect all the nouns and make good dataset, to release in public domain license. https://github.com/KaniyamFoundation/ProjectIdeas/issues/18

Started collecting nouns last year itself. Fortunately, we found a good contributor to collect all nouns. Mrs. Divya Gunasekaran, M.Phil Tamil student, from Chellammal Arts College, Chennai. She collected nearly one lakh nouns and published in this public google sheet.

https://docs.google.com/spreadsheets/d/1FqiFLstsTo6DXsPKPKzp7iPKR49Ml2k81UPR6Nq6inQ/edit?usp=sharing

Tons of thanks for Dhivya, for her tireless works on making this huge collection of nouns.

Today, I worked on collecting few more nouns and release as all text version in a github repo.

Collected nouns from below resources

  • nouns collected by Dhivya – 97875
  • peyar.in boy names – 20391 peyar.in girl names – 24030
  • random collection – 1115
  • tamilsurangam,in – 1249
  • wiktionary – 85256

total – 2,29,916

Unique words count – 1,92,122

Released all here – https://github.com/KaniyamFoundation/all_tamil_nouns

TODO

  • Collect more nouns and add in this repo.
  • Check for any errors and fix them in these files.
  • Collect all verbs and other forms in tamil too.

If you are interested in contributing to this project, email me at (tshrinivasan@gmail.com)

Read Previous days notes on building tamil spellchecker

Study notes on open-tamil spellchecker – day 1

Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset

 

 

Tesseract OCR GUI for Windows


Tesseract is a good open source OCR. The recent version 4 supports many Indian Languages. We are using it extensively for many OCR projects.

Tesseract can only get input as a image and it is a command line application. We the linux users are happily using the command line terminal to automate anything. Most of the book content we need to do OCR will be in PDF format.

To OCR a pdf using tesseract, we have to do the below things.

 

  1. Split the PDF to individual pdf files. (pdfseparate or pyPDF can be used)
  2. Convert the individual PDF files to individual images. (convert in imagemagics or ghostscript can be used)
  3. OCR each image file using tesseract
  4. Combine all the text files and give a one single text file for the given pdf file.

 

The quick implementation for the above in python is here

tess_ocr_pdf.py

Linux users can enjoy this script and convert any pdf to text using this.

But we have windows users are also in this world. They need walking sticks always. I mean a GUI for any application.

Installing Python,Ghostscript,Tesseract, its language data, running the above python script can be tough for many windows users.  I hate windows mainly for this reason. It makes its users not to learn/do real good things easily.

It will be nice, if someone builds a Windows GUI version of the above python script for windows users. As I dont have windows for the past 15 years, I dont know how to make it.

Wrote a project idea here last year – https://github.com/KaniyamFoundation/ProjectIdeas/issues/80

Before few months, found Parathan, a college student from srilanka, a volunteer for Noolaham Foundation, showed interest to build this GUI version.

https://github.com/KaniyamFoundation/ProjectIdeas/issues/80#issuecomment-609077422

He did it quickly using pysimpleGUI and ported to PyQT. He did all the packaging works and today he demonstrated on a live YouTube session for OpenPublishingFest.org event.

You can see how it works here. Video is in Tamil Language.

 

Download it here – https://github.com/Parathantl/tesseract_gui/releases

Source is available here – https://github.com/Parathantl/tesseract_gui/

Report any issues here – https://github.com/Parathantl/tesseract_gui/issues

I have added few feature requests. He told that he is busy to fix them all. I may give a try to add the fixes.  If you are interested in contributing, write to me ( tshrinivasan@gmail.com ) or Parathan ( parathanlive123@gmail.com ) or raise a issue on the repo.

Thanks Parathan for releasing the code in GPL License. We need more open minded contributors to build  better tamil computing resources.

Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on dataset


What we need to build a good spellchecker for tamil?

1 We need tons of good base words collection.

Tamil may have few million base words. All others are derivative words. We should quickly compare a given word with our all-correct dataset and say that given word is correct or not.

If the given word is in our dataset, all fine. move to next word.

1.1 Few issues here.

The dataset will be really huge. May scale to few GBs. How to search within this dataset quickly? How to compact the dataset so that we can use in mobile too. Will you install a spellchecker if it comes with GBs of dataset? What if querying this dataset takes too much time? Will we use a spellchecker if it takes some 10 minutes to complete checking for errors?

We need memory and processor Effient ways to query the dataset.

1.2 Possibilities

Today morning, discussed this issue with Malaikannan from IndicNLP team. He introduced “Bloom Filer”. This is very efficient algorithm to find a word from a given data set and tell that word is there in the dataset or not. Interesting.

He told that IndicNLP project is already using this. He would make this bloom filter as a separate function and contribute to open-tamil.

happy to see more helping hands. 🙂 Thanks Malaikannan and IndicNLP team for their great works.

https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

https://llimllib.github.io/bloomfilter-tutorial/

IndicNLP unique words from wikipedia duump – https://github.com/malaikannan/IndicNLPUniqueWords

View at Medium.com

 

2 Apply grammer rules to check for errors on derived words

Once the base words are checked for error, now we have to check for derieved words. We have to do stemming and find the base word. Apply available grammar rules and generate the possible derived words and compare with the given word.

Stemming may be one action. There may be too many grammer related works to play around with words. We need a Good Tamil Grammar Expert to guide us.

2.1 Issues here

Tamil has huge set of grammar rules. Applying them for each word may take time. How to do this in optimal way? Have to explore on this.

3 Suggestion alternates

Once a word is marked as error, we have to suggest other correct alternates. Open-Tamil already implemented Norvig algorithm to provide suggestions. Have to test for the efficiency and have to explore other possibilities, if any.

4 Unknown items

On the path of building spellchecker, there may be unknown obstacles to cross. We will find them only when we go through the path.

5 Packaging

If all the above items are solved, we can provide the spellchecker as web application along with API (with throttles). This is easy AFAIK. Open-Tamil has a web version already. But the real users are living in another world. They may be using LibreOffice/MS Office/Page Maker/Indesign. Have to check for building plugins for these application. Finally Mobile users. How we are going to give a spellchecker for mobile users?

We can explore all these packaging stuff once the basic web version is ready.

While I am typing this post, Malaikannan pulled all text from project madurai site. Build a bloom filter code to check a given word against project madurai data set. 🙂

https://www.projectmadurai.org/pmworks.html

 

Here is a screenshot of his quick implementation

Image

Here are his quick github gists –

 

6. What next?

We need huge dataset with correct words to give input for this bloom filter. Wikipedia dump one public dataset available in CC-BY-SA license. We can use it. But there may be many error words. Will explore for other possibilities to get good words.

Study notes on open-tamil spellchecker – day 1


Open-Tamil is a wonderful python module built to process tamil text. We can build awesome NLP tools for Tamil using this module.

We can get it from here – https://github.com/Ezhil-Language-Foundation/open-tamil

I am dreaming of a open source tamil spellchecker for around 10 years. It needs someone to explore and work on it continuously. We have waited for long time. Finally decided to give time for it.

I am going to spend some time to build a spellchecker for Tamil.

As a first task, I explored the existing code in the repo.

https://github.com/Ezhil-Language-Foundation/open-tamil/tree/master/solthiruthi

The creator Muthu has done lot of magics on this code. He has provided all the platforms, basic classes and functions to build a spell checker.

We can read the code, just like reading a novel or a account balance sheet. It is so neat and elegant.

I just explored the code and took notes on my emacs org-mode files. Sharing here them for quick reference.

1 solthiruthi

1.1 dataparser.py

This file parses the data files in data folder. In each file, the first line starts with “>> category”

usage: python dataparser.py <filename1> … <filenamen> this command shows categories of words and their frequencies in document(s)

1.2 datastore.py

This builds a Trie datastructure with any given data file.

A trie is a tree-like data structure whose nodes store the letters of an alphabet. By structuring the nodes in a particular way, words and strings can be retrieved from the structure by traversing down a branch path of the tree.

This file loads a sample english list as trie and a sample tamil text files as a trie. It has all the functions to process a Trie data structure.

Learn more about Trie here https://medium.com/basecs/trying-to-understand-tries-3ec6bede0014 https://www.geeksforgeeks.org/trie-insert-and-search/

1.3 resourses.py

read the files in the ‘data’ directory. prepares the datacategories and datadictionary references.

1.4 dictionary.py

Loads the datadictionary files from the “data” directory into memory. Note: dictionary files are words collection files from various sources like wiktionary, project madurai, TamilVU etc.

1.5 dom.py

Classes are defined here to load a text file as a Trie queue.

1.6 WordSpeller.py

Two functions are defined here to process word and return as “correct word or not” and alternate words.

1.7 Ezhimai.py

Loads content from tamilvu dictionary file. Checks the given words against that dictionary. returns a object of processed word result.

Got errors in importing. Fixed like below #from . import WordSpeller #from . import resources import WordSpeller import resources

1.8 heuristics.py

Provides classes/functions to mark the words as false, if the below grammer rules are applied.

  • “”” donot allow adjacent vowels in the word.
  • “”” donot allow adjacent consonants in the word.
  • “”” donot allow more than one repetition of a letter in word “””
  • “”” donot allow vowels with kombu, thunaikaal etc in the word.

1.9 morphology.py

Removes predefined prefix and suffix in the given words. Not a tru stemmer. But removes the prefix/suffix and gives a base word for further processing.

Todo: More prefix/suffix can be added to the list. Move the prefix/suffix to a separate file for easy adding.

1.10 solthiruthi.py

Builds a command line interface to give various options for spell checking. -files, -dialects, -Dictionary, -nalt, -debug, -stdin, -auto, -help are the options provided.

1.11 suggestions.py

defines a word suggestion method  norvig suggestor, using norvig algorythm.

1.12 vinaisorkal.py

Finds irregular verbs and doublets. It uses the classifications defined here – Ref: Dr. V.S. Rajam, http://letsgrammar.org/verbsWithClass.html

TODO: The above link is not working. Get it from archives way back machine and document within the open-tamil repo.

1.13 data folder

This folder has lot of files, that contains single words for each category like countries, fruits. Few files have english and tamil wordslike dictionary. There are few tgz files with huge content from project madurai, wiktionary etc.

There are two types of files here.

  1. random words collections from tamulvu, tamilwiktionary and project madurai. these are called as data dictionaries.
  2. Category based word collection files. For each category, there is a file. like fruits.txt, countries.txt. These are called data categories.

1.14 Todo

1.14.1 Add requirements.txt with version number for external modules like django
1.14.2 check and add developer document/API document i.e auto generated from inline comments.

I am trying to run the web version of the spellchecker. but getting some django import errors.

Will fix them and update the further learnings tomorrow.

 

With all these code, and good collection of words, I am hoping that we can build a super tamil spell checker soon.

If you are interested in joining this game, mail me ( tshrinivasan@gmail.com) or muthi (ezhillang@gmail.com)

Tons of thanks for muthu and other open-tamil contributors.

Let us build a better world for us, ourself, in open source way.

Indic Wiki stats as grafana dashboard – wikimedia remote hackathon 2020


Last weekend, I attended “Wikimedia Remote Hackathon 2020”. Due to the pandemic period, all the events are being moved to remote.

Like other events, wikimedia hackathon also moved to remote. It was on May 9, 10 2020. Though it was remote, it was well planned and the organizing team took same efforts like the in person hackathon.

All the announcements, planning, communications, various sessions, new telegram channel for new comers, plenty of tools, walks with dogs, quick answers over irc/telegram for any questions, show case/demo, music etc, are really well planned. We can learn a lot from the organizing team on how to conduct a remote hackathon.

Here you can get all the details about the event – https://mediawiki.org/wiki/Wikimedia_Hackathon_2020/Remote_Hackathon

I could not decide what to do till the event start time. On the previous day night, just thought to get the page page count stats of the indic wikipedia sites and showcase them with good graphs, charts etc.

For past few months, I am working on writing custom metrics exporters for prometheus. We can write custom exporter to expose the wikipedia page stats as prometheus metrics. We can setup a prometheus server to scrap the data from the exporter and use grafana to show the graphs.

Good Idea. Right? Now, How to get the page counts for any wikipedia site?

Posted a query on the wiki-tech mailing list.
wikitech-l AT lists.wikimedia.org

See the discussion here
https://lists.wikimedia.org/pipermail/wikitech-l/2020-May/093363.html

This a wonderful place to get answers for any technical queries related to wikipedia.

All wikipedia sites give nice REST API to interact with them.

Basic information on article counts can be fetched from each wiki
using the Action API’s action=query&meta=siteinfo endpoint. See
<https://www.mediawiki.org/wiki/API:Siteinfo&gt; for more information
about this API.

See <https://ta.wikipedia.org/wiki/%E0%AE%9A%E0%AE%BF%E0%AE%B1%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AF%81:ApiSandbox#action=query&format=json&meta=siteinfo&siprop=statistics&gt;
for an example usage on tawiki.

This url gives the stats as json
https://ta.wikipedia.org/w/api.php?action=query&format=json&meta=siteinfo&siprop=statistics

With this query, we get the below answer.

{
“batchcomplete”: “”,
“query”: {
“statistics”: {
“pages”: 406044,
“articles”: 129087,
“edits”: 2961233,
“images”: 7758,
“users”: 174664,
“activeusers”: 431,
“admins”: 40,
“jobs”: 0,
“queued-massmessages”: 0
}
}
}

We can parse this and get the required details.

Thats it. After seeing this answer in early morning, I could not sleep. Just woke up. Wrote a custom exporter for these metrics for all indic wikipedia sites.

Here is the code – http://github.com/tshrinivasan/indicwiki_stats_exporter

Here is the phabricator task – https://phabricator.wikimedia.org/T252212

I used a digital ocean droplet server to run the exporter, prometheus and grafana.
Built a dashboard and published the dashboard for public grafana dashboard too. Yey. My first contribution to public grafana dashboards.

https://grafana.com/grafana/dashboards/12265

Here is the grafana dashboard – http://139.59.47.5:3000/d/kx1Pb36Zz/indic-wiki-stats?orgId=1

Image

 

Image

Shared this with wikitech list and telegram groups for the remote hackathon.

Felt good to got an idea and a quicker implementation. Found that wikistats team provides various analytics possibilities here – https://stats.wikimedia.org/#/ta.wikipedia.org/content/pages-to-date/normal|line|2-year|~total|monthly

And here is another site to look for such numbers – https://wikistats.wmflabs.org/display.php?t=wp

Still, this custom exporter and grafana shows a comparing graphs, which is not available anywhere.

Will add more stats for indic wikisource, wikibooks, wikinews, wiktionary sites soon.

Apart from this, I could not join in any of the events, live demos happened on the hackathon. I thought all the live sessions would be recorded. Alas. They were only live. No recording due to inability of the meet.google.com to share the streams with youtube.

I could not attend the showcase event. But saw the event. happy to see the great efforts of other participants.

Thus, the two days remote wikipedia hackathon 2020 came to end. Happy to see that there is my little contribution to this event.

Tons of thanks for the event organizers, Indic Wiki team, Noolaham Foundation for the server I used, wiki tech mailing list and entire wikipedia contributors for making the world a little better.

Introduction to Prometheus Monitoring System – IRC Chat log


Last saturday, we had a IRC training on our IndianLinux Users Group Chennai  (ILUGC) at IRC channel #ilugc on irc.freenode.net

I provided an introduction to Prometheus monitoring system. here is the chat log.

To get the full logs, check here – https://gist.github.com/mohan43u/9abac833a2faa6f790d94864809a4ac6

Thanks for ILUGC community and all the participants.

 

====

Hello all,

Today we are going to discuss about a open source infrastructure monitoring tool called Prometheus. we can use it to monitor anything. From metrics of our computers to weather, petrol price, gold price. Anything that can be said using numbers can be monitored using Prometheus.

it is created by soundcloud.

Prometheus has the following components
1. Prometheus server
2. various exporters
3. Alert manager
along with these 4. Grafana ia good combo for building dashboards

mohan43u: shrini: prometheus dont have dashboard?
shrini: no

Grafana is used to build dashboard
Grafana is generic dashboard system which can get inputs from many systems including Prometheus

Let me explain with an example of how all these works together. I have 200 servers with various disks/volumes attached for many servers, I get disk full issues and many scripts are failing due to hard disk full. I can not run ‘df -h’ in all the servers frequently to monitor all these. In this situation, it will be very nice, if I get some dashboard to monitor all the disk space for all servers and alert on email or in any chat system, for the servers that go >80% of disk space, so that I can check only those servers and fix them.

Here prometheus helps well. I just configure one central prometheus server and a client in all 200 servers. That client is called “node exporter”, which will export all the properties/metrics of the clients like cpu/ram/disk usage.

In ubuntu, just install by running below commands,

“sudo apt-get install prometheus prometheus-node-exporter”.

/etc/prometheus/prometheus.yml is the config file for server. There we write scrap configs. server will poll all the configured clients and get data. clients will expose their metrics on a defined port. In server, we should configure all the clients and their ports to read and get the data.

For my situation, I will install node_Exporter in all 200 clients. and add those details in prometheus.yml file.

static_configs:
– targets: [‘localhost:9100’]

in target, we can add any number of target ips and 9100 to get metrics from node exporter from all the clients we have.

scrape_interval: 5s
scrape_timeout: 5s

to define the time delay for each poll/pull

There is nothing to do on the client side. Just install node exporter and enable the traffic flow on port 9100 from server to client. thats all. Now, start prometheus service. It will start getting data from the nodes.

mbuf: Any reason why the “server” should fetch the data from the clients, as opposed to the clients pushing the data to the server?
shrini: the design of poll/pull is easy. we wont miss any data even if the number of clients are high. On the design of clients pushing data, DoS may happen, even DDOS.

mbuf: So, if I install a new client, then I will need to add an entry for the same and restart the prometheus service?
shrini: yes. we have to add all the clients manually on the server side
There are configuration for service discovery automatically. like it can poll all the instances in a subnet, even all the ec2 instances with a specific tag. We can configure like this too. In production, we use only service discovery. we dont add clients manually.

stof1: how do you deploy the configuration to the monitored servers? ansible? or has prometheus something own?
shrini: we can use any config management tool like ansible. prometheus dont provide anything. we have to manage the configs ourself manually or using any tool

mohan43u: What is service discovery? how it prometheus finds client using this?
shrini: service discovery is a system to get all the required instances information with simple configuration

protocol255: How much network bandwidth it will consume?
shrini: all the data transfers are only text. so it is very low

let me show live demo a node that exposes its metrics using node_exporter. http://139.59.47.5:9100/metrics Open this in browser and just see the metrics. All the metrics are just numbers at the end. Prometheus can accept only number metrics

bala: what if i want to export only few metrics from my client like( RAM usage and disk space)? where i have to specify it ? in client node or server node?
shrini: for any custom metrics, we have to write custom exporters

who are all seeing the http://139.59.47.5:9100/metrics in browser?
say “yes” here
* mohan43u yes
protocol255 Yes
rhnvrm Yes
stof1 yes
saranya yes
kaartic Yes
bala yes

thanks. Too many metrics. right? we may need them sometime. I have configured the client and server on the same machine. Now let us open the prometheus server UI runs on port 9090. open this. http://139.59.47.5:9090/graph and give any metrics name on the query box and press execute. got it?

It can give a simple graph too, on the same page. There is a alertmanager. if we configured any alerts, they will be shown here. http://139.59.47.5:9090/alerts
currently it is empty.

Let us explore grafana now. we have lots of metrics into our promethus. we can install grafana and configure prometheus as data source. we can use others like elasticsearch. open this now http://139.59.47.5:3000/d/hb7fSE0Zz/1-node-exporter-for-prometheus-dashboard-en-v20191102

This is a sample dashboard on grafana to show all the metrics we receive from node exporter. There are not much data we have. Just now configured the node_exporter.
see this image – https://grafana.com/api/dashboards/1860/images/7994/image

Grafana has wonderful system of sharing dashboards. we can build dashboards and share to the community. Anyone can import using the dashboard id number
https://grafana.com/grafana/dashboards/1860 Here 1860 is the id for the above dashboard. I just imported into grafana using Dashbord->import section

Another interesting part in prometheus is the custom exporters. we can write custom exporters in any language (mostly people write in python and go) and make wonderful dashboards. All we have to do is import prometheus_client library. Define the metric and its parameters and expose ina specific port using inbult http server or nginx etc. Configure prometheus to poll on that port and build graphs on grafana. so simple.

Here is another demo. Today and tomorrow, there is a remote wikimedia hackathon running as a part of this, I wanted to build a custom exporter to get the counts of various indic wikipedia sites

http://139.59.47.5:11810/

check this page. It may take a while to load, as for every request, it fetches the numbers from each indic wikipedia API servers.
wiki_articles{lang_code=”ta”,language=”Tamil”} 128990.0

This is one sample metric. We can have a metric name and attributes as key,value pairs so that we can use them on our queries. There is a promQL language, simple only, to query the prometheus to get various metrics. Now see this dashboard for the indic wikipedia stats.

http://139.59.47.5:3000/d/kx1Pb36Zz/indic-wiki-stats?orgId=1

Writing a Jenkins exporter in Python


This page shows a sample exporter for jenkins. I wrote similar code to get the metrics for wikipedia pages. All wikipedia sites give well rest api support, so we can get any content easily.

https://prometheus.io/docs/instrumenting/exporters/

check here for available exporters. In github we can get even more. With python we can write for anything, even to get gold rate, temperature etc
https://www.robustperception.io/blog this site has interesting articles on prometheus.
https://prometheus.io/docs/introduction/first_steps/
https://opensource.com/article/19/4/weather-python-prometheus
https://winderresearch.com/introduction-to-monitoring-microservices-with-prometheus/

There is a pushgateway to push data from client side. but it is not advised to rely on it.

https://winderresearch.com/img/blog/2017/prometheus/prometheus-architecture.svg
This image explains the architecture of prometheus.

I am done with my session. Any questions?

stof1: thanks a lot

protocol255: I understand Prometheus provides lot of customization option. How is this netdata for client monitoring?
shrini: I dont know what is netdata
protocol255: Ok, this also kind of prometheus but it is limited for client machines stats.

Netdata is an all-in-one monitoring solution, expertly crafted with a blazing-fast C core, flanked by hundreds of collectors. Featuring a comprehensive dashboard with thousands of metrics, extreme performance and configurability, it is the ultimate single-node monitoring tool. from https://www.netdata.cloud/ seems netdata is all in one. prometheus is loosly integrated. power of prometheus comes with custom exporters, distributed, scalable architecture.

https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2020/Remote_Hackathon
this is the wiki hackathon I am participating.

Thanks all. Bye. See you all on next IRC meet.

Open Publishing Fest – May 18-29 2020


Coko Foundation, USA is organizing a global wide event to promote “Open Publishing”.

https://pbs.twimg.com/media/EXxc8znUEAARp5L.png

 

This is an online event.

Anyone can propose, conduct event on their own language, their desired online platform.

From the website,

<quote>
Open Publishing Fest celebrates communities developing open creative, scholarly, technological, and civic publishing projects. Together, we find new ground to share our ideas.

This is at once a collaborative and distributed event. Sessions are hosted by individuals and organizations around the world as panel discussions, fireside chats, demonstrations, and performances. We connect those points to bring them in conversation with one another and map out what’s next.

We seek to build networks of resilience and care for people working on new ways to develop and share knowledge.

Join us by proposing a session. Proposals will be considered on a rolling basis up to and throughout the fest.

</quote>

Propose an event here – https://openpublishingfest.org/form.html

You can propose any number of events.

Last date to propose is May 29 2020. Yes. Till the last date, you can propose and conduct events.

Already 70+ events are proposed. Check them all here – https://openpublishingfest.org/calendar.html

 

The events can be anything. Below are few ideas I can share

1. Demo on any free/open source software related to publishing. There are plenty.

  • Calibre
  • Pressbooks.com
  • LibreOffice
  • Scribus
  • Editoria
  • PubSweet
  • WikiSource.org
  • wordpress blog

and more

2. Hackathons to explore and contribute to these software.

3. Panel discussions on these publishing tools

4. Book reviews/discussion on open licensed books

5. Silent reading events. Mark a specific date/time and friends. Read few open licensed books. Thats all.

6. Proofread the books in wikisource.org in your language

7. If you can write, write a book or few chapters for your next open licensed book

8. Pratham Books is running https://storyweaver.org.in/  where you can create story books for kids. You can draw images and upload there. Translate existing stories to your language. All the books there in Creative Commons Attribution (CC-BY) license.

9. Check the Open Publishing Fest Event calendar here – https://openpublishingfest.org/calendar.html Promote the all the events you can at the social mediums you are.

You can bring more ideas and share here.

 

 

 

 

 

.

ILUGC Monthly Meet – May 09, 2020 – 3-6 pm – Lets meet on IRC (#ilugc in freenode.net)


I am giving a text based talk on prometheus monitoring system today at #ilugc channel at irc.freenode.net

Find full event details below.

Indian Linux Users Group, Chennai [ ILUGC ] is spreading awareness on
Free/Open Source Software (F/OSS) in Chennai since January 1998.

We will be organizing this month meet through ILUGC’s
Official IRC channel (#ilugc in freenode.net). If Visual Presentation
required, we will use jit.si

We usually meet on the second Saturday of every month, and for the
month of May we shall meet through IRC on Saturday 09, 2020 at 1500
IST.

IRC Server: irc.freenode.net
Channel: #ilugc

How to join?
You can use any desktop or mobile client to connect with
irc.freenode.net server and give #ilugc as channel name.

Else, use the below web client
https://kiwiirc.com/client/irc.freenode.net/#ilugc

Give a nick name and connect.

Simple Meeting Guidelines:

we encourage participants to follow these simple steps to conduct the
meet effectively,

1. Please watch what others are doing and do not interrupt
2. If you have a question, type "?" and wait for your time to ask,
speaker who is conducting the talk will call your name to ask question.
3. If you need to speak, type "!" and wait for your time to speak.
4. If you’re done speaking, type "eof"
5. If you agree with someone, type "nickname: +1", here nickname is who
you are agreeing with.

See: https://fedoraproject.org/wiki/How_to_use_IRC#Meeting_Protocol

Talk Details:

Talk 1:

Topic: Introduction to Prometheus – Monitoring System

Description: Prometheus a infrastructure monitoring system, which can
get various metrics from applications and linux servers. With grafana
dashboard and various exporters, we can monitor all our systems. Let
us explore these in this discussion.

Duration: 60 minutes

Speaker: T. Shrinivasan

About Speaker: Founder Kaniyam.com, FreeTamilebooks.com & Kaniyam
Foundation

Talk 2:

Topic: Contribution ways to Mozilla

Description: In this session you will be introduced to Mozilla and the
projects around Mozilla. The contribution ways to Mozilla and also more
about interesting projects like Mozilla Mixed Reality and Mozilla Hubs.

Duration: 30 minutes

Speaker: Bhuvana Meenakshi Koteeswaran

About Speaker: I am a Mozilla volunteer since 2014 and part of Mozilla
Reps since 2017. I am more interested on Augmented and Virtual Reality.
I mostly contribute to A-frame and Firefox Reality related projects. I
also develop AR based native apps using various SDKs available in the
market.

Talk 3:

Topic: Introduction to tmux

Description: Tmux is a terminal multiplexer. Considered as alternative
to popular ‘screen’ terminal multiplexer. In this talk, we will see how
to multiplex terminals and various functionality of tmux.

Duration: 30 minutes

Speaker: Mohan R

About Speaker: Just another FOSS enthusiast.

After Talks:

QA & general discussions

All are welcome.

Thanks,
Mohan R

IRC meeting on #Tamil #computing – 02 May 2020 – 8-9 PM IST


We are planning for a IRC meet on tamil computing activities

date : May 2 2020
Time : 8 – 9 PM IST

Venue : IRC
Channel : #tamilirc
server : irc.freenode.net

See here for more details.
https://tamilirc.wordpress.com/

All are invited.

Update:

Here is the chat log for the event.

https://tamilirc.wordpress.com/2020/05/03/irc-dicussion-on-tamil-computing-on-may-02-2020/

Happy to see a French person writing in tamil. Thanks JulienM.

Thanks for all participants.