Why do I love apache Kafka?


apache kafka க்கான பட முடிவு

From Wikipedia,

https://en.wikipedia.org/wiki/Apache_Kafka

<quote>

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log,” making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.

</quote>

I am using Kafka as a message queue, in one of my projects that get huge amount of real time data. I get around 6000 to 1,00,000 events per minute. I tried to read those events by a custom python script. The script can not read that huge data. It missed many data.

 

Was looking for a stable data reading tool. Found Kafka and explore it. For my surprise, it worked well. Stress tested with the tool “siege“, producing millions of test data. Single Kafka server received all the data and  stored.

apache kafka க்கான பட முடிவு

It compresses all the data as its own internal format and keeps them all. By default, it stores for a week. Anyone can write to it and anyone can read from it, in a very stable process.

Logstash is a perfect pet for reading from kafka. Then it can write to s3, another kafka or elasticsearch.

Installation is very simple. Just download, extract, start running it.

https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-14-04

With confluent platform, it can read and write json documents easily.

I strongly suggest to use kafka on any message queue requirements.

Image sources:

 

 

Real Time Bigdata Analysis – Few Tools


https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAZbAAAAJGE5Y2ZiNmU2LWRhNTgtNDhlYi05YTY0LTAwYWVmY2EyZGY5Yw.png

 

Big Data Analysis is becoming one of the hot words in the IT industry. Everyone wants to analysis data. They all want to use the tools like hadoop, spark etc. These are used to process huge amount data. i.e in TB size . This is called “Historical Data Analysis”.

In opposite to this, there is “Real Time Data Analysis”. This is to process immediately on the stream of constantly incoming data.

The typical data pipeline for Real Time Big Data Analysis is as below.

App/Site->API Server->Message Queue(Kafka) ->Processor(Logstash)->Storage(Elasticsearch, Redis, MongoDB)->Visualization(Kibana)

Few years ago, we had to rely on Google Analytics and pay huge amount of money to get real time data of our site visitors, credit card swipes etc. Nowadays, we can build entire pipeline with Free/Open Source Software itself.

https://i1.wp.com/blog.infochimps.com/wp-content/uploads/2012/05/realtime-analytics.png

With the following links, we can setup the data pipeline easily.

https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-14-04

http://docs.confluent.io/3.2.0/kafka-rest/docs/index.html

https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-elk-stack-on-ubuntu-14-04

To setup these things are easy. But once the real time flow is started on production, remember, you are always on fire. You will feel that you are riding an aeroplane, with so many buttons on the dashboard. You have to keep running, while solving the real time issues when they appear.

Explore these tools and learn their basics. Learning Basics will give their sweet results for sure.

There are tons of new tools coming in this arena. We can not master all the tools. But, exploring and learning one tool will help to keep on moving with new tools easily.

I am exploring the following tools along with the ELK.

  1. Presto
  2. Spark
  3. Secor
  4. Druid
  5. Hadoop
  6. Hive

Doing most of the programming with Python. It becomes very slow to deal with GBs of data. Go language seems faster to work with text files. Started exploring Go too.

What are the new tools, technologies you are learning?

 

Image source- https://www.linkedin.com/pulse/real-time-stream-processing-big-data-platform-birendra-kumar-sahu

http://gcastd.com/

 

 

What I learnt from teaching ELK stack in a Workshop?


Today, I trained a mixed group of students about doing real time bigdata analysis in 4ccon, Chennai.

 

https://pbs.twimg.com/media/C23UnKJUcAECtEb.jpg:large

As the Bigdata is one of the trending words on the IT field, got around 40 participants.

The participants are from Electrical, CSE departments and few working professionals.

Though we asked everyone to bring the laptop with ELK stack preinstalled, many spot registered participants, did not get the laptop or installed anything.

Thats fine. I had one full day. We can do the installations in one hour.

There were many unexpected issues.

1. Windows laptops

I never thought that people will come with windows laptops. I did not know that ELK stack can run in windows, till I see the windows laptops.

I left windows some 10 years ago. Dont know the basic stuff to do on it. Fortunately, Mr. Sivarama selvan, from NIC, got the packages for windows and demonstrated the following

1. Installing Java
2. Setting the JAVA_HOME and path
3. Invoking logstash with a sample configuration file

Without him, I would felt hopeless. Thanks a lot sir.

2. Poor Internet

Though the college provided WiFI to all the rooms, we got very poor connectivity. Connection speed was too low. The ubuntu and redhat users lost their patients to get installed these packages from repositories.

After some time, I asked them to login to my laptop and explore the commands and use my elasticsearch and kibana to connect from chrome plugins (Elasticsearch tool box, postman). As the wifi was poor, they had to wait for a long time to check even small stuff.

3. Windows Users behaviors

Our mixed stilled participants found very tough to work with the Command Prompt in windows. Many saw it for the first time. So, traversing through various directories itself very tough. Had to teach the very basic commands like cd, dir etc. Never thought that I will be teaching MS Dos commands in the ELK workshop.

We provided Zip files for logstash, elasticsearch and kibana with sample configuration files in another zip file for logstash, Elasticsearch.

The icon for Zip file and a folder seems similar in Windows. On double clicking any zip file, it opens just like a folder. People started to double clicked the Zip file and edited the config files. When they tried to access those files from command prompt, they can not reach those files. It took much time for me to find the issue and trained them on how to extract zip files. 😦

4. Editing Files

Some opened the sample logstash config file in notepad. It showed everything in a single line. Changing some values were tough.

Some opened in MS word and save as docx files.

Some found difficulties on finding a file path to give in this sample config file.

5. curl for windows

As curl is the main tool to interact with elasticsearch, dont know how people can practise in windows without curl. Found curl for windows. Again, downloading with poor Internet, teaching how to install and how to use in command prompt was tough for me. So, missed this part. Asked people to use chrome plugins like sense, elasticseach toolbox. With these plugins, people can index only few data. They cant do bulk import of data.

6. ELK versions

Someone installed a mixed versions of ELK stack. it did not worked as I displayed on my laptop. After a deep troubleshooting session, found the version issue and installed the latest version as on my laptop.

Finally, got some handson learning.

With more than 50% of time, spent of fixing these issues, managed to explain the ELK stack. Demonstrated how to read a CSV file using logstash. Displayed the data in screen and sent to elasticsearch. Then, explained Elasticsearch. Demonstrated indexing data, importing bulk data, search, and delete. Then, explored kibana. Asked them to create visualizations and dashboards. They did it with huge interest.
Then, demonstrated how we can get data from twitter stream and analyse in kibana.

Participants are happy to get some handson with the ELK stack.

Used the following links.

Config files, sample data = https://github.com/tshrinivasan/elk-training

https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-elk-stack-on-ubuntu-16-04

http://ikeptwalking.com/elasticsearch-sample-data/

http://www.generatedata.com/

sample config files.
https://github.com/elastic/examples/tree/master/ElasticStack_twitter

Slides

Here are my learnings:

1. Never expect internet connection.

Find a solution to setup a quick local intranet. Always go with a Wifi router. So that all VNC, SSH, web servers, file transfer can be easy and fast.

Get some portable packages for GNU/Linux too.

Always be prepared to run the workshop without internet.

2. Learn few Windows stuff and have software for windows too.

It is not good to ignore the windows users. When they come forward to learn something, we have to be prepared to teach them too.

Have a copy of ELK Zip files, curl, putty, VNC, Java setup files, Notepad++ editor, Firefox/Chrome browser etc.

3. Prepare documentation and share to participants

Prepare a how to install/setup/example document and share with all. With this document, people can explore further once they go home. If possible, create video tutorials and share online and offline.

4. Software versions

Make sure the software you use on laptop and participants using are same. ELK stak is changing a lot on every release.

5. Know the audience

Mostly, we get mixed skilled audience. I assumed that they had the basic computer skills like extracting files, understanding file path and using command line. When they lack on this, we have to start training them on the basics.

This is my first training on ELK for public. Learnt tons of stuff on my preparation hours and on workshop. Thanks for the participants. With their patience and interest on learning, the day was successful. Thanks for 4ccon volunteers for the wonderful event.