Real Time Bigdata Analysis – Few Tools

https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAZbAAAAJGE5Y2ZiNmU2LWRhNTgtNDhlYi05YTY0LTAwYWVmY2EyZGY5Yw.png

 

Big Data Analysis is becoming one of the hot words in the IT industry. Everyone wants to analysis data. They all want to use the tools like hadoop, spark etc. These are used to process huge amount data. i.e in TB size . This is called “Historical Data Analysis”.

In opposite to this, there is “Real Time Data Analysis”. This is to process immediately on the stream of constantly incoming data.

The typical data pipeline for Real Time Big Data Analysis is as below.

App/Site->API Server->Message Queue(Kafka) ->Processor(Logstash)->Storage(Elasticsearch, Redis, MongoDB)->Visualization(Kibana)

Few years ago, we had to rely on Google Analytics and pay huge amount of money to get real time data of our site visitors, credit card swipes etc. Nowadays, we can build entire pipeline with Free/Open Source Software itself.

https://i1.wp.com/blog.infochimps.com/wp-content/uploads/2012/05/realtime-analytics.png

With the following links, we can setup the data pipeline easily.

https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-14-04

http://docs.confluent.io/3.2.0/kafka-rest/docs/index.html

https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-elk-stack-on-ubuntu-14-04

To setup these things are easy. But once the real time flow is started on production, remember, you are always on fire. You will feel that you are riding an aeroplane, with so many buttons on the dashboard. You have to keep running, while solving the real time issues when they appear.

Explore these tools and learn their basics. Learning Basics will give their sweet results for sure.

There are tons of new tools coming in this arena. We can not master all the tools. But, exploring and learning one tool will help to keep on moving with new tools easily.

I am exploring the following tools along with the ELK.

  1. Presto
  2. Spark
  3. Secor
  4. Druid
  5. Hadoop
  6. Hive

Doing most of the programming with Python. It becomes very slow to deal with GBs of data. Go language seems faster to work with text files. Started exploring Go too.

What are the new tools, technologies you are learning?

 

Image source- https://www.linkedin.com/pulse/real-time-stream-processing-big-data-platform-birendra-kumar-sahu

http://gcastd.com/

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s