Brief Overview on BigData and Hadoop

January 13, 2013

Big Data:

The term Big Data applies to the information that can't be processed or analyzed using traditional processes. It’s basically processing terabytes of unstructured information to generate insights that are required for the businesses.

Three basic characteristics of BigData:

1. Volume: The data that is to be analyzed is in Terabytes. For example, Twitter alone generates around 7 terabytes (TB) data per day, Facebook around 10 TB/day. Analyzing this volume requires lot of hardware.

2. Variety: The data is not only organized in traditional table patterns but also raw, semi structured, and unstructured data like weblogs, social media, blogs, sensors, images, videos etc.

3. Velocity: It requires to analysis or process volume and variety data quickly. The velocity is very important as the data is still in motion, for example analysis real time market data, or analysis customer's browsing pattern while the time he is logged in.

BigData case studies:

1. Analyzing IT logs: Log analytics is the common use case for BigData. The software system generates huge amount for logs each day. The challenge IT department faces is to store and analyze these logs in efficient manner. Also the nature of the logs is Semi structured so they are not suited for database processing. The logs can be used to look for the rare problem which does not occur very often. In Retail system logs can be used for analyzing how the customers surf through various categories on the ecommerce web site.

2. Finding reviews on Social Media: The data generated by the social media like Facebook, Twitter, Blogs, Travel portal etc. can be analyzed to find the public reviews/feedback on various topics like product launches, hotel. For example, many hotels are using Bigdata applications to find the reviews posted by people on their services like room views, cleanliness, food etc. Some Media channels are using Bigdata applications to understand the reviews on their newly lunched serial or episodes.

Hadoop:

Hadoop is the Apache framework project which is used for distributed processing of large data set across clusters of computers. Hadoop mainly consist of HDFS (Hadoop Distributed File System) and MapReduce framework. It’s a framework used by BigData applications to process large data sets.

Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.^[1]

Hadoop Distributed File System:

HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. ^[2]NameNode maintains all the information related to the filesystem like files and directories in the tree. It also contains the pointer to the Datanodes which contains blocks for a given file. Datanode stores and retrieve the block and update the Namenode with the list of blocks maintained.

MapReduce:

MapReduce framework works by dividing the process in two phases: First is Map phase followed by Reduce phase. Each phase as key-value pair as input and output. Let’s take a example, we have weather data from past 50 years. We want to find how many times in 50 years temperature exceeded 30^o C. With MapReduce, first the map Job will run which will extract the count in each year. Next, the output of the Map is given to Reduce Job which sum’s the result and output the required count.

Other Hadoop projects:

Hive: Hive manages data stored in HDFS and provides SQL query language interface for querying the data

ZooKeeper: It’s provides coordination service such configuration management, distributed locks, group services etc.

Pig: It’s a high level framework for creating MapReduce programs used by Hadoop.

HBase: HBase is a type of “NOSQL” distributed database.

References:

[1] "Hadoop Overview”

[2] Wiki

Rohan Lopes Blog

Brief Overview on BigData and Hadoop

Popular posts from this blog

Combine or Merge XML documents in Single XML using Boomi & Groovy

Journey towards launching: Follow My Church Mobile App - (iOS & Android)

Quick Guide - Docker/Container/Container Images/Registry