Brief Overview on BigData and Hadoop
Big Data:
The term Big Data applies to the information that can't
be processed or analyzed using traditional processes. It’s basically processing
terabytes of unstructured information to generate insights that are required
for the businesses.
Three basic characteristics of BigData:
1. Volume: The data that is to be analyzed is
in Terabytes. For example, Twitter alone generates around 7 terabytes (TB) data
per day, Facebook around 10 TB/day. Analyzing this volume requires lot of
hardware.
2. Variety: The data is not only organized in
traditional table patterns but also raw, semi structured, and unstructured data
like weblogs, social media, blogs, sensors, images, videos etc.
3. Velocity: It requires to analysis or process
volume and variety data quickly. The velocity is very important as the data is
still in motion, for example analysis real time market data, or analysis
customer's browsing pattern while the time he is logged in.
BigData case studies:
1. Analyzing
IT logs: Log analytics
is the common use case for BigData. The software system generates huge amount
for logs each day. The challenge IT department faces is to store and analyze
these logs in efficient manner. Also the nature of the logs is Semi structured
so they are not suited for database processing. The logs can be used to look
for the rare problem which does not occur very often. In Retail system logs can
be used for analyzing how the customers surf through various categories on the ecommerce
web site.
2. Finding
reviews on Social Media: The
data generated by the social media like Facebook, Twitter, Blogs, Travel portal
etc. can be analyzed to find the public reviews/feedback on various topics like
product launches, hotel. For example, many hotels are using Bigdata applications
to find the reviews posted by people on their services like room views,
cleanliness, food etc. Some Media
channels are using Bigdata applications to understand the reviews on their
newly lunched serial or episodes.
Hadoop:
Hadoop is the Apache framework project which is used for
distributed processing of large data set across clusters of computers. Hadoop
mainly consist of HDFS (Hadoop Distributed File System) and MapReduce
framework. It’s a framework used by BigData applications to process large data
sets.
Hadoop implements a
computational paradigm named map/reduce, where the application is divided into
many small fragments of work, each of which may be executed or re-executed on
any node in the cluster. In addition, it provides a distributed file system
that stores data on the compute nodes, providing very high aggregate bandwidth
across the cluster. Both map/reduce and the distributed file system are
designed so that node failures are automatically handled by the framework.[1]
Hadoop Distributed File System:
HDFS is a distributed,
scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance
typically has a single namenode; a cluster of datanodes form the HDFS cluster. [2]
NameNode
maintains all the information related to the filesystem like files and
directories in the tree. It also contains the pointer to the Datanodes which
contains blocks for a given file. Datanode stores and retrieve the block and
update the Namenode with the list of blocks maintained.
MapReduce:
MapReduce framework works by dividing the process in two
phases: First is Map phase followed by Reduce phase. Each phase as key-value
pair as input and output. Let’s take a example, we have weather data from past
50 years. We want to find how many times in 50 years temperature exceeded 30o
C. With MapReduce, first the map Job will run which will extract the count in
each year. Next, the output of the Map is given to Reduce Job which sum’s the
result and output the required count.
Other Hadoop projects:
Hive: Hive manages data stored in HDFS and
provides SQL query language interface for querying the data
ZooKeeper: It’s provides coordination service
such configuration management, distributed locks, group services etc.
Pig: It’s a high level framework for
creating MapReduce programs used by Hadoop.
HBase: HBase is a type of “NOSQL” distributed
database.
References: