Data is probably the most important asset if not the lifeline of every organization. As companies are getting more and more data savvy and moving towards data driven decision making, we have noticed a significant increase in the need to collect more data. This has resulted in obvious growth in operational or corporate data. Besides the growth in corporate data, the explosion of social media (i.e. Web 2.0) and advancement in mobile technology have resulted in an exponential growth rate of data easily reaching several petabytes of data storage. According to IDC, in the year 2020 data growth is expected to touch 35 zettabytes (1 zettabytes = 1 trillion terabytes or 1 million petabytes). Is this a reality? Here are some fast known facts about the data growth in the industry. According to The Economist, retail giant Wal-Mart generates an average of 1 million transactions per hour and has databases holding 2.5 petabytes. On average 200 million twitter messages are created per day (June 2011); there are around 1 billion users on Facebook creating more than 30 billion pieces of shared content each month (Oct 2012); 100 trillion emails are sent almost daily. Another area that is actively growing is the Geo-spatial and RFID sensor data collected from satellite and mobile devices; and the list goes on...
So now that we have established that dramatic data explosion is an inevitable reality, where do we go from here? How do we sustain such a growth? How can we use this data for analysis? This is where 'Big Data' comes into the picture.
What is Big Data?
Big Data is the new buzz word trending in the data space these days. Every company or organization has probably made some headway in the Big Data world by either researching the technology or by hiring scores of people to do big data analytics. So what is Big Data and what does it really mean to an organization? Gartner originally characterized Big Data as the 3 V’s (Volume, Velocity and Variety) and Forrester added the 4th V (Variability). Below is a brief description of what they mean.
- Volume: Handle extremely large volumes of data (in terms of petabytes or more)
- Velocity: Fast enough to enable real time analytics
- Variety: Support multiple data formats from highly structured (relational corporate data) to unstructured (images, videos, blogs)
- Variability: Support variable interpretation of the data
The underlying infrastructure should be scalable and programmatically flexible in order to handle extremely large volumes of data while at the same time enable blazing fast data retrieval to support analytics. Big Data’s answer to this problem is Hadoop and NoSQL.
Hadoop is a generic computing framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. This framework is independent of the underlying hardware and can scale based on the number of servers made available. The Hadoop ecosystem is continuously growing and evolving over time. A few of the important aspects of the Hadoop ecosystem are listed below.
Hadoop Distributed File System (HDFS): a fault tolerant storage system that distributes and replicates data across multiple nodes contained within clusters. This can be up scaled by adding more nodes to the cluster or increasing the number of clusters. When a large file is stored in the HDFS, it is broken down into smaller blocks of bytes and saved and replicated on different machines in the cluster.
MapReduce: a programming framework originally developed by Google for their search engine. It can be implemented in a variety of languages with Java being the most popular. MapReduce provides a way to break complex analysis over large data sets across a series of servers. This is achieved in two steps:
- Map: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again, in turn, leading to a multi-level tree structure. The worker node processes the smaller problem and passes the answer back to its master node.
- Reduce: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
HDFS and MapReduce are the core foundation of Hadoop. As you may have realized the key aspect of Big Data is the ability to spread data or tasks across multiple servers. This may sound expensive at first but the beauty of Hadoop is that these servers do not need to be high-end servers but can be low-end commodity servers. This has significantly reduced the cost of hardware needed to store and process such large volume of data and is one of the primary enablers of the Big Data wave.
Now that storage and data retrieval is a non issue with Big Data, the next question is - What do we do with this data? According to a study, 85% of a company's data is unstructured and there is value to be gained in analyzing this data. Let's look at a few technology options for using semi or unstructured data. Some of these also apply to structured data as well.
HBase: A column-oriented data store built on HDFS. An Hbase table has primary key and column families which could have variable number of columns. Hbase does not support SQL and can be classified as a NoSQL database. Applications on HBase are written in Java or other programming language.
Hive: A data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop- compatible file systems. Hive provides a mechanism to project structure onto underlying data and also query the data using a SQL-like language called HiveQL or HQL. HiveQL or HQL is nothing but MapReduce code behind the scenes. Hive also supports custom MapReduce programs for sophisticated analysis.
PIG: High level ETL-like data language that processes large volumes of data for data analysis. PIG provides features like joins and aggregation similar to SQL and ETL. It can be used on HBase or directly on the HDFS. Similar to Hive, PIG utilizes MapReduce code behind the scenes thus rendering itself favorably to a parallel computation environment.
Data Import Utilities:
- Sqoop: Imports structured data from relational databases into HDFS.
- Flume: Imports semi - structured data like text files or machine logs into HDFS.
- Nutch: Imports data from the web into HDFS.
Data Mining Utilities:
- Mahout: Scalable machine learning algorithm on Hadoop platform.
- Pegasus: Java based graph mining system that runs in parallel and distributed manner on Hadoop.
Besides Hadoop, NoSQL databases are synonymously used with Big Data. NoSQL databases, sometimes known as Not Only SQL, represent non RDBMS databases that do not use SQL as a primary interface. NoSQL databases handle unstructured data with ease. There are large numbers of NoSQL databases already available and some in the process of being developed (i.e. Cassandra, MongoDB, Bigtable (Google), Voldemort, etc). These databases employ MapReduce and are designed to handle the processing volume needed to support several millions of online users. They are used in several applications by Google, Amazon and social networking sites such as Twitter and Facebook.
In conclusion, Big Data, by virtue of being an open source technology, is an evolving field. As companies look more and more into Big Data analytics applications such as text mining, sentiment analytics, and customer segmentation there will be development of more advanced tools to make Big Data more accessible and simplified to all.