In a recent blog post we discussed the characteristics of Big Data and how you can manage Big Data using an open source framework like Hadoop. Hadoop enables applications to work with thousands of nodes and petabytes of data. It certainly looks like the Holy Grail for organizing unstructured data, so it's no wonder everyone is jumping on this bandwagon. Microsoft joined this bandwagon when it created a relationship with Hortonworks (Hortonworks is a Yahoo spinoff that offers Hadoop distribution and support services).
Microsoft has utilized this relationship to create key capabilities for managing, accessing and analyzing Big Data.
Big Data Management
In order to provide a management layer for Big Data, Microsoft has introduced HDInsight for managing Big Data. HDInsight is an enterprise-ready Hadoop service based on the Hortonworks Data Platform (HDP), enabling users to store and process data of all types for analysis, including structured, unstructured and real-time data. HDInsight provides two versions:
HDInsight Server is the on-premises version, which contains the following benefits:
- Leverage Active Directory to manage access privileges
- Management of HDInsignt clusters through System Center 2012
HDInsight Windows Azure is the cloud-based version, which provides the following benefits:
- Virtualized deployment in the cloud
- On demand compute, storage and networking resources
Whether it's cloud-based or on-premise, both environments allow you to manage and manipulate data through the use of Hadoop tools such as Pig or Hive. As of this post, HDInsight (cloud and on-premise) offerings are still in community technology preview. However, if you're interested in kicking the tires, you can go to www.hadooponazure.com and setup a cloud-based sample environment. It only takes minutes to setup once you're granted access.
Big Data Access
Integration between Hadoop and Microsoft's business intelligence suite is a key feature to Microsoft's Big Data offering. However, to get started, Microsoft requires you to create a data warehouse abstraction layer using Apache's Hive. Once you have created your Hive layer, you can use the Hive ODBC driver to connect your BI tools to Hadoop. The Hive ODBC driver allows connectivity to Apache Hive, which in turn facilitates querying and managing large datasets residing in Hadoop's distributed storage by exposing them as a data warehouse to the end user. Microsoft is currently using the Hive ODBC driver to provide access through the following tools:
- Excel 2010
Hive ODBC enables users to bring Big Data and traditional data warehouse data together in tools they're familiar with like Excel.