SQL Server has a number of new and exciting BI features that were added in the recent 2016 release. Many of these open up revolutionary possibilities such as allowing true real-time in-memory analytics or processing advanced analytics and machine learning directly inside the database engine. One of the features that I’m most excited about is the inclusion of PolyBase, which will bridge the gap between data warehouses stored in traditional databases and Big Data stored in Hadoop.
PolyBase has been around for several years in Microsoft’s APS Data Warehouse appliance, and more recently as part of Azure SQL Data Warehouse. It is now part of SQL Server with the 2016 release. It allows for the creation of an external table in the database where the data is physically stored in Hadoop or Azure Blob Storage. To an end user, the table appears to be part of the database like any other table. It can be used in T-SQL queries and stored procedures.
The difference with PolyBase is where the data is stored and how it’s processed. The data can be stored in Hadoop or Azure Blob Storage which provide much more cost effective storage solutions. When a query is executed, the database will determine the most efficient processing method for the query; whether that be processing locally in SQL, or by having the massive parallel capability of Hadoop tackle the query.
PolyBase allows for the blending together of the capabilities of traditional databases and Hadoop like never before. Here are just a few powerful use cases that are now available to us.
In the past, we’ve been presented with problems of what to do with massive amounts of data that is stored in our warehouse tables. We often store a lot of older data that is accessed infrequently but takes up a lot of valuable space. Many organizations aggregate or archive and delete this data as a cost savings measure. This infrequently accessed data is referred to as “cold data”, while data that is in high demand is considered “hot”.
A recent method of handling this is to remove the cold data from database storage and archive that data in Hadoop. The problem with this method is that our data is now spread between two different platforms. If you want to query across the entire dataset, you either need to move data or combine multiple queries. With PolyBase, all the data can be accessible from the SQL 2016 database. Combining these two datasets into one query is now a trivial task.
Use Hadoop for Big ETL
Fairly recent advances in SQL Server, such as column storage, have opened up the door for fantastic performance against very large datasets in SQL Server. This performance is possible after transforming the data and loading it into a proper model setup for query performance. This is great for the queries, but the process of getting the data into the model can be tedious and process intensive. Wouldn’t it be great to process that data with the parallel processing power of Hadoop but allow for the data to be queried in a well-defined columnar table in SQL Server?
PolyBase makes this easy in that the table can be loaded using standard T-SQL sourcing from Hadoop tables. Hadoop can take on the heavy workload of transforming the data and loading it into a high performant SQL table for analytics.
Data Movement from Hadoop to SQL
There are various methods to move data from Hadoop to a SQL Server database. Some of those methods, such as Sqoop, can be overly complicated once you start to consider file formats and scheduling. Others, like using an ETL tool and ODBC to query a Hive table, can be poor performing. Moving data using PolyBase is very simple and high performing. Performance can be increased further by setting up a PolyBase Scale-Out Group which allows a cluster of SQL Servers to move data from Hadoop to SQL Server in parallel.
The various components and flexibility of Hadoop make it a compelling platform for consuming live streams of information from everything from sensors to receiving messages from an enterprise messaging solution. Those live steams can be accessed live from SQL Server by simply creating a PolyBase connection to that information.
BI Tool Integration
Many organizations store massive amounts of raw data in Hadoop or Azure Blob Storage. The data is ready for analysts to come in and perform their data analysis and discovery with BI tools. While most new BI tools interact well with Hadoop, not all tools have good integration. In addition, analysts may be stuck on older versions of the tool or have to jump through administrative hurdles getting the latest drivers to Hadoop components installed and configured. Finally, managing permissions in various tools across the organizations can be complicated and costly. Allowing the user to access these important data sources with the ease of connecting to SQL Server will open up new possibilities and streamline processes.