Defining the Data Lake
In its simplest form, a data lake is a storage repository. Generally this storage repository is a Hadoop Distributed File System (HDFS). A data lake, unlike a data warehouse, can contain structured, unstructured and semi-structured data in its native format. Furthermore, traditional data warehouses cleanse and conform data for ad hoc analysis or reporting while the raw data residing in data lakes is waiting for people or applications to discover insights.
Benefits of a Data Lake
The data lake provides many benefits to organizations. One benefit is that it can act as a data archive for your company's cold or less used data. The thought here is you have data you want to keep, and you have a place you can store it at a very reasonable cost. Another benefit of the data lake is that it does not require a schema, which allows for storage of structured and unstructured data. The added benefit here is data ingestion speed. This means you're not spending time manipulating the data to fit a schema. You're merely bringing it in as is and loading it into the data lake. These and other benefits provide a good basis for information management and data analysis.
Microsoft's Data Lake
In the fall of 2015, Microsoft officially launched a public preview of Azure Data Lake. This data lake offering is a set of services on Azure called Data Lake Store and Data Lake Analytics.
Data Lake Store
This is where the data is actually stored and managed. The basis of the Data Lake Store is an Apache Hadoop file system, which allows data managed under this service to be analyzed by Hadoop tools like MapReduce and Hive. This service allows you to create a collection of corporate data by ingesting and storing data any type regardless of size. The Data Lake Store also offers integration with other Azure services such as Data Lake Analytics, HDInsight and Azure Data Factory.
Data Lake Analytics
The Data Lake Store is merely a place for the management and organization of data but doesn't really provide a mechanism for analyzing the data. This is where Data Lake Analytics comes in to play. The analytics service provides a query language called U-SQL, which allows for discovery of any data in the Data Lake Store. Simply put, U-SQL is a combination of Structured Query Language (SQL) and C#. SQL provides a familiar mechanism for querying and manipulating databases while C# provides extensibility through custom coding. U-SQL can be used to analyze structured information but it's main advantage is its ability to process unstructured data.
Collectively, these are a set of services that provide storage and analytic capabilities across volumes of data regardless of the data structure. With each of these services you're paying only for what you use. Data Lake Store charges you based on the amount of data you store and with Data Lake Analytics you're paying for compute time used to run your analytic jobs.
Proceed With Caution
Precautions need to be taken when storing vast amounts of company data in a data lake, especially in regards to data governance and data quality. Some level of data governance should be considered when creating a data lake. You might consider implementing some policies around security. Ask yourself: who will access to this information and how will things like personally identifiable information be protected? Another area to consider is data quality. The data lake contains raw data from the source systems. If there's poor quality data in the source, there's poor quality in the data lake. These and other topics should be addressed before heading down this path.