Using data lakes to unify information and deliver insights
4 min read
The role of data in business has changed. Digitization has touched virtually every industry, creating reams of new information about customers, supply chains, and every other facet of business. Data is a powerful asset for organizations able to harness it to create new strategies and capabilities. Unlocking the potential of big data requires a new approach to data management.
Organizations must be nimble enough to handle the large volumes and varied formats of emails, web analytics, transactions, phone calls, and many other sources. At the same time, they must be able to support the demands of users and applications capitalizing on the data. Data lakes have become an established pattern for solving these modern data problems. Let’s zoom out a bit and look at the world of data warehouses and data lakes, then look at how they relate to analytics and business.
What is a data warehouse?
A data warehouse is an operational data store that uses a structured, relational database. Your data warehouse is your gold standard view of your data. It is a clean, organized, single source of truth. During the ETL process, data is extracted from source systems, transformed into a clean structure, and loaded into the warehouse. This is a tried and true pattern for data storage, but its down-side is in its rigid and centralized nature. Some issues arise with modern data usage:
• New and unstructured data is difficult to assimilate rapidly into the data warehouse
• Analysts, data scientists, and other business users need specialized spaces to develop new uses of the data
• Real-time and customer facing applications require high performance subsets of the data (think chat bots, mobile apps, and recommendation engines)
What is a data lake?
A data lake is a flexible storage space for collecting and preparing information from a wide range of sources for further use. A data lake is a more general data store and that can serve multiple purposes. Let’s dig into some of the typical features of a data lake:
• Cloud-based to allow easy spin-up/tear-down of space and compute resources
• Clustered to solve heavy analytic or real-time problems with distributed computation power
• Non-relational or schema-on-read to support new and diverse data types, as well as rapid experimentation by analysts and developers
How is a data lake different than a data warehouse?
Data lakes and data warehouses are two different but complimentary services. A data warehouse contains structured, sorted, refined data, while a data lake is more flexible, dynamic, and user-focused.
Compared to the one-size-fits-all data warehouse, data lakes can be employed in different patterns for different use cases:
1. Staging data upstream of the data warehouse
The data lake serves as a staging layer to contain raw data and sources that may not be integrated into the relational data warehouse yet.
2. Serving re-structured data downstream from the data warehouse
In this case, the data lake is a form of data mart, typically a subset of the warehouse, used in support of a specific team or real-time application.
3. Providing a sandbox for analysis, development, and experimental data integration
The data lake is used as a testing ground, providing flexibility and compute resources for creating new data products. In this instance, it can be both a landing zone for experimental new data as well as a downstream development zone for building applications or machine learning models.
So how do data lakes relate to analytics?
Data lakes empower teams to mine the gold out of new mountains of digital information. With data lakes, analysis is easier due to a few factors:
• Flexibility to store raw data in many formats
• Scalability to process large volumes of information on demand
• Customization to make data structure fit end user needs
Do I need a data lake?
Data lakes are great tool for using data in ways that are not supported by a traditional data warehouse. They have become a pattern for organizations that are bringing data to every aspect of their business. If your organization can benefit from data-driven decisions or innovations, then data lakes are a tool that you should consider.
What are your business challenges? Let's talk through the solutions.