How to migrate to the cloud to improve Big Data processing
3 min read
Data is the fuel propelling business optimization today. Large quantity, high quality data results in keen market insights that lead to better business decisions and operations. However, there is one big obstacle businesses face with becoming a well-oiled data machine: the traditional method of extracting, transforming, and loading (ETL) data and storing it in data warehouses is not scalable enough to satisfy current needs. In a data-driven world, success hinges on the ability to gather and analyze data in quick and clever ways, leading to better customer insights and a competitive edge in the market.
If you introduce big data from multiple sources to this mix, ETL screeches to a crawl while storage costs skyrocket. In order to manage big data efficiently and streamline operations, new technologies should be embraced. Migrating data processing and storage to a cloud environment is a guaranteed solution for the problem of managing big data.
In this article, we’ll discuss the advantages of moving traditional ETL workloads to the cloud. Some organizations may also want to explore new tools for real-time data processing.
Advantages of processing data in the cloud
Traditionally, ETL has been performed through on-premise servers. That model is not easily scalable to support the demands of big data without a continued increase in equipment and cost. Migrating data operations and storage to the cloud alleviates the challenges that exist when trying to squeeze big data into traditional models. Benefits include:
1. Smaller capital investment.
With cloud environments, you only pay for the space you use, and because the space is unlimited, the relative cost is low. This payment model eliminates the need for large budgets to support growing server requirements associated with ETL processing and data storage of big data. Instead, ETL can be performed in one dedicated cloud environment, and copies of published data can exist in multiple locations with minimal impact on budget.
2. Better performance.
While data warehouses may struggle to compute the high volume of big data, the cloud has the computing power to handle the impact of big data. Moving data operations out of data warehouses and into the cloud will result in quick query responses and faster interaction with BI tools.
3. Duplication.
A drawback of a single, on-premise data warehouse is that it creates a single point of failure. Replicating data means more server space is required, which leads to more capital investment. Migrating to the cloud solves this challenge and allows different users and teams to easily access data. The cloud allows for ETL processing in one environment, storage in another, limitless data copies for various purposes such as back-up. Since some companies may still prefer to keep a data warehouse on premise, it can exist as a legacy back-up or even as another way for users to retrieve data (e.g., a presentation layer).
4. Flexibility.
Data warehouses are monolithic giants, while cloud environments are flexible and nimble. In the cloud environment, multiple copies of data allow space for testing and more experimentation. When undergoing a digital transformation such as migrating to the cloud, it is important to initially run new and old data solutions in parallel to ease transitions and avoid downtime. With the cloud, this hybrid model is entirely possible. On-premise data warehouses can be run in conjunction with new data housed in cloud data lakes.
5. Scalability.
Once a company begins capturing big data, processing and storage needs may increase rapidly. With the cloud, increasing or decreasing capacity to match data needs is possible without much effort or cost.
ETL becomes ELT(P)
ETL— “extract, transform, load”—is the standard way of collecting raw data and preparing it to become usable data. This process makes sense for data stored and queried within a data warehouse. Once data processing and storage is moved to the cloud, though, it is important to modify this process.
The optimal workflow in the cloud is to extract, load, transform, and publish or ELT(P). In other words, data is not modified until after it has been loaded into the system. This leaves open the possibility of reverting back to the raw data, if necessary. Transforming big data can be complex and involves cleansing, joining, and merging until a usable data table emerges. This entire working process happens within the cloud environment. For example, five different data sources may each reveal something unique about a customer. All five of those data points can be collected, merged, and then refined to remove any errors or duplicates until a single “customers” data table is ready to be published.
So, sound good? It’s time to migrate to the cloud! Here’s how to get started:
1. Select a cloud product.
Azure and AWS are the largest and most-well known cloud products, but choose what works best for your business. If the cloud is already being used in another part of your organization, you may want to stick to the same product. If you are not using the cloud or are unsure or overwhelmed by the choice, seek out a knowledgeable consultant to guide your selection.
2. Create a data lake and organize structure.
The data lake can be set up within the cloud environment. Generally, it is important to start simply and organize the data lake by setting up three main folders: (1) raw data, (2) prep, and (3) published data. Once the folder structure is established, it is time to start pointing source data to the landing zone. Raw data then begins loading into the “raw data” folder.
3. Begin migrating ELT(P) jobs to the new stack.
Existing jobs can be transformed into cloud-native technologies using big data tools, such as Spark SQL or Azure USQL.
4. Publish.
Final data can land in the “published” zone of the data lake. In the cloud it is possible (and preferred) to replicate published data in multiple destinations. A working copy can be stored in a data lake and copies can be routed to other data marts. For businesses operating in a hybrid, on-premise/cloud model, the last step would be to publish a copy of the final data into the legacy data warehouse. This allows the data warehouse to exist as either back-up storage or a presentation layer or both. Once a business is ready to turn off an on-premise solution and go “full-cloud,” the legacy data warehouse can be migrated to a cloud-based database platform, such as SQL Azure or Oracle on AWS RDS.
If Big Data is becoming cumbersome for your organization, it’s time to consider migrating your data operations to a cloud solution. Cloud environments provide the space, flexibility and computing power to handle the velocity of big data while also having the benefit of being budget-friendly.
Ilya Tsapin is the Architecture Practice Area Lead at Logic20/20. He has experience in project management, architecture, Agile methodologies, and more. He has written about diverse topics including machine learning, IoT, cloud workload reliability, and Agile management.
Follow Ilya on LinkedIn