Implemented a data lake to store data from multiple data sources

Implemented a data lake to store data from multiple data sources

 

Challenge

A Seattle-based global retailer needed to integrate data from multiple campaign systems and automate the process for analysis and engagement. They needed to better understand who was engaging with marketing content and offers and if customers were progressing through the campaigns as expected. They also wanted to benchmark campaigns across platforms to see which were the most effective.

 

Logic solution:

Logic20/20 implemented a data lake to store data from multiple data sources. Leveraging Agile methodologies for data management, the team transformed the data in the cloud with an eye towards reduced cost, operational efficiency, and customer needs for data analytics.

 

Solution breakdown: the challenges and interesting bits:

Logic20/20’s solution can be broken down into several core components – here’s how we tackled the challenge.

 

Challenge 1: Collect data from multiple data sources and vendors

Classify the vendors and created a small number of repeatable processes to collect and transform the data. We identified the shared platform commonality between the different data sets – in this case SFTP and PGP encryption – and developed a process to download the data from SFTP, decrypt using the PGP encryption, and securely upload all the data from across platforms into the cloud data lake.

Create a generic (and thus reusable) way to load data from legacy Oracle system to the data lake.

Upload data from Google docs using Google API.

Where possible, the team created reusable code that could be used across multiple scenarios.

 

Challenge 2: Transforming the data in the cloud, once it is in the data lake

Minimized the cost and operational support for data transformation.

Leveraged U-SQL to perform the transformation, as the scripts can handle very large jobs, custom code in C# libraries, and can be scaled using parallelism. By leveraging U-SQL, the team also avoided cluster maintenance.

Supported legacy systems during the migration by doing all the heavy lifting in the data lake. The team created a simple replication job that presents the newly transformed data in legacy format, allowing the client teams to continue using their Tableau dashboards.

Sped up development and solved the U-SQL challenge of running jobs through a web interface by writing code that automatically submitted files as a U-SQL job and returned back a list of erros.

 

Challenge 3: Detect dependencies across files

Leveraged DAG (directed acyclical graph) to map dependencies between items so that developers can execute on files in the correct order.

Created dependency management code to detect and manage (in parallel) the process across upwards of 20 different jobs.

 

Technologies:

Cloud based – data lake and U-SQL

Python for instrumentation and orchestration of jobs

PGP for encryption

The tool set was kept purposefully small so that it was easy to support.