Understanding the data science life cycle—and how to improve it
4 min read
Data science encompasses a broad set of techniques for solving problems with data. Building and managing data science value requires a different mindset than typical development projects. Let’s take a look at what sets data science apart, how its life cycle is different, and how to manage through it.
There are 3 ways data science can deliver value:
• Data-driven strategy – Research, analysis, and attribution modeling that yield insights into your business. This often comes in the form of ad-hoc analysis to uncover underlying trends or predict future developments that can guide business strategy. The power of data science here is in extracting new insights from stores of historical data. An example data-driven strategy is segmentation of customers based on buying patterns that can lead to the development of new promotions and marketing.
• Data-driven decisions – Repeatable prescriptive analytics that inform and drive specific decisions. Dashboards, reports, and decision aids bring data into the user’s workflow, increasing accuracy and improving efficiency. Recommendation engines are a classic decision aid that puts data in the right moment to inform users.
• Data-driven solutions – Fully automated tools, actions, or products produced by machine learning models. In this case, decisions are made directly by algorithms designed to act at a scale or level of precision unreachable by humans.
What is the life cycle of a data science project?
Data science is both an art and a science. It requires imagination and experimentation to mine insights from data. The path from hypothesis to production-grade machine learning is not always straight-forward. It often requires trying many different approaches, evaluating results, and carefully tuning a solution. A common problem is investing large amounts of time on research avenues that yield little benefit. The experimental process can be effectively managed with a life cycle that focuses on end value and iterating quickly:
1. Problem framing – Clearly define the outcomes you want up-front and a metric for measuring them.
2. Acquire and clean data – The development cycle starts with data and this is where you will have the most impact. Clean data creates clean insights.
3. Create features – Extract features and structure from your data that are most informative for your model.
4. Select a model – Choose an algorithm that performs well, mirrors the structure of your problem, and provides the required level of intelligibility.
5. Tune your model – Adjust the model’s architecture and hyperparameters to optimize its performance on your data.
6. Monitor and collect feedback – Deploy the model, first with a test set or population, then scale up. Record results and use them to go back to steps 2 through 5 and improve your solution.
There are some key advantages to this approach. By defining a clear measure of value upfront, you can evaluate every change objectively and stop iterating when your returns diminish. Steps 2-6 are a development cycle that one should be encouraged to make as short as possible. Usually I will start with a simple solution, press it out to a test set and start iterating. This allows you to set a baseline quickly and estimate future improvements or change priorities as needed. This also gets data scientists thinking about real-life deployment from the start.
How to improve / optimize so analytics run smoothly
Moving quickly through the data science iterative lifecycle requires support from domain experts to properly frame and evaluate problems, as well as data engineers to quickly move code from experiment to deployment. Agile cross-functional teams or strong self-service data pipelines can enable data scientists to rapidly transform business decision making with the power of data.
What are your business challenges? Let's talk through the solutions.