Data cleaning for machine learning success
4 min read
From the 1960’s Rosie the Robot to 2002’s Roomba robot vacuum, society has long appreciated the value of machines as cleaning assistants. As artificial intelligence has developed over the past forty years, capabilities have blossomed and we have realized the potential value of using machines to complete tasks in the virtual space. Machine learning allows us to gain insights and recognize patterns in new ways. With the advent of Big Data from social media, smartphones, the Internet of Things (IoT), and Industry 4.0, there’s not a second to spare.
Humans are producing unprecedented amounts of data every second of every day. For businesses, this data is a resource unlike any other – continually expanding, multi-faceted, and (in some cases) almost microscopically detailed. Unfortunately, this also means that there’s data of all kinds coming in from every direction, from accounting spreadsheets to geolocation data to real-time purchasing statistics for your latest product.
Types of data
Data can be broken out into two categories: structured and unstructured. Structured data includes all things already inside relational databases. The data is formatted in a way that both humans and machines can easily understand and manipulate. Relationships between pieces of data are inherent to how it’s been organized.
Unstructured data is a bit more complicated. This includes data that is not already related: social media posts, emails, chat messages, Word documents, weather patterns, buying history, and much more. If we were to plot the two types of data on a graph, structured data might be drawn as lines, while unstructured data might be plotted as single dots.
Technology exists to make sense of all types of data, and with machine learning, it’s faster than ever. There are algorithms for most every scenario, whether using supervised, semi-supervised, unsupervised, or reinforcement learning. However, even the most advanced algorithms are useless if your data isn’t clean. Though cleaning data accounts for 80% of the data analysis task, you must make sense of the data before you can do anything with it.
Why is data cleaning important?
Data cleaning is important because it helps:
1. Improve data accuracy. Accuracy covers a variety of elements including formatting, duplication, consistency, and validity.
2. Increase data readability. Sometimes data is generally accurate but not as readable as it could be. Cleaning up the formatting before processing allows you to adjust capitalization issues, eliminate spelling errors, and more. This also saves time for any humans later working with the data.
3. Identify problem areas. As you process data of all types, patterns may emerge about how data is created, allowing you to adjust sources you have control over and reduce the likelihood of incorrect or corrupt data in the future.
What does data cleaning involve?
The process of data cleaning depends on the data type. For structured text data, it’s important to:
• Check for missing or extra data. This involves imputation and removal of duplicate values.
• Validate the data. If your data is in a range, and you’re looking at measurements, does it fit?
• Check for errors and typos. Humans can easily recognize words even when they’re misspelled, but for machine learning algorithms, errors and typos present a bigger challenge. Consistency is key.
• Confirm that the data fits the data type. This involves ensuring that data in data type columns are not misinterpreted. For example, a text string of “1, 2, 3” should not turn into the numeral “123.”
• Identify outliers. Anomaly detection helps you to define what “normal” means for your dataset, as well as note factors that may lead to unusual data.
• Transform the data. To ensure your data is mobile and transferrable, you should reformat the data frame that you have into something that can be consumed by other tools (open source tools, etc.) For example, you could reformat dates to make them machine readable or remove specific characters that don’t have Unicode equivalents.
• Automate what you can. Some stages of the data cleaning can be automated, but make sure you setup proper monitoring and alerts to be notified when data cleanup needs human intervention.
For other types of data, like unstructured weather data, the cleaning process becomes more complicated. Using a data lake can simplify it, since diverse data types can be stored in one scalable, centralized location.
However, no matter the data you have, it’s important to ensure it’s relevant to the question(s) you want answered. For unstructured data, the core goals of the data cleaning process remain the same:
2. Ensure data consistency to increase comprehension
3. Adhere to data governance best practices
Machine learning can do amazing things, and perhaps someday it will be able to clean data as well as analyze it. Cleaning data is an important step in the analytics journey—and we’re here to help.
Interested in machine learning for your organization?
Check out how Logic20/20 can help.