Data science has become an integral part of the fabric of every field from healthcare to marketing. Modern data scientists wear a number of hats, performing tasks such as creating algorithms that sort and organize key items, performing regressions and qualitative analysis on specific datasets, and determining the impact of different customer-facing decisions. They are stewards of arguably the most valuable asset of today’s organizations: information.
The title of data scientist hasn’t been around long, having been coined only in 2008, yet its holders must still overcome many issues when tackling typical projects. For example, a data scientist tasked with applying mathematical models to a dataset or creating a more efficient data pipeline might run into selection bias and anchor bias.
The stakes for recognizing and addressing these problems are high. Current and expected demand for data scientists is robust, with one tech firm expecting it to increase 28 percent from 2017 to 2020. Unfortunately, there is no equivalent of a Hippocratic oath (“do no harm”) in data professions, meaning it is usually up to practitioners to learn how to navigate ethical challenges and biases on their own. Let’s take a look:
Big data ethics
Questions about ethical data science often emerge in discussions of big data ethics. This is the domain centered on how information is collected, shared, and monetized at great scale, for example through third-party connections to a major platform or via a website or mobile application.
The many moving parts that connects different data systems, as well as the lack of visibility for end users on how their information is being handled, can lead to ethics issues, such as data leakage. Big data ethics rooted in transparency and confidentiality can help avoid these situations.
Data scientists must also devote attention to how the algorithms and infrastructures they oversee can institutionalize common biases including sexism and racism. One way this can happen is through sampling bias.
Say a data science team is working on facial recognition technology, but in assembling the necessary training set, the team didn’t produce a sample representative of the expected population. As a result, a scanner operating in real-world conditions might struggle with particular facial features or skin tones, reproducing bias at enormous scale.
Insufficient attention during testing and design of data science projects can also lead to unexpected results that alienate users. You might be familiar with the concept of the uncanny valley, which holds that robotic representations that look almost – but not exactly – like humans produce much more negative reactions than ones that are less or more realistic. With AI, something similar can occur with “creepy AI” that is neither highly intelligent nor quaintly limited.
Anchor bias is another concern, especially considering the vast scope of today’s data science intiativies. When a team works with large amounts of information, it can be tempting for them to anchor their models to the first round of datasets they work with, which might create issues as the total repository of data keeps expanding and evolving. Applications might be built on assumptions that are already outmoded.
The seeds of more ethical data science
More ethical data science begins with awareness of the above issues, bolstered by full documentation of data sources and processes to identify possible areas of risk. Similarly, data models should be regularly re-evaluated to ensure their transparency and functionality – they shouldn’t become black boxes, offering no indication of their possible problems until something really goes wrong.
The Logic20/20 team is committed to ethical data science that minimizes risk to your organization and maximizes your ability to get value from big data, analytics, and related projects. If you’re a visual learner, see our infographic on this same subject here - https://www.logic2020.com/insight/infographic-the-biggest-data-science-ethics-and-biases-to-watch-out-for!
What are your business challenges? Let's talk through the solutions.