Machine learning is still a human-intensive process. The majority of machine learning projects require labeling and augmenting raw data, with much of the heavy lifting laden with human effort. The metadata of machine learning training data, if you will, is borne from people. Therefore, any playbook that speaks to data labeling, or its periphery, must be centered around people.
Data Labeling: A Human Endeavor
A survey from Alegion reported that upwards of 78% of teams have their AI/ML projects stall at some stage before deployment. In other words, 78% of AI initiatives face fatal roadblocks, principally owing to the modern ordeal: training a model with relevant, labeled data. Labeling is a key challenge, with 71% of teams failing to label data in-house, ultimately opting to outsource this process.
Quality training data alone is insufficient for training models of high efficacy. For training models, chiefly with supervised learning algorithms, training data must be fed in conjunction with labels. Data labeling is mapping data points to a set of the right answers. It’s a crucial step in shaping the accuracy of a model. And it’s very much a human process.
Consider these three strategies in approaching data labeling:
- Be purpose-driven. Define the goals of a model, prior to training and labeling.
- Verge on the side of being user-centric, not user-prohibitive.
- View ethical considerations as an integral part of the data labeling process, not as an afterthought.
Purpose as a Foundation
Labeling should be a purpose-driven process. With several methods to label data — in-house labeling, crowdsourcing, freelancer outsourcing, and outsourcing to companies — the quality and cost associated with the generated labels for each of these methods is highly-divergent. As a consequence, it’s imperative that purpose is what defines and drives this process.
Determine the application of the end model, its users, and work backward, ultimately alluding to answering: to what level of model accuracy do my applications and users require? This answer is unique. It could be on the magnitude of basis points, or several percentage points. Healthcare applications, for example, lean towards the former. Users drive the parameters mandated from a model, and as a corollary, drive the data labeling process.
User-Centrism as the Key to Quality
Verge on the side of being user-centric, not user-prohibitive. To capitalize on the full capability of a model, it’s imperative to train with a growing dataset, not a fixed one, to thwart overfitting, adjust to evolving ambient variables, and generally create a more effective model. In a world of chiefly supervised learning algorithms, the key challenge is labeling growing datasets. In other words, an atypical approach is necessary.
Users are your most effective data labelers — both in cost and in the fidelity of labels. In creating more intelligent systems, it’s critical to understand, and leverage, human intuition.
High-quality data labeling can occur at the source, by understanding user intention and juxtaposing it with empirical, real-time data. Take a navigation application as an example. If a user uses a voice-to-text capability to navigate to a location with a name laden with complex pronouns, it’s likely the text generated will not be completely accurate.
Voice-to-text intelligence, in this example, can be augmented in a number of ways. One option is to prompt the user—at the conclusion of the workflow or arbitrarily—with a modal, verifying if the text generated was accurate. This option is flawed, for two reasons. One, this is user-intrusive, impacting the entire user experience of today’s users, in the hope that it will augment the experience of tomorrow’s. Second, if a user perceives the modal to be intrusive, they will rush through the obtrusive workflow, choosing any answer indiscriminately — anything to get back to solving their real needs. In other words, efforts to automate data labeling with this method impacts your product’s user experience and ultimately, yields low-quality data labels.
Effective data labeling automation anticipates human intuition. It’s subtle and non-intrusive. With the voice-to-text capability in the navigation application, if a user receives an incorrect text generation, they are likely to immediately perform a remediation process: select the back button, reattempt another voice-to-text query, or forgo the capability and type in the query instead. Regardless of the process, understanding human incentives as well as anticipating human intuition and reactions to inaccuracy within a workflow can help with automated, high-quality data labeling.
Ethics: An Integral Element
According to a January 2019 report by analyst firm Cognilytica, the market for third-party data labeling solutions was $150M in 2018, growing to more than $1B by 2023. Data labeling, particularly with outsourced labeling, is big business.
When opting to partner with third-party data labeling companies, it’s critical to partner with specialized labeling solution, whose employees are treated and compensated equitably.
These specialized third-party companies deviate from rote labeling of basic elements such as a fruit or animal and find a niche in specialized, nuanced domains: discerning features in a vascular blood vessel of a human retina, for example. This makes sense for more than solely ethics. By reserving specialized third-party solutions for gathering intelligence on domains outside of yours and doing the bulk of data labeling in-house, you can integrate discipline into the model training process.
You can focus on user needs, the nuances of your workflows, and have the room to inject creativity to the process — such as building training mechanisms that anticipate human behavior. By viewing third-party labeling services as a way to augment your data labeling strategy, rather than a way to replace it, helps you focus on true pain points, increases the fidelity of training data, and the overall efficacy of your models.