Is Your Machine Learning Model Bingeing on Junk Data? Part III

Part III: Intentionality

In Parts I and II, we discussed what it means to generate Information Rich synthetic data based on highly Indistinguishable samples generated from a Digital Twin. We have shown that we can determine how Indistinguishable our synthetic data is from the original sample (Part I) and that we can quantify novelty compared to the real-world collected data (Part II). But while Indistinguishable and Information Rich data is a good baseline — the true value of our synthetic data is determined by its relevance, and ensuring this relevance requires a careful shepherding framework which we refer to as Intentionality.

At its core, Intentionality stems from the ever-present awareness that any Machine Learning (ML) model trained on our data must function in the nuance and chaos of the real world and not just in the customizable conditions of a simulation. And while possibilities of how to evolve Information Richness are practically boundless, they can also be operationally overwhelming. Unfortunately, a one-size-fits-all approach often results in all being fit rather poorly. Intentionality, simply put, is the tailoring of our synthetic data to the specific real-world problem itself.‍

How Do We Define Intentionality?

Intentionality is how we define our Domain of Operation after thoroughly understanding the Domain of Interest. With a careful consideration of the realities in which an ML model will be expected to reliably perform, Intentionality is how we choose what variations in our data to keep, to emphasize, or to prune. A Digital Twin can be used to generate very different Intentional data sets depending on what features may or may not be relevant to the specific ML model being trained. Intentionality strives for the holistic understanding of what specific Information Richness to introduce and can be viewed as the control mechanism for shaping novel synthetic data away from the distribution of the real-world sample and towards one more advantageous to robust training.

We mentioned in the previous post that Information Richness that expands the domain is also inherently more speculative. Intentionality functions as a fulcrum to balance this speculation against Indistinguishability; keeping our variations as realistic as possible, while allowing for useful aberrations (such as higher representation of edge cases than in the real-world sample) to persist.

Intentionality is the primary guiding principle of how we think about the relevance of synthetic data generation, and it shapes every step of our process.‍

How Does Intentionality Work In Practice?

We begin with a deep dive into a use case. To learn all about a domain, we rely on experts and studies in the field, the customer’s expertise, as well as our own research. As we build our understanding, we begin to define our domain, and key criteria start to emerge. This leads to examining which conditions are pertinent and if they are variable or static. Which contexts and environments are to be expected, and which are irrelevant. What edge cases are significant, and which ones can be ignored. This process is frequently iterative and persists until we sufficiently delineate the features vital to our data set.

For an example of the variety of conditions, consider an ML model that may be used to sort inventory in a retail supply chain. While it sounds like a fairly determined task, significantly different data could be needed if the supply chain is for a large department store versus a highly specialized boutique. The department store incorporates a greater variety of objects, with higher number of permutations, in more complex and variable contexts all while being more likely to experience more frequent inventory fluctuations and turnovers. The patterns of item stocking, the types of changes and varieties of human error may also be different between these two environments, necessitating a further emphasis on different types of edge cases. An Intentional data set should integrate these nuances to the best of our ability, avoiding the confounding issues often caused by less-relevant information.

**Fig. 1** Even though both businesses carry chocolate, the supply line context of the grocery store is more complex, and requires a more diverse data set for successful training.

Types of ML tasks similarly drive fundamental choices about the scope of any data set. For example, a Classification scenario will have different synthetic data needs than an Object Detection scenario. Any form of Classification model seeks to identify the presence of an already familiar object in a novel image. This means that the training and testing data should be photorealistic and generally averse to variations that stray from these criteria (indicating a more narrow domain of interest). Alternatively, a Detection scenario requires recognizing unknown and novel objects in unpredictable conditions and, thus, features a broader domain of interest (Fig. 2).

**Fig. 2** A Classification scenario presents a more narrow domain (left) than that of an Object Detection scenario (right)

This broader domain of interest in part arises from the high unpredictability of the task, but it also incorporates the higher utility of non-photorealistic images for a Detection Model (data simulating input from various kinds of sensors, random variations, etc.). These are just two simplified examples of possible ML models and they will likely change with time. What is truly essential for good Intentional data is the process of integrating the understanding of any given task into the optimal data set created for that scenario.

Concurrently with understanding our domain of interest, we build a Digital Twin using a large set of real-world images. Once data generated from this Digital Twin passes the Indistinguishability test, we are ready to Intentionally detach from the real-world sample using all the considerations we enumerated above. In generating this Information Richness, we may find some parallels to Domain Randomization (DR), where we are also looking for randomized variety. However, unlike standard DR, we are looking to create a bounded variety in very specific variables. This is where we draw on our study of the use case to identify points of high variation that may not be captured in the real-world sample. This may be as common as moving away from specific lights or camera lenses, to less predictable features such as variable backgrounds or objects.

What emerges is a large synthetic data set, Intentionally tailored to the scenario and the functioning in that scenario. This means that the training of this model will maximize exposure to error causing phenomena, and minimize instances that have no bearing on, or are irrelevant to, its performance. ‍

A (Suit) Case Study

Let’s revisit our suitcase illustrations from the previous two blog posts. For simplicity, we can imagine two scenarios:

(A) where we need to detect defects on newly manufactured suitcases and (B) where we need to sort customers’ own suitcases for shipping or transportation.

**Fig. 3** The narrow scope of the defect detection task is reflected in the smaller, more specific Domain of Operation necessary for this task. A good understanding of possible defects lessens the reliance on randomized data.

In the manufacturing defect detection scenario, we already know what our perfect suitcases should look like, and we can learn common patterns of manufacturing errors from the customer. We also know that we have a relatively predictable production line environment, and a comparatively low amount of variability in the non-defective products. With good quality real-world images, we can be confident of generating highly Indistinguishable data. As we evolve this data to be more Information Rich, a domain with such a defined scope, allows us to set up “guarantees”, or statistical rules for how often various events are expected to occur in our data. In this case, the guarantees would include the defects as we expect to find them in the wild, as well as an over representation of those defects that can be hard to spot.

As we focus on the oversampling and higher variation of images that show defects in the suitcases, as well as images that can be falsely identified as defective (due to artifacts of lighting, lens aberrations, etc.) — what emerges is a relatively small domain of interest, and the distribution of these images should not deviate too greatly from the real-world sample.

**Fig. 4** In the shipping-sorting scenario an ML model needs to identify suitcases (and reject non-suitcases - as shown by the flagged box) as well as their various properties. This presents a much larger, harder to define domain, requiring a significantly more varied and randomized data set.

In the shipping-sorting scenario, we will run into a much greater variety of situations that are not likely to be captured in a real-world sample. In fact, the variety of suitcase sizes, forms, colors, and states of wear is so great that we do not even know what the domain really looks like. Unlike the previous scenario, there are no guarantees that we could assign here since the scope of possibility tells us that we do not even know what we do not know about this domain. To help us bridge this Domain Gap, we need a significantly more Information Rich data set, one that will drift increasingly further from the real-world sample. We need to include a bounded but large variety of randomized images, along with non-photorealistic results to help hone any identification schemas. We will also need to include shipping labels with an emphasis on their variations. If this sorter is expected to run in different facilities, we will need greater variety in our backgrounds and lighting conditions. As we enumerate all these options, it quickly becomes evident how this data set will evolve in a significantly different direction than the defect detection data, encompassing a much greater domain.

Even though both of these hypothetical scenarios deal with suitcases on a conveyor belt, Intentionality guides us to create two very different synthetic data sets. One is narrow and targeted, concerned with only relevant aberrations on a suitcase, while the other is broad and varied, aimed at fundamentally describing what a suitcase is. There is a vast gap between these scenarios and their synthetic data needs, and Intentionality helps us design the most efficient option for each one.‍

The Three I’s Together

As we move into the future, the ubiquity of ML models is becoming an undeniable reality. And while we will increasingly use and experience these models in the real world, the comprehensive training they require can only happen in a synthetic one. The quality of these synthetic worlds, and the data they yield will always predict how reliably well we can train our present and future models.

Indistinguishable. Information Rich. Intentional. These three benchmark descriptors are the essential criteria by which we quantify the realism and relevance of our Digital Twin generated data.

High Indistinguishability roots our data in the reality of a concrete and observed scenario, while Information Richness allows us to mindfully broaden the horizons of our data beyond what was strictly observed. The linchpin of it all is Intentionality, directing the evolution of our data, shaping it to reflect the realities and requirements of any specific scenario – concretely defining our Domain of Operation.

“The Three I’s” together comprise a deeply interwoven methodology that helps us balance all the aspects of evolving, complex data sets — keeping them as realistic as possible, while optimizing them for each novel situation, and ultimately, decreasing the Domain Gap. This approach also helps make novel synthetic data future-proof since data that follows “The Three I’s” framework is not tied to any specific ML model, but is tailored to the use case and domain it represents. As long as this data is of a high quality, any future ML model can benefit from it. And since we are sourcing our data from an evergreen and ever-growing library of Digital Twins, the possibilities of and scope of what we can capture will only increase.

The data diets of our ML models matter progressively more, and we need a reliable framework for assessing the quality of synthetic data that in turn will yield impactful and predictable return on data investment. We believe that “The Three I’s” framework is a valuable recipe for combating junk data, ensuring relevant training for future ML models, and bringing us significantly closer to unlocking the true potential of synthetic data.

Recommended for you

Digital Twins

Is Your Machine Learning Model Bingeing on Junk Data?

Digital Twins