Is Your Machine Learning Model Bingeing on Junk Data?

Wouldn’t it be nice to be able to just push a button and have all the data you want at your fingertips? In theory, that is the promise of synthetic data: to solve one of the biggest challenges of using Machine Learning (ML), the collecting and labeling of relevant data, simply by generating data in a simulated environment and then using it to train your ML models. But it is not easy to capture all the complexities of the real world, so synthetic data and simulation frequently fall short of their promise, leading to poor ML performance once released into the wild of the real-world data. The accumulation of all the differences between real-world gathered data and its synthetic counterpart is known as the Domain Gap, which can be large and multi-faceted – making it very difficult to identify and address those discrepancies which contribute to poor performance. Another way to look at this problem is that training a model on poor quality data, i.e., one with a large domain gap and a high percentage of irrelevant information, can cause the model to pick up “bad habits” that don’t transfer to real data.

If we are to embrace the promise of synthetic data, addressing the Domain Gap issue is a crucial step. While there is a range of currently favored approaches for closing the Domain Gap, at Duality we leverage high quality Digital Twins, and believe this represents a thoughtful, systematic and future proof approach to generating high quality synthetic data that in turn results in an impactful and predictable return on data investment.

‍What are Digital Twins? A digital twin is a virtual representation of real-world entities and processes, synchronized at a specified frequency and fidelity. Source: Digital Twin Consortium

Simply put, Digital Twins are highly realistic digital versions of real-world entities. The primary purpose of a Digital Twin is to accurately present the appearance, properties and behaviors of a physical object in a virtual setting. To achieve this goal, Digital Twin acquisition requires meticulous 3D modeling, high quality real-world gathered data that sufficiently describes essential aspects of the entity, and state-of-the-art physics engines to integrate it all together. To that effect, a Digital Twin can be generated from a boundless pool of real-world sources with ever increasing complexity: a single flower stem and a field of wildflowers can both be represented as different types of Digital Twins. We classify Digital Twins into three basic types: environments, systems, and items.

Environments are the encompassing surroundings in our domain of interest – they can be as broad as a forest or the streets of a city, or as narrow as a particular spot on a conveyor belt.

Systems are any entities that perform or exhibit behaviors in the environment.

Items are any non-functional objects or products that populate the environment and that systems can interact with.

Digital Twins for More Precise and Efficient Training

We have observed that highly realistic and relevant synthetic data can match and augment real world data leading to robust and deployable ML models. This implication, that the quality of our synthetic data helps predict successful ML model training, leads us to pose a question:

How do we quantify the realism and relevance of our Digital Twins, even before using that data to train a model?

To this end, we came up with three criteria to guide the creation of synthetic data. They are collectively referred to as “The Three I’s”: Indistinguishability, Information Richness, and Intentionality. ‍

1. INDISTINGUISHABILITY

Indistinguishability: In this example, the suitcases aren’t identical (synthetic ones are mixed in with the real ones) and we cannot tell which is which.

The first step towards good synthetic data is minimizing the Domain Gap. Therefore, our synthetic data should strive to be indistinguishable from a real-world sample. It is not supposed to be identical, but it should be impossible to determine if any given distribution of data came from our simulated version or from a real-world example. In other words, an impartial algorithm sorting data as either ‘real’ or ‘fake’ should be wrong at least 50% of the time – the real-world samples should completely blend in with the synthetic ones. The higher our Indistinguishability rating, the more precisely our data will capture a specific scenario. We will expand on how we evaluate the Indistinguishability of Digital Twins later in this post.‍

2. INFORMATION RICHNESS

Information Richness: Generating new versions that we did not see in the wild. Here, we have ones covered in infinite varieties of stickers and tags in familiar and novel variations.

While synthetic data should be indistinguishable according to the metric outlined above, it also needs to be novel – to be useful, it needs to be generating new information about a specific domain. The data should provide, for example, new perspectives, new angles, new features, etc., that fill in the gaps of the real data. We don’t want to dilute the data set with redundant information, so each data point should be valuable and representative of the real-world scenario. If Indistinguishability allows for high precision, Information Richness allows us to accurately broaden the horizons of what our data can capture.‍

3. INTENTIONALITY

Intentionality: Identifying our domain of interest. For this hypothetical model, suitcases that feature the Duality logo sticker.

We are seeking to have a fundamental understanding of the data we are simulating and what aspects of it are useful in our domain. In generating new data, we want to identify key items so that we can create variety in the most relevant variables. Through Intentional data, we define our domain of operation. In other words – even though we can create infinite amounts of variation in our synthetic data, not all variations are useful for improving the performance of an ML model. Simply introducing Information Richness without consideration for the use case, or relevance to the model, often yields results that are either negligible, or potentially confounding. Thus, to create a robust ML model, we can make a clear decision on its intended domain of operation: which conditions are relevant? Are they variable or static? What edge cases are significant, and which ones can be ignored? Intentionality strives for the holistic understanding of what specific Information Richness to introduce and can be viewed as the control lever for how far and in what directions we venture from our homebase of real-world gathered data, for any given use case.

“The Three I’s” are intertwined and interdependent. To help us visualize these abstract relationships we may imagine that any domain can be represented as a unique three-dimensional shape. Indistinguishability is the structural core of this shape, where our real and synthetic samples blend together. Information Richness is how much of the domain we fill, or all of the ways the shape can evolve away from the core. Intentionality is then the guide of this evolution, pruning the irrelevant and highlighting the valuable aspects, ultimately defining what the shape looks like. As we come to better understand our domain of interest, we are better able to dictate exactly what Information Richness is introduced, and the shape takes on a clear and defined form.

As an analogy, we can imagine building a house from a set of general blueprints. Much of the original structure would closely line up with what the blueprints dictate (Indistinguishability) – it is how we know that our house is safe and functional. Information Richness then, is all the ways we can evolve the construction: the materials we can choose, the rooms we can partition or add on, the appliances we can install, etc., - all things not expressly dictated by the blueprints, but guided by their foundational layout. Intentionality is how we determine which of those choices are right for our specific house: what features are important for our location? Or for the climate? Or for the number and type of occupants? All these choices evolve the house away from the original blueprints, but in ways intentionally consistent with their boundaries and our needs. To summarize this idea in the world of synthetic data: we build on top of our initial structure by making informed choices that intentionally increase Information Richness to guide the direction of our ML model.

Considering all of the above, we think it is fair to say that providing increasingly Indistinguishable, Information Rich and Intentional synthetic data is possible and necessary.‍

Evaluating Indistinguishability

As we mentioned at the outset: high quality synthetic data can match and even outperform real world data in ML model training. This correlates high Indistinguishability with better training results, and mandates that we quantitatively evaluate Indistinguishability before we subject any model to a particular data diet.

The first step of evaluating a Digital Twin scenario is to evaluate the Digital Twin items individually. For example, if we are training an object detection model, we will first evaluate the indistinguishability of each item/system individually, followed by repeating the process in the intended environment. We will walk through the evaluation of an individual Item Digital Twin, but the same process is followed for digital twin systems and environments as well as Digital Twin system/items in their environment.

Fig. 4 Example of Real Images and Digital Twins.

As shown in Fig. 4, we use images from the Digital Twin we made and the real-world object to evaluate their Indistinguishability. Our friends at Voxel51 have developed an open-source software tool called FiftyOne that supports visualization and analysis of data sets in machine learning. We leverage FiftyOne to facilitate the calculation and visualization of the Indistinguishability Score (we present an example below, but encourage anyone to try FiftyOne and our repo on their own data). In order to represent the visual aspects of the image as quantitative features that can be analyzed, the images are sent through a pre-trained convolutional neural network. From here, FiftyOne’s implementation of dimensionality reduction is used to visualize the data.

Fig. 5b Indistinguishable synthetic data

In Fig. 5, each data point represents a unique real (blue) or synthetic (red) image. Here, we are presented with equal amounts of real and synthetic data. If the data are indeed Indistinguishable, then the likelihood that the closest sample next to a random synthetic image is real should be 50%. In other words, the synthetic data images are perfectly mixed in with their real-world counterparts, and the distribution of synthetic data Indistinguishably matches the distribution of real-world data. Please note that ideally the real world data should not change between Fig. 5 a), b) but since dimensionality reduction is an iterative process it needs to be done on synthetic and real data at the same time. This is why, although the real data in 5 a), b) is the same, the different synthetic data causes a different low-dimensional representation [T-SNE][UMAP]. This technique is not limited to any specific dimensionality reduction and can benefit from other types of dimensionality reduction.

In the cases when the data are not completely Indistinguishable, we can follow the same logic by generating a ‘data overlap value’. This value quantifies how much of the synthetic data is actually overlapping with the real data and represents it as a succinct Indistinguishability Score. In this scenario the probability of a synthetic sample having its nearest neighbor be a real-world sample will be less than 50%. Once we have this probability, we can calculate how much this distribution deviates from fully Indistinguishable to get the Indistinguishability Score.

This same approach can also be extended to cases when there is more synthetic than real data, which just happens to be all the time! If we have 70% synthetic data and 30% real data, and the data are Indistinguishable, then the likelihood that the closest sample located next to a random synthetic image is real should be 30%. Again, this probability is then converted to a Indistinguishability Score that now accounts for large imbalances in amounts of real and synthetic data and for data that are less-than-Indistinguishable to give us a good estimate of the realism of our synthetic data.

In Fig. 5a and 5b, there are an equal number of synthetic and real images, which means that synthetic data should be nearest to a real data point 50% of the time. In Fig. 5a, this is observed 0% of the time, while it is observed 40% of the time in Fig. 5b. With these observed probabilities, we estimate that the data overlap is 0% and 80% for Fig. 5a and 5b, respectively. Check out our repo to test these methods on sample data or try it out on your own data.

Fig. 6 Indistinguishability score (data overlap) vs mean Average Precision (mAP) for object detection. Note how a higher Indistinguishability Score predicts better model performance.

Fig. 6 demonstrates the relationship between the Indistinguishability Score of a Digital Twin and the performance of models trained on that Digital Twin. In our testing, we have observed that there is a clear positive correlation between Indistinguishability Score and the performance of object detection models. What’s more, we can also see that a Digital Twin does not need to be perfectly Indistinguishable in order to yield significant benefits to an ML model. In fact, a score greater than 0.8 does not necessarily produce an improvement. The reason for this is that Indistinguishability is not the only important factor - many factors impact model performance using synthetic data, and top among these is a direct tension of Indistinguishability with Information Richness.

While we are not diving into the specifics of Digital Twin acquisition in this post, it is important to remember that Indistinguishability is always rooted in the measuring and analyzing of a real-world sample, and is an approximation of the relationship with the actual real world. Furthermore, even if we have access to complete real-world data, it does not mean that it is the distribution that we want. For example, we may want to oversample edge cases that are not common in the real world but are very important for good training, therefore real world data may not be the end goal. This is why all Three I’s are essential for optimal synthetic data. In the next blog we will address these points, and dive deeper into Information Richness and its tension with Indistinguishability. We’ll break down how we conceptualize Information Richness, postulate its usefulness, and how that compares to today’s dominant methods to ensure the best possible data diet for any ML model.

Recommended for you

Digital Twins

Is Your Machine Learning Model Bingeing on Junk Data? Part III

Digital Twins