Is Your Machine Learning Model Bingeing on Junk Data? Part II

Part II: Information Richness

In Part I, we explored how we quantify if our synthetic data is realistic enough using an Indistinguishability Score. Since we found a firm positive correlation between this score and how well Machine Learning (ML) models trained on our data perform in the real world, we need to know that our Digital Twins are as Indistinguishable as possible from their real-world counterparts before any training is done. But as we noted at the end of our last blog (and emphasized in the “Three I’s”), Indistinguishability is only one part of the methodology that leads to good training outcomes. In this post, we are exploring the second key component: Information Richness.

The reason why the Indistinguishability Score alone is not sufficient to evaluate the quality of our data is that we are not only trying to replicate real-world data, but generate new information beyond what is gained from a few real-world samples. If our data is Indistinguishable, but provides no new information for the model, then the data is simply redundant – and what would be the point of generating it in the first place? A model needs to encounter a vast multitude of novel and diverse iterations of the data to prepare it for the unpredictable conditions it will encounter in the real world. Thus, variety in large scale data sets is the platonic ideal of good synthetic data. At Duality, we believe that our approach to this synthetic creation of variety (i.e., Information Richness) adds true value to our more Indistinguishable Digital Twins.

What is Information Richness?

Simply put, Information Richness is a measure of novelty or uniqueness within a synthetic data set. In general, data does not need to be useful or realistic to be Information Rich; it just needs to be highly varied. For our purposes, we conceptualize it as the expansion of the data distribution from a real-world sample to fill-in and expand a simulated domain. If appropriately applied, it allows us to accurately broaden the horizons of what our data can capture and better simulate our domain of interest.

Recalling the Snickers candy bar example from the previous post, we generated images that were as Indistinguishable as possible from the real-world gathered images. This could have also been done by duplicating the real dataset, but this would not increase the Information Richness. To make a set more Information Rich, we must generate additional images that vary from the original real-world samples.

To name a few options, we can create images from different angles, change the lighting conditions, or even deform the Digital Twin itself. When we generate images that are visually close to our real-world sample we are filling in the gaps of our domain, as these are images that we could have likely observed by collecting more real-world data (Ex. 1). This is not unlike how a cartoon animator fills in all the in-between poses of a character between two keyframes. Alternatively, when we generate images farther afield from our real-world sample, ones that present conditions not commonly observed (Ex. 2), we are expanding our domain and creating all sorts of edge cases (realistic as well as extremely unrealistic). Both of these scenarios increase Information Richness, but the latter also decreases the Indistinguishability Score of the images, which we expand on in the next section.

Two schemes for generating Information Richness‍

Ex. 1: Filling in the gaps of the real-world sample.
Ex. 2: Highlighting and oversampling edge cases.

Not All Information Richness Is Created Equal

The current dominant approach for creating large, information-rich synthetic data sets for ML training is called Domain Randomization (DR). With DR, a set of parameters is randomly altered to create a much larger set of novel synthetic data. Randomness is the operative concept here as DR theoretically produces such a wide swath of cases that it blanketly fills the blind spots we might have in our Domain Gap. In practice, this means that the immense data sets produced contain everything from data that is similar to the domain of interest to data so far removed from reality that the ML model would never actually encounter it. In between those two extremes, the model is theoretically exposed to a sufficient amount of relevant data that generally captures most real-world scenarios.

However, since these data are randomly generated, without regard for Indistinguishability or Intentionality, their Indistinguishability Scores tend to be quite low. Images of Snickers candy bars placed on the moon with astronauts (a hypothetical example of potential DR generated data) certainly increase the Information Richness of a data set, but they also decrease the Indistinguishability Score as they don’t reflect the reality that our model will encounter in the real world. This crossing of the boundary between edge cases and impossibility contributes to inefficient training and is a simplified example of the natural tension that exists between Indistinguishability and Information Richness (Fig.1).

Since Indistinguishability is always rooted in the measuring and analyzing of a real-world sample, the Indistinguishability Score of novel synthetic data is always tied to that sample and will decrease as soon as new data starts to drift away from the core distribution of the aforementioned sample. In other words, Information Richness that expands the domain is also more speculative. But this expansion of the domain is essential for capturing all the varieties of cases relevant to successful ML model training.

Fig. 1 The natural tradeoff of between Information Richness and Indistinguishability

Does this mean that increasing Information Richness away from the core distribution creates less realistic synthetic data?

Not necessarily.

This tradeoff between Indistinguishability and Information Richness is not universal nor zero-sum. Real-world data, if it were possible to be gathered in mass, would always be highly Indistinguishable and highly Information Rich. Thus, we can postulate that many synthetically created samples should have real-world counterparts; they just weren’t the ones that we observed. Moreover, edge cases are by definition not common in the real world, but are crucial for successful training. An ideal training dataset over-represents these pivotal edge cases while in the process sacrificing some Indistinguishability, which highlights that for training efficacy, emphasis on Indistinguishability has limits. So while we will always have to balance Indistinguishability and Information Richness in useful synthetic data, they are by no means operatively opposite of one another.

Evaluating Information Richness

When we evaluate Indistinguishability, we use data clusters in which individual points represent ‘real’ and ‘synthetic’ images. The alignment of ‘synthetic’ to ‘real’ clusters is a key indicator of Indistinguishability: the higher the overlap between the clusters, the more Indistinguishable the synthetic data is from the real-world data. With Information Richness, we look at the area these clusters cover. The more expansive the ‘synthetic’ data cluster is, the more Information Rich the data set is. More simply put: Information Richness can be measured by the area of the cluster.

Let’s walk through an example! Just as with calculating the Indistinguishability Score, we start by applying a Convolutional Neural Network followed by dimensionality reduction. Here, we are again leveraging FiftyOne, an open-source software tool developed by our friends at Voxel51. This provides us with a 2-d representation of each image, as shown in Fig. 2.

Fig. 2 Not only does the dataset on the right have higher Indistinguishability but it also has higher Information Richness.

But as we stated above, for Information Richness evaluation, we are not interested in points but in areas. To calculate the area of the cluster, we must first assign an area to each point. To do this, we apply random noise to approximate variability in the images (Ex. 3). This is done once per image and the difference between the randomized image and the original image is then used to create a scatter plot as shown in Fig. 3. This scatter plot is made up of all the images and each point represents an image pair consisting of the original and randomized image. We then use this scatter plot to calculate an area radius by finding the median distance from zero of the points; in Fig. 3, the median distance from zero is 0.51.

Ex.3 Images of Snickers bar Digital Twins with random noise applied to approximate variability.

Fig. 3 Scatter plot of randomized real and synthetic images minus nonrandomized images. Center represents the location of nonrandomized images.

Fig. 4 Information Richness of synthetic and real data.

Once we have this radius, we can expand each of our original points into circles that approximate the area coverage of each image. Fig. 4 visually demonstrates how these points from Fig. 2 are now converted into areas. The aggregation of all the areas of individual circles creates a total area for the point cluster and we can then calculate the overlap between real and synthetic data as well as how much unique area synthetic data is providing. In Fig. 4, we can see that the synthetic data is providing greater Information Richness than the real data sample while still exhibiting a similar distribution, meaning that we have gained new information while still keeping a high Indistinguishability Score.

Conversely, Fig. 5 illustrates the effect of using completely random data addition to generate novel synthetic data. Here, the Information Richness is high, but is completely untethered from Indistinguishability, providing novel but less relevant data, and highlighting the importance that the “Three I’s” have together.

Effect of Real-World Sample on Information Richness

Theoretically, Information Richness can increase until the performance of an ML model is 100% perfect. And while this is always the goal, there is a tradeoff on the relevant Information Richness that can be obtained from any specific real-world sample and the cost of collecting such data. As we add relevant data points to a large set, the uniqueness of each additional point eventually begins to decrease. But as we noted in the previous post on Indistinguishability, a Digital Twin does not need to be 100% Indistinguishable to be highly useful. Similarly, Information Richness does not need to increase indefinitely.

So, how do we know when a synthetic sample is sufficiently Information Rich? We can judge this by running a sum of the uniqueness of all individual points – the sum will keep appreciably increasing if the new points are adding new, novel information. Once additional points are no longer adding new information, the sum starts to hit a plateau and we know that adding more points is no longer increasing the Information Richness of our set (Fig. 6). This is likewise reflected in our area-based evaluation method: adding points that do not provide novel information will only add points that cover an area that has already been covered, and the total area of our cluster will not change.

Fig. 6 Information Richness vs dataset size. The larger the dataset the less likely the data will be novel.

As with other aspects of synthetic data, this limitation stems from the inherent tethering of synthetic data to the real-world sample it is based on and the methodology by which it was collected. Different methodologies can yield samples that are less or more advantageous for generating greater Information Richness, and this consideration plays an important part in the last of the “Three I’s”: Intentionality.

Intentionality in Domain Randomization

Ultimately, all Information Richness, especially when produced by Domain Randomization, has value. But we believe that through a more selective and intentional use of Domain Randomization, we’ll be able to create truly useful Information Richness that streamlines training and increases performance of ML models. It is this Intentionality and how we use it to shape our data sets as well as how we balance Information Richness and Indistinguishability that we will cover in our next post.

Recommended for you

Digital Twins

Is Your Machine Learning Model Bingeing on Junk Data?

Digital Twins

Part II: Information Richness

What is Information Richness?

Not All Information Richness Is Created Equal

Evaluating Information Richness

Effect of Real-World Sample on Information Richness

Intentionality in Domain Randomization

Recommended for you

Is Your Machine Learning Model Bingeing on Junk Data?

Is Your Machine Learning Model Bingeing on Junk Data? Part III