Insights
Can SORA’s Visually Amazing Videos Train Other AI Models?
March 7, 2024
· Written by
Apurva Shah

The video examples generated by SORA, recently showcased by OpenAI, are varied; impressive in their visual fidelity; and temporally coherent [1]. An astounding step forward in the quality of generative AI video output! Particularly interesting is the physical grounding evident in some of the clips such as the jeep traveling along a hillside road kicking up dirt, and clips of dogs with realistic body articulation and fur simulation. The applications in entertainment and content creation are obvious and direct. However, OpenAI’s technical report on SORA describes it as, “Video generation models as world simulators.” [2][3]. This leads us to ask: Can SORA’s output be used to train other AI perception models?

Video Example 1. Generated by SORA, the video of the SUV navigating on a dirt road exhibits photorealism and plausible physics

To answer this question, we have to clearly define the characteristic of synthetic data needed for AI mode training, and contrast that with the kind of data that models like SORA can provide. Let’s start with what we know about SORA.

SORA is a diffusion transformer model. In very simple terms, it is a hybrid of diffusion models that generate images based on text prompts, such as Stable Diffusion and Dall-E, combined with massive attention enabled transformers at the heart of modern large language models (LLMs), such as ChatGPT and Llama [4][5].

Vital to our question is OpenAI’s belief that the massive amounts of training data and scale of SORA’s model leads to emergent behaviors that include:

  • 3D consistency
  • Long-range coherence and object permanence
  • Interacting with the world
  • Simulating digital worlds

These characteristics are typically associated with physically based simulators — so it is nothing short of astounding that SORA’s generative model appears to have “learned” complex physical laws and their situational application in a data-driven manner!

It’s important to note the caveats that OpenAI acknowledges on the SORA product page:

“The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.

The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.”

The research report further underscores:

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model — such as incoherencies that develop in long duration samples or spontaneous appearances of objects — in our landing page.

Video Example 2. "Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering." [Source: OpenAI]

Video Example 3. "Sora sometimes creates physically implausible motion." [Source: OpenAI]

While the question of using SORA videos for training other AI models is not domain specific, for the purposes of our speculation, we will focus on synthetic sensor data used in embodied AI and robotics applications. This is the focus of Duality’s work with our customers across defense and commercial teams. Our current approach, which often includes generating synthetic data for AI model training, leverages physically based digital twin simulation with Falcon [6].

Video Example 4. Generating synthetic data via physically based, deterministic digital twin simulation in Falcon. Digital twins of operational systems, equipped with virtual sensors (e.g. RGB camera, Lidar, GPS, and more) navigate customizable, photorealistic environments to generate any needed data.

From a simulation perspective, there are three key areas that need consideration:

  • Sensor simulation through advanced rendering. This includes RGB cameras this is where SORA’s video output has direct relevance. However, for embodied AI and robotics applications, there is a requirement of additional sensor types, such as LiDAR, infrared, radar, GPS and IMU. Sensor fusion, vital for a system’s ability to operate in the real world, makes it necessary for the output from multiple sensors to be coherent and time synchronized.
  • Physics of the system under consideration. Particularly rigid body dynamics (RBD), but the environmental context may also call for soft body, fluid and particle dynamics.
  • Highly specific and precise interactions. System validation and embodied AI requires a closed perception-action loop making it necessary for the embodied system to be able to impact agents and the environment through direct presence and in very specific ways [7][8].

While SORA’s cause-effect weakness is certainly relevant to dynamics, it is important to reiterate that photoreal rendering is also a physical simulation of electromagnetic waves andphotons with a similar dependence on cause and effect. For example, moving an object in front of a light source should cast an appropriate shadow. In a more complex instance, dimming a particular light within a room should cause its contribution to surfaces to be correspondingly reduced with a recalculation of its bidirectional reflectance distribution function (BRDF) response.

Model Versus the Real World

Language can get pretty squishy when talking about modeling and simulation, especially since cognitive science, mathematics, physics, computer graphics and AI all have their own subtle nuances and assumptions. In order to clearly frame our fundamental questions, let’s zoom out first and establish what precisely is a model for our purposes and what is the ground truth we can use to objectively compare disparate models?

Fig 1. Determining the predictive value of a model. Given the same input conditions, the difference between the ground truth collected from the real world and the reality predicted by the model allows us to quantify the predictive value of that model. The smaller the difference between the two, the higher the predictive value of the model.

Based on observations of a real-world phenomenon, we can create a model that can predict (based on input specifications) what will happen in reality. These specifications can take the form of a set of initial conditions for a physically based simulator, or be presented as a natural language prompt, as done with generative AI.

Hypothetically, in the limit case, one could make a model to be an exact replica of the real world, down to the sub-atomic level — but such a model would provide no benefit over reality. The real world generally has a lot of extraneous information and dimensions that must be abstracted so that the model can predict efficiently. This pulls in the necessity of a domain over which the model is considered to be valid.

Finally, the predictive value of a model can be ascertained by looking at the delta between the actual data of interest produced from the real world, i.e. ground truth, and the predicted or synthetic data produced by our model given input conditions that are matched to reality over the domain of interest.

With that grounding, we can frame our fundamental questions:

  • Is generative AI capable of high quality physical modeling?
  • What is the efficiency of using generative AI as a  modeling approach, especially when compared to traditional methods such as ray traced rendering and RBD?

Predictive Quality

“What gets measured, gets improved.”
- Peter Drucker

Keeping the above quote in mind, the quality of model prediction can be measured in three ways over the domain of interest:

  1. Consistency
  2. Precision
  3. Explanative value

Physical phenomena are objective, deterministic, and generally continuous. They are subject to precise inputs and corresponding outputs rather than a probabilistic distribution. This makes measurement straight forward. But how do we quantitatively measure output from generative AI models, such as SORA, against physical ground truth, to establish predictive quality?

For AI models, accuracy of model output in relation to ground truth test data (held out from training) has generally been used as a primary measure of predictive value. However, evaluating generative models, especially diffusers, is very tricky since the input is specified as text, with its inherent ambiguity and symbolic abstraction [9]. Further, the output is generated as an image or video without any introspectable and explicit model structure that can be instrumented to measure physical properties such as mass, volume, velocity, acceleration, pressure, temperature, etc. Generative AI models are also subjective and do not have a fixed coordinate frame. For example, precise measurement of the distance between two objects in a diffusion image is not possible when the virtual camera position (the model’s locus of observation) itself is uncertain.

Recent research attempts to solve this by recreating the output of a generative AI model in explicit and measurable form [10]. It is an intriguing direction and could ultimately intersect with physically based simulation approaches but a true ground truth baseline may remain elusive.

Putting aside the inherent difficulty in quantitative measurement of generative AI model output, let us turn our attention to the fundamental capabilities of generative AI as a world simulator.

Given that AI models are based on neural networks, on first blush, it would seem that AI broadly, and generative AI in particular, maps closely to what cognitive science refers to as mental models [11]. And certainly there are some obvious similarities such as the ability to generalize and extrapolate; bias stemming from learning data or experience; and the ability to generate out of domain answers even if they have low predictive value.

Fig 2. Contrasting characteristics of Physical and Mental Models.

However, it is important to keep in mind that even the neural network representation is ultimately a mathematical model and the parallels to cognitive science shouldn’t be taken too literally.

When AI is viewed as mathematical modeling, a number of avenues open up for consistent and precise physical modeling. Some of the areas of active research include run-of-the-mill domain specific tuning; reward functions that capture physical constraints; hybrid models that combine AI with physically based models; and augmenting model learning data and state with additional dimensions [12][13]. We must therefore conclude that AI models, as a subset of mathematical models, are absolutely capable of high quality physical modeling, at least in terms of consistency and precision.1

With the above in mind, we can put forth some speculation on the predictive value generative AI models in terms of consistency, precision and explanative value:

  • Consistency in statistical terms is measured as variance or standard deviation of a dataset. Recent research in text-to-video models has focused extensively on visual consistency, for example the persistence of characters and objects, especially when they are obscured for some frames before being revealed again [14]. However, for a world simulator, this focus must broaden to encompass the consistent application of the laws of physics. For example, we expect Newton’s infamous apple to drop under gravity. If the apple starts levitating instead, it is fundamentally inconsistent with the physical world. So, more broadly we can define consistency as the ability of the model to follow physical laws and be semantically grounded. Inconsistent models will appear to be hallucinating and as such training data with these inconsistencies is likely to cause serious learning difficulties for any subsequent models that train on that data.
  • Precision was briefly discussed in the context of generative AI models being implicit, subjective, and without an introspectable model structure. When it comes to ground truth, it is important to differentiate between plausible, which is often the bar for current generative AI output, versus precise. For example, pushing on a door causes it to open, a very plausible outcome. However, the number of degrees that the door will precisely open is very much determined by the force applied to the door, its mass, and the friction of the hinges. Is a plausible rather than precise dataset sufficient for subsequent AI model training? Perhaps. But without knowing exactly what latent patterns an AI model may learn from, this could be a significant risk. In our work we have found that the more indistinguishable and precise the synthetic training data, the less likely it is to lead to overfitting problems and remain model agnostic.2
  • Explanative value of AI models trail traditional approaches. The various equations that capture the laws of physics are boiled down to an essential set of parameters and there is a clean and often intuitive relationship between them. This makes the physical model both definitive and comprehensive (over the domain). The same is not true of neural networks with their data-driven modeling approach expressed as non-intuitive connections and weights. The ability to understand AI models is further compounded by model creators, including OpenAI, not sharing their model structure and the basis of acquiring and processing their training data. While it is understandable from the standpoint of protecting their intellectual property and competitive advantage, it nonetheless compounds model opacity.

    Another practical aspect of explanative value is the labels and annotations that must accompany synthetic data to make it appropriate for AI training. For data collected from the real world, this is done using human annotators often leading to cost, delays and inaccuracies. Yet it is the reality of how most current AI models are bootstrapped. Unsupervised learning can fill some of the data gap, however, there is no good substitute for true ground truth annotation and label [15]. Synthetic data from digital twin simulation is very attractive in this regard since the digital twins carry the necessary labels and semantic information that can produce accurate ground truth annotations for free [16].

Predictive Efficiency

The efficiency of prediction can be decomposed into three components:

  1. Cost of model setup
  2. Cost of context setup
  3. Cost of model prediction

The laws of physics are well established and represent human intellectual investment that goes back many millennia. These are classical models whose data requirements can range from none to a very minimal set for parameter tuning. AI model setup, however, requires extensive data gathering and retraining when either the training data or the model architecture changes in meaningful ways. The cost for sourcing high quality, comprehensive, annotated and error-free training data is expensive in both time and resources. In fact, this is the primary motivation to look for synthetic data in the first place!

All AI models learn implicitly from data, with the general mantra being:  “the more data the better.” This is the reason why foundational models seem to exhibit unbounded improvement when trained on larger and larger corpuses of data. However, not all data is equal – data that is varied and plugs blind spots within the domain of operation are far more valuable than data points that are clustered together or out of domain. There are only two ways to get this data, either from the real world or from other modeling approaches which may cascade inconsistencies or imprecisions (more on this in the next section).

When it comes to robotics applications, we also have to consider sensors that are outside what is visible to the human eye, which makes up less than 1% of the electromagnetic spectrum – ultraviolet; short, mid and long range infrared; sonar; radar. The real world training data corpus for these sensor types is significantly smaller, especially in the public domain. Can generative AI turn to other forms of physically based simulation, such as Falcon, to close these data gaps?

The cost of context setup for traditional approaches, such as digital twin simulation, often boils down to creating a specific context that is particular to the application and data requirements. Since physically based simulation requires precise inputs, there isn’t a direct mapping to natural language and in general a lack of generalization. This is an area in which generative AI models truly shine. They are highly generalized and furthermore can take natural language prompts and example images or videos to establish context in an efficient and intuitive way for model users.

In terms of model prediction, there are some precision/quantization tradeoffs that can be made in any model, but all things being equal, it ultimately comes down to compute. Both physically based approaches as well as neural network processing are highly parallelized and can run on GPUs. Overall this is likely to be fairly comparable between different modeling approaches.3

There is also complex crosstalk between the quality and efficiency characteristics. For example, while an AI models can be tuned or otherwise constrained, this may generally come at the expense of generalization and introduce context setup cost.

Next Steps

Will generative AI subsume all other forms of modeling? While theoretically possible, it is more likely that it becomes an all-encompassing interface to specify the simulation context and then marshals necessary models to create and process the requested data. But here we may be getting out over our skis — the first crucial step is to baseline where generative AI models are today when it comes to physical modeling, and compare their quality and performance against real world ground truth data as well as other physically based simulation approaches.

While it is possible to test generative AI output in a component-wise manner, such as looking at just the shadows or reflections, at Duality we use a more comprehensive approach, the 3i framework [17]. We evaluate synthetic data across three criteria: Indistinguishability, Information Richness and Intentionality. In our customer engagements, we have found this evaluation framework to be extremely helpful at predicting positive end-to-end outcomes that are both model and context agnostic. As soon as our team can access SORA, we are eager to begin this work!

To revisit an above mentioned word of caution: when AI generated data is used to train other AI models, we also need to consider the carbon copy effect. Programmers are familiar with the cascading impact of imprecision that can snowball into wild numeric instability. Similarly, as each generation of AI model is trained on a subset of another AI model’s output, how does the quality deteriorate? In the middle ages, before the laws of thermodynamics were firmly established, innovators were obsessed with finding the perpetual motion machine until the uncompromising laws of physics effectively debunked the effort [18]. Will history look back on the current time as a quest for the perpetual data machine?

Lastly, it is heartening to see the increased focus (and daunting challenges) when it comes to safety of generative AI deep fakes, copyright violations and ethical use. However, from the lens of embodied AI and robotics, safety is a much more literal concept – the lack of consistent and precise AI output can result in bodily harm or loss of critical services. It is ok for the braking distance of a car to look generally plausible in a video, however, a few decimeters can be the difference between a safe stop and a collision. There is no undo button in the real world!

At Duality, we see a dynamic give-and-take between generative AI and digital twin simulation across model training, context creation and ultimately to produce the highest quality synthetic data for our customers that lead to safe and predictable AI models and robots. Ultimately, that is the only true measure of success.

Footnotes

  1. While completely tangential to our primary topic of discussion, this begs the question of why our brain’s mental modeling capabilities evolved in the way they did? Could our cognitive mental models learn to be more physically grounded and still remain efficient enough to survive in the jungle? Did they plateau out in their consistency and precision because we were already the apex species and there was no evolutionary benefit to going further?
  2. There is another important dimension to precision that has to do with the specific characteristics of a system or a sensor. For imagers, it is important to consider a specific camera’s intrinsic and extrinsic parameters for the resulting synthetic data to be meaningful. The precise parameter tuning might even have to consider a specific physical twin, such as a quadcopter with one of its rotors only performing at 95% of its specified RPM.
  3. While generative AI models are very intuitive at context building, human-computer interaction (HCI) patterns are still evolving. For example, for embodied AI applications with humans-in-the-loop; interactivity becomes a prime consideration and the generative model can no longer “live inside its head.” This brings with it a host of design and pipeline requirements that game developers have been tackling for decades.

References

[1] OpenAI, “Sora: Creating video from text.” https://openai.com/sora, 2024.

[2] OpenAI, “Video generation models as world simulators.” https://openai.com/research/video-generation-models-as-world-simulators, 2024.

[3] Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, Lichao Sun, “Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models.” https://arxiv.org/abs/2402.17177, 2024.

[4] Jonathan Ho, Ajay Jain, Pieter Abbeel, “Denoising Diffusion Probabilistic Models.” https://arxiv.org/abs/2006.11239, 2020.

[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need”, https://arxiv.org/abs/1706.03762, 2017.

[6] Duality AI, “Falcon: Digital Twin Simulation Platform.”, https://www.duality.ai/product, 2024.

[7] Apurva Shah, “How Will Internet AI Crossover to the Physical World?” https://www.duality.ai/blog/embodied-ai, 2023.

[8] Felipe Mejia, “ViperGPT Takes a Walk in the Park: Evaluating Vision Systems for Embodied AI in FalconCloud.”, https://www.duality.ai/blog/vipergpt-for-embodied-ai, 2024.

[9] Hugging Face, “Evaluating Diffusion Models.” https://huggingface.co/docs/diffusers/main/en/conceptual/evaluation, 2024.

[10] Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, Ming-Ming Cheng, “Sora Generates Videos with Stunning Geometrical Consistency.” https://arxiv.org/abs/2402.17403, 2024.

[11] Ileana María Greca, Marco Antonio Moreira, “Mental, physical, and mathematical models in the teaching and learning of physics.” Science Education, Volume86, Issue1, Wiley, 2002.

[12] Nils Thuerey, Philipp Holl, Maximilian Mueller, Patrick Schnell, Felix Trost, Kiwon Um, “Physics-based Deep Learning.” https://www.physicsbaseddeeplearning.org, 2021.

[13] Rachel Gordon, “From physics to generative AI: An AI model for advanced pattern generation.” MIT News, September 27, 2023.

[14] Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever, “Consistency Models.” https://arxiv.org/abs/2303.01469, 2023.

[15] L. Kryeziu and V. Shehu, "A Survey of Using Unsupervised Learning Techniques in Building Masked Language Models for Low Resource Languages," 11th Mediterranean Conference on Embedded Computing, 2022.

[16] Francesco Leacche, Roberto De Ioris, Amey Godse, Apurva Shah, “The Digital Twin Encapsulation Standard: An Open Standard Proposal for Simulation-Ready Digital Twins.” Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), 2023.

[17] Duality AI, “Is Your Machine Learning Model Bingeing on Junk Data?” https://www.duality.ai/blog/ml-synthetic-data-model, 2022.

[18] Paul Scheerbart, “The Perpetual Motion Machine: A Story of Invention.” Wakefield Press, 2011.