Prompts to Execution — The Task Coding Framework for Large Language Models in Robotics
May 14, 2024
· Written by
Felipe Mejia
Francesco Leacche
Mish Sukharev

This blog presents a high level overview of what Task Coding is and how it can be leveraged to bring the benefits of Large Language Models to robots, while keeping performance reliable and predictable. For a more in-depth and technical dive, read the entire white paper: The Task Coding Framework for Leveraging Large Language Models in Robotics.

In a previous blog we talked about how digital twin simulation can bridge generative AI models into the real world. In that blog we outlined how Large Language Models (LLMs) and multimodal models are already sufficiently capable to go after autonomy problems that have plagued smart systems. While autonomous robotics systems have made significant advancements in the recent years, they are generally hitting a developmental ceiling due to their somewhat brittle nature in the following areas:

  • Adaptability: Robots are generally designed for specific environments and conditions — also known as their Operational Design Domain (ODD). Unexpected changes (such as lighting conditions or obstacles) can challenge their ability to perform tasks correctly. Increasing adaptability means improving the robot's ability to tackle unfamiliar environments and situations by using general reasoning to learn and extrapolate from past experiences, and functionally expanding the ODD.
  • Generalization: As robots are designed for specific tasks or a range of related tasks, they often struggle to generalize their knowledge or skills to entirely different tasks without significant reprogramming. Improving generalization would enable robots to operate in a completely novel ODD.
  • Long Horizon Tasks: Highly complex tasks that require a deep understanding of the environment, nuanced decision-making, can still challenge modern robotics systems.

LLMs with well honed prompt design offer promising solutions to the above challenges. For most applications, these models, characterized by their large parameters and datasets possess an intricate grasp of human language, enabling them to produce coherent and contextually relevant sentences, and even emulate specific writing styles. However, the prowess of LLMs isn’t confined to natural language alone, they are also able to write and understand different coding languages. We already accept LLM applications for co-piloting and code generation. When effectively prompted, this enables LLMs to provide executive function to autonomous robotic systems. In such scenarios, LLMs can bring their vast knowledge and reasoning of the world to make informed decisions, and execute tasks based on prior knowledge, with high levels of adaptability to complex, evolving situations.

However, LLMs do not possess an inherent understanding of the physical world. They don’t understand its limits, are frequently unpredictable and given to hallucinations, and they often replicate and amplify biases present in training data. This can lead to expensive and even dangerous errors, especially when an LLM is embodied in a physical robotic system. Is there a secure and controlled environment in which we could explore the potential of Embodied AI systems without endangering ourselves, expensive hardware, and the world around us? Falcon, our digital twin platform, allows us to do precisely that.

As a digital twin simulator designed for high-fidelity environments, autonomy software integration, and complex scenarios, Falcon is ideally suited for bringing LLM’s into simulated realistic physical worlds. Here we can create LLM integrated scenarios, test the outcomes using prompted tasks, and identify ways to fine tune the models for their intended domain of operation. In Falcon’s realistic virtual worlds we can safely observe the behavior of the embodied AI, identify any edge cases that produce failure or hallucinations, and adjust accordingly. Furthermore, the embodied AI can learn from these environments, creating its own data set of its physical interactions that can be integrated with the model.

In this blog we look at two such scenarios we created to run in Falcon, which all users can try for themselves. Duality’s chosen approach is to take a narrower application focus that goes after well defined tactical challenges that today’s roboticists are well familiar with. Here we explore one such area of focus: task coding. An LLMs general reasoning abilities allow them to take higher level mission goals, framed as prompts, and turn them into well defined coded tasks capable of being directly run on a robotic system. This type of LLM generated task plan can be an effective way to make systems less brittle and generalize their domains of operations while still keeping their actions predictable.

As we introduce our digital twin simulation approach, we will delve into the concrete and exciting possibilities that Falcon offers for developing Embodied AI systems. We’ll dive deeper into the concept of Task Coding, examine how it is implemented, explore challenges presented, and showcase the above-mentioned free Falcon scenarios that anyone can use to further experiment with real Embodied AI in simulation.

A Deeper Look Into LLMs and Embodied AI

LLM are trained with massive, internet-sourced datasets ranging from trivia to academic subjects like physics, medicine and law. Their large parameters enable them to retain and use this information by recognizing and capturing patterns in data. This then allows them to predict the next word, or series of words, that are most likely to follow any given text, based on the frequency and context in their training data. This drives an LLM’s ability to simulate understanding and logical reasoning based on example reasoning from text they’ve seen in the training data.

But what value do these abilities bring to Embodied AI?

Language Inputs

LLMs excel at understanding and generating human-like text. This capability can enable embodied AI to communicate effectively with people, interpret commands, ask coherent questions, and provide understandable responses, which are crucial aspects of human-robot interaction, and human-machine interface (HMI) design. By embracing language inputs, we can streamline embodied AI interactions, even removing the need for deep technical expertise.

Common Knowledge (“Knowing”)

LLMs have demonstrated clear understanding of spatial relationships gained via data they have been exposed to. For instance, if the training dataset contains sentences like “Cows graze in the field,” it associates cows with fields more strongly than with unrelated concepts like the moon. This understanding, when integrated with Embodied AI, allows them to exhibit “common sense” reasoning about real-world relationships (e.g. “cheese might be found in the refrigerator”), and in turn can help robots navigate various situations and adapt their behavior and responses to the context.

Problem-Solving and Decision Making (“General Reasoning”)

LLMs extract patterns, logic and problem-solving strategies across diverse scenarios from the vast stores of information they’ve been trained on. This in turn enables them to assess complex cause and effect relationships. When Embodied AI faces novel challenges, these models can reference these extracted “experiences’’ to suggest viable solutions. LLMs can also predict potential outcomes by comparing scenarios with their training data. This ability can enhance a robot’s decision-making, helping avoid actions that could result in failure or harm.  

However, it’s important to note that mathematical operations and traditional logic (analytics operators) are challenging for LLMs, and often can yield incorrect results. In these areas, relying on traditional computing techniques is currently the more reliable approach. For example: we’ve observed LLMs struggling with functions like unit conversions. This is one of the reasons why we chose a hybrid approach with LLMs operating at the highest levels, while more conventional software carries out the lower level functions.

Learning and Knowledge Integrations (“Adaptability”)

Through interaction and feedback, LLMs can help robots learn in a way that mimics human learning, integrating new information into their existing knowledge base. This process is crucial for the development of more nuanced behaviors and understanding — and is especially true when a human-in-the-loop has to intervene or “take over” the system. Simply put, Embodied AI can report the outcomes of its actions back to the language model, which then uses this data to refine future suggestions, thus simulating a learning process and eliminating the need to code responses to specific novel scenarios.

Task Coding

Task coding via LLM does not replace any aspect of conventional robotics protocols, but it does change how we conceptualize high-level planning and low-level control. Instead of a human developer coding every part of the high level tasks, we entrust this function to an LLM. This also means that instead of a human developer adapting the mid-level policies to the high-level ones, the mid-level policies are designed to be more generalized and flexible to accommodate the LLM’s code. An example of this can be seen when designing a drone to inspect different infrastructures. Traditionally, a developer would have to hard code an inspection path for every single type of infrastructure.

For example, a path to inspect a wind turbine would be made but would not be able to be used for solar panels. To achieve this generalization, we instead focus on path types, such as star search or ladder search and allow the LLM to choose the best option for any given infrastructure or general object. Task coding takes advantage of the generative capabilities of LLMs to perform general reasoning and take high level policy goals, framed as prompts, and turn them into well defined coded tasks capable of being directly run on a robotic system. As noted earlier, leveraging an LLM for task coding is a pragmatic path for reaping the benefits they offer for Embodied AI while still maintaining the predictability of the system. It's important to note that at present running inference on LLMs is very compute and memory intensive and can be challenging to run at the edge or at a high frequency — but these realities are constantly evolving.

To go deeper into the details of Task Coding, including examples of API and prompt design for various models, as well as code generated by the integrated LLMs, please read our white paper: The Task Coding Framework for Leveraging Large Language Models in Robotics.

Task Coding Scenarios on FalconCloud

To showcase how task coding works and enable anyone to experiment with it in simulation, we created several scenarios available for free in FalconCloud. Below we introduce these scenarios along with a few examples of prompted missions and their outcomes. But the best way to experience Task Coding in practice is try it for yourself.

Embodied AI AMR in a Maze

In this scenario users interact with an AMR in a simple maze environment. The AMR is integrated with GPT as its LLM, and the maze contains digital twins of random objects that the AMR can navigate to. Users prompt this embodied AI AMR with any commands of their choosing that involve the objects in the maze. GPT will interpret these commands, and write code that will then pilot the AMR towards the objects that it reasons to be the correct targets based on its understanding of the prompt.

Try this scenario on FalconCloud.

In the above example the AMR is asked to go to every product from farthest to nearest. It is able to sort and go to every product in the correct order.

But with task coding, we can get pretty creative with prompts. Let's try: 'Go to the products needed to make a sandwich'

Above we see an example of the LLM used for its reasoning abilities. The successful completion of this task requires the LLM to understand the idea of a sandwich and reason out what it can be composed of. Given the field of objects it is aware of, the LLM can then choose which objects meet the criteria, and therefore are ones that the AMR should go to. The LLM then uses the commands at its disposal to interact with the API, and the AMR executes the code generated by the LLM.

Experimenting with Embodied AI in simulation is a natural way to identify limitations of a model or errors in prompt design. In the next scenario we'll look a more real-world application and see how using task coding in simulation can be a vital tool for predicting errors in AI behavior and honing prompt design to correct it.

AI Drone Infrastructure Inspection

In the second scenario, users interact with a drone in a rural environment. This scenario builds on the lessons learned from the AMR scenario by setting a more complex environment with room for exploration. Furthermore it leverages embodied AI for a task that drones are frequently used for: routine inspection of infrastructure. This scenario involves a drone exploring an open environment filled with solar panels, utility poles, trees and other objects commonly found in a rural setting. One of the main additions to the embodied AI in this environment is the capability of visual reasoning. This allows the robot API to have more adaptability, calling upon visual reasoning to learn and adapt to the environment. We will dive much more deeply into visual reasoning in our next blog, but for now it is simply enough to mention the use of two distinct LLMs for task coding and visual reasoning in one embodied AI system.

Try this scenario on FalconCloud.

Let's look at how a simple mission is prompted and executed in this scenario.

When prompted to "Inspect utility poles 1, 3, and 9", once the code for the drone is generated by the LLM, Falcon displays the separate steps the drone will take to accomplish the mission (seen under "Preview Tasks" on the middle right of the screen), as well as the actual code being executed by the drone (seen in the lower right panel under "Code").

But what happens when a prompt is less clear? Identifying and correcting errors is significantly easier (and less costly) in simulation. The integrated development environment (IDE) available in this scenario helps us identify how to better hone the prompt.

Example: "Fly around a 5m square"

In this case the drone does not behave as expected. As we can see in the Preview Tasks panel, instead of carrying out 90 degree turns, the drone is rotating 90 degrees, but continues flight in a straight line. We can attempt to correct for this behavior with a better honed prompt:

Example 3: "Fly around a 5m square without rotating."

With a two word fix we were able align the intention of the prompt with the desired outcome for the drones flight pattern.

Perhaps the most interesting part of this scenario is seeing how task coding interacts with visual reasoning abilities. As discussed in a previous blog, we can use Falcon to learn how an LLM sees and interprets the world around it. In this scenario we can combine that ability with actionable missions.

Example: "Fly to the cement block."

In this situation we see an excellent example of the two systems working together and performing exactly as expected. After the initial take off, the drone produces an image in which it can detect a cement block. It uses this information to plot a route to the cement block, and then returns back to its landing pad.

You can read a deeper dive on Task Coding in our white paper: The Task Coding Framework for Leveraging Large Language Models in Robotics.

And try the presented scenarios for free on FalconCloud:

Embodied AI AMR in a Maze

AI Drone Infrastructure Inspection