Whitepaper
The Task Coding Framework for Leveraging Large Language Models in Robotics
May 14, 2024
· Written by
Felipe Mejia
Francesco Leacche
Mish Sukharev

NOTE: If you read the blog Prompts to Execution — The Task Coding Framework for Large Language Models in Robotics, please skip to the Breaking Down Task Coding section of this paper.

In a previous post we talked about how digital twin simulation can bridge generative AI models into the real world. In that blog we outlined how Large Language Models (LLMs) and multimodal models are already sufficiently capable to go after autonomy problems that have plagued smart systems. While autonomous robotics systems have made significant advancements in the recent years, they are generally hitting a developmental ceiling due to their somewhat brittle nature in the following areas:

  • Adaptability: Robots are generally designed for specific environments and conditions — also known as their Operational Design Domain (ODD). Unexpected changes (such as lighting conditions or obstacles) can challenge their ability to perform tasks correctly. Increasing adaptability means improving the robot's ability to tackle unfamiliar environments and situations by using general reasoning to learn and extrapolate from past experiences, and functionally expanding the ODD.
  • Generalization: As robots are designed for specific tasks or a range of related tasks, they often struggle to generalize their knowledge or skills to entirely different tasks without significant reprogramming. Improving generalization would enable robots to operate in a completely novel ODD.
  • Long Horizon Tasks: Highly complex tasks that require a deep understanding of the environment, nuanced decision-making, can still challenge modern robotics systems.

LLMs with well honed prompt design offer promising solutions to the above challenges. For most applications, these models, characterized by their large parameters and datasets possess an intricate grasp of human language, enabling them to produce coherent and contextually relevant sentences, and even emulate specific writing styles. However, the prowess of LLMs isn’t confined to natural language alone, they are also able to write and understand different coding languages. We already accept LLM applications for co-piloting and code generation. When effectively prompted, this enables LLMs to provide executive function to autonomous robotic systems. In such scenarios, LLMs can bring their vast knowledge and reasoning of the world to make informed decisions, and execute tasks based on prior knowledge, with high levels of adaptability to complex, evolving situations.

However, LLMs do not possess an inherent understanding of the physical world. They don’t understand its limits, are frequently unpredictable and given to hallucinations, and they often replicate and amplify biases present in training data. This can lead to expensive and even dangerous errors, especially when an LLM is embodied in a physical robotic system. Is there a secure and controlled environment in which we could explore the potential of Embodied AI systems without endangering ourselves, expensive hardware, and the world around us? Falcon, our digital twin platform, allows us to do precisely that.

As a digital twin simulator designed for high-fidelity environments, autonomy software integration, and complex scenarios, Falcon is ideally suited for bringing LLM’s into simulated realistic physical worlds. Here we can create LLM integrated scenarios, test the outcomes using prompted tasks, and identify ways to fine tune the models for their intended domain of operation. In Falcon’s realistic virtual worlds we can safely observe the behavior of the embodied AI, identify any edge cases that produce failure or hallucinations, and adjust accordingly. Furthermore, the embodied AI can learn from these environments, creating its own data set of its physical interactions that can be integrated with the model.

In this whitepaper we present two such scenarios we created to run in Falcon, which all users can try for themselves. Duality’s chosen approach is to take a narrower application focus that goes after well defined tactical challenges that today’s roboticists are well familiar with. Here we explore one such area of focus: task coding. An LLMs general reasoning abilities allow them to take higher level mission goals, framed as prompts, and turn them into well defined coded tasks capable of being directly run on a robotic system. This type of LLM generated task plan can be an effective way to make systems less brittle and generalize their domains of operations while still keeping their actions predictable.

As we introduce our digital twin simulation approach, we will delve into the concrete and exciting possibilities that Falcon offers for developing Embodied AI systems. We’ll dive deeper into the concept of Task Coding, examine how it is implemented, explore challenges presented, and showcase the above-mentioned free Falcon scenarios that anyone can use to further experiment with real Embodied AI in simulation.

A Deeper Look Into LLMs and Embodied AI

LLM are trained with massive, internet-sourced datasets ranging from trivia to academic subjects like physics, medicine and law. Their large parameters enable them to retain and use this information by recognizing and capturing patterns in data. This then allows them to predict the next word, or series of words, that are most likely to follow any given text, based on the frequency and context in their training data. This drives an LLM’s ability to simulate understanding and logical reasoning based on example reasoning from text they’ve seen in the training data.

But what value do these abilities bring to Embodied AI?

Language Inputs

LLMs excel at understanding and generating human-like text. This capability can enable embodied AI to communicate effectively with people, interpret commands, ask coherent questions, and provide understandable responses, which are crucial aspects of human-robot interaction, and human-machine interface (HMI) design. By embracing language inputs, we can streamline embodied AI interactions, even removing the need for deep technical expertise.

Common Knowledge (“Knowing”)

LLMs have demonstrated clear understanding of spatial relationships gained via data they have been exposed to. For instance, if the training dataset contains sentences like “Cows graze in the field,” it associates cows with fields more strongly than with unrelated concepts like the moon. This understanding, when integrated with Embodied AI, allows them to exhibit “common sense” reasoning about real-world relationships (e.g. “cheese might be found in the refrigerator”), and in turn can help robots navigate various situations and adapt their behavior and responses to the context.

Problem-Solving and Decision Making (“General Reasoning”)

LLMs extract patterns, logic and problem-solving strategies across diverse scenarios from the vast stores of information they’ve been trained on. This in turn enables them to assess complex cause and effect relationships. When Embodied AI faces novel challenges, these models can reference these extracted “experiences’’ to suggest viable solutions. LLMs can also predict potential outcomes by comparing scenarios with their training data. This ability can enhance a robot’s decision-making, helping avoid actions that could result in failure or harm.  

However, it’s important to note that mathematical operations and traditional logic (analytics operators) are challenging for LLMs, and often can yield incorrect results. In these areas, relying on traditional computing techniques is currently the more reliable approach. For example: we’ve observed LLMs struggling with functions like unit conversions. This is one of the reasons why we chose a hybrid approach with LLMs operating at the highest levels, while more conventional software carries out the lower level functions.

Learning and Knowledge Integrations (“Adaptability”)

Through interaction and feedback, LLMs can help robots learn in a way that mimics human learning, integrating new information into their existing knowledge base. This process is crucial for the development of more nuanced behaviors and understanding — and is especially true when a human-in-the-loop has to intervene or “take over” the system. Simply put, Embodied AI can report the outcomes of its actions back to the language model, which then uses this data to refine future suggestions, thus simulating a learning process and eliminating the need to code responses to specific novel scenarios.

Breaking Down Task Coding

Task coding via LLM does not replace any aspect of conventional robotics protocols, but it does change how we conceptualize high-level and mid-level policies. Instead of a human developer coding every part of the high level tasks, we entrust this function to an LLM. This also means that instead of a human developer adapting the mid-level policies to the high-level ones, the mid-level policies are designed to be more generalized and flexible to accommodate the LLM’s code. An example of this can be seen when designing a drone to inspect different infrastructures. Traditionally, a developer would have to hard code an inspection path for every single type of infrastructure. For example, a path to inspect a wind turbine would be made but would not be able to be used for solar panels. To achieve this generalization, we instead focus on path types, such as star search or ladder search and allow the LLM to choose the best option for any given infrastructure or general object.

As noted earlier, leveraging an LLM for task coding is a pragmatic path for reaping the benefits they offer for Embodied AI while still maintaining the predictability of the system. Task coding takes advantage of the generative capabilities of LLMs to perform general reasoning and take high level policy goals, framed as prompts, and turn them into well defined coded tasks capable of being directly run on a robotic system. This code is sent to the robot API which can trigger control functions, sensory functions and/or visual reasoning. However, It's important to note that at present running inference on LLMs is very compute and memory intensive and can be challenging to run at the edge or at a high frequency.

A straightforward way to integrate an LLM with an embodied AI is by exposing a control API to a LLM and then prompting the LLM to complete a task. For task coding, two aspects need careful consideration: API design and prompt design. To best leverage the LLM, the control API must be designed as a set of functions that can be used to complete executable tasks. The prompt must be intentionally crafted to, both, capture how the LLM must use the API,  and clearly specify the task.

With task coding an LLM is given a mission goal, which it then reasons and returns runtime task coding. This code is sent to the robot API which can trigger control functions, sensory functions and/or visual reasoning.

Working with LLM-integrated Embodied AI in Falcon, allows us to evaluate our API, and make any changes necessary for better performance. It allows us to work in an Integrated Development Environment where we can see every part of how the LLM is interpreting a prompt, and how it is commanding the lower-level functions of the machine. And it allows us to refine our prompt crafting to better match the LLM’s understanding.

API Design

Crafting an effective robot control API for LLMs to use hinges on striking the perfect balance between low-level functionalities (such as nuanced motor controls), and high-level abstract actions. The goal is to architect the API in such a manner that the LLM not only grasps the API syntax effortlessly but also utilizes insights from its general training to produce intelligent coding decisions. For example, if we ask an LLM to search for birds it will intuitively know to search around utility poles and trees to try and find birds. However, we also want the API to retain a degree of ‘mid-level’ flexibility, ensuring adaptability, and catering to an expansive array of tasks in innovative ways.

It is also crucial for the API to be designed as a fully contained mission. In other words the LLM creates a complete code that is meant to complete the given prompt with minimal secondary queries. In practice the LLM should act as the high level strategist, creating a large scale order of operations while smaller faster models work as tacticians, used for real-time decision making.  

Integrated Development Environment (IDE)

Working with LLMs in Falcon enables the creation of an IDE, with a useful layout for engineers to experiment with the embodied AI and implement quick troubleshooting. This was done by providing pre-planning for the API. What this means is that once the LLM generates code the path of the robot is immediately displayed in the simulation. To identify and isolate specific challenges, the behavior of the machine can be fast forwarded or rewound, and specific steps can be accessed directly. This allows the user to directly identify relationships between prompts, LLM decisions, coding, and task execution.

Prompt Design

Currently Falcon integrates two LLMs: GPT and CodeLLama. While Falcon can be made to integrate any LLM, for now we chose the two most advanced models for the task in the commercial (GPT) and open-source spheres (CodeLLaMa).

There are two key components in the prompt we use to extract task coding from an LLM. The first component is context. In context we are aiming to expose the context of the environment, possible tasks and functions to be used to complete a given task. The second component is the actual task, here we surround the user input with some overall guidelines as to how we want the LLM to provide the output such that it follows the structure needed to be run in python.  The approach for prompting these LLMs is different for each one. Below we walk through the prompt design we’ve implemented in Falcon.

GPT

GPT-4 seamlessly extracts the information for how to use the API from code documentation. As such we can provide guidance of the robot control API by inputting the raw code (such as a Python class) with the functions that we want GPT-4 to use. The raw Python class is given to GPT-4 with documentation and examples of how to apply the functions to complete the provided example prompted missions. But it's vital to note that what is given to GPT-4 does not include the specific implementation of the functions that would be found in the source code. The structure of the functions is as follows:

The above function contains documentation with description, parameters, returns and examples of how to interact with the API (in this case for an AMR).

Next we can prompt actual tasks. The task is added following ViperGPT prompt design, with query being the task description given by the user:

In this prompt we are providing the output constraints for GPT. We are constraining GPT to output Python code and use the DRONE class we have just demonstrated as well as any basic functions it might need. This steers GPT into outputting code that can be run directly.

At this point GPT-4 is ready to accept concrete mission prompts, which we will explore in the following section. Lastly, its important to note that not all functions need extensive documentation. Basic functions are exemplified within more complex functions and therefore don’t need to be demonstrated on their own.

CODE LLAMA

Unlike GPT CODE LLAMA is trained primarily on text completion and text infilling as such it works best by learning from examples [1]. Even Code Llama - Instruct, which was trained on 5B tokens to follow human instructions, performed better by learning from examples than by being given the robot API. As such, we follow CodeLLaMa’s one-shot learning prompt to get an output:

In this prompt, we give CODE LLAMA tasks similar to the ones we expect from the user (‘move drone’), followed by the corresponding code to complete the task. Currently we give 18 such examples to CODE LLAMA, but we expect this number to increase as we find more edge cases.

This far we have integrated GPT4 and CODE LLAMA with Falcon, and wewill continue integrating more LLMs as the state of the art improves. As more LLMs are developed their strengths and weaknesses drastically change [1], which further highlights the need for tools like Falcon to act as a benchmark for these LLM as they are utilized and applied to embodied AI.

Falcon Scenarios: Task Coding In Simulation


To explore the benefits of using LLM for task coding, we created two use case scenarios that anyone can access. The first, and simpler, scenario places an Autonomous Mobile Robot (AMR) in a maze filled with random objects that the LLM can reason about. The second scenario is grounded in the real world, and places an inspection drone in a field with utility structures like electric poles, solar panels, water pipes, and wind turbines. The drone can be prompted to carry out inspection tasks of any structures in the scenario.

Scenario 1: Embodied AI Autonomous Mobile Robot (AMR) in a Maze

In the first scenario users interact with an AMR in a simple maze environment. The AMR is integrated with GPT as its LLM, and the maze contains digital twins of random objects that the AMR can navigate to. Users prompt this embodied AI AMR with any commands of their choosing that involve the objects in the maze. GPT will interpret these commands, and write code that will then pilot the AMR towards the objects that it reasons to be the correct targets based on its understanding of the prompt.

Leveraging the API and prompt designs that we discussed we can test this scenario within Falcon. We are able to do various tasks within this environment, such as going to specific objects, going to them based on their position, and even going to objects needed to complete a specific task. In this scenario, the AMR and subsequently the LLM do not have access to visual reasoning but instead leverage structured data of the scene. This structured data includes descriptions, locations and distance of the objects in the scene relative to the AMR. To take advantage of this structured data we use GPT not only for task coding but also to search through the structured data and find objects in the scene.  

In this scenario the LLM does not have access to visual reasoning. Instead visual reasoning is replaced with LLM structured data reasoning, which leverages chat-GPT to extract objects from the structured data given a query.

This AMR robot works on a fixed set of commands. These commands and framework are meant as building blocks and can easily be expanded. The AMR robot has 7 method functions that it can call:

  • find
  • distance_from_AMR
  • go_to
  • wait
  • stop
  • move
  • rotate.

GPT is given access to these different AMR methods inside of the Falcon environment. GPT can then showcase its reasoning capabilities and be tested in a sandbox environment.

GPTs code is run in Python, which translates to actions performed by the AMR in Falcon.

Example 1: ‘Go to the products needed to make a sandwich’

In this example, GPT4 hard codes example ingredients that are commonly needed to make a sandwich. It then iterates over these ingredients to try and find any objects in the scene that are related to each individual ingredient.

Above we see an example of the LLM used for its reasoning abilities. The successful completion of this task requires the LLM to understand the idea of a sandwich and reason out what it can be composed of. Given the field of objects it is aware of, the LLM can then choose which objects meet the criteria, and therefore are ones that the AMR should go to. The LLM then uses the commands at its disposal to interact with the API, and the AMR executes the code generated by the LLM.

Example 2: ‘Go to all products that require a can opener.‘

In this example the LLM is asked to find products that require a can opener. It calls the find function to find such objects and then iterate through each object to go to those locations. The find function uses GPT to identify the objects in the scene that best match the query ‘can opener.’

Above we see an example of LLM being able to discriminate between objects requiring a can opener such as a can of soup and the rest of the objects such as a bag of bread or cheese. This highlights that the LLM has an understanding of the world that can be leveraged for embodied AI. The embodied AI system does not need to be taught, what is a can opener, instead it leverages GPT’s prior knowledge to complete this mission. There is no need for extra data on what a can opener is or what it is used for, GPT already has that covered.

One of the main benefits of using Falcon for embodied AI is being able to explore different prompts and seeing if they work and adjusting the prompts based on the observed failure mode. In the next example we ask the AMR to go to all objects with the following prompt:

Example 3: ‘Go to all objects in the maze and then rotate by 360 degrees.’

GPT4 generates code that first calls the find function that calls chat-GPT again to compare the structured data of the maze with the input ‘object’. Then the code iterates through the actor_tags that were found and finally the AMR rotates by 360. As you can see in the video below, chat-GPT does not return anything matching ‘object’ as it is considered too vague to match with the products on the maze.

Without clarifying for chat-GPT what we mean by “object” — GPT will still generate correct code — but the nested chat-GPT isn't given enough context to execute in the intended way. As such if we add the following example so that chat-GPT knows to return all products when looking for "objects":

Above is an example of the question-answer context that is given to chat-GPT. Without this context chat-GPT doesn’t consider the items in the scene as ‘objects’, but with the addition of this example it is able to expand what is considered an ‘object’. Here is what the AMR does once we add in the example for products:

Scenario 2: AI Drone Infrastructure Inspection

In this second scenario, users interact with a drone in a rural environment. This scenario builds on the lessons learned from the AMR scenario by setting a more complex environment with room for exploration. Furthermore it leverages embodied AI for a task that drones are frequently used for: routine inspection of infrastructure. This scenario involves a drone exploring an open environment filled with solar panels, utility poles, trees and other objects commonly found in a rural setting. One of the main additions to the embodied AI in this environment is the capability of visual reasoning. This allows the robot API to have more adaptability, calling upon visual reasoning to learn and adapt to the environment. You can learn more about visual reasoning in our previous blog, but for now it is simply enough to mention the use of two distinct LLMs for task coding and visual reasoning in one embodied AI system.

While any LLM can be integrated with Falcon to control the drone, we experimented with CodeLLaMa and GPT-4. The GPT-4 version is available to all users on FalconCloud.

In this environment there is known structured data for objects such as utility poles, solar panels, windmills, and water pipes. There are also objects without structured information, such as bridges, trees, and rocks. With the help of visual reasoning, the LLM can utilize both.

First let's take a look at the API design for this more complex scenario:

The API (shown abov) defines the functions available to the drone.

Next, let's see how task coding fares in this scenario.

Example 1: 'Inspect utility poles 1, 3 and 9.'

In this example the code starts by having the drone take off. It then defines a callback specifying what the drone should do when it is detecting or searching for during its path while detection mode is activated. The code then defines the utility poles of interest from the structured data and iterates through these locations going to each pole and doing an orbital search. It then deactivates detection mode and lands the drone on the closest landing pad.

Here's what this looks like in simulation:

As with the previous scenario, we can just as easily identify and correct errors — made even easier by the IDE available in this scenario.

Example 2: 'Fly around a 5m square'

In this case the drone does not behave as expected. As we can see in the Preview Tasks panel, instead of carrying out 90 degree turns, the drone is rotating 90 degrees, but continues flight in a straight line. We can attempt to correct for this behavior with a better honed prompt:

Example 3: 'Fly around a 5m square without rotating.'

With a two word fix we were able align the intention of the prompt with the desired outcome for the drones flight pattern.

Uncovering Differences Between Various LLMs

The ability to prompt the same missions with different LLMs in order to ascertain their strengths and weaknesses is a major benefit of testing LLM performance in Falcon. In the future we hope to present deeper explorations into various models, but for now we just want to illustrate how we can uncover model differences, and as before, test out interventions to overcome any errors.

A good example is an issue we found with CodeLLaMa that didn't present the same way in GPT. In this instance, LLaMA did not convert units correctly — an important issue for robots in the real world. In the following task instead of using 1000cm to represent 10m CodeLLama inputs 100cm.

In this example the drone starts by taking off and then moves using the move_by function. The move_by function uses "cm" as it’s units as such this call moves the drone down by 100cm (1m) instead of the 10m the dictated by the mission prompt.

To overcome this issue we made a unit conversion function to simplify this problem for CodeLLama.

Unlike the previous example, an extra convert_to_cm function was created that receives a number and a unit and converts it to cm. This allows CodeLLama to easily move by a given distance no matter what the units are used in the prompt.

Next Steps

The exciting work of Embodied AI is just getting started. We will be presenting regular updates with expanding work that brings simulated AI scenarios closer and closer with real world applications. This of course includes more state of the art model integrations in Falcon, as well as scenarios with more realistic autonomy stacks, ROS integration, and more.

We hope you try these task coding scenarios for yourself, and we encourage you to reach out to us with questions, comments, and ideas: solutions@duality.ai .

Embodied AI Autonomous Mobile Robot (AMR) in a Maze

AI Drone Infrastructure Inspection

The Task Coding Framework for Leveraging Large Language Models in Robotics

The Task Coding Framework takes advantage of the generative capabilities of LLMs to perform general reasoning to take high level policy goals, framed as prompts, and turn them into well defined coded tasks capable of being directly run on a robotic system. This presents an effective approach to make robotic systems less brittle, with more generalized domains of operations, while keeping their behavior predictable.
Download PDF