ViperGPT Takes a Walk in the Park: Evaluating Vision Systems for Embodied AI in FalconCloud

One of the main current challenges for autonomous robotics is the ability of a robotic system to accurately understand its immediate environment. As humans, one of the fundamental ways we understand our environment is through our vision. Developers have attempted to approximate this with visual reasoning systems but it continues to be a major challenge when developing robotic systems.

So how can we evaluate any visual system's ability to understand its environment? An effective way is offered by Visual Question-Answering (VQA). With VQA we use a natural language question along with an image and evaluate the accuracy of answers yielded by the visual system.

Significant advancements have been made in VQA with datasets such as GQA providing a breeding ground for advanced VQA systems. One such system is ViperGPT which introduced an innovative approach to VQA by utilizing a large language model (LLM) in conjunction with multiple vision models. The main strategy of ViperGPT is to split up visual processing tasks from general reasoning: visual processing is handled by several vision models while general reasoning is handled by a LLM. This creates a modular framework for ViperGPT, allowing for easy replacement of AI models as the state-of-the-art evolves. Additionally, this framework provides transparent reasoning and decision making.

Given the demonstrated success of ViperGPT, we naturally wanted to see how it would perform as the visual system for embodied AI. The scenario we created is currently live on FalconCloud and is available for anyone to experiment with in the cloud and free of charge: simply follow this link to the Visual Reasoning Scenario. (Note: If it is your first time using FalconCloud, you will need to create a free account.)

ViperGPT in a Photoreal Environment

We set out to test ViperGPT in a complex outdoor environment where we control a robotic mannequin that can freely move around in a 4 sq km photorealistic location. We chose a downtown environment with a blend of stores, roads, and public spaces. It features essential elements like trees, benches, and buildings — offering a realistic yet manageable setting for robots to navigate and interpret. This setting, closely mirroring a real-world scenario, is ideal for testing a robot’s capability in visual processing and environmental interpretation. The presence of everyday objects such as trash bins, bicycle racks, and informational signs adds to the complexity, challenging robots to distinguish and understand a variety of visual cues.

In this simulated scenario we can explore ViperGPT’s abilities in highly flexible ways. For example, we can ask it to create bounding boxes around any objects we are interested in. Alternatively we can ask it to generate a more pixel specific mask of those same objects. By moving around and changing the viewing angle we can retest those same queries to see if its answers change.

*Fig 1. The UI of our visual reasoning scenario makes it easy to query ViperGPT about objects in its field of view.*

To make it easier to explore this visual reasoning system, we connected ViperGPT to an intuitive UI. Prompts are submitted in the top right text box and ViperGPT responds with code used to solve the prompts as well as with multimodal responses that include natural language, instance segmentation, and bounding boxes.

And while this scenario takes place in the downtown environment, with Falcon we can easily place ViperGPT into any environment we wish.

‍Example Prompts

As demonstrated in the source paper, ViperGPT is good at identifying an object’s specific location relative to other objects. Let’s look at some examples. Note: viewing these videos in full screen mode will make it easier to see the natural language responses.

Example Prompt 1: ”Is there a Coke bottle on top of the table?”

Example Prompt 2: “Is there water in this water fountain?”

Example Prompt 3: ”Is there water under this bridge?"

As we can see, in all three examples the vision reasoning model returned clear and correct yes/no responses.

Similarly, ViperGPT also exhibits a firm understanding of the function and significance of various objects it observes.

‍Example Prompt 4: “Can you tell if it rained recently?”

Next, let’s consider some more complex questions. While ViperGPT is good at reasoning about relationships between objects in an image, it exhibits one common issue of making assumptions when comparing objects or finding their relationships.

For example, when prompted with the question “How many humans do you think can sit on this bench?”, ViperGPT defaults to estimating average human sizes as translated to pixel amounts which can often be incorrect.

However, this can be minimized by implicitly or explicitly clarifying that the objects in question are in the image. Simply altering the prompt to include the phrase “of the humans” (“How many of the humans do you think can sit on this bench?”) guides it to use the humanoid mannequin as its human-size reference.

These kinds of tests help us better understand how the visual system reasons and allow us to refine our prompts.

Example Prompt 5: “How many of the humanoid figures do you think can sit on this bench?”

Open-ended questions are more challenging for ViperGPT. It is a bigger challenge to split up open ended questions into subtasks, so instead it tends to collapse into calling the general function ‘simple_query’. In other words, instead of following a logical multi-step path it simply defaults to the ‘simple_query’ function.

Furthermore, ViperGPT often makes assumptions about open ended questions. For example, when asked, “What is on the table?” it assumes that the only objects that can be on the table are ‘pen’ and ‘paper’ and misses the other possible objects. Alternatively, it simply tries to find all and any objects by calling the find function for terms like “objects” or even nonsensical terms like “*”. This outcome can be improved in the future by giving the LLM more direct access to the image or training the vision models to understand more vague terms.

Example Prompt 6: “What is on top of the table?”

Duality’s ViperGPT Updates

One major advantage of the ViperGPT framework is the ability to swap out models for newer models to do the different image processing tasks. We leveraged this feature for Duality’s current version of ViperGPT. For instance, we updated the llm_query and code generation functions to use GPT-4 instead of Codex and GPT-3 respectively. We also added a mask property by leveraging the Segment Anything Model (SAM).

Table 1. Comparing the original vision models used in ViperGPT versus Duality’s implementation of ViperGPT.

ViperGPT framework allows for multimodal integration. The framework is modular and any particular model can be easily updated for a newer model or replaced with a fine-tuned model designed for specific use cases.

The GitHub repo for Duality’s implementation of ViperGPT can be found HERE.

‍