Reinforcement Learning in Falcon
March 3, 2021
· Written by
Brad Kriel
Allie O'brien

Engineers and society at large are enamored with robots, and for a good reason: Robots can perform dull, dirty, and dangerous jobs ill-suited for humans. They can work quickly and precisely without experiencing injuries or fatigue. Today, robots work side by side with humans in all kinds of settings.
The Duality team recently built an end-to-end operations demo where two autonomous vehicles, a quadcopter and a compact track loader (CTL), are used in a Military Forward Operating Base to perform a search and retrieve mission of airdropped supplies. The quadcopter is used to spot the dropped pallets marked with unique Aruco codes, and to send their location information to the CTL which is then responsible for navigating to the specified locations and retrieving the pallets back to the base.

A view of the quadcopter and CTL self-operating and physically simulated within Falcon

The quadcopter and CTL are not scripted or animated, but rather self-operating and physically simulated. They are controlled, live, by two different autonomy stacks interacting with Duality’s Falcon Autonomy Simulator via our API and robotics middleware. Falcon leverages the Unreal Engine as a 3D operating system and Universal Scene Description to define tunable environments, machines, and diverse scenarios. In our pallet retrieval scenario, the quadcopter is flown using a reinforcement learning model that itself was trained within Falcon.

This article discusses the steps we took to train the quadcopter’s autonomous landing model; the importance of accurate physical simulation in autonomous vehicle training; and the most important aspects we kept in mind when we developed a toolkit for reinforcement learning inside Falcon. Reinforcement Learning (RL) is an exciting branch of machine learning and offers tantalizing opportunities across a wide set of problem domains. However, there are also many challenges to be addressed before RL’s potential can be fully realized. In this article, we only focus on one specific challenge: How to build a reliable simulation environment for Reinforcement Learning experiments?

Challenges for autonomous vehicles training

The self-driving car industry is a good example of why autonomous policies are hard and expensive to train. While it is possible to train a self-driving car to drive perfectly in normal conditions, training it to deal with diverse conditions such as weather changes or edge cases such as street construction without proper signage is exponentially harder.

Building a robust model requires training data from as many edge cases as possible and in all kinds of scenarios. Edge case scenarios, such as recovering from near-collision events, are not only extremely expensive and difficult to capture but also potentially dangerous.

Training and validating in a simulator can potentially solve both of those problems but most simulators fail to reproduce realistic and diverse scenarios. They focus only on the dynamics of the models or sensor simulation, leaving a significant gap for vision-based learning models that can be trained in a simulated environment and reliably deployed in the real-world.

In “A Study on the Challenges of Using Robotics Simulators for Testing” by Afzal et al. 2020, the researchers conducted an interesting survey with robotic developers and identified the problems they usually face with the adoption of simulators in their pipelines. The Duality team has already explained how we solve some of these challenges in other blog posts, like how Falcon leverages USD to set up complex and realistic environments and scenarios that are tunable and integrated with robotics and ML workflows. In this blog post, we describe how we tackled challenges specific to using the simulator for reinforcement learning -- the integration of standard APIs and tools; reproducibility; and reliability.


In the world of Reinforcement Learning, OpenAI Gym is the de-facto standard toolkit for developing and comparing RL algorithms. It contains a set of ready-to-use environmentswhich are used as a common testing ground for benchmarking new agents implementations. 

Intending to make the development of AI-based applications for Robotics and RL researchers and practitioners as easy as possible, we introduced Falcon-Gym, an implementation of the OpenAI Gym interface for the Falcon simulator. Beyond ready-to-use environments, Duality’s customers can create custom, photo-real environments that can even accurately match real-world locations using interior and exterior high-quality assets.

Integration of standard APIs and Tools

Integration of a new simulator in the pipeline should not require a change in the existing technology stack. This is especially true for RL, where most of the libraries force developers to adapt their environments to specific framework requirements.

Being compliant with the gym interface, the environment created in Falcon-Gym can be seamlessly integrated with the standard libraries and tools that roboticists and RL researchers and practitioners already use in their every-day work, such as stable-baselines3 and every other library compatible with the classic reset/step interface.

Reproducibility and Simulator Reliability

Simulations should be deterministic. That is, given the same starting state and applying the same forces, should always result in the same outcome. This helps track down errors or bugs and makes it easy to reproduce the experiments. To be deterministic, the simulation must advance by fixed delta time, while the physics interactions must be resolved in a repeatable order.

The Falcon-Gym wraps the simulation environment in a stepping API we have developed on top of Unreal Engine, which gives us complete authority over the engine stepping or ticking. By default, Unreal optimizes engine ticking for gameplay responsiveness. However, with Falcon-Gym, it is possible to set a fixed time step, all physics interactions can be resolved in a fixed order and the simulation can be executed faster than real-time. This allows us to speed-up the simulations and to control them in a fast and scriptable way. And even permits headless and vectorized parallel execution. The determinism is still limited to multiple runs on the same machine, because floating-point arithmetic could differ by platform.

Implementing a Custom Environment

The process of implementing a custom environment can be broken down into 4 different tasks:

  1. Define the action space. This is implicitly derived from the input we want to support. In this specific case, the quadcopter’s controls support 4 continuous actions which respectively manage Throttle, Pitch, Yaw, and Roll.
  2.  Define the observation space. This is the set of information we want to extract at every time step to represent the state of the world. Falcon provides a wide range of sensors, among which LiDAR, Depth, and RGB Cameras. For the landing task, we used an observation space of 22 features which are taken from the GPS and IMU sensors, as well as collision information.
  3. Design a reward. For the sake of simplicity, we used a dense increasing reward as the drone approaches the landing pad, with a final huge positive reward, awarded on success, to encourage the agent to land instead of keeping collecting infinite small rewards. 
  4. Lay out the terminal states. Naturally, the positive terminal state is reached when the quadcopter successfully lands. We also defined 2 big negative rewards when the drone crashes or moves too far from the landing pad, and a small negative one in case it gracefully lands on a surface different from the landing pad. In the end, a time limit of 1000 steps per episode was set to prevent the agent from being stuck. 
 Overview of the defined observation space for our example

As is often the case in machine learning, this is not a straightforward process, but one that requires multiple iterations to converge to the right features for the observation space and properly tuned rewards.

An Incremental Approach

Implementing and debugging reinforcement learning systems is notoriously difficult, as one can easily understand from the words of Andrej Karpathy, the Director of AI at Tesla, in this HN post. The hyperparameters’ tuning of a model is already hard per se. If we also add the uncertainties introduced by a custom reward or an improperly selected observation space, it is essential to keep the K.I.S.S. principle in mind.

With this in mind, we decided to start from a simple task and incrementally reach for more difficult, compound goals.

Not only did this make it easier to implement the final environment, but it also made it possible to promptly identify if problems were introduced by the environment or the model being trained.

Training the model

Here is where Falcon-gym shines. Being compliant with the gym API, the code below is all is needed to start training using stable-baselines3:

from stable_baselines3 import TD3
model = TD3('MlpPolicy', 'Falcon:DroneLanding-v1').learn(5000000)

NOTE: Being a complex environment don’t expect it to work without some hyperparameters tuning!

We tested some among the SOTA of off-policy algorithms: DDPGTD3, and SAC. As expected TD3 outperformed DDPG but both of them were able to solve the environment. Differently, SAC didn't achieve a good result and it will require more heavy hyperparameters tuning.  Below is the training result of TD3.

The training result of the TD3 algorithm

The plot shows the average score over the last 100 episodes. The agents started avoiding the big negative rewards pretty quickly, it started making a sense of the environment after 2 million steps and eventually solved the environment in about 4 million steps. In the video below, you will be able to see the evolution of the model from the initial to final stage.">

With an average of 400 FPS the training required about 4 hours and it could be greatly reduced with the implementation of a more efficient reward function.

Conclusion and Future Work

In this article, we mentioned the value of simulators for autonomous vehicle training, to ensure safety and reduce costs. We also explained the important factors we kept in mind to leverage Falcon for reinforcement learning. Based on these, we presented Falcon-gym - an implementation of the OpenAI gym interface for Falcon, and a reliable and fast tool that is deterministic and integrable with the most widely used RL libraries. Then, we demonstrated, step by step, how we used it to implement a custom environment to train a model for autonomous landing.

Going forward we would like to validate out RL trained models with field testing of the quadcopter and  other autonomous systems.

If you are a researcher in reinforcement learning and would like to have access to Falcon, an advanced photorealistic simulator integrated with accurate physics, and with a wide selection of ready-to-use vehicles and assets, please contact us at ml[at] We look forward to hearing from you!