About the author: Dr. Sergio Sanz is an engineer and researcher with a PhD in Electrical Engineering. He specializes in Multimedia and Communications and has multi-faceted experience across machine learning, video coding and streaming, project management, and business development. His research experience includes work at the Carlos III University of Madrid (2005–2012), the Technical University of Berlin and the Fraunhofer Heinrich Hertz Institute (2012–2016). He co-founded Spin Digital Video Technologies GmbH (2015–2024) and is the author and co-author of numerous scientific publications and B2B white papers.
This article outlines the algorithmic solution developed by Sergio Sanz for the Synthetic to Real Object Detection Challenge - 2nd Edition on Kaggle, organized by Duality AI. The original article is available in its entirety on Dr. Sanz's LinkedIn blog.
GitHub: https://github.com/sergio-sanz-rodriguez/Synthetic-To-Real-Object-Detection-Edition-2
In this challenge participants were tasked to train models on synthetic images of a soup can (generated via Duality AI’s digital twin simulation software called Falcon) and generalize effectively to unseen real-world images.
As in the previous challenge, the proposed model is based on PyTorch's Region-based Convolutional Neural Network (R-CNN), specifically the Faster R-CNN implementation.
A central component of the proposed method is an augmentation-based regularization strategy to enhance generalization. Strong data augmentation techniques, including horizontal and vertical flip, perspective, zooming out, occlusions, color jittering, and resolution scaling, are applied throughout training. Another key aspect of the proposed solution was the use of ensemble learning to improve generalization.
Additional highlights of this approach include:
Compared to the previous challenge, the second edition has turned out to be significantly more difficult. This increased difficulty is primarily due to the simplicity of the training and validation datasets, which contrasts with the complexity of the real-world test dataset. In the first edition, a single Faster R-CNN network was used for object detection. In contrast, the current pipeline is considerably more sophisticated, as illustrated in Figure 1.
The pipeline consists of two models—Model A and Model B—as well as a Meta-decision stage.
Model A: Better Generalization
Model A consists of three deep learning models—A1, A2, and A3—that each make predictions on the same input image. Ensemble learning is applied to their outputs to (1) identify the most frequently detected object (majority vote) and (2) select the bounding box (BBox) with the highest confidence score. Model A is specifically designed to improve generalization.
During training, each model path uses a different augmentation strategy...
[Read the rest of the article on Dr. Sanz's LinkedIn blog]
Ready to take on the next Synthetic-to-Real challenge? The third edition of our Kaggle competition—Multi-Instance Object Detection Challenge—is now live. Join today!