Because they’ve proven to be fruitful test beds for safely trying out dangerous driving scenarios, hyper-realistic virtual worlds have been heralded as the best driving schools for autonomous vehicles (AVs). Tesla, Waymo, and other self-driving companies all rely heavily on data to power expensive and proprietary photorealistic simulators, because testing and gathering nuanced I-almost-crash data isn’t always the easiest or most desirable to recreate.
To that end, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) developed “VISTA 2.0,” a data-driven simulation engine in which vehicles can learn to drive in real-world scenarios and recover from near-crash scenarios. Furthermore, all of the code is being made available to the public.
VISTA 2.0 expands on the team’s previous model, VISTA, and is fundamentally different from existing AV simulators in that it is data-driven — that is, it was built and photorealistically rendered from real-world data — allowing for direct transfer to reality. While the initial iteration only supported single car lane-following with one camera sensor, achieving high-fidelity data-driven simulation necessitated rethinking the fundamentals of how different sensors and behavioural interactions can be synthesised.
VISTA 2.0 is a data-driven system capable of simulating complex sensor types as well as massively interactive scenarios and intersections at scale. The team was able to train autonomous vehicles that could be significantly more robust than those trained on large amounts of real-world data using much less data than previous models.
The team was able to scale the complexity of interactive driving tasks such as overtaking, following, and negotiating in highly photorealistic environments, including multiagent scenarios. Because most of our data (thankfully) is just regular, day-to-day driving, training AI models for autonomous vehicles requires hard-to-secure fodder of various edge cases and strange, dangerous scenarios. We can’t logically crash into other cars just to teach a neural network not to crash into other cars.
There has recently been a shift away from more traditional, human-designed simulation environments and toward those built from real-world data. Although the latter has incredible photorealism, the former can easily model virtual cameras and lidars. With this paradigm shift, a key question has emerged: Can the richness and complexity of all of the sensors required by autonomous vehicles, such as lidar and sparse event-based cameras, be accurately synthesised?
In a data-driven world, Lidar sensor data is much more difficult to interpret — you’re effectively trying to generate brand-new 3D point clouds with millions of points from sparse views of the world. The team used the data collected by the car to project it into a 3D space based on the lidar data, and then let a new virtual vehicle drive around locally from where the original vehicle was. Finally, they used neural networks to project all of that sensory data back into the frame of view of this new virtual vehicle.
Together with the simulation of event-based cameras, which operate at speeds of thousands of events per second, the simulator was capable of not only simulating this multimodal information, but also doing so in real time, allowing neural nets to be trained offline and tested online in augmented reality setups for safe evaluations. “The question of if multisensor simulation at this scale of complexity and photorealism was possible in the realm of data-driven simulation was very much an open question,” Amini says.
When the team took their full-scale car out into the “wild,” a.k.a. Devens, Massachusetts, they saw immediate transferability of results, with both failures and successes. They were also able to demonstrate the swaggering, magical word of self-driving car models: “robust.” They demonstrated that AVs trained entirely in VISTA 2.0 were so robust in the real world that they could deal with the elusive tail of difficult failures. Human emotion is one guardrail humans rely on that cannot yet be simulated. It’s the friendly wave, nod, or blinker switch of acknowledgement that the team wants to incorporate in future work.
The paper was co-authored by Amini and Wang, as well as Zhijian Liu, an MIT CSAIL PhD student; Igor Gilitschenski, an assistant professor of computer science at the University of Toronto; Wilko Schwarting, an AI research scientist and MIT CSAIL PhD ’20; Song Han, an associate professor at MIT’s Department of Electrical Engineering and Computer Science; Sertac Karaman, an associate professor of aeronautics and astronautics at The work was presented at the IEEE International Conference on Robotics and Automation (ICRA) in Philadelphia by the researchers.
The National Science Foundation and Toyota Research Institute provided funding for this research. The team wishes to thank NVIDIA for the donation of the Drive AGX Pegasus.