Harnessing the Power of Multi-Modal LLMs for Autonomy

LLMs have the potential to re-shape the autonomy software stack.

By Ghost

November 8, 2023

5 minute read

Large Language Models (LLMs) are continually advancing their capabilities and expanding into new applications on a near-daily basis, disrupting the existing computing architecture across various industries.

At Ghost we believe that LLMs will have a profound impact on the autonomy software stack, and the addition of multi-modal capabilities to LLMs (accepting image and video inputs alongside text) only accelerates their applicability to the autonomy use case.

Multi-modal LLMs (MLLMs) have the potential power to reason about driving scenes holistically, combining perception and planning to provide autonomous vehicles deeper scene understanding and guidance on the correct driving maneuver by considering the scene in totality. In the video above we share some examples of out-of-the box commercial MLLMs analyzing driving scenes to provide scene understanding and maneuver guidance.

Examples of MLLM-Based Reasoning

While traditional in-car AI systems can identify objects they have been trained on or read signs with OCR, MLLMs have the ability to go beyond detections and reason on appropriate actions or outcomes from the totality of information in the scene.

Navigating Construction Zones

For example, although this scene contains a confusing array of cones and signs on the left and right and a construction worker with a sign right in the middle, the MLLM correctly reasons that the right maneuver is to move to the right lane.

Navigating Complex or Conditional Road Rules

MLLM-based reasoning can also be enhanced by the context of the situation which is provided via the prompt.  In this example, while traditional image-recognition AI might have been able to understand a HOV sign, the MLLM can combine the understanding of the sign with situational information about the number of people in the car (2) and the time of day (8AM) to reason that it is OK for the vehicle to utilize the HOV lane on the left.

This reasoning is particularly important in complex scenes, where traffic lights, pedestrians, and obstacles can all be interacting – which are the most important elements to pay attention to?

Navigating Crowded Urban Environments

In this example the MLLM understands the intersection and cross-walk, sees the signal is red, alerts the car to the presence of a pedestrian on the right, and correctly instructs the car to wait for the green light before proceeding, while being aware of the pedestrian.  Furthermore, the MLLM anticipates the double-parked delivery vehicle down the road for the next driving segment.

In addition to these scenarios, the MLLM has shown the ability to understand pedestrian and cyclist likely behaviors, infer meaning from signs and the arrangement of cones and construction barriers, and decipher the complex associations between lanes, signs, and signals for intersections.  

Optimizing Multi-modal Large Language Models for Autonomy

While MLLMs can generate some surprisingly good results for driving “out of the box,” the growing ability to fine-tune and customize both commercial and open source MLLMs has the potential to accelerate this use case dramatically.  

Training GPT models on more and more multi-modal driving data will help improve them for the task, and fine-tuning will increase the quality of results, reduce the chances of hallucinations, and provide well-structured, specific outputs that can be connected to driving maneuvers.  

It’s also important to test and validate this capability in the real world as we iterate – in this case that means on the road. Ghost’s fleet of development vehicles are already sending data to the cloud for MLLM analysis, and we’re actively developing autonomy capabilities that leverage and bring MLLM insights back to the car.

The Large Model Architecture for Autonomy

We believe MLLMs present the opportunity to re-think the autonomy stack holistically.

Self-driving technologies today have a fragility problem. They tend to be built “bottoms-up” – many, many cobbled-together point AI networks and hand-written driving software logic to perform the various tasks of perception, sensor fusion, drive planning, and drive execution – all atop a complicated stack of sensors, maps, and compute. This approach has led to an intractable “long tail” problem – where every corner case discovered on the road leads to more and more point AI and software patches to try and iterate towards safety. When the scene becomes overly complex such that the in-car AI can no longer safely drive, the car must “fall-back” – either to remote humans in a tele-operations center in the case of robo-taxis, or by alerting the driver to takeover in the case of driver assistance systems.  

MLLMs present the opportunity to solve “top-down,” starting from a world model. What if we could reason about driving with a model that is broadly trained on the world’s knowledge, and optimize it for executing the driving task?  What if such a model could reason about a scene holistically, going from perception to suggested driving outcomes in a single step? The result would be an autonomy stack that is both simpler to build, and much more capable – a stack that can reason about complex and fluid urban driving scenarios that go beyond classic curated training.

Implementing MLLMs for driving requires a new architecture, as today’s MLLMs are significantly too large to run on embedded in-car processors.  A hybrid architecture is required, where large-scale MLLMs running in the cloud collaborate with specially trained models running in-car, splitting the autonomy task and long-term vs. short-term planning between car and cloud.

Building, delivering, and validating for safety this large-model autonomy architecture will take time, but that doesn’t mean that MLLMs can’t impact the autonomy stack much more quickly. MLLMs can start by improving the data center processes by which autonomy training data is curated, labeled, simulated, and in-car networks are trained and validated. MLLMs can also be linked to and start adding insights to existing autonomy architectures, as they grow in their capabilities to takeover more and more of the autonomy task.  For a deeper exploration of how large-scale models will impact the autonomy architecture, read this accompanying technical blog.

In Conclusion

Ghost’s autonomy platform is evolving quickly to include capabilities powered by large multi-modal vision-language models. If you are an automaker who would like to explore adding MLLM-based reasoning and intelligence to your ADAS or AV system, we’d love to collaborate.

Disclaimer: Ghost Autonomy’s MLLM-based capabilities are currently in development.  These video and image examples show MLLM-based analysis of driving scenes captured from Ghost vehicles driving in both autonomous and conventional mode. MLLM-based reasoning is not yet being returned to the car to impact actual driving maneuvers.