Part 1 – Big models are big news
LLMs are turning the world upside-down. LLMs, more broadly called foundational models, are accelerating the capabilities of automation tools and assistants at an unprecedented rate. The latest expansion to multi-modal capabilities (MLLMs) like with OpenAI’s GPT-4 Vision will rapidly expand the ways these models interact with and affect people all over the world.
As a gross generalization, this is the era of foundational models. These very large models are eating small models - 10Bn+ parameter universal models trained on massive datasets encompassing the entire world’s information are laying to waste 50Mn parameter task specific models. Why? Large models solve many tasks from a single model. Large models also outperform smaller models by extrapolating one step beyond simple correlations, solving new tasks on data never previously trained on (so called zero-shot) and are capable of incorporating context and history about the data to provide better answers.
The main difference is the ability to reason – large models have encoded a generalized “understanding of the world” that has proven to be quite useful in universally solving problems, a feature lacking in smaller models. For example, a large model can understand the concept of a driving lane while a single task model is only trained to find white painted road markers on the road. As a result, the large model will perform much better in ambiguous circumstances found in the real-world, like when the paint is faded or splattered, or construction barrels have altered the path of travel.
The benefit of the foundational model paradigm is better performing applications, with a lot less custom development effort. The performance part is obvious – large models just work better for lots of tasks. But the reduced effort is also extraordinary – it is hard and expensive to make good models! A lot of time and money is spent collecting, cleaning, improving data for ground truth, all to make just one model perform one task. While large models are also extremely hard and expensive to build, once they are built they can execute all kinds of different tasks. You might only need one (or a few) to perform thousands (or even millions) of tasks.
Large, foundational models transforming model engineering is hardly new. Large companies have developed open-sourced foundational models that have quickly become the industry standard – recent examples include segmentation, machine translation and speech understanding models from Meta, and image understanding and captioning models from Microsoft. The advent of Reinforcement Learning in mid to late 2010s from DeepMind / OpenAI converted the task of building control systems from ad hoc programming to a single methodology of learning for all control tasks. The typical progression of model engineering has followed this pattern:
- One Task, One Domain: A simple model for a specific use case (e.g. object detectors for roads, depth segmentation models for indoor scenes, image captioning models, chatbots for web applications)
- One Task, Every Domain: Expanding the application of that simple model to lots of use cases (e.g. object detectors for everywhere (YOLO, DINO etc.), depth segmentation for everything (MobileNet), chat plugins for multiple products)
- Every Task, Every Domain: Large models that can do everything – a paradigm shift made possible by new LLMs (e.g. Florence, GPT-4V, ChatGPT)
- Every Task, One Domain: Optimizing large models for one domain, enabling real-time applications and higher reliability (e.g. GPT-3.5-Turbo for interactive searching, Harvey.ai for researching and drafting legal docs, DriveGPT for autonomous planning)
Part 2 – Today we are still driving on small models
Autonomous driving is still dominated by the small model paradigm developed over the past two decades. And while much progress has been made, we are all still driving ourselves to work.
Here is what is and isn’t working across both consumer autonomy and robotaxi applications:
What is working…
- Specialized sensors & hardware – Marrying software and purpose-built hardware has yielded superhuman performance on specific tasks – e.g. lidar for obstacle detection, multiple cameras for different tasks such as obstacles vs traffic lights, radars for seeing far away etc. Most redundancy is also achieved through multiple sensors instead of robust models for just cameras.
- Specialized, single-purpose models – Narrow models deliver the building blocks of a scene, like recognizing lanes or objects, and work well enough as standalone assistance features and can be combined to handle most basic driving tasks. Every model requires a large amount of either human labeled data or expensive offline evaluation to develop training data.
- Mapping – Anything that can be mapped well can be driven well. Robotaxi systems that rely on accurate HD Maps and precision localization work – see Waymo and Cruise handling downtown San Francisco with few issues.
What isn’t working…
- Zero-shot generalization – Existing models struggle with new or unusual scenes. If not sufficiently trained, they have no ability to reason from first principles on what to do next. The solution has been to build another special purpose model. This might be described as the inability to understand context. Dynamic scenes which are tough to map are a key weakness of most autonomous products right now.
- Interpreting driver and actor intent – Existing models largely fail to capture the subtleties of the driver’s intent inside the vehicle and the road actors’ intent outside the vehicle.
- Mapping the entire world, accurately – While well mapped areas are mostly drivable, accurate HD mapping has proven difficult to scale. And without accurate maps, map-based driving does not work very well. The human brain builds a map as we drive through scenes and autonomous vehicles would end up having to develop that capability as well.
- Scaling vehicles – The best performing robotaxis rely on specialized sensors, expensive compute and amalgamating lots of special purpose models, a complex and expensive recipe that has yet to scale to everyday drivers.
Part 3 – Foundational models will transform autonomous driving
Foundational models offer a new tool for solving some of the challenging problems that have held autonomous driving back from mass adoption.
The theory of foundational models in autonomy is simple – better performance by adding reasoning capabilities to navigate complex scenes, and reduced cost by dramatically simplifying model development.
In practice, foundational models are transforming nearly every area of autonomous driving development and on-road capabilities:
- Reasoning and planning (navigation) - LLM’s have shown promise in going beyond pure correlations to demonstrating a real “understanding of the world.” This new level of understanding extends to the driving task, enabling planners to navigate complex scenes with safe and natural maneuvers without requiring explicit training. This offers a new path to solving “the long tail”, the ability to handle scenarios never seen before, the fundamental challenge of autonomous driving over the past two decades. LLM’s also show the ability to use a lot of contextual information in decision making, enabling higher order decision making previously reserved for expert code that is expensive to develop and still does not cover every scenario. In these contexts, large models are applied directly to the driving task, in the driving system runtime.
- Understanding and labeling data - Model engineering at its core is a data problem – better data makes better models. “Better data” is not just about scale, but completeness. The training set must represent every concept one might encounter in the real world – e.g. every lane marker type, every road configuration, every obstacle, types of construction etc. It is expensive not only to collect all this data, but sort through it, finding and labeling relevant examples to develop a complete training set. It takes humans hundreds of thousands of hours to develop these training sets, and they are still incomplete. Large models have proven exceptionally useful for solving this problem, capable of sorting and labeling massive data sets very cheaply because of their unique ability to zero shot generalize to solving complex problems through the linguistic interface. In this application, large models might not be the used for inference in the final product, but are used to help train the models ultimately delivered in the final product.
- Interpretability - Early autonomous driving was dominated by large sprawling code-bases, which quickly became impossible to debug in complex scenarios. Learned approaches consolidated the code-bases by replacing them with combinations of neural networks, which improved performance but are also difficult to debug. LLM’s have offered a new path to interacting with the attention layers in the neural networks, enabling both prompting and explainability inside the driving system. Again large models here are a tool to help develop and interpret other models deployed in the runtime.
- Simulation – Generative vision models conditioned on actions (e.g. GANs, Diffusion models) have shown to be an effective tool for simulation, with the ability to create photo realistic driving scenes on the fly. However it is not fully clear yet if the large vision models can generate interesting edge case scenarios. Pixel perfect simulation renders are extremely useful for building planners and testing path prediction models but might not be computationally efficient at the scale that is required for testing and building self-driving cars.
Part 4 – Foundational model limitations and potential solutions
While these are promising developments, LLM’s today still have limitations for autonomous applications. But new solutions to these problems are emerging.
- Latency/real-time constraints - Safety critical driving decisions must be made in <1 second. Existing LLMs running in data centers can take up to 10 seconds to deliver a result. There are at least two solutions to this problem already showing promise:
- ~New hybrid-cloud architectures are emerging, supplementing in-car compute with powerful data center processing. With optimized connectivity and data center performance, round-trip latency times are already <2 seconds.
- ~Also, purpose built LLMs can be compressed into small enough form factors to run in the car on more limited compute. As in-car compute improves, increasingly large and complex models can fit into the in-car runtime requirements.
- Hallucinations - LLM’s reason based on correlations, even if those correlations invalid in a particular scenario – e.g. both a red light and the presence of a pedestrian in an intersection would normally suggest the car stops, but if that person in the intersection is a crossing guard waving traffic through, the car should proceed. Positive correlations do not always deliver the correct answer.
Reinforcement learning with human feedback offers a potential solution to this problem by aligning the model with human feedback to understand these sorts of complex driving scenes. Smaller models would require expert code to solve these type of edge cases, but language-based learning gives us the ability to quickly align the model with much less effort.
- “The New Long Tail” – Language models have “everything” encoded into them, but still may not have every driving specific concept covered – e.g. the ability to predict the path of a vehicle in a complex urban environment.
One potential solution here is exposing the model on long sequences of proprietary driving data can embed these more detailed concepts in the model.
Part 5 – Industry implications
Large foundational models are changing the path to autonomy at scale. Autonomous driving to date has been a long march, with only a handful of vehicles today tackling the most complex urban environments. Large models will expand not only the capabilities, but also the availability of these products, accelerating their impact on the life of everyday drivers. Here are a few areas where we should expect the biggest impact:
- Speed – The time to build, and time to market, will go down dramatically. What has taken more than a decade to accomplish in existing robotaxis can now be done in just a few GPU years of training and evaluation.
- The end of robotics – The critical engineering capabilities are shifting away from robotics to data engineering & model training. Expertise in controls and reinforcement learning is transcended by expertise in imitation learning and behavior cloning.
- Lower barriers to entry – Foundational models are dramatically reducing the amount of human-labeled data required to develop advanced, highly capable models. No longer is a fleet capturing millions of miles of real world driving necessary, enabling new entrants to get to market without massive capital expenditures.
- Evolving compute paradigms – Introducing hybrid-compute architectures that rely more on cloud & cloud connectivity changes the requirements for in-car processing. Advanced autonomous driving may soon be possible with significantly cheaper and more scalable in-car compute. $30K cars will finally have the capabilities of today’s $500K robotaxis.
- Attention-free driving – “Attention-free” / L4 autonomy requires a system that can reason about any given scene, even those never seen before, and deliver a safe driving maneuver every single time. The only models capable of this type of zero-shot generalization today are LLM’s, finally delivering on the life-changing promise of autonomous driving.
Part 6 – Ghost Autonomy is bringing MLLM’s to driving
Ghost Autonomy is pioneering the use of multi-modal large language models in autonomous driving. Ghost is developing natural language enabled driving solutions with MLLMs that allow autonomous vehicles to reason about safe maneuvers in any driving environment. Ghost’s platform allows leading automakers to bring artificial intelligence and advanced autonomous driving software into the next generation of vehicles, now with expanded capabilities and use cases with MLLMs.
In addition to developing proprietary large models, Ghost is working with leading model providers including OpenAI and Microsoft to jointly develop and customize models specifically tailored to autonomy applications. Ghost is actively testing these capabilities via its development fleet today, and is partnering with automakers to jointly validate and integrate new large models into the autonomy stack.