Welcome to the inaugural post in Ghost’s blog series on computer vision for L4 autonomy. In this first post we give some background on existing approaches to computer vision for L2 ADAS systems, and the challenges in evolving those technologies to L4. In future posts we’ll deep-dive into various aspects of Ghost’s vision perception technologies, covering both the fundamental algorithms as well as how our system links various neural networks together into Ghost’s end-to-end perception pipeline.
L4 Self-Driving: A Higher Bar
The past decade has seen a dramatic rise in adoption of computer vision ADAS systems for increasing auto safety and driving comfort – now the majority of passenger vehicles ship with camera or radar-driven L2 solutions for smart adaptive cruise control (ACC), auto emergency braking (AEB), and a host of other safety features. These systems generally rely on a single forward-facing camera and a dedicated processing chip that runs computer vision algorithms and neural networks. Many auto makers hope to evolve their systems from L2/L2+ to fully-capable L4 self-driving systems over time, working to expand the capabilities of their vision-centric solutions. But can L2 systems evolve naturally to L4? Let’s explore the existing approach to automotive computer vision, new requirements for L4 driving, and the challenges in evolving computer vision for L4.
The fundamental difference between L2 and L4 systems is that L2 systems are designed to help the driver, while L4 systems are designed to replace the driver, or at least remove the need for them to pay attention while activated. This difference results in a radically different set of design choices for L2 vs. L4, in terms of reliability, cost, and capabilities.
The most important difference is the relaxed requirements for reliability that assuming a backup driver enables in L2 systems – the system doesn’t have to be perfect because the driver is always there. These relaxed requirements result in the following design choices that are suitable for L2, but fail to meet the needs of L4:
- Failure resilience: one camera, one chip. If either fails – hand back to driver. If the view is blocked and the scene can’t be perceived – hand back to the driver.
- Object recognition certainty: neural networks used to identify objects at high probability, but not absolute probability. Missed object? It’s OK – the driver should always be monitoring.
- Slow decision-making: due to a lack of object recognition certainty, many systems will observe scenes and obstacles over time to increase certainty, taking up to 1 second to verify recognitions and make decisions. But at freeway speeds, 1s is an eternity, during which the car will travel 30+ meters – and more likely be taken over by the driver to avert the issue.
- Limited operating domain/conditions: solutions are designed for normal conditions, if inclement weather, low visibility, or construction zones are encountered, simply disengage and hand back to driver.
Despite these limitations, L2 systems play an important role in delivering safety today, but their fundamental design point around assuming an attentive human backup driver makes it very difficult to evolve these systems to attention-free L4 driving.
Traditional Approaches to Automotive Computer Vision
Vendors use a wide range of computer vision algorithms to provide the visual perception and understanding required for driving. Multiple algorithms are used together to form a perception pipeline, with the goal of detecting surfaces (road, curbs, barriers), actors (vehicles, humans), and semantic information (signs, lane markers) in a scene. Once detected, other algorithms are used to estimate distance, velocity, and motion direction of important actors and obstacles in the scene:
For example, a popular L2 ADAS solution is to use a single mono camera coupled with object detection to understand the scene. Once key actors are identified, pictorial depth estimation and multi-frame tracking are used to estimate distance, velocity, and motion.
To identify the key actors, either simple outline detection, or more recently image-based AI is used to detect objects and determine their classification, such as car, van, semi-truck, motorcycle, bike, etc. Once the type of object was identified, the computer reasons about its probable size, for example a typical passenger car might be ~1.9m wide. Once this probable size is estimated, the computer can then count the # of pixels in its image sensor, and do basic geometry using the “seen” size and the “inferred” size to calculate an estimated distance from you (the ego vehicle) to the obstacle. While this distance estimation process (as seen in steps 1 to 3 in the diagram above) can be completed using a single camera image, it is also necessary to track how the obstacle is moving over time compared to the ego vehicle in order to inform driving decisions.
To determine motion, the computer then compares images over a period of time, the first step being to again use image recognition to identify the same object in both images. Once the object is found in both images, distance measurements can be compared to see how far the vehicle has traveled in that elapsed time, allowing the calculation of velocity. But in addition to velocity, it’s important to also understand motion direction, or motion vector as it’s often called. For this one can observe both how the object moves within the frame in terms of location, as well as how the object expands, a set of techniques referred to as Optical Flow. These algorithms allow the estimation of velocity and motion direction in all three dimensions.
The Challenges of Scaling L2 Vision Systems
The above techniques vary a bit by vendor, but loosely define “the recipe” for traditional automotive computer vision. While they have proven to add value for L2 ADAS features, they have several real challenges when being adapted for L4 driving. Let’s explore the major areas of challenge below:
Image Recognition – Uncommon Objects. Perhaps the most obvious challenge with today’s image recognition-based systems is the most fundamental one – the image recognition itself. We’ll cover this in detail in an upcoming blog, but it’s simply impossible to train a neural network to recognize everything that might be on the road – in every size, color, rotation, weather and lighting condition, and state of partial reveal. This presents two immediate challenges – if an obstacle isn’t recognized it can cause a collision – many of these systems can’t avoid what they don’t recognize. Other sensors can be added to reduce this risk, but that causes other issues (sensor fusion) and adds complexity. Put simply – today’s systems are too reliant upon image recognition as the first, base step.
Image Recognition – Misclassifications. Beyond the binary recognized/not issues above, a secondary challenge with image recognition is misclassification of objects, and thus erroneous estimates of distance. Let’s say an average car is 1.9m wide, a small car is 1.7m, and a large car is 2.1m wide. If a car is indeed recognized, but its size is misclassified, distance estimations from that size may be off 10-20+%, potentially a large margin of error at highway speeds. Another common challenge is trailers, there is no common size, shape, or tail light width to trailers, making them very difficult to classify and assume width.
Camera & Chip Failure. Part of the “L4 contract” is that the system be capable of driving and in the worst condition bringing the car to a safe stop without any assumed user intervention. Most L2 systems today simply don’t fit that bill, and lack camera redundancy, chip/compute redundancy, and power redundancy – instead using human redundancy. Adding redundancy is possible, but it needs to be an integral part of the design, with not only redundant and/or failback hardware, but sophisticated high availability software capable of reasoning about and managing failures while continuing safe driving – a much more sophisticated system.
Occluded and Occulted Objects. Building on the challenges of image recognition, there are many situations where the camera’s view of an object might be occluded (debris partially-covering the camera) or occulted (large truck partially blocking the view of a car). Both these situations make the object look different, and dramatically reduce the changes of successful image recognition. Take, for example, a car in front of a semi in the late to the right, that is switching into the ego vehicle’s lane. Can that car be recognized and disambiguated from the semi the moment it edges out into view, or does the system require a full view of the car before realizing that it is different from the semi? If a full view is required in many object recognition systems, the car may already be half into the ego lane.
Resolution, Distance, and Low Light. Image recognition systems are trained at the given resolution of the cameras they operate on, and it's not uncommon today to have cameras in the 1-4MP range of resolution. The farther an object is in the distance the small the number of pixels in the image sensor that it represents, and at some point it no longer looks like a car, just a gray blob of 4 pixels. All image recognition systems have a minimum pixel size at which they can recognize objects, and this limits their effective perception distance. This is particularly challenging for detecting small obstacles (car parts or debris on the road) at a distance, and only gets worse in lower-light situations or inclement weather. Good object detection requires good light, clear sight, and many pixels, limiting distance. To solve this issue some manufacturers use 2 or 3 cameras at different fields of view (FoVs), wide, mid, and long-distance, but this creates further challenges in resilience, as none of these cameras are redundant, and a loss of any one camera now makes the system inoperable.
Vision-based L2 ADAS systems have had a profound impact on the auto industry, and have been an important first proof-point of the potential for computer vision in automotive safety. But the requirements of true L4 attention-free autonomy are much more challenging, and don’t just require an evolution/improvement of L2 vision systems, but truly a different design.
In future blog posts in this series we’ll discuss Ghost’s approach to universal object detection, innovations in physics-based AI, how image-based neural networks and video-based neural networks differ, and a whole host of other topics. Thanks for reading, and stay tuned for future posts!