To achieve attention-free operations, L4 systems must be built around reliable, redundant perception. Autonomous vehicles not only need redundant sensors and multiple sensing modalities (camera + radar in Ghost’s case), but they need a strategy for sensor fusion to combine and appropriately interpret information from multiple sensors in real time.
Camera/radar fusion in traditional L2 systems has always been a challenging problem. Automotive-grade radar is often noisy, and it can be difficult to disambiguate radar returns of important actors from those of static objects, road barriers, signs, and reflections in the scene. To aid in this disambiguation vision is often used (i.e. trust radar detections where vision can confirm there is a vehicle, person, animal or object present and identify it). However, monocular vision systems don’t measure distance directly and are subject to the limitations of image training/recognition, thus accurately matching radar returns to imprecise visual signals is difficult. Tuning the fusion system to be more tolerant in such matching might result in erroneous detections (false positives), while requiring strict dual confirmation might result in missing an important detection (false negatives).
Ghost’s sensor redundancy and fusion strategy revolves around three principles:
- Independence: Elevate both vision and radar sensors to the point where they can be independently trustworthy (i.e., action can be taken based upon object identification or distance/velocity measurements from stereo vision, mono vision, or radar when operating in their known good states). Ghost’s sensor choices help greatly in this regard, stereo vision can measure distance directly, and modern imaging radar has high sensitivity and resolution.
- Redundancy: Ensure redundant sensors/modalities exist for each measurement – i.e., everything necessary to drive can be obtained from at least two independent sources.
- Quality: Know how operating conditions/ranges negatively affect sensor readings, and don’t trust (or assume very low confidence in) sensor readings in these conditions (e.g., radar not reliable for very near-to-car measurements, vision not reliable at very long distances, vision can’t detect what it can’t see when it’s occluded, etc.).
These rules inform Ghost’s drive program, and how Ghost interprets, or “fuses” inputs from vision and radar (along with the situational confidence rating of each input) to create sufficient scene understanding for safe driving.
Another important discussion in fusion is early vs. late fusion. Late fusion keeps sensor modalities independent far into the stack, allowing the drive program to evaluate and fuse their outputs after each sensor modality has been independently processed. Early fusion seeks to combine raw sensor data as early as possible so that neural networks can operate across the richest data set possible. Suffice it to say, for now, that Ghost believes in and leverages both of these approaches. We’ll have to save the mechanics of this fusion for future blog posts, but the video below shows you the results of this fusion in action:
Let’s explain each window in the video:
- Top Left: Camera view. This is the source view from the front left camera. This view shows you the dense traffic environment we’re driving in this example (US Highway 101S in the Bay Area in this case).
- Top Right: KineticFlow Stereo Disparity. The KineticFlow visual neural network combines the left and right camera outputs to produce a per-pixel disparity map (which can be used to calculate depth or distance directly). In this case near is light green, far is black, and the road surface is a continuous gradient from green to blue to black. When you see an object “rise up” from the road in roughly constant color, that’s an object that has been detected via depth to be non-road. While only depth information is shown here, multiple frames of depth information can be used to calculate velocity and motion direction for each pixel as well (not pictured but available).
- Bottom Left: Projected Stereo Disparity. This view is simply a top-down projection of the stereo disparity data discussed above. The green wavy lines represent the road surface, and the blue clusters are objects that rise above the road. Since this is vision, the presence of an object results in a gap in the visual field behind it, as it obscures the road beyond it.
- Bottom Middle: Radar. Ghost’s imaging radar output is visualized here, where each small dot is a radar return. The video “trails” each return for 0.5 seconds, so that it is a bit easier to see clusters of these returns. The radar points are colored based upon ego-compensated velocity (how each object is moving after the motion of the ego vehicle has been removed). Non-moving points are red, these are mostly road barriers as you will see on the right side of the video, but can also be stopped vehicles or bridges. Purple dots are moving about the same velocity/direction as the ego vehicle. Finally, big red dots are detected objects that have been deemed to be relevant to Ghost (i.e. within the ego lane or +/- 1 lane), with the red outlines encasing the clustered radar points to give a rough approximation of the object size.
- Bottom Right: Fusion. This view shows it all coming together, with the yellow boxes indicating relevant objects detected by radar, vision, or radar + vision. You’ll notice most are redundantly detected, but if you look at distant objects in the ego lane you’ll see radar-only detections (as vision is obscured by the lead vehicle), and if you look for detections right in front of the ego vehicle when traffic is stopped, you’ll see vision-only detections). The yellow box represents the back plane of the detected vehicle in this particular view.
Hopefully this quick example gives you some insight into how Ghost leverages both visual and radar sensing to detect objects and calculate distance and velocity. Stay tuned for future posts that dive deeper into how fusion works, and how more challenging scenarios are managed.