Talking Autonomy,

Mapless Driving

How Ghost sees and understands lanes - without the aid of HD Maps

By Ghost

August 15, 2023

minute read

Matt Kixmoeller

Welcome to "Talking Autonomy." In this episode we're going be talking about map-less driving, and in particular how Ghost navigates the world without the aid of HD maps. I'm joined today by Arun, a key member of our model engineering team. Arun, why don't you tell us a little bit about what you do here at Ghost and what you did previously in your career?

Arun Narayanaswamy

I'm a model engineer at Ghost. I primarily work on neural network models for perception, focused on lane perception right now. Prior to this, for 10 years I was at Google working on computer vision and machine learning in Street View imagery and in Google Research.

Matt Kixmoeller

All right, you're deep, it sounds like, into both computer vision and maps. So let me just set up the problem a little bit. One of the unique choices we've taken at Ghost in terms of how we drive is not to use HD maps. A lot of autonomy companies will use centimeter-accurate HD maps to have a very good understanding of the world they're driving in and then localize themselves to that map so they can understand where to drive, where to turn, and all the other decisions that need to be made. We have not taken that approach at all, because there are real challenges with map quality and keeping maps up to date. So why don't you give us a sense for why HD maps are so difficult?

Arun Narayanaswamy

First of all, like you said, HD maps quickly go out of date. When you're driving at highway speeds, the precision is really important. If you are working off of a non-updated map, it really puts safety risk. What you want to really do is to drive off what we see in the actual visual field and drive off perception. HD maps also cause difficulty in terms of localization. Your GPS could be off. Your SLAM localization could be off. All these potential errors and noise add to more issues downstream.

Matt Kixmoeller

And just to at a high level put it very simply, as humans, we don't drive that way. We're not holding up a map, trying to localize our car to that map while we're trying to make decisions on the road. We're perfectly capable of looking out the front window, seeing the paint on the road, understanding other visual cues, and staying in a lane. So why can’t a computer drive that way?

Arun Narayanaswamy

That's right.

Matt Kixmoeller

This may sound like kind of a simple question, but why are lanes so important? Obviously it's important to keep yourself in the lane, but the lanes are so much more, right? Why is this such a key signal for driving?

Arun Narayanaswamy

Lanes are very important on highway scenarios because of the higher speeds. Most of the time as humans we are driving subconsciously, just keeping the lane, and we keep it within like a very short distance. So when you're driving at 70 miles per hour, these distances matter a lot, and you want to see very far ahead in the lanes. Lanes are also important to understand where obstacles are placed, in-lane or in the neighboring lanes at a distance, both in the front and the rear. You want get a 360 understanding of your scene around you, whether the people are moving into your lane and do you have to react to them because of safety reasons. So lanes are very important for precise understanding of driving more in the highway scenario because of the higher speeds.

Matt Kixmoeller

Let's get a little bit more into how we do it. At Ghost we have a dedicated computer vision neural network for understanding obstacles. We call that KineticFlow, and it uses stereo and mono to understand what's an obstacle and how far away it might be, what velocity it's moving at, and what direction its heading. But we've also built a completely separate neural network to understand the lane and the scene information. Why don't you walk us through how that works?

Arun Narayanaswamy

Our scene understanding neural network is monocular, it looks at a constant stream of images coming from a single camera, but it can also work off the other camera in a stereo pair for resiliency as backup. It looks at a time series of these images and makes a decision on where the lane markers are and what the drivable regions are.

Matt Kixmoeller

So it's fundamentally looking at the paint on the ground, but its main purpose is then to understand the scene and tell us where exactly the lane delimiters are between the ego-lane and the left lane and the right lane.

Arun Narayanaswamy

That's right, but it looks at more than just the paint markers, for obvious reasons. The paint markers are often not visible, they're occluded, they're washed out, and during construction and other cases you can have them all drawn on top of each other. There are a lot of noisy scenarios where just going by the single paint marker is not an obviously good solution. What the neural network does is it looks at the overall context to deduce where the drivable lanes are, what the region separating the different lanes are, as well as the types of lane marking delimiters between the lanes.

Matt Kixmoeller

This network, I understand, also gives us a lot more than just the lane information. What are all the different outputs that that we get from it?

Arun Narayanaswamy

It gives us what the types of the lane markers are, what the colors of them are, and what their semantic meaning is. Also it detects the pitch and the roll of the road ahead of us, which helps us drive with better dynamics. It also gives us what are the drivable parts of the road and the shoulder region where you should not be driving, and the direction of the vanishing point at each pixel.

Matt Kixmoeller

The pitch and roll is interesting, because other solutions have to either use a really complicated IMU or get that information from a map, but we can actually detect that right from the visual field.

Arun Narayanaswamy

That's right.

Matt Kixmoeller

Tell us a little bit about how we train this network. Where does the information come from, and how do we teach the network to understand these scenes?

Arun Narayanaswamy

We collect a lot of data by driving on the road, obviously, and we take all this data and process it as individual frames.  It goes through an auto-labeling system that is offline in the data center, where we spend a lot of compute time trying to solve this problem of what is the correct solution for this particular image. That involves us aligning our visual field with existing maps, and that aligns us to exact delimiters on the exact pixels and so on. The offline maps also give us other information on what the types of delimiters exist in that particular scenario. So we learn off of these, and then we train the neural network so we don’t have to use maps in the car. Once it's trained, we also evaluate the neural network by comparing to maps. For both for training and evaluation, we look at a lot of dimensions. If we just took all the videos that we collected and just trained on them, they would obviously be dominated by the easiest and most common scenarios, so they won't have enough of the difficult scenarios. So we stratify training data by different scenarios, for example, different lighting conditions, different types of lane markers, different weather conditions, etc. We stratify them and make sure that we have enough samples for both training, and then for evaluation, we have a separate holdout set on which we stratify them on all these dimensions and make sure that we are not regressing on any one of those particular dimensions.

Matt Kixmoeller

This is the hard work of a model engineer! You're basically curating this training set that could come from all the many, many hours of video we've collected on the road over the years. But you're just trying to find the exact representative input characteristics to be able to completely train the model and not overfit it in one dimension or another.

Arun Narayanaswamy

That's right. These regression sets actually help us evaluate if the new models are actually better in every one of the dimensions, or just better in one dimension. So these are important for us to make sure we are not going back on any one particular difficult aspect of scene perception.

Matt Kixmoeller

One of the harder situations I often hear about is dealing with missing or occluded lane markers. How does the network handle the fact that you might not always have all the markers present, or they might be partially scraped off the ground?

Arun Narayanaswamy

Most of this comes from the context. The model understands this from the overall context of where the other paint lane markers are. If you can imagine, in most of the training data, you often have cars that are occluding lane markers, cars cutting in. The neural network knows that lane markers just don't vanish because a car occludes them. Based on the overall context, it decides where the lane markers are, and it has information on a time series of camera images, so it makes this decision based on a time series of a few seconds of data.

Matt Kixmoeller

Lighting can also be very challenging. Depending on the time of day or the light, lane markers might not even be visible, or be looking very different, right?

Arun Narayanaswamy

One of the most challenging situations is glare at sunset, or when you're entering a tunnel or exiting a tunnel in very bright lit sunlight conditions. The lighting conditions quickly change as you enter or exit a tunnel, and the model needs to adjust itself for the such scenarios.

Matt Kixmoeller

It sounds like there's really no shortcut. It's just about understanding those situations and diligently making sure we have training data to cover them all.

Arun Narayanaswamy

Correct.

Matt Kixmoeller

One of the key things we always think about is how we design Ghost in a way that it generalizes. It strikes me that this lane marker problem must be a hard one to generalize. If I think about here in the US, even across states you have somewhat different-looking lane markers sometimes. Internationally, they can be very different. Does the model generalize easily, or do we have to retrain for every location we drive in?

Arun Narayanaswamy

We don't have to retrain for every location, but we need to have data that covers the whole gamut of different kinds of lane markers. The US has some homogeneity in terms of lane markers on highways, but there's subtle differences between them. We collect data from different regions in the US, in different lighting conditions and different weather conditions. For example, we won't get enough snow in Bay Area, but we can get enough snow conditions in other places. We put all this data into our training set and then we make sure that we evaluate on all these conditions, so we make sure that we cover the difficult conditions for the neural network that are otherwise hard to find in one place.

Matt Kixmoeller

It sounds like your team has to do the hard work to make sure that we cover all the conditions, but as long as the conditions are met, we can drive in new locations without explicit training. All right, well, there you have it. A little bit of an insight into how Ghost achieves map-less driving. Looking out the window, understanding the scene with our sensors and driving just like a human does. Thanks very much, Arun.

Talking Autonomy
Technology