Talking Autonomy,

Ghost Data Pipeline

How we wrangle the data that powers machine learning, simulation, and analytics at Ghost.

By Ghost

May 24, 2023

minute read

It sounds trite to say that data underlies every aspect of building an autonomous vehicle, but data is inextricably connected to everything from ensuring safety to enabling engineering productivity at Ghost.  In this video machine learning systems engineer Scott Kirlin walks us through the major "data flows" at Ghost, including:

  • Test drive tracking, analysis, and observability - is Ghost getting better at driving?
  • Neural network training - how is driving data collected, labeled, curated, and used for training neural networks?
  • Simulation - how are complex scenarios simulated and new code tested before it ever hits the road?

Scott will give us an in-depth look at not only how Ghost uses data, but some of the tools and infrastructure for wrangling and transforming data that ensure engineering productivity at high scale.

If you enjoyed this video, joint Scott LIVE for an AMA session Thurs, Jun 1st, 10AM PST / 1PM EST where you can go deeper and ask Scott questions about Ghost's data pipeline, tools, processes, and anything you like! REGISTER NOW: https://ghost.zoom.us/webinar/register/8116849534829/WN_asyJvIkFTgGWq5cvLRdong

Video Transcript (edited for brevity):

Matt Kixmoeller:

Welcome to Talking Autonomy. In this episode, we're going to dive into the data behind Ghost and specifically what are some of the major data flows of how we take data from cars, all the way through machine learning, back out to cars again. I'm joined today by Scott Kirklin. Scott, why don't you introduce yourself? Give us a sense for what you do here and maybe what you did in your career previously.

Scott Kirklin:

Sure. Before Ghost, I was in finance. I was at Jump Trading for about seven years, and there I was primarily responsible for building out our research infrastructure as well as some bits of production infrastructure. More or less doing the same thing here - the name of the game is data and you have to get that in, sanitize it, work with it, use it.

Matt Kixmoeller:

Either way, autonomous driving, finance, a ton of data that powers the engine above. All right, as we said in the onset, we're just going to try to give people a sense for the various major data flows through the company and how we move data around and leverage it for our various processes. So why don't you give us this kind of a high-level view of that?

Scott Kirklin:

So the 30,000-foot view is that all of our data essentially comes from the cars. The cars are recording and that flows into the data center. Once it arrives in the data center, we do lots of validation and metrics -  that's how we validate the drives. At the same time, the data is also used as an input for training. So machine learning requires data and this is the input to those models. Then from there, it flows back out to the car once those models have been trained.

Matt Kixmoeller:

So let's start with that infrastructure around understanding driving. We're out driving every day, all day at all times of the day. It's important that we learn from all those drives, right? Maybe in the very beginning, it was okay to have some people in the car making human observations, but we quickly scaled beyond that and realized we needed infrastructure to be able to understand drive quality to help improve. So  walk us through that pipeline. How does it work?

Scott Kirklin:

The car is recording and every two minutes, it cuts a new data segment and it pushes that two-minute chunk to our data center. It lands in the data center basically just as a file, just the file on disk. Then we ingest that into a custom in-house storage framework for working with time series data.

Matt Kixmoeller:

This data that comes up over two minutes, it's a combination of video data plus telemetrics?

Scott Kirklin:

We're directly recording a number of raw sensors, anything from brake pressure, steering angle, are the windshield wipers on, as well as a lot of internal state from the driving program. Once it arrives, we have to fill in gaps though because we are producing so much intermediate state in the driving program that we can't afford to save and upload all of that. So what we do is as soon as it arrives in the data center, we will re-run exactly the same code as in the car, fill in all of those gaps so that we have a more complete view of exactly what was going on in the car from the perception data. These are the inputs to the next step, which is going and calculating a bunch of post-drive metrics - validation of drive performance, looking for any outlier events, anything that we want to keep an eye on. This is the input to how we monitor our driving performance that we are going to be looking at aggregate statistics on. We're going to be looking at outlier events. We're going to be using that to say, is this release better than the last release? Are things improving? Have we regressed?

Matt Kixmoeller:

On a per drive basis, we get these KPIs, and that allows us to understand that individual drive, but also, all of our drives as a whole. Why don't you give us a little more color on what some of these KPIs are? What are we looking at as some of the key measures of drive quality?

Scott Kirklin:

We're looking for two kinds of things primarily. One is more like normal driving data: following distance behind another car, speed, braking distance. And then others are more like outlier events. So if we have a hard brake tap, if we are weaving in the lane, if we're hugging to the right in the lane. These kinds of metrics are mostly driven by human observation, where some human noticed that something a little off was happening. And then from there, we go and create a more rigorous quantifiable metric that we can then apply to all of our drives to get more statistically meaningful sense of how often it's happening.

Matt Kixmoeller:

Obviously if there's a kind of difficult situation on the road, there might have been a good reason to slam on the brakes or to veer or something like that. But if that's happening a lot or more than normal, it's something to look into.

So it sounds like these KPIs essentially allow our engineers to look out for and find things that might be awry in some element of the product. How does that triage process work? Say we go on a drive and we come back, how do our engineers understand if there was a problem and then ultimately take that through to a form of resolution?

Scott Kirklin:

The basic life cycle of triage at Ghost is data comes in in almost real time as the cars are driving. Then we will have a go through the pipeline that we discussed earlier where that data comes in, we fill in the gaps, then we calculate a bunch of additional metrics. These land on a dashboard. That dashboard is something that people are going to be watching and looking for. Is this happening more? Is this happening less? How does this release compare to the last one? The overall lifecycle of this process is basically one where there's a human that observes something inappropriate or unusual or unexpected, then we create a metric to track that. Then we scale that metric out to every drive. We will backfill and see if this issue that was just observed is new or has always been happening, and how often.

Once you have those statistics, then you can start to correlate, is this event happening more or less around bridges or overpasses or cracks in the road? Or you can find what things seem to be causing it. That's incredibly useful in solving it. Once you know what's causing it, that gives you the tools to try and fix it. Then once you have rolled out some changes to try and improve that, then the loop closes and now you're back to like looking for something new that's happening that's unexpected. Once it's on your radar, you can just watch it forever and you can make sure that you're not getting worse at it and hopefully keep getting better.

Matt Kixmoeller:

As engineers build new features, we're now just building these metrics in from the start. So part of the process of designing a feature is understanding how to measure its quality. From day one, we're essentially building those metrics in.

You also have a number of tools for visualization of this data? Sometimes when there's a problem, you just want to be able to dive in and see the scene and see what radar saw, see what vision saw, see what the drive program did. How does that infrastructure work?

Scott Kirklin:

We have a number of tools for visualization. The range runs from something like Foxglove, which is a very optimized tool for robotics, where in the world of robotics, it's very common to have a lot of different data streams that are all happening in the same synchronized time stream and you want to visualize them together in some coherent way. Foxglove is great for that.

It's primarily useful for looking exactly at what was happening in a car at one time. It's mostly a point in time, time series view of the world. In contrast to that, we've got tools like Kibana, which is a dashboard that is sitting on top of Elastic and that's useful for computing aggregate statistics. You want to see, for example, just in the last hour, what's happened? You can see it there in almost real time. And then you also have the option, as a model engineer, to have very granular control over the analysis that's being done. We have a lot of direct APIs to the data itself so that you can pull down the data and directly explore it.

Matt Kixmoeller:

Go a little deeper on that. Given all these cars, all these drives, all this data coming in, what does this data infrastructure look like underneath? What are some of the tools we use and to manage this huge amount of data?

Scott Kirklin:

To get something like real-time performance, we use Kafka as an event-driven framework.  The first message comes in from the car and that starts the chain of events. Then every processing step from there is consuming from these topics and pushing to notify the next consumer. So in this way, we can flow very directly from ingest to the first incorporation into our structured time series storage. We do a bunch of indexing. We go and do the rehydrate process that I described earlier. We will compute all of these post-rehydrate statistics that are useful for KPIs in tracking metrics. We also will go and export this data in a variety of forms that are optimized for the different view layers. All of these pieces are flowing continuously from the moment that the data arrives.

Matt Kixmoeller:

But we’re ultimately trying to keep an underlying source of truth and then creating all these different views into it.

Scott Kirklin:

Exactly right. The way that I think about data is one where you always can only ever have one source of truth. You know the old adage that a man with one clock knows what time it is, a man with two can never be sure. You have to have exactly one source of truth and that should be immutable and never changed. That's the raw file that comes in from the car, and everything after that is derived. Many of those layers are specialized for a purpose. When we ingest from that raw file, we're converting that file on disk into a database entry that is optimized for retrieval, for selection by time range or asking for particular columns of data. This is very general and it's useful for our compute framework, but it's also not like a relational database, so it's not as good for some kinds of query expressions that you might want to do. That's why you end up exporting to Kibana. Similarly, Foxglove wants its own particular way of seeing the data, which is just like a pure events in a stream. That's another thing where we're creating another layer on top of our primary storage that is optimized for a particular form of consumption.

Matt Kixmoeller:

Let's dive into a whole different area. The same data that we're collecting and we've been talking about so far is also used to power our machine learning and model training environment. Walk us through the data flows within that part of the world and how model engineers ultimately use it to train models.

Scott Kirklin:

In contrast to everything we've talked about up to now, which is all event-driven and streaming in real-time, the model training process is a little bit more human-involved, where you're going and saying these are the scenarios where our car is performing well. This is the scenario where they're not. We're going to go and try to find new sources of data that will fill in those gaps so that we can improve our models. So there's this sort of data exploration process where you're going and using the ways that we have indexed the data so that you can find what you're looking for and then incorporate that into your training set. Our training infrastructure sits primarily on top of the time series storage layout that I've described. Basically what happens is we'll define a training set and then we'll extract from that primary storage into yet another optimized view of the data where for machine learning in general, the way you want data formatted is as rows. You want to have a pre-formatted bit of data that can be used as an input for regression. That's the way that we will sort of rotate it from the column-based events that are the natural format of the data into this row-based view that is optimized for training.

Matt Kixmoeller:

Just use an example, one of the earlier videos, we talked a little bit about our lane marker network that ultimately understands paint on the road and gives us lane understanding. That's a good example of where we have so many hours of driving out there but we need to find a representative set of training data that has all these different types of lane markers of all shapes and sizes and all different orientations and all different colors at all different times of day. So those are all curated based upon that data being labeled, so the model engineer can find the right pieces and then pull together in this set. What happens from that point forward?

Scott Kirklin:

From that point, it's primarily in the land of training, which means training in PyTorch or TensorFlow, then visualizing in something like TensorBoard. Once that training is done, then there's a validation loop that will go and take the models that have been produced and try to score them on the training data and see if it's improved. Eventually we will pick the one that we like and that goes back into the car.

Matt Kixmoeller:

Within those steps you can also run these new models through various forms of simulated testing. That gets us into the last topic, simulation.  We're using simulation pretty broadly throughout our cycle as well. What are some of the elements that we take advantage of simulation for?

Scott Kirklin:

Simulation is a great way to explore the parameter space of your program. If you think of a program as essentially a function that takes many inputs, the space that that function can cover is enormous and it's much more than you can reasonably expect to feed from like just plain drive recordings. What thus generate a wide range of scenarios that cover the parameter space and evaluate what the program will do in all of those cases. This is a great way to validate that you've covered corner cases and that as you make changes to code, you can very quickly evaluate all of the simulated data much more efficiently than if you were trying to run the driving program on raw video.

Matt Kixmoeller:

The driving program's a great example of this where, you know, ultimately we might want to put it in situations where that are pretty hard to recreate on the road. For example, if somebody just hard drives in front of the car or there's really nefarious actors. We can do that via simulation, see what happens and it's the same infrastructure to learn from that.

Scott Kirklin:

Right, if we want to run the driving program on a recording, what you're going to be doing is taking video and sensor data from the car and running through the whole perception part of the program. And only then do you get to the part where the planner is trying to make its decisions. If what you care about as a developer on the planner is evaluating if you have improved the planner, you really would like to just cut out the perception piece and just directly simulate the inputs to you as the planner. That's one of the places where simulation really shines, you can skip the pieces of the program that you don't need and only simulate the pieces that you care about testing and improving.

Matt Kixmoeller:

Scott, thanks so much. This was a very brief overview but it gave us a pretty good sense for how data flows through Ghost - coming off cars and coming into our system to be able to use to score drives and constantly improve, but also feeding the machine learning and simulation processes to help our engineers build faster. Thanks so much.

Talking Autonomy
Technology