In this episode of Talking Autonomy, Ghost Model Engineer Prannay Khosla discusses KineticFlow, Ghost's core vision-based AI algorithm for autonomous vehicle perception.
Traditional approaches to computer vision for autonomy have typically leveraged a series of mono cameras at multiple fields of view, using image recognition and object classification to aid in inference of distance and motion. But this approach suffers from challenges associated with mis-recognized objects, inaccurate distance estimation, poor low-light performance, and reliability tied to a single camera. KineticFlow represents a new approach that fuses both mono and stereo algorithms into a single neural network, providing the distance (depth), velocity, and direction of motion of an object, on a per-pixel basis. In this video, Prannay covers the key functions and benefits of KineticFlow and explains why it was designed to serve as the foundational vision algorithm for the Ghost Autonomy Engine.
Video Transcript (edited for brevity):
[Matt Kixmoeller] Welcome to Talking Autonomy. Today we're going to talk about KineticFlow, Ghost's Visual Neural Network. I'm joined today by Prannay, a model engineer here at Ghost. Prannay, why don't you introduce yourself and tell us a little bit about what you do here, and then we'll dive in.
[Prannay Khosla] Sounds good, thanks, Matt. I'm Prannay, work in model engineering, and mostly on architecting the Drive program and thinking about how it all comes together. And I focused on detecting obstacles from vision radar and then using them in planning over my time here.
[Matt Kixmoeller] All right, so let's dive into KineticFlow. As we said earlier, KineticFlow is the visual neural network at Ghost. We have stereo camera pairs in all four directions around the car. And so that allows us to basically do visual detection in a 360 degree view. So why don't you run through KineticFlow a little bit. What it is and what it gives us.
[Prannay Khosla] KineticFlow is this idea that you should be able to understand whatever is going around in the scene by understanding the physics of the scene. And the physics of the scene just really comes down to what is the depth of everything that you're seeing and how is that moving, right? That's really inspired from biology. I think we've talked about that before and comes down to understanding how the eye works and how the human brain responds to changes in the scene than just where something is. The way humans drive is that they really pay attention to what's changing, what's expanding, what's contracting, rather than trying to figure out everything that is around them. A lot of it is just knowing, having an expectation of what usually happens. As long as the vision keeps confirming that, they have a really course-grain signal of that. And that is in the end what this visual neural network does. And so we have 360 vision and we have four sets of cameras and each set of camera has two commodity cameras sitting in there, which we use to do both monocular and stereo vision. We do not do anything in the hardware. There's no custom ASICs, there is no FPGA sitting in there, just high speed links that allow us to process the images entirely on a GPU in software. And the beauty of that is you get to run a neural network on it and you get to reproduce it in data center and you do not have to do hardware upgrades and can do software upgrades, which are much faster.
[Matt Kixmoeller] So when you introduce the original inspiration being physics based and biology based, I think it gets to a unique dimension of KineticFlow, which is that there's no image recognition required. That kind of gets to the notion that if I throw something at you, you're probably not going to try and say, 'what is that? Is it a baseball?' before you decide I'm either gonna catch it or get out of the way. Then later you might figure out, okay, it's a baseball and what do I do with it? But your first instinct is let me just avoid this thing, right?
[Prannay Khosla] Yep.
[Matt Kixmoeller] So talk a bit about how KineticFlow works without explicitly understanding something's a car or truck or a bus.
[Prannay Khosla] Yeah, I mean the idea is that you want to build a system that generalizes and that is where KineticFlow comes in. So when you see the depth of something or when you see something move, you know that if there are a set of pixels in the image which are moving together, then they probably are something that is a rigid body, which is an obstacle. That obstacle could be a wall, it could be a car, it could be a truck, it could be a sideways truck, a construction vehicle, an accident, anything. In the end, all you really care about are these pixels, that they move together. I know where they are from me to the left of me or to the right of me, and what's their depth. The reason that can kind of get around the question of image recognition, classification, object, bounding boxes and all of that is because in the end you're using the physics of how these pixels move together and stay together. Hence, they're a rigid body. That's the only piece of information you need to know in order to be able to do universal collision avoidance, which is in the end the bare minimum that a self-driving system has to do.
[Matt Kixmoeller] That was one key focus, we wanted the base algorithm to be universal so that it couldn't be tricked. There wouldn't be a long tail of 'I didn't train it on this bizarre object that could be on the road.' If we can basically find masses of atoms on the road universally and we know we can't drive through them, then higher levels of the stack can determine if it's a person or something else that we might have to treat differently. But the base layer delivers safety in a universal way.
[Prannay Khosla] That's right. And in the end, if you think of any model, it's just activations and weights, right? The way models work and the expectation we have that they will be able to generalize comes from one of two things. Either it is principled in something, which we believe is actually something that generalizes or you show it enough data. A lot of people in the general machine learning, artificial intelligence community have really focused on trying to throw more and more data in the problem. Every few years we have a big computing leap, which gives us a step function. But mostly you see only diminishing returns from trying to scale the data. On the other hand, if you try to turn the problem around and try to find the most basic principle behind a system and try to learn that, you will usually come up with a smaller model which requires smaller amount of data and does not require often continual learning. You only need to understand the physics of the sensors that you're getting information from and you will build a system that generalizes.
[Matt Kixmoeller] And you can reason about its completeness in that way.
[Prannay Khosla] Exactly, having guarantees of completeness is obviously really important for somebody to deploy a self-driving system out in the wild.
[Matt Kixmoeller] So you scratched a little bit there on the topic of the training sets and how we build KineticFlow. So, why don't we talk about that a little bit more? As we talked about earlier, many of the former approaches to stereo vision in the auto world would use simple stereo algorithms and often just use an ASIC right next to two cameras that would do the stereo math right in the car. But we don't do that. We train this neural network on how to approximate stereo vision. And so walk through how that training process works.
[Prannay Khosla] Right, I mean if you want to build a model, first you need to construct the ground truth for it and both the beauty and the drawback of something like stereo or monocular vision in KineticFlow is that you don't need any external information. All the information is directly available in the videos you have recorded from both your sensors. The drawback is that, as other automotive companies have tried in the past of building hardware, which do pixel to pixel matching to find the epipolar planes along which you can match information and try to say that, okay, in in my two perspectives, these set of pixels are the same thing and hence now I know how disparate they are in the two images and use that information to find the depth of the obstacle. This problem that you need to solve is on an efficient frontier of compute. If you need an FPGA at some point or another, you are going to need a lot of compute. But if you don't want to do it on an FPGA, that is when neural networks come in. So instead of doing something on the FPGA, you do a heavy ground truth training job that runs offline and whatever the FPGA is trying to do in real time you do it offline and it takes up to three minutes in the data center where you're gonna be able to do a much more expansive search for trying to find the optimal set of pixels which match in both your images or across time in case of monocular. And from that you're able to generate the ground truth and that is where neural networks come in where you are able to say that, okay, even though the problem I'm solving is on this efficient frontier of compute. So it's gonna take more time offline, you can can compress that entire thing down into a few set of weights which can run in real time and you train the neural network off of it.
[Matt Kixmoeller] That's amazing - on a video every single frame we're taking minutes of compute time in the data center to do the full labeling and compute all the stereo and that can be translated into what milliseconds on the GPU in the car?
[Prannay Khosla] Right, single digit milliseconds or small double digit milliseconds usually.
[Matt Kixmoeller] Got it. So I think you've explained pretty well how we can find objects in the system, how we can get per pixel depth for the entire scene. But we can also understand motion with KineticFlow. How does you know the KineticFlow then generate motion if it's only analyzing every image.
[Prannay Khosla] So we analyze an image every quantum that is the two images from the two cameras that are looking outside.
[Matt Kixmoeller] And every quantum is how long?
[Prannay Khosla] Every quantum for us is 30 milliseconds, so that we run the systems at just above 30 hertz. But the idea is that if you look at the video from just one sensor and you try to correspond and run basically the same idea, right, across time, what you cannot do is figure out what is the depth of something. Because think of it this way, if you had a camera, right, and you just change the focal length of your lens by half or doubled, you will just see the same thing still, right? So you cannot know how deep something and how far away it is from you [with mono vision]. But since you know pixels and rigid bodies, pixels which move together form rigid bodies and since you're able to associate pixels across time, you can see them expand or contract. So it's actually interesting that you simultaneously solve for what is a rigid body by figuring out what is expanding and contracting together. From that you also find its expansion rate or contraction rate, and if you know the depth of something and you know it's expansion and contraction rate and you know where it is in the scene, that basically gives you all of the information you need to figure out its kinematics. And that usually is the goal of anything when it comes to obstacle detection in the context of automotive driving.
[Matt Kixmoeller] That's pretty incredible. So one neural network can basically not only detect objects, use stereo to get precise depth estimates and a really dense point cloud of that, but also use multiple frames to give us motion, both the relative speed compared to us and the motion direction.
[Prannay Khosla] Yes, that's right.
[Matt Kixmoeller] OK, so let's zoom back out to the high level again. We have camera pairs in every direction. So we have a 360 view and KineticFlow gives us this very dense point field across that whole space. So in essence, this creates an alternative for us to a 360 lidar, correct?
[Prannay Khosla] Yes, that's right. And unlike a lidar, it's much cheaper. It just requires cameras which keep getting upgraded. It is processed entirely in software. So the development cycle's much faster and it it's not as big and it can be be integrated into any car at any time which makes the product much more ready to consume for everyday vehicles, which is in the end the goal of Ghost Autonomy.
[Matt Kixmoeller] Awesome. Well there you have it a much deeper look into KineticFlow, the visual neural network here at Ghost that gives us the combination of object detection, a dense per pixel point field across everything, distances to our objects as well as velocity measurements. Prannay, thank you very much. Enjoyed it as always.