> The Waymo World Model can convert those kinds of videos, or any taken with a regular camera, into a multimodal simulation—showing how the Waymo Driver would see that exact scene.
Subtle brag that Waymo could drive in camera-only mode if they chose to. They've stated as much previously, but that doesn't seem widely known.
I think I'm misunderstanding - they're converting video into their representation which was bootstrapped with LIDAR, video and other sensors. I feel you're alluding to Tesla, but Tesla could never have this outcome since they never had a LIDAR phase.
(edit - I'm referring to deployed Tesla vehicles, I don't know what their research fleet comprises, but other commenters explain that this fleet does collect LIDAR)
I think what we are seeing is that they both converged on the correct approach, one of them decided to talk about it, and it triggered disclosure all around since nobody wants to be seen as lagging.
Exactly: they convert video into a world model representation suitable for 3D exploration and simulation without using LIDAR (except perhaps for scale calibration).
Tesla does collect LIDAR data (people have seen them doing it, it's just not on all of the cars) and they do generate depth maps from sensor data, but from the examples I've seen it is much lower resolution than these Waymo examples.
Human depth perception uses stereo out to only about 2 or 3 meters, after which the distance between your eyes is not a useful baseline. Beyond 3m we use context clues and depth from motion when available.
We do a lot more internal image processing. For example, relative motion as seen by either eye helps improve accuracy by a whole lot, in the "medium" distance range.
And I'll add that it in practice it is not even that much unless you're doing some serious training, like a professional athlete. For most tasks, the accurate depth perception from this fades around the length of the arms.
ok, but the point trying to be made is based on human's depth perception, but a car's basic limitation is the width of the vehicle, so there's missing information if you're trying to figure out if a car can use cameras to do what human eyes/brains do.
Humans are very good at processing the images that come into our brain. Each eye has a “blind spot” but we don’t notice. Our eyes adjust color (fluorescent lights are weird) and the amount of light coming in. When we look through a screen door or rain and just ignore it, or if you look outside a moving vehicle to the side you can ignore the foreground.
If you increase the distance of stereo cameras you probably can increase depth perception.
But a lidar or radar sensor is just sensing distance.
Radar has a cool property that it can sense the relative velocity of objects along the beam axis too, from Doppler frequency shifting. It’s one sense that cars have that humans don’t.
To this point, one of the coolest features Teslas _used_ to have was the ability for it to determine and integrate the speed of the car in front of you AND the speed of the car in front of THAT car, even if the second car was entirely visually occluded. They did this by bouncing the radar beam under the car in front and determining that there were multiple targets. It could even act on this: I had my car AEB when the second ahead car slammed on THEIR brakes before the car ahead even reacted. Absolutely wild. Completely gone in vision-only.
The company I used to work for was developing a self driving car with stereo depth on a wide baseline.
It's not all sunshine and roses to be honest - it was one of the weakest links in the perception system. The video had to run at way higher resolutions than it would otherwise and it was incredibly sensitive to calibration accuracy.
(Always worth noting, human depth perception is not just based on stereoscopic vision, but also with focal distance, which is why so many people get simulator sickness from stereoscopic 3d VR)
> Always worth noting, human depth perception is not just based on stereoscopic vision, but also with focal distance
Also subtle head and eye movements, which is something a lot of people like to ignore when discussing camera-based autonomy. Your eyes are always moving around which changes the perspective and gives a much better view of depth as we observe parallax effects. If you need a better view in a given direction you can turn or move your head. Fixed cameras mounted to a car's windshield can't do either of those things, so you need many more of them at higher resolutions to even come close to the amount of data the human eye can gather.
Easiest example I always give of this is pulling out of the alley behind my house: there is a large bush that occludes my view left to oncoming traffic, badly. I do what every human does:
1. Crane my neck forward, see if I can see around it.
2. Inch forward a bit more, keep craning my neck.
3. Recognize, no, I'm still occluded.
4. Count on the heuristic analysis of the light filtering through the bush and determine if the change in light is likely movement associated with an oncoming car.
My Tesla's perpendicular camera is... mounted behind my head on the B-pillar... fixed... and sure as hell can't read the tea leaves, so to speak, to determine if that slight shadow change increases the likelihood that a car is about to hit us.
I honestly don't trust it to pull out of the alley. I don't know how I can. I'd basically have to be nose-into-right-lane for it to be far enough ahead to see conclusively.
Waymo can beam the LIDAR above and around the bush, owing to its height and the distance it can receive from, and its camera coverage to the perpendicular is far better. Vision only misses so many weird edge cases, and I hate that Elon just keeps saying "well, humans have only TWO cameras and THEY drive fine every day! h'yuck!"
> owing to its height and the distance it can receive from,
And, importantly, the fender-mount LIDARs. It doesn't just have the one on the roof, it has one on each corner too.
I first took a Waymo as a curiosity on a recent SF trip, just a few blocks from my hotel east on Lombard to Hyde and over to the Buena Vista to try it out, and I was immediately impressed when we pulled up the hill to Larkin and it saw a pedestrian that was out of view behind a building from my perspective. Those real-time displays went a long way to allowing me to quickly trust that the vehicle's systems were aware of what's going on around it and the relevant traffic signals. Plenty of sensors plus a detailed map of a specific environment work well.
Compare that to my Ioniq5 which combines one camera with a radar and a few ultrasonic sensors and thinks a semi truck is a series of cars constantly merging in to each other. I trust it to hold a lane on the highway and not much else, which is basically what they sell it as being able to do. I haven't seen anything that would make me trust a Tesla any further than my own car and yet they sell it as if it is on the verge of being able to drive you anywhere you want on its own.
In fact there are even more depth perception clues. Maybe the most obvious is size (retinal versus assumed real world size). Further examples include motion parallax, linear perspective, occlusion, shadows, and light gradients
Actually the reason people experience vection in VR is not focal depth but the dissonance between what their eyes are telling them and what their inner ear and tactile senses are telling them.
It's possible they get headaches from the focal length issues but that's different.
I keep wondering about the focal depth problem. It feels potentially solvable, but I have no idea how. I keep wondering if it could be as simple as a Magic Eye Autostereogram sort of thing, but I don't think that's it.
There have been a few attempts at solving this, but I assume that for some optical reason actual lenses need to be adjusted and it can't just be a change in the image? Meta had "Varifocal HMDs" being shown off for a bit, which I think literally moved the screen back and forth. There were a couple of "Multifocal" attempts with multiple stacked displays, but that seemed crazy. Computer Generated Holography sounded very promising, but I don't know if a good one has ever been built. A startup called Creal claimed to be able to use "digital light fields", which basically project stuff right onto the retina, which sounds kinda hogwashy to me but maybe it works?
My understanding is that contextual clues are a big part of it too. We see a the pitcher wind up and throw a baseball as us more than we stereoscopically track its progress from the mound to the plate.
More subtly, a lot of depth information comes from how big we expect things to be, since everyday life is full of things we intuitively know the sizes of, frames of reference in the form of people, vehicles, furniture, etc
. This is why the forced perspective of theme park castles is so effective— our brains want to see those upper windows as full sized, so we see the thing as 2-3x bigger than it actually is. And in the other direction, a lot of buildings in Las Vegas are further away than they look because hotels like the Bellagio have large black boxes on them that group a 2x2 block of the actual room windows.
> Humans do this, just in the sense of depth perception with both eyes.
Humans do this with vibes and instincts, not just depth perception. When I can't see the lines on the road because there's too much slow, I can still interpret where they would be based on my familiarity with the roads and my implicit knowledge of how roads work, e.g. We do similar things for heavy rain or fog, although, sometimes those situations truly necessitate pulling over or slowing down and turning on your 4s - lidar might genuinely given an advantage there.
Yes and no - vibes and instincts isn't just thought, it's real senses. Humans have a lot of senses; dozens of them. Including balance, pain, sense of passage of time, and body orientation. Not all of these senses are represented in autonomous vehicles, and it's not really clear how the brain mashes together all these senses to make decisions.
Hesai has driven the cost into the $200 to 400 range now. That said I don't know what they cost for the ones needed for driving. Either way we've gone from thousands or tens of thousands into the hundreds dollar range now.
Looking at prices, I think you are wrong and automotive Lidar is still in the 4 to 5 figure range. HESAI might ship Lidar units that cheap, but automotive grade still seems quite expensive: https://www.cratustech.com/shop/lidar/
Those are single unit prices. The AT128 for instance, which is listed at $6250 there and widely used by several Chinese car companies was around $900 per unit in high volume and over time they lowered that to around $400.
The next generation of that, the ATX, is the one they have said would be half that cost. According to regulator filings in China BYD will be using this on entry level $10k cars.
Hesai got the price down for their new generation by several optimizations. They are using their own designs for lasers, receivers, and driver chips which reduced component counts and material costs. They have stepped up production to 1.5 million units a year giving them mass production efficiencies.
That model only has a 120 degree field of view so you'd need 3-4 of them per car (plus others for blind spots, they sell units for that too). That puts the total system cost in the low thousands, not the 200 to 400 stated by GP. I'm not saying it hasn't gotten cheaper or won't keep getting cheaper, it just doesn't seem that cheap yet.
That was 2 generations of hardware ago (4th gen Chrysler Pacificas). They are about to introduce 6th gen hardware. It's a safe bet that it's much cheaper now, given how mass produced LiDARs cost ~$200.
Otto and Uber and the CEO of https://pronto.ai do though (tongue-in-cheek)
> Then, in December 2016, Waymo received evidence suggesting that Otto and Uber were actually using Waymo’s trade secrets and patented LiDAR designs. On December 13, Waymo received an email from one of its LiDAR-component vendors. The email, which a Waymo employee was copied on, was titled OTTO FILES and its recipients included an email alias indicating that the thread was a discussion among members of the vendor’s “Uber” team. Attached to the email was a machine drawing of what purported to be an Otto circuit board (the “Replicated Board”) that bore a striking resemblance to – and shared several unique characteristics with – Waymo’s highly confidential current-generation LiDAR circuit board, the design of which had been downloaded by Mr. Levandowski before his resignation.
The presiding judge, Alsup, said, "this is the biggest trade secret crime I have ever seen. This was not small. This was massive in scale."
(Pronto connection: Levandowski got pardoned by Trump and is CEO of Pronto autonomous vehicles.)
I think there are two steps here: converting video to sensor data input, and using that sensor data to drive. Only the second step will be handled by cars on road, first one is purely for training.
It’s way easier to “jam” a camera with bright light than a lidar, which uses both narrow band optical filters and pulsed signals with filters to detect that temporal sequence. If I were an adversary, going after cameras is way way easier.
Autonomous cars need to be significantly better than humans to be fully accepted especially when an accident does happen. Hence limiting yourself to only cameras is futile.
Surely as soon as they're safer than humans they should be deployed as fast as possible to save some of the 3000 people who are killed by human drivers every day
Of course they should be, but that's not what will happen. Humans are not rational, so self-driving cars must be significantly safer than human drivers to avoid as much political pushback as possible.
I've always wondered... if Lidar + Cameras is always making the right decision, you should theoretically be able to take the output of the Lidar + Cameras model and use it as training data for a Camera only model.
That's exactly what Tesla is doing with their validation vehicles, the ones with Lidar towers on top. They establish the "ground truth" from Lidar and use that to train and/or test the vision model. Presumably more "test", since they've most often been seen in Robotaxi service expansion areas shortly before fleet deployment.
I don't have a specific source, no. I think it was mentioned in one of their presentation a few years back, that they use various techniques for "ground truth" for vision training, among those was time series (depth change over time should be continuous etc) and iirc also "external" sources for depth data, like LiDAR. And their validation cars equipped with LiDAR towers are definitely being seen everywhere they are rolling out their Robotaxi services.
"Exactly" is impossible: there are multiple Lidar samples that would map to the same camera sample. But what training would do is build a model that could infer the most likely Lidar representation from a camera representation. There would still be cases where the most likely Lidar for a camera input isn't a useful/good representation of reality, e.g. a scene with very high dynamic range.
No, I don't think that will be successful. Consider a day where the temperature and humidity is just right to make tail pipe exhaust form dense fog clouds. That will be opaque or nearly so to a camera, transparent to a radar, and I would assume something in between to a lidar. Multi-modal sensor fusion is always going to be more reliable at classifying some kinds of challenging scene segments. It doesn't take long to imagine many other scenarios where fusing the returns of multiple sensors is going to greatly increase classification accuracy.
The goal is not to drive in all conditions; it is to drive in all drivable conditions. Human eyeballs also cannot see through dense fog clouds. Operating in these environments is extra credit with marginal utility in real life.
But humans react to this extremely differently than a self driving car.
Humans take responsability, and the self-driving disengages and say : WELP.
Oh sorry were you "enjoying your travel time to do something useful" as we very explicitely marketed ? Well now your wife is dead and it's your fault (legally). Kisses, Elon.
No, but if you run a shadow or offline camera-only model in parallel with a camera + LIDAR model, you can (1) measure how much worse the camera-only model is so you can decide when (if ever) it's safe enough to stop installing LIDAR, and (2) look at the specific inputs for which the models diverge and focus on improving the camera-only model in those situations.
Subtle brag that Waymo could drive in camera-only mode if they chose to. They've stated as much previously, but that doesn't seem widely known.