Fusion2Drive: Waymo Perception Fusion to Ego Action
2025 · Autonomous driving · multi-sensor fusion
Overview
Fusion2Drive is a clean, reproducible multi-sensor fusion model for autonomous driving, trained on the Waymo Open Dataset. It fuses camera and LiDAR inputs in a shared bird's-eye-view space to jointly predict ego waypoints for closed-loop control and 3D object detection for vehicles, pedestrians, and cyclists. A PointPillars LiDAR encoder and a Lift-Splat-Shoot camera encoder feed a BEV fusion backbone, which branches into a CenterPoint-style detection head and a transformer planning head.
Key features
- Multi-sensor BEV fusion — PointPillars LiDAR encoder + Lift-Splat-Shoot camera encoder.
- Dual-task learning — joint 3D detection and waypoint prediction from a shared representation.
- Apple Silicon ready — MPS, ONNX, and CoreML export with benchmarked inference latency.
- CARLA integration scaffolding for closed-loop simulation.
- Reproducible by construction — fixed seeds, comprehensive configs, checkpointing.
Expected metrics (full training)
| Metric | LiDAR-only | Camera-only | Fusion |
|---|---|---|---|
| Vehicle mAP (L1) | 0.65 | 0.42 | 0.71 |
| Pedestrian mAP (L1) | 0.58 | 0.35 | 0.64 |
| Cyclist mAP (L1) | 0.52 | 0.30 | 0.58 |
| Waypoint ADE (m) | 0.85 | 1.20 | 0.72 |
| Waypoint FDE (m) | 1.45 | 2.10 | 1.25 |
Target metrics from the full-training configuration, illustrating the expected gain of fusion over single-sensor baselines on both detection and planning.