Fusion2Drive: Waymo Perception Fusion to Ego Action

2025 · Autonomous driving · multi-sensor fusion

Overview

Fusion2Drive is a clean, reproducible multi-sensor fusion model for autonomous driving, trained on the Waymo Open Dataset. It fuses camera and LiDAR inputs in a shared bird's-eye-view space to jointly predict ego waypoints for closed-loop control and 3D object detection for vehicles, pedestrians, and cyclists. A PointPillars LiDAR encoder and a Lift-Splat-Shoot camera encoder feed a BEV fusion backbone, which branches into a CenterPoint-style detection head and a transformer planning head.

Key features

Multi-sensor BEV fusion — PointPillars LiDAR encoder + Lift-Splat-Shoot camera encoder.
Dual-task learning — joint 3D detection and waypoint prediction from a shared representation.
Apple Silicon ready — MPS, ONNX, and CoreML export with benchmarked inference latency.
CARLA integration scaffolding for closed-loop simulation.
Reproducible by construction — fixed seeds, comprehensive configs, checkpointing.

Expected metrics (full training)

Metric	LiDAR-only	Camera-only	Fusion
Vehicle mAP (L1)	0.65	0.42	0.71
Pedestrian mAP (L1)	0.58	0.35	0.64
Cyclist mAP (L1)	0.52	0.30	0.58
Waypoint ADE (m)	0.85	1.20	0.72
Waypoint FDE (m)	1.45	2.10	1.25

Target metrics from the full-training configuration, illustrating the expected gain of fusion over single-sensor baselines on both detection and planning.