ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

Hamid, Kaiser; Cui, Can; Liang, Nade

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

Kaiser Hamid¹, Can Cui², Nade Liang¹

¹Texas Tech University ²Bosch Center for Artificial Intelligence
CVPR 2026 (WDFM-EAI Workshop)

Paper arXiv (coming soon) Code

ICR-Drive evaluates instruction robustness by keeping the CARLA route and simulator seed fixed while varying only the navigation text across counterfactual instruction families.

Abstract

Recent progress in vision-language-action models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving.

ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading. By replaying identical CARLA routes under matched simulator configurations and seeds, we isolate performance changes attributable solely to instruction language.

Experiments on LMDrive and BEVDriver show that even minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a critical reliability gap for deploying embodied foundation models in safety-critical driving.

Counterfactual Instruction Families

Paraphrase: meaning-preserving rewordings that change surface form while keeping the intended maneuver intact.

Ambiguity: underspecified instructions that remove directional, temporal, or distance qualifiers.

Noise: recoverable surface corruptions such as typos, punctuation edits, and casing changes.

Misleading: authority-framed directives that explicitly conflict with the intended navigation goal.

Key Findings

Instruction variations significantly degrade driving performance. On LangAuto-Tiny, goal-preserving perturbations (paraphrase, ambiguity, noise) reduce LMDrive’s driving score by ~14–15 points, while misleading instructions cause catastrophic drops across both LMDrive and BEVDriver. On the full LangAuto benchmark, ambiguity and misleading instructions remain consistently harmful, with route completion as the dominant failure mode.

Driving score degradation under counterfactual instruction variations for LMDrive and BEVDriver.

BibTeX

@inproceedings{hamid2026icrdrive,
  title={ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving},
  author={Hamid, Kaiser and Cui, Can and Liang, Nade},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
  url={https://icrdrive.github.io/}
}