LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

1 National University of Singapore; 2 RoboScience Co.; 3 Nanjing University;
4 University of Science and Technology of China

*Equal contribution   |   Corresponding author

LISN couples a fast-slow hierarchy: a VLM reasons over vision and language to modulate classical controllers and costmaps, enabling instruction-compliant social navigation with 91.3% success across dynamic crowd scenarios.

Abstract

Towards human-robot coexistence, socially aware navigation demands more than efficient paths and collision avoidance. Robots must also comply with language instructions that encode task goals and social norms. We present LISN-Bench, the first simulation benchmark built on Rosnav-Arena 3.0 to evaluate instruction following and scene understanding in social navigation. To tackle this task, we propose Social-Nav-Modulator, a fast-slow hierarchy where a VLM modulates costmaps and controller parameters while the low-level controller maintains real-time safety. LISN achieves a 91.3% average success rate, surpassing the best baseline by 63%, especially in challenging scenarios like person-following in crowds and respecting instruction-forbidden regions.

Key Contributions

  • LISN-Bench: first simulation benchmark for language-instructed social navigation with standardized metrics for instruction compliance, social norm adherence, and dynamic avoidance.
  • Fast-slow Social-Nav-Modulator: VLM performs high-level reasoning to modulate a reactive controller and costmaps, marrying semantic alignment with real-time safety.
  • Strong empirical gains: 91.3% success (+63% vs. prior baselines) on challenging tasks including crowd person-following and instruction-forbidden zone avoidance.

LISN-Bench

LISN-Bench is built on Arena 3.0 with diverse crowd densities, region-level semantics, and instruction templates that bind goals and social rules. Metrics capture navigation efficiency, instruction compliance, and social comfort to holistically evaluate behavior.

LISN-Bench benchmark visualization
Social-Nav-Modulator framework

Social-Nav-Modulator

A VLM-based slow loop reasons over instructions and visual context, adjusting controller parameters and costmap values. The fast loop executes reactive control for collision-free motion, retaining classical safety while expressing nuanced social behaviors like caution, urgency, and forbidden-zone avoidance.

BibTeX

@misc{chen2025lisnlanguageinstructedsocialnavigation,
    title={LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating},
    author={Junting Chen and Yunchuan Li and Panfeng Jiang and Jiacheng Du and Zixuan Chen and Chenrui Tie and Jiajun Deng and Lin Shao},
    year={2025},
    eprint={2512.09920},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2512.09920},
    }