Benchmarking Gaslighting Attacks Against Speech Large Language Models

Wu, Jinyang; Zhu, Bin; Zou, Xiandong; Zhang, Qiquan; Fang, Xu; Zhou, Pan

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Jinyang Wu^1,†, Bin Zhu^1,*, Xiandong Zou¹, Qiquan Zhang², Xu Fang³, Pan Zhou^1,*

¹Singapore Management University, Singapore
²Tongyi Speech Lab, Alibaba, China
³Dalian University of Technology, China
^†Contact: imcc.sg@gmail.com
^*Corresponding authors

ICASSP 2026

Paper Benchmark Dataset BibTeX

We introduce a systematic benchmark for probing whether Speech LLMs can maintain correct audio-grounded beliefs when users apply emotional pressure, sarcasm, implicit doubt, cognitive distraction, or professional authority.

Motivation

Motivation. Speech LLMs are increasingly used in spoken QA, emotion understanding, and multimodal reasoning, but their robustness to manipulative follow-up prompts remains underexplored.

Our Contributions

Contributions. The project introduces the first systematic benchmark of gaslighting attacks on Speech LLMs, a behavior-aware evaluation protocol, and multimodal stress testing with adversarial prompts and acoustic noise.

Framework Overview

Framework overview. Stage 1 establishes baseline predictions from clean speech plus task prompts, while Stage 2 applies gaslighting prompts and optional noise to measure revised outputs and behavior shifts.

Benchmark Setup

Benchmark setup. We evaluate 5 models on 5 datasets covering 10,740 test samples, and curate a 1,500-sample behavior-aware benchmark subset. All tasks are formatted as multiple-choice speech understanding problems.

Key Results

Main finding. Across 5 models and 10,740 test samples, gaslighting-style prompts cause an average accuracy drop of 24.3%, with Cognitive and Professional negation often the most disruptive.

Noise Amplifies Gaslighting Vulnerability

Relative performance change under five negation strategies with increasing noise

Controlled acoustic ablation. Larger drops indicate higher vulnerability when gaslighting prompts are combined with increasing noise.

Acoustic uncertainty compounds semantic pressure

Noise does not affect every manipulation type equally. Professional and Implicit prompts become especially damaging under degraded audio, while Sarcasm remains comparatively more stable in the VocalSound setting.

This suggests that Speech LLM vulnerability is shaped by the interaction between prompt framing and acoustic uncertainty, rather than by text-only or audio-only difficulty alone.

Paper

BibTeX

@inproceedings{wu2026benchmarking,
  title={Benchmarking gaslighting attacks against speech large language models},
  author={Wu, Jinyang and Zhu, Bin and Zou, Xiandong and Zhang, Qiquan and Fang, Xu and Zhou, Pan},
  booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={19867--19871},
  year={2026},
  organization={IEEE}
}