Integrated Speech Language Models Face Critical Safety Vulnerabilities
Table of Links
Part 1: Abstract & Introduction
Part 2: Background
Part 3: Attacks & Countermeasures
Part 4: Experimental Setup
Part 5: Datasets & Evaluation
Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations
Part 7: Results & Discussion
Part 8: Transfer Attacks & Countermeasures
Part 9: Conclusion, Limitations, & Ethics Statement
Part 10: Appendix: Audio Encoder Pre-training & Evaluation
Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness
Part 12: Appendix: Adaptive attacks & Qualitative Examples
6. Conclusion
We present a detailed study of the safety alignment of speech language models through the lens of Spoken QA application. We investigate the robustness of several in-house models along with public models in light of adversarial attacks. To accurately determine the safety alignment of these models, we developed a comprehensive evaluation setup using a publicly available LLM. Through extensive experiments, we demonstrate that an adversary with white-box access to the systems can jailbreak them using barely perceptible perturbations, and force them to ignore their safety alignment training. Furthermore, adversarial perturbations generated using one model can jailbreak a different model with reasonable success, with some models exhibiting greater robustness than others. We also showed the effectiveness of a noise-flooding defense in countering the attacks.
To the best of our knowledge, this is the first study to investigate the potential safety vulnerability of integrated speech and language models. We believe that with the rapid adoption of such technologies, it is imperative to thoroughly understand the safety implications of these systems. Furthermore, it is important to devise effective countermeasures against jailbreaking threats and prevent the models from causing harm. A holistic approach to understanding the safety alignment of systems is required, including studying universal adversarial threats (a single perturbation to jailbreak multiple systems), prompt injection attacks, model poisoning, etc. We hope that this work will serve as a precursor to many such studies.
Limitations
In this work, we avail a preference model as judge to assess safety of SLMs. However, we acknowledge that such a judge may not always align with human judgement, which might lead to a minor margin of error in our safety annotations, which we plan to address in future work. Furthermore, our work provides a limited exploration into SLMs trained with safety-aligned text LLMs, although SLMs themselves are trained with safety aligned spoken data. However, given that our approach has already shown efficacies of such models, we leave the thorough exploration to future work. Lastly, concerns of misuse by malicious practitioners prevent us from releasing the training datasets and models, limiting replication by other researchers. However, we are considering the release of benchmarking datasets with the final submission to facilitate further exploration in this space.
Ethics Statement
All speech datasets we use have anonymous speakers. We do not have any access to nor try to create any PII (Personal Identifiable Information) of speakers, and our model neither identifies speakers nor uses speaker embeddings. Furthermore, we obtained necessary consent from all the participants of our data collection efforts following approval by an internal review board.
While we acknowledge the ethical risks associated with jailbreaking techniques, this work represents a valuable contribution towards a deeper understanding of speech-language model capabilities and limitations. Our aim is to enable further research that improves model robustness, leading to safer and more beneficial applications. By responsibly investigating methods to circumvent restrictions, we shed light on potential vulnerabilities that could be exploited by malicious attackers if left unaddressed. Critically, our work also proposes and evaluates countermeasures to mitigate such jailbreaking attacks. While we encourage ethical debate on such emerging issues, we believe the merits of responsible disclosure, proactive security improvements, and developing defensive techniques outweigh any potential risks associated with our narrow jailbreaking experiments under controlled conditions. Overall, our work aims to make progress towards more secure and robust multimodal speech-language models.
References
Meysam Alizadeh, Maƫl Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego Bermeo, Maria Korobeynikova, and Fabrizio Gilardi. 2023. Opensource large language models outperform crowd workers and approach chatgpt in text-annotation tasks.
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. (ab) using images and sounds for indirect instruction injection in multimodal llms. arXiv preprint arXiv:2307.10490.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. 2022. Constitutional ai: Harmlessness from ai feedback.
Emily M Bender, Timnit Gebru, Angelina McMillanMajor, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610ā623.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
Nicholas Carlini, Milad Nasr, Christopher A ChoquetteChoo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian TramĆØr, and Ludwig Schmidt. 2023. Are aligned neural networks adversarially aligned? In Thirty-seventh Conference on Neural Information Processing Systems.
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. 2022. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pages 3915ā3924. PMLR.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction finetuned language models.
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. 2019. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310ā1320. PMLR.
Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. 2023. Pengi: An audio language model for audio tasks. arXiv preprint arXiv:2305.11834
Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. 2023. How robust is googleās bard to adversarial image attacks? arXiv preprint arXiv:2309.11751.
Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, YiChia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. 2023. Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
Anmol Gulati, Chung-Cheng Chiu, James Qin, Jiahui Yu, Niki Parmar, Ruoming Pang, Shibo Wang, Wei Han, Yonghui Wu, Yu Zhang, and Zhengdong Zhang, editors. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition.
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2020. Aligning ai with shared human values. In International Conference on Learning Representations.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.
Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box adversarial attacks with limited queries and information. In International conference on machine learning, pages 2137ā2146. PMLR.
Nanna Inie, Jonathan Stray, and Leon Derczynski. 2023. Summon a demon and bind it: A grounded theory of llm red teaming in the wild. arXiv preprint arXiv:2311.06237.
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
Arindam Jati, Chin-Cheng Hsu, Monisankha Pal, Raghuveer Peri, Wael AbdAlmageed, and Shrikanth Narayanan. 2021. Adversarial attack and defense strategies for deep speaker recognition systems. Computer Speech & Language, 68:101199.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LƩlio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, TimothƩe Lacroix, and William El Sayed. 2023. Mistral 7b.
Samer Y Khamaiseh, Derek Bagagem, Abdullah AlAlaj, Mathew Mancino, and Hakam W Alomari. 2022. Adversarial deep learning: A survey on adversarial attacks and defense mechanisms on image classification. IEEE Access.
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. 2023. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. Peft: State-of-the-art parameterefficient fine-tuning methods. com/huggingface/peft.
Nicholas Mehlman, Anirudh Sreeram, Raghuveer Peri, and Shrikanth Narayanan. 2023. Mel frequency spectral domain defenses against adversarial attacks on speech recognition systems. JASA Express Letters, 3(3).
Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, and Rahul Gupta. 2023. Jab: Joint adversarial prompting and belief augmentation. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048ā11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
OpenAI. 2024. Gpt-4o. index/hello-gpt-4o/.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730ā27744.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. 2023. Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning.
Krishan Rajaratnam and Jugal Kalita. 2018. Noise flooding for detecting audio adversarial examples against automatic speech recognition. In 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pages 197ā201. IEEE.
Lea Schƶnherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2018. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665.
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. āDo Anything Nowā: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222ā 4235.
Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin Shi. 2023. Llasm: Large language and speech model.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929ā1958.
Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Hang Yan, Xiangyang Liu, Yunfan Shao, Qiong Tang, Xingjian Zhao, et al. 2023. Moss: Training conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7.
Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024. Large language models for data annotation: A survey.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. 2023. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
Nevan Wichers, Carson Denison, and Ahmad Beirami. 2024. Gradient-based language model red teaming. arXiv preprint arXiv:2401.16656.
The PyTorch Lightning team William Falcon. 2019. Pytorch lightning. ai.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RĆ©mi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38ā45, Online. Association for Computational Linguistics.
Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. 2024. Continual learning for large language models: A survey.
Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yuejian Fang, and Hang Su. 2023. Text-toimage diffusion models can be easily backdoored through multimodal data poisoning. arXiv preprint arXiv:2305.04175.
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.
Jiachen Zhao, Zhun Deng, David Madras, James Zou, and Mengye Ren. 2023a. Learning and forgetting unsafe examples in large language models.
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. 2023b. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934.
Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Authors:
(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([emailĀ protected]);
(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;
(3) Srikanth Ronanki, AWS AI Labs, Amazon;
(4) Anshu Bhatia, AWS AI Labs, Amazon;
(5) Karel Mundnich, AWS AI Labs, Amazon;
(6) Saket Dingliwal, AWS AI Labs, Amazon;
(7) Nilaksh Das, AWS AI Labs, Amazon;
(8) Zejiang Hou, AWS AI Labs, Amazon;
(9) Goeric Huybrechts, AWS AI Labs, Amazon;
(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;
(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;
(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;
(13) Kyu J Han, AWS AI Labs, Amazon;
(14) Katrin Kirchhoff, AWS AI Labs, Amazon.