Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Narek Maloyan; Bislan Ashinov; Dmitry Namiot

Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Narek Maloyan, Bislan Ashinov, Dmitry Namiot

Abstract

Large Language Models (LLMs) are increasingly employed as evaluators (LLM-as-a-Judge) for assessing the quality of machine-generated text. This paradigm offers scalability and cost-effectiveness compared to human annotation. However, the reliability and security of such systems, particularly their robustness against adversarial manipulations, remain critical concerns. This paper investigates the vulnerability of LLM-as-a-Judge architectures to prompt-injection attacks, where malicious inputs are designed to compromise the judge’s decisionmaking process. We formalize two primary attack strategies: Comparative Undermining Attack (CUA), which directly targets the final decision output, and Justification Manipulation Attack (JMA), which aims to alter the model’s generated reasoning. Using the Greedy Coordinate Gradient (GCG) optimization method, we craft adversarial suffixes appended to one of the responses being compared. Experiments conducted on the MTBench Human Judgments dataset with open-source instructiontuned LLMs (Qwen2.5-3B-Instruct and Falcon3-3B-Instruct) demonstrate significant susceptibility. The CUA achieves an Attack Success Rate (ASR) exceeding 30%, while JMA also shows notable effectiveness. These findings highlight substantial vulnerabilities in current LLM-as-a-Judge systems, underscoring the need for robust defense mechanisms and further research into adversarial evaluation and trustworthiness in LLM-based assessment frameworks.

Full Text:

PDF

References

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.

OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.

R. Anil, A. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud,

D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,

C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in Neural Information Processing Systems, vol. 30, 2017.

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and

C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing

Systems, vol. 36, 2023.

Y. Dubois, X. Chen, C. Burgess, M. Sap, T. Gao, C. Raffel, and T. Hashimoto, “Alpacafarm: A simulation framework for methods that learn from human feedback,” Advances in Neural Information Processing Systems, vol. 36, 2023.

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, J. E. Gonzalez et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” arXiv preprint arXiv:2306.05685, 2023.

D. Zhong, J. Zeng, Q. Huang, T. Guo, Y. Ren, W. Zhao, H. Qi, D. Jiang, D. Yin, and D. Jiang, “Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,” arXiv preprint arXiv:2306.02561, 2023.

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, and Y. Takida, “Open problems and fundamental limitations of reinforcement learning from human feedback,” arXiv preprint arXiv:2307.15217, 2023.

Y. Liu, Y. Sun, A. Naik, S. Jain, A. Sharma, Z. Zhao, Y. X. Wang, X. Wang, B. Ananthanarayanan, Z. Zhao et al., “Trustworthy llms: A survey and guideline for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.

J. Wei, N. Belrose, A. M. Dai, Y. Tay, J. Berant, D. Zhou, Y. Choi, and C. Raffel, “Jailbroken: How does llm behavior change when aligned with personas that ignore ai safety guidelines?” arXiv preprint

arXiv:2305.13860, 2023.

P. Chao, H. Li, X. Wang, and S. Chong, “Jailbreaking black box large language models in twenty queries,” arXiv preprint arXiv:2310.08419, 2023.

Y. Huang, X. Zheng, H. Jiang, X. Ren, R. Pang, X. Zhao, A. Zou, and R. Jia, “Catastrophic jailbreak of open-source llms via exploiting generation,” arXiv preprint arXiv:2310.06987, 2023.

E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2153–2162, 2019.

R. Jia and P. Liang, “Adversarial examples for evaluating reading comprehension systems,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2021–2031, 2017.

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang et al., “Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,” arXiv preprint arXiv:2306.04528, 2023.

J. Shen, C. Ren, K. Xu, M. Xu, Z. J. Zhao, J. Xu, Z. Zhao, T. Xu, and C. Xiao, “Anything but llm: Adversarial attacks on instruction-tuned large language models,” arXiv preprint arXiv:2312.06632, 2023.

S. Liu, H. Qi, A. Zou, W. Jiang, R. Jia, N. Goodman, and Z. Kolter,

“Autodan: Automatic and interpretable adversarial attacks on large language models,” arXiv preprint arXiv:2310.15140, 2023.

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Deng, Z. Ding et al., “Decodingtrust: A comprehensive assessment of trustworthiness in gpt models,” Advances in Neural Information Processing Systems, vol. 36, 2023.

X. Qi, B. K. H. Low, and K. Bhatia, “Fine-tuning aligned language models compromises safety, even when users don’t want it to,” arXiv preprint arXiv:2310.03693, 2023.

X. Xu, J. Jiang, Y. Ren, A. Zou, and R. Jia, “Llm agents can be tricked to generate harmful content through indirect prompt injection,” arXiv preprint arXiv:2311.16339, 2023.

H. Li, D. Jiang, S. Ren, C. Zhang, S. Gu, and C. Ren, “Multi-step jailbreaking privacy attacks on chatgpt,” arXiv preprint arXiv:2304.05197, 2023.

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and

M. Fritz, “Not what you’ve signed up for: Compromising real-world llmintegrated applications with indirect prompt injection,” arXiv preprint arXiv:2302.12173, 2023.

D. Kang, J. Khoury, N. Lieberum, K. Santhanam, and S. Shahrampour, “Exploiting programmatic behavior of llms: Dual-use through standard security attacks,” arXiv preprint arXiv:2302.05733, 2023.

Y. Deng, W. Jiang, Y. Tian, W. Zhao, A. Zou, and R. Jia, “Multilingual jailbreak challenges in large language models,” arXiv preprint arXiv:2310.06474, 2023.

X. Liu, Z. Yu, Y. Zhang, N. Zhang, and C. Xiao, “Automatic and

universal prompt injection attacks against large language models,” arXiv preprint arXiv:2403.04957, 2024.

R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi, “Defending against neural fake news,” Advances in Neural Information Processing Systems, vol. 32, 2019.

N. Carlini, Z. Wang, A. Zou, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Quantifying and understanding adversarial prompting,” arXiv preprint arXiv:2307.15043, 2023.

Y. Li, Z. Zhang, Z. Chen et al., “Mt-bench: How strong is chatgpt’s judgement?” arXiv preprint arXiv:2306.05685, 2023.

L. Org, “Llm arena: Community-based benchmarking of language

models,” url: https://lmarena.ai/, 2024, accessed: 2024-05-10.

X. Gao, Y. Zhang, M. Galley, C. Brockett, and J. Gao, “Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models,” arXiv preprint arXiv:2305.13711, 2023.

J. Gu, Y. Liu, Z. Wang, A. Zou et al., “How helpful is chatgpt as a judge?” arXiv preprint arXiv:2411.15594, 2024.

Anthropic, “Claude: A conversational ai assistant,” Anthropic Blog, 2023. [Online]. Available: https://www.anthropic.com/claude

Z. Wang, Y. Jiang, W. Guo, S. Ren, J. Gu, and R. Jia, “Badjudge: Backdoor vulnerabilities of llm-as-a-judge,” arXiv preprint arXiv:2503.00596, 2024.

Y. Shi, P. P. Liang, R. Zheng, A. Zou, D. Song, Y. Yin, X. Zhang, E. Wu, and J. Fu, “Judgedeceiver: Prompt injection attacks to manipulate llm-as-a-judge,” arXiv preprint arXiv:2403.17710, 2024.

A. Zou, Z. Wang, N. Carlini, M. Nasr, Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023.

Z. Zhang, Y. Jiang, S. Ren, and R. Jia, “Attention tracker: Detecting prompt injection attacks in llms,” arXiv preprint arXiv:2411.00348, 2024.

Z. Wu, J. Shen, Y. Zheng, C. Xie, C. Zhao, T. Zhao, J. E. Gonzalez, I. Stoica, and R. Jia, “Defending large language models against jailbreaking attacks through goal prioritization,” arXiv preprint arXiv:2311.09096, 2023.

S. Jain, A. Naik, A. Agarwal, Y. Shi, Z. Zhao, B. Ananthanarayanan, N. Naik, Z. Zhao, A. Sharma, S. Ananthakrishnan et al., “Baseline defenses for adversarial attacks against aligned language models,” arXiv preprint arXiv:2309.00614, 2023.

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020.

A. Robey, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,” arXiv preprint arXiv:2310.03684, 2023.

A. Kumar, C. Shen, S. Singla, C. Tan, and S. Feizi, “Certifying llm safety against adversarial prompting,” arXiv preprint arXiv:2309.02705, 2023.

E. Perez, S. Ringer, K. Lukoši¯ut˙e, K. Maharaj, S. Burnell, A. Kenton, D. Hernandez, A. Ganesh, A. Goldie, A. Mirhoseini et al.,

“Red teaming language models with language models,” arXiv preprint

arXiv:2202.03286, 2022.

D. Ganguli, D. Hernandez, L. Lovitt, A. Askell, Y. Bai, A. Bergen, T. Besiroglu, T. Conerly, D. Drain, J. Eisenstein et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” arXiv preprint arXiv:2209.07858, 2022.

P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C.-J. Hsieh, “Combating adversarial attacks using sparse representations,” Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4785–4794, 2023.

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, and Z. Fan, “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024.

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo, “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023.

Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность ИТ конгресс СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies