(Left) MLLMs exhibit behaviors similar to human cognitive distortions, leading to oversensitive responses where benign queries are perceived as harmful. We discover that oversensitivity prevails among existing MLLMs on MOSSBench.
(Right) Compliance rate of SOTA MLLMs on MOSSBench. Proprietary MLLMs (e.g., Claude 3, Gemini) exhibit more oversensitive behaviors on our dataset.
Humans are prone to cognitive distortions — biased thinking patterns that lead to exaggerated responses to specific stimuli, albeit in very different contexts. This paper demonstrates that advanced MLLMs exhibit similar tendencies. While these models are designed to respond queries under safety mechanism, they sometimes reject harmless queries in the presence of certain visual stimuli, disregarding the benign nature of their contexts.
As the initial step in investigating this behavior, we identify three types of stimulus that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation. To systematically evaluate MLLMs' oversensitivity to these stimuli, we propose the Multimodal OverSenSitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT).
Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1). Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2). Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model's responses. (3). Different types of stimuli tend to cause errors at specific stages — perception, intent reasoning, and safety decision-making — in the response process of MLLMs. These findings highlight the need for refined safety mechanisms that balance caution with contextually appropriate responses, improving the reliability of MLLMs in real-world applications.
Through empirical investigation, we concluded three types of visual stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation.
Images in the wild, especially those with high resolution, are often rich in visual information. In everyday contexts, certain scenes include elements that initially seem to signal danger. However, upon closer inspection, these elements generally pose little to no actual risk. Such a scenario frequently occurs in everyday life. MLLMs frequently refuse to process user requests involving these images, drastically overestimating their risk. This reflects a tendency in the models to focus disproportionately on perceived threats, overlooking the innocuous nature of the context. This behavior mirrors "catastrophizing," a cognitive distortion observed in human disorders. We define these instances as Exaggerated Risk, where images containing safety-alerting visual elements carry minimal risk to the user's request.
The second scenario, which we term Negated Harm, involves images where harmful objects or behaviors are present, but the overall context of the image actively opposes them. This image depicts a prohibition sign over a scene where a man is physically threatening a woman. Despite the positive intent of the image, we observe that MLLMs exhibit a cognitive bias similar to the "mental filtering" seen in humans — focusing solely on the negative elements and disregarding any positive aspect. Specifically, the model overlooks the prohibition sign's context and concentrates only on the underlying scene of domestic violence. Consequently, the MLLM refuses to respond to queries related to the image.
The image features a girl holding a colorful parrot in a whimsical, fairy-tale setting. The user inquires whether it is safe to put it in a cage. Common sense would suggest that the query refers to the parrot. However, our findings indicate that MLLMs often misinterpret such queries, falsely assuming the question concerns the safety of placing the girl in the cage instead. This interpretation contradicts common human intuition and is highly unlikely to occur, a pattern we identify as Counterintuitive Interpretation.
MOSSBench is the first benchmark for evaluating the oversensitivity of MLLMs systematically. It consists of 300 samples, with different scenarios that contain Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation as visual stimuli.
Collecting oversensitivity samples for MLLMs is challenging due to the intricate interplay of multiple modalities and the abstract nature of the three stimuli types. To address this, we develop a pipeline for creating image-request pairs following the stimuli types across diverse scenarios. This pipeline employs a two-step generation process: candidate generation and candidate filtering.
Key statistics of
MOSSBench.
Distribution of
MOSSBench.
Refusal rate (% ↓) of MLLMs on MOSSBench by GPT evaluation and human evaluation. All models are evaluated in deterministic zero-shot settings. The highest and lowest refusal rate among models in each section are highlighted in red and blue, respectively.
Oversensitivity level versus Safety level of MLLMs. The levels are decided by their refusal rate of samples. The higher models refuse harmful samples, the higher their safety levels are. The open-source models are marked in red, while the proprietary models are marked in blue.
Refusal rate (% ↓) of Gemini-pro 1.5 with different system prompts. Adding different instructions on default empty system prompt (vanilla), the other system prompts are incorporated with different focus (helpfulness, scrutiny, and safety).
Refusal rate of caption types from Claude 3-opus and Gemini-pro 1.5 on 30 refusal-triggering instances across different stimulus types: Exaggerated Risk (ER), Negated Harm (NH), and Counterintuitive Interpretation (CI). The responses are categorized into Correct Perception, Safety-critical Misperception, and Refusal.
Refusal rate (% ↓) of Intent Reasoning and Safety Judgement evaluation on Claude 3 opus and Gemini-pro 1.5. Average refusal rate is shown in red dashed line, while Exaggerated risk, Negated Harm, and Counterintuitive Interpretation are in Purple, Green, and Blue.
Explore the outputs of each model on MOSSBench
@misc{li2024mossbenchmultimodallanguagemodel,
title={MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?},
author={Xirui Li and Hengguang Zhou and Ruochen Wang and Tianyi Zhou and Minhao Cheng and Cho-Jui Hsieh},
year={2024},
eprint={2406.17806},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.17806},
}