Professor Wang Dequan’s research group at Shanghai Jiao Tong University has raised a critical question in their latest study.
Imagine this scenario: A kindergarten child holds up a picture of a tiger and asks you, “This kitten is so cute; is it a female cat?” How would you respond?
You likely wouldn’t simply answer “yes” or “no.” Instead, you would first point out the contradiction in the question—the image depicts a tiger, not a cat.

However, there has been little systematic research on how large language models (LLMs) handle such situations.
It is important to note that AI models unable to detect “instruction conflicts” will generate responses for questions that should have no answer. Regardless of whether the generated response leans toward one side of the conflict or the other, it can lead to potential disasters, impacting AI safety and Superalignment.
In this latest study, the team introduced a multimodal benchmark called the Self-Contradictory Instructions (SCI) dataset. They also designed an innovative automatic dataset creation framework named AutoCreate.
The team found that multimodal large models are significantly lacking in their ability to detect self-contradictory user instructions. To address this, they proposed a Cognitive Awakening Prompting (CAP) method, which injects cognitive capabilities from the external world to improve contradiction detection.
This paper is scheduled for publication at the 18th European Conference on Computer Vision (ECCV) in October of this year.

Can Large Models Detect Conflicting Instructions?
Currently, multimodal large models have made significant progress in both research and application fields. They can process various data types, including text and images, demonstrating capabilities similar to human cognition.
The team believes that the success of these models is due to extensive research and development efforts that enable them to closely follow human instructions, sometimes even being overly compliant.
Additionally, these models excel in handling long contexts. Multimodal large models such as Claude 3 and Gemini 1.5 Pro have demonstrated powerful capabilities. The Claude 3 series offers a context window of 200K tokens, while the standard context window for Gemini 1.5 Pro is 128K, reaching up to 1M tokens during its private preview phase.
These advancements allow multimodal large models to perform exceptionally well in complex tasks, meeting human needs for prolonged interaction.
However, as multimodal interactions deepen and context lengths increase, the issue of self-contradictory user instructions has become increasingly prominent.
As shown below, when users (such as children or language beginners) interact with these models, they often fail to recognize potential multimodal conflicts.

Furthermore, as the number of dialogue turns increases and context windows expand, users struggle to remember all details, leading to contradictions between instructions.
Additionally, as the number of modalities increases, conflicts may arise between them. If these models lack self-awareness and the ability to discern contradictions, their performance will be compromised.
To address these challenges, the research team proposed a multimodal benchmark—the Self-Contradictory Instructions (SCI) dataset—to evaluate the ability of multimodal large models to detect conflicting instructions.
SCI contains 20,000 conflicting instructions and covers 8 tasks, evenly distributed across two paradigms: language-language and vision-language.
In the upper part of the figure below, the language-language paradigm involves conflicts between context and instructions, such as rule conflicts, object attribute conflicts, exclusive instructions, and prohibited vocabulary.

In the lower part of the figure, the vision-language paradigm covers multimodal conflicts, such as OCR text recognition conflicts, chart conflicts, geometric conflicts, and semantic conflicts. Among the eight tasks, only semantic conflict involves another dataset (ImageNet).
To illustrate with a specific example: when constructing semantic conflicts, researchers first generate corresponding text based on an image, then replace key semantic information in the text with similar but different new semantics.
In the figure below, the image contains an ostrich. The authors posed the question “Does the picture depict the ostrich’s size?” regarding the image’s semantic content (“ostrich”).
Subsequently, they replaced the key semantic term “ostrich” in this question with “kiwi.” This creates a pair of self-contradictory multimodal instructions.

Throughout the construction of SCI, the authors designed an innovative automatic dataset creation framework—AutoCreate.
It establishes a multimodal loop using programs and large language models. The framework leverages both programming and LLMs to automate dataset creation.
AutoCreate starts with several seed data points related to the task and maintains a seed pool. In each cycle, AutoCreate includes two branches: the language branch (left) and the vision branch (right). Each branch consists of a generator and a modifier.

Finally, a cleaner removes data that does not meet the standards. After passing manual expert quality checks, this data is fed back into the seed pool for use in the next round.
AutoCreate significantly improves both the speed and breadth of content in SCI dataset construction.
How to Improve Contradiction Detection Capability?
Using the SCI dataset, researchers comprehensively evaluated the performance of large models when handling contradictory instructions.
Experimental results indicate that current large models often show deficiencies when facing self-contradictory instructions.
While they can process information and knowledge, they lack the ability to evaluate the rationality of instructions, a capability the research team refers to as “cognition.”
This defect stems from a lack of self-awareness, preventing them from identifying inconsistencies within instructions.
Therefore, researchers proposed a simple insertion-based prompting method called “Cognitive Awakening Prompting” (CAP).
By adding a single simple prompt to the input, CAP injects cognitive capabilities from the external world, thereby improving the large model’s contradiction detection ability with minimal negative side effects.
This finding suggests that current multimodal large models require more self-awareness and cognitive abilities to better handle complex instruction conflicts.

For more details, interested readers can refer to the original paper.
Author Profile
The first author of the paper is Gao Jin, a doctoral student at Shanghai Jiao Tong University.
His research interests include computer vision, multimodal large models, and AI-enabled life sciences.

The corresponding author is Wang Dequan, a Tenure-Track Assistant Professor and doctoral supervisor at Shanghai Jiao Tong University. He received his bachelor’s degree from Fudan University and his Ph.D. from the University of California, Berkeley, under the supervision of Professor Trevor Darrell.
His research has been published in international top-tier conferences including CVPR, ICCV, ECCV, ICLR, ICML, ICRA, and IROS. In the past five years, his papers have received over 10,000 citations on Google Scholar, with an H-index of 20.
Paper Link: https://arxiv.org/abs/2408.01091
Project Link: https://selfcontradiction.github.io/
— End —
Signed with this website and Toutiao
Follow us to stay updated on cutting-edge technology trends.
Comments
Sign in to join the discussion and leave a comment.
Sign in with Google