en.Wedoany.com Reported - NTT Japan recently announced the establishment of the "Rationale-Enhanced Decoding" multimodal explainable AI reasoning framework, addressing the issue of trustworthy output in large visual language models during joint reasoning of images and text, thereby improving the consistency between the model's final answer and its reasoning basis. This achievement will be showcased at CVPR 2026, held from June 3 to 7 in Denver, USA, with application directions targeting enterprise decision-making, AI agent collaboration, document understanding, visual question answering, and high-reliability human-computer interaction scenarios.
Large visual language models are evolving from "answering based on images" to more complex multimodal reasoning, capable of simultaneously processing images, text, tables, page screenshots, video clips, and business documents. They are gradually entering trial and deployment phases in industrial inspection, medical imaging, contract review, remote operations, intelligent customer service, and enterprise knowledge management. However, a key issue with such models is that the generated intermediate reasoning process does not necessarily truly influence the final answer. NTT's research points out that traditional multimodal chain-of-thought methods first generate explanations or reasoning bases, then input these together with the original image into the model to produce the final answer. On the surface, the model provides "reasons," but the actual output may still primarily rely on image features. Even when the reasoning basis is replaced with irrelevant content, the model may still produce the same answer. This means that the so-called explanation might merely be appended text, failing to prove that the model indeed made its judgment based on that explanation. For enterprise AI systems requiring auditing, accountability, and review, this undermines the credibility of multimodal AI in critical business operations and limits the entry of visual language models into high-reliability scenarios such as medical diagnosis, financial risk control, manufacturing quality inspection, and complex office workflows.
The solution proposed by NTT does not require retraining the model or relying on additional datasets. Instead, it reorganizes the output generation process during the inference phase.
This framework forms conditional distributions for visual input and reasoning basis separately, then combines them to predict the next word, ensuring that the model's answer generation is simultaneously constrained by image information and reasoning information. In other words, the final answer must be consistent with both the visual content and the reasoning basis, rather than treating the explanatory text as an optional appendage. NTT describes this method as a plug-and-play decoding technique that can be integrated into existing large visual language models, reducing the computational, data, and deployment costs associated with additional training. Research results show that this method improves answer accuracy and reasoning fidelity across various visual language models. When higher-quality reasoning bases are input, the framework's effectiveness is further enhanced. For enterprise AI deployment, the value of such a technical approach lies in advancing from "the model can answer" to "the model's answers can be explained, verified, and reviewed," providing a more stable reasoning foundation for multi-agent collaboration, complex document processing, visual scene analysis, and assisted decision-making.
The industrial significance of multimodal explainable AI is on the rise. As AI agents move from single-turn question answering to continuous task execution, systems repeatedly pass judgment results between image recognition, document understanding, retrieval, planning, and tool invocation. Once the reasoning provided by the front-end visual language model becomes disconnected from the answer, subsequent agent chains may expand based on erroneous foundations. NTT's achievement focuses on the fundamental link of "whether the reasoning basis truly participates in answer generation," helping to enhance information credibility during collaboration between AI systems. If this framework is subsequently validated for stability across more models, tasks, and real business data, it is expected to enter the reasoning layers of enterprise-level AI platforms, intelligent office systems, industry-specific large models, and high-reliability visual analysis tools, becoming a key technical component for multimodal AI to transition from demonstration to production deployment.
This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com









