en.Wedoany.com Reported - Recently, Japan's NTT announced the establishment of a novel explainable AI reasoning technology for multimodal foundation models, called "Rationale-Enhanced Decoding." This technology aims to improve the output reliability of large vision-language models when processing images and text. The related research will be presented at CVPR 2026, held from June 3 to 7 in Denver, USA.
This technology addresses a key issue in current multimodal AI applications: the final answer generated by the model may not actually use the reasoning basis it produced in the previous step. NTT found in experiments that while existing large vision-language models can first generate an intermediate reasoning process and then provide a final answer based on the image, text, and reasoning content, the model sometimes ignores this reasoning content and directly relies on image information to output the result. Even when researchers replaced the reasoning basis with content unrelated to the question, the model might still give the same answer as before. This means that the so-called "chain of thought" cannot naturally equate to a true explanation in some scenarios, making it difficult to support high-reliability applications such as medical imaging, corporate decision-making, and critical business audits.
The Rationale-Enhanced Decoding proposed by NTT does not require retraining the model or relying on additional datasets. Its approach is to separate the probability distribution under visual input conditions from the probability distribution under reasoning basis conditions during the inference phase, and then generate the final answer through combined decoding, ensuring the output is constrained by both image information and the reasoning basis.
This "no retraining required" feature makes it more suitable for integration into existing large vision-language models and enterprise AI systems. As AI agents begin to undertake tasks such as document understanding, video analysis, industrial inspection, customer service collaboration, risk control auditing, and business decision support, enterprises not only need models to provide answers but also need to determine whether the answers are based on a traceable and verifiable chain of evidence. If traditional multimodal models can only provide a superficial reasoning process, and there is a lack of consistency constraints between the final answer and the reasoning basis, it will affect the allocation of responsibility and risk control in critical AI scenarios. NTT's research pushes the explanatory capability from "post-hoc justification" to "mandatory use of reasoning during the inference process." This is equally important for collaboration among AI agents, because when multiple AI systems work together, subsequent agents need to understand why the previous agent made a judgment and continue executing tasks based on the same evidence.
Subsequent research directions focus on engineering integration and application validation. If Rationale-Enhanced Decoding can maintain stable performance across more multimodal models, more image understanding tasks, and enterprise-level agent systems, explainable AI will no longer be just an additional capability for compliance or auditing, but will become one of the foundational capabilities for multimodal AI to enter production workflows. For the information and communication industry, such technologies also indicate that enterprise AI competition is extending from model scale and answer capability to reasoning consistency, explanation credibility, and cross-system collaboration reliability.
This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com









