CAICT Releases First-Phase Test Results of "FangSheng" Omni-Modal Large Model Benchmark, Human Counterfactual Reasoning Significantly Ahead

2026-05-12 14:08

Favorite

en.Wedoany.com Reported - Wedoany News, The China Academy of Information and Communications Technology (CAICT) officially released the first-phase results of the "FangSheng-OmniModal" large model benchmark test on May 12, 2026. This evaluation focused on the counterfactual reasoning capability of omni-modal large models—requiring models to perform hypothetical causal inference based on the integration of text, audio, and video information. The human baseline significantly outperformed all tested large models in this capability.

The evaluation system covers three categories: closed-source large models, open-source omni-modal large models, and audio-visual large language models. The test data is constructed around audio and video, compatible with multimodal combinations, and covers various video durations. The question bank has undergone multiple rounds of manual verification to ensure objective and unique answers. In terms of capability verification, the system establishes three core tasks: reasoning, generation, and interaction, covering key cognitive and application capabilities such as counterfactual assumptions, temporal causality, audio-visual coordination, 3D rendering, and dynamic interaction. At the data construction level, it achieves full-modal coverage from text, images, and audio to long video sequences and 3D point clouds, and introduces multi-dimensional annotations including modal complexity, scene authenticity, human preference, and video length.

The overall test results reveal three core findings. First, the average accuracy of human answers far exceeds that of the tested large models, indicating a significant gap still exists between current omni-modal large models and human levels in the high-order cognitive task of cross-modal causal reasoning. Second, there is a clear performance stratification between open-source and closed-source models, with the average accuracy of closed-source large models higher than that of open-source omni-modal large models, reflecting the critical supporting role of high-quality omni-modal data and training computation power for models' counterfactual reasoning capabilities. Third, some audio-visual large language models ranked lower in the evaluation, suggesting that relying solely on audio-visual fusion training is difficult to achieve high-quality counterfactual reasoning, while omni-modal joint pre-training shows a clear advantage in this task.

At the fine-grained scenario level, the evaluation covers ten major domains including art, sports, and science, with the scenario capabilities of each model showing significant differentiation. Models generally performed better in daily life scenarios such as home and personal care; however, in knowledge-intensive fields like culture, politics, and science & technology, as well as scenarios requiring complex logic and temporal understanding such as sports and music, models still exhibit deficiencies in cross-domain knowledge fusion and complex causal reasoning, and their scenario generalization capability needs improvement.

Comparative testing of multimodal input forms further reveals structural shortcomings in models' cross-modal fusion. Under "audio + text" input conditions, the participating models generally had the lowest accuracy, as pure audio struggles to provide sufficient scene and temporal detail support. Visual information plays a key role in counterfactual reasoning; under "video + text" input, model accuracy was overall higher, with temporal visual information forming the core support for constructing causal chains. However, under the "audio + video + text" full-modal input condition, most models failed to achieve omni-modal synergy gains, and the current shortcomings in models' cross-modal fusion capabilities directly constrain the practical effectiveness ceiling of omni-modal solutions.

Going forward, CAICT will collaborate with experts from various sectors to continuously monitor the reasoning, generation, and dynamic interaction capabilities of omni-modal large models, advance the development of related benchmark test standards and omni-modal data construction, and promote the healthy development of the omni-modal ecosystem. The "FangSheng" benchmark test will continue to be iteratively updated in line with technological and industrial development needs.

This article is compiled by Wedoany. AI citations must indicate the source "Wedoany". For any infringement or other issues, please notify us promptly, and this site will modify or delete the content. Email: news@wedoany.com