Google US Launches New Multimodal AI Model Gemini Omni, Enabling Seamless Interaction Across Text, Audio, Image, and Video

2026-05-20 15:56

Favorite

en.Wedoany.com Reported - At the Google I/O 2026 annual developer conference held on May 19 in Mountain View, California, Google officially unveiled its next-generation multimodal artificial intelligence model—Gemini Omni. This new model is not a simple upgrade of existing large language models, but is positioned as the world's first general world model to achieve truly "seamless understanding and cross-modal generation" across text, audio, image, and video, marking a crucial step for AI moving from predicting text to simulating reality.

During the keynote address, Google CEO Sundar Pichai and Google DeepMind CTO Koray Kavukcuoglu, along with other executives, detailed how Omni eliminates the boundaries of traditional AI models. Unlike the previous "Swiss Army knife" architecture that calls upon multiple single-modal models, Omni establishes a unified perception and generation mechanism within a single neural network. Live demonstrations fully showcased its "world knowledge"-driven creativity: the system can accept a text description and render complex scientific principles of protein folding in real-time into a claymation explanatory video with professional narration; it can dynamically interact with a simple line drawing of a fish sketched by a user, imbuing it with fluid vitality; it can also accurately identify the astronomical concept of a "black hole" sketch and transform it into an accessible science explainer video.

This powerful multimodal interaction capability stems from Omni's deep integration of Gemini's underlying reasoning logic with Google's accumulated generative media assets. Google DeepMind CTO Koray Kavukcuoglu pointed out during a media briefing that, compared to Google's previously launched Veo model, which was primarily used for text-to-video conversion, Omni possesses "vast amounts of world knowledge" and a deeper level of semantic understanding. It no longer simply stitches user prompts into visuals, but can generate video content that appears plausible and internally consistent based on its understanding of physical laws, cultural contexts, and even scientific logic.

The first-to-market Gemini Omni Flash version is specifically designed for the most widespread video creativity and editing needs today. This model is now available to global users, deeply integrated initially into the Gemini App, Google Flow creative studio, and YouTube Shorts, and will soon be available to developers via Application Programming Interface (API). In terms of interactive experience, Omni breaks down the complex operational barriers of traditional video editing software by introducing a "conversational editing" feature: users can simply use a natural language command to have the AI modify character appearances in a video, replace story backgrounds, or convert a casually shot mobile phone video into another visual style in real-time.

To address the increasingly severe ethical challenges such as deepfakes, Google has implemented a series of strict safety measures alongside the release of Omni. Nicole Brichtova, who oversees product management at Google DeepMind, clearly stated that Omni is not an unrestricted creation tool. Particularly in the segment involving cloning a real person's digital avatar, users must go through a specific product onboarding process, recording a designated digital reading video to complete identity verification and authorization. More crucially, all video content created or edited by Omni will be permanently embedded with SynthID invisible digital watermarks, allowing external verification of whether it is AI-generated.

The release of the Omni model is highly synergistic with Google's overall business and infrastructure strategy. At the conference, Google simultaneously launched the highly compute-efficient Gemini 3.5 Flash model and a new eighth-generation TPU dual-chip architecture, aiming to provide enterprise customers with more cost-effective cloud training and inference services. This signals that, as multimodal large models begin evolving towards deep logical understanding, AI is not only creating content but also "understanding" and "simulating" the physical world we inhabit in an unprecedented way.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com

America

Information and Communication Artificial Intelligence Engineering

This bulletin is compiled and reposted from information of global Internet and strategic partners, aiming to provide communication for readers. If there is any infringement or other issues, please inform us in time. We will make modifications or deletions accordingly. Unauthorized reproduction of this article is strictly prohibited. Email: news@wedoany.com

Previous：Google Officially Launches Gemini 3.5 in the U.S.: Flash Version Debuts, Pro Version Coming Next Month

Next：To address the global computing power shortage, U.S.-based OpenAI has launched a long-term contract "Guaranteed Capacity" service, allowing customers to lock in discounted computing power for 1-3 years.