OpenAI Launches GPT-Realtime Series of Three Audio Models, Integrating GPT-5-Level Reasoning into Voice Interaction for the First Time

2026-05-13 14:22

Favorite

en.Wedoany.com Reported - OpenAI has officially launched the GPT-Realtime series, comprising three real-time audio models named GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, all available to developers via the Realtime API. These three models integrate reasoning, translation, and transcription capabilities into a single API, expanding voice interaction from single-turn Q&A directly to production-grade agents equipped with tool calling and task execution abilities.

GPT-Realtime-2 is the core model of the series and the first OpenAI model to bring GPT-5-level reasoning capabilities into voice interaction. Designed for real-time voice agents, it can perform complex reasoning, call external tools, handle mid-conversation interruptions and corrections, and maintain contextual coherence during long sessions. The context window has been directly expanded from the previous generation's 32K to 128K, sufficient to support complex multi-turn task dialogues lasting over half an hour. The model offers five adjustable reasoning intensity levels—from lowest to highest—allowing developers to trade off between response speed and reasoning depth based on task complexity. Parallel tool calling enables it to access multiple backend systems like calendars, maps, and CRMs simultaneously, executing operations while reporting progress to the user, and naturally inserting transitional phrases like "let me check on that" through a "Preambles" mechanism, making the interaction feel closer to a real human conversation.

GPT-Realtime-Translate is a streaming simultaneous interpretation engine. It supports over 70 input languages, with output languages limited to 13. Translation pace synchronizes with the speaker, starting output without waiting for a complete sentence to finish, keeping latency extremely low. GPT-Realtime-Whisper provides low-latency streaming transcription, generating text synchronously as a person speaks, suitable for real-time captions, meeting minutes, and workflow updates, effectively eliminating the waiting time inherent in traditional speech-to-text services.

The billing methods for the three models are clearly differentiated. GPT-Realtime-2 is metered by token: audio input costs $32 per million tokens, output costs $64, and cached input is only $0.40. GPT-Realtime-Translate costs $0.034 per minute, and GPT-Realtime-Whisper costs $0.017 per minute, both charged based on usage duration. This structure drives the per-minute cost of simultaneous interpretation extremely low, making the economics viable for large-scale enterprise deployment.

US real estate information platform Zillow, online travel service provider Priceline, and Deutsche Telekom have already begun integration testing. Zillow used GPT-Realtime-2 to build a voice assistant that understands housing conditions and schedules viewings; in internal adversarial testing, the phone task success rate soared from 69% to 95%, with more stable anti-discrimination compliance performance. Priceline integrated the voice agent into long-chain services such as flight inquiries, hotel bookings, and itinerary changes, aiming to accelerate voice interaction from "Q&A" to "transaction processing." Deutsche Telekom completed validation in scenarios involving complex plan consultations, troubleshooting, and bill explanations, confirming usability in telecom agent environments.

Benchmark scores are also climbing. GPT-Realtime-2 scored 15.2 percentage points higher than the previous generation on the Big Bench Audio audio intelligence test, and 13.8 percentage points higher on the Audio MultiChallenge multi-turn dialogue instruction following test.

Looking at the iteration pace, OpenAI's advancement trajectory in the voice domain is clear. In 2024, it first opened the low-latency capabilities of ChatGPT's advanced voice mode to developers; in August 2025, it released the first production-grade Gpt-Realtime model; in February 2026, it launched Gpt-Realtime-1.5; and now, GPT-Realtime-2 officially brings this product line from experiential features into the foundational version sequence of enterprise-grade APIs.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com