OpenAI Releases GPT-5.5, Surpassing Claude Opus 4.7 in Agentic Coding

2026-04-24 09:48

Keywords:

Favorite

en.Wedoany.com Reported - On April 23, OpenAI officially released its new flagship large language model GPT-5.5, codenamed "Spud," along with a higher-spec GPT-5.5 Pro version. This is the first fully retrained foundational model since GPT-4.5, and it is now available to Plus, Pro, Business, and Enterprise users of ChatGPT and Codex. API deployment will follow after additional safety evaluations. At the launch event, OpenAI co-founder and President Greg Brockman described the model as "the smartest and most intuitive model to date," with a core breakthrough being a fundamental paradigm shift from "answering questions" to "autonomously completing tasks."

GPT-5.5 is positioned as a landmark product for the era of artificial intelligence agents. Brockman stated that what truly sets this model apart is that "it can accomplish more with less instruction, analyze an ambiguous problem, and accurately determine the next step to take, truly laying the groundwork for how computer work will be done in the future." OpenAI explicitly positions GPT-5.5 as a "new form of intelligence for real work and agent-driven tasks," with capability gains concentrated in four areas: agentic coding, computer use, knowledge work, and early-stage scientific research.

The most significant improvement in GPT-5.5 is in coding capability. On the Terminal-Bench 2.0 benchmark, which measures complex command-line workflows, GPT-5.5 achieves 82.7%, compared to 75.1% for GPT-5.4, 69.4% for Anthropic's Claude Opus 4.7, and 68.5% for Google's Gemini 3.1 Pro, giving GPT-5.5 a lead of more than 13 percentage points over its competitors. On OpenAI's internal Expert-SWE evaluation, which focuses on long-cycle programming tasks with a human-estimated median completion time of 20 hours, GPT-5.5 scores 73.1%, representing a 4.6 percentage point improvement over GPT-5.4's 68.5%. On the OSWorld-Verified benchmark, GPT-5.5 achieves 78.7%, surpassing Claude Opus 4.7's 78.0%. On the GDPval benchmark, which assesses knowledge work capabilities across 44 professions, GPT-5.5 scores 84.9%, leading Claude Opus 4.7's 80.3% and Gemini 3.1 Pro's 67.3%. On the Tau2-bench Telecom benchmark for evaluating customer service workflows, GPT-5.5 reaches 98.0%. Notably, GPT-5.5 accomplishes all these evaluations with fewer output tokens, achieving the breakthrough of being "stronger and more efficient."

In terms of efficiency, GPT-5.5 maintains the same per-token latency as GPT-5.4 in real-world production environments while requiring significantly fewer tokens to complete the same Codex tasks. This efficiency leap is attributed to a deep collaboration between OpenAI and Nvidia—GPT-5.5 was jointly co-optimized from the design phase with Nvidia's GB200 and GB300 NVL72 systems, with some heuristic algorithms written by the AI itself, resulting in a token generation speed increase of over 20%. On Artificial Analysis' Coding Agent Index, GPT-5.5 tops the chart with a score of 60, leading Claude Opus 4.7 and Gemini 3.1 Pro Preview by 3 points each, achieving the highest intelligence level at half the cost of its direct competitors.

GPT-5.5 API pricing is $5 per million input tokens and $30 per million output tokens, approximately double the pricing of GPT-5.4. GPT-5.5 Pro pricing is $30 per million input tokens and $180 per million output tokens, while the Pro version only supports agentic usage. OpenAI claims that output token usage has decreased by about 40%, so the net cost per actual task is only about 20% higher than GPT-5.4. In terms of equivalent intelligence, GPT-5.5 achieves the same composite score as Claude Opus 4.7 but at a quarter of the operating cost. In ChatGPT and Codex, the context window supports 400K to 1M tokens. Codex introduces a fast mode, offering a 2.5x price increase for a 1.5x generation speed gain. Codex's weekly active users have already reached 4 million, representing a 33% increase from 3 million users two weeks ago; 85% of OpenAI's employees use Codex weekly.

Cybersecurity capability is another focal point of industry interest regarding GPT-5.5. GPT-5.5 performs outstandingly on XBOW's real-world penetration testing benchmark. GPT-5 missed 40% of known vulnerabilities; Claude Opus 4.6 reduced this to 18%; GPT-5.5 further compresses the miss rate to 10%. Under pure black-box testing conditions, GPT-5.5's performance surpasses that of GPT-5 under white-box testing conditions with source code. Under its Preparedness Framework, OpenAI assesses GPT-5.5 at a "High" safety risk level. It has deployed the most stringent cybersecurity classifiers and multiple security protections to date, actively intercepting potential malicious uses. This consideration is a key reason for the delayed API launch compared to its ChatGPT release.

The release of GPT-5.5 comes amidst intense competition among frontier AI labs. The model launches just six weeks after GPT-5.4. Although Anthropic's Claude Mythos Preview leads GPT-5.5 on most absolute benchmarks, Mythos adopts a strict, restricted release strategy, offering access only to about 40 institutions at an API pricing of $25 per million input tokens and $125 per million output tokens—roughly five times that of GPT-5.5. GPT-5.5 opts for full access to paying users, differentiating its competition with Anthropic through a strategy of "democratizing frontier capabilities."

In the field of scientific research, GPT-5.5 demonstrates significant gains. OpenAI's Chief Research Officer Mark Chen stated that the model has achieved a "meaningful breakthrough" in scientific and technical research workflows. On the GeneBench benchmark, which measures multi-stage genetic data analysis, GPT-5.5 scores 25.0%, and GPT-5.5 Pro reaches 33.2%, a substantial improvement from GPT-5.4's 19.0%. On FrontierMath Tier 4, currently the most challenging mathematics benchmark, GPT-5.5 achieves 35.4%, and GPT-5.5 Pro scores 39.6%, surpassing Claude Opus 4.7's 22.9%.

The enterprise market is emerging as the main battlefield for GPT-5.5. OpenAI disclosed that it has 9 million paying business users, over 900 million weekly active users, and more than 50 million paid subscribers. BNY Mellon's Chief Information Officer Leigh-Ann Russell stated that GPT-5.5 delivers a step-change improvement in response quality and resistance to hallucinations, with the bank currently testing GPT-5.5 across over 220 AI application scenarios. OpenAI's internal financial team used GPT-5.5 to review 24,771 K-1 tax forms, totaling 71,637 pages, completing the task two weeks earlier than last year. Brockman confirmed that GPT-5.5 will serve as the core engine for the "super app" OpenAI is building. The transition from a "conversational tool" to an "agentic engine" is redefining the boundaries of human-computer collaboration.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com