One Month After Release, US Anthropic Mythos Model's Capabilities Evolve Again; UK AI Safety Institute Tests Show It Surpasses GPT-5.5

2026-05-15 15:19

Favorite

en.Wedoany.com Reported - The UK AI Safety Institute released its latest test results on May 14, showing that Anthropic's frontier model, Claude Mythos Preview, after receiving an updated version checkpoint, has further enhanced its cybersecurity capabilities, successfully completing two full cyber attack range exercises for the first time. In previous AISI evaluations, GPT-5.5 slightly outperformed Mythos on expert-level tasks with a 71.4% pass rate compared to Mythos's 68.6%. However, following this update, Mythos achieved 6 successes out of 10 attempts in a 32-step simulated enterprise intranet penetration task, significantly widening the gap with GPT-5.5.

Mythos was officially announced by Anthropic on April 7, 2026, positioned as a new tier model surpassing the Opus series, internally codenamed "Capybara," representing the most powerful AI system Anthropic has built to date. Anthropic decided not to release the model to the public but instead provides controlled access to over 40 key infrastructure and cybersecurity partners through Project Glasswing, for defensive vulnerability discovery and remediation. About a month after the Mythos Preview release, AISI disclosed it received an updated version model checkpoint, which performs even stronger in cybersecurity tasks, even successfully completing the "Cooling Tower" industrial control system attack exercise for the first time, a task all previous models had failed.

AISI's testing system is built around a "time horizon benchmark," measuring the capability boundaries of AI models by estimating the time required for a human cybersecurity expert to complete a specific task. Under this framework, Mythos succeeded in 6 out of 10 attempts in the "The Last Ones" 32-step simulated enterprise intranet penetration task, fully covering the entire attack chain from initial breach, lateral movement, to final objective capture. AISI estimates a human expert would need approximately 20 hours to complete an equivalent task. GPT-5.5 succeeded in 3 out of 10 attempts on the same task. More significantly, Mythos broke through the "Cooling Tower" exercise for the first time, a simulation of an attack attempt on power plant control software, succeeding in 3 out of 10 attempts, a feat no previous model had achieved.

AISI also simultaneously published the test results for GPT-5.5. GPT-5.5 achieved an average pass rate of 71.4% on AISI's expert-level cybersecurity tasks, slightly higher than Mythos's previous version score of 68.6%, placing both at a similar level within the 2.5M token limit. However, in tests closer to real intrusion scenarios, such as multi-step attack simulations, Mythos demonstrated a prominent ability to coherently complete long-chain attacks. AISI noted that GPT-5.5 and Mythos reached similar performance levels in cybersecurity assessments and suggested that Mythos's cybersecurity capabilities are not a breakthrough specific to a single model but rather a byproduct of overall improvements in long-duration autonomy, reasoning, and coding abilities.

AISI simultaneously updated its estimate for the doubling cycle of frontier model cyber capabilities. In November 2025, the institute estimated that the duration of cybersecurity tasks models could complete doubled every 8 months; by February 2026, based on progress following the emergence of reasoning models at the end of 2024, this cycle was compressed to 4.7 months. The actual measured performance of Mythos and GPT-5.5 this time has clearly exceeded the 4.7-month doubling trend line. AISI is currently uncertain whether this signifies the emergence of a steeper new growth trend or is merely a short-term leap.

Logan Graham, responsible for frontier red teaming at Anthropic, confirmed that the Mythos checkpoint used in this AISI test is the exact version deployed synchronously with Project Glasswing, meaning the offensive and defensive capabilities observed externally are not laboratory prototypes but a running production-grade model. Previously, Mythos had already garnered widespread attention in the field of vulnerability discovery; Mozilla leveraged it to find and remediate 271 security vulnerabilities in Firefox. Anthropic disclosed in its system card that Mythos Preview helped identify thousands of high-risk zero-day vulnerabilities during testing, covering all major operating systems and browsers.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com

America

Information and Communication Artificial Intelligence Engineering

This bulletin is compiled and reposted from information of global Internet and strategic partners, aiming to provide communication for readers. If there is any infringement or other issues, please inform us in time. We will make modifications or deletions accordingly. Unauthorized reproduction of this article is strictly prohibited. Email: news@wedoany.com

Previous：Singapore's Brilliant Labs Launches Halo AI+AR Glasses with Integrated Long-Term Memory AI Assistant

Next：Microsoft Edge Mobile Adds Copilot AI Cross-Tab Summarization, Browser Shifts from Passive Tool to Proactive Information Processing Platform