Anthropic Adjusts Fable 5 Safety Measures to Make Downgrade Process Visible

2026-06-15 15:45

Favorite

en.Wedoany.com Reported - Anthropic launched the Mythos model in April (as part of Project Glasswing, designed to discover and fix vulnerabilities in internet infrastructure), and subsequently released its restricted version, Fable 5. Anthropic explicitly stated that Fable will not support certain high-risk research directions in fields such as cybersecurity, biology, and chemistry. When requests in these areas arise, the model will automatically downgrade from Fable to Opus-level intelligence and inform the user that the downgrade is occurring.

At the heart of the controversy is that for researchers working on super-powerful chip design or cutting-edge AI large language models, the downgrade process is not visible to users. Anthropic described this behavior in a 319-page system card, but there is no indication on the user interface, and users actually receive Opus-level output. Fortune described this behavior as "secret sabotage," and Wired reported that this practice could undermine AI research. Mythos and Glasswing are far more powerful than Anthropic's Claude Security tool, which is designed to run on Opus and can still scan codebases and help discover some issues.

Sally Vincent, Senior Threat Research Engineer at security analysis firm Exabeam, stated via email that claims about jailbreak resistance should be treated with caution, as these results "represent an assessment at a point in time," adding that "attackers will continuously adapt." Rob T. Lee, Chief AI Officer and Research Director at SANS Institute, said in an email to ZDNET that Fable 5 is "a novel and clever solution, but Fable 5 will be attacked. The same layer that prevents malicious use also hinders legitimate defensive research." He was downgraded to Opus 4.8 while attempting to build digital forensics skills, and believes that "whether it's a clever way to block malicious actors or not, it prevents those who will build the next generation of tools from gaining new defensive capabilities." He also noted that even under Glasswing, access was restricted and monitored, but in organizations with tens of thousands of employees, any one person could be incentivized to hand over access to criminal groups.

Facing the controversy, Anthropic responded that it would change Fable 5's safety measures to make them visible. Starting this week, flagged requests will visibly fall back to Opus 4.8, and flagged requests on the API will return a rejection reason. The company stated that current safety measures "cover a small number of narrow tasks, such as frontier-scale LLM data pipelines and kernel development for certain non-standard chips," and these measures "prevent foreign adversaries from using our most powerful models in ways that pose serious security risks." Anthropic also said, "We made the wrong trade-off, and we apologize for not getting the balance right. Building these safety measures is a complex technical challenge: as we improve these classifiers to address new threats, users may encounter more false positives. We are working to reduce them as quickly as possible." When deciding whether the downgrade should be visible or invisible, the company faced a choice: "Hidden safety measures are harder to detect and bypass. This means safety measures can be more targeted," but those hidden safety measures were discovered within hours.

Current usage shows that the classifier triggers on approximately 0.05% of tasks, affecting less than 0.05% of organizations. Anthropic stated that visible safety measures require casting a wider net to enhance robustness, leading to more requests being incorrectly flagged, but "they do not affect the vast majority of coding and machine learning work." Ashley Casovan, Managing Director of the AI Governance Center at IAPP, praised Anthropic for keeping Mythos long enough to "set necessary guardrails in its software," while noting that "we have not yet seen the potential impact of these models when released at such a scale." Chris Boehm, Field CTO at network segmentation vendor Zero Networks, described this achievement as restraint rather than raw capability, with Anthropic "taming it enough to be safely released widely," and the reward is scale: ordinary defenders can finally operate at the speed of attackers, "provided the safety measures hold up."

Regarding data retention policies, Anthropic will retain prompts and responses for Mythos-level models for 30 days, and retain prompts that violate policies for longer. This policy has already drawn attention from companies like Microsoft, which has restricted employee use and formed a legal team to evaluate the policy. Etay Maor, Vice President of Threat Intelligence at security vendor Cato Networks, believes that Fable 5's protections are strong enough against opportunistic hackers, but "well-funded and motivated attackers" will turn to other methods. He also noted that "when classifiers become too strict, false positives start to appear. The same controls designed to block malicious activity can also prevent legitimate users from using the model for proper purposes." He added, "From an enterprise perspective, the 30-day retention requirement is noteworthy. Organizations in regulated industries need to know exactly what data is retained and whether this complies with their compliance and legal requirements before using these models in sensitive environments."

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com

America

Information and Communication Artificial Intelligence Engineering

This bulletin is compiled and reposted from information of global Internet and strategic partners, aiming to provide communication for readers. If there is any infringement or other issues, please inform us in time. We will make modifications or deletions accordingly. Unauthorized reproduction of this article is strictly prohibited. Email: news@wedoany.com

Previous：European Mistral Completes New Funding Round at €20 Billion Valuation

Next：US Procore and Trunk Tools Clash Over Construction Data Control Rights