Microsoft Open-Sources Enterprise AI Agent Evaluation Framework ASSERT
2026-06-12 11:57
Favorite

en.Wedoany.com Reported - Microsoft recently open-sourced an AI evaluation framework designed to convert natural language requirements into executable tests, strengthening enterprise capabilities in artificial intelligence governance. Named ASSERT (Adaptive Specification-Driven Scoring for Evaluation and Regression Testing), the framework automatically generates evaluation scenarios, datasets, metrics, and scorecards based on written specifications, product requirements, and governance documents. In a blog post announcing the release, Microsoft stated that many organizations struggle to systematically verify agent behavior before deploying them into production.

AI

Agents can fail in subtle ways, such as deviating from established policies, producing unsafe outputs in edge cases, or performing differently in production environments compared to testing. Generic benchmarks fail to capture these failures because they are not built around specific policies, agents, or use cases. ASSERT eliminates the need for developers to manually create evaluation suites by converting written intent into reusable tests that can be integrated into AI development workflows.

With ASSERT, Microsoft enters an increasingly competitive AI evaluation market. This market already includes platforms such as LangChain's LangSmith, Braintrust, Patronus AI, Galileo, Arize AI's Phoenix, and Promptfoo, which help enterprises benchmark, monitor, and validate large language model applications. This release comes as enterprises accelerate the deployment of AI agents, yet formal evaluation practices remain the exception rather than the rule. Anushree Verma, Senior Director Analyst at Gartner, noted that currently 99% of organizations do not evaluate any AI agents before production. The next competitive advantage in the industry will depend more on how effectively organizations simulate and stress-test AI agents before deployment, rather than on advances in reasoning models. Gartner estimates that by 2029, over 75% of domain-specific agents in regulated industries that are not designed with agent simulation will fail to deliver value.

Forrester believes enterprises are shifting toward behavioral evaluation, but most organizations have not yet made it a formal production requirement. Biswajeet Mahapatra, Principal Analyst at Forrester, stated that behavioral evaluation is applied inconsistently rather than being treated as a formal production gate. According to Forrester data, over 45% of organizations have already deployed AI agents, with another 25% in pilot phases, but many still face difficulties in scaling due to immature governance and limited operational rigor.

Microsoft stated that ASSERT uses large language models as judges, and in internal company validation, model-generated evaluations achieved an 80% to 90% agreement rate with human reviewers. Biswajeet Mahapatra, Principal Analyst at Forrester, noted that this agreement rate helps automate most AI testing but is still insufficient as an independent control measure for governance or compliance. Enterprises should adopt layered oversight, allowing AI to evaluate AI at scale while humans retain supervisory responsibility for high-risk, regulated, or ambiguous scenarios. Buyers should also be aware of bias, consistency issues, and the risk of over-relying on a single model serving as both generator and evaluator.

Microsoft released ASSERT under the MIT open-source license, allowing organizations to inspect, modify, and integrate the framework into existing AI development workflows. Biswajeet Mahapatra, Principal Analyst at Forrester, stated that open source reduces vendor lock-in risks and enables broad interoperability across model ecosystems, but it does not fully eliminate trust or conflict-of-interest issues, as the original vendor still influences how evaluation criteria, scoring logic, and definitions of acceptable behavior are encoded. Enterprises should not rely on a single evaluation framework but should validate AI systems against multiple evaluation approaches and maintain ownership of their internal evaluation strategies.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com