Microsoft Releases Open-Source Framework ASSERT to Simplify AI Behavior Testing and Evaluation
2026-06-03 09:47
Favorite

en.Wedoany.com Reported - Microsoft on Tuesday released the open-source framework ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), designed to simplify the testing and evaluation of AI application behaviors.

The framework leverages artificial intelligence technology to convert high-level natural language descriptions of goals, strategies, or expected behaviors into executable, scorable test cases. ASSERT takes plain-language descriptions of an AI model's expected behaviors and strategies, transforms them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs these cases on the target system, and scores the results. The framework also records the paths taken by the AI system, including intermediate actions and tool calls, making it easier for developers to inspect where failures occur.

Developers can additionally provide system context, tools, and constraints to customize the evaluation coverage. For example, a developer can specify that a document research AI agent should not send emails to individuals outside the company, should restrict confidential information to C-level executives, and should provide concise summaries while considering prior context. ASSERT will use these rules to generate test cases and continuously check whether the system adheres to them.

Microsoft stated that ASSERT fills a gap that broader, general evaluations cannot cover when AI model behaviors need to be shaped by the context, policies, and tools of an application or product. "One thing we've learned is that evaluation is absolutely critical to making the right decisions," said Sarah Bird, Chief Product Officer for Responsible AI at Microsoft. "Because without understanding the behavior of an AI system, it's hard to know if it meets an organization's standards... We've found that if you truly want a trustworthy system, you should evaluate more application-specific dimensions." Bird noted that ASSERT can be used for evaluation during system construction, after deployment, and even for continuous monitoring.

This release comes as the AI industry's evaluation capabilities are gradually improving. As model capabilities increase, researchers are focusing on reproducible testing and regression checks, with benchmarks such as Stanford's HELM, MLCommons' AILuminate, and the evaluation team METR emerging to measure model behavior under different conditions.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com