Testing and Validation for AI Systems: More Than Accuracy

PM Takeaways

Testing an AI system is more than checking if it produces the right output. You also need to ask: does it perform fairly across different groups, handle unexpected inputs gracefully, resist deliberate manipulation, and produce decisions that can be explained? A system can pass traditional testing and still fail all of those.
NIST AI RMF MEASURE 2.1 requires that test metrics and acceptance thresholds be documented before testing begins — not after you see the results. If thresholds are set after scores are known, the entire validation process is compromised. Get sign-off on acceptance criteria during planning.
The people who validate the system should be different from the people who built it. NIST AI RMF specifically calls for independent verification and validation. For high-risk systems, plan for external review at the final go/no-go gate.
TEVV is not a pre-deployment phase. NIST AI RMF maps testing and validation activities to design, data preparation, development, deployment, and ongoing operations. A test plan that only kicks in at the end of development misses most of what the framework requires.

Every software project includes testing. AI testing just looks very different. Traditional testing asks: does the system produce the correct output? AI testing needs to ask several more: Does it perform fairly across different groups of users? How does it handle unexpected inputs? Can it be deliberately manipulated into wrong answers? And will it still work reliably as the world changes around it?

NIST AI RMF uses the term TEVV — Test, Evaluation, Verification, and Validation — for this comprehensive picture. TEVV isn't a testing phase at the end of development. It's a continuous set of activities running from planning through post-deployment monitoring. Your job as PM isn't to run the tests — it's to make sure they're scoped, resourced, independent, and actually happening at every stage.

Why Traditional Testing Isn't Enough

Traditional software testing has one core question: given this input, does the system produce the right output? That question still applies to AI. But it leaves a set of critical failure modes completely untested — and those are exactly the failure modes that cause harm in real AI deployments.

What Traditional Testing Covers	What AI Systems Also Require
Functional requirements met — correct outputs for defined inputs	Performance across subpopulations — does accuracy hold for all groups, or only in aggregate?
Integration correctness — components working together as designed	Equitable outcomes — are error rates and outcome distributions fair across demographic groups?
Performance under load — response time and throughput at scale	Robustness to unexpected inputs — edge cases, out-of-distribution data, degraded inputs
Security vulnerabilities — known attack vectors and penetration testing	Adversarial resilience — deliberate manipulation by crafted inputs designed to cause failures
Regression — new code doesn't break existing functionality	Stability over time — does performance hold as the world changes and data drifts?
User acceptance — the system meets stated requirements	Alignment with legal and ethical requirements — bias, explainability, privacy, human oversight

A system can pass every traditional test and still discriminate against protected groups, break unpredictably on inputs slightly outside its training distribution, or produce decisions that can't be explained to the people they affect. NIST AI RMF MEASURE 2.1 is clear: the testing process itself must be documented, followed, and independently reviewed. Independent review isn't procedural nicety — it catches the things that builders are structurally unlikely to catch themselves.

The TEVV Framework

NIST AI RMF breaks testing and assessment into four distinct activities. They aren't sequential phases — they run in parallel and recur throughout the lifecycle. The distinction matters because each one involves different people, different questions, and different documentation.

Activity	Core Question	Focus
Test	Does it work?	Executing the system against defined test cases to identify problems and measure performance
Evaluation	How well does it work?	Assessing performance against benchmarks, thresholds, and trustworthiness criteria, including socio-technical factors beyond the pipeline
Verification	Did we build it right?	Confirming the system meets its specifications — the technical requirements defined before development
Validation	Did we build the right thing?	Confirming the system meets real-world needs in its actual deployment context — not just the requirements as written

NIST AI RMF is specific: the people who validate the system should be different from those who tested it — and both should be different from those who built it. This isn't procedural preference; it's how you prevent builders from (consciously or not) designing tests their own systems will pass.

TEVV Across the AI Lifecycle

The most common TEVV mistake is treating it as a pre-deployment gate — one test sprint before go-live and then done. NIST AI RMF assigns TEVV responsibilities to every phase of the lifecycle. A test plan that only activates in the final weeks before deployment misses most of what the framework requires.

Lifecycle Phase	TEVV Focus
Design and planning	Validate assumptions about data availability, quality, and representativeness. Verify that requirements capture trustworthiness characteristics — fairness, explainability, safety — not just functional behavior. Plan test approaches for each dimension before any model development begins.
Data preparation	Test data quality, completeness, and lineage. Evaluate representativeness across populations relevant to the use case. Identify proxy features that may encode protected characteristics. Document assumptions and limitations of the dataset for downstream TEVV actors.
Development (model building)	Validate model performance on held-out data not used in training. Evaluate fairness metrics disaggregated by relevant demographic groups. Conduct adversarial testing of model behavior. Verify that explanations accurately reflect model behavior, not post-hoc rationalisations.
Deployment	System integration testing in the production environment. User acceptance testing with realistic scenarios involving actual users and affected parties. Independent review of test results before final go/no-go. Recalibration for integration, user experience, and compliance with legal, regulatory, and ethical specifications.
Operations (ongoing)	Continuous monitoring of performance metrics against baselines. Detection of drift and degradation. Incident tracking, analysis, and response. Periodic re-evaluation against original benchmarks. SME recalibration as the deployment context evolves.

What to Test: Five Dimensions

1. Performance and Accuracy

The starting point — does the system produce correct outputs at acceptable rates? AI performance testing goes beyond a point-in-time accuracy score. You need to assess calibration and consistency across varying conditions.

Metric	What It Reveals
Accuracy, precision, recall, F1, AUC	Overall correctness against a labeled test set — the baseline performance measure
False positive and false negative rates	The cost of different error types, which varies significantly by use case — false negatives in medical screening carry different consequences than false positives in content moderation
Confidence calibration	Whether the model's stated confidence reflects its actual accuracy — a model that says 90% confident should be correct 90% of the time. Miscalibrated confidence is a major source of over-reliance.
Performance on held-out data	Generalisation beyond the training distribution — benchmark performance does not always equal real-world performance, particularly when systems may have encountered test data during training (benchmark contamination)

One thing PMs must own: acceptance thresholds need to be defined and signed off before testing begins — not set after you've seen the results. Thresholds that get defined retroactively to match actual scores aren't thresholds; they're rationalisations. NIST MEASURE 2.1 requires that test metrics are documented in the TEVV plan, not appended to it afterwards.

2. Fairness and Bias

A system that looks good on overall metrics may be causing disproportionate harm to specific groups. Aggregate accuracy numbers hide this. Fairness testing has to be disaggregated — broken down by the actual populations the system will affect.

Metric	What It Reveals
Disaggregated performance by group	Whether accuracy, error rates, and confidence calibration hold across demographic groups, geographies, or other segments relevant to the use case
Demographic parity	Whether the AI produces positive outcomes at similar rates across groups, regardless of individual characteristics
Equalized odds / equal opportunity	Whether true positive and false positive rates are consistent across groups — a more demanding standard than demographic parity in many use cases
Denigration and stereotyping prevalence	For generative AI, whether outputs exhibit systematic bias, harmful stereotyping, or denigrating content toward particular groups — NIST AI 600-1 MS-2.11-001 recommends specific benchmarks including Bias Benchmark Questions and Winogender Schemas

Go beyond single-variable analysis. Harms often concentrate at intersections — a group that appears fine on gender metrics and fine on age metrics may experience systematic disadvantage at the intersection of gender and age. NIST AI 600-1 requires direct engagement with potentially affected communities — not to be polite, but because they'll surface failure modes the technical team won't anticipate.

3. Robustness

What happens when the system sees something it wasn't trained on? Does it fail gracefully, or fail dangerously? For high-risk AI, this matters a lot — production environments rarely look exactly like training environments.

Test Type	What It Surfaces
Edge case testing	Unusual but legitimate inputs — values at the extreme ends of distributions, rare but valid combinations of features, inputs that represent underrepresented populations in the training data
Out-of-distribution testing	Inputs that differ from the training distribution in ways the model has not seen — how does the model behave, and does it signal appropriate uncertainty rather than producing confident but wrong outputs?
Noisy and degraded inputs	Incomplete data, missing features, corrupted inputs, or lower-quality data than the training set — common in real-world deployments where data quality is uncontrolled
Distribution shift simulation	Testing performance on data from a different time period, geography, or population than the training data, to assess how quickly performance degrades as deployment conditions diverge from training conditions

4. Safety and Security

Can the system be deliberately misled? Can its safety guardrails be bypassed? AI safety and security testing is genuinely different from conventional cybersecurity — it requires different techniques and usually different expertise.

Test Type	What It Surfaces
Adversarial input testing	Inputs deliberately crafted to cause the model to fail — images with imperceptible perturbations that cause misclassification, text prompts designed to elicit harmful outputs, inputs that exploit model architecture weaknesses
Prompt injection testing	For systems accepting natural language, whether malicious instructions embedded in user input can override system instructions, bypass guardrails, or extract protected information
Guardrail and safety mechanism testing	Whether safety controls actually prevent the failure modes they are designed to prevent — including whether they can be bypassed through creative reformulation of harmful requests
Red-teaming	Structured adversarial exercises conducted by people independent of the build team. NIST AI 600-1 defines AI red-teaming as exercises to identify potential adverse behavior, stress test safeguards, and find failure modes that standard testing misses. Team diversity — demographic, disciplinary, and domain expertise — directly affects the quality of findings.

Red-teaming results are inputs to risk management, not stand-alone verdicts. They tell you where to look and what questions to ask — they require interpretation before they can drive governance decisions.

5. Explainability

Can the system's decisions be understood and justified — by the people who operate it, the people affected by it, and the regulators who oversee it? Explainability requirements vary by context. Testing has to be designed for the actual audiences who need to understand the system's outputs.

Test Focus	Questions to Answer
Explanation accuracy	Do explanations accurately reflect the model's actual decision process, or are they post-hoc rationalisations that look plausible but do not correspond to what the model computed?
Explanation comprehensibility	Are explanations comprehensible to the specific audiences who need them — a technical operator, a non-technical user, an affected individual, a regulator? The same explanation may be appropriate for one audience and meaningless to another.
Explanation consistency	Does the system produce similar explanations for similar decisions? Inconsistent explanations for similar cases undermine both trust and the utility of explanations for error detection.
Regulatory adequacy	Do explanations meet the specific explainability requirements that apply in the deployment jurisdiction and sector — including the EU AI Act's transparency requirements for high-risk AI and sector-specific obligations?

Testing Approaches

Different methods surface different problems. Good TEVV uses a combination of approaches matched to the system's risk level and deployment context.

Approach	How It Works	Strengths and Limitations
Benchmark testing	Compare system performance against standard datasets or established baselines used across the field	Reproducible and comparable across systems. Risk of benchmark contamination — the system may have encountered test data during training. Benchmark performance does not always equal real-world performance in the deployment context.
Field testing	Evaluate system behavior in realistic deployment conditions with actual users or representative populations	Reveals issues that controlled testing misses — real-world data quality, user behavior, and deployment context. More resource-intensive. May expose real users to risk if not carefully staged. Requires informed consent and human subjects protections.
AI red-teaming	Structured adversarial exercises by people independent of the build team, aiming to find failure modes and vulnerabilities that standard testing does not surface	Identifies failure modes that designed tests cannot find. Quality depends heavily on team diversity, domain expertise, and depth of effort. Resource-intensive. Findings require interpretation before incorporation into governance decisions.
Structured public feedback	Gather input from users, affected communities, and domain experts through focus groups, surveys, community advisory boards, and feedback channels	Surfaces failure modes and contextual impacts that technical testing cannot detect. Builds stakeholder trust. Requires careful design to be actionable and representative. Not a substitute for technical evaluation.

Common Failure Modes

Knowing how AI systems typically fail helps you focus testing effort on what matters most for your deployment.

Failure Mode	What It Means	Testing Response
Brittleness	System fails on inputs slightly different from training data — small perturbations, slightly different formatting, or edge-of-distribution values cause disproportionate performance drops	Edge case testing, out-of-distribution testing, input perturbation analysis
Embedded bias	System learns and perpetuates biases present in training data, producing systematically different outcomes for different demographic groups even when those differences are not justified by the task	Disaggregated fairness testing, bias audits using standardized benchmarks, intersectional analysis, engagement with affected communities
Catastrophic forgetting	After retraining on new data, the system loses capabilities it had on the original training distribution — a silent regression that aggregate metrics may not detect	Regression testing against the full original test suite after every retraining cycle, not just testing on new data
Uncertainty blindness	System produces high-confidence outputs even when inputs are ambiguous, out of distribution, or when the model should not be confident — miscalibration that drives over-reliance on AI recommendations	Confidence calibration testing; evaluation of model behavior on inputs the model should recognize as uncertain
Adversarial vulnerability	System can be deliberately manipulated by carefully crafted inputs designed to exploit weaknesses in the model architecture or training data	Adversarial input testing, prompt injection testing for language models, red-teaming with adversarial intent
Distribution shift	Real-world production data differs from training data in ways that erode performance gradually and without obvious failure signals	Field testing, pre-deployment distribution comparison, post-deployment drift monitoring with defined action thresholds

Independence and Separation

NIST MEASURE 2.1 is explicit: independent review improves testing effectiveness and mitigates the biases that arise when builders test their own systems. This isn't procedural — it reflects how testing actually fails in practice.

Builders have real blind spots. They design tests to confirm what they expect, not to discover what they didn't anticipate. Organizational pressure to ship creates incentives — often unconscious — to define acceptance criteria the system will meet. These are structural pressures, not personal failures. Independence is the structural countermeasure.

Independence Level	What It Means in Practice
Testers separate from developers	The minimum baseline — the team conducting TEVV should not have been involved in building the system components they are evaluating. This applies even within the same organization.
Independent internal review	A function within the organization — risk, compliance, legal, or a dedicated AI governance team — conducts or reviews TEVV results independent of the project team. Applicable to medium- and higher-risk systems.
External audit or assessment	Third parties with no commercial interest in the system's deployment conduct or independently verify TEVV results. Required for high-risk AI systems under the EU AI Act in some sectors, and recommended by NIST for systems with significant potential for harm.
Structured challenge process	NIST AI RMF describes "effective challenge" — a culture that encourages critical thinking and questioning of important design and implementation decisions by experts with the authority and stature to act on what they find. This applies throughout the lifecycle, not only at formal test gates.

Deployment Strategies as Testing Mechanisms

Testing doesn't stop at the go/no-go decision. How you deploy the system affects your ability to catch problems early and limit the damage from failures that pre-deployment testing didn't surface.

Strategy	How It Functions as Testing
Phased rollout	Deploy to a limited user population first and expand gradually based on observed performance. Allows real-world performance validation at limited scale before full exposure. Appropriate for most AI deployments.
Shadow mode	The AI system runs in parallel with existing processes but does not affect outcomes — its outputs are logged and evaluated, but decisions are made by the current process. Allows observation of real-world AI behavior without risk to current outcomes. Particularly useful before autonomous operation begins.
A/B testing	Run the new AI system alongside the existing process or a baseline model, routing different users to each. Allows direct performance comparison under identical conditions. Requires careful design to avoid differential harm — if the AI system may produce worse outcomes for some users, random assignment raises ethical issues.
Canary deployment	Route a small, controlled percentage of live traffic to the new system while maintaining the existing system for the remainder. Allows detection of production issues before full rollout. Appropriate when gradual validation of production behavior is needed before committing to full deployment.

Documentation Requirements

NIST MEASURE 2.1 requires test sets, metrics, and tooling to be documented. Undocumented testing isn't governance — it's activity. Documentation is what makes testing repeatable, reviewable, and defensible when it matters.

For each TEVV activity, document at minimum:

Test plan: what will be tested, by whom, using what methods and data, against what acceptance thresholds — completed before testing begins.
Test data: sources, composition, representativeness, and any known limitations or gaps relative to the production population.
Test results: metrics and whether they met, fell short of, or exceeded defined thresholds — including results that were unfavorable.
Limitations: conditions under which results may not generalise — populations, geographies, use cases, or conditions not covered by the test dataset.
Residual risks: known issues identified through TEVV that were accepted, along with the rationale and the mitigations in place.
Decisions informed: what go/no-go decisions, design changes, monitoring requirements, or additional testing the TEVV results triggered.

Your Responsibilities, Phase by Phase

Planning Phase

Define TEVV activities in scope, schedule, and budget — including budget for independent testing that may require external expertise.
Allocate resources for each dimension of TEVV: performance, fairness, reliability, safety/security, and explainability — not only functional correctness.
Define acceptance criteria for each trustworthiness dimension before development begins — thresholds set after results are known are not acceptance criteria.
Identify who will conduct independent review and at what gates — independence must be structural, not asserted.
Determine whether red-teaming is appropriate for the system's risk level and, if so, define the composition and scope of the red team.

Development Phase

Monitor TEVV progress alongside development milestones — TEVV is concurrent with development, not a gate at the end of it.
Ensure TEVV findings are addressed and documented, not merely recorded — an unfavorable finding that produces no design change or accepted residual risk is a governance gap.
Escalate findings that affect go/no-go decisions to the appropriate stakeholders before, not after, the deployment decision is made.
Confirm that test data is documented including its composition, representativeness, and limitations.

Deployment Phase

Verify that all planned TEVV activities are complete before the go/no-go decision — incomplete testing is a documented residual risk, not an implicit acceptance.
Confirm that residual risks identified through TEVV are formally accepted by the appropriate decision-maker, not informally deferred.
Confirm that post-deployment monitoring is in place to detect drift and performance degradation — the transition from TEVV to ongoing monitoring must be explicit.
If using phased rollout or shadow mode, define the criteria that must be met before expanding deployment scope.

Operations Phase

Track ongoing performance against the baselines established during pre-deployment TEVV.
Trigger re-testing when conditions change materially: new user populations, significant data drift, regulatory changes, or identified incidents.
Include TEVV in change management processes — model updates, retraining, and configuration changes each require regression testing before production deployment.
Conduct periodic re-evaluation of the system against its original benchmarks and the current regulatory and ethical standards that apply to it.

Right-Sizing This for Your Situation

Testing depth should match system risk. A low-risk internal tool doesn't need the same TEVV program as a high-risk system making consequential decisions about individuals. Not every system needs external red-teaming — but every system needs documented validation of performance and fitness for purpose in its actual deployment context.

Greenfield

You don't have a formal TEVV process yet. Start with the basics: document your acceptance criteria for each of the five testing dimensions before development starts — even rough thresholds are better than none. Separate the people doing testing from the people doing building, even if it means swapping team members across workstreams. Before go-live, run at least one scenario with out-of-distribution inputs for each high-stakes decision the system makes. Document what you found and what you decided. That's your minimum viable TEVV record.

Emerging

You're building repeatable testing capability. Formalize your threshold-setting process so acceptance criteria are always defined before testing begins — make it a required input to development kickoff. Build a test planning template that covers all five dimensions and forces explicit decisions about independence level for each. Use deployment strategies like phased rollout or shadow mode as post-deployment validation tools, not just risk management tools. Track findings across deployments to identify recurring failure patterns.

Established

TEVV needs to integrate with your existing quality management, compliance, and audit infrastructure. Map your EU AI Act Article 9 and Annex IV obligations to specific TEVV activities and documentation requirements — the technical documentation the Act requires for high-risk AI is essentially your TEVV record. Build external review into your governance calendar for high-risk systems, not just when a regulator asks. Use TEVV data across your AI portfolio to identify systemic risk patterns that no individual project will see.

The AI Governance Advisor can help you design TEVV plans, define acceptance criteria, and identify the right testing approaches for your specific deployment context and risk level.

h2('Framework References'),

NIST AI RMF 1.0 — MEASURE 2.1 (TEVV documentation and independence requirements), MEASURE 2.6 (demonstrating safety before deployment), MEASURE 2.9 (explainability validation).
NIST AI 600-1 GenAI Profile (2024) — MS-2.11-001 and MS-2.11-002. Fairness assessments, demographic subgroup testing, red-teaming, and benchmark selection for generative AI.
EU AI Act (2024) — Article 9 (risk management and testing for high-risk AI throughout lifecycle), Article 10 (training and test data governance), Article 15 (accuracy, reliability, and cybersecurity requirements), Annex IV (technical documentation including test results).
Stanford HAI — Validating Claims About AI (2024). Practical framework for assessing benchmark validity and the risks of benchmark contamination.
PMI CPMAI Guide (2025) — Phases III, IV, V. Acceptance criteria and test strategy, model validation including fairness and adversarial testing, pre-deployment TEVV gate and user acceptance testing.

This article is part of AIPMO’s PM Practice series. See also: AI Risk Registers | Monitoring AI Systems in Production | Human Oversight in AI Systems

To err is AI; to govern, human.

AIPMO.co · AI Governance, PM-first

Testing and Validation for AI Systems: More Than Accuracy

Why Traditional Testing Isn't Enough

The TEVV Framework

TEVV Across the AI Lifecycle

What to Test: Five Dimensions

1. Performance and Accuracy

2. Fairness and Bias

3. Robustness

4. Safety and Security

5. Explainability

Testing Approaches

Common Failure Modes

Independence and Separation

Deployment Strategies as Testing Mechanisms

Documentation Requirements

Your Responsibilities, Phase by Phase

Planning Phase

Development Phase

Deployment Phase

Operations Phase

Right-Sizing This for Your Situation

AIPMO

More in PM Practice

The AI Project Charter for Agile Teams: Governance that Enables Agility, Not Bureaucracy

Change Management for AI Projects: Preparing People for a New Way of Working

Third-Party AI and Vendor Management: Risks You Don't Control

Monitoring AI Systems in Production: The Work After Go-Live

More from AIPMO

NAIC AI Bulletin Adoption: Q2 2026 State-by-State Status

The Banking Sector Got Mythos First. Here's What That Means for Its PMs.

The Mythos Signal: Why a Model You Can't Use Should Change Your AI Governance

The AI Project Charter for Agile Teams: Governance that Enables Agility, Not Bureaucracy