- Testing an AI system is more than checking if it produces the right output. You also need to ask: does it perform fairly across different groups, handle unexpected inputs gracefully, resist deliberate manipulation, and produce decisions that can be explained? A system can pass traditional testing and still fail all of those.
- NIST AI RMF MEASURE 2.1 requires that test metrics and acceptance thresholds be documented before testing begins — not after you see the results. If thresholds are set after scores are known, the entire validation process is compromised. Get sign-off on acceptance criteria during planning.
- The people who validate the system should be different from the people who built it. NIST AI RMF specifically calls for independent verification and validation. For high-risk systems, plan for external review at the final go/no-go gate.
- TEVV is not a pre-deployment phase. NIST AI RMF maps testing and validation activities to design, data preparation, development, deployment, and ongoing operations. A test plan that only kicks in at the end of development misses most of what the framework requires.
Every software project includes testing. AI testing just looks very different. Traditional testing asks: does the system produce the correct output? AI testing needs to ask several more: Does it perform fairly across different groups of users? How does it handle unexpected inputs? Can it be deliberately manipulated into wrong answers? And will it still work reliably as the world changes around it?
NIST AI RMF uses the term TEVV — Test, Evaluation, Verification, and Validation — for this comprehensive picture. TEVV isn't a testing phase at the end of development. It's a continuous set of activities running from planning through post-deployment monitoring. Your job as PM isn't to run the tests — it's to make sure they're scoped, resourced, independent, and actually happening at every stage.
Why Traditional Testing Isn't Enough
Traditional software testing has one core question: given this input, does the system produce the right output? That question still applies to AI. But it leaves a set of critical failure modes completely untested — and those are exactly the failure modes that cause harm in real AI deployments.
| What Traditional Testing Covers | What AI Systems Also Require |
|---|---|
| Functional requirements met — correct outputs for defined inputs | Performance across subpopulations — does accuracy hold for all groups, or only in aggregate? |
| Integration correctness — components working together as designed | Equitable outcomes — are error rates and outcome distributions fair across demographic groups? |
| Performance under load — response time and throughput at scale | Robustness to unexpected inputs — edge cases, out-of-distribution data, degraded inputs |
| Security vulnerabilities — known attack vectors and penetration testing | Adversarial resilience — deliberate manipulation by crafted inputs designed to cause failures |
| Regression — new code doesn't break existing functionality | Stability over time — does performance hold as the world changes and data drifts? |
| User acceptance — the system meets stated requirements | Alignment with legal and ethical requirements — bias, explainability, privacy, human oversight |
A system can pass every traditional test and still discriminate against protected groups, break unpredictably on inputs slightly outside its training distribution, or produce decisions that can't be explained to the people they affect. NIST AI RMF MEASURE 2.1 is clear: the testing process itself must be documented, followed, and independently reviewed. Independent review isn't procedural nicety — it catches the things that builders are structurally unlikely to catch themselves.
The TEVV Framework
NIST AI RMF breaks testing and assessment into four distinct activities. They aren't sequential phases — they run in parallel and recur throughout the lifecycle. The distinction matters because each one involves different people, different questions, and different documentation.
| Activity | Core Question | Focus |
|---|---|---|
| Test | Does it work? | Executing the system against defined test cases to identify problems and measure performance |
| Evaluation | How well does it work? | Assessing performance against benchmarks, thresholds, and trustworthiness criteria, including socio-technical factors beyond the pipeline |
| Verification | Did we build it right? | Confirming the system meets its specifications — the technical requirements defined before development |
| Validation | Did we build the right thing? | Confirming the system meets real-world needs in its actual deployment context — not just the requirements as written |
NIST AI RMF is specific: the people who validate the system should be different from those who tested it — and both should be different from those who built it. This isn't procedural preference; it's how you prevent builders from (consciously or not) designing tests their own systems will pass.
TEVV Across the AI Lifecycle
The most common TEVV mistake is treating it as a pre-deployment gate — one test sprint before go-live and then done. NIST AI RMF assigns TEVV responsibilities to every phase of the lifecycle. A test plan that only activates in the final weeks before deployment misses most of what the framework requires.
| Lifecycle Phase | TEVV Focus |
|---|---|
| Design and planning | Validate assumptions about data availability, quality, and representativeness. Verify that requirements capture trustworthiness characteristics — fairness, explainability, safety — not just functional behavior. Plan test approaches for each dimension before any model development begins. |
| Data preparation | Test data quality, completeness, and lineage. Evaluate representativeness across populations relevant to the use case. Identify proxy features that may encode protected characteristics. Document assumptions and limitations of the dataset for downstream TEVV actors. |
| Development (model building) | Validate model performance on held-out data not used in training. Evaluate fairness metrics disaggregated by relevant demographic groups. Conduct adversarial testing of model behavior. Verify that explanations accurately reflect model behavior, not post-hoc rationalisations. |
| Deployment | System integration testing in the production environment. User acceptance testing with realistic scenarios involving actual users and affected parties. Independent review of test results before final go/no-go. Recalibration for integration, user experience, and compliance with legal, regulatory, and ethical specifications. |
| Operations (ongoing) | Continuous monitoring of performance metrics against baselines. Detection of drift and degradation. Incident tracking, analysis, and response. Periodic re-evaluation against original benchmarks. SME recalibration as the deployment context evolves. |
What to Test: Five Dimensions
1. Performance and Accuracy
The starting point — does the system produce correct outputs at acceptable rates? AI performance testing goes beyond a point-in-time accuracy score. You need to assess calibration and consistency across varying conditions.
| Metric | What It Reveals |
|---|---|
| Accuracy, precision, recall, F1, AUC | Overall correctness against a labeled test set — the baseline performance measure |
| False positive and false negative rates | The cost of different error types, which varies significantly by use case — false negatives in medical screening carry different consequences than false positives in content moderation |
| Confidence calibration | Whether the model's stated confidence reflects its actual accuracy — a model that says 90% confident should be correct 90% of the time. Miscalibrated confidence is a major source of over-reliance. |
| Performance on held-out data | Generalisation beyond the training distribution — benchmark performance does not always equal real-world performance, particularly when systems may have encountered test data during training (benchmark contamination) |
One thing PMs must own: acceptance thresholds need to be defined and signed off before testing begins — not set after you've seen the results. Thresholds that get defined retroactively to match actual scores aren't thresholds; they're rationalisations. NIST MEASURE 2.1 requires that test metrics are documented in the TEVV plan, not appended to it afterwards.
2. Fairness and Bias
A system that looks good on overall metrics may be causing disproportionate harm to specific groups. Aggregate accuracy numbers hide this. Fairness testing has to be disaggregated — broken down by the actual populations the system will affect.
| Metric | What It Reveals |
|---|---|
| Disaggregated performance by group | Whether accuracy, error rates, and confidence calibration hold across demographic groups, geographies, or other segments relevant to the use case |
| Demographic parity | Whether the AI produces positive outcomes at similar rates across groups, regardless of individual characteristics |
| Equalized odds / equal opportunity | Whether true positive and false positive rates are consistent across groups — a more demanding standard than demographic parity in many use cases |
| Denigration and stereotyping prevalence | For generative AI, whether outputs exhibit systematic bias, harmful stereotyping, or denigrating content toward particular groups — NIST AI 600-1 MS-2.11-001 recommends specific benchmarks including Bias Benchmark Questions and Winogender Schemas |
Go beyond single-variable analysis. Harms often concentrate at intersections — a group that appears fine on gender metrics and fine on age metrics may experience systematic disadvantage at the intersection of gender and age. NIST AI 600-1 requires direct engagement with potentially affected communities — not to be polite, but because they'll surface failure modes the technical team won't anticipate.
3. Robustness
What happens when the system sees something it wasn't trained on? Does it fail gracefully, or fail dangerously? For high-risk AI, this matters a lot — production environments rarely look exactly like training environments.
| Test Type | What It Surfaces |
|---|---|
| Edge case testing | Unusual but legitimate inputs — values at the extreme ends of distributions, rare but valid combinations of features, inputs that represent underrepresented populations in the training data |
| Out-of-distribution testing | Inputs that differ from the training distribution in ways the model has not seen — how does the model behave, and does it signal appropriate uncertainty rather than producing confident but wrong outputs? |
| Noisy and degraded inputs | Incomplete data, missing features, corrupted inputs, or lower-quality data than the training set — common in real-world deployments where data quality is uncontrolled |
| Distribution shift simulation | Testing performance on data from a different time period, geography, or population than the training data, to assess how quickly performance degrades as deployment conditions diverge from training conditions |
4. Safety and Security
Can the system be deliberately misled? Can its safety guardrails be bypassed? AI safety and security testing is genuinely different from conventional cybersecurity — it requires different techniques and usually different expertise.
| Test Type | What It Surfaces |
|---|---|
| Adversarial input testing | Inputs deliberately crafted to cause the model to fail — images with imperceptible perturbations that cause misclassification, text prompts designed to elicit harmful outputs, inputs that exploit model architecture weaknesses |
| Prompt injection testing | For systems accepting natural language, whether malicious instructions embedded in user input can override system instructions, bypass guardrails, or extract protected information |
| Guardrail and safety mechanism testing | Whether safety controls actually prevent the failure modes they are designed to prevent — including whether they can be bypassed through creative reformulation of harmful requests |
| Red-teaming | Structured adversarial exercises conducted by people independent of the build team. NIST AI 600-1 defines AI red-teaming as exercises to identify potential adverse behavior, stress test safeguards, and find failure modes that standard testing misses. Team diversity — demographic, disciplinary, and domain expertise — directly affects the quality of findings. |
Red-teaming results are inputs to risk management, not stand-alone verdicts. They tell you where to look and what questions to ask — they require interpretation before they can drive governance decisions.
5. Explainability
Can the system's decisions be understood and justified — by the people who operate it, the people affected by it, and the regulators who oversee it? Explainability requirements vary by context. Testing has to be designed for the actual audiences who need to understand the system's outputs.
| Test Focus | Questions to Answer |
|---|---|
| Explanation accuracy | Do explanations accurately reflect the model's actual decision process, or are they post-hoc rationalisations that look plausible but do not correspond to what the model computed? |
| Explanation comprehensibility | Are explanations comprehensible to the specific audiences who need them — a technical operator, a non-technical user, an affected individual, a regulator? The same explanation may be appropriate for one audience and meaningless to another. |
| Explanation consistency | Does the system produce similar explanations for similar decisions? Inconsistent explanations for similar cases undermine both trust and the utility of explanations for error detection. |
| Regulatory adequacy | Do explanations meet the specific explainability requirements that apply in the deployment jurisdiction and sector — including the EU AI Act's transparency requirements for high-risk AI and sector-specific obligations? |
Testing Approaches
Different methods surface different problems. Good TEVV uses a combination of approaches matched to the system's risk level and deployment context.
| Approach | How It Works | Strengths and Limitations |
|---|---|---|
| Benchmark testing | Compare system performance against standard datasets or established baselines used across the field | Reproducible and comparable across systems. Risk of benchmark contamination — the system may have encountered test data during training. Benchmark performance does not always equal real-world performance in the deployment context. |
| Field testing | Evaluate system behavior in realistic deployment conditions with actual users or representative populations | Reveals issues that controlled testing misses — real-world data quality, user behavior, and deployment context. More resource-intensive. May expose real users to risk if not carefully staged. Requires informed consent and human subjects protections. |
| AI red-teaming | Structured adversarial exercises by people independent of the build team, aiming to find failure modes and vulnerabilities that standard testing does not surface | Identifies failure modes that designed tests cannot find. Quality depends heavily on team diversity, domain expertise, and depth of effort. Resource-intensive. Findings require interpretation before incorporation into governance decisions. |
| Structured public feedback | Gather input from users, affected communities, and domain experts through focus groups, surveys, community advisory boards, and feedback channels | Surfaces failure modes and contextual impacts that technical testing cannot detect. Builds stakeholder trust. Requires careful design to be actionable and representative. Not a substitute for technical evaluation. |
Common Failure Modes
Knowing how AI systems typically fail helps you focus testing effort on what matters most for your deployment.
| Failure Mode | What It Means | Testing Response |
|---|---|---|
| Brittleness | System fails on inputs slightly different from training data — small perturbations, slightly different formatting, or edge-of-distribution values cause disproportionate performance drops | Edge case testing, out-of-distribution testing, input perturbation analysis |
| Embedded bias | System learns and perpetuates biases present in training data, producing systematically different outcomes for different demographic groups even when those differences are not justified by the task | Disaggregated fairness testing, bias audits using standardized benchmarks, intersectional analysis, engagement with affected communities |
| Catastrophic forgetting | After retraining on new data, the system loses capabilities it had on the original training distribution — a silent regression that aggregate metrics may not detect | Regression testing against the full original test suite after every retraining cycle, not just testing on new data |
| Uncertainty blindness | System produces high-confidence outputs even when inputs are ambiguous, out of distribution, or when the model should not be confident — miscalibration that drives over-reliance on AI recommendations | Confidence calibration testing; evaluation of model behavior on inputs the model should recognize as uncertain |
| Adversarial vulnerability | System can be deliberately manipulated by carefully crafted inputs designed to exploit weaknesses in the model architecture or training data | Adversarial input testing, prompt injection testing for language models, red-teaming with adversarial intent |
| Distribution shift | Real-world production data differs from training data in ways that erode performance gradually and without obvious failure signals | Field testing, pre-deployment distribution comparison, post-deployment drift monitoring with defined action thresholds |
Independence and Separation
NIST MEASURE 2.1 is explicit: independent review improves testing effectiveness and mitigates the biases that arise when builders test their own systems. This isn't procedural — it reflects how testing actually fails in practice.
Builders have real blind spots. They design tests to confirm what they expect, not to discover what they didn't anticipate. Organizational pressure to ship creates incentives — often unconscious — to define acceptance criteria the system will meet. These are structural pressures, not personal failures. Independence is the structural countermeasure.
| Independence Level | What It Means in Practice |
|---|---|
| Testers separate from developers | The minimum baseline — the team conducting TEVV should not have been involved in building the system components they are evaluating. This applies even within the same organization. |
| Independent internal review | A function within the organization — risk, compliance, legal, or a dedicated AI governance team — conducts or reviews TEVV results independent of the project team. Applicable to medium- and higher-risk systems. |
| External audit or assessment | Third parties with no commercial interest in the system's deployment conduct or independently verify TEVV results. Required for high-risk AI systems under the EU AI Act in some sectors, and recommended by NIST for systems with significant potential for harm. |
| Structured challenge process | NIST AI RMF describes "effective challenge" — a culture that encourages critical thinking and questioning of important design and implementation decisions by experts with the authority and stature to act on what they find. This applies throughout the lifecycle, not only at formal test gates. |
Deployment Strategies as Testing Mechanisms
Testing doesn't stop at the go/no-go decision. How you deploy the system affects your ability to catch problems early and limit the damage from failures that pre-deployment testing didn't surface.
| Strategy | How It Functions as Testing |
|---|---|
| Phased rollout | Deploy to a limited user population first and expand gradually based on observed performance. Allows real-world performance validation at limited scale before full exposure. Appropriate for most AI deployments. |
| Shadow mode | The AI system runs in parallel with existing processes but does not affect outcomes — its outputs are logged and evaluated, but decisions are made by the current process. Allows observation of real-world AI behavior without risk to current outcomes. Particularly useful before autonomous operation begins. |
| A/B testing | Run the new AI system alongside the existing process or a baseline model, routing different users to each. Allows direct performance comparison under identical conditions. Requires careful design to avoid differential harm — if the AI system may produce worse outcomes for some users, random assignment raises ethical issues. |
| Canary deployment | Route a small, controlled percentage of live traffic to the new system while maintaining the existing system for the remainder. Allows detection of production issues before full rollout. Appropriate when gradual validation of production behavior is needed before committing to full deployment. |
Documentation Requirements
NIST MEASURE 2.1 requires test sets, metrics, and tooling to be documented. Undocumented testing isn't governance — it's activity. Documentation is what makes testing repeatable, reviewable, and defensible when it matters.
For each TEVV activity, document at minimum:
- Test plan: what will be tested, by whom, using what methods and data, against what acceptance thresholds — completed before testing begins.
- Test data: sources, composition, representativeness, and any known limitations or gaps relative to the production population.
- Test results: metrics and whether they met, fell short of, or exceeded defined thresholds — including results that were unfavorable.
- Limitations: conditions under which results may not generalise — populations, geographies, use cases, or conditions not covered by the test dataset.
- Residual risks: known issues identified through TEVV that were accepted, along with the rationale and the mitigations in place.
- Decisions informed: what go/no-go decisions, design changes, monitoring requirements, or additional testing the TEVV results triggered.
Your Responsibilities, Phase by Phase
Planning Phase
- Define TEVV activities in scope, schedule, and budget — including budget for independent testing that may require external expertise.
- Allocate resources for each dimension of TEVV: performance, fairness, reliability, safety/security, and explainability — not only functional correctness.
- Define acceptance criteria for each trustworthiness dimension before development begins — thresholds set after results are known are not acceptance criteria.
- Identify who will conduct independent review and at what gates — independence must be structural, not asserted.
- Determine whether red-teaming is appropriate for the system's risk level and, if so, define the composition and scope of the red team.
Development Phase
- Monitor TEVV progress alongside development milestones — TEVV is concurrent with development, not a gate at the end of it.
- Ensure TEVV findings are addressed and documented, not merely recorded — an unfavorable finding that produces no design change or accepted residual risk is a governance gap.
- Escalate findings that affect go/no-go decisions to the appropriate stakeholders before, not after, the deployment decision is made.
- Confirm that test data is documented including its composition, representativeness, and limitations.
Deployment Phase
- Verify that all planned TEVV activities are complete before the go/no-go decision — incomplete testing is a documented residual risk, not an implicit acceptance.
- Confirm that residual risks identified through TEVV are formally accepted by the appropriate decision-maker, not informally deferred.
- Confirm that post-deployment monitoring is in place to detect drift and performance degradation — the transition from TEVV to ongoing monitoring must be explicit.
- If using phased rollout or shadow mode, define the criteria that must be met before expanding deployment scope.
Operations Phase
- Track ongoing performance against the baselines established during pre-deployment TEVV.
- Trigger re-testing when conditions change materially: new user populations, significant data drift, regulatory changes, or identified incidents.
- Include TEVV in change management processes — model updates, retraining, and configuration changes each require regression testing before production deployment.
- Conduct periodic re-evaluation of the system against its original benchmarks and the current regulatory and ethical standards that apply to it.
Right-Sizing This for Your Situation
Testing depth should match system risk. A low-risk internal tool doesn't need the same TEVV program as a high-risk system making consequential decisions about individuals. Not every system needs external red-teaming — but every system needs documented validation of performance and fitness for purpose in its actual deployment context.
You don't have a formal TEVV process yet. Start with the basics: document your acceptance criteria for each of the five testing dimensions before development starts — even rough thresholds are better than none. Separate the people doing testing from the people doing building, even if it means swapping team members across workstreams. Before go-live, run at least one scenario with out-of-distribution inputs for each high-stakes decision the system makes. Document what you found and what you decided. That's your minimum viable TEVV record.
You're building repeatable testing capability. Formalize your threshold-setting process so acceptance criteria are always defined before testing begins — make it a required input to development kickoff. Build a test planning template that covers all five dimensions and forces explicit decisions about independence level for each. Use deployment strategies like phased rollout or shadow mode as post-deployment validation tools, not just risk management tools. Track findings across deployments to identify recurring failure patterns.
TEVV needs to integrate with your existing quality management, compliance, and audit infrastructure. Map your EU AI Act Article 9 and Annex IV obligations to specific TEVV activities and documentation requirements — the technical documentation the Act requires for high-risk AI is essentially your TEVV record. Build external review into your governance calendar for high-risk systems, not just when a regulator asks. Use TEVV data across your AI portfolio to identify systemic risk patterns that no individual project will see.
The AI Governance Advisor can help you design TEVV plans, define acceptance criteria, and identify the right testing approaches for your specific deployment context and risk level.
h2('Framework References'),
- NIST AI RMF 1.0 — MEASURE 2.1 (TEVV documentation and independence requirements), MEASURE 2.6 (demonstrating safety before deployment), MEASURE 2.9 (explainability validation).
- NIST AI 600-1 GenAI Profile (2024) — MS-2.11-001 and MS-2.11-002. Fairness assessments, demographic subgroup testing, red-teaming, and benchmark selection for generative AI.
- EU AI Act (2024) — Article 9 (risk management and testing for high-risk AI throughout lifecycle), Article 10 (training and test data governance), Article 15 (accuracy, reliability, and cybersecurity requirements), Annex IV (technical documentation including test results).
- Stanford HAI — Validating Claims About AI (2024). Practical framework for assessing benchmark validity and the risks of benchmark contamination.
- PMI CPMAI Guide (2025) — Phases III, IV, V. Acceptance criteria and test strategy, model validation including fairness and adversarial testing, pre-deployment TEVV gate and user acceptance testing.
This article is part of AIPMO’s PM Practice series. See also: AI Risk Registers | Monitoring AI Systems in Production | Human Oversight in AI Systems