Skip to content

NEW - Testing and Validation for AI Systems: More Than Accuracy

Every AI project needs testing — but the test plan looks nothing like traditional software QA. The NIST AI RMF TEVV framework requires testing across five dimensions: accuracy, fairness, robustness, safety, and explainability. A system can pass every conventional test and still discriminate.

By AIPMO
Published: · 17 min read

 

PM Takeaways

       NIST AI RMF MEASURE 2.1 requires that test sets, metrics, and tools used during TEVV are documented before testing begins — not after. Defining acceptance thresholds post-hoc, after results are known, undermines the entire validation process and creates governance exposure. The PM must ensure thresholds are documented and signed off as part of planning, not rationalised at the end of a test cycle.

       Traditional software testing answers one question: does the system produce the correct output? AI TEVV must answer five: does it perform accurately, does it perform equitably across groups, does it degrade gracefully, can it be manipulated, and can its decisions be explained? A system that passes every traditional test can still discriminate against protected groups, fail unpredictably on edge cases, or produce outputs that cannot be explained to the people affected by them.

       NIST AI RMF requires that AI actors carrying out verification and validation tasks are distinct from those who performed test and evaluation actions — and ideally distinct from those who built the system. Independence is not a procedural nicety; it mitigates the blind spots and organisational pressure that lead builders to unconsciously design tests their systems will pass. For high-risk systems, external auditors should be engaged for at least the final validation gate.

       AI red-teaming, as defined in NIST AI 600-1, is an evolving practice for identifying adverse behaviour or outcomes through structured adversarial exercises. The quality of red-teaming outputs depends heavily on the diversity and domain expertise of the red team — demographically and interdisciplinarily diverse teams find different failure modes in different deployment contexts. Red team composition is a design decision, not a staffing convenience.

       TEVV is not a phase — it is a continuous process across the entire AI lifecycle. NIST AI RMF maps distinct TEVV tasks to design and planning (validating assumptions and data), development (model validation and assessment), deployment (system integration, user experience, compliance), and operations (ongoing monitoring, incident tracking, SME recalibration). A test plan that treats TEVV as a pre-deployment gate misses the majority of what the framework requires.

Every software project includes testing. AI projects need testing too — but the test plan looks fundamentally different. Traditional software testing asks a single question: does the system produce the specified output given a specified input? AI testing must ask several others: does it perform equitably across different populations? Does it fail gracefully when inputs are unexpected? Can it be deliberately manipulated? Will its outputs still be reliable as the world changes around it?

The NIST AI Risk Management Framework uses the term TEVV — Test, Evaluation, Verification, and Validation — to describe the comprehensive assessment that AI systems require. TEVV is not a testing phase at the end of development; it is a set of activities that run throughout the AI lifecycle, from planning assumptions through post-deployment monitoring. As PM, your job is not to run the tests. It is to ensure TEVV activities are scoped, resourced, independent, and completed — at every stage, not just before go-live. 

Why Traditional Testing Isn’t Enough

Traditional software testing centres on functional correctness. Given input X, does the system produce output Y? This matters for AI systems too. But it leaves a set of critical failure modes entirely untested — and those failure modes are the ones most likely to cause harm in AI deployments.

What Traditional Testing Covers

What AI Systems Also Require

Functional requirements met — correct outputs for defined inputs

Performance across subpopulations — does accuracy hold for all groups, or only in aggregate?

Integration correctness — components working together as designed

Equitable outcomes — are error rates and outcome distributions fair across demographic groups?

Performance under load — response time and throughput at scale

Robustness to unexpected inputs — edge cases, out-of-distribution data, degraded inputs

Security vulnerabilities — known attack vectors and penetration testing

Adversarial resilience — deliberate manipulation by crafted inputs designed to cause failures

Regression — new code doesn’t break existing functionality

Stability over time — does performance hold as the world changes and data drifts?

User acceptance — the system meets stated requirements

Alignment with legal and ethical requirements — bias, explainability, privacy, human oversight

A system can pass every traditional test and still discriminate against protected groups, fail unpredictably on inputs slightly outside its training distribution, or produce decisions that cannot be explained to the people they affect. NIST AI RMF MEASURE 2.1 frames the objective: objective, repeatable, or scalable TEVV processes including metrics, methods, and methodologies must be in place, followed, and documented. Processes for independent review improve the effectiveness of testing and mitigate internal biases and potential conflicts of interest. 

The TEVV Framework

NIST AI RMF distinguishes four complementary activities. They are not sequential phases — they operate in parallel and recur throughout the lifecycle. Understanding the distinction matters because each activity has different scope, different actors, and different documentation requirements.

Activity

Core Question

Focus

Test

Does it work?

Executing the system against defined test cases to identify problems and measure performance

Evaluation

How well does it work?

Assessing performance against benchmarks, thresholds, and trustworthiness criteria, including socio-technical factors beyond the pipeline

Verification

Did we build it right?

Confirming the system meets its specifications — the technical requirements defined before development

Validation

Did we build the right thing?

Confirming the system meets real-world needs in its actual deployment context — not just the requirements as written

NIST AI RMF 1.0 specifies that AI actors carrying out verification and validation tasks should ideally be distinct from those who perform test and evaluation actions — and both should be distinct from those who built the system. Independence is a structural requirement, not a preference. It mitigates the blind spots and organisational pressure to ship that lead builders to unconsciously design tests their own systems will pass. 

TEVV Across the AI Lifecycle

One of the most common TEVV failures is treating it as a pre-deployment gate. NIST AI RMF 1.0 maps distinct TEVV responsibilities to each phase of the AI lifecycle. A test plan that is only active in the final weeks before go-live misses the majority of what the framework requires.

Lifecycle Phase

TEVV Focus

Design and planning

Validate assumptions about data availability, quality, and representativeness. Verify that requirements capture trustworthiness characteristics — fairness, explainability, safety — not just functional behaviour. Plan test approaches for each dimension before any model development begins.

Data preparation

Test data quality, completeness, and lineage. Evaluate representativeness across populations relevant to the use case. Identify proxy features that may encode protected characteristics. Document assumptions and limitations of the dataset for downstream TEVV actors.

Development (model building)

Validate model performance on held-out data not used in training. Evaluate fairness metrics disaggregated by relevant demographic groups. Conduct adversarial testing of model behaviour. Verify that explanations accurately reflect model behaviour, not post-hoc rationalisations.

Deployment

System integration testing in the production environment. User acceptance testing with realistic scenarios involving actual users and affected parties. Independent review of test results before final go/no-go. Recalibration for integration, user experience, and compliance with legal, regulatory, and ethical specifications.

Operations (ongoing)

Continuous monitoring of performance metrics against baselines. Detection of drift and degradation. Incident tracking, analysis, and response. Periodic re-evaluation against original benchmarks. SME recalibration as the deployment context evolves.

 

What to Test: Five Dimensions of AI TEVV

1. Performance and Accuracy

The foundational dimension — does the system produce correct outputs at acceptable rates? Performance testing for AI systems must go beyond point-in-time accuracy to assess calibration and consistency under varying conditions.

Metric

What It Reveals

Accuracy, precision, recall, F1, AUC

Overall correctness against a labelled test set — the baseline performance measure

False positive and false negative rates

The cost of different error types, which varies significantly by use case — false negatives in medical screening carry different consequences than false positives in content moderation

Confidence calibration

Whether the model’s stated confidence reflects its actual accuracy — a model that says 90% confident should be correct 90% of the time. Miscalibrated confidence is a major source of over-reliance.

Performance on held-out data

Generalisation beyond the training distribution — Stanford HAI’s validity framework warns that benchmark performance does not always equal real-world performance, particularly when systems may have encountered test data during training (benchmark contamination)

PM critical action: define acceptable performance thresholds — including thresholds for each error type — before testing begins, documented and signed off. Thresholds defined after seeing results are not thresholds; they are rationalisations. NIST MEASURE 2.1 requires that metrics are documented as part of the TEVV plan, not appended afterwards.

2. Fairness and Bias

A system that performs well in aggregate may perform poorly — and cause disproportionate harm — for specific demographic groups. Fairness testing must be disaggregated and must be designed with the actual affected populations in mind.

Metric

What It Reveals

Disaggregated performance by group

Whether accuracy, error rates, and confidence calibration hold across demographic groups, geographies, or other segments relevant to the use case

Demographic parity

Whether the AI produces positive outcomes at similar rates across groups, regardless of individual characteristics

Equalized odds / equal opportunity

Whether true positive and false positive rates are consistent across groups — a more demanding standard than demographic parity in many use cases

Denigration and stereotyping prevalence

For generative AI, whether outputs exhibit systematic bias, harmful stereotyping, or denigrating content toward particular groups — NIST AI 600-1 MS-2.11-001 recommends specific benchmarks including Bias Benchmark Questions and Winogender Schemas

Fairness testing must include intersectional analysis — not just gender, but gender × age × geography, because harms often concentrate at intersections that aggregate metrics obscure. NIST AI 600-1 MS-2.11-003 requires direct engagement with potentially impacted communities to identify the classes of individuals and groups the system may affect. Affected communities, not only technical teams, should be involved in defining what fairness means in the specific deployment context.

3. Robustness

Does the system handle unexpected inputs gracefully, or does it fail in unpredictable ways when it encounters data outside its training distribution? Robustness testing is particularly important for high-risk AI systems where failures may cause direct harm.

Test Type

What It Surfaces

Edge case testing

Unusual but legitimate inputs — values at the extreme ends of distributions, rare but valid combinations of features, inputs that represent underrepresented populations in the training data

Out-of-distribution testing

Inputs that differ from the training distribution in ways the model has not seen — how does the model behave, and does it signal appropriate uncertainty rather than producing confident but wrong outputs?

Noisy and degraded inputs

Incomplete data, missing features, corrupted inputs, or lower-quality data than the training set — common in real-world deployments where data quality is uncontrolled

Distribution shift simulation

Testing performance on data from a different time period, geography, or population than the training data, to assess how quickly performance degrades as deployment conditions diverge from training conditions

4. Safety and Security

Can the system be deliberately manipulated? Can its safety mechanisms be bypassed? Safety and security testing for AI systems requires different techniques from conventional cybersecurity — and different expertise.

Test Type

What It Surfaces

Adversarial input testing

Inputs deliberately crafted to cause the model to fail — images with imperceptible perturbations that cause misclassification, text prompts designed to elicit harmful outputs, inputs that exploit model architecture weaknesses

Prompt injection testing

For systems accepting natural language, whether malicious instructions embedded in user input can override system instructions, bypass guardrails, or extract protected information

Guardrail and safety mechanism testing

Whether safety controls actually prevent the failure modes they are designed to prevent — including whether they can be bypassed through creative reformulation of harmful requests

Red-teaming

Structured adversarial exercises conducted by people independent of the build team. NIST AI 600-1 defines AI red-teaming as exercises to identify potential adverse behaviour, stress test safeguards, and find failure modes that standard testing misses. Team diversity — demographic, disciplinary, and domain expertise — directly affects the quality of findings.

NIST MEASURE 2.3 recommends using red-team exercises to actively test systems under adversarial or stress conditions, measure system response, assess failure modes, and evaluate mismatches between claimed and actual system performance. Red-teaming results require additional analysis before incorporation into governance decisions — they are inputs to risk management, not stand-alone verdicts.

5. Explainability

Can the decisions or outputs of the AI system be understood and justified — by operators, users, affected parties, and regulators? Explainability requirements vary significantly by context, and testing must be designed for the specific audiences who need to understand the system’s reasoning.

Test Focus

Questions to Answer

Explanation accuracy

Do explanations accurately reflect the model’s actual decision process, or are they post-hoc rationalisations that look plausible but do not correspond to what the model computed?

Explanation comprehensibility

Are explanations comprehensible to the specific audiences who need them — a technical operator, a non-technical user, an affected individual, a regulator? The same explanation may be appropriate for one audience and meaningless to another.

Explanation consistency

Does the system produce similar explanations for similar decisions? Inconsistent explanations for similar cases undermine both trust and the utility of explanations for error detection.

Regulatory adequacy

Do explanations meet the specific explainability requirements that apply in the deployment jurisdiction and sector — including the EU AI Act’s transparency requirements for high-risk AI and sector-specific obligations?

 

Testing Approaches

Different testing methods surface different types of problems. Effective TEVV uses a combination of approaches calibrated to the risk level of the system and the deployment context. NIST AI 600-1 distinguishes four main approaches for evaluating AI systems.

Approach

How It Works

Strengths and Limitations

Benchmark testing

Compare system performance against standard datasets or established baselines used across the field

Reproducible and comparable across systems. Risk of benchmark contamination — the system may have encountered test data during training. Benchmark performance does not always equal real-world performance in the deployment context (Stanford HAI).

Field testing

Evaluate system behaviour in realistic deployment conditions with actual users or representative populations

Reveals issues that controlled testing misses — real-world data quality, user behaviour, and deployment context. More resource-intensive. May expose real users to risk if not carefully staged. Requires informed consent and human subjects protections.

AI red-teaming

Structured adversarial exercises by people independent of the build team, aiming to find failure modes and vulnerabilities that standard testing does not surface

Identifies failure modes that designed tests cannot find. Quality depends heavily on team diversity, domain expertise, and depth of effort. Resource-intensive. Findings require interpretation before incorporation into governance decisions.

Structured public feedback

Gather input from users, affected communities, and domain experts through focus groups, surveys, community advisory boards, and feedback channels

Surfaces failure modes and contextual impacts that technical testing cannot detect. Builds stakeholder trust. Requires careful design to be actionable and representative. Not a substitute for technical evaluation.

 

Common AI Failure Modes and Testing Responses

Understanding how AI systems characteristically fail helps focus TEVV efforts and calibrate testing depth to the risks that matter most for the deployment context.

Failure Mode

What It Means

Testing Response

Brittleness

System fails on inputs slightly different from training data — small perturbations, slightly different formatting, or edge-of-distribution values cause disproportionate performance drops

Edge case testing, out-of-distribution testing, input perturbation analysis

Embedded bias

System learns and perpetuates biases present in training data, producing systematically different outcomes for different demographic groups even when those differences are not justified by the task

Disaggregated fairness testing, bias audits using standardised benchmarks, intersectional analysis, engagement with affected communities

Catastrophic forgetting

After retraining on new data, the system loses capabilities it had on the original training distribution — a silent regression that aggregate metrics may not detect

Regression testing against the full original test suite after every retraining cycle, not just testing on new data

Uncertainty blindness

System produces high-confidence outputs even when inputs are ambiguous, out of distribution, or when the model should not be confident — miscalibration that drives over-reliance on AI recommendations

Confidence calibration testing; evaluation of model behaviour on inputs the model should recognise as uncertain

Adversarial vulnerability

System can be deliberately manipulated by carefully crafted inputs designed to exploit weaknesses in the model architecture or training data

Adversarial input testing, prompt injection testing for language models, red-teaming with adversarial intent

Distribution shift

Real-world production data differs from training data in ways that erode performance gradually and without obvious failure signals

Field testing, pre-deployment distribution comparison, post-deployment drift monitoring with defined action thresholds

 

Independence and Separation

NIST AI RMF 1.0 is explicit that TEVV is enhanced when processes enable corroboration by independent evaluators. MEASURE 2.1 recommends processes for independent review to improve effectiveness and mitigate internal biases and potential conflicts of interest. This is not a procedural preference — it reflects a structural reality about how testing fails when the people who built a system are also the people who test it.

Builders have genuine blind spots about their own systems. They design tests to validate what they expect the system to do, not to discover what it might do in contexts they did not anticipate. Organisational pressure to ship creates incentives — often unconscious — to define test success in ways the system will meet. And conflicts of interest between development velocity and testing rigour are structural, not personal failures.

Independence Level

What It Means in Practice

Testers separate from developers

The minimum baseline — the team conducting TEVV should not have been involved in building the system components they are evaluating. This applies even within the same organisation.

Independent internal review

A function within the organisation — risk, compliance, legal, or a dedicated AI governance team — conducts or reviews TEVV results independent of the project team. Applicable to medium- and higher-risk systems.

External audit or assessment

Third parties with no commercial interest in the system’s deployment conduct or independently verify TEVV results. Required for high-risk AI systems under the EU AI Act in some sectors, and recommended by NIST for systems with significant potential for harm.

Structured challenge process

NIST AI RMF describes “effective challenge” — a culture that encourages critical thinking and questioning of important design and implementation decisions by experts with the authority and stature to act on what they find. This applies throughout the lifecycle, not only at formal test gates.

 

Deployment Strategies as Testing Mechanisms

Testing does not end at the go/no-go decision. How an AI system is deployed affects the organisation’s ability to detect problems early and contain the impact of failures that pre-deployment testing did not anticipate.

Strategy

How It Functions as Testing

Phased rollout

Deploy to a limited user population first and expand gradually based on observed performance. Allows real-world performance validation at limited scale before full exposure. Appropriate for most AI deployments.

Shadow mode

The AI system runs in parallel with existing processes but does not affect outcomes — its outputs are logged and evaluated, but decisions are made by the current process. Allows observation of real-world AI behaviour without risk to current outcomes. Particularly useful before autonomous operation begins.

A/B testing

Run the new AI system alongside the existing process or a baseline model, routing different users to each. Allows direct performance comparison under identical conditions. Requires careful design to avoid differential harm — if the AI system may produce worse outcomes for some users, random assignment raises ethical issues.

Canary deployment

Route a small, controlled percentage of live traffic to the new system while maintaining the existing system for the remainder. Allows detection of production issues before full rollout. Appropriate when gradual validation of production behaviour is needed before committing to full deployment.

 

Documentation Requirements

NIST MEASURE 2.1 requires that test sets, metrics, and details about tools used during TEVV are documented. Testing without documentation is incomplete governance. Documentation enables repeatability, supports independent review, provides the basis for go/no-go decisions, and creates the record that regulators, auditors, and affected parties may examine.

At minimum, document the following for each TEVV activity:

•       Test plan: what will be tested, by whom, using what methods and data, against what acceptance thresholds — completed before testing begins

•       Test data: sources, composition, representativeness, and any known limitations or gaps relative to the production population

•       Test results: metrics and whether they met, fell short of, or exceeded defined thresholds — including results that were unfavourable

•       Limitations: conditions under which results may not generalise — populations, geographies, use cases, or conditions not covered by the test dataset

•       Residual risks: known issues identified through TEVV that were accepted, along with the rationale and the mitigations in place

•       Decisions informed: what go/no-go decisions, design changes, monitoring requirements, or additional testing the TEVV results triggered 

PM Responsibilities by Phase

Planning Phase

•       Define TEVV activities in scope, schedule, and budget — including budget for independent testing that may require external expertise

•       Allocate resources for each dimension of TEVV: performance, fairness, robustness, safety/security, and explainability — not only functional correctness

•       Define acceptance criteria for each trustworthiness dimension before development begins — thresholds set after results are known are not acceptance criteria

•       Identify who will conduct independent review and at what gates — independence must be structural, not asserted

•       Determine whether red-teaming is appropriate for the system’s risk level and, if so, define the composition and scope of the red team

Development Phase

•       Monitor TEVV progress alongside development milestones — TEVV is concurrent with development, not a gate at the end of it

•       Ensure TEVV findings are addressed and documented, not merely recorded — an unfavourable finding that produces no design change or accepted residual risk is a governance gap

•       Escalate findings that affect go/no-go decisions to the appropriate stakeholders before, not after, the deployment decision is made

•       Confirm that test data is documented including its composition, representativeness, and limitations

Deployment Phase

•       Verify that all planned TEVV activities are complete before the go/no-go decision — incomplete testing is a documented residual risk, not an implicit acceptance

•       Confirm that residual risks identified through TEVV are formally accepted by the appropriate decision-maker, not informally deferred

•       Confirm that post-deployment monitoring is in place to detect drift and performance degradation — the transition from TEVV to ongoing monitoring must be explicit

•       If using phased rollout or shadow mode, define the criteria that must be met before expanding deployment scope

Operations Phase

•       Track ongoing performance against the baselines established during pre-deployment TEVV

•       Trigger re-testing when conditions change materially: new user populations, significant data drift, regulatory changes, or identified incidents

•       Include TEVV in change management processes — model updates, retraining, and configuration changes each require regression testing before production deployment

•       Conduct periodic re-evaluation of the system against its original benchmarks and the current regulatory and ethical standards that apply to it 

Right-Sizing for Your Situation

Testing depth should match system risk. A low-risk internal productivity tool requires a different TEVV programme than a high-risk system making consequential decisions about individuals. Not every system needs external red-teaming, but every system needs documented validation of performance and fitness for purpose in its actual deployment context.

Greenfield — AI Testing Playbook

For PMs without formal AI testing processes. Essential tests for each trustworthiness dimension, with simplified approaches for resource-constrained teams — designed to establish a documented TEVV baseline without enterprise infrastructure.

Emerging — AI Testing Playbook

For PMs building repeatable processes. Comprehensive test planning templates, metric definitions, threshold-setting guidance, red-teaming design, and documentation standards for teams building a structured TEVV capability.

Established — AI Testing Playbook

For PMs in organisations with formal governance. How to integrate AI TEVV into existing quality management systems, compliance frameworks, and regulatory requirements — including EU AI Act conformity assessment and sector-specific audit obligations.

Become a member →

 

Framework References

•       NIST AI Risk Management Framework (AI RMF 1.0, NIST AI 100-1) — MEASURE function overview (objective, repeatable, scalable TEVV processes required; metrics and methodologies must adhere to scientific, legal, and ethical norms; independent review improves effectiveness and mitigates internal biases); TEVV lifecycle mapping (distinct tasks for design/planning, development, deployment, and operations phases); AI actor roles (verification and validation actors should be distinct from test and evaluation actors, and ideally distinct from builders)

•       NIST AI RMF Playbook — MEASURE 2.1 (documentation of test sets, metrics, and tools used during TEVV; leveraging model cards and datasheets; regular assessment and updating of measurement tools); MEASURE 2.3 (red-teaming to test systems under adversarial or stress conditions; evaluation of mismatches between claimed and actual performance; countermeasures to increase robustness); MEASURE 2.4 (monitoring of AI system functionality and behaviour in production; hypothesis testing for distribution differences; anomaly detection using control limits)

•       NIST AI 600-1: Generative AI Profile (2024) — MS-2.11-001 (application of use-case-appropriate benchmarks including Bias Benchmark Questions, Real Hateful or Harmful Prompts, Winogender Schemas to quantify bias; documentation of benchmark assumptions and limitations); MS-2.11-002 (fairness assessments measuring performance across demographic groups and subgroups; field testing with subgroup populations; red-teaming with counterfactual prompts; demographic parity, equalized odds, equal opportunity metrics); MS-2.11-003 (direct engagement with impacted communities to identify affected groups); AI red-teaming definition and typology (general public, domain expert, and adversarial red team compositions)

•       Stanford HAI — Validating Claims About AI: A Policymaker’s Guide (2024) — Claim-centred validity framework (content validity, criterion validity, construct validity, external validity, consequential validity); benchmark contamination risk (systems may have encountered test data during training); benchmark performance does not equal real-world performance; three-step validation process: decide object of claim, state claim, review evidence

•       EU AI Act (Official Journal, 12 July 2024) — Article 9 (risk management system requirements for high-risk AI, including testing to ensure appropriate performance throughout lifecycle and against pre-defined metrics); Article 10 (data governance requirements for training, validation, and testing datasets, including representativeness and examination for biases); Article 15 (accuracy, robustness, and cybersecurity requirements including adversarial resilience for high-risk AI systems); Article 17 (quality management system requirements including TEVV procedures); Annex IV (technical documentation requirements including description of testing methods, test data, and test results)

•       PMI Guide to Leading and Managing AI Projects (CPMAI 2025) — Phase III (test strategy development; acceptance criteria definition before development; independence requirements for validation); Phase IV (model validation on held-out data; fairness evaluation; adversarial testing; explanation verification); Phase V (pre-deployment TEVV gate; governance and MLOps readiness assessment; user acceptance testing with realistic scenarios; go/no-go criteria documentation)

 

This article is part of AIPMO’s PM Practice series. See also: AI Risk Registers | Monitoring AI Systems in Production | Human Oversight in AI Systems