|
PM Takeaways |
|
•
NIST
AI RMF MEASURE 2.1 requires that test sets, metrics, and tools used during
TEVV are documented before testing begins — not after. Defining acceptance
thresholds post-hoc, after results are known, undermines the entire
validation process and creates governance exposure. The PM must ensure
thresholds are documented and signed off as part of planning, not
rationalised at the end of a test cycle. |
|
•
Traditional
software testing answers one question: does the system produce the correct
output? AI TEVV must answer five: does it perform accurately, does it perform
equitably across groups, does it degrade gracefully, can it be manipulated,
and can its decisions be explained? A system that passes every traditional
test can still discriminate against protected groups, fail unpredictably on
edge cases, or produce outputs that cannot be explained to the people
affected by them. |
|
•
NIST
AI RMF requires that AI actors carrying out verification and validation tasks
are distinct from those who performed test and evaluation actions — and
ideally distinct from those who built the system. Independence is not a
procedural nicety; it mitigates the blind spots and organisational pressure
that lead builders to unconsciously design tests their systems will pass. For
high-risk systems, external auditors should be engaged for at least the final
validation gate. |
|
•
AI
red-teaming, as defined in NIST AI 600-1, is an evolving practice for
identifying adverse behaviour or outcomes through structured adversarial
exercises. The quality of red-teaming outputs depends heavily on the
diversity and domain expertise of the red team — demographically and
interdisciplinarily diverse teams find different failure modes in different
deployment contexts. Red team composition is a design decision, not a
staffing convenience. |
|
•
TEVV
is not a phase — it is a continuous process across the entire AI lifecycle.
NIST AI RMF maps distinct TEVV tasks to design and planning (validating
assumptions and data), development (model validation and assessment),
deployment (system integration, user experience, compliance), and operations
(ongoing monitoring, incident tracking, SME recalibration). A test plan that
treats TEVV as a pre-deployment gate misses the majority of what the
framework requires. |
Every software project includes testing. AI projects need testing too — but the test plan looks fundamentally different. Traditional software testing asks a single question: does the system produce the specified output given a specified input? AI testing must ask several others: does it perform equitably across different populations? Does it fail gracefully when inputs are unexpected? Can it be deliberately manipulated? Will its outputs still be reliable as the world changes around it?
The NIST AI Risk Management Framework uses the term TEVV — Test, Evaluation, Verification, and Validation — to describe the comprehensive assessment that AI systems require. TEVV is not a testing phase at the end of development; it is a set of activities that run throughout the AI lifecycle, from planning assumptions through post-deployment monitoring. As PM, your job is not to run the tests. It is to ensure TEVV activities are scoped, resourced, independent, and completed — at every stage, not just before go-live.
Why Traditional Testing Isn’t Enough
Traditional software testing centres on functional correctness. Given input X, does the system produce output Y? This matters for AI systems too. But it leaves a set of critical failure modes entirely untested — and those failure modes are the ones most likely to cause harm in AI deployments.
|
What Traditional Testing
Covers |
What AI Systems Also Require |
|
Functional requirements met — correct outputs for defined
inputs |
Performance across subpopulations — does accuracy hold for
all groups, or only in aggregate? |
|
Integration correctness — components working together as
designed |
Equitable outcomes — are error rates and outcome
distributions fair across demographic groups? |
|
Performance under load — response time and throughput at
scale |
Robustness to unexpected inputs — edge cases,
out-of-distribution data, degraded inputs |
|
Security vulnerabilities — known attack vectors and
penetration testing |
Adversarial resilience — deliberate manipulation by
crafted inputs designed to cause failures |
|
Regression — new code doesn’t break existing functionality |
Stability over time — does performance hold as the world
changes and data drifts? |
|
User acceptance — the system meets stated requirements |
Alignment with legal and ethical requirements — bias,
explainability, privacy, human oversight |
A system can pass every traditional test and still discriminate against protected groups, fail unpredictably on inputs slightly outside its training distribution, or produce decisions that cannot be explained to the people they affect. NIST AI RMF MEASURE 2.1 frames the objective: objective, repeatable, or scalable TEVV processes including metrics, methods, and methodologies must be in place, followed, and documented. Processes for independent review improve the effectiveness of testing and mitigate internal biases and potential conflicts of interest.
The TEVV Framework
NIST AI RMF distinguishes four complementary activities. They are not sequential phases — they operate in parallel and recur throughout the lifecycle. Understanding the distinction matters because each activity has different scope, different actors, and different documentation requirements.
|
Activity |
Core Question |
Focus |
|
Test |
Does it work? |
Executing the system against defined test cases to
identify problems and measure performance |
|
Evaluation |
How well does it work? |
Assessing performance against benchmarks, thresholds, and
trustworthiness criteria, including socio-technical factors beyond the
pipeline |
|
Verification |
Did we build it right? |
Confirming the system meets its specifications — the
technical requirements defined before development |
|
Validation |
Did we build the right thing? |
Confirming the system meets real-world needs in its actual
deployment context — not just the requirements as written |
NIST AI RMF 1.0 specifies that AI actors carrying out verification and validation tasks should ideally be distinct from those who perform test and evaluation actions — and both should be distinct from those who built the system. Independence is a structural requirement, not a preference. It mitigates the blind spots and organisational pressure to ship that lead builders to unconsciously design tests their own systems will pass.
TEVV Across the AI Lifecycle
One of the most common TEVV failures is treating it as a pre-deployment gate. NIST AI RMF 1.0 maps distinct TEVV responsibilities to each phase of the AI lifecycle. A test plan that is only active in the final weeks before go-live misses the majority of what the framework requires.
|
Lifecycle Phase |
TEVV Focus |
|
Design and planning |
Validate assumptions about data availability, quality, and
representativeness. Verify that requirements capture trustworthiness
characteristics — fairness, explainability, safety — not just functional
behaviour. Plan test approaches for each dimension before any model
development begins. |
|
Data preparation |
Test data quality, completeness, and lineage. Evaluate
representativeness across populations relevant to the use case. Identify
proxy features that may encode protected characteristics. Document
assumptions and limitations of the dataset for downstream TEVV actors. |
|
Development (model building) |
Validate model performance on held-out data not used in
training. Evaluate fairness metrics disaggregated by relevant demographic
groups. Conduct adversarial testing of model behaviour. Verify that
explanations accurately reflect model behaviour, not post-hoc
rationalisations. |
|
Deployment |
System integration testing in the production environment.
User acceptance testing with realistic scenarios involving actual users and
affected parties. Independent review of test results before final go/no-go.
Recalibration for integration, user experience, and compliance with legal,
regulatory, and ethical specifications. |
|
Operations (ongoing) |
Continuous monitoring of performance metrics against
baselines. Detection of drift and degradation. Incident tracking, analysis,
and response. Periodic re-evaluation against original benchmarks. SME
recalibration as the deployment context evolves. |
What to Test: Five Dimensions of AI TEVV
1. Performance and Accuracy
The foundational dimension — does the system produce correct outputs at acceptable rates? Performance testing for AI systems must go beyond point-in-time accuracy to assess calibration and consistency under varying conditions.
|
Metric |
What It Reveals |
|
Accuracy, precision, recall, F1, AUC |
Overall correctness against a labelled test set — the
baseline performance measure |
|
False positive and false negative rates |
The cost of different error types, which varies
significantly by use case — false negatives in medical screening carry
different consequences than false positives in content moderation |
|
Confidence calibration |
Whether the model’s stated confidence reflects its actual
accuracy — a model that says 90% confident should be correct 90% of the time.
Miscalibrated confidence is a major source of over-reliance. |
|
Performance on held-out data |
Generalisation beyond the training distribution — Stanford
HAI’s validity framework warns that benchmark performance does not always
equal real-world performance, particularly when systems may have encountered
test data during training (benchmark contamination) |
PM critical action: define acceptable performance thresholds — including thresholds for each error type — before testing begins, documented and signed off. Thresholds defined after seeing results are not thresholds; they are rationalisations. NIST MEASURE 2.1 requires that metrics are documented as part of the TEVV plan, not appended afterwards.
2. Fairness and Bias
A system that performs well in aggregate may perform poorly — and cause disproportionate harm — for specific demographic groups. Fairness testing must be disaggregated and must be designed with the actual affected populations in mind.
|
Metric |
What It Reveals |
|
Disaggregated performance by group |
Whether accuracy, error rates, and confidence calibration
hold across demographic groups, geographies, or other segments relevant to
the use case |
|
Demographic parity |
Whether the AI produces positive outcomes at similar rates
across groups, regardless of individual characteristics |
|
Equalized odds / equal opportunity |
Whether true positive and false positive rates are
consistent across groups — a more demanding standard than demographic parity
in many use cases |
|
Denigration and stereotyping prevalence |
For generative AI, whether outputs exhibit systematic
bias, harmful stereotyping, or denigrating content toward particular groups —
NIST AI 600-1 MS-2.11-001 recommends specific benchmarks including Bias
Benchmark Questions and Winogender Schemas |
Fairness testing must include intersectional analysis — not just gender, but gender × age × geography, because harms often concentrate at intersections that aggregate metrics obscure. NIST AI 600-1 MS-2.11-003 requires direct engagement with potentially impacted communities to identify the classes of individuals and groups the system may affect. Affected communities, not only technical teams, should be involved in defining what fairness means in the specific deployment context.
3. Robustness
Does the system handle unexpected inputs gracefully, or does it fail in unpredictable ways when it encounters data outside its training distribution? Robustness testing is particularly important for high-risk AI systems where failures may cause direct harm.
|
Test Type |
What It Surfaces |
|
Edge case testing |
Unusual but legitimate inputs — values at the extreme ends
of distributions, rare but valid combinations of features, inputs that
represent underrepresented populations in the training data |
|
Out-of-distribution testing |
Inputs that differ from the training distribution in ways
the model has not seen — how does the model behave, and does it signal
appropriate uncertainty rather than producing confident but wrong outputs? |
|
Noisy and degraded inputs |
Incomplete data, missing features, corrupted inputs, or
lower-quality data than the training set — common in real-world deployments
where data quality is uncontrolled |
|
Distribution shift simulation |
Testing performance on data from a different time period,
geography, or population than the training data, to assess how quickly
performance degrades as deployment conditions diverge from training
conditions |
4. Safety and Security
Can the system be deliberately manipulated? Can its safety mechanisms be bypassed? Safety and security testing for AI systems requires different techniques from conventional cybersecurity — and different expertise.
|
Test Type |
What It Surfaces |
|
Adversarial input testing |
Inputs deliberately crafted to cause the model to fail —
images with imperceptible perturbations that cause misclassification, text
prompts designed to elicit harmful outputs, inputs that exploit model
architecture weaknesses |
|
Prompt injection testing |
For systems accepting natural language, whether malicious
instructions embedded in user input can override system instructions, bypass
guardrails, or extract protected information |
|
Guardrail and safety mechanism testing |
Whether safety controls actually prevent the failure modes
they are designed to prevent — including whether they can be bypassed through
creative reformulation of harmful requests |
|
Red-teaming |
Structured adversarial exercises conducted by people
independent of the build team. NIST AI 600-1 defines AI red-teaming as
exercises to identify potential adverse behaviour, stress test safeguards,
and find failure modes that standard testing misses. Team diversity —
demographic, disciplinary, and domain expertise — directly affects the
quality of findings. |
NIST MEASURE 2.3 recommends using red-team exercises to actively test systems under adversarial or stress conditions, measure system response, assess failure modes, and evaluate mismatches between claimed and actual system performance. Red-teaming results require additional analysis before incorporation into governance decisions — they are inputs to risk management, not stand-alone verdicts.
5. Explainability
Can the decisions or outputs of the AI system be understood and justified — by operators, users, affected parties, and regulators? Explainability requirements vary significantly by context, and testing must be designed for the specific audiences who need to understand the system’s reasoning.
|
Test Focus |
Questions to Answer |
|
Explanation accuracy |
Do explanations accurately reflect the model’s actual
decision process, or are they post-hoc rationalisations that look plausible
but do not correspond to what the model computed? |
|
Explanation comprehensibility |
Are explanations comprehensible to the specific audiences
who need them — a technical operator, a non-technical user, an affected
individual, a regulator? The same explanation may be appropriate for one
audience and meaningless to another. |
|
Explanation consistency |
Does the system produce similar explanations for similar
decisions? Inconsistent explanations for similar cases undermine both trust
and the utility of explanations for error detection. |
|
Regulatory adequacy |
Do explanations meet the specific explainability
requirements that apply in the deployment jurisdiction and sector — including
the EU AI Act’s transparency requirements for high-risk AI and
sector-specific obligations? |
Testing Approaches
Different testing methods surface different types of problems. Effective TEVV uses a combination of approaches calibrated to the risk level of the system and the deployment context. NIST AI 600-1 distinguishes four main approaches for evaluating AI systems.
|
Approach |
How It Works |
Strengths and Limitations |
|
Benchmark testing |
Compare system performance against standard datasets or
established baselines used across the field |
Reproducible and comparable across systems. Risk of
benchmark contamination — the system may have encountered test data during
training. Benchmark performance does not always equal real-world performance
in the deployment context (Stanford HAI). |
|
Field testing |
Evaluate system behaviour in realistic deployment
conditions with actual users or representative populations |
Reveals issues that controlled testing misses — real-world
data quality, user behaviour, and deployment context. More
resource-intensive. May expose real users to risk if not carefully staged.
Requires informed consent and human subjects protections. |
|
AI red-teaming |
Structured adversarial exercises by people independent of
the build team, aiming to find failure modes and vulnerabilities that
standard testing does not surface |
Identifies failure modes that designed tests cannot find.
Quality depends heavily on team diversity, domain expertise, and depth of
effort. Resource-intensive. Findings require interpretation before
incorporation into governance decisions. |
|
Structured public feedback |
Gather input from users, affected communities, and domain
experts through focus groups, surveys, community advisory boards, and
feedback channels |
Surfaces failure modes and contextual impacts that
technical testing cannot detect. Builds stakeholder trust. Requires careful
design to be actionable and representative. Not a substitute for technical
evaluation. |
Common AI Failure Modes and Testing Responses
Understanding how AI systems characteristically fail helps focus TEVV efforts and calibrate testing depth to the risks that matter most for the deployment context.
|
Failure Mode |
What It Means |
Testing Response |
|
Brittleness |
System fails on inputs slightly different from training
data — small perturbations, slightly different formatting, or
edge-of-distribution values cause disproportionate performance drops |
Edge case testing, out-of-distribution testing, input
perturbation analysis |
|
Embedded bias |
System learns and perpetuates biases present in training
data, producing systematically different outcomes for different demographic
groups even when those differences are not justified by the task |
Disaggregated fairness testing, bias audits using
standardised benchmarks, intersectional analysis, engagement with affected
communities |
|
Catastrophic forgetting |
After retraining on new data, the system loses
capabilities it had on the original training distribution — a silent
regression that aggregate metrics may not detect |
Regression testing against the full original test suite
after every retraining cycle, not just testing on new data |
|
Uncertainty blindness |
System produces high-confidence outputs even when inputs
are ambiguous, out of distribution, or when the model should not be confident
— miscalibration that drives over-reliance on AI recommendations |
Confidence calibration testing; evaluation of model
behaviour on inputs the model should recognise as uncertain |
|
Adversarial vulnerability |
System can be deliberately manipulated by carefully
crafted inputs designed to exploit weaknesses in the model architecture or
training data |
Adversarial input testing, prompt injection testing for
language models, red-teaming with adversarial intent |
|
Distribution shift |
Real-world production data differs from training data in
ways that erode performance gradually and without obvious failure signals |
Field testing, pre-deployment distribution comparison,
post-deployment drift monitoring with defined action thresholds |
Independence and Separation
NIST AI RMF 1.0 is explicit that TEVV is enhanced when processes enable corroboration by independent evaluators. MEASURE 2.1 recommends processes for independent review to improve effectiveness and mitigate internal biases and potential conflicts of interest. This is not a procedural preference — it reflects a structural reality about how testing fails when the people who built a system are also the people who test it.
Builders have genuine blind spots about their own systems. They design tests to validate what they expect the system to do, not to discover what it might do in contexts they did not anticipate. Organisational pressure to ship creates incentives — often unconscious — to define test success in ways the system will meet. And conflicts of interest between development velocity and testing rigour are structural, not personal failures.
|
Independence Level |
What It Means in Practice |
|
Testers separate from developers |
The minimum baseline — the team conducting TEVV should not
have been involved in building the system components they are evaluating.
This applies even within the same organisation. |
|
Independent internal review |
A function within the organisation — risk, compliance,
legal, or a dedicated AI governance team — conducts or reviews TEVV results
independent of the project team. Applicable to medium- and higher-risk
systems. |
|
External audit or assessment |
Third parties with no commercial interest in the system’s
deployment conduct or independently verify TEVV results. Required for
high-risk AI systems under the EU AI Act in some sectors, and recommended by
NIST for systems with significant potential for harm. |
|
Structured challenge process |
NIST AI RMF describes “effective challenge” — a culture
that encourages critical thinking and questioning of important design and
implementation decisions by experts with the authority and stature to act on
what they find. This applies throughout the lifecycle, not only at formal
test gates. |
Deployment Strategies as Testing Mechanisms
Testing does not end at the go/no-go decision. How an AI system is deployed affects the organisation’s ability to detect problems early and contain the impact of failures that pre-deployment testing did not anticipate.
|
Strategy |
How It Functions as Testing |
|
Phased rollout |
Deploy to a limited user population first and expand
gradually based on observed performance. Allows real-world performance
validation at limited scale before full exposure. Appropriate for most AI
deployments. |
|
Shadow mode |
The AI system runs in parallel with existing processes but
does not affect outcomes — its outputs are logged and evaluated, but
decisions are made by the current process. Allows observation of real-world
AI behaviour without risk to current outcomes. Particularly useful before
autonomous operation begins. |
|
A/B testing |
Run the new AI system alongside the existing process or a
baseline model, routing different users to each. Allows direct performance
comparison under identical conditions. Requires careful design to avoid
differential harm — if the AI system may produce worse outcomes for some
users, random assignment raises ethical issues. |
|
Canary deployment |
Route a small, controlled percentage of live traffic to
the new system while maintaining the existing system for the remainder.
Allows detection of production issues before full rollout. Appropriate when
gradual validation of production behaviour is needed before committing to
full deployment. |
Documentation Requirements
NIST MEASURE 2.1 requires that test sets, metrics, and details about tools used during TEVV are documented. Testing without documentation is incomplete governance. Documentation enables repeatability, supports independent review, provides the basis for go/no-go decisions, and creates the record that regulators, auditors, and affected parties may examine.
At minimum, document the following for each TEVV activity:
• Test plan: what will be tested, by whom, using what methods and data, against what acceptance thresholds — completed before testing begins
• Test data: sources, composition, representativeness, and any known limitations or gaps relative to the production population
• Test results: metrics and whether they met, fell short of, or exceeded defined thresholds — including results that were unfavourable
• Limitations: conditions under which results may not generalise — populations, geographies, use cases, or conditions not covered by the test dataset
• Residual risks: known issues identified through TEVV that were accepted, along with the rationale and the mitigations in place
• Decisions informed: what go/no-go decisions, design changes, monitoring requirements, or additional testing the TEVV results triggered
PM Responsibilities by Phase
Planning Phase
• Define TEVV activities in scope, schedule, and budget — including budget for independent testing that may require external expertise
• Allocate resources for each dimension of TEVV: performance, fairness, robustness, safety/security, and explainability — not only functional correctness
• Define acceptance criteria for each trustworthiness dimension before development begins — thresholds set after results are known are not acceptance criteria
• Identify who will conduct independent review and at what gates — independence must be structural, not asserted
• Determine whether red-teaming is appropriate for the system’s risk level and, if so, define the composition and scope of the red team
Development Phase
• Monitor TEVV progress alongside development milestones — TEVV is concurrent with development, not a gate at the end of it
• Ensure TEVV findings are addressed and documented, not merely recorded — an unfavourable finding that produces no design change or accepted residual risk is a governance gap
• Escalate findings that affect go/no-go decisions to the appropriate stakeholders before, not after, the deployment decision is made
• Confirm that test data is documented including its composition, representativeness, and limitations
Deployment Phase
• Verify that all planned TEVV activities are complete before the go/no-go decision — incomplete testing is a documented residual risk, not an implicit acceptance
• Confirm that residual risks identified through TEVV are formally accepted by the appropriate decision-maker, not informally deferred
• Confirm that post-deployment monitoring is in place to detect drift and performance degradation — the transition from TEVV to ongoing monitoring must be explicit
• If using phased rollout or shadow mode, define the criteria that must be met before expanding deployment scope
Operations Phase
• Track ongoing performance against the baselines established during pre-deployment TEVV
• Trigger re-testing when conditions change materially: new user populations, significant data drift, regulatory changes, or identified incidents
• Include TEVV in change management processes — model updates, retraining, and configuration changes each require regression testing before production deployment
• Conduct periodic re-evaluation of the system against its original benchmarks and the current regulatory and ethical standards that apply to it
Right-Sizing for Your Situation
Testing depth should match system risk. A low-risk internal productivity tool requires a different TEVV programme than a high-risk system making consequential decisions about individuals. Not every system needs external red-teaming, but every system needs documented validation of performance and fitness for purpose in its actual deployment context.
|
Greenfield
— AI Testing Playbook For PMs
without formal AI testing processes. Essential tests for each trustworthiness
dimension, with simplified approaches for resource-constrained teams —
designed to establish a documented TEVV baseline without enterprise
infrastructure. |
|
Emerging
— AI Testing Playbook For PMs
building repeatable processes. Comprehensive test planning templates, metric
definitions, threshold-setting guidance, red-teaming design, and
documentation standards for teams building a structured TEVV capability. |
|
Established
— AI Testing Playbook For PMs
in organisations with formal governance. How to integrate AI TEVV into
existing quality management systems, compliance frameworks, and regulatory
requirements — including EU AI Act conformity assessment and sector-specific
audit obligations. |
Framework References
• NIST AI Risk Management Framework (AI RMF 1.0, NIST AI 100-1) — MEASURE function overview (objective, repeatable, scalable TEVV processes required; metrics and methodologies must adhere to scientific, legal, and ethical norms; independent review improves effectiveness and mitigates internal biases); TEVV lifecycle mapping (distinct tasks for design/planning, development, deployment, and operations phases); AI actor roles (verification and validation actors should be distinct from test and evaluation actors, and ideally distinct from builders)
• NIST AI RMF Playbook — MEASURE 2.1 (documentation of test sets, metrics, and tools used during TEVV; leveraging model cards and datasheets; regular assessment and updating of measurement tools); MEASURE 2.3 (red-teaming to test systems under adversarial or stress conditions; evaluation of mismatches between claimed and actual performance; countermeasures to increase robustness); MEASURE 2.4 (monitoring of AI system functionality and behaviour in production; hypothesis testing for distribution differences; anomaly detection using control limits)
• NIST AI 600-1: Generative AI Profile (2024) — MS-2.11-001 (application of use-case-appropriate benchmarks including Bias Benchmark Questions, Real Hateful or Harmful Prompts, Winogender Schemas to quantify bias; documentation of benchmark assumptions and limitations); MS-2.11-002 (fairness assessments measuring performance across demographic groups and subgroups; field testing with subgroup populations; red-teaming with counterfactual prompts; demographic parity, equalized odds, equal opportunity metrics); MS-2.11-003 (direct engagement with impacted communities to identify affected groups); AI red-teaming definition and typology (general public, domain expert, and adversarial red team compositions)
• Stanford HAI — Validating Claims About AI: A Policymaker’s Guide (2024) — Claim-centred validity framework (content validity, criterion validity, construct validity, external validity, consequential validity); benchmark contamination risk (systems may have encountered test data during training); benchmark performance does not equal real-world performance; three-step validation process: decide object of claim, state claim, review evidence
• EU AI Act (Official Journal, 12 July 2024) — Article 9 (risk management system requirements for high-risk AI, including testing to ensure appropriate performance throughout lifecycle and against pre-defined metrics); Article 10 (data governance requirements for training, validation, and testing datasets, including representativeness and examination for biases); Article 15 (accuracy, robustness, and cybersecurity requirements including adversarial resilience for high-risk AI systems); Article 17 (quality management system requirements including TEVV procedures); Annex IV (technical documentation requirements including description of testing methods, test data, and test results)
• PMI Guide to Leading and Managing AI Projects (CPMAI 2025) — Phase III (test strategy development; acceptance criteria definition before development; independence requirements for validation); Phase IV (model validation on held-out data; fairness evaluation; adversarial testing; explanation verification); Phase V (pre-deployment TEVV gate; governance and MLOps readiness assessment; user acceptance testing with realistic scenarios; go/no-go criteria documentation)
This article is part of AIPMO’s PM Practice series. See also: AI Risk Registers | Monitoring AI Systems in Production | Human Oversight in AI Systems