|
PM Takeaways |
|
•
AI
systems are not static software — they degrade in production through data
drift, concept drift, and feedback loops even when no code changes are made.
NIST AI RMF MANAGE 2.2 is explicit that performance and trustworthiness may
evolve and shift once a system is deployed, and that regular monitoring is
required to detect and respond to drift before accumulated harm becomes
visible. |
|
•
Monitoring
must cover six distinct dimensions — performance metrics, data quality,
fairness metrics, operational health, human oversight behaviour, and user
feedback. Quantitative performance metrics alone are insufficient: NIST
MEASURE 2.4 requires monitoring of AI system functionality and behaviour as
identified during risk mapping, and fairness metrics must be disaggregated by
group, not reported only in aggregate. |
|
•
Drift
is the most common and least visible failure mode in deployed AI. The three
types — data drift (covariate shift), concept drift (changed input-outcome
relationships), and label drift (shifted outcome distributions) — require
different detection approaches and different intervention responses. Define
statistical thresholds that trigger warning, action, and critical responses
before deployment, not after drift is discovered. |
|
•
Under
the EU AI Act Article 72, providers of high-risk AI systems are required to
establish post-market monitoring systems that actively and systematically
collect, document, and analyse data on system performance throughout the
lifetime of the system — and serious incidents must be reported to
authorities. This obligation begins at deployment, which means the monitoring
plan must be part of the technical documentation completed before go-live. |
|
•
Project
closure is the wrong concept for AI projects. The PMI–CPMAI Phase VI
framework treats ongoing monitoring, version control, retraining pipelines,
and contingency planning as core operational requirements, not post-project
afterthoughts. PMs must define who owns monitoring, how model updates are
governed and funded, and what triggers re-engagement of the project team —
before the project formally closes. |
Traditional software projects treat deployment as the finish line. Hand off to operations, close the project, move on. AI projects don’t work that way. The system that passed all your pre-deployment tests will change in production — sometimes slowly, sometimes dramatically — even if no one touches the code.
Post-deployment monitoring is not a best practice that resource-constrained teams can defer. For high-risk AI systems, it is a regulatory requirement under the EU AI Act. For all AI systems, it is the only mechanism by which you can know whether the system is still doing what it was designed to do. NIST AI RMF MANAGE 2.2 frames it plainly: regular monitoring of AI systems’ performance and trustworthiness enhances organisations’ ability to detect and respond to drift, and thus sustain an AI system’s value once deployed.
Why AI Systems Need Continuous Monitoring
Traditional software behaves deterministically: given the same inputs, it produces the same outputs. If it worked yesterday, it works today, absent a code change or infrastructure failure. AI systems operate differently, and PMs who treat them as conventional software create invisible risk.
|
Failure Mode |
Why It Matters |
|
The world changes |
Production data reflects current conditions — customer
behaviour, market dynamics, regulatory context. A model trained on last
year’s patterns may be confidently wrong about this year’s reality, with no
obvious failure signal. |
|
Data drifts |
The statistical properties of incoming data change over
time. Features that were predictive become less so. New patterns appear that
the model has never encountered. The system continues to produce outputs, but
their basis is increasingly mismatched to what the model learned. |
|
Performance degrades gradually |
Models lose accuracy incrementally, without obvious
failures. By the time someone notices, significant harm may have accumulated
across many decisions. NIST AI RMF MANAGE 2.2 identifies this gradual
degradation — drift — as the core risk that monitoring must address. |
|
New risks emerge |
Users find unexpected applications or misapplications.
Edge cases appear that pre-deployment testing did not anticipate. Adversarial
actors probe for exploitable patterns. These risks are only discoverable in
production, not in a test environment. |
|
Feedback loops amplify problems |
AI systems that influence behaviour can change the data
they are subsequently trained on, creating self-reinforcing patterns. A
content recommendation system that amplifies engagement can shift user
behaviour in ways that then appear in retraining data, compounding the
effect. |
What to Monitor
NIST MEASURE 2.4 requires that the functionality and behaviour of the AI system and its components — as identified during risk mapping — are monitored in production. Effective monitoring covers six distinct dimensions. Tracking only performance metrics is insufficient.
Performance Metrics
Track the metrics defined during development and watch for divergence from the pre-deployment baseline. Without a documented baseline, you have no reference point for detecting degradation.
|
Metric Type |
What to Track |
|
Accuracy metrics |
Precision, recall, F1, AUC — measured against the
pre-deployment baseline and monitored for directional trend, not just
point-in-time value |
|
Error rates |
False positives and false negatives, reported overall and
by segment — aggregate error rates can mask deterioration in specific
subpopulations |
|
Confidence distribution |
Whether confidence scores remain calibrated to actual
accuracy — a model can become overconfident as conditions shift |
|
Prediction distribution |
Whether the distribution of outputs has shifted
unexpectedly — a sudden change in the proportion of positive predictions is a
signal worth investigating even before accuracy data is available |
Data Quality
Monitoring outputs without monitoring inputs means you will detect problems late — after degraded inputs have already produced degraded outputs at scale.
|
Metric Type |
What to Track |
|
Input drift |
Statistical properties of incoming features compared to
the training data distribution — the reference point against which the model
was validated |
|
Feature drift |
Individual features changing distribution, which may
affect specific model pathways even when aggregate input statistics look
stable |
|
Missing data rate |
Increase in null or missing values in critical features,
which the model may handle in unexpected ways if missingness was rare in
training data |
|
Data volume |
Unexpected changes in throughput or record counts, which
can indicate upstream pipeline problems or changes in the user population |
Fairness Metrics
Overall performance can improve while performance for a specific demographic group deteriorates. Fairness monitoring must be disaggregated — aggregate metrics obscure differential impact by design.
|
Metric Type |
What to Track |
|
Disaggregated performance |
Accuracy, error rate, and related metrics broken out by
demographic group, geography, or other segments relevant to the use case |
|
Outcome distribution |
Whether positive and negative outcomes are distributed
across groups as expected relative to the pre-deployment baseline and any
regulatory requirements |
|
Error distribution |
Whether certain groups are experiencing disproportionate
rates of false positives or false negatives — error type matters as much as
error rate in many use cases |
Operational Metrics
System health and AI performance are not independent. Infrastructure degradation can manifest as model degradation, and the two must be distinguished to identify the correct intervention.
|
Metric Type |
What to Track |
|
Latency |
Response time changes that may indicate infrastructure
problems, model complexity increases from updates, or serving pipeline issues |
|
Throughput |
Unexpected volume changes that may indicate upstream
system changes or shifts in the user population hitting the model |
|
Resource utilisation |
CPU, memory, and storage trends — PMI–CPMAI Phase VI notes
that AI workloads are more variable and hardware-intensive than conventional
software and require dedicated resource monitoring |
|
Error logs |
System errors, exceptions, and timeouts that may precede
or accompany model performance degradation |
Human Oversight Metrics
How humans interact with an AI system is itself a monitoring signal. Changing override rates or escalation volumes often surface model degradation before quantitative metrics do — because human reviewers see the outputs directly.
|
Metric Type |
What to Track |
|
Override rate |
How often human reviewers reject or modify system
recommendations — a rising override rate is a strong early signal of model
degradation |
|
Override patterns |
Whether overrides are concentrated in particular segments,
use cases, or output types, which points toward where the model is failing |
|
Escalation volume |
Whether more cases are being escalated for human review,
which may indicate reviewers losing confidence in model outputs |
|
Reviewer response time |
Whether human reviewers are keeping pace with escalations
— a bottleneck in human oversight can mean AI-driven decisions are going
unreviewed longer than intended |
User and Stakeholder Feedback
Quantitative metrics do not capture everything. User and stakeholder feedback surfaces failure modes that automated monitoring cannot detect — edge cases, contextual problems, and harms experienced by people subject to AI decisions. NIST AI RMF MANAGE 4.2 includes feedback from affected parties as a required input to continual improvement.
|
Metric Type |
What to Track |
|
User complaints |
Volume, frequency trend, and nature of reported issues —
categorise by failure type to identify patterns |
|
Support tickets |
AI-related support requests, which often surface usability
failures and mismatched expectations before they appear in performance
metrics |
|
User sentiment |
Survey results and feedback form data, tracked as a trend
rather than a point-in-time snapshot |
|
Affected party reports |
Concerns and complaints from people subject to AI-driven
decisions, particularly in high-risk contexts where the AI Act requires
accessible redress mechanisms |
Detecting Drift
Drift is the most common and least visible failure mode in deployed AI systems. Performance degrades gradually, without obvious failures, until accumulated harm becomes apparent. NIST AI RMF MANAGE 2.2 identifies monitoring and maintenance procedures for drift and decontextualisation as core operational requirements.
Three Types of Drift
Each type of drift has different causes, different detection approaches, and different appropriate interventions. Conflating them leads to wrong diagnoses and ineffective responses.
|
Drift Type |
What It Means and Why It
Matters |
|
Data drift (covariate shift) |
The statistical distribution of input features changes,
but the underlying relationship between features and outcomes remains the
same. The model may still work correctly — but it is now operating on data
that looks different from what it was trained on, and performance on that new
distribution may be untested. |
|
Concept drift |
The relationship between inputs and the target outcome
changes. What the model learned is no longer true. A fraud detection model
trained before a new fraud pattern emerged will not recognise the new
pattern, regardless of how stable the input distribution is. Concept drift is
the most dangerous type because the model continues to produce confident
outputs that are based on outdated relationships. |
|
Label drift |
The distribution of outcomes changes. If you are
predicting loan default and macroeconomic conditions shift default rates
significantly, the model’s calibration becomes invalid even if it is
detecting the right signals. Label drift often accompanies concept drift and
can go undetected when ground truth labels are delayed. |
Detection Approaches
|
Approach |
How It Works |
|
Statistical tests |
Compare distributions of current data against the training
data reference using tests such as Kolmogorov-Smirnov (continuous features)
or chi-squared (categorical features). NIST MEASURE 2.4 recommends hypothesis
testing or domain expertise to measure monitored distribution differences. |
|
Drift detection algorithms |
Specialised algorithms designed to detect distributional
changes in data streams, such as ADWIN (Adaptive Windowing) or DDM (Drift
Detection Method), that provide statistically grounded change-point
detection. |
|
Performance monitoring against ground truth |
Track accuracy against labelled ground truth when labels
are available with reasonable latency. This is the most direct drift signal
but requires a feedback pipeline to collect outcomes. |
|
Proxy metrics |
When ground truth labels are delayed (loan outcomes,
health outcomes), monitor leading indicators that correlate with eventual
performance — confidence score distributions, prediction distribution shifts,
or upstream behavioural signals. |
Setting Response Thresholds
Not all drift requires the same response. Define thresholds before deployment, not when drift is discovered under operational pressure.
• Warning threshold: Increase monitoring frequency, initiate investigation, notify stakeholders. No immediate intervention to the system.
• Action threshold: Intervention required — model refresh, retraining with updated data, configuration adjustment, or rollback to a previous version.
• Critical threshold: Immediate response — pause the system, activate manual fallback, halt consequential decisions until the root cause is identified and addressed.
Incident Response
When monitoring detects a problem, a pre-defined response process is essential. AI incidents that surface under operational pressure without a documented process result in inconsistent containment, poor documentation, and missed regulatory reporting obligations.
What Constitutes an AI Incident
The OECD’s definition of AI incidents encompasses events where AI systems contribute to harm — including injury to health (including psychological harm), disruption of critical infrastructure, violations of human rights or law, or harm to property, communities, or the environment. Incidents can be acute (a single harmful decision) or systemic (accumulated harm across many decisions, each individually below the threshold of obvious concern). NIST AI RMF MANAGE 2.2 requires treatment and response plans for incidents, negative impacts, and outcomes to be established and regularly reviewed.
Incident Response Process
|
Element |
Description |
PM Responsibility |
|
Detection |
How incidents are identified — automated monitoring
alerts, human reviewer escalations, user reports, or external reports from
affected parties |
Define detection channels at project close; ensure all are
active before deployment |
|
Triage |
Assess severity and urgency: is harm ongoing? Is it
systemic or isolated? Is regulatory reporting triggered? |
Define severity criteria in advance; ambiguous cases
default to the higher severity tier |
|
Containment |
Stop ongoing harm: pause the system, revert to a fallback
process, increase human oversight for affected decision types, or restrict
scope of AI-driven decisions |
Ensure fallback processes are tested before deployment,
not designed during an incident |
|
Investigation |
Determine root cause: data drift, concept drift, training
data issue, infrastructure failure, edge case, or adversarial input |
Assign investigation ownership at project closure; data
science team must remain reachable post-deployment |
|
Remediation |
Fix the identified problem: retrain with corrected data,
adjust thresholds, patch infrastructure, update guardrails, or decommission
if no fix is feasible |
Document the remediation decision and the rationale,
including any tradeoffs accepted |
|
Communication |
Notify affected parties, internal stakeholders, and —
where required — regulators. Under the EU AI Act, serious incidents involving
high-risk AI must be reported to authorities without undue delay |
Know regulatory notification timelines before an incident
occurs; 15 days for serious incidents under the AI Act |
|
Post-mortem |
Analyse root cause, assess whether monitoring thresholds
were appropriate, update response procedures, and incorporate lessons into
model governance |
Post-mortem findings should feed back into the project’s
risk register and monitoring design |
Documentation Requirements
Incident documentation is both an internal governance requirement and, for high-risk AI, a regulatory one. Track all reported errors, near-misses, and negative impacts; the response actions taken and by whom; root cause analysis findings; preventive measures implemented; and communications to affected parties. NIST AI RMF MANAGE 4.2 requires that the basis for decisions made relative to tradeoffs between trustworthy characteristics, system risks, and system opportunities is documented.
Change Management for AI Systems
AI systems change over time — through retraining, fine-tuning, data updates, architecture changes, or configuration adjustments. Each change introduces risk. A change that improves aggregate performance may degrade fairness metrics. A data update that incorporates recent patterns may introduce new biases. PMI–CPMAI Phase VI treats model versioning, retraining pipelines, and controlled deployment as core operational governance requirements.
Version Control
Maintain explicit versioning across all components of the AI system, and connect each deployed version to the artefacts it was built from. Without this, root cause analysis after a production problem becomes guesswork.
• Model weights and architecture — including the specific training run that produced them
• Training data and preprocessing pipelines — versioned by dataset snapshot, not just file path
• Configuration and hyperparameters — the settings used for the deployed version, not just the latest configuration file
• Guardrails and safety mechanisms — which thresholds and constraints are active in the deployed system
• Dependencies — versions of ML frameworks, libraries, and infrastructure components in the production environment
Change Triggers
Define in advance what triggers a model update cycle. Unplanned updates driven by ad hoc observations are less rigorous and harder to govern than updates triggered by documented criteria.
|
Trigger Type |
Example Criteria |
|
Performance degradation |
Accuracy falls below the action threshold defined in the
monitoring plan; fairness metrics exceed acceptable differential across
groups |
|
Drift detection |
Statistical tests confirm drift beyond the action
threshold in one or more critical input features |
|
New data availability |
A scheduled retraining cycle runs on a predetermined
cadence, using accumulated production data that passed quality checks |
|
Regulatory or policy change |
A regulatory update affects what the model is permitted to
use as an input or how outputs must be explained |
|
Identified bias or fairness issue |
An audit, user report, or monitoring alert surfaces a
fairness problem not detected during pre-deployment testing |
Champion-Challenger Testing
Before replacing a production model, evaluate the replacement under production conditions. Run the new model (“challenger”) in parallel with the current production model (“champion”), comparing outputs on live traffic without routing consequential decisions through the challenger until it has met acceptance criteria. This approach surfaces problems that staging environments do not replicate, and it creates a documented performance comparison that supports the deployment decision.
Decommissioning
Sometimes the right governance decision is to turn the system off. NIST AI RMF MANAGE 2.4 addresses superseding, disengaging, or deactivating AI systems that demonstrate performance or outcomes inconsistent with intended use — and the playbook notes that responsibilities for decommissioning must be assigned and understood before the situation requires them to be exercised.
When to Decommission
• Performance cannot be restored to acceptable levels through retraining, data updates, or architectural changes
• Residual risks exceed the benefits of the system, and risk mitigation options have been exhausted
• Regulatory requirements that apply to the system cannot be met with the current architecture or data
• The business need the system was designed to address has changed materially
• A substantially better alternative exists and the transition can be managed without harm to current users
Decommissioning Considerations
|
Consideration |
Key Questions |
|
Data retention |
What training data, model artefacts, and decision logs
must be retained? For how long, and under what legal obligations? |
|
Data security |
How will you prevent unauthorised access to decommissioned
systems, training data, and logs that may contain personal data? |
|
Downstream dependencies |
What other systems, processes, or workflows depend on this
AI’s outputs? All dependencies must be identified and migrated or
deactivated. |
|
User and affected party communication |
How and when will you inform users, affected parties, and
— where required — regulators of the decommissioning? |
|
Transition plan |
What replaces the AI system? If manual processes resume,
have they been tested? If an alternative system is deployed, has it completed
its own governance process? |
Regulatory Requirements
EU AI Act — Article 72
The EU AI Act establishes mandatory post-market monitoring obligations for providers of high-risk AI systems. Article 72 requires providers to establish and document a post-market monitoring system that actively and systematically collects, documents, and analyses relevant data on the performance of high-risk AI systems throughout their lifetime. The monitoring system must be proportionate to the nature of the AI technologies and the risks involved, and it must enable the provider to evaluate continuous compliance with the requirements of the Act. Serious incidents must be reported to the relevant national market surveillance authority without undue delay — within 15 days for incidents that present a risk to health, safety, or fundamental rights.
The post-market monitoring plan must be included in the technical documentation completed before the system is placed on the market. This means monitoring is not a post-deployment concern: it must be designed and documented as part of the project, before go-live.
NIST AI RMF
The NIST AI RMF addresses ongoing monitoring across the MANAGE function. MANAGE 2.2 requires mechanisms to sustain the value of deployed AI systems through regular monitoring of performance and trustworthiness. MANAGE 2.4 requires that mechanisms are in place to supersede, disengage, or deactivate AI systems performing inconsistently with intended use. MANAGE 3.1 requires that risks be regularly monitored with documented controls applied. MANAGE 4.2 requires that continual improvement processes be integrated into AI system updates, including feedback from all relevant AI actors. Together these constitute a continuous governance cycle, not a periodic audit.
Sector-Specific Requirements
Healthcare, financial services, and other regulated sectors layer additional monitoring and reporting requirements on top of horizontal AI regulation. Financial services regulators have long-standing model risk management expectations (SR 11-7, OCC Comptroller’s Handbook) that treat ongoing monitoring as a core model governance obligation. Healthcare AI faces increasingly specific requirements for post-market clinical follow-up and adverse event reporting. PMs in regulated sectors must map both horizontal AI regulation and sector-specific obligations before finalising the monitoring plan.
PM Responsibilities by Phase
During Planning
• Define monitoring requirements in scope, including the specific metrics, thresholds, and cadences that will govern the live system
• Identify the six monitoring dimensions — performance, data quality, fairness, operational health, human oversight behaviour, and user feedback — and confirm coverage for each
• Budget for monitoring infrastructure, tooling, and the ongoing staff time required to review reports, investigate alerts, and manage incidents
• Plan incident response capabilities before deployment, including roles, escalation paths, regulatory notification timelines, and fallback procedures
• Include post-market monitoring plan in technical documentation for high-risk AI systems, as required by EU AI Act Article 72
At Deployment
• Verify that monitoring infrastructure is fully operational before traffic is routed to the live system — monitoring deployed after go-live creates a gap that is difficult to retroactively close
• Confirm that baselines are established and documented for all metrics — without a baseline, drift detection has no reference point
• Train the operations team on monitoring tools, alert interpretation, and escalation procedures
• Conduct a live test of the incident response procedure before the system handles consequential decisions
Post-Deployment (Ongoing)
• Review monitoring reports on the cadence appropriate to the system’s risk level — weekly for high-risk systems, monthly for lower-risk systems, with continuous automated alerting for all
• Ensure incidents are tracked, documented, and addressed with root cause analysis — a pattern of near-misses with no documented response is a governance failure
• Trigger model refresh or retraining when thresholds are exceeded according to the pre-defined criteria, not on an ad hoc basis
• Update monitoring requirements as risks, regulatory requirements, and the user population evolve
Project Closure Considerations
Traditional project closure happens at deployment. AI project governance requires a different model. Before formally closing an AI project, define and document:
• Who owns ongoing monitoring, and what authority they have to trigger interventions including system pause or decommissioning
• How model updates are proposed, evaluated, approved, funded, and deployed — including who must sign off on changes to high-risk systems
• What conditions trigger re-engagement of the original project team — significant retraining, architectural changes, or regulatory changes that require a full re-assessment
• How long post-deployment support from the build team lasts and what the handover criteria are for the operations team to manage independently
Right-Sizing for Your Situation
Monitoring depth should match system risk. A low-risk internal productivity tool needs different monitoring infrastructure than a high-risk system making consequential decisions about individuals. The EU AI Act’s proportionality principle reflects this: monitoring must be proportionate to the nature of the technologies and the risks involved.
|
Greenfield
— AI Monitoring Playbook For PMs
without formal monitoring infrastructure. Essential metrics, simple drift
detection approaches, and basic incident tracking without enterprise tooling
— designed for teams starting from a lightweight operational baseline. |
|
Emerging
— AI Monitoring Playbook For PMs
building repeatable processes. Comprehensive monitoring framework design,
threshold-setting guidance, drift detection selection, and incident response
templates for teams building a structured operational capability. |
|
Established
— AI Monitoring Playbook For PMs
in organisations with formal governance. How to integrate AI monitoring with
existing operational monitoring, incident management, and compliance
frameworks — including EU AI Act Article 72 post-market monitoring plan
requirements. |
Framework References
• EU AI Act (Official Journal, 12 July 2024) — Article 72 (post-market monitoring obligations for providers of high-risk AI systems: systematic data collection and analysis throughout system lifetime; post-market monitoring plan as required component of technical documentation; proportionality principle); Article 73 (serious incident reporting obligations; 15-day reporting timeline for incidents presenting risk to health, safety, or fundamental rights); Annex IV (technical documentation requirements including post-market monitoring plan)
• NIST AI Risk Management Framework (AI RMF 1.0, NIST AI 100-1) — MANAGE 2.2 (mechanisms to sustain value of deployed AI systems through monitoring of performance, trustworthiness, and drift; risk response options: avoid, accept, mitigate, transfer); MANAGE 2.4 (superseding, disengaging, or deactivating AI systems performing inconsistently with intended use; responsibilities must be assigned and understood); MANAGE 3.1 (regular monitoring of third-party and internal AI risks with documented controls); MANAGE 4.2 (continual improvement processes integrated into AI system updates; feedback from all relevant AI actors as required input)
• NIST AI RMF Playbook — MANAGE 2.2 suggested actions (establishing risk controls considering trustworthiness characteristics; procedures for monitoring drift and decontextualisation; decommissioning systems that exceed risk tolerances); MANAGE 2.4 suggested actions (contingency verification for mission-critical systems; deactivation mechanisms); MEASURE 2.4 (monitoring of AI system functionality and behaviour in production; hypothesis testing and domain expertise for distribution differences; anomaly detection using control limits and confidence intervals)
• PMI Guide to Leading and Managing AI Projects (CPMAI 2025) — Phase V (governance and MLOps readiness assessment: model drift detection systems, statistical baselines, real-time dashboards, alert protocols, and designated ownership for drift management; fixing gaps post-deployment is far more costly than addressing them proactively); Phase VI (continuous monitoring and versioning as core operational requirements; automated retraining pipelines; version control for model artefacts, data history, and configurations; risk mitigation and contingency planning for data drift, unexpected behaviours, and regulatory changes)
• NIST AI RMF Playbook — MANAGE 2.2 transparency documentation (post-deployment testing methodology, metrics, and performance outcomes; accessibility of information to external stakeholders); MANAGE 4.2 transparency documentation (user and stakeholder engagement in model development and regular performance review; ability of affected parties to test and provide feedback)
• AIGP Body of Knowledge v1.0.0 — Domain IV (post-deployment risk management; incident documentation and reporting obligations; model risk management integration with AI governance frameworks); Domain V (regulatory compliance monitoring; sector-specific requirements for healthcare and financial services AI; integration of horizontal AI regulation with sector regulator expectations)
• Federal Reserve SR 11-7 / OCC Comptroller’s Handbook: Model Risk Management — (ongoing monitoring as core model governance obligation in financial services; performance monitoring against established benchmarks; outcomes analysis and back-testing; model review and validation after significant changes — foundational model risk management principles now increasingly aligned with AI-specific regulatory frameworks)
This article is part of AIPMO’s PM Practice series. See also: AI Testing and Validation | AI Risk Registers | Human Oversight in AI Systems