- AI systems degrade in production even when no code has changed. Performance shifts as data drifts, user behavior evolves, or the world changes in ways your training data didn’t capture. NIST AI RMF MANAGE 2.2 requires ongoing monitoring to detect and respond to drift — this is not optional maintenance, it’s how you keep the system doing what you deployed it to do.
- Monitoring must cover six things: performance metrics, data quality, fairness metrics, system health, human oversight behavior, and user feedback. Tracking only accuracy misses the most common sources of real-world failure. Fairness metrics especially must be broken out by group — aggregate numbers can look fine while a specific population is being harmed.
- For high-risk AI systems under EU AI Act Article 72, post-market monitoring is a legal requirement. The monitoring plan must be documented before the system goes live — it can’t be written after deployment as an afterthought.
- AI projects don’t close the way software projects do. Monitoring, version control, retraining, and contingency planning are ongoing operational requirements. Define who owns them, how model updates are governed, and what would bring the project team back in — and put that in writing before you close the project.
Traditional software projects treat deployment as the finish line. For AI, it’s more like the starting line. The system that passed all your pre-deployment tests will change in production — sometimes slowly, sometimes quickly — even when no one has touched the code.
Post-deployment monitoring isn’t something you defer when resources are tight. For high-risk AI systems, it’s a legal requirement. For all AI systems, it’s the only way to know whether the system is still doing what it was built to do.
Why AI Systems Need Continuous Monitoring
Traditional software behaves predictably: same input, same output. AI systems don’t follow that rule. PMs who treat AI like conventional software create risk they can’t see.
| Failure Mode | Why It Matters |
|---|---|
| The world changes | Production data reflects current conditions — customer behavior, market dynamics, regulatory context. A model trained on last year’s patterns may be confidently wrong about this year’s reality, with no obvious failure signal. |
| Data drifts | The statistical properties of incoming data change over time. Features that were predictive become less so. The system continues to produce outputs, but their basis is increasingly mismatched to what the model learned. |
| Performance degrades gradually | Models lose accuracy incrementally, without obvious failures. By the time someone notices, significant harm may have accumulated across many decisions. NIST AI RMF MANAGE 2.2 identifies this gradual degradation — drift — as the core risk that monitoring must address. |
| New risks emerge | Users find unexpected applications or misapplications. Edge cases appear that pre-deployment testing did not anticipate. Adversarial actors probe for exploitable patterns. These risks are only discoverable in production. |
| Feedback loops amplify problems | AI systems that influence behavior can change the data they are subsequently trained on, creating self-reinforcing patterns that compound over time. |
What to Monitor
Good monitoring covers six distinct areas. Most teams focus on performance metrics and stop there — but tracking accuracy alone leaves the most common failure modes invisible.
Performance Metrics
Track the metrics defined during development and watch for divergence from the pre-deployment baseline. Without a documented baseline, you have no reference point for detecting degradation.
| Metric Type | What to Track |
|---|---|
| Accuracy metrics | Precision, recall, F1, AUC — measured against the pre-deployment baseline and monitored for directional trend, not just point-in-time value. |
| Error rates | False positives and false negatives, reported overall and by segment — aggregate error rates can mask deterioration in specific subpopulations. |
| Confidence distribution | Whether confidence scores remain calibrated to actual accuracy — a model can become overconfident as conditions shift. |
| Prediction distribution | Whether the distribution of outputs has shifted unexpectedly — a sudden change in the proportion of positive predictions is a signal worth investigating even before accuracy data is available. |
Data Quality
Monitoring outputs without monitoring inputs means you will detect problems late — after degraded inputs have already produced degraded outputs at scale.
| Metric Type | What to Track |
|---|---|
| Input drift | Statistical properties of incoming features compared to the training data distribution — the reference point against which the model was validated. |
| Feature drift | Individual features changing distribution, which may affect specific model pathways even when aggregate input statistics look stable. |
| Missing data rate | Increase in null or missing values in critical features, which the model may handle in unexpected ways if missingness was rare in training data. |
| Data volume | Unexpected changes in throughput or record counts, which can indicate upstream pipeline problems or changes in the user population. |
Fairness Metrics
Overall performance can improve while performance for a specific demographic group deteriorates. Fairness monitoring must be disaggregated — aggregate metrics obscure differential impact by design.
| Metric Type | What to Track |
|---|---|
| Disaggregated performance | Accuracy, error rate, and related metrics broken out by demographic group, geography, or other segments relevant to the use case. |
| Outcome distribution | Whether positive and negative outcomes are distributed across groups as expected relative to the pre-deployment baseline and any regulatory requirements. |
| Error distribution | Whether certain groups are experiencing disproportionate rates of false positives or false negatives — error type matters as much as error rate in many use cases. |
Operational Metrics
System health and AI performance are not independent. Infrastructure degradation can manifest as model degradation, and the two must be distinguished to identify the correct intervention.
| Metric Type | What to Track |
|---|---|
| Latency | Response time changes that may indicate infrastructure problems, model complexity increases from updates, or serving pipeline issues. |
| Throughput | Unexpected volume changes that may indicate upstream system changes or shifts in the user population hitting the model. |
| Resource use | CPU, memory, and storage trends — AI workloads are more variable and hardware-intensive than conventional software and require dedicated resource monitoring. |
| Error logs | System errors, exceptions, and timeouts that may precede or accompany model performance degradation. |
Human Oversight Metrics
How humans interact with an AI system is itself a monitoring signal. Changing override rates or escalation volumes often surface model degradation before quantitative metrics do — because human reviewers see the outputs directly.
| Metric Type | What to Track |
|---|---|
| Override rate | How often human reviewers reject or modify system recommendations — a rising override rate is a strong early signal of model degradation. |
| Override patterns | Whether overrides are concentrated in particular segments, use cases, or output types, which points toward where the model is failing. |
| Escalation volume | Whether more cases are being escalated for human review, which may indicate reviewers losing confidence in model outputs. |
| Reviewer response time | Whether human reviewers are keeping pace with escalations — a bottleneck in human oversight can mean AI-driven decisions are going unreviewed longer than intended. |
User and Stakeholder Feedback
Numbers don’t tell the whole story. Feedback from users and affected parties surfaces failure modes that automated monitoring can’t detect — edge cases, context-specific problems, and harms experienced by real people. NIST AI RMF MANAGE 4.2 treats feedback from affected parties as a required input to continual improvement, not an optional add-on.
| Metric Type | What to Track |
|---|---|
| User complaints | Volume, frequency trend, and nature of reported issues — categorize by failure type to identify patterns. |
| Support tickets | AI-related support requests, which often surface usability failures and mismatched expectations before they appear in performance metrics. |
| User sentiment | Survey results and feedback form data, tracked as a trend rather than a point-in-time snapshot. |
| Affected party reports | Concerns and complaints from people subject to AI-driven decisions, particularly in high-risk contexts where accessible redress mechanisms are required. |
Detecting Drift
Drift is the most common failure mode in deployed AI — and the least visible. Performance erodes incrementally, without any obvious break or error message, until the accumulated impact becomes impossible to ignore.
Three Types of Drift
| Drift Type | What It Means and Why It Matters |
|---|---|
| Data drift (covariate shift) | The statistical distribution of input features changes, but the underlying relationship between features and outcomes remains the same. The model may still work correctly — but it is now operating on data that looks different from what it was trained on, and performance on that new distribution may be untested. |
| Concept drift | The relationship between inputs and the target outcome changes. What the model learned is no longer true. A fraud detection model trained before a new fraud pattern emerged will not recognize the new pattern. Concept drift is the most dangerous type because the model continues to produce confident outputs based on outdated relationships. |
| Label drift | The distribution of outcomes changes. If you are predicting loan default and macroeconomic conditions shift default rates significantly, the model’s calibration becomes invalid even if it is detecting the right signals. Label drift often accompanies concept drift and can go undetected when ground truth labels are delayed. |
Detection Approaches
| Approach | How It Works |
|---|---|
| Statistical tests | Compare distributions of current data against the training data reference using tests such as Kolmogorov-Smirnov (continuous features) or chi-squared (categorical features). NIST MEASURE 2.4 recommends hypothesis testing or domain expertise to measure monitored distribution differences. |
| Drift detection algorithms | Specialized algorithms designed to detect distributional changes in data streams — such as ADWIN or DDM — that provide statistically grounded change-point detection. |
| Performance monitoring against ground truth | Track accuracy against labeled ground truth when labels are available with reasonable latency. This is the most direct drift signal but requires a feedback pipeline to collect outcomes. |
| Proxy metrics | When ground truth labels are delayed (loan outcomes, health outcomes), monitor leading indicators that correlate with eventual performance — confidence score distributions, prediction distribution shifts, or upstream behavioral signals. |
Setting Response Thresholds
Define thresholds before deployment, not when drift is discovered under operational pressure:
- Warning threshold: Increase monitoring frequency, initiate investigation, notify stakeholders. No immediate intervention to the system.
- Action threshold: Intervention required — model refresh, retraining with updated data, configuration adjustment, or rollback to a previous version.
- Critical threshold: Immediate response — pause the system, activate manual fallback, halt consequential decisions until the root cause is identified and addressed.
Incident Response
When monitoring surfaces a problem, you need a response process that’s already written down. AI incidents that surface under pressure — without a plan — result in inconsistent containment, poor documentation, and missed regulatory reporting obligations.
The OECD defines AI incidents as events where AI systems contribute to harm — including physical harm, psychological harm, infrastructure disruption, rights violations, or harm to property or communities. The systemic kind — accumulated harm from many decisions, each small enough to go unnoticed individually — is the more dangerous one, because it can run for a long time before anyone notices the pattern.
| Element | Description | PM Responsibility |
|---|---|---|
| Detection | How incidents are identified — automated monitoring alerts, human reviewer escalations, user reports, or external reports from affected parties. | Define detection channels at project close; ensure all are active before deployment. |
| Triage | Assess severity and urgency: is harm ongoing? Is it systemic or isolated? Is regulatory reporting triggered? | Define severity criteria in advance; ambiguous cases default to the higher severity tier. |
| Containment | Stop ongoing harm: pause the system, revert to a fallback process, increase human oversight for affected decision types, or restrict scope of AI-driven decisions. | Ensure fallback processes are tested before deployment, not designed during an incident. |
| Investigation | Determine root cause: data drift, concept drift, training data issue, infrastructure failure, edge case, or adversarial input. | Assign investigation ownership at project closure; data science team must remain reachable post-deployment. |
| Remediation | Fix the identified problem: retrain with corrected data, adjust thresholds, patch infrastructure, update guardrails, or decommission if no fix is feasible. | Document the remediation decision and the rationale, including any tradeoffs accepted. |
| Communication | Notify affected parties, internal stakeholders, and — where required — regulators. Under the EU AI Act, serious incidents involving high-risk AI must be reported to authorities without undue delay. | Know regulatory notification timelines before an incident occurs; 15 days for serious incidents under the EU AI Act. |
| Post-mortem | Analyze root cause, assess whether monitoring thresholds were appropriate, update response procedures, and incorporate lessons into model governance. | Post-mortem findings should feed back into the project’s risk register and monitoring design. |
Managing Changes to the System
AI systems change — through retraining, fine-tuning, data updates, configuration adjustments, or architecture changes. Each change is a risk. A change that improves overall accuracy may quietly degrade fairness metrics for a specific group. Treat changes to production AI systems with the same rigor you’d apply to any other high-stakes system change.
Version Control
Keep explicit version records for every component of the AI system and link each deployed version to the artifacts it was built from:
- Model weights and architecture — including the specific training run that produced them
- Training data and preprocessing pipelines — versioned by dataset snapshot, not just file path
- Configuration and hyperparameters — the settings used for the deployed version
- Guardrails and safety mechanisms — which thresholds and constraints are active in the deployed system
- Dependencies — versions of ML frameworks, libraries, and infrastructure components in production
Change Triggers
Define in advance what triggers a model update cycle. Unplanned updates driven by ad hoc observations are less rigorous and harder to govern than updates triggered by documented criteria.
| Trigger Type | Example Criteria |
|---|---|
| Performance degradation | Accuracy falls below the action threshold defined in the monitoring plan; fairness metrics exceed acceptable differential across groups. |
| Drift detection | Statistical tests confirm drift beyond the action threshold in one or more critical input features. |
| New data availability | A scheduled retraining cycle runs on a predetermined cadence, using accumulated production data that passed quality checks. |
| Regulatory or policy change | A regulatory update affects what the model is permitted to use as an input or how outputs must be explained. |
| Identified bias or fairness issue | An audit, user report, or monitoring alert surfaces a fairness problem not detected during pre-deployment testing. |
Champion-Challenger Testing
Before replacing a production model, test the replacement under real conditions. Run the new model (challenger) alongside the current model (champion), comparing outputs on live traffic — but don’t route consequential decisions through the challenger until it has met acceptance criteria. This approach also produces a documented performance comparison that justifies the deployment decision.
Decommissioning
Sometimes the right call is to turn the system off. NIST AI RMF MANAGE 2.4 addresses exactly this: decommissioning AI systems that are performing inconsistently with their intended use. The key requirement is that decommissioning responsibilities must be assigned and understood before you need to use them.
- Performance cannot be restored to acceptable levels through retraining, data updates, or architectural changes
- Residual risks exceed the benefits of the system, and risk mitigation options have been exhausted
- Regulatory requirements that apply to the system cannot be met with the current architecture or data
- The business need the system was designed to address has changed materially
| Consideration | Key Questions |
|---|---|
| Data retention | What training data, model artifacts, and decision logs must be retained? For how long, and under what legal obligations? |
| Data security | How will you prevent unauthorized access to decommissioned systems, training data, and logs that may contain personal data? |
| Downstream dependencies | What other systems, processes, or workflows depend on this AI’s outputs? All dependencies must be identified and migrated or deactivated. |
| User and affected party communication | How and when will you inform users, affected parties, and — where required — regulators of the decommissioning? |
| Transition plan | What replaces the AI system? If manual processes resume, have they been tested? If an alternative system is deployed, has it completed its own governance process? |
Regulatory Requirements
EU AI Act — Article 72
For high-risk AI systems, post-market monitoring is a legal obligation under EU AI Act Article 72. Providers must establish a monitoring system that actively and systematically collects, documents, and analyzes performance data throughout the system’s lifetime. Serious incidents — those presenting a risk to health, safety, or fundamental rights — must be reported to the relevant national authority within 15 days. The monitoring plan must be included in technical documentation completed before the system goes live. It can’t be written after deployment.
NIST AI RMF
NIST AI RMF’s MANAGE function treats monitoring as continuous, not periodic. MANAGE 2.2 requires ongoing monitoring of performance and trustworthiness. MANAGE 2.4 requires mechanisms to deactivate or supersede systems that aren’t performing as intended. MANAGE 3.1 requires ongoing monitoring of risks with documented controls. MANAGE 4.2 integrates feedback from affected parties into continual improvement. Together, these requirements describe a governance cycle that runs throughout the system’s production life.
Sector-Specific Requirements
Regulated sectors add further obligations on top of horizontal AI law. Financial services has long-standing model risk management expectations (SR 11-7, OCC Comptroller’s Handbook) that already treat ongoing monitoring as a core requirement. Healthcare AI faces post-market clinical follow-up and adverse event reporting requirements. Map both the horizontal AI obligations and the sector-specific requirements before finalizing the monitoring plan.
Right-Sizing for Your Situation
Match monitoring depth to system risk. A low-risk internal tool needs far less monitoring infrastructure than a high-risk system making consequential decisions about individuals. The EU AI Act’s proportionality principle says the same thing: monitoring must be proportionate to the technologies and the risks involved.
Start with the three metrics that matter most for your specific system type and instrument those before deployment, not after. For most systems, that means: a performance baseline you can measure against, a data quality check on incoming features, and a human override rate tracker. Those three give you the minimum signal to detect the most common failures. Establish a simple weekly review cadence — automated alerts plus a human review of the output — before the system handles consequential decisions, and document who does the review and what they’re looking for.
The highest-value standardization at this stage is the threshold framework: define warning, action, and critical thresholds for your key metrics before deployment, in writing, so that monitoring alerts trigger a documented response rather than an ad hoc investigation. Connect monitoring alerts to your incident response process explicitly — a monitoring dashboard that isn’t connected to an escalation path is not monitoring, it’s reporting. Build champion-challenger testing into your model update process so changes to production models follow a consistent evaluation protocol.
At this level the monitoring system is evidence for EU AI Act Article 72 compliance and NIST MANAGE 2.2 requirements. That means the monitoring plan must be in the technical documentation before go-live, the serious incident reporting workflow must be tested before it’s needed (15-day clock starts immediately), and post-mortem findings must feed back into the risk register in a documented, traceable way. If your current monitoring doesn’t disaggregate fairness metrics by group, that’s the gap most likely to surface in a regulatory review — fix it now.
The AI Governance Advisor can help you design a monitoring plan for your specific system type, risk level, and regulatory context — and generate an AI Monitoring Plan document pre-populated for your deployment.
AIPMO’s AI Monitoring Plan template is a structured, fillable PDF covering all six monitoring dimensions described in this article — performance metrics, data quality, fairness, operational health, human oversight behavior, and user feedback — with drift threshold definitions, incident response process, and change trigger criteria. It is mapped to EU AI Act Article 72 post-market monitoring requirements and NIST AI RMF MANAGE 2.2, 3.1, and 4.2. Download free and adapt to your system, or use the AI Governance Advisor to generate a version pre-populated for your system type, risk level, and deployment context.
Framework References
EU AI Act (Regulation (EU) 2024/1689) — Article 72 (post-market monitoring obligations for high-risk AI providers; monitoring plan required in pre-deployment technical documentation; serious incident reporting to national authorities within 15 days).
NIST AI Risk Management Framework 1.0 (NIST AI 100-1, 2023) — MANAGE 2.2 (ongoing monitoring for drift and degradation throughout the system’s operational life), MANAGE 2.4 (mechanisms to deactivate or supersede systems performing inconsistently with intended use), MANAGE 3.1 (ongoing monitoring of risks with documented controls), MANAGE 4.2 (continual improvement using feedback from affected parties as a required input), MEASURE 2.4 (monitoring AI system functionality and behavior in production).
OECD AI Principles (OECD/LEGAL/0449, revised 2024) — Definition of AI incidents and near-misses, including systemic harm from accumulated low-severity failures. Foundation for incident classification and reporting obligations.
PMI CPMAI Guide (2025) — Phase VI. Monitoring infrastructure, version control, retraining pipelines, and operational governance as mandatory ongoing requirements; champion-challenger testing as a standard model update protocol.
This article is part of AIPMO’s PM Practice series. See also: AI Risk Registers | AI Testing and Validation | Human Oversight in AI Systems | The PM’s Guide to NIST AI RMF