Monitoring AI Systems in Production: The Work After Go-Live

PM Takeaways

AI systems degrade in production even when no code has changed. Performance shifts as data drifts, user behavior evolves, or the world changes in ways your training data didn’t capture. NIST AI RMF MANAGE 2.2 requires ongoing monitoring to detect and respond to drift — this is not optional maintenance, it’s how you keep the system doing what you deployed it to do.
Monitoring must cover six things: performance metrics, data quality, fairness metrics, system health, human oversight behavior, and user feedback. Tracking only accuracy misses the most common sources of real-world failure. Fairness metrics especially must be broken out by group — aggregate numbers can look fine while a specific population is being harmed.
For high-risk AI systems under EU AI Act Article 72, post-market monitoring is a legal requirement. The monitoring plan must be documented before the system goes live — it can’t be written after deployment as an afterthought.
AI projects don’t close the way software projects do. Monitoring, version control, retraining, and contingency planning are ongoing operational requirements. Define who owns them, how model updates are governed, and what would bring the project team back in — and put that in writing before you close the project.

Traditional software projects treat deployment as the finish line. For AI, it’s more like the starting line. The system that passed all your pre-deployment tests will change in production — sometimes slowly, sometimes quickly — even when no one has touched the code.

Post-deployment monitoring isn’t something you defer when resources are tight. For high-risk AI systems, it’s a legal requirement. For all AI systems, it’s the only way to know whether the system is still doing what it was built to do.

Why AI Systems Need Continuous Monitoring

Traditional software behaves predictably: same input, same output. AI systems don’t follow that rule. PMs who treat AI like conventional software create risk they can’t see.

Failure Mode	Why It Matters
The world changes	Production data reflects current conditions — customer behavior, market dynamics, regulatory context. A model trained on last year’s patterns may be confidently wrong about this year’s reality, with no obvious failure signal.
Data drifts	The statistical properties of incoming data change over time. Features that were predictive become less so. The system continues to produce outputs, but their basis is increasingly mismatched to what the model learned.
Performance degrades gradually	Models lose accuracy incrementally, without obvious failures. By the time someone notices, significant harm may have accumulated across many decisions. NIST AI RMF MANAGE 2.2 identifies this gradual degradation — drift — as the core risk that monitoring must address.
New risks emerge	Users find unexpected applications or misapplications. Edge cases appear that pre-deployment testing did not anticipate. Adversarial actors probe for exploitable patterns. These risks are only discoverable in production.
Feedback loops amplify problems	AI systems that influence behavior can change the data they are subsequently trained on, creating self-reinforcing patterns that compound over time.

What to Monitor

Good monitoring covers six distinct areas. Most teams focus on performance metrics and stop there — but tracking accuracy alone leaves the most common failure modes invisible.

Performance Metrics

Track the metrics defined during development and watch for divergence from the pre-deployment baseline. Without a documented baseline, you have no reference point for detecting degradation.

Metric Type	What to Track
Accuracy metrics	Precision, recall, F1, AUC — measured against the pre-deployment baseline and monitored for directional trend, not just point-in-time value.
Error rates	False positives and false negatives, reported overall and by segment — aggregate error rates can mask deterioration in specific subpopulations.
Confidence distribution	Whether confidence scores remain calibrated to actual accuracy — a model can become overconfident as conditions shift.
Prediction distribution	Whether the distribution of outputs has shifted unexpectedly — a sudden change in the proportion of positive predictions is a signal worth investigating even before accuracy data is available.

Data Quality

Monitoring outputs without monitoring inputs means you will detect problems late — after degraded inputs have already produced degraded outputs at scale.

Metric Type	What to Track
Input drift	Statistical properties of incoming features compared to the training data distribution — the reference point against which the model was validated.
Feature drift	Individual features changing distribution, which may affect specific model pathways even when aggregate input statistics look stable.
Missing data rate	Increase in null or missing values in critical features, which the model may handle in unexpected ways if missingness was rare in training data.
Data volume	Unexpected changes in throughput or record counts, which can indicate upstream pipeline problems or changes in the user population.

Fairness Metrics

Overall performance can improve while performance for a specific demographic group deteriorates. Fairness monitoring must be disaggregated — aggregate metrics obscure differential impact by design.

Metric Type	What to Track
Disaggregated performance	Accuracy, error rate, and related metrics broken out by demographic group, geography, or other segments relevant to the use case.
Outcome distribution	Whether positive and negative outcomes are distributed across groups as expected relative to the pre-deployment baseline and any regulatory requirements.
Error distribution	Whether certain groups are experiencing disproportionate rates of false positives or false negatives — error type matters as much as error rate in many use cases.

Operational Metrics

System health and AI performance are not independent. Infrastructure degradation can manifest as model degradation, and the two must be distinguished to identify the correct intervention.

Metric Type	What to Track
Latency	Response time changes that may indicate infrastructure problems, model complexity increases from updates, or serving pipeline issues.
Throughput	Unexpected volume changes that may indicate upstream system changes or shifts in the user population hitting the model.
Resource use	CPU, memory, and storage trends — AI workloads are more variable and hardware-intensive than conventional software and require dedicated resource monitoring.
Error logs	System errors, exceptions, and timeouts that may precede or accompany model performance degradation.

Human Oversight Metrics

How humans interact with an AI system is itself a monitoring signal. Changing override rates or escalation volumes often surface model degradation before quantitative metrics do — because human reviewers see the outputs directly.

Metric Type	What to Track
Override rate	How often human reviewers reject or modify system recommendations — a rising override rate is a strong early signal of model degradation.
Override patterns	Whether overrides are concentrated in particular segments, use cases, or output types, which points toward where the model is failing.
Escalation volume	Whether more cases are being escalated for human review, which may indicate reviewers losing confidence in model outputs.
Reviewer response time	Whether human reviewers are keeping pace with escalations — a bottleneck in human oversight can mean AI-driven decisions are going unreviewed longer than intended.

User and Stakeholder Feedback

Numbers don’t tell the whole story. Feedback from users and affected parties surfaces failure modes that automated monitoring can’t detect — edge cases, context-specific problems, and harms experienced by real people. NIST AI RMF MANAGE 4.2 treats feedback from affected parties as a required input to continual improvement, not an optional add-on.

Metric Type	What to Track
User complaints	Volume, frequency trend, and nature of reported issues — categorize by failure type to identify patterns.
Support tickets	AI-related support requests, which often surface usability failures and mismatched expectations before they appear in performance metrics.
User sentiment	Survey results and feedback form data, tracked as a trend rather than a point-in-time snapshot.
Affected party reports	Concerns and complaints from people subject to AI-driven decisions, particularly in high-risk contexts where accessible redress mechanisms are required.

Detecting Drift

Drift is the most common failure mode in deployed AI — and the least visible. Performance erodes incrementally, without any obvious break or error message, until the accumulated impact becomes impossible to ignore.

Three Types of Drift

Drift Type	What It Means and Why It Matters
Data drift (covariate shift)	The statistical distribution of input features changes, but the underlying relationship between features and outcomes remains the same. The model may still work correctly — but it is now operating on data that looks different from what it was trained on, and performance on that new distribution may be untested.
Concept drift	The relationship between inputs and the target outcome changes. What the model learned is no longer true. A fraud detection model trained before a new fraud pattern emerged will not recognize the new pattern. Concept drift is the most dangerous type because the model continues to produce confident outputs based on outdated relationships.
Label drift	The distribution of outcomes changes. If you are predicting loan default and macroeconomic conditions shift default rates significantly, the model’s calibration becomes invalid even if it is detecting the right signals. Label drift often accompanies concept drift and can go undetected when ground truth labels are delayed.

Detection Approaches

Approach	How It Works
Statistical tests	Compare distributions of current data against the training data reference using tests such as Kolmogorov-Smirnov (continuous features) or chi-squared (categorical features). NIST MEASURE 2.4 recommends hypothesis testing or domain expertise to measure monitored distribution differences.
Drift detection algorithms	Specialized algorithms designed to detect distributional changes in data streams — such as ADWIN or DDM — that provide statistically grounded change-point detection.
Performance monitoring against ground truth	Track accuracy against labeled ground truth when labels are available with reasonable latency. This is the most direct drift signal but requires a feedback pipeline to collect outcomes.
Proxy metrics	When ground truth labels are delayed (loan outcomes, health outcomes), monitor leading indicators that correlate with eventual performance — confidence score distributions, prediction distribution shifts, or upstream behavioral signals.

Setting Response Thresholds

Define thresholds before deployment, not when drift is discovered under operational pressure:

Warning threshold: Increase monitoring frequency, initiate investigation, notify stakeholders. No immediate intervention to the system.
Action threshold: Intervention required — model refresh, retraining with updated data, configuration adjustment, or rollback to a previous version.
Critical threshold: Immediate response — pause the system, activate manual fallback, halt consequential decisions until the root cause is identified and addressed.

Incident Response

When monitoring surfaces a problem, you need a response process that’s already written down. AI incidents that surface under pressure — without a plan — result in inconsistent containment, poor documentation, and missed regulatory reporting obligations.

The OECD defines AI incidents as events where AI systems contribute to harm — including physical harm, psychological harm, infrastructure disruption, rights violations, or harm to property or communities. The systemic kind — accumulated harm from many decisions, each small enough to go unnoticed individually — is the more dangerous one, because it can run for a long time before anyone notices the pattern.

Element	Description	PM Responsibility
Detection	How incidents are identified — automated monitoring alerts, human reviewer escalations, user reports, or external reports from affected parties.	Define detection channels at project close; ensure all are active before deployment.
Triage	Assess severity and urgency: is harm ongoing? Is it systemic or isolated? Is regulatory reporting triggered?	Define severity criteria in advance; ambiguous cases default to the higher severity tier.
Containment	Stop ongoing harm: pause the system, revert to a fallback process, increase human oversight for affected decision types, or restrict scope of AI-driven decisions.	Ensure fallback processes are tested before deployment, not designed during an incident.
Investigation	Determine root cause: data drift, concept drift, training data issue, infrastructure failure, edge case, or adversarial input.	Assign investigation ownership at project closure; data science team must remain reachable post-deployment.
Remediation	Fix the identified problem: retrain with corrected data, adjust thresholds, patch infrastructure, update guardrails, or decommission if no fix is feasible.	Document the remediation decision and the rationale, including any tradeoffs accepted.
Communication	Notify affected parties, internal stakeholders, and — where required — regulators. Under the EU AI Act, serious incidents involving high-risk AI must be reported to authorities without undue delay.	Know regulatory notification timelines before an incident occurs; 15 days for serious incidents under the EU AI Act.
Post-mortem	Analyze root cause, assess whether monitoring thresholds were appropriate, update response procedures, and incorporate lessons into model governance.	Post-mortem findings should feed back into the project’s risk register and monitoring design.

Managing Changes to the System

AI systems change — through retraining, fine-tuning, data updates, configuration adjustments, or architecture changes. Each change is a risk. A change that improves overall accuracy may quietly degrade fairness metrics for a specific group. Treat changes to production AI systems with the same rigor you’d apply to any other high-stakes system change.

Version Control

Keep explicit version records for every component of the AI system and link each deployed version to the artifacts it was built from:

Model weights and architecture — including the specific training run that produced them
Training data and preprocessing pipelines — versioned by dataset snapshot, not just file path
Configuration and hyperparameters — the settings used for the deployed version
Guardrails and safety mechanisms — which thresholds and constraints are active in the deployed system
Dependencies — versions of ML frameworks, libraries, and infrastructure components in production

Change Triggers

Define in advance what triggers a model update cycle. Unplanned updates driven by ad hoc observations are less rigorous and harder to govern than updates triggered by documented criteria.

Trigger Type	Example Criteria
Performance degradation	Accuracy falls below the action threshold defined in the monitoring plan; fairness metrics exceed acceptable differential across groups.
Drift detection	Statistical tests confirm drift beyond the action threshold in one or more critical input features.
New data availability	A scheduled retraining cycle runs on a predetermined cadence, using accumulated production data that passed quality checks.
Regulatory or policy change	A regulatory update affects what the model is permitted to use as an input or how outputs must be explained.
Identified bias or fairness issue	An audit, user report, or monitoring alert surfaces a fairness problem not detected during pre-deployment testing.

Champion-Challenger Testing

Before replacing a production model, test the replacement under real conditions. Run the new model (challenger) alongside the current model (champion), comparing outputs on live traffic — but don’t route consequential decisions through the challenger until it has met acceptance criteria. This approach also produces a documented performance comparison that justifies the deployment decision.

Decommissioning

Sometimes the right call is to turn the system off. NIST AI RMF MANAGE 2.4 addresses exactly this: decommissioning AI systems that are performing inconsistently with their intended use. The key requirement is that decommissioning responsibilities must be assigned and understood before you need to use them.

Performance cannot be restored to acceptable levels through retraining, data updates, or architectural changes
Residual risks exceed the benefits of the system, and risk mitigation options have been exhausted
Regulatory requirements that apply to the system cannot be met with the current architecture or data
The business need the system was designed to address has changed materially

Consideration	Key Questions
Data retention	What training data, model artifacts, and decision logs must be retained? For how long, and under what legal obligations?
Data security	How will you prevent unauthorized access to decommissioned systems, training data, and logs that may contain personal data?
Downstream dependencies	What other systems, processes, or workflows depend on this AI’s outputs? All dependencies must be identified and migrated or deactivated.
User and affected party communication	How and when will you inform users, affected parties, and — where required — regulators of the decommissioning?
Transition plan	What replaces the AI system? If manual processes resume, have they been tested? If an alternative system is deployed, has it completed its own governance process?

Regulatory Requirements

EU AI Act — Article 72

For high-risk AI systems, post-market monitoring is a legal obligation under EU AI Act Article 72. Providers must establish a monitoring system that actively and systematically collects, documents, and analyzes performance data throughout the system’s lifetime. Serious incidents — those presenting a risk to health, safety, or fundamental rights — must be reported to the relevant national authority within 15 days. The monitoring plan must be included in technical documentation completed before the system goes live. It can’t be written after deployment.

NIST AI RMF

NIST AI RMF’s MANAGE function treats monitoring as continuous, not periodic. MANAGE 2.2 requires ongoing monitoring of performance and trustworthiness. MANAGE 2.4 requires mechanisms to deactivate or supersede systems that aren’t performing as intended. MANAGE 3.1 requires ongoing monitoring of risks with documented controls. MANAGE 4.2 integrates feedback from affected parties into continual improvement. Together, these requirements describe a governance cycle that runs throughout the system’s production life.

Sector-Specific Requirements

Regulated sectors add further obligations on top of horizontal AI law. Financial services has long-standing model risk management expectations (SR 11-7, OCC Comptroller’s Handbook) that already treat ongoing monitoring as a core requirement. Healthcare AI faces post-market clinical follow-up and adverse event reporting requirements. Map both the horizontal AI obligations and the sector-specific requirements before finalizing the monitoring plan.

Right-Sizing for Your Situation

Match monitoring depth to system risk. A low-risk internal tool needs far less monitoring infrastructure than a high-risk system making consequential decisions about individuals. The EU AI Act’s proportionality principle says the same thing: monitoring must be proportionate to the technologies and the risks involved.

Greenfield — Starting Out

Start with the three metrics that matter most for your specific system type and instrument those before deployment, not after. For most systems, that means: a performance baseline you can measure against, a data quality check on incoming features, and a human override rate tracker. Those three give you the minimum signal to detect the most common failures. Establish a simple weekly review cadence — automated alerts plus a human review of the output — before the system handles consequential decisions, and document who does the review and what they’re looking for.

Emerging — Building Repeatability

The highest-value standardization at this stage is the threshold framework: define warning, action, and critical thresholds for your key metrics before deployment, in writing, so that monitoring alerts trigger a documented response rather than an ad hoc investigation. Connect monitoring alerts to your incident response process explicitly — a monitoring dashboard that isn’t connected to an escalation path is not monitoring, it’s reporting. Build champion-challenger testing into your model update process so changes to production models follow a consistent evaluation protocol.

Established — Mature Programs

At this level the monitoring system is evidence for EU AI Act Article 72 compliance and NIST MANAGE 2.2 requirements. That means the monitoring plan must be in the technical documentation before go-live, the serious incident reporting workflow must be tested before it’s needed (15-day clock starts immediately), and post-mortem findings must feed back into the risk register in a documented, traceable way. If your current monitoring doesn’t disaggregate fairness metrics by group, that’s the gap most likely to surface in a regulatory review — fix it now.

The AI Governance Advisor can help you design a monitoring plan for your specific system type, risk level, and regulatory context — and generate an AI Monitoring Plan document pre-populated for your deployment.

Free Template — AI Monitoring Plan

AIPMO’s AI Monitoring Plan template is a structured, fillable PDF covering all six monitoring dimensions described in this article — performance metrics, data quality, fairness, operational health, human oversight behavior, and user feedback — with drift threshold definitions, incident response process, and change trigger criteria. It is mapped to EU AI Act Article 72 post-market monitoring requirements and NIST AI RMF MANAGE 2.2, 3.1, and 4.2. Download free and adapt to your system, or use the AI Governance Advisor to generate a version pre-populated for your system type, risk level, and deployment context.

Get the free template → AI-customize it →

Framework References

EU AI Act (Regulation (EU) 2024/1689) — Article 72 (post-market monitoring obligations for high-risk AI providers; monitoring plan required in pre-deployment technical documentation; serious incident reporting to national authorities within 15 days).

NIST AI Risk Management Framework 1.0 (NIST AI 100-1, 2023) — MANAGE 2.2 (ongoing monitoring for drift and degradation throughout the system’s operational life), MANAGE 2.4 (mechanisms to deactivate or supersede systems performing inconsistently with intended use), MANAGE 3.1 (ongoing monitoring of risks with documented controls), MANAGE 4.2 (continual improvement using feedback from affected parties as a required input), MEASURE 2.4 (monitoring AI system functionality and behavior in production).

OECD AI Principles (OECD/LEGAL/0449, revised 2024) — Definition of AI incidents and near-misses, including systemic harm from accumulated low-severity failures. Foundation for incident classification and reporting obligations.

PMI CPMAI Guide (2025) — Phase VI. Monitoring infrastructure, version control, retraining pipelines, and operational governance as mandatory ongoing requirements; champion-challenger testing as a standard model update protocol.

This article is part of AIPMO’s PM Practice series. See also: AI Risk Registers | AI Testing and Validation | Human Oversight in AI Systems | The PM’s Guide to NIST AI RMF

To err is AI; to govern, human.

AIPMO.co · AI Governance, PM-first