Skip to content

NEW - Monitoring AI Systems in Production: The Work After Go-Live

AI systems degrade in production even when no one touches the code. Post-deployment monitoring is not optional: for high-risk AI it's a legal requirement, and for all AI it's the only way to know whether the system is still doing what it was designed to do.

By AIPMO
Published: · 17 min read

PM Takeaways

       AI systems are not static software — they degrade in production through data drift, concept drift, and feedback loops even when no code changes are made. NIST AI RMF MANAGE 2.2 is explicit that performance and trustworthiness may evolve and shift once a system is deployed, and that regular monitoring is required to detect and respond to drift before accumulated harm becomes visible.

       Monitoring must cover six distinct dimensions — performance metrics, data quality, fairness metrics, operational health, human oversight behaviour, and user feedback. Quantitative performance metrics alone are insufficient: NIST MEASURE 2.4 requires monitoring of AI system functionality and behaviour as identified during risk mapping, and fairness metrics must be disaggregated by group, not reported only in aggregate.

       Drift is the most common and least visible failure mode in deployed AI. The three types — data drift (covariate shift), concept drift (changed input-outcome relationships), and label drift (shifted outcome distributions) — require different detection approaches and different intervention responses. Define statistical thresholds that trigger warning, action, and critical responses before deployment, not after drift is discovered.

       Under the EU AI Act Article 72, providers of high-risk AI systems are required to establish post-market monitoring systems that actively and systematically collect, document, and analyse data on system performance throughout the lifetime of the system — and serious incidents must be reported to authorities. This obligation begins at deployment, which means the monitoring plan must be part of the technical documentation completed before go-live.

       Project closure is the wrong concept for AI projects. The PMI–CPMAI Phase VI framework treats ongoing monitoring, version control, retraining pipelines, and contingency planning as core operational requirements, not post-project afterthoughts. PMs must define who owns monitoring, how model updates are governed and funded, and what triggers re-engagement of the project team — before the project formally closes.

Traditional software projects treat deployment as the finish line. Hand off to operations, close the project, move on. AI projects don’t work that way. The system that passed all your pre-deployment tests will change in production — sometimes slowly, sometimes dramatically — even if no one touches the code.

Post-deployment monitoring is not a best practice that resource-constrained teams can defer. For high-risk AI systems, it is a regulatory requirement under the EU AI Act. For all AI systems, it is the only mechanism by which you can know whether the system is still doing what it was designed to do. NIST AI RMF MANAGE 2.2 frames it plainly: regular monitoring of AI systems’ performance and trustworthiness enhances organisations’ ability to detect and respond to drift, and thus sustain an AI system’s value once deployed. 

Why AI Systems Need Continuous Monitoring

Traditional software behaves deterministically: given the same inputs, it produces the same outputs. If it worked yesterday, it works today, absent a code change or infrastructure failure. AI systems operate differently, and PMs who treat them as conventional software create invisible risk.

Failure Mode

Why It Matters

The world changes

Production data reflects current conditions — customer behaviour, market dynamics, regulatory context. A model trained on last year’s patterns may be confidently wrong about this year’s reality, with no obvious failure signal.

Data drifts

The statistical properties of incoming data change over time. Features that were predictive become less so. New patterns appear that the model has never encountered. The system continues to produce outputs, but their basis is increasingly mismatched to what the model learned.

Performance degrades gradually

Models lose accuracy incrementally, without obvious failures. By the time someone notices, significant harm may have accumulated across many decisions. NIST AI RMF MANAGE 2.2 identifies this gradual degradation — drift — as the core risk that monitoring must address.

New risks emerge

Users find unexpected applications or misapplications. Edge cases appear that pre-deployment testing did not anticipate. Adversarial actors probe for exploitable patterns. These risks are only discoverable in production, not in a test environment.

Feedback loops amplify problems

AI systems that influence behaviour can change the data they are subsequently trained on, creating self-reinforcing patterns. A content recommendation system that amplifies engagement can shift user behaviour in ways that then appear in retraining data, compounding the effect.

What to Monitor

NIST MEASURE 2.4 requires that the functionality and behaviour of the AI system and its components — as identified during risk mapping — are monitored in production. Effective monitoring covers six distinct dimensions. Tracking only performance metrics is insufficient.

Performance Metrics

Track the metrics defined during development and watch for divergence from the pre-deployment baseline. Without a documented baseline, you have no reference point for detecting degradation.

Metric Type

What to Track

Accuracy metrics

Precision, recall, F1, AUC — measured against the pre-deployment baseline and monitored for directional trend, not just point-in-time value

Error rates

False positives and false negatives, reported overall and by segment — aggregate error rates can mask deterioration in specific subpopulations

Confidence distribution

Whether confidence scores remain calibrated to actual accuracy — a model can become overconfident as conditions shift

Prediction distribution

Whether the distribution of outputs has shifted unexpectedly — a sudden change in the proportion of positive predictions is a signal worth investigating even before accuracy data is available

Data Quality

Monitoring outputs without monitoring inputs means you will detect problems late — after degraded inputs have already produced degraded outputs at scale.

Metric Type

What to Track

Input drift

Statistical properties of incoming features compared to the training data distribution — the reference point against which the model was validated

Feature drift

Individual features changing distribution, which may affect specific model pathways even when aggregate input statistics look stable

Missing data rate

Increase in null or missing values in critical features, which the model may handle in unexpected ways if missingness was rare in training data

Data volume

Unexpected changes in throughput or record counts, which can indicate upstream pipeline problems or changes in the user population

Fairness Metrics

Overall performance can improve while performance for a specific demographic group deteriorates. Fairness monitoring must be disaggregated — aggregate metrics obscure differential impact by design.

Metric Type

What to Track

Disaggregated performance

Accuracy, error rate, and related metrics broken out by demographic group, geography, or other segments relevant to the use case

Outcome distribution

Whether positive and negative outcomes are distributed across groups as expected relative to the pre-deployment baseline and any regulatory requirements

Error distribution

Whether certain groups are experiencing disproportionate rates of false positives or false negatives — error type matters as much as error rate in many use cases

Operational Metrics

System health and AI performance are not independent. Infrastructure degradation can manifest as model degradation, and the two must be distinguished to identify the correct intervention.

Metric Type

What to Track

Latency

Response time changes that may indicate infrastructure problems, model complexity increases from updates, or serving pipeline issues

Throughput

Unexpected volume changes that may indicate upstream system changes or shifts in the user population hitting the model

Resource utilisation

CPU, memory, and storage trends — PMI–CPMAI Phase VI notes that AI workloads are more variable and hardware-intensive than conventional software and require dedicated resource monitoring

Error logs

System errors, exceptions, and timeouts that may precede or accompany model performance degradation

Human Oversight Metrics

How humans interact with an AI system is itself a monitoring signal. Changing override rates or escalation volumes often surface model degradation before quantitative metrics do — because human reviewers see the outputs directly.

Metric Type

What to Track

Override rate

How often human reviewers reject or modify system recommendations — a rising override rate is a strong early signal of model degradation

Override patterns

Whether overrides are concentrated in particular segments, use cases, or output types, which points toward where the model is failing

Escalation volume

Whether more cases are being escalated for human review, which may indicate reviewers losing confidence in model outputs

Reviewer response time

Whether human reviewers are keeping pace with escalations — a bottleneck in human oversight can mean AI-driven decisions are going unreviewed longer than intended

User and Stakeholder Feedback

Quantitative metrics do not capture everything. User and stakeholder feedback surfaces failure modes that automated monitoring cannot detect — edge cases, contextual problems, and harms experienced by people subject to AI decisions. NIST AI RMF MANAGE 4.2 includes feedback from affected parties as a required input to continual improvement.

Metric Type

What to Track

User complaints

Volume, frequency trend, and nature of reported issues — categorise by failure type to identify patterns

Support tickets

AI-related support requests, which often surface usability failures and mismatched expectations before they appear in performance metrics

User sentiment

Survey results and feedback form data, tracked as a trend rather than a point-in-time snapshot

Affected party reports

Concerns and complaints from people subject to AI-driven decisions, particularly in high-risk contexts where the AI Act requires accessible redress mechanisms

Detecting Drift

Drift is the most common and least visible failure mode in deployed AI systems. Performance degrades gradually, without obvious failures, until accumulated harm becomes apparent. NIST AI RMF MANAGE 2.2 identifies monitoring and maintenance procedures for drift and decontextualisation as core operational requirements.

Three Types of Drift

Each type of drift has different causes, different detection approaches, and different appropriate interventions. Conflating them leads to wrong diagnoses and ineffective responses.

Drift Type

What It Means and Why It Matters

Data drift (covariate shift)

The statistical distribution of input features changes, but the underlying relationship between features and outcomes remains the same. The model may still work correctly — but it is now operating on data that looks different from what it was trained on, and performance on that new distribution may be untested.

Concept drift

The relationship between inputs and the target outcome changes. What the model learned is no longer true. A fraud detection model trained before a new fraud pattern emerged will not recognise the new pattern, regardless of how stable the input distribution is. Concept drift is the most dangerous type because the model continues to produce confident outputs that are based on outdated relationships.

Label drift

The distribution of outcomes changes. If you are predicting loan default and macroeconomic conditions shift default rates significantly, the model’s calibration becomes invalid even if it is detecting the right signals. Label drift often accompanies concept drift and can go undetected when ground truth labels are delayed.

Detection Approaches

Approach

How It Works

Statistical tests

Compare distributions of current data against the training data reference using tests such as Kolmogorov-Smirnov (continuous features) or chi-squared (categorical features). NIST MEASURE 2.4 recommends hypothesis testing or domain expertise to measure monitored distribution differences.

Drift detection algorithms

Specialised algorithms designed to detect distributional changes in data streams, such as ADWIN (Adaptive Windowing) or DDM (Drift Detection Method), that provide statistically grounded change-point detection.

Performance monitoring against ground truth

Track accuracy against labelled ground truth when labels are available with reasonable latency. This is the most direct drift signal but requires a feedback pipeline to collect outcomes.

Proxy metrics

When ground truth labels are delayed (loan outcomes, health outcomes), monitor leading indicators that correlate with eventual performance — confidence score distributions, prediction distribution shifts, or upstream behavioural signals.

Setting Response Thresholds

Not all drift requires the same response. Define thresholds before deployment, not when drift is discovered under operational pressure.

•       Warning threshold: Increase monitoring frequency, initiate investigation, notify stakeholders. No immediate intervention to the system.

•       Action threshold: Intervention required — model refresh, retraining with updated data, configuration adjustment, or rollback to a previous version.

•       Critical threshold: Immediate response — pause the system, activate manual fallback, halt consequential decisions until the root cause is identified and addressed. 

Incident Response

When monitoring detects a problem, a pre-defined response process is essential. AI incidents that surface under operational pressure without a documented process result in inconsistent containment, poor documentation, and missed regulatory reporting obligations.

What Constitutes an AI Incident

The OECD’s definition of AI incidents encompasses events where AI systems contribute to harm — including injury to health (including psychological harm), disruption of critical infrastructure, violations of human rights or law, or harm to property, communities, or the environment. Incidents can be acute (a single harmful decision) or systemic (accumulated harm across many decisions, each individually below the threshold of obvious concern). NIST AI RMF MANAGE 2.2 requires treatment and response plans for incidents, negative impacts, and outcomes to be established and regularly reviewed.

Incident Response Process

Element

Description

PM Responsibility

Detection

How incidents are identified — automated monitoring alerts, human reviewer escalations, user reports, or external reports from affected parties

Define detection channels at project close; ensure all are active before deployment

Triage

Assess severity and urgency: is harm ongoing? Is it systemic or isolated? Is regulatory reporting triggered?

Define severity criteria in advance; ambiguous cases default to the higher severity tier

Containment

Stop ongoing harm: pause the system, revert to a fallback process, increase human oversight for affected decision types, or restrict scope of AI-driven decisions

Ensure fallback processes are tested before deployment, not designed during an incident

Investigation

Determine root cause: data drift, concept drift, training data issue, infrastructure failure, edge case, or adversarial input

Assign investigation ownership at project closure; data science team must remain reachable post-deployment

Remediation

Fix the identified problem: retrain with corrected data, adjust thresholds, patch infrastructure, update guardrails, or decommission if no fix is feasible

Document the remediation decision and the rationale, including any tradeoffs accepted

Communication

Notify affected parties, internal stakeholders, and — where required — regulators. Under the EU AI Act, serious incidents involving high-risk AI must be reported to authorities without undue delay

Know regulatory notification timelines before an incident occurs; 15 days for serious incidents under the AI Act

Post-mortem

Analyse root cause, assess whether monitoring thresholds were appropriate, update response procedures, and incorporate lessons into model governance

Post-mortem findings should feed back into the project’s risk register and monitoring design

Documentation Requirements

Incident documentation is both an internal governance requirement and, for high-risk AI, a regulatory one. Track all reported errors, near-misses, and negative impacts; the response actions taken and by whom; root cause analysis findings; preventive measures implemented; and communications to affected parties. NIST AI RMF MANAGE 4.2 requires that the basis for decisions made relative to tradeoffs between trustworthy characteristics, system risks, and system opportunities is documented. 

Change Management for AI Systems

AI systems change over time — through retraining, fine-tuning, data updates, architecture changes, or configuration adjustments. Each change introduces risk. A change that improves aggregate performance may degrade fairness metrics. A data update that incorporates recent patterns may introduce new biases. PMI–CPMAI Phase VI treats model versioning, retraining pipelines, and controlled deployment as core operational governance requirements.

Version Control

Maintain explicit versioning across all components of the AI system, and connect each deployed version to the artefacts it was built from. Without this, root cause analysis after a production problem becomes guesswork.

•       Model weights and architecture — including the specific training run that produced them

•       Training data and preprocessing pipelines — versioned by dataset snapshot, not just file path

•       Configuration and hyperparameters — the settings used for the deployed version, not just the latest configuration file

•       Guardrails and safety mechanisms — which thresholds and constraints are active in the deployed system

•       Dependencies — versions of ML frameworks, libraries, and infrastructure components in the production environment

Change Triggers

Define in advance what triggers a model update cycle. Unplanned updates driven by ad hoc observations are less rigorous and harder to govern than updates triggered by documented criteria.

Trigger Type

Example Criteria

Performance degradation

Accuracy falls below the action threshold defined in the monitoring plan; fairness metrics exceed acceptable differential across groups

Drift detection

Statistical tests confirm drift beyond the action threshold in one or more critical input features

New data availability

A scheduled retraining cycle runs on a predetermined cadence, using accumulated production data that passed quality checks

Regulatory or policy change

A regulatory update affects what the model is permitted to use as an input or how outputs must be explained

Identified bias or fairness issue

An audit, user report, or monitoring alert surfaces a fairness problem not detected during pre-deployment testing

Champion-Challenger Testing

Before replacing a production model, evaluate the replacement under production conditions. Run the new model (“challenger”) in parallel with the current production model (“champion”), comparing outputs on live traffic without routing consequential decisions through the challenger until it has met acceptance criteria. This approach surfaces problems that staging environments do not replicate, and it creates a documented performance comparison that supports the deployment decision. 

Decommissioning

Sometimes the right governance decision is to turn the system off. NIST AI RMF MANAGE 2.4 addresses superseding, disengaging, or deactivating AI systems that demonstrate performance or outcomes inconsistent with intended use — and the playbook notes that responsibilities for decommissioning must be assigned and understood before the situation requires them to be exercised.

When to Decommission

•       Performance cannot be restored to acceptable levels through retraining, data updates, or architectural changes

•       Residual risks exceed the benefits of the system, and risk mitigation options have been exhausted

•       Regulatory requirements that apply to the system cannot be met with the current architecture or data

•       The business need the system was designed to address has changed materially

•       A substantially better alternative exists and the transition can be managed without harm to current users

Decommissioning Considerations

Consideration

Key Questions

Data retention

What training data, model artefacts, and decision logs must be retained? For how long, and under what legal obligations?

Data security

How will you prevent unauthorised access to decommissioned systems, training data, and logs that may contain personal data?

Downstream dependencies

What other systems, processes, or workflows depend on this AI’s outputs? All dependencies must be identified and migrated or deactivated.

User and affected party communication

How and when will you inform users, affected parties, and — where required — regulators of the decommissioning?

Transition plan

What replaces the AI system? If manual processes resume, have they been tested? If an alternative system is deployed, has it completed its own governance process?

 

Regulatory Requirements

EU AI Act — Article 72

The EU AI Act establishes mandatory post-market monitoring obligations for providers of high-risk AI systems. Article 72 requires providers to establish and document a post-market monitoring system that actively and systematically collects, documents, and analyses relevant data on the performance of high-risk AI systems throughout their lifetime. The monitoring system must be proportionate to the nature of the AI technologies and the risks involved, and it must enable the provider to evaluate continuous compliance with the requirements of the Act. Serious incidents must be reported to the relevant national market surveillance authority without undue delay — within 15 days for incidents that present a risk to health, safety, or fundamental rights.

The post-market monitoring plan must be included in the technical documentation completed before the system is placed on the market. This means monitoring is not a post-deployment concern: it must be designed and documented as part of the project, before go-live.

NIST AI RMF

The NIST AI RMF addresses ongoing monitoring across the MANAGE function. MANAGE 2.2 requires mechanisms to sustain the value of deployed AI systems through regular monitoring of performance and trustworthiness. MANAGE 2.4 requires that mechanisms are in place to supersede, disengage, or deactivate AI systems performing inconsistently with intended use. MANAGE 3.1 requires that risks be regularly monitored with documented controls applied. MANAGE 4.2 requires that continual improvement processes be integrated into AI system updates, including feedback from all relevant AI actors. Together these constitute a continuous governance cycle, not a periodic audit.

Sector-Specific Requirements

Healthcare, financial services, and other regulated sectors layer additional monitoring and reporting requirements on top of horizontal AI regulation. Financial services regulators have long-standing model risk management expectations (SR 11-7, OCC Comptroller’s Handbook) that treat ongoing monitoring as a core model governance obligation. Healthcare AI faces increasingly specific requirements for post-market clinical follow-up and adverse event reporting. PMs in regulated sectors must map both horizontal AI regulation and sector-specific obligations before finalising the monitoring plan. 

PM Responsibilities by Phase

During Planning

•       Define monitoring requirements in scope, including the specific metrics, thresholds, and cadences that will govern the live system

•       Identify the six monitoring dimensions — performance, data quality, fairness, operational health, human oversight behaviour, and user feedback — and confirm coverage for each

•       Budget for monitoring infrastructure, tooling, and the ongoing staff time required to review reports, investigate alerts, and manage incidents

•       Plan incident response capabilities before deployment, including roles, escalation paths, regulatory notification timelines, and fallback procedures

•       Include post-market monitoring plan in technical documentation for high-risk AI systems, as required by EU AI Act Article 72

At Deployment

•       Verify that monitoring infrastructure is fully operational before traffic is routed to the live system — monitoring deployed after go-live creates a gap that is difficult to retroactively close

•       Confirm that baselines are established and documented for all metrics — without a baseline, drift detection has no reference point

•       Train the operations team on monitoring tools, alert interpretation, and escalation procedures

•       Conduct a live test of the incident response procedure before the system handles consequential decisions

Post-Deployment (Ongoing)

•       Review monitoring reports on the cadence appropriate to the system’s risk level — weekly for high-risk systems, monthly for lower-risk systems, with continuous automated alerting for all

•       Ensure incidents are tracked, documented, and addressed with root cause analysis — a pattern of near-misses with no documented response is a governance failure

•       Trigger model refresh or retraining when thresholds are exceeded according to the pre-defined criteria, not on an ad hoc basis

•       Update monitoring requirements as risks, regulatory requirements, and the user population evolve

Project Closure Considerations

Traditional project closure happens at deployment. AI project governance requires a different model. Before formally closing an AI project, define and document:

•       Who owns ongoing monitoring, and what authority they have to trigger interventions including system pause or decommissioning

•       How model updates are proposed, evaluated, approved, funded, and deployed — including who must sign off on changes to high-risk systems

•       What conditions trigger re-engagement of the original project team — significant retraining, architectural changes, or regulatory changes that require a full re-assessment

•       How long post-deployment support from the build team lasts and what the handover criteria are for the operations team to manage independently 

Right-Sizing for Your Situation

Monitoring depth should match system risk. A low-risk internal productivity tool needs different monitoring infrastructure than a high-risk system making consequential decisions about individuals. The EU AI Act’s proportionality principle reflects this: monitoring must be proportionate to the nature of the technologies and the risks involved.

Greenfield — AI Monitoring Playbook

For PMs without formal monitoring infrastructure. Essential metrics, simple drift detection approaches, and basic incident tracking without enterprise tooling — designed for teams starting from a lightweight operational baseline.

Emerging — AI Monitoring Playbook

For PMs building repeatable processes. Comprehensive monitoring framework design, threshold-setting guidance, drift detection selection, and incident response templates for teams building a structured operational capability.

Established — AI Monitoring Playbook

For PMs in organisations with formal governance. How to integrate AI monitoring with existing operational monitoring, incident management, and compliance frameworks — including EU AI Act Article 72 post-market monitoring plan requirements.

Become a member →

 

Framework References

•       EU AI Act (Official Journal, 12 July 2024) — Article 72 (post-market monitoring obligations for providers of high-risk AI systems: systematic data collection and analysis throughout system lifetime; post-market monitoring plan as required component of technical documentation; proportionality principle); Article 73 (serious incident reporting obligations; 15-day reporting timeline for incidents presenting risk to health, safety, or fundamental rights); Annex IV (technical documentation requirements including post-market monitoring plan)

•       NIST AI Risk Management Framework (AI RMF 1.0, NIST AI 100-1) — MANAGE 2.2 (mechanisms to sustain value of deployed AI systems through monitoring of performance, trustworthiness, and drift; risk response options: avoid, accept, mitigate, transfer); MANAGE 2.4 (superseding, disengaging, or deactivating AI systems performing inconsistently with intended use; responsibilities must be assigned and understood); MANAGE 3.1 (regular monitoring of third-party and internal AI risks with documented controls); MANAGE 4.2 (continual improvement processes integrated into AI system updates; feedback from all relevant AI actors as required input)

•       NIST AI RMF Playbook — MANAGE 2.2 suggested actions (establishing risk controls considering trustworthiness characteristics; procedures for monitoring drift and decontextualisation; decommissioning systems that exceed risk tolerances); MANAGE 2.4 suggested actions (contingency verification for mission-critical systems; deactivation mechanisms); MEASURE 2.4 (monitoring of AI system functionality and behaviour in production; hypothesis testing and domain expertise for distribution differences; anomaly detection using control limits and confidence intervals)

•       PMI Guide to Leading and Managing AI Projects (CPMAI 2025) — Phase V (governance and MLOps readiness assessment: model drift detection systems, statistical baselines, real-time dashboards, alert protocols, and designated ownership for drift management; fixing gaps post-deployment is far more costly than addressing them proactively); Phase VI (continuous monitoring and versioning as core operational requirements; automated retraining pipelines; version control for model artefacts, data history, and configurations; risk mitigation and contingency planning for data drift, unexpected behaviours, and regulatory changes)

•       NIST AI RMF Playbook — MANAGE 2.2 transparency documentation (post-deployment testing methodology, metrics, and performance outcomes; accessibility of information to external stakeholders); MANAGE 4.2 transparency documentation (user and stakeholder engagement in model development and regular performance review; ability of affected parties to test and provide feedback)

•       AIGP Body of Knowledge v1.0.0 — Domain IV (post-deployment risk management; incident documentation and reporting obligations; model risk management integration with AI governance frameworks); Domain V (regulatory compliance monitoring; sector-specific requirements for healthcare and financial services AI; integration of horizontal AI regulation with sector regulator expectations)

•       Federal Reserve SR 11-7 / OCC Comptroller’s Handbook: Model Risk Management — (ongoing monitoring as core model governance obligation in financial services; performance monitoring against established benchmarks; outcomes analysis and back-testing; model review and validation after significant changes — foundational model risk management principles now increasingly aligned with AI-specific regulatory frameworks)

 

This article is part of AIPMO’s PM Practice series. See also: AI Testing and Validation | AI Risk Registers | Human Oversight in AI Systems