- The Obermeyer et al. (2019) care management algorithm didn’t include race as a variable — it used healthcare spending as a proxy. Because Black patients historically spent less due to access barriers, the algorithm systematically underestimated their care needs. Removing protected characteristics from model inputs does not remove proxy discrimination. Testing outputs against protected characteristics does.
- A 2024 UK government review found minority ethnic people, women, and people from deprived communities face worse healthcare from biased medical tools. AI scribes perform worse for Black patients and non-English speakers. Over half of published clinical AI uses US or China training data. These are documented harms with documented sources.
- EU AI Act Article 10 makes bias assessment a legal obligation for medical device AI manufacturers — not a voluntary commitment. Health Canada’s February 2025 guidance requires demographic representativeness including skin pigmentation. Both are active compliance obligations.
- Subgroup performance testing is part of the clinical validation process — not a separate ‘bias audit’ commissioned after the AI is built. Set acceptable disparity thresholds before testing begins, on the actual deployment population. If a subgroup falls below threshold, that is a design-phase finding, not a launch-week problem.
- A model that performs equitably at launch may develop disparities over time as the production population diverges from the training set. Monthly demographic performance monitoring with defined alert thresholds is the minimum post-deployment standard.
Algorithmic bias in healthcare AI is not a fringe risk. It is the most broadly documented and scientifically substantiated governance failure in clinical AI, appearing consistently across diagnostic imaging, clinical decision support, care management algorithms, drug dosing tools, and AI documentation systems. It affects patients in every country where clinical AI is deployed, with consistent patterns: underrepresented populations in training data, proxy variables encoding structural health inequities, and aggregate performance metrics that mask subgroup disparities.
The EU AI Act’s Article 10 data governance obligations, Health Canada’s explicit representativeness requirement in its February 2025 guidance, and the global convergence of regulatory frameworks on demographic subgroup testing all signal the same direction: demonstrating that clinical AI performs equitably across the patient population is becoming a legal requirement, not a voluntary commitment.
Why Clinical AI Produces Biased Outputs
The Training Data Problem
Clinical AI learns patterns from historical clinical data — data generated in healthcare systems with documented disparities: unequal access to care, differential diagnosis rates, racially biased treatment decisions, and underrepresentation of minority populations in research datasets. More than half of all published clinical AI models leverage data from the US or China. An AI trained on US hospital data may perform well for the US patient mix it learned from. Deployed in a different country, with a different disease prevalence profile, different demographic composition, and different clinical presentation norms, it is solving a different problem than it was trained for.
The Proxy Variable Problem
Removing protected characteristics from model inputs does not remove bias. The Obermeyer et al. (2019) commercial care management algorithm that systematically underestimated Black patients’ health needs did not include race as a variable. It used healthcare spending as a proxy for health need. Because structural barriers historically limited Black patients’ access to healthcare, they spent less — and the algorithm concluded they needed less care. Proxy discrimination in healthcare AI appears in multiple forms: zip code encoding residential segregation; prior healthcare utilization encoding differential access; genetic variants calibrated for European populations used to dose drugs for African-descended patients. EU AI Act Article 10 is precise: training data governance must address biases “likely to affect the health and safety of persons.”
The Aggregate Metric Problem
A diagnostic AI with 92% overall sensitivity may have 78% sensitivity in the demographic group whose cancer is most likely to be missed. Aggregate metrics alone are insufficient for regulatory submission and, increasingly, for clinical deployment authorization. The regulatory response globally is consistent: subgroup performance must be tested and reported separately for relevant demographic groups.
Documented Cases: A Global Pattern
Care Management Algorithms (US)
The Obermeyer et al. (2019) study in Science documented that a commercial care management algorithm used by hundreds of US hospital systems systematically underestimated Black patients’ health needs. The algorithm recommended additional care management support based on predicted health costs. Because Black patients historically spent less on healthcare — due to access barriers, not health status — the algorithm concluded they were healthier than equally ill white patients. When researchers replaced healthcare cost with actual health status as the target variable, racial bias “nearly disappeared.” This remains the defining documented case of clinical AI bias, still cited in regulatory guidance and court filings in 2025.
Skin Cancer Detection (UK and Global)
A 2020 study by Wen et al. examined 21 open-access datasets used to train AI for skin cancer detection. Of 106,950 total images, only 2,436 had skin type recorded. Among those, just 10 images were from people with brown skin and only one from an individual with dark brown or black skin. Multiple published studies have demonstrated significantly lower diagnostic accuracy for darker-skinned patients in commercial skin cancer AI tools — at a point where Black patients already face the highest melanoma mortality rates due to late-stage diagnosis. A 2024 UK government review cited this pattern explicitly as a risk affecting minority ethnic patients.
AI Scribes and Speech Recognition Bias
Ambient AI scribes rely on automatic speech recognition (ASR). ASR systems perform worse for Black patients, non-English speakers, and accented speakers. A 2023 study documented significantly worse ASR performance for African American Language patterns. A 2024 study examined how ASR disparities in nursing home settings produced documentation quality gaps for Black patients. In clinical contexts, documentation quality directly affects care continuity, billing accuracy, and legal record integrity. An AI scribe that systematically produces lower-quality notes for certain patient demographics is a health equity risk embedded in a productivity tool.
UK Equity in Medical Devices Review (2024)
A UK government-commissioned review found that minority ethnic people, women, and people from deprived communities are at risk of poorer healthcare from biases in medical tools and devices, including AI. The review identified pulse oximeters, dermatological tools, and AI-based diagnostic tools as categories with documented bias. It recommended that the NHS require developers to test and report on device performance across skin tones and demographic groups before deployment. The review’s findings directly informed NHS clinical governance guidance and the NHS STANDING Together consensus recommendations published in Lancet Digital Health in 2025.
Regulatory Requirements: The Global Convergence
| Jurisdiction / Framework | Bias-Specific Obligations |
|---|---|
| EU AI Act Article 10 (effective for MDAI August 2027) | Data governance must address “possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights or lead to discrimination.” Training data must be relevant, representative, and free of errors for the intended purpose. |
| Health Canada MLMD Guidance (February 2025) | Training data “justified as adequately representative of the Canadian population and clinical practice” including skin pigmentation, biological sex, and other identity-based factors; subgroup validation expected. |
| FDA (Draft Guidance, January 2025) | Subgroup performance analysis as component of clinical evidence; demographic subgroup testing for protected characteristics; post-market surveillance tracking performance by subgroup. |
| UK NHS STANDING Together (Lancet Digital Health, 2025) | Consensus recommendations for algorithmic transparency including reporting of training data composition, subgroup performance across demographic groups, and representation of underserved populations. |
| Australia TGA + ACSQHC (2025) | Bias assessment as component of clinical validation; ACSQHC August 2025 guides emphasize assessing AI for potential biases before and during deployment. |
The Bias Testing Workstream
Step 1: Identify the Demographic Groups That Matter
- For imaging AI: skin tone, sex, age.
- For clinical decision support: race and ethnicity, sex, age, socioeconomic indicators, primary language.
- For AI scribes: primary language and language variety (ASR performance), accent, clinical specialty.
- For care management algorithms: race and ethnicity, socioeconomic factors, insurance status.
Step 2: Set Thresholds Before Testing
Define what constitutes acceptable and unacceptable performance disparity before testing begins. Setting thresholds after testing — when you know the results — compromises scientific validity and creates litigation exposure. For diagnostic AI, a performance gap exceeding 10 percentage points in sensitivity between demographic groups for a life-threatening condition is a serious concern. Document the threshold-setting rationale and have clinical governance review and approve it before testing.
Step 3: Conduct Testing With Statistical Rigour
- Ensure each demographic subgroup has sufficient sample size for statistical inference.
- Test on data that reflects the deployment population, not just the training population.
- Apply the same performance metrics across all subgroups.
- Report confidence intervals, not just point estimates.
Step 4: Investigate and Remediate
If subgroup testing reveals a performance disparity, investigate its cause before deciding whether to deploy. Potential causes: underrepresentation of the subgroup in training data (data problem); proxy variable encoding structural inequity (feature engineering problem); model architecture optimizing for majority population patterns (design problem). Remediation: augment training data, remove or replace proxy variables, apply fairness constraints, or restrict the intended use of the AI.
Step 5: Monitor in Production
- Monthly: performance metrics by demographic group against deployment-phase benchmarks.
- Alert threshold: any demographic group showing performance below the pre-specified acceptable threshold triggers investigation.
- Quarterly: complete bias report with trend analysis and corrective action documentation.
- Annual: comprehensive demographic performance review as part of post-market surveillance.
Right-Sizing for Your Situation
Proxy discrimination basics in healthcare AI; minimum subgroup testing requirements; threshold-setting methodology; vendor data request templates.
Comprehensive demographic subgroup analysis methodology; proxy variable identification in healthcare contexts; ASR bias testing for ambient scribes; EU Article 10 data governance compliance; production monitoring program design.
Enterprise bias governance program; UK Equity in Medical Devices standard alignment; multi-jurisdiction regulatory compliance for bias testing across FDA, Health Canada, TGA, MHRA, and EU AI Act; litigation readiness for bias-related clinical AI claims.
Framework References
Obermeyer et al., ‘Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations’ (Science, 2019) — The defining documented case of healthcare AI proxy discrimination.
UK Equity in Medical Devices: Independent Review (2024) — Government-commissioned review finding minority ethnic people, women, and people from deprived communities face risk from biased medical tools.
NHS STANDING Together Consensus Recommendations (Lancet Digital Health, 2025) — Algorithmic transparency standards including training data composition disclosure and subgroup representation reporting.
EU AI Act (Reg. (EU) 2024/1689) Article 10 — Data governance obligation to address biases “likely to affect the health and safety of persons, have a negative impact on fundamental rights or lead to discrimination.”
Health Canada Pre-Market Guidance for Machine Learning-Enabled Medical Devices (February 2025) — Training data representativeness requirement including skin pigmentation and biological sex.
Wen D et al., ‘Characteristics of Public Datasets Used to Train Deep Learning Algorithms in Dermatology’ (npj Digital Medicine, 2020) — Documented severe underrepresentation of darker skin tones in skin cancer AI datasets.
Martin JL, Wright KE, ‘Bias in Automatic Speech Recognition: The Case of African American Language’ (Applied Linguistics, 2023) — ASR performance disparities creating documentation quality gaps relevant to AI scribe equity.
This article is part of AIPMO’s Healthcare series. See also: AI Governance in Healthcare | Clinical Validation of Healthcare AI | Ambient AI and Consent in Healthcare