Calibrating Trust in Diagnostic Copilots
A leading diagnostic AI vendor wanted to ship a lesion-detection copilot for chest CT, but pilot deployments showed worrying patterns: radiologists either over-trusted high-confidence predictions or wholesale ignored the system. We were brought in to characterise — and fix — the trust calibration problem.
Confidence scores expressed as percentages were systematically misinterpreted. Radiologists treated 85% confidence as 'definitely there' and 60% as 'almost certainly not.' This binarisation defeated the calibration the model team had carefully built.
We ran a within-subject study with 48 board-certified radiologists across two countries. Each read 32 CT scans under three UI conditions: baseline (percentage), redesigned (calibrated visual language + counterfactual examples), and control (no AI). Eye-tracking captured attention to AI cues; verbal protocols captured reasoning.
- 01Baseline UI produced 'lazy reliance' on 28% of cases — radiologists agreed with the AI without examining the scan.
- 02Redesigned UI cut lazy reliance to 11% (−61%) without reducing total agreement rate.
- 03Appropriate reliance (agreement when AI correct, disagreement when AI wrong) rose by 41%.
- 04Counterfactual examples were the single highest-impact design element.