Calibrating Trust in Diagnostic Copilots

Year

2024

Duration

9 months

Cohort Size

48 radiologists

Headline Result

+41%

Overview

A leading diagnostic AI vendor wanted to ship a lesion-detection copilot for chest CT, but pilot deployments showed worrying patterns: radiologists either over-trusted high-confidence predictions or wholesale ignored the system. We were brought in to characterise — and fix — the trust calibration problem.

Challenge

Confidence scores expressed as percentages were systematically misinterpreted. Radiologists treated 85% confidence as 'definitely there' and 60% as 'almost certainly not.' This binarisation defeated the calibration the model team had carefully built.

Approach

We ran a within-subject study with 48 board-certified radiologists across two countries. Each read 32 CT scans under three UI conditions: baseline (percentage), redesigned (calibrated visual language + counterfactual examples), and control (no AI). Eye-tracking captured attention to AI cues; verbal protocols captured reasoning.

Key Findings

01Baseline UI produced 'lazy reliance' on 28% of cases — radiologists agreed with the AI without examining the scan.
02Redesigned UI cut lazy reliance to 11% (−61%) without reducing total agreement rate.
03Appropriate reliance (agreement when AI correct, disagreement when AI wrong) rose by 41%.
04Counterfactual examples were the single highest-impact design element.

Next Project

Calibrating Trust in Diagnostic Copilots

Summative HFE for Smart Infusion Pumps