The study said that part of the issue is that existing rules for deploying AI in clinical settings, such as the standards for FDA clearance in the US or a CE mark in Europe, focus primarily on accuracy. There are no explicit requirements that an AI must improve the outcome for patients, largely because such trials have not yet run.
Emma Beede, a UX researcher at Google Health, said: "We have to understand how AI tools are going to work for people in context -- especially in health care -- before they're widely deployed."
Google's first opportunity to test the tool in a real setting came from Thailand. The country's ministry of health has set an annual goal to screen 60 percent of people with diabetes for diabetic retinopathy, which can cause blindness if not caught early. But with around 4.5 million patients to only 200 retinal specialists -- roughly double the ratio in the US -- clinics are struggling to meet the target.
Google has CE mark clearance, which covers Thailand, but it is still waiting for FDA approval. So to see if AI could help, Beede and her colleagues outfitted 11 clinics across the country with a deep-learning system trained to spot signs of eye disease in patients with diabetes.
In the system Thailand had been using, nurses take photos of patients' eyes during check-ups and send them off to be looked at by a specialist elsewhere -- a process that can take up to 10 weeks. The AI developed by Google Health can identify signs of diabetic retinopathy from an eye scan with more than 90 per cent accuracy -- which the team calls "human specialist level" -- and, in principle, give a result in less than 10 minutes.
While this sounds good, it says nothing of how the AI will perform in the chaos of a real-world environment, and this is what the Google Health team wanted to find out.
Over several months they observed nurses conducting eye scans and interviewed them about their experiences using the new system. The feedback wasn't entirely positive.
AI did speed things up, when it worked. But it sometimes failed to give a result at all. The deep-learning model had been trained on high-quality scans; to ensure accuracy, it was designed to reject images that fell below a certain threshold of quality. With nurses scanning dozens of patients an hour and often taking the photos in poor lighting conditions, more than a fifth of the images were rejected.
Nurses felt frustrated, especially when they believed the rejected scans showed no signs of disease and the follow-up appointments were unnecessary. They sometimes wasted time trying to retake or edit an image that the AI had rejected.