(2024-10-25) Researchers Say An AI-powered Transcription Tool Used In Hospitals Invents Things No One Ever Said

Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said. OpenAI has touted its artificial intelligence-powered transcription tool Whisper as having near “human level robustness and accuracy.”

But Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers

More concerning, they said, is a rush by medical centers to utilize Whisper-based tools to transcribe patients’ consultations with doctors, despite OpenAI’ s warnings that the tool should not be used in “high-risk domains.”

A University of Michigan researcher conducting a study of public meetings, for example, said he found hallucinations in 8 out of every 10 audio transcriptions he inspected, before he started trying to improve the model.

A machine learning engineer said he initially discovered hallucinations in about half of the over 100 hours of Whisper transcriptions he analyzed. A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he created with Whisper.

Such mistakes could have “really grave consequences,” particularly in hospital settings

“This seems solvable if the company is willing to prioritize it,” said William Saunders, a San Francisco-based research engineer who quit OpenAI in February over concerns with the company's direction. “It’s problematic if you put this out there and people are overconfident about what it can do and integrate it into all these other systems.”

The tool is integrated into some versions of OpenAI’s flagship chatbot ChatGPT, and is a built-in offering in Oracle and Microsoft’s cloud computing platforms, which service thousands of companies worldwide. It is also used to transcribe and translate text into multiple languages.

Professors Allison Koenecke of Cornell University and Mona Sloane of the University of Virginia examined thousands of short snippets they obtained from TalkBank, a research repository hosted at Carnegie Mellon University. They determined that nearly 40% of the hallucinations were harmful or concerning because the speaker could be misinterpreted or misrepresented.

In an example they uncovered, a speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.”
But the transcription software added: “He took a big piece of a cross, a teeny, small piece ... I’m sure he didn’t have a terror knife so he killed a number of people.”

In a third transcription, Whisper invented a non-existent medication called “hyperactivated antibiotics.”

OpenAI recommended in its online disclosures against using Whisper in “decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes.”

That warning hasn’t stopped hospitals or medical centers from using speech-to-text models, including Whisper, to transcribe what’s said during doctor’s visits to free up medical providers to spend less time on note-taking or report writing.

Over 30,000 clinicians and 40 health systems, including the Mankato Clinic in Minnesota and Children’s Hospital Los Angeles, have started using a Whisper-based tool built by Nabla, which has offices in France and the U.S.

That tool was fine tuned on medical language to transcribe and summarize patients’ interactions, said Nabla’s chief technology officer Martin Raison.


Edited:    |       |    Search Twitter for discussion