ChatGPT bombs test on diagnosing kids’ medical cases with 83% error rate

suoko@feddit.it · 10 months ago

ChatGPT bombs test on diagnosing kids’ medical cases with 83% error rate

NevermindNoMind@lemmy.world · edit-2 10 months ago

This is such an annoyingly useless study. 1) the cases they gave ChatGPT were specifically designed to be unusual and challenging, they are basically brain teasers for pediatrics, so all you’ve shown is that ChatGPT can’t diagnose rare cases, but we learn nothing about how it does on common cases. It’s also not clear that these questions had actual verifiable answers, as the article only mentions that the magazine they were taken from sometimes explains the answers.

since these are magazine brain teasers, and not an actual scored test, we have no idea how ChatGPT’s score compares to human pediatricians. Maybe an 83% error rate is better than the average pediatrician score.
why even do this test with a general purpose foundational model in the first place, when there are tons of domain specific medical models already available, many open source?
the paper is paywalled, but there doesn’t seem to be any indication that the researchers used any prompting strategies. Just last month Microsoft released a paper showing gpt-4, using CoT and multi shot promoting, could get a 90% score on the medical license exam, surpassing the 86.5 score of the domain specific medpapm2 model.

This paper just smacks of defensive doctors trying to dunk on ChatGPT. Give a multi purpose model super hard questions, no promoting advantage, and no way to compare it’s score against humans, and then just go “hur during chatbot is dumb.” I get it, doctors are terrified because specialized LLMs are very certain to take a big chunk of their work in the next five years, so anything they can do to muddy the water now and put some doubt in people’s minds is a little job protection.

If they wanted to do something actually useful, give those same questions to a dozen human pediatricians, give the questions to gpt-4 with zero shot, gpt-4 with Microsoft’s promoting strategy, and medpalm2 or some other high performing domain specific models, and then compare the results. Oh why not throw in a model that can reference an external medical database for fun! I’d be very interested in those results.

Edit to add: If you want to read an actually interesting study, try this one: https://arxiv.org/pdf/2305.09617.pdf from May 2023. “Med-PaLM 2 scored up to 86.5% on the MedQA dataset…We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility.” The average human score is about 60% for comparison. This is the domain specific LLM I mentioned above, which last month Microsoft got GPT-4 to beat just through better prompting strategies.

Ugh this article and study is annoying.

deejay4am@lemmy.world · 10 months ago

LLM are not even the right type of AI to try to do medical diagnosis with. Stop treating LLMs like they can fucking think and reason. They do not.

NevermindNoMind@lemmy.world · 10 months ago

There literally are probably a dozen LLM models trained exclusively on or fined tuned on medical papers and other medical materials, specifically designed to do medical diagnosis. The already perform on pair or better than the average doctors in some tests. It’s already a thing. And they will get better. Will they replace doctors outright, probably not at least not for a while. But they certainly will be very helpful tools to help doctors make diagnosis and miss blind spots. I’d bet in 5-10 years it will be considered malpractice (i.e., below the standard of care) not to consult with a specialized LLM when making certain diagnosis.

On the other hand, you make a very compelling argument of “nuh uh” so I guess I should take that into account.

suoko@feddit.it · 10 months ago

I’m curious about llama2-medical bot results too

Catoblepas@lemmy.blahaj.zone · edit-2 10 months ago

If you want the advanced predictive text to give you medical treatment, have fun. I’m sure as shit not trusting anything other than a human being with my health.

I get it, doctors are terrified because specialized LLMs are very certain to take a big chunk of their work in the next five years, so anything they can do to muddy the water now and put some doubt in people’s minds is a little job protection.

Ah yes, the common refrain from doctors that they have too little work and the field is overcrowded.

I’m gonna be honest dude it sounds like you’re starting from “ChatGPT good” and working backwards, not that you have any specialized knowledge of how medicine works as a profession and how ChatGPT could affect it.

But I’m sure this time the capitalists will save us from the medical industrial complex and not just wring even more blood out of a stone.

PS: you did not link a published, peer reviewed study, you linked a preprint. Ethical sites will clearly display such information, like the Research Gate page for the preprint does.

NevermindNoMind@lemmy.world · edit-2 10 months ago

It’s fine to be skeptical of AI medical diagnostics. But your response is as much of a knee jerk “AI bad” as you accused me of being biased toward “AI good”. At no point did you ever both to discuss or argue against any of the points I raised about the quality and usefulness of the cited study. Your response consisted entirely of 1) you sure as shit won’t trsut AI, 2) doctors aren’t afraid of AI cause they are so busy, 3) I am biased, 4) capitalism bad (ironic since I was mostly talking about an open-source model), 5) the study I cited is bad because its pre-print (unlike all the wonderful studies you cited).

Since you don’t want to deal with the substance, and just want to talk about “AI bad, doctor good” and since you only respect published studies: In the US our wonderful human doctors cause serious medical harm through misdiagnosis in about 800,000 cases a year (https://qualitysafety.bmj.com/content/early/2023/08/07/bmjqs-2021-014130). Our wonderful human doctors routinely ignore female complaints of pain, making them less likely to receive diagnosis of adnominal pain (https://pubmed.ncbi.nlm.nih.gov/18439195/), less likely to receive treatment for knee pain (https://pubmed.ncbi.nlm.nih.gov/18332383/), more likely to be sent home by our human doctors after being misdiagnosed while suffering a heart attack (https://pubmed.ncbi.nlm.nih.gov/10770981/), and more likely to have missed diagnosis of strokes (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5361750/). So maybe let’s not pretend like humans are infallible.

Healthcare diagnosis is something that one day could greatly be improved with the assistance of AI, which can be kept up to date with the latest studies, which can read and analyze a patient’s entire medical history and catch things a doctor might miss, and which can conduct statistical analysis in a way better than a doctor relying on their vague recollections from 30 years ago in medical school. An AI never has a bad day and doesn’t feel like dealing with patients, is never tired or hungover, will never dismiss a patients concerns because of some bias about the patient being a woman, or the wrong skin color, or because they sound dumb, or whatever else (yes AI can be biased, they learn it from us, but I’d argue its easier to train bias out of AI than it is to train it out of the GP in Alabama screaming about DEI while writing a donation check to Trump). Will AI be perfect, no. Will it be better than doctors, probably not for a while but maybe. But it can absolutely assist and lead to better diagnosis.

And since you want to cry about capitalism, while defending one of the weirdest capitalistic structures (the healthcare industry). Maybe think about what it would mean for millions of people to be able to run an open source diagnostic tool on their phones to help determine if they need treatment, without having to be charged by a doctor 300 dollars for walking into the office just to be ignored and dismissed so the doctor can quickly move to the next patient that has health insurance so they can get paid. Hmm, maybe democratizing access to medical diagnostics and care might be anti-capitalist? Wild thought. No that can’t be right, we need a system with health insurance gatekeepers and doctors taking on patients based on whether they have the insurance or cash to get them that new beamer.

LemmyIsFantastic@lemmy.world · 10 months ago

Chatgpt is in no way at all designed to diagnose. Waste of elections and breath this article is.

AutoTL;DR@lemmings.world · 10 months ago

This is the best summary I could come up with:

While the chatty AI bot has previously underwhelmed with its attempts to diagnose challenging medical cases—with an accuracy rate of 39 percent in an analysis last year—a study out this week in JAMA Pediatrics suggests the fourth version of the large language model is especially bad with kids.

The medical field has generally been an early adopter of AI-powered technologies, resulting in some notable failures, such as creating algorithmic racial bias, as well as successes, such as automating administrative tasks and helping to interpret chest scans and retinal images.

But AI’s potential for problem-solving has raised considerable interest in developing it into a helpful tool for complex diagnostics—no eccentric, prickly, pill-popping medical genius required.

For ChatGPT’s test, the researchers pasted the relevant text of the medical cases into the prompt, and then two qualified physician-researchers scored the AI-generated answers as correct, incorrect, or “did not fully capture the diagnosis.”

Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature—not stuff on the Internet, which can include inaccurate information and misinformation.

“This presents an opportunity for researchers to investigate if specific medical data training and tuning can improve the diagnostic accuracy of LLM-based chatbots,” the authors conclude.

The original article contains 721 words, the summary contains 211 words. Saved 71%. I’m a bot and I’m open source!

auf@lemmy.ml · edit-2 10 months ago

deleted by creator

NigelFrobisher@aussie.zone · 10 months ago

Why would a chatbot be any good at this - it’s just matching text to keywords based on probability? You may as well let the patient do internet self-diagnosis (it’s always cancer!).

Grownbravy [they/them]@hexbear.net · 10 months ago

Well, it’s not a doctor anyway (proceeds to replace actual trained and qualified professionals)