[long] Some tests of how much AI "understands" what it says (spoiler: very little)

diz@awful.systems · 1 year ago

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

Barbarian · 1 year ago

There are plenty of people right here on Lemmy that confidently describe LLMs as “thinking” because it’s a neural net, so it must be just like a brain. Based on that, a debunking is useful.

MudMan@fedia.io · 1 year ago

That’s fair, but the takeaway of asking it to think and showing wonky output is not that it’s not thinking, it’s that it’s thinking poorly, or wrong.

That’s the gap that frustrates me. The important bit is to debunk the thinking part altogether.

I remember in one of the earlier versions of ChatGPT a couple of my normie friends were having this conversation where they were going back and forth on whether the chatbot could do this or that operation or get this or that answer right. I took the chatbot and asked it the same thing a few times, same as the OP did, and got the right AND wrong answers. I pointed out that’s because it’s not thinking, it’s giving you a likely text follow-up, so the dice are gonna roll differently in eact attempt. The sense of realization in their eyes was incredible. Such an “Oh… Oooooooh” moment. Seeing the conversation get reframed like that is a big part of why I’m frustrated by the online discourse not doing that more.

Barbarian · edit-2 1 year ago

showing wonky output is not that it’s not thinking

I pointed out that’s because it’s not thinking, it’s giving you a likely text follow-up

Sorry? Which is it?

You start the comment disagreeing with this type of analysis/debunking, and then go on to agree with it, even using a very similar result of a percentage chance of a particular response as in the original post.

I am very confused by your comment.

MudMan@fedia.io · 1 year ago

I am saying the first quote is the incorrect takeaway from wonky output. People see wonky output and think it means the machine is thinking, it’s just doing it wrong. But it’s not thinking in the first place.

So the second is my statement, the first is me explaining the statements I disagree with.

I am not saying that demonstrating that the model spits out two different answers in response to the same prompt doesn’t mean anything. I’m saying what it means is that it’s rolling the dice on a piece of text that is likely to be context-appropriate, not thinking about what the question is and getting confused because it’s a bad artificial intelligence. So the method isn’t a problem, it’s the framing of the results that is a problem.

Does that make it any clearer? I think that’s a fair question, it is a fairly nuanced distinction AND a pretty big reframing of the situation, I get why it’s hard to both convey and understand from a forum post.

Barbarian · edit-2 1 year ago

If your stance is that LLMs do not have the capacity to think, then isn’t that just agreeing the the post? I don’t understand the contention there. The post very clearly states the issues you’re talking about.

When an LLM outputs something like a non-existent but highly plausible citation, it is working precisely as an LLM should - modeling the statistical distribution of text and sampling from it.

Calling it a “hallucination” is an attempt to divert the discussion from the possibility that a language model is simply not the right tool for the job when accurate information is desired.

@[email protected] may jokingly refer to “Absolute Imbecile Level Reasoning Benchmark”, but that’s an attempt to counter the narrative that these things can think, let alone reason. If you have big “AI” companies trying to sell LLMs as capable of making decisions, pushback like this is very reasonable.

MudMan@fedia.io · edit-2 1 year ago

The first quote is closest to my stance, but it’s not what the entire post is about, and it’s inaccurate.

The model isn’t sampling from the text, it’s actually creating new text. That is a misconception, often a deliberate one to highlight the (very real) challenges this stuff poses against traditional IP and copyright.

And the problem is the rest of it:

To be fair to the poor AI, it used a numbered list, rather than numbering the 7 steps on its own and then claiming it was 5. Still, it is rather funny to see that it can’t even count.

Note that LLMs are not so dumb as to be naturally unable to answer something like “Barber shaves himself. Does he shave himself?”.

Those and many other parts of this are concerned with whether the model is “dumb”, or about which cognitive capabilities it demonstrates. The answer is none. It’s pointless to shame it for not being good at counting or at keeping track of formal logic because it’s doing neither. That’s like shaming a shoe for being a bad hammer.

My argument here is that framing it that way buys into the techbro premise that cognitive ability IS demonstrated but is rudimentary. If it is there but it’s rough, then more tech can refine it and make it better. If it is something fundamentally different from cognitive ability, then it’s less likely that incremental improvement will make the cognitive ability emerge.

You’ll note I’m hedging a bit, because like I said earlier I am neither so knowledgable nor so cocky to claim cognitive ability couldn’t emerge, or that there isn’t some form of understanding implicit in the current process. I would have said it’s so far removed from cognition that language wouldn’t have worked and yet here we are, so I’m keeping my mind open.

But beyond the surprise that these things are pretty good at language and semi-decent at making images I have no real impression that there is a mechanic here that should give rise to AGI or cognition of any kind just by doing more training or adding more data. I’d be very wary to imply that these are bad at being smart, as opposed to doing something entirely different from being smart.

I also think that’s unfair to the tech, because it’s kinda nuts that it works as well as it does at generating coherent, natural language. It’s just that it’s being misrepresented as doing something else for a number of reasons.

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

A couple simple probes:

GPT4 is uncannily good at recognizing the river crossing puzzle

An Idiot With a Petascale Cheat Sheet

Is this a “hallucination”?

But after an update, GPT-whatever is so much better at such prompts.

The need for an Absolute Imbecile Level Reasoning Benchmark

Randomness in bullshitting