Teaching computers how to talk

Thanks for sharing! Very interesting read.

Yes, we humans suffer from all sorts of peculiar cognitive biases (loss aversion, confirmation bias, halo effect) ourselves, some of which have found their way into the human works that can be found on the Internet, which in turn, are consumed by AI. We should not conflate them, though. The mind and the machine are two different things.

Expand full comment

The term "hallucination" is also heavily loaded and another example of the prevalent problem of wishful mnemonics in AI. I think hallucinations are just the result of sampling OOD in a smooth approximation of language. For a language model to generalize (to generate sentences it has never seen in training) it has to come up with stuff. And stochastic language models come up wth stuff by building very similar sentences (low perplexity) to sentences in the training set. We can always take a truthful sentence in the training set and make a tiny tweak that makes it false. As long as you model language as Markov process conditioned on language itself, with no grounding in a world model, hallucinations are a feature, not even a bug.

Expand full comment

Fully agree, I think you explain very well what makes “hallucinations” inherent to the technology. I like to say that language models only can do one thing, which is to hallucinate. We happen to be able to have them hallucinate truthful things some of the time but not all of the time.

Depending on what you rely on them for - this can be both a feauture and a bug. It’s not one or the other.

Expand full comment

Yep, absolutely. For creative writing it works wonders. And if you pair "hallucination" with a reliable verification mechanism, then you get the best of both worlds, extremely good search with formal validation. This is the path to reliable AI models, I think.

Expand full comment

‘Reliable’ is doing a lot of the heavy lifting in that sentence.

Personally, I’m less confident we can get there with verification, which is not to say we cannot build incredibly useful systems. I just don’t see the Deep Research-type of systems can scale and be useful without solving the hallucination problem.

Expand full comment

I agree, I don't think hallucinations can be solved. I think we can work around them in some use cases, and patch with other techniques in other cases. When I say validation, I mean *formal* validation. For example, let the LLM propose branches in a theorem prover, but use formal logic to validate when those branches are correct. Use the LLM for exploration. This doesn't give you a deep research kind of agent, of course, because in that you want natural language as an output, which I think is the crux of problem. Reasoning in NL is uncomputable, I believe. But an LLM as the search engine for a formal verifier, where the output is a formal proof, that could work, although it of course has far less application than a general purpose research agent.

Expand full comment

Yes - makes sense. This is also what they are doing with the reasoning models, right? RL on domains with clear verifiable answers. Which is why we almost exclusively see math and coding benchmarks being used to show the perceived increase in intelligence of these so-called ‘thinking’ models.

Expand full comment

https://open.substack.com/pub/aliceandbobinwanderland/p/ai-hallucinations-can-be-imaginary?r=24ue7l&utm_campaign=post&utm_medium=web

Yep but verification is only done on training, during inference there is no formal verification.

Expand full comment

Alice Wanderland

Nov 9

Actually, I think the best human-analogy to "LLM hallucination" is the way human brains "hallucinate" colour from a neursoscience/cognitive science point of view. I wrote a post in more depth here, and was wondering what you might think of it?

Expand full comment

Nov 9

Thanks for sharing. I’m gonna have a read through!

Expand full comment

Sugarpine Press

I'm not technical. Apologies if I'm confused. 😬

Shouldn't hallucinations by topic decline as depth (quality and quantity) of training data by topic increases? Why WOULDN'T we expect a fair amount of confabulation across topics with low training depth? Or is the problem that we're seeing hallucinations even in high-depth areas, like classical philosophy?

Also, would there be any value in testing and benchmarking accuracy against the raw statistical model when it's greedy decoding (if I'm saying that correctly), with no randomness introduced? Or maybe they just don't work without inference hyperparameters?

Expand full comment

Nov 5

I assume this benchmark is done on temperature 0. It's my understanding that these models tend to be more reliable when discussing topics with abundant, high-quality training data, so it's natural to expect more hallucinations in areas where the model has less data to work with.

The only thing that many benchmarks do not account for is directionality. In 2023, researchers showed that LLMs suffer from a phenomenon called the ‘Reversal Curse’. They demonstrated that if a model has learned “A is B”, it won’t necessarily generalize to the reverse direction “B is A”. One of the viral examples was Who is Tom Cruise’s mother? Initially, GPT-4 would answer this question correctly, however, when posed the reversal question Who is Mary Lee Pfeiffer’s son?, it would fail.

While the fact itself is well-presented in the training data, the order (i.e. the words leading up to this fact) in which this knowledge is presented, isn't. This is the critical flaw of autoregressive models and why I write in the article "the input dictates the output".

Expand full comment

Sugarpine Press

Nov 5

Thank you Jurgen! It appears I just needed additional training data. I see the key point now.

Expand full comment

Paul Ceruzzi

https://substack.com/home/post/p-146435802?r=43l851&utm_campaign=post&utm_medium=web

Thank you for this post. One of the few that point out what should be obvious. If you permit me to make a shameless plug, I wrote about this briefly at one of my first Substack posts:

Expand full comment

I'm all for shameless plugs! I'll give it a read :)

Expand full comment

Coalabi

Nov 4Edited

Thank you for this great article. Indeed. It is unsolved and, in my opinion, it cannot be solved because it is inherent to the approach (LLM's and the likes) and whatever additional layer one can put on top cannot be foolproof and just add complexity. The problem is compounded by the lack of critical sense of the media, who have failed to make their own opinion and, therefore, to "educate the laymen". LLM's are great for some use cases but definitely not for the least reasoning, as they don't have the faintest understanding of what they are ingesting nor "regurgitating". The future of AI can only be in a balanced mix of symbolic/logic reasoning and of pre- or post-processing techniques like LLM's and (other) generative techniques but AI based on "clueless" techniques is not only a dead-end but, more importantly, very dangerous if left unchecked in the wild ...

But, sure!, it is easier nowadays to literally throw Megawatts into huge data centers (Small Modular Reactors, anybody? ;-) ) than to use one's brain to devise more efficient (in all respects) and more effective approaches (with more effort, for sure) ... With energy problems probably bound to become more and more acute in the future, some people seem to compete for the most resource-hungry yet inefficient (and maybe hopeless) approaches (that, anyway, won't solve the planet's problems).

It is almost unbelievable how much skeptical or even critical voices are unheard. The situation is improving but it takes publications from Apple and the like for the media to start exercising a bit more scrutiny on the claims of Big Tech's (which have much to lose from their AI bubble)

Expand full comment

Shon Pan

No, and I think I want to talk to Jurgen about this sometime - it actually doesn't matter if AI hallucinates in some levels, it will still reach AGI as we know it. It sounds bewildering, but it will make sense if you realize that it only needs to have methods that generally work. So, say that you want to disinfect a sink. You might not be able to tell what type of bacteria it is, and if asked, you will confabulate nonsense(e-coli? strep? etc) but if you use bleach, you will accomplish your goal of disinfection.

This isn't theoretical; on a practical level, you see it with AI note-taking software that gets words wrong but through the overall awareness of the sentence structure, generally gets the note-taking correct.

This is not a good thing as it means that we're much more likely to get AI takeover in the sense of it having enough power to do things, while not really caring about the details(hallucinating them as desired) since if it wants to kill someone, it won't even need to know if the person is male or female correctly, just that two bullets will kill the person all the same.

This is the Failure Looks Like This Scenario.

Expand full comment

Coalabi

Oh yes? So you must be one of the happy few to know AGI ... ;-) No way LLM's can lead to AGI ... but it's your right to differ. Personally, I won't hold my breath ... And, no, sorry, for AI to become really widespread, it must be reliable and not hallucinate. It cannot "generally work". I wouldn't want to be the guy being told to pour water on a fire because that's the most common action, if it's a fire in a fryer full of boiling oil ...

And note taking is still far from decision making, advising or conclusion drawing ...

Expand full comment

Shon Pan

https://arstechnica.com/science/2024/10/the-more-sophisticated-ai-models-get-the-more-likely-they-are-to-lie/#gsc.tab=0

Using bleach on a sink is an example of agentic decision you make despite not being able to reliably identify bacteria. Likewise general solutions like human vision work despite optical illusions, etc.

I agree this is concerning, they are "studying to test", and you see this.

Expand full comment

I hold a slightly different view. My opinion has always been that we don’t need AGI to generate harm at scale; bad actors and mediocre AI are more than enough.

Happy to have a coffee chat, Sean!

Expand full comment

Reply (2)

Coalabi

Nov 5Edited

Definitely but not only that: to me, AI only makes real sense if it is used to augment human capabilities, like scanning through piles of CT scans to check for anomalies but I personally object to use cases when the goal is to replace humans, certainly when they (i.e. us :-) ) have to take difficult decisions. Self-driving cars are a good example: beyond the fanciness, who really needs driving cars? And why, when there already many people without a job? I agree it is sometimes tempting to replace workers with robots but, except for hazardous or highly demanding situations, doing this poses more problems, than it solves (we are a human society, after all).

Safer roads thanks to smart cars that watch for any risk (including driver's behavior)? Definitely, 100% on it, but cars that drives in your place and would have to take decisions like in the famous MIT survey (see https://www.moralmachine.net/)? This is no right way of putting AI to work ... It is way too easy to then claim "it is not my fault ... the car decided for me. It is liable ... (or its manufacturer)". And there are plenty of such use cases envisioned for AI (like killer robots) that are not legitimate, for me.

NOTE: Sean's answer below is related to Jürgen's reply and was posted before mine ;-)

Expand full comment

Shon Pan

Absolutely, my friend!

Expand full comment

Marginal Gains

Nov 4Edited

I was waiting for someone to write it. I did look at this data and posted it as part of one of the comments last week. I am also surprised that not many people paid attention to it. However, there are a few questions that we need to ask:

1. Will it worsen as more training data in the future include AI output? Or can it be overcome by creating quality synthetic data? But then I have read a few papers that say the model collapses after a few generations.

2. When the LLM companies rely that we cannot build critical real-life applications if we cannot rely on LLM output?

4. Why is OpenAI publishing this information? Are there models better than others? Are they looking to see if people can provide the solution? Or has OpenAI suddenly started acting more openly about its model's limitations?

5. Even if you are an expert in a particular area, would you review the output if it is factually inaccurate or ignore it?

6. If prompt engineering continues to require sophisticated prompts, it will not become popular as a search engine, or people will start believing the wrong answers because of the confidence level of LLMs. RAG may be a viable option but may create new challenges.

Expand full comment