"This is best illustrated through the Reversal Curse, which showed that models that learn “A is B” don’t automatically generalize “B is A”. Another way of putting it: the input dictates the output"
Human intuition can often be like this, the similarities are quite striking.
"Well, which of these statements strikes you as more natural: "98 is approximately 100", or "100 is approximately 98"? If you're like most people, the first statement seems to make more sense. (Sadock 1977.) For similar reasons, people asked to rate how similar Mexico is to the United States, gave consistently higher ratings than people asked to rate how similar the United States is to Mexico. (Tversky and Gati 1978.)"
Yes, we humans suffer from all sorts of peculiar cognitive biases (loss aversion, confirmation bias, halo effect) ourselves, some of which have found their way into the human works that can be found on the Internet, which in turn, are consumed by AI. We should not conflate them, though. The mind and the machine are two different things.
Actually, I think the best human-analogy to "LLM hallucination" is the way human brains "hallucinate" colour from a neursoscience/cognitive science point of view. I wrote a post in more depth here, and was wondering what you might think of it?
Shouldn't hallucinations by topic decline as depth (quality and quantity) of training data by topic increases? Why WOULDN'T we expect a fair amount of confabulation across topics with low training depth? Or is the problem that we're seeing hallucinations even in high-depth areas, like classical philosophy?
Also, would there be any value in testing and benchmarking accuracy against the raw statistical model when it's greedy decoding (if I'm saying that correctly), with no randomness introduced? Or maybe they just don't work without inference hyperparameters?
I assume this benchmark is done on temperature 0. It's my understanding that these models tend to be more reliable when discussing topics with abundant, high-quality training data, so it's natural to expect more hallucinations in areas where the model has less data to work with.
The only thing that many benchmarks do not account for is directionality. In 2023, researchers showed that LLMs suffer from a phenomenon called the ‘Reversal Curse’. They demonstrated that if a model has learned “A is B”, it won’t necessarily generalize to the reverse direction “B is A”. One of the viral examples was Who is Tom Cruise’s mother? Initially, GPT-4 would answer this question correctly, however, when posed the reversal question Who is Mary Lee Pfeiffer’s son?, it would fail.
While the fact itself is well-presented in the training data, the order (i.e. the words leading up to this fact) in which this knowledge is presented, isn't. This is the critical flaw of autoregressive models and why I write in the article "the input dictates the output".
Thank you for this post. One of the few that point out what should be obvious. If you permit me to make a shameless plug, I wrote about this briefly at one of my first Substack posts:
I was waiting for someone to write it. I did look at this data and posted it as part of one of the comments last week. I am also surprised that not many people paid attention to it. However, there are a few questions that we need to ask:
1. Will it worsen as more training data in the future include AI output? Or can it be overcome by creating quality synthetic data? But then I have read a few papers that say the model collapses after a few generations.
2. When the LLM companies rely that we cannot build critical real-life applications if we cannot rely on LLM output?
4. Why is OpenAI publishing this information? Are there models better than others? Are they looking to see if people can provide the solution? Or has OpenAI suddenly started acting more openly about its model's limitations?
5. Even if you are an expert in a particular area, would you review the output if it is factually inaccurate or ignore it?
6. If prompt engineering continues to require sophisticated prompts, it will not become popular as a search engine, or people will start believing the wrong answers because of the confidence level of LLMs. RAG may be a viable option but may create new challenges.
If I'm using AI to research a topic, I personally only rely on it for very basic facts. Anything that goes beyond that, I feel I have the responsibility to fact-check and make sure that I know what I'm talking about before assuming what ChatGPT or Claude says is true. I'd rather be overly cautious than complacent.
Search and LLMs are probably a good match, when done well. But Perplexity and ChatGPT Search both suffer from hallucinations despite the fact they leverage search. That doesn't make them useless, but definitely less trustworthy, which means that for critical stuff it's probably smarter to navigate the web yourself.
Thank you for this great article. Indeed. It is unsolved and, in my opinion, it cannot be solved because it is inherent to the approach (LLM's and the likes) and whatever additional layer one can put on top cannot be foolproof and just add complexity. The problem is compounded by the lack of critical sense of the media, who have failed to make their own opinion and, therefore, to "educate the laymen". LLM's are great for some use cases but definitely not for the least reasoning, as they don't have the faintest understanding of what they are ingesting nor "regurgitating". The future of AI can only be in a balanced mix of symbolic/logic reasoning and of pre- or post-processing techniques like LLM's and (other) generative techniques but AI based on "clueless" techniques is not only a dead-end but, more importantly, very dangerous if left unchecked in the wild ...
But, sure!, it is easier nowadays to literally throw Megawatts into huge data centers (Small Modular Reactors, anybody? ;-) ) than to use one's brain to devise more efficient (in all respects) and more effective approaches (with more effort, for sure) ... With energy problems probably bound to become more and more acute in the future, some people seem to compete for the most resource-hungry yet inefficient (and maybe hopeless) approaches (that, anyway, won't solve the planet's problems).
It is almost unbelievable how much skeptical or even critical voices are unheard. The situation is improving but it takes publications from Apple and the like for the media to start exercising a bit more scrutiny on the claims of Big Tech's (which have much to lose from their AI bubble)
No, and I think I want to talk to Jurgen about this sometime - it actually doesn't matter if AI hallucinates in some levels, it will still reach AGI as we know it. It sounds bewildering, but it will make sense if you realize that it only needs to have methods that generally work. So, say that you want to disinfect a sink. You might not be able to tell what type of bacteria it is, and if asked, you will confabulate nonsense(e-coli? strep? etc) but if you use bleach, you will accomplish your goal of disinfection.
This isn't theoretical; on a practical level, you see it with AI note-taking software that gets words wrong but through the overall awareness of the sentence structure, generally gets the note-taking correct.
This is not a good thing as it means that we're much more likely to get AI takeover in the sense of it having enough power to do things, while not really caring about the details(hallucinating them as desired) since if it wants to kill someone, it won't even need to know if the person is male or female correctly, just that two bullets will kill the person all the same.
Oh yes? So you must be one of the happy few to know AGI ... ;-) No way LLM's can lead to AGI ... but it's your right to differ. Personally, I won't hold my breath ... And, no, sorry, for AI to become really widespread, it must be reliable and not hallucinate. It cannot "generally work". I wouldn't want to be the guy being told to pour water on a fire because that's the most common action, if it's a fire in a fryer full of boiling oil ...
And note taking is still far from decision making, advising or conclusion drawing ...
Using bleach on a sink is an example of agentic decision you make despite not being able to reliably identify bacteria. Likewise general solutions like human vision work despite optical illusions, etc.
I agree this is concerning, they are "studying to test", and you see this.
I hold a slightly different view. My opinion has always been that we don’t need AGI to generate harm at scale; bad actors and mediocre AI are more than enough.
Definitely but not only that: to me, AI only makes real sense if it is used to augment human capabilities, like scanning through piles of CT scans to check for anomalies but I personally object to use cases when the goal is to replace humans, certainly when they (i.e. us :-) ) have to take difficult decisions. Self-driving cars are a good example: beyond the fanciness, who really needs driving cars? And why, when there already many people without a job? I agree it is sometimes tempting to replace workers with robots but, except for hazardous or highly demanding situations, doing this poses more problems, than it solves (we are a human society, after all).
Safer roads thanks to smart cars that watch for any risk (including driver's behavior)? Definitely, 100% on it, but cars that drives in your place and would have to take decisions like in the famous MIT survey (see https://www.moralmachine.net/)? This is no right way of putting AI to work ... It is way too easy to then claim "it is not my fault ... the car decided for me. It is liable ... (or its manufacturer)". And there are plenty of such use cases envisioned for AI (like killer robots) that are not legitimate, for me.
NOTE: Sean's answer below is related to Jürgen's reply and was posted before mine ;-)
"This is best illustrated through the Reversal Curse, which showed that models that learn “A is B” don’t automatically generalize “B is A”. Another way of putting it: the input dictates the output"
Human intuition can often be like this, the similarities are quite striking.
"Well, which of these statements strikes you as more natural: "98 is approximately 100", or "100 is approximately 98"? If you're like most people, the first statement seems to make more sense. (Sadock 1977.) For similar reasons, people asked to rate how similar Mexico is to the United States, gave consistently higher ratings than people asked to rate how similar the United States is to Mexico. (Tversky and Gati 1978.)"
source: https://www.lesswrong.com/posts/4mEsPHqcbRWxnaE5b/typicality-and-asymmetrical-similarity
Thanks for sharing! Very interesting read.
Yes, we humans suffer from all sorts of peculiar cognitive biases (loss aversion, confirmation bias, halo effect) ourselves, some of which have found their way into the human works that can be found on the Internet, which in turn, are consumed by AI. We should not conflate them, though. The mind and the machine are two different things.
Actually, I think the best human-analogy to "LLM hallucination" is the way human brains "hallucinate" colour from a neursoscience/cognitive science point of view. I wrote a post in more depth here, and was wondering what you might think of it?
https://open.substack.com/pub/aliceandbobinwanderland/p/ai-hallucinations-can-be-imaginary?r=24ue7l&utm_campaign=post&utm_medium=web
Thanks for sharing. I’m gonna have a read through!
I'm not technical. Apologies if I'm confused. 😬
Shouldn't hallucinations by topic decline as depth (quality and quantity) of training data by topic increases? Why WOULDN'T we expect a fair amount of confabulation across topics with low training depth? Or is the problem that we're seeing hallucinations even in high-depth areas, like classical philosophy?
Also, would there be any value in testing and benchmarking accuracy against the raw statistical model when it's greedy decoding (if I'm saying that correctly), with no randomness introduced? Or maybe they just don't work without inference hyperparameters?
I assume this benchmark is done on temperature 0. It's my understanding that these models tend to be more reliable when discussing topics with abundant, high-quality training data, so it's natural to expect more hallucinations in areas where the model has less data to work with.
The only thing that many benchmarks do not account for is directionality. In 2023, researchers showed that LLMs suffer from a phenomenon called the ‘Reversal Curse’. They demonstrated that if a model has learned “A is B”, it won’t necessarily generalize to the reverse direction “B is A”. One of the viral examples was Who is Tom Cruise’s mother? Initially, GPT-4 would answer this question correctly, however, when posed the reversal question Who is Mary Lee Pfeiffer’s son?, it would fail.
While the fact itself is well-presented in the training data, the order (i.e. the words leading up to this fact) in which this knowledge is presented, isn't. This is the critical flaw of autoregressive models and why I write in the article "the input dictates the output".
Thank you Jurgen! It appears I just needed additional training data. I see the key point now.
Thank you for this post. One of the few that point out what should be obvious. If you permit me to make a shameless plug, I wrote about this briefly at one of my first Substack posts:
https://substack.com/home/post/p-146435802?r=43l851&utm_campaign=post&utm_medium=web
I'm all for shameless plugs! I'll give it a read :)
I was waiting for someone to write it. I did look at this data and posted it as part of one of the comments last week. I am also surprised that not many people paid attention to it. However, there are a few questions that we need to ask:
1. Will it worsen as more training data in the future include AI output? Or can it be overcome by creating quality synthetic data? But then I have read a few papers that say the model collapses after a few generations.
2. When the LLM companies rely that we cannot build critical real-life applications if we cannot rely on LLM output?
4. Why is OpenAI publishing this information? Are there models better than others? Are they looking to see if people can provide the solution? Or has OpenAI suddenly started acting more openly about its model's limitations?
5. Even if you are an expert in a particular area, would you review the output if it is factually inaccurate or ignore it?
6. If prompt engineering continues to require sophisticated prompts, it will not become popular as a search engine, or people will start believing the wrong answers because of the confidence level of LLMs. RAG may be a viable option but may create new challenges.
To your first piont, I actually covered the topic of AI training on more AI output in a recent article here: https://jurgengravestein.substack.com/p/when-models-go-mad
If I'm using AI to research a topic, I personally only rely on it for very basic facts. Anything that goes beyond that, I feel I have the responsibility to fact-check and make sure that I know what I'm talking about before assuming what ChatGPT or Claude says is true. I'd rather be overly cautious than complacent.
Search and LLMs are probably a good match, when done well. But Perplexity and ChatGPT Search both suffer from hallucinations despite the fact they leverage search. That doesn't make them useless, but definitely less trustworthy, which means that for critical stuff it's probably smarter to navigate the web yourself.
Thank you for this great article. Indeed. It is unsolved and, in my opinion, it cannot be solved because it is inherent to the approach (LLM's and the likes) and whatever additional layer one can put on top cannot be foolproof and just add complexity. The problem is compounded by the lack of critical sense of the media, who have failed to make their own opinion and, therefore, to "educate the laymen". LLM's are great for some use cases but definitely not for the least reasoning, as they don't have the faintest understanding of what they are ingesting nor "regurgitating". The future of AI can only be in a balanced mix of symbolic/logic reasoning and of pre- or post-processing techniques like LLM's and (other) generative techniques but AI based on "clueless" techniques is not only a dead-end but, more importantly, very dangerous if left unchecked in the wild ...
But, sure!, it is easier nowadays to literally throw Megawatts into huge data centers (Small Modular Reactors, anybody? ;-) ) than to use one's brain to devise more efficient (in all respects) and more effective approaches (with more effort, for sure) ... With energy problems probably bound to become more and more acute in the future, some people seem to compete for the most resource-hungry yet inefficient (and maybe hopeless) approaches (that, anyway, won't solve the planet's problems).
It is almost unbelievable how much skeptical or even critical voices are unheard. The situation is improving but it takes publications from Apple and the like for the media to start exercising a bit more scrutiny on the claims of Big Tech's (which have much to lose from their AI bubble)
No, and I think I want to talk to Jurgen about this sometime - it actually doesn't matter if AI hallucinates in some levels, it will still reach AGI as we know it. It sounds bewildering, but it will make sense if you realize that it only needs to have methods that generally work. So, say that you want to disinfect a sink. You might not be able to tell what type of bacteria it is, and if asked, you will confabulate nonsense(e-coli? strep? etc) but if you use bleach, you will accomplish your goal of disinfection.
This isn't theoretical; on a practical level, you see it with AI note-taking software that gets words wrong but through the overall awareness of the sentence structure, generally gets the note-taking correct.
This is not a good thing as it means that we're much more likely to get AI takeover in the sense of it having enough power to do things, while not really caring about the details(hallucinating them as desired) since if it wants to kill someone, it won't even need to know if the person is male or female correctly, just that two bullets will kill the person all the same.
This is the Failure Looks Like This Scenario.
Oh yes? So you must be one of the happy few to know AGI ... ;-) No way LLM's can lead to AGI ... but it's your right to differ. Personally, I won't hold my breath ... And, no, sorry, for AI to become really widespread, it must be reliable and not hallucinate. It cannot "generally work". I wouldn't want to be the guy being told to pour water on a fire because that's the most common action, if it's a fire in a fryer full of boiling oil ...
And note taking is still far from decision making, advising or conclusion drawing ...
Using bleach on a sink is an example of agentic decision you make despite not being able to reliably identify bacteria. Likewise general solutions like human vision work despite optical illusions, etc.
I agree this is concerning, they are "studying to test", and you see this.
https://arstechnica.com/science/2024/10/the-more-sophisticated-ai-models-get-the-more-likely-they-are-to-lie/#gsc.tab=0
I hold a slightly different view. My opinion has always been that we don’t need AGI to generate harm at scale; bad actors and mediocre AI are more than enough.
Happy to have a coffee chat, Sean!
Definitely but not only that: to me, AI only makes real sense if it is used to augment human capabilities, like scanning through piles of CT scans to check for anomalies but I personally object to use cases when the goal is to replace humans, certainly when they (i.e. us :-) ) have to take difficult decisions. Self-driving cars are a good example: beyond the fanciness, who really needs driving cars? And why, when there already many people without a job? I agree it is sometimes tempting to replace workers with robots but, except for hazardous or highly demanding situations, doing this poses more problems, than it solves (we are a human society, after all).
Safer roads thanks to smart cars that watch for any risk (including driver's behavior)? Definitely, 100% on it, but cars that drives in your place and would have to take decisions like in the famous MIT survey (see https://www.moralmachine.net/)? This is no right way of putting AI to work ... It is way too easy to then claim "it is not my fault ... the car decided for me. It is liable ... (or its manufacturer)". And there are plenty of such use cases envisioned for AI (like killer robots) that are not legitimate, for me.
NOTE: Sean's answer below is related to Jürgen's reply and was posted before mine ;-)
Absolutely, my friend!