Discussion about this post

User's avatar
Bret Kinsella's avatar

Great piece Jurgen. There is a trap that I find people falling into in regards to these types of questions with AI and, frankly, in other areas as well. It is what I call the spectrum fallacy though there is probably a technical term for this.

We all understand that performance is often graded on a spectrum and generally that means that a higher score on a spectrum means a closer proximity to some optimal standard. However, when it is applied without a key variable we can all understand that the results can be misleading. We saw this play out recently with the conclusion that LLMs have emergent abilities. HAI researchers were able to demonstrate that the conclusion was a function of the measurement system. Change the measurement and you come to a different conclusion. https://synthedia.substack.com/p/do-large-language-models-have-emergent

Another common misunderstanding is applying a spectrum when a binary measurement is more appropriate, or when the spectrum doesn't matter until a certain threshold is reached that makes the spectrum-based prediction superior than random chance. For example, if you have two systems, one which correctly predicts content is AI generated 25% of the time and another 45% of the time, is the latter a better solution? In reality, they are both equally bad. Both perform worse than a coin flip. That means random guessing will provide you with correct answers more often. Why an AI model would be worse than random chance at this prediction (which basically all of them are despite claims otherwise) is also an interesting question. However, it doesn't change the fact that these systems do not deliver what they promise.

A more recent example comes from some interpretation of the HAI study on how well various AI foundation models comply with the draft EU AI act provisions. https://synthedia.substack.com/p/how-ready-are-leading-large-language

You can interpret the results as suggesting Hugging Face's BLOOM model is 75% in compliance and GPT-4 is just over 50% compliant. In the eyes of the law, both are non-compliant because regulations are typically binary in their application. Granted, there may be regulatory discretion which applies fines based on the level (i.e. spectrum) of compliance or weighs provisions differently. The fact remains that both model makers will be subject to regulatory action.

GPT-4 may seem to be closer to a theory of mind than other models based on the measurement technique. This matters little if it lacks agency and a sense of self as you point out, and also stands on the far side of the chasm between theory of mind in the abstract and theory of mind in reality.

Expand full comment
Andrew Smith's avatar

Well done. It feels like we (humans) are, broadly speaking, trying to put a round shape into a square hole here. We hairless apes see that the machine talks like us, so it must think like us too!

We're making crazy fast progress in making machines that seem to think, but "seem" is still operative here. We don't even understand how we think.

Expand full comment
14 more comments...

No posts