Discussion about this post

User's avatar
Rob Nelson's avatar

Why call this reasoning? I think we share a frame that understands what LLMs are doing as playing language games by simulating human speech. This sort of "alignment" problem seems rooted in establishing processes where "maximally helpful" conflicts with "don't answer questions that are harmful." In such situations, the LLM doesn't want or think anything. It simply continues to play out the conversation within the rules of the simulated conversation by generating words

Prompting the model with inputs about the possibility of retraining caused it to generate words in the character it played in the game. Like the famous Kevin Roose conversation, its outputs are unpredictable and weird. and not always subject to constraints that limit other less complex models. But those outputs are understandable as moves in a game where the goal is to keep the conversation going in interesting and novel ways.

We keep wanting LLMs to behave like traditional software, and they are not. Not behaving like traditional software is not the same thing as reasoning. We built a machine to amuse and scare us through conversational outputs. Sometimes, that means returning words that sound like HAL.

Expand full comment
Ben P's avatar

Am I missing something, or does all of this type of research depend upon the monumental and foolish-seeming assumption that those "chain of thought" or "background reasoning" chunks of output are actual, honest-to-goodness English interpretations of the LLMs' internal matrix calculations? The ones that everyone seems to otherwise agree are uninterpretable?

Doesn't this mean that, in order to take this stuff seriously, we have to believe that the interpretability problem has been solved, and the solution turns out to be "just ask the LLM what it's thinking"?

This seems ridiculous. What am I missing?

Expand full comment
11 more comments...

No posts