Key insights of today’s newsletter:
The announcement of OpenAI’s new text-to-video model Sora and the released footage blew people’s minds last week.
Sora can generate videos up to a minute long, maintaining visual continuity, persisting characters, and visual style.
While impressive, upon further examination Sora seems to suffer from all the same flaws that have been observed in language models and image generators.
↓ Go deeper (5 min read)
Unreal. That was my first response after seeing the video clips of Sora, OpenAI’s freshly announced text-to-video model.
I wasn’t the only one:
The announcement came the same day (literally hours apart) as Google announced their Gemini 1.5 model, but Sora instantly sucked away all the attention. For good reason:
You can find all the clips here.
Sora can generate videos up to a minute long, according to the release notes, which is completely unheard of. Anyone who has used generative video before knows that maintaining visual continuity was virtually impossible beyond a few seconds.
But what really blew people out of the water was the quality. The quality is unlike anything we’ve seen before.
It looks unreal.
§
Immediately, a debate ensued over its impact. Is this the death of Hollywood? Pixar, Disney, Netflix? Can animators and videomakers retire? Is the entire advertisement industry doomed?
While those may be interesting questions, I’m going to refrain from answering until Sora has been released to the wider public. The only thing I’ll say about it is that it probably won’t go as fast as some people think it will. There’s reasons for that which I’ll come to in a second.
Instead I’d like to focus on something different entirely. And that’s because I noticed some odd things when I started paying close attention to the videos.
Take another look at video I shared above. If anything, it looks like it came straight out of the game engine Unreal — which gives us pause to think about the potential use of copyrighted materials in training Sora. Also, on the top of the SUV it says “Danover” instead of “Landrover” suggesting that it struggles with spelling the same way as Dall-E.
In another clip, we see a Dalmatian jumping from one window sill to another, but when you look more closely the dog seems to defy 3D physics. Can the Dalmatian really make it around that wooden shutter? Isn’t that window sill way too narrow by the looks of it?
In a third video, we see a construction site, where the arm of the orange excavator slowly morphs into a construction worker with an orange vest over the duration of 10 seconds. It’s subtle and happens in the background while your attention is draw elsewhere. Also, did you notice it says “contreuprence” on the side of yellow machine?
These are just three examples, but I assure you, you can find something out of the ordinary in almost every single video. The reason I say this is not to discredit or invalidate the capabilities of Sora in any way. The point I’m trying to make that what we’re looking at is not real. Every scene that attempts to portray a real-world scenario seems realistic at first glance — the mind has a remarkable ability to make up for blind spots — but upon further examination, it doesn’t adhere to reality at all.
§
Why is this relevant? Because it strikes at the heart of what this technology can and cannot do.
OpenAI believes that the process of breaking video down and parsing it back together, similar to how large language models break down language in little bits and make predictions about how to put those bits back together, will encode some kind of world model into the machine. They view video generation models as ‘world simulators’:
“We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
I’m very skeptical that is the case, and that’s just based off the limited footage (riddled with irregularities) that is currently available to us.
pointed out in his recent posts (1, 2, 3) that by far the most illuminating fact about Sora’s glitches is that they do not appear in the training data (in real video footage of construction sites, construction workers don’t spontaneously pop into existence, do they?). It suggests that the faults we’re seeing are intrinsic to the technology and more data or more compute likely won’t change that. These glitches are akin to the hallucinations that we see in LLMs, and since Sora is built (for the most part) on the same incredibly powerful but flawed transformer architecture, that shouldn’t come as a surprise to OpenAI or anybody else.§
Maybe this is what Andrej Karpathy meant when he said: “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”
I can’t think of a better way to describe what Sora is. It’s a dream machine — and we could all benefit from looking at it as such.
Join the conversation 🗣
Leave comment with your thoughts. Or like this article if it resonated with you.
Get in touch 📥
Have a question? Shoot me an email at jurgen@cdisglobal.com.
That was a super interesting quote by Karparthy that an LLM is basically always dreaming an output and when we like it we call it useful and creative and when we don’t call call it a hallucination
I think of it like this: we're looking through a tunnel at the world on the other side. This world is incredible, and we can see where the tunnel leads. Unfortunately, we can't see how long this tunnel is. It's longer than it seems to most folks who see this sort of wizardry.
Still, it's kind of amazing that we can see where this is going. We just shouldn't expect it to be ready for prime time any time soon, and maybe that's good: we can plan for the inevitable better.