Nice find, Jurgen. I wonder if the hype is so strong that many people would be surprised by the fact that visual models are blind.
I would oppose the use of human sight-related terminology like "myopic". We'll have a barrage of articles about how lenses are easy and the next breakthrough is just around the corner... My understanding is that these vision models are broadly clip-based and have very little capability for positional accuracy.
I'd say they are conceptually fuzzy rather than vision-fuzzy... The latent space (where they approximate meaning) is an abstract mess, the object recognition is fuzzy and imperfect (but is getting better through exhaustive training) and then you add the interpretative bullshit of the Language Model interfacing with the visual aspect. Due to the latter part alone the fact is we don't know what the visual model sees if we try to establish it through prompting the LLM - like me asking you to explain what your dog sees. In fact the LLM interface happens in two places: task setting and outcome interpreting, so it's like me asking you to ask your dog to perform a task and then you explain to me what it was thinking at the time - I end up knowing very little about your dog...
I all on favour of grounding research though. I'd love for it to be done on raw visual analysis APIs rather than the LLM combines, but I don't think many companies exposed the API itself for public use.
Fascinating article, Jurgen. I'm surprised that advanced AI struggles with tasks like counting line intersections or nested shapes. This makes me wonder about real-world implications. For instance, how might these limitations affect an AI's ability to interpret complex road markings or traffic signs in self-driving cars? Are researchers developing specific tests for autonomous vehicles similar to these simple visual tasks?
To be honest, I don't know much about the systems that autonomous vehicles rely on. But I agree with you that it raises questions about how this translates to the real-world. Suffice to say, I won't ask ChatGPT what metro line to take to get home 👀
Nice find, Jurgen. I wonder if the hype is so strong that many people would be surprised by the fact that visual models are blind.
I would oppose the use of human sight-related terminology like "myopic". We'll have a barrage of articles about how lenses are easy and the next breakthrough is just around the corner... My understanding is that these vision models are broadly clip-based and have very little capability for positional accuracy.
I'd say they are conceptually fuzzy rather than vision-fuzzy... The latent space (where they approximate meaning) is an abstract mess, the object recognition is fuzzy and imperfect (but is getting better through exhaustive training) and then you add the interpretative bullshit of the Language Model interfacing with the visual aspect. Due to the latter part alone the fact is we don't know what the visual model sees if we try to establish it through prompting the LLM - like me asking you to explain what your dog sees. In fact the LLM interface happens in two places: task setting and outcome interpreting, so it's like me asking you to ask your dog to perform a task and then you explain to me what it was thinking at the time - I end up knowing very little about your dog...
I all on favour of grounding research though. I'd love for it to be done on raw visual analysis APIs rather than the LLM combines, but I don't think many companies exposed the API itself for public use.
I love that "conceptionally fuzzy". I'm going to steal that. Thank you for adding this more technical perspective to the conversation.
You are welcome to it. I am speculating on my understanding of the fundamentals of the tech. Could be wrong, happy to be corrected.
Jurgen, fascinating post. I remain intrigued by how it is both similar and unalike we humans.
Right!? It’s a gift that keeps on giving.
Fascinating article, Jurgen. I'm surprised that advanced AI struggles with tasks like counting line intersections or nested shapes. This makes me wonder about real-world implications. For instance, how might these limitations affect an AI's ability to interpret complex road markings or traffic signs in self-driving cars? Are researchers developing specific tests for autonomous vehicles similar to these simple visual tasks?
To be honest, I don't know much about the systems that autonomous vehicles rely on. But I agree with you that it raises questions about how this translates to the real-world. Suffice to say, I won't ask ChatGPT what metro line to take to get home 👀