Discussion about this post

User's avatar
Ilia Kurgansky's avatar

Nice find, Jurgen. I wonder if the hype is so strong that many people would be surprised by the fact that visual models are blind.

I would oppose the use of human sight-related terminology like "myopic". We'll have a barrage of articles about how lenses are easy and the next breakthrough is just around the corner... My understanding is that these vision models are broadly clip-based and have very little capability for positional accuracy.

I'd say they are conceptually fuzzy rather than vision-fuzzy... The latent space (where they approximate meaning) is an abstract mess, the object recognition is fuzzy and imperfect (but is getting better through exhaustive training) and then you add the interpretative bullshit of the Language Model interfacing with the visual aspect. Due to the latter part alone the fact is we don't know what the visual model sees if we try to establish it through prompting the LLM - like me asking you to explain what your dog sees. In fact the LLM interface happens in two places: task setting and outcome interpreting, so it's like me asking you to ask your dog to perform a task and then you explain to me what it was thinking at the time - I end up knowing very little about your dog...

I all on favour of grounding research though. I'd love for it to be done on raw visual analysis APIs rather than the LLM combines, but I don't think many companies exposed the API itself for public use.

Expand full comment
Krista Bradford's avatar

Jurgen, fascinating post. I remain intrigued by how it is both similar and unalike we humans.

Expand full comment
5 more comments...

No posts