A ‘visual Turing test’ of a computer’s understanding of images

Researchers from Brown and Johns Hopkins universities have come up with a new way to evaluate how well computers can “understand” the relationships or implied activities between objects in photographs, videos, and other images, not just recognize objects, a ‘visual Turing test,’ as they describe it.
 
Traditional computer-vision benchmarks tend to measure an algorithm’s performance in detecting objects within an image (the image has a tree, or a car or a person), or how well a system identifies an image’s global attributes (scene is outdoors or in the nighttime).
 
“We think it’s time to think about how to do something deeper, something more at the level of human understanding of an image,” said Stuart Geman, the James Manning Professor of Applied Mathematics at Brown. For example, recognizing that an image shows two people walking together and having a conversation is a much deeper understanding than just recognizing the people.
 
The system Geman and his colleagues developed, described this week in the Proceedings of the National Academy of Sciences, is designed to test for such a contextual understanding of photos. It works by generating a string of yes or no questions about an image, which are posed sequentially to the system being tested. Each question is progressively more in-depth and based on the responses to the questions that have come before.