Next Big Test for AI: Making Sense of the World

A few years ago, a breakthrough in machine learning suddenly enabled computers to recognize objects shown in photographs with unprecedented, almost spooky, accuracy. The question now is whether machines can make another leap, by learning to make sense of what’s actually going on in such images.
A new image database, called Visual Genome, could push computers toward this goal, and help gauge the progress of computers attempting to better understand the real world. Teaching computers to parse visual scenes is fundamentally important for artificial intelligence.
It might not only spawn more useful vision algorithms, but also help train computers how to communicate more effectively, because language is so intimately tied to representation of the physical world.
Visual Genome was developed by Fei-Fei Li, a professor who specializes in computer vision and who directs the Stanford Artificial Intelligence Lab, together with several colleagues. “We are focusing very much on some of the hardest questions in computer vision, which is really bridging perception to cognition,” Li says.
“Not just taking pixel data in and trying to makes sense of its color, shading, those sorts of things, but really turn that into a fuller understanding of the 3-D as well as the semantic visual world.”
Li and colleagues previously created ImageNet, a database containing more than a million images tagged according to their contents. Each year, the ImageNet Large Scale Visual Recognition Challenge tests the ability of computers to automatically recognize the contents of images.
In 2012, a team led by Geoffrey Hinton at the University of Toronto built a large and powerful neural network that could categorize images far more accurately than anything created previously.
The technique used to enable this advance, known as deep learning, involves feeding thousands or millions of examples into a many-layered neural network, gradually training each layer of virtual neurons to respond to increasingly abstract characteristics, from the texture of a dog’s fur, say, to its overall shape.
The Toronto team’s achievement marked both a boom of interest in deep learning and a sort of a renaissance in artificial intelligence more generally. And deep learning has since been applied in many other areas, making computers better at other important tasks, such as processing audio and text.
The images in Visual Genome are tagged more richly in ImageNet, including the names and details of various objects shown in an image; the relationships between those objects; and information about any actions that are occurring. This was achieved using a crowdsourcing approach developed by one of Li’s colleagues at Stanford, Michael Bernstein. The plan is to launch an ImageNet-style challenge using the data set in 2017.
Algorithms trained using examples in Visual Genome could do more than just recognize objects, and ought to have some ability to parse more complex visual scenes.
“You’re sitting in an office, but what’s the layout, who’s the person, what is he doing, what are the objects around, what event is happening?” Li says. “We are also bridging [this understanding] to language, because the way to communicate is not by assigning numbers to pixels, you need to connect perception and cognition to language.” 
Li believes that deep learning will likely play a key role in enabling computers to parse more complex scenes, but that other techniques will help advance the state of the art.
The resulting AI algorithms could perhaps help organize images online or in personal collections, but they might have more significant uses, enabling robots or self-driving cars to understand a scene properly. They could conceivably also be used to teach computers more common sense, by appreciating which concepts are physically likely or more implausible.
Richard Socher, a machine-learning expert and the founder of an AI startup called MetaMind, says this could be the most important aspect of the project. “A large part of language is about describing the visual world,” he says. “This data set provides a new scalable way to combine the two modalities and test new models.”