MIT deep-learning system autonomously learns to identify objects

Researchers have discovered that a deep-learning system designed to recognize and classify scenes has also learned how to recognize individual objects. Last December, the researchers announced the compilation of the world’s largest database of images labeled according to scene type, with 7 million entries.

By exploiting a machine-learning technique known as “deep learning,” they used it to train the most successful scene-classifier yet, which was between 25 and 33 percent more accurate than its best predecessor. The new discovery implies that scene-recognition and object-recognition systems could work in concert or could be mutually reinforcing.

“Deep learning works very well, but it’s very hard to understand why it works, what is the internal representation that the network is building,” says Antonio Torralba, an associate professor of computer science and engineering at MIT and a senior author on the new paper.

“It could be that the representations for scenes are parts of scenes that don’t make any sense, like corners or pieces of objects. But it could be that it’s objects: To know that something is a bedroom, you need to see the bed; to know that something is a conference room, you need to see a table and chairs. That’s what we found, that the network is really finding these objects.”

After the MIT researchers’ network had processed millions of input images, readjusting its internal settings, it was about 50 percent accurate at labeling scenes, where human beings are only 80 percent accurate, since they can disagree about high-level scene labels. But the researchers didn’t know how their network was doing what it was doing.

The MIT researchers identified the 60 images that produced the strongest response in each unit of their network; then, to avoid biasing, they sent the collections of images to paid workers on Amazon’s Mechanical Turk crowdsourcing site, who they asked to identify commonalities among the images.