Im going to make artificial intelligence that predicts what happens in videos

The race to build ever more capable artificial intelligence and machine learning is heating up. This week, Facebook unveiled several artificial intelligence projects. Yann Lecun, the company’s director of AI, reveals what his technology can do, and how he sees the future, in this recent interview with New Scientist:

What are the big challenges ahead for you?

The big challenge is unsupervised learning: the ability of machines to acquire common sense by just observing the world. And we don’t have the algorithms for this yet.

Why should AI researchers be concerned about common sense and unsupervised learning?

Because that’s the type of learning that humans and animals do mostly. Almost all of our learning is unsupervised. We learn about how the world works by observing it and living in it without other people telling us the name of everything. So how do we get machines to learn like in an unsupervised way like animals and humans?

This week, Facebook demonstrated a system that can answer simple questions about what’s happening in a picture. Is that trained by annotations made by humans?

It’s a combination of human annotation and artificially generated questions and answers. The images already have either lists of objects they contain or descriptions of themselves. From those lists or descriptions, we can generate questions and answers about the objects that are in the picture, and then train a system to use the answer when you ask the question. That’s pretty much how it’s trained.

Are there certain types of questions your AI system struggles with?

Yes. If you ask things that are conceptual then it’s not going to be able to do a good job. It is trained on certain types of questions like the presence or absence of objects, or the relationship between objects, but there’s a lot of things it cannot do. So it’s not a perfect system.

Is this system something that could be used for Facebook or Instagram to automatically caption pictures?

Captioning uses a slightly different method, but it’s similar. Of course, this is very useful for the visually impaired who use Facebook. Or, say you’re driving around and someone sends you a picture and you don’t want to look at your phone, so you could ask “What’s in the picture?”

Right now the system just tells you the type of image it is, if it’s outdoors or indoors, if there’s a sunset or whatever. It then gives you a list of the things that’s found in it, but it’s not like full sentences. It’s just a list of words.

It doesn’t know the relationships between these things?

Right, and so the next generation that we have working in the lab is more like prose.

What other potential uses do you envisage for such artificial neural networks?

In biology and genomics, there is a lot of interesting work. For example, Brendan Frey at the University of Toronto has shown that you can train a deep-learning system to emulate the biochemical machinery that reads the DNA and produces proteins. With that you can figure out the relationship between multiple particular changes in the genome and particular diseases, which are not really traceable to a single mutation but can be an assembly of things. There is going to be a lot of progress in medicine because of this kind of stuff.

Are there problems that you think deep learning or the image-sensing convolutional neural nets you use can’t solve?

There are things that we cannot do today, but who knows? For example, if you had asked me like 10 years ago, “Should we use convolutional nets or deep learning for face recognition?”, I would have said there’s no way it’s going to work. And it actually works really well.

Why did you think that neural nets weren’t capable of this?

At that time, neural nets were really good at recognising general categories. So here’s a car, it doesn’t matter what car it is or what position it is. Or there’s a chair, there are lots of different possible chairs and those networks are good at extracting the “chair-ness” or the “car-ness”, independently of the particular instance and the pose.

But for things like recognising species of birds or breeds of dogs or plants or faces, you need fine-grained recognition, where you might have thousands or millions of categories, and the differences between the different categories is very minute. I would have thought deep learning was not the best approach for this – that something else would work better. I was wrong. I underestimated the power of my own technique. There’s a lot of things that now I might think are difficult, but, once we scale up, are going to work.

Facebook recently unveiled an experiment in which engineers gave a computer a passage from Lord of the Rings and then asked it to answer questions about the story. Is this an example of Facebook’s new intelligence test for machines?

It’s a follow-up of that work, using the same techniques that underlie it. The group that’s working on this has come up with a series of questions that a machine should be able to answer. Here is a story, answer questions about this story. Some of them are just a simple fact. If I say “Ari picks up his phone” and then asked the question where is Ari’s phone? The system should say that it’s in Ari’s hands.

But what about a whole story where people move around? I can ask, “Are those two people in the same place?” and you have to know what the physical world looks like if you want to be able to answer these questions. If you want to be able to answer questions, like “How many people are in the room now?”, for example, you have to remember how many people came into this room from all the sentences. To answer those questions, you require reasoning.

Do we need to teach machines common sense before we can get them to predict the future?

No, we can do this at the same time. If we can train a system for prediction, it can essentially infer the structure of the world it’s looking at by doing this prediction. A particular embodiment of this that’s cool is this thing called Eyescream. It’s a neural net that you feed random numbers and it produces natural-looking images at the other end. You can tell it to draw an airplane or a church tower, and for things that it’s been trained on, it can generate images that look sort of convincing.

So that’s a piece of puzzle, to be able to generate images, because if you want to predict what happens next in videos, you must first have a model that can generate images.

What kind of things could a model predict?

If you show a video to a system and ask, “What’s the next frame in the video going to look like?” it’s not that complicated. There are several things that can happen, but moving objects are probably going to keep moving in the same direction. But if you ask what the video will look like a second from now, there are a lot of things that can happen that you just can’t predict, so there the system will have a hard time making a good prediction. That’s the problem we’re facing that we don’t know how to handle properly.

And what if you’re watching a Hitchcock movie and I ask, “15 minutes from now, what is it going to look like in the movie?” You have to figure out who the murderer is. Solving this problem completely will require knowing everything about the world and human nature. That’s what’s interesting about it.

Five years from now, how will deep learning have changed our lives?

One of the things we’re exploring is the idea of the personal butler, the digital butler. There isn’t really a name for this, but at Facebook it’s called Project M. A digital butler is the long-term sci-fi version of M, like in the movie Her.