MIT 6.S191 (2018): Beyond Deep Learning: Learning+Reasoning

Transcription for the video titled "MIT 6.S191 (2018): Beyond Deep Learning: Learning+Reasoning".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Intro (00:00)

Good morning, everyone. As I said, I'm the director of IBM Research Cambridge. It's literally just a few blocks down the road. I've worked probably 20 years in IBM Research, primarily out of our New York lab, but I moved here just three months ago to really start up this new AI lab, refocus and significantly grow the Cambridge lab that we've already started. I intentionally chose a somewhat provocative title to my talk today. The reason I wanted to, the beyond deep learning, it's not necessarily to say that, you know, all of these deep learning techniques are going to be, you know, obsolete. That's definitely not what I'm trying to say. But I am trying to say that, you know, although there's a lot of exciting things that we can do with deep learning today, there's also a frontier, you know, a space that we can't do very well. And so I hope to today talk to you about, kind of what is an area of a boundary that we were not able to break through at this time that I think is critical for machine intelligence, for artificial general intelligence. So I'm hoping that today I can set that up for you, hopefully motivate additional people to come in and study this because I really believe that it's a critical area where additional breakthroughs are needed. I'd like to first introduce IBM Research. I don't know how many of you actually know about IBM Research. Some of you may have heard of us because of the Jeopardy challenge. So in 2011, we created a computer that was able to beat the Jeopardy champions at that game, very handily as a matter of fact. Some people don't realize that our research division is quite significant. So we have 3,000 people worldwide, 12 labs that are doing this research. Our researchers pursue the same types of accolades that, you know, the top university professors go after. So Nobel laureates, Turing Awards, National Academy of Sciences, National Academy of Technologies. So we pursue a very, you know, rigorous, you know, set of research in the areas that we focus on. And although I put the the Jeopardy challenge up there, you're only as good as your your most recent results, right? So I also wanted to make sure that I talked a little bit about things that we've done more recently so in 2017 as an example we created the first 50 qubit quantum computer most people are kind of you know I expect to actually see some people announcing some other companies announcing simulators for 50, on the order of 50 qubits, and the difference here is that we're talking about an actual 50 qubit quantum computer.

Discussion On Quantum Advantage And Ai Lab

IBM Research Division (02:59)

So we also have the simulators, but we have the real quantum computers. I think what's also unique about what we're doing is that we are also making our quantum computing capabilities available through the cloud so that people can kind of log in and they can experiment and learn about what quantum computing is. So it's a very exciting program for us.

Quantum Advantage (2017) (03:33)

We also in 2017 were able to show near linear scale out in terms of, you know, CAFE deep learning models on our servers. We were able to show algorithms that were able to exploit the quantum advantage. So the idea is that if you, in order to actually get the speed ups on quantum computers, you need to be able to map problems into a form where you can get that acceleration. And when you get that acceleration, then you're talking about exponential accelerations over traditional computers that people are using today. So this particular result was an algorithm that's able to basically map small molecules, models of small molecules, onto the actual quantum computing system so that we could demonstrate the ability to find the lowest energy state and get that exponential speed up from that. The thing is that in 2017, we were named number one leading corporation for scientific research at a corporate institute. So that's pretty exciting for us. I also wanted to just tell you a little bit about the MIT-IVM Watson AI Lab. It's obviously a very exciting announcement for us. So in September of 2017, we announced that we were building this $240 dollar joint effort with MIT to pursue fundamental advances in AI. The core areas are listed here so when I say fundamental advances in AI that it's really again recognizing what we can and can't do with AI today and then trying to create new algorithms to go beyond that. So examples of problems, just very quickly, that we're interested in in terms of the new AI lab. One area is that learning causal structure from data is a very challenging problem. We're looking at how we can use data that was captured due to CRISPR mediations or CRISPR gene inactivations which basically has very large set of interventional data where they're observing the whole genome and we're going to try to learn causal structure you know from those those interventions. So that's that's one example. Another example in physics of AI, basically what we're talking about there is the ability to have AI help quantum computers and quantum computers accelerate AI algorithms. So we're looking at problems, for example, in machine learning algorithms that would help us to manage the state of the quantum computer. Also looking at, for example, machine learning algorithms that would help us to manage the state of the quantum computer. Also looking at, for example, the ability to knowing what we know about quantum computers, knowing that we're going to be in the small numbers of hundreds of qubits for some time, that the memory bandwidth between traditional computers and the quantum computers can be relatively small. Which of the machine learning algorithms will we be able to map onto those systems in order to get that exponential speed up? So those are just some of the examples of things that we're studying. Also, we feel that right now there are two industries that are really ripe for AI disruption. One is healthcare life sciences. The reason they're ripe for disruption is because that community has invested a lot to create what we refer to as structured knowledge. So gene ontology, SNOMED clinical terms, all of the structured knowledge that we can combine with observational data and create new algorithms. And in cybersecurity, the reason why that one is ripe for disruption is because if everybody is advancing these AI algorithms and people start to try to use those AI algorithms to attack our systems, it's very important that we also use AI algorithms to try to figure out how they're going to do that and how to defend against that. So here's some of the examples. defend against that. So here's some of the examples. The last one, shared prosperity, is about how do we get nondiscrimination, non-bias, morals into the algorithms, not just training on for scale out and accuracy and these sorts of things.

AI Lab (07:48)

All right. We already had our first announcement in terms of the new MIT-IVM Watson AI Lab. So at NIPS, we announced that we are releasing a one million video data set. The idea, for those of you who are familiar, and you probably learned a lot in this class about how people used ImageNet to make new breakthroughs in terms of deep learning. Just that volume of labeled data meant that people could go in and run experiments and train networks that they could never do before. So we've created this million video data set. Three second videos were chosen for a specific reason. The lead for this project, Ode Oliva, has great expertise not only in computer science but also in cognitive science and so it's expected that, you know, three seconds is roughly the order of times that it takes humans to recognize certain actions as well. So we're kind of sticking with that time. Computers aren't, the machine learning algorithms that you were learning about today aren't able to do this well. You know, what you primarily learned about so far was how to segment images, how to find objects within images, how to classify those objects, but not actions. And the reason why we also think that actions are important is because they are composable, right? So we want to be able to learn not just elemental actions from these videos, but then to start to think about how you recognize and detect compositions of actions and procedures that people are performing because then we can use that to start to teach the computers to also perform those procedures or to help humans to be able to perform those procedures better.

Challenges (09:29)

Okay now what I want to do is maybe take a break from kind of the setup and get into the more technical part of the of the talk today. So as I was saying earlier, you know, what we've seen recently in deep learning is is truly all inspiring. I mean in terms of the number of breakthroughs over over the last ten years and especially over the last 10 years, and especially over the last five years. It is very exciting, breakthroughs in terms of being able to, for certain tasks, beat human error rates in terms of visual recognition and speech recognition and so on. But my position is that there are still huge breakthroughs that are required to try to get to machine intelligence. So some examples of challenges that the systems of today aren't able to do. So one is that many, many of the scenarios in order to get the performance that actually is usable requires labeled data. It requires training data where you've actually labeled the objects and the images and so on. While the systems are getting better in terms of doing unsupervised learning because of the vast amount of data that's available on the web, the problem is that we at IBM care about AI for businesses, right? And if you think of AI for businesses, there's just not that much deep domain data that we're able to find, right? So if you think about the medical field, very deep expressive relations that are required to be understood requires a lot of labeled data. You know, if you think of, you think of an airline manufacturer and all of the manuals they may have, and they'd like to try to be able to answer questions from those and be able to reason and help humans understand how to conduct procedures within those, there's not enough data out there on those fields in terms of relationships and entities and so on for us to train. So it requires a lot of labeling humans actually going through and finding the important entities and relationships and so on for us to train. So it requires a lot of labeling humans actually going through and finding the important entities and relationships and so on. So an important area that we need to be able to break through is first of all, why do these machines, why do these networks require so much data, labeled data? And can we address that? Can we make the algorithms better? The second thing is that you've probably realized that you train up the algorithms and then they're able to perform some tasks. The tasks are getting more and more sophisticated, self-driving cars and so on, but you're still not training this network so that it can perform many different tasks. And even more importantly, what happens is that you learn a model and even though you may, as part of the training, you may have reinforcement learning, that's not lifelong learning. That's not just turning the cars out on the road and enabling them to continue to learn and aggregate information and bring that into a representation so that they can adapt to non-stationary environments, environments that change over time. Another thing is that when we train these networks, you kind of get it down to a certain error rate, but how do you keep improving the accuracy even though the error rate is not that bad? Algorithms don't do well at this today. And then the last area is that it's really important that we're creating algorithms that are interacting with the humans, right, that can explain how they may have come to a particular decision or classification or whatever the case may be. So we need to try to think about how can we build machines that can learn, that can listen, interact with the humans, be able to explain their decisions to humans. And so a big part of this is what I refer to and what many in the community refer to as, you know, learning plus reasoning, meaning that we want to be able to reason on representations that are learned, preferably on representations that are learned in an unsupervised manner. Alright, the first step in doing that is to be able to make language itself computational, right? So we are able to think about, you know, words as sort of symbols and what those words mean and the properties of them and then reason about them. If you think about all the algorithms that you've been learning, they expect the information to be coming to them as numerical information, preferably as real valued information. How do you go from text to real valued information that can then be fed into these algorithms over time and then computed on? So one of the first areas here is word embeddings. The reason I italicized word is because what you'll see is that we kind of started out with word embeddings, but now it's phrase embeddings, document embeddings, so there's much more to this. But let's start first with what do we mean with word embeddings, okay? The point is to try to represent a word as a real valued vector that basically represents what that word means in terms of other words, right? So the dimensions of that vector, the features, are basically other words and the point is that you can assign different weights on how well that word relates to these other words.

Word Embeddings (15:10)

Now, the difficult part there is how do you learn that representation that's going to give you that objective of making that vector really represent that word and be comparable to other words in a way that the machine can compute on them. So the idea is the first work, early work in this was, all right, how do we do this such that those embeddings, those vectors, give you an understanding of the similarity between this word and other words within your dictionary. First model here was what they referred to as a skip gram model. Basically, what you try to do is you say, given a particular word, how well can I predict the words around it, the words that are one hop away from it, two hop away from it, three hops away from it? What they're trying to do is to train a set of vectors where you minimize the loss, the prediction loss, of being able to predict the words around it based off of that word. This is a very big difference for the community. Previously it had been mostly counts. People would just kind of count the number of occurrences and then they would use that to try to almost make that the weight. What this was doing was taking a deep neural network. Well, it was actually kind of a shallow neural network to try to optimize or maximize this log probability of being able to predict the words around it from that. And what they got from that is what you can kind of see here, this ability to place words, symbols, into a vector space such that words that are similar are closer together. So the point is you can see, and also that relationships between words move in similar directions, right? So what this figure is trying to show you is that, you know, you can look at the countries that are here on the left side and you can see that they're similar. The ones that are all countries are proximal, the cities are proximal, the you know, the vector that is going from countries to city are essentially going in the same direction. I'm not sure if you can actually see that, but the point is that this kind of a vector space gives us the ability to compute on those real valued vectors and then learn more about this. So a very first simple thing is to be able to, you know, be able to find other similar things, right? You have something, you have a symbol, Italy, for example, can you find other things that are similar to, or related to Italy? You can find other countries, you can find cities within Italy. So that's kind of the first step. The, you know, so the first work there was this kind of distributed representation that we're talking about here. The second is basically showing the difference in terms of being able to, the second talk is, our paper was about the accuracy, the significant jump in accuracy that we got from being able to do that prediction-based representation as opposed to the count-based representation.

FastText (18:14)

I put, I signaled, you know, IBM authors, people who are at or were at IBM in orange. And the last one is that Facebook has actually recently released this fast text where you can basically go in and very easily create your own embeddings. So where you can basically go in and very easily create your own embeddings. So the first thing was, you know, how do they create it? And they went after a specific thing.

Knowledge Representation (18:49)

How do you optimize that similarity? But what you'll see from the rest of the talk is that there are many other ways that you might want to try to figure out how to place things together, other constraints that you might want to place on the vector space and how things are represented in that vector space so that you can accomplish to place on the vector space and how things are represented in that vector space so that you can accomplish tasks from it. Prior to these types of representation, the ideals for how people would actually go after representing knowledge and language were knowledge bases, structured knowledge. So our original ideas was, all right listen, I have to be able to have entities that are well-defined, I have to have well-defined relationships between those entities, I have to have rules that basically will give me information about, you know, categories of relationships or categories of entities, and it's great. Humans are able to do that. They're able to lay out an entire space, describe molecules and relationships between molecules or genes and relationships between genes and the targets that they might be able to affect. Humans can do that well. Machines don't do it that well. That was the problem. So the second figure here was, even though you kind of go out and you're able to find a lot of things in Wikipedia, in Wikidata, Freebase, or some of the examples where you can find structured information on the web, even though you're able to find a lot of information out here, and although this is a 2013 statement, what you can see is that in terms of the types of relationships and completeness of that, this is saying, okay, well, for the people that are in Wikipedia or in Freebase, actually, we're missing about 80% of their education, where their education is from. We're missing over 90% of their employment history, right? So even though it seems like it's a lot of information, it's really sparse and very difficult for humans to be able to use the algorithms that we have today to automatically populate knowledge bases that look like the form that we understand and feel that we can apply our logic to. So one of the first results in terms of going from that symbolic knowledge into sub symbolic knowledge, so the vectors that I was talking about earlier, was could we redo our knowledge bases based off of these sub symbolic, these vectors. If we were able to do that then we actually would be able to learn much more data. It's possible we could learn these representations and fill out some of the information that we're missing from our knowledge bases.

Cybergmentation, wheels, quadratic programming (21:29)

So this first part was saying, okay look, I can take some of the information I can find in Freebase and other sources. What I'll do is I can use text information to try to build out these embeddings. I can find relationships or I can find entities that are in similar spaces and realize there may be a relationship between these and I can start to populate more of the relationships that I'm missing from my knowledge base. We're able to use this principle of the embeddings and the knowledge basis to then start to grow the knowledge basis that we have, right? So this is basically the first study here is showing how we took information about genes, diseases and drugs from ontologies that were available, represented that, learned vectors across that structured space so that we could predict relationships that weren't in the knowledge base. That's important because if you think about how people get that information today, they actually do wet lab experiments to try to understand if there is a relationship, if something upregulates something else. It's very expensive. If we can use this knowledge to make those predictions, then we can give other scientists places to look. Looks like there might be an interaction here. Maybe you could try that, right? The second set of results is more recent. Basically, there was a challenge issued by the semantic web community. How can we better improve this automated knowledge base construction? So the team used a combination of these word embeddings to be able to search for and validate information gained from a set of structured and unstructured knowledge. So this is actually won first place in the 2017 Semantic Web Challenge. Okay, so now we're kind of getting an idea about how we would take language, make it computational, put it into a knowledge base so that we can aggregate it over time, but how are we going to get the neural networks to use that? That's the next question. Okay, an example task of why you would want those neural networks to be able to use that is question answering. You want to build up a knowledge base, everything you can possibly find, and then you want to be able to ask it questions and see if it can answer those questions.

Short answer questions (24:09)

We say that this requires memories because the point is that if you think about some of the other tasks that you may have seen, you provide, once you've trained the network, you provide an input and then you get an output. You don't necessarily use long-term memories of relationships and entities and all of these sorts of things. So this is a challenge that was issued to the community essentially in terms of being able to read some sentences and then being able to, given a question, give an answer to that. And there are different stages of the complexity of what is required in order to answer the question. Sometimes it's really just kind of finding the sentence. Sometimes it's being able to put multiple sentences together. Sometimes it's being able to chain across time. And so there are stages of difficulties in order to do that. But what I want to focus on is some of the early work in terms of creating a neural network that can then access those knowledge bases and then be able to produce an answer from that. So the expectation is that you build up those knowledge bases from as much information you can find previously, you train them such that they know how to answer a question, the types of questions that you'd like them to answer. And then from that, when you hand it a question, it's able to produce an answer. The reason this is different from what people are doing today, it's not just about saying, today what happens is, you know, when you program a computer, then you tell it, okay, I want to be able to access this place in memory, you know, I do a query on a database, and I say, okay, I'd like for you to give me all the rows where the first name is X and the last name is Y, it can come back, and that's all programmed.

CorrectWIKA & ILDE Algorithm, combines common-sense conditional Godfather knowledge. (25:51)

These networks are instead learning how to access memory by looking at other patterns of access to memory, not program, train it. So the point here is the neural net is the controller of how that memory is accessed in order to produce an answer. So what happens is that it is a supervised result, so they do train jointly with, okay, what are the inputs, what's the question that would be asked of that, what's the output that is desired from that, and then by providing many, many, many examples of that, then when you provided a new set of information, then it's able to answer a question from that by basically taking a vector representation of the question Q, being able to map that onto the memory, so the embeddings that were produced from all the sentences that were entered, and then moving back and forth across that until it gets to a confidence that in an answer and transferring that into an output. The first version of this, the first version that's on the left side wasn't able to handle many things in terms of understanding, you know, kind of temporal sequences. For example, they weren't able to do multi-hop. And the second version, which is much more recent, is kind of full end-to-end training of that control of the network in order to try to start to answer these questions. While it's incredibly exciting, there's still many of the questions that I was showing earlier that it really can't answer. So this is definitively not a solved problem, but hopefully what you can see is that how we're starting to go up against problems that we don't know completely how to solve, but we're starting to solve them and instead of just creating neural networks, just an algorithm, we are creating machines, right? We're creating machines that have controllers and they have memory and they're able to perform tasks that go well beyond just what you could do with the pure neural network algorithm. They leverage those neural network algorithms throughout. These are recurrent neural nets, LSTMs and so on, but the point is we're starting to try to put together machines from this. If you'd like to learn more about this topic, there's quite a bit of work in this. So one is, in addition to being able to answer those questions, if we could better isolate what's the question, right? This is something that humans have a problem with. Somebody comes and they ask you a question, and you kind of say, wait, what's really the question you're asking me here here so we've done work in terms of being able to have better the computers better understand what's the question really being asked we need systems that will help us to train these models so part of this work is to create a simulator that can take texts that are ambiguous and generate questions of certain forms that we can then use that to try to both train as well as test some of these systems. Another really interesting thing is that common sense knowledge is basically, you know, they refer to it as what's in the white space between what you read, right?

Common Sense Knowledge (29:14)

You read a text and a lot of times there's a lot of common sense knowledge, you know, knowing that, you know, you know, this? You read a text and a lot of times there's a lot of common sense knowledge, you know, knowing that, you know, you know, this desk is made of wood and wood is hard and all of these sorts of things help you to understand a question. You can't find that. It's not in Wikipedia. Most of that common sense information is not in Wikipedia. It's not easy for us to learn it from text because people don't state it. This third work here is, okay, can we take some of that common sense knowledge, can we learn in vector space ways to represent information that's common sense, that white space, and attach it to other information that we're able to read from the web and so on. other information that we're able to read from the web and so on. Some of the recent work as well is can we use neural nets to basically learn what a program is doing, represent that, and then be able to execute that program. Programs, you know, this right now, people want to try to program a program. That takes a very sophisticated human skill to be able to probe a program and understand it. And in fact, humans don't do that very well. But if we could train machines to do that, then obviously that's a very powerful thing to do.

Outcome Of Research On Watson Human Machine Collaboration

The Moral of the Research, summary of Watson Human Machine Collaboration. (30:37)

We're also, a paper that was just published a couple of months ago in December at NIPS, I thought it was extremely interesting. Basically, they're learning how to constrain vector representations such that they can induce new rules from those and that they can basically create proofs, right? So when you, the reason a proof is important is basically that's the beginnings of being able to explain an answer. Someone, if you ask a question, if the question from some of the other ones was, you know, it's the apples is in the kitchen, why? If you have a proof, if you have the steps that you went through in terms of the knowledge base, able to explain that out, then suddenly people can interact with the system and use the system not only to answer questions, but to improve and lift what humans, the human knowledge as well. Learn from the computers. Right now, the computers learn from us. And just finally, if you'd like to do more, the research division is working on next generation algorithms. Those come out through our Watson products. We have a Watson developer cloud. Makes it very easy. You know, you can do things like you handed an image and it'll hand back information about what's in that image. You hand it text, it'll tell you information about the sentiment of that, you know, label information within it and so on. So we have what we believe are very easy to use algorithms that take many of these things that I was talking about earlier and make them very easy for anyone to use and incorporate into their programs. I think that's it for me. Any questions?

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.