MIT 6.S191 (2018): Deep Learning Limitations and New Frontiers
Transcription for the video titled "MIT 6.S191 (2018): Deep Learning Limitations and New Frontiers".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
I want to bring this part of the class to an end. So this is our last lecture before our series of guest lectures. And in this talk, I hope to address some of the state of deep learning today and kind of bring up some of the limitations of the algorithms that you've been seeing in this class so far. So we got a really good taste of some of the limitations, specifically in reinforcement learning algorithms that Lex gave in the last lecture. And that's really going to build on, or I'm going to use that to build on top of during this lecture. And just to end on, I'm going to introduce you to some new frontiers in deep learning that are really, really inspiring and at the cutting edge of research today. Before we do that, I'd like to just make some administrative announcements. So t-shirts have arrived, and we'll be distributing them today. And we'd like to distribute first to the registered four-credit students. After that, we will be happy to distribute to registered listeners. And then after that, if there's any remaining, we'll give out to listeners if they're interested. So for those of you who are taking this class for credit, I need to reiterate what kind of options you have to actually fulfill your credit requirement. So the first option is a group project proposal presentation. So for this option, you'll be given the opportunity to pitch a novel deep learning idea to a panel of judges on Friday. You'll have exactly one minute to make your pitch as clear and as concisely as possible. So this is really difficult to do in one minute, and this is kind of one of the challenges that we're putting on you, in addition to actually coming up with the deep learning idea itself. If you want to go down this route for your final project, then you'll need to submit your teams, which have to be of size three or four by the end of today. So at 9 PM today, we'd like those in. You'll have to do teams of three and four. So if you want to work in groups of one or two, then you'll have to. You're welcome to do that, but you won't be able to actually submit your final project as part of a presentation on Friday. You can submit it to us and we'll give you the grade for the class like that.
Lecture Content Overview
Groups & Check in Tab (02:15)
So groups are due 9 p.m. today, and you have to submit your slides by 9 p.m. tomorrow. Presentations are at class on Friday in this room. If you don't want to do a project presentation, you have a second option, which is to write a one-page paper review of a deep learning idea. So any idea or any paper that you find interesting is welcome here. So we really accept anything, and we're really free in this option as well. I want to highlight some of the exciting new talks that we have coming up after today. So tomorrow, we'll have two sets of guest lectures. First, we'll hear from Urs Mueller, who is the chief architect of NVIDIA's self driving car team. So Urs and his team are actually known for some really exciting work that Ava was showing yesterday during her lecture. And they're known for this development of an end-to-end platform for autonomous driving that takes directly image data and produces a steering control command for the car at the output. Then we'll hear about, we'll hear from two Google brain researchers on recent advancements on image classification at Google. And also, we'll hear about some super recent advancements and additions to the TensorFlow pipeline that were actually just released a couple of days ago. So this is really, really new stuff. Tomorrow afternoon, we'll get together for one of the most exciting parts of this class. So what will happen is we'll have each of the sponsors actually come up to the front of the class here. We have four sponsors that will present on each of these four boards. And you'll be given the opportunity to basically connect with each of them through the ways of a recruitment booth. And basically, they're going to be looking at students that might be interested in deep learning internships or employment opportunities. So this is really an incredible opportunity for you guys to connect with these companies in a very, very, very direct manner. So we highly recommend that you take advantage of that. There will be info sessions with pizza provided on Thursday with one of these guest lecturers, with one of these industry companies.
Guest Lecture details (04:20)
And we'll be sending out more details with that today as well. So on Friday, we'll continue with the guest lectures and hear from Lisa Amini, who is the head of IBM Research in Cambridge. She's actually also the director of the MIT IBM Lab. And this is a lab that was just founded a couple, or actually about a month ago or two months ago. We'll be hearing about how IBM is creating AI systems that are capable of not only deep learning, but going a step past deep learning, or trying to be capable of learning and reasoning on a higher order sense.
Social Networks (05:04)
And then finally, we'll hear from a principal researcher at Tencent AI Lab about combining computer vision and social networks. It's a very interesting topic that we haven't really touched upon in this class, this topic of social networks and using massive big data collected from humans themselves. And then, as I mentioned before, in the afternoon, we'll go through and hear about the final project presentations.
We'll celebrate with some pizza and the awards that will be given out to the top projects during that session as well. So now let's start with the technical content for this class.
Universal Approximation Theorem (05:41)
I'd like to start by just kind of overviewing the type of architectures that we've talked about so far. For the most part, these architectures can be thought of almost pattern recognition architecture. So they take as input data, and the whole point of their pipeline, their internals, are performing feature extraction. And what they're really doing is taking all of the sensory data data trying to figure out what are the important pieces what are the patterns to be learned within the data such that they can produce a decision at the output we've seen this take many forms so the decision could be a prediction could be a detection or even an action like in a reinforcement learning setting we've even learned how these models can be viewed in a generative sense to go in the opposite direction and actually generate new synthetic data. But in general, we've been dealing with algorithms that are really optimized to do well at only a single task. But they really fail to think like humans do, especially when we consider a higher order level of intelligence like I defined on the first day. To understand this in a lot more detail, we have to go back to this very famous theorem that was dating back almost 30 years from today. This theorem, which is known as the universal approximation theorem, was one of the most impactful theorems in neural networks when it first came out because it had such a profound, it proved such a profound claim. What it states is that a neural network with a single hidden layer is sufficient to approximate any function to any arbitrary level of accuracy. Now in this class, we deal with networks that are deep. They're not single layered, so they're actually more than a single layer. So actually, they contain even more complexity than the network that I'm referring to here. But this theorem proves that we actually only need one layer to approximate any function in the world. And if you believe that any problem can actually be reduced to a set of inputs and outputs in the form of a function, then this theorem shows you that a neural network with just a single layer is able to solve any problem in the world. Now this is an incredibly powerful result, but if you look closely there are a few very important caveats. I'm not actually telling you how large that hidden layer has to be to accomplish this task. Now, with the size of your problem, the hidden layer and the number of units in that hidden layer may be exponentially large. And it will grow exponentially with the difficulty of your problem. This makes training that network very difficult. So I never actually told you anything about how to obtain that network. I just told you that it existed. And there is a possible network in the realm of all neural networks that could solve that problem. But as we know in practice actually training neural networks because of their non-convex structure is extremely difficult. So this theorem is really a perfect example of the possible effects of overhyping in AI. So over the history of AI, we've had two AI winters. And this theorem was one of the resurgence after the first AI winter, but it also caused a huge false hype in the power of these neural networks, which ultimately led to yet another AI winter. And I feel like as a class, it's very important to bring this up because right now we're very much in the state of a huge amount of overhyping in deep learning algorithms. So these algorithms are, especially in the media, being portrayed that they can accomplish human-level intelligence and human-level reasoning. And simply, this is not true. So I think such overhype is extremely dangerous. And resulted, well, we know it resulted in both of the two past AI winters. And I think as a class, it's very important for us to focus on some of the limitations of these algorithms so that we don't overhype them, but we provide realistic guarantees or realistic expectations, rather, on what these algorithms can accomplish. And finally, going past these limitations, the last part of this talk will actually focus on some of the exciting research, like I mentioned before, that tries to take a couple of these limitations and really focus on possible solutions and possible ways that we can move past them.
OK, so let's start. And I think one of the best examples of a potential danger of neural networks comes from this paper from Google DeepMind named Understanding Deep Neural Networks Requires Rethinking Generalization. And generalization was this topic that we discussed in the first lecture. So this is the notion of a gap or a difference between your training accuracy and your test accuracy. If you're able to achieve equal training and test accuracy, that means you have essentially no generalization gap. You're able to generalize perfectly to your test data set. But if there's a huge disparity between these two data sets, and your model is performing much better on your training data set than your test data set, this means that you're not able to actually generalize to brand new images. You're only just memorizing the training examples. And what this paper did was they performed the following experiment. So they took images from ImageNet. So you can see four examples of these images here. And what they did was they rolled a k-sided die, where k is the number of all possible labels in that data set. And this allowed them to randomly assign brand new labels to each of these images. So what used to be a dog, they call now a banana. And what used to be that banana is now called a dog. And what used to be called that second dog is now a tree. So note that the two dogs have actually been transformed into two separate things. So things that used to be in the same class are now in completely disjoined classes. And things that were in disjoined classes may be now in the same class. So basically, we're completely randomizing our labels entirely. And what they did was they tried to see if a neural network could still learn random labels. And here's what they found. So as you'd expect, when they tested this neural network with random labels, as they increased the randomness on the x-axis, so going from left to right, this is the original labels before randomizing anything. And then they started randomizing, their test accuracy gradually decreased. And this is as expected, because we're trying to learn something that has absolutely no pattern in it. But then what's really interesting is that then they looked at the training accuracy. And what they found was that the neural network was able to, with 100% accuracy, get the training set correct every single time. No matter how many random labels they introduced, the training set would always be shattered. Or in other words, every single example in the training set could be perfectly classified. So this means that modern deep neural networks actually have the capacity to brute force memorize massive data sets, even on the size of ImageNet. With completely random labels, they're able to memorize every single example in that data set. And this is a very powerful result, because it drives home this point that neural networks are really, really excellent function approximators. So this also connects back to the universal approximation theorem that I talked about before. But they're really good approximators for just a single function, like I said, which means that we can always create this maximum likelihood estimate of our data using a neural network, such that if we were given a new data point, like this purple one on the bottom, it's easy for us to compute its estimate output just by intercepting it with that maximum likelihood estimate. But that's only if I'm looking at a place that we have sufficient training data already. What if I extend these x-axes and look at what the neural network predicts beyond that? In these locations, these are actually the locations that we care about most. These are the edge cases in driving. These are the cases that we don't have a lot of data that was collected. And these are usually the cases where safety critical applications are most important.
Adversarial attacks (14:21)
So we need to be able to make sure when we sample the neural network from these locations, are we able to know that the neural network, are we able to get feedback from the neural network that it actually doesn't know what it's talking about? So this notion leads nicely into the idea of what is known as adversarial attacks, where I can give a neural network two images, like on the left, like this one, and on the right, an adversarial image, that to a human look exactly the same. But to the network, they're incorrectly classified 100% of the time. So the image on the right shows an example of a temple, which when I feed to a neural network, it gives me back label of a temple. But when I apply some adversarial noise, it classifies this image incorrectly as an ostrich. So for this, I'd like to focus on this piece specifically. So to understand the limitations of neural networks, the first thing we have to do is actually understand how we can break them. And this perturbed noise is actually very intelligently designed. So this is not just random noise, but we're actually modifying pixels in specific locations to maximally change or mess up our output prediction. So we want to modify the pixels in such a way that we're decreasing our accuracy as much as possible. And if you remember back to how we actually train our neural networks, this might sound very similar. So if you remember, training a neural network is simply optimizing over our weights theta. So to do this, we simply compute the gradient of theta with respect to, sorry, the gradient of our loss function with respect to theta. And we simply perturb our weights in the direction that will minimize our loss. Now, also remember that when we do this, we're perturbing theta, but we're fixing our x and our y. This is our training data and our training labels. Now, for adversarial examples, we're just shuffling the variables a little bit. So now we want to optimize over the image itself, not the weights. So we fix the weights and the target label itself, and we optimize over the image x. We want to make small changes to that image x such that we increase our loss as much as possible. And we want to go in the opposite direction of training now. And these are just some of the limitations of neural networks. And for the remainder of this class, I want to focus on some of the limitations of neural networks. And for the remainder of this class, I want to focus on some of the really, really exciting new frontiers of deep learning that focus on just two of these.
Bayesian learning (16:52)
Specifically, I want to focus on the notion of understanding uncertainty in deep neural networks and understanding when our model doesn't know what it was trained to know, maybe because it didn't receive enough training data to support that hypothesis. And furthermore, I wanted to focus on this notion of learning how to learn models, because optimization of neural networks is extremely difficult. It's extremely limited in its current nature, because they're optimized just to do a single task. So what we really want to do is create neural networks that are capable of performing not one task, but a set of sequences of tasks that are maybe dependent in some fashion. So let's start with this notion of uncertainty in deep neural networks. And to do that, I'd like to introduce this field called Bayesian deep learning.
Why we care (17:43)
Now, to understand Bayesian deep learning. Now, to understand Bayesian deep learning, let's first understand why we even care about uncertainty. So this should be pretty obvious. Let's suppose we're given a network that was trained to distinguish between cats and dogs. At input, we're given a lot of test training images of cats and dogs. And it's simply at input, we're given a lot of test image or training images of cats and dogs, and it's simply at the output we're producing an output probability of being a cat or a dog. Now this model is trained on only cats or dogs, so if I showed another cat, it should be very confident in its output. But let's suppose I give it a horse, and I force that network, because it's the same network, to produce an output of being a probability of a cat or a probability of a dog. Now, we know that these probabilities have to add up to 1, because that's actually the definition that we constrain our network to follow. So that means, by definition, one of these categories, so the network has to produce one of these categories. So the notion of probability and the notion of uncertainty are actually very different. But a lot of deep learning practitioners often mix these two ideas. So uncertainty is not probability. Neural networks are trained to detect or produce probabilities at their output, but they're not trained to produce uncertainty values. So if we put this horse into the same network, we'll get a set of probability values that add up to one. But what we really want to see is, we want to see a very low certainty in that prediction. And one possible way to accomplish this in deep learning is through the eyes of Bayesian deep learning. And to understand this, let's briefly start by formulating our problem again. So first, let's go through the variables. So we want to approximate this variable y, or output y, given some raw data x. And really what we mean by training is we want to find this functional mapping f parameterized by our weights theta, such that we minimize the loss between our predicted examples and our true outputs y. So Bayesian neural networks take a different approach to solve this problem. They aim to learn a posterior over our weights, given the data. So they attempt to say, what is the probability that I see this model with these weights, given the data in my training set? Now, it's called Bayesian deep learning because we can simply rewrite this posterior using Bayes' rule. However, in practice, it's rarely possible to actually compute this Bayes' rule update. And it just turns out to be intractable. So instead, we have to find out ways to actually approximate it through sampling. So one way that I'll talk about today is a very simple notion that we've actually already seen in the first lecture. And it goes back to this idea of using dropout. So if you remember what dropout was, dropout is this notion of randomly killing off a certain percentage of neurons in each of the hidden layers. Now I'm going to tell you not how to use it as a regularizer, but how to use dropout as a way to produce reliable uncertainty measures for your neural network. So to do this, we have to think of capital T stochastic passes through our network, where each stochastic pass performs one iteration of dropout. Each time you iterate dropout, you're basically just applying a Bernoulli mask of ones and zeros over each of your weights. So going from the left to the right, you can see our weights, which is like this matrix here. Different colors represent the intensity of that weight. And we element-wise multiply those weights by our Bernoulli mask, which is just either a one or a 0 in every location. The output is a new set of weights with certain of those dropped out, with certain aspects of those dropped out. Now, all we have to do is compute this T times, capital T times, we get theta t weights. And we use those theta t different models to actually produce an empirical average of our output class, given the data. So that's this guy. What we're actually really interested in, why I brought this topic up, was the notion of uncertainty, though. And that's the variance of our predictions right there.
Bayesian model uncertainty drop out (22:24)
So this is a very powerful idea. All it means is that we can obtain reliable model uncertainty estimates simply by training our network during runtime with dropout. And then instead of estimating or classifying just a single pass through this network at test time, we classify capital T iterations of this network, and then use it to compute a variance over these outputs. And that variance gives us an estimation of our uncertainty. Now to give you an example of how this looks in practice, let's look at this network that was trained to take as input images of the real world and output predicted depth maps. Oh, it looks like my text was a little off, but that's okay. So at the output, we have a predicted depth map, where at each pixel, the network is predicting the depth in the real world of that pixel. Now when we run Bayesian model uncertainty using the exact same dropout method that I just described, we can see that the model is most uncertain in some very interesting locations. So first of all, pay attention to that location right there. And if you look, where is that location exactly? It's just the windowsill of this car. And in computer vision, windows and specular objects are very difficult to basically model, because we can't actually tell their surface reliably. So we're seeing the light from actually the sky. We're not actually seeing the surface of the window in that location. So it can be very difficult for us to model the depth in that place. Additionally, we see that the model is very uncertain on the edges of the cars, because these are places where the depth is changing very rapidly. So the prediction may be least accurate in these locations.
Robust measurements (24:22)
So having reliable uncertainty estimates can be an extremely powerful way to actually interpret deep learning models and also provide human practitioners, especially in the realm of safe AI, as a way to interpret the results and also trust our results with a certain amount of or a certain grain of salt. So for the next and final part of this talk I'd like to address this notion of learning to learn. So this is a really cool sounding topic. It aims to basically learn not just a single model that's optimized to perform a single task like we've learned basically in all of our lectures previous to this one, but it learns how to learn which model to use to train that task. So first let's understand why we might want to do something like that. I hope this is pretty obvious to you by now, but humans are not built in a way where we're executing just a single task at a time. We're executing many, many, many different tasks. And all of these tasks are constantly interacting with each other in ways that learning one task can actually aid, speed up, or deter the learning of another task at any given time. Modern deep neural network architectures are not like this. They're optimized for a single task. And this goes back to the very beginning of this talk, where we talked about the universal approximator. And as these models become more and more complex, what ends up happening is that you have to have more and more expert knowledge to actually build and deploy these models in practice. And that's exactly why all of you are here. You're here to basically get that experience such that you yourselves can build these deep learning models. So what we want is actually an automated machine learning framework where we can actually learn to learn. And this basically means we want to build a model that learns which model to use given a problem definition. One example that I'd like to just use as an illustration of this idea, so there are many ways that AutoML can be accomplished.
And this is just one example of those ways. So I'd like to focus on this illustration here, and I'd like to walk through it. It's just a way that we can learn to learn. So this system focuses on two parts. The first part is the controller RNN in red on the left, and this controller RNN is basically just sampling different architectures of neural networks. So if you remember in your first lab, you created an RNN that could sample different music notes. Remember in your first lab, you created an RNN that could sample different music notes. This is no different, except now we're not sampling music notes, we're sampling an entire neural network itself. So we're sampling the parameters that define that neural network. So let's call that the architecture, or the child network. So that's the network that will actually be used to solve our task in the end. So that network is passed on to the second bot.
Detailed Learning Process
Hyper Parameters (27:28)
So that network is passed on to the second one. And in that piece, we actually used that network that was generated by the RNN to train a model. Depending on how well that model did, we can provide feedback to the RNN such that it can produce an even better model on the next time step. So let's go into this piece by piece. So let's look at just the RNN part in more detail. So this is the RNN, or the architecture generator. So like I said, this is very similar to the way that you were generating songs in your first lab, except now we're not generating songs. The time steps are going from layers on the x-axis, and we're just generating parameters, or hyperparameters, rather, for each of those layers. So this is a generator for a convolutional neural network, because we're producing parameters like the filter height, the filter width, the stride height, et cetera. So what we can, the filter width, the stride height, etc. So what we can do is we can, at each time step, produce a probability distribution of over each of these parameters, and we can essentially just sample an architecture or sample a child network. Once we have that child network, which I'm showing right here in blue, we can train it using our data set that we ultimately want to solve. So we put our training data in, and we get our predicted labels out. This is the realm that we've been dealing with so far in this class. So this is basically what we've seen so far. So this is just a single network, and we have our training data that we're using to train it. We see how well this does.
Brass Tacks (29:07)
we're using to train it. We see how well this does. Depending on the accuracy of this model, that accuracy is used to provide feedback back to the RNN and update how it produces or how it generates these models. So let's look at this one more time to summarize. This is an extremely powerful idea. It's really, really, really exciting because it shows that an RNN can be actually combined in a reinforcement learning paradigm where the RNN itself is almost like the agent in reinforcement learning. It's learning to make changes to the child network architecture depending on how that child network performs on a training set. This means that we're able to create an AI system capable of generating brand new neural networks specialized to solve specific tasks, rather than just creating a single neural network that we create just to solve that task that we want to create, that we want to solve. Thus, this has significantly reduced the difficulty in optimizing these neural networks, for our architectures for different tasks. And this also reduces the need for expert engineers to design these architectures. So this really gets at the heart of artificial intelligence. So when I began this course, we spoke about what it actually means to be intelligent. And loosely, I defined this as the ability to take information, process that information, and use it to inform future decisions. So the human learning pipeline is not restricted to solving just one task at a time, like I mentioned before. How we learn one task can greatly impact, speed up, or even slow down our learning of other tasks. And the artificial models that we create today simply do not capture this phenomenon. To reach artificial general intelligence, we need to actually build AI that can not only learn a single task, but also be able to improve its own learning and reasoning such that it can generalize to sets of related and dependent tasks. I'll leave this with you as a thought-provoking point and encourage you to all talk to each other on some ways that we can reach this higher order level of intelligence that's not just pattern recognition, but rather a higher order form of reasoning and actually thinking about the problems that we're trying to solve. Thank you.