MIT 6.S191 (2019): Deep Learning Limitations and New Frontiers

Transcription for the video titled "MIT 6.S191 (2019): Deep Learning Limitations and New Frontiers".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Intro (00:00)

OK. So welcome, everyone, to the final foundational lecture of this year's offering of Success 191, where we'll be taking kind of a step back from the architectures and the algorithms we've been exploring over the past two days to take a broader perspective on some of the limitations of deep learning and a couple of exciting new subfields that are emerging within deep learning. So before we dive into that, some very important announcements. First and foremost, we have amazing t-shirts for you that have arrived, and we'll be distributing them today after lecture. So we'll have them arranged sort of in the front by size. And we'd first like to distribute to four credit registered students. We should have plenty for everyone. But then to registered listeners, and then to all other guests and listeners. So there should be more than enough. We also have some shirts from Google, as well as some other swag, so it's going to be really exciting. And please do stick around for that. So sort of where have we been and where are we going? So following this lecture, we have our final lab, which is going to be on reinforcement learning. And then tomorrow, we have two extremely exciting guest lectures from Google and IBM, as well as time for you to work on your final projects during the lab portion. And on Friday, we'll have one final guest lecture from NVIDIA, the project pitch competition, as well as the judging and awards ceremony. So we've received a lot of inquiries about the final projects and specific logistics, so I'd like to take a few minutes to recap that. So for those of you who are taking the course for credit, right, you have two options to fulfill your final requirement. The first is a project proposal, and here we're asking you to really pitch a novel deep learning architecture idea application, and we've we've gotten a lot of questions about the size of the groups. So groups of one are welcome, but you will not be eligible to receive a prize if you are a group of one. Listeners are welcome to present ideas and join groups. In order to be eligible for a prize, you must have a group of two to four people, and your group must include at least one four-credit registered student. We're going to give you three minutes to do your pitches, and this link, there's pretty detailed instructions for the proposal on that link. So last year we only gave people one minute for their pitch, which was really, really short. So by giving you three minutes, we're really hoping that you have spent some time to think in depth about your idea, how it could work, why it's compelling. And we're going to have a panel of judges, including judges from industry, as well as Alexander, myself, and other guest judges. And our prizes are these three NVIDIA GPUs as well as a set of four Google Homes. Sort of how the prize awarding is going to go is that top three teams will be awarded a GPU per team, and the Google Homes will be distributed within one team. So if you have a team that has four people, right, and you're awarded the Google Home prize, everyone will get a Google Home. If you have two, each of you will get one, and then the remaining two will be awarded to the next best team. Okay, so in terms of actually completing this, we ask that you prepare slides for your pitch on Google Slides. So on Friday, if you're participating, you'll come down to the front, present your pitch to the rest of the class and to our judges. We ask that you please submit your groups on by today, tonight at 10 p.m. There is, this link leads to a Google Sheet where you can sign up with your team members' names and a tentative title for your project. Tomorrow's lab portion will be completely devoted to in-class work on the project. Tomorrow's lab portion will be completely devoted to in-class work on the project. Then we ask that you submit your slides by midnight Thursday night so that we have everything ready and in order for Friday, and the link for doing that is there. And finally our presentations will be on Friday. So as was discussed in the first lecture, the second, arguably more boring option, is to write a one-page review of a recent deep learning paper. It can be either on deep learning fundamentals and theory or an interesting application of deep learning to a different domain that you may be interested in. And this would be due Friday at the beginning of class by email to intro to deep learning stuff at OK, so tomorrow we're going to have two guest speakers. The first is going to be, we're really, really lucky and privileged to have her, Fernanda Villegas. She is a MIT alum, and she's the co-director of Google's People and Artificial Intelligence Research Lab, or PAIR. And she's a world-class specialist on visualization techniques for machine learning and deep learning. So it should be a really fun, interactive, cool talk. And we really hope to see everyone there. The second talk will be given by Dimitri, or Dima Khrotov, from the MIT-IBM Watson AI Lab. He's a physicist by training, really exciting and fun guy.

Deep Learning And Reinforcement Learning

Brain-bailout loops (06:21)

And his research focuses on biologically plausible algorithms for training neural networks. So he'll give some insight onto whether or not back propagation could actually be biologically plausible. And if not, what are some exciting new ideas about how learning could actually work in the neuroscientific sense, I guess. Then the lab portion is going to be devoted to work on the final projects. We'll be here. The TAs will be here. You can brainstorm with us, ask us questions, work with your team. You get the point. Finally, Thursday, we'll have our final guest lecture given by Jan Kautz from NVIDIA, who's a leader in computer vision. And then we'll have the project proposal competition, the awards, as well as a pizza celebration at the end.

Reinforcement learning (07:16)

Okay, so that is sort of the administrivia for today and ideally the rest of the class. So now let's start with the technical content. So on day one, Alexander showed this slide, which sort of summarized how deep learning has revolutionized so many different research areas, from autonomous vehicles to medicine and health care, reinforcement learning, generative modeling, robotics, and the list goes on and on. And hopefully now, through this series of five lectures, you have a more concrete understanding of how and why deep learning is so well suited for these kinds of really complex tasks and how it's enabled these advances across a multitude of disciplines.

Deep networks data inception (07:59)

And also, so far we've primarily dealt with these algorithms that take as input some set of data in the form of signals, sequences, images, or other sensory data to directly produce a decision as an output, whether that's a prediction or an action as in the case of reinforcement learning and we've also seen ways in which we can go from sort of from decision to data in the context of generative modeling and to sample brand-new data from the decision space in sort of this probabilistic setting more generally in in all these cases, we've really been dealing with algorithms that are designed to do, that are optimized to do well on a single task, right, but fail to think like humans or operate like humans, sort of at a higher level, order level of intelligence. And to understand this in more detail, we have to go back to a famous theorem in sort of the theory of neural networks, which was presented in 1989 and generated quite the stir.

Universal Approximation Theorem (09:14)

And this is called the universal approximation theorem. And basically what it states is that a neural network with a single hidden layer is sufficient to approximate any arbitrary function, any continuous function. And in this class, we've mostly been talking about deep models that use multiple layers. But this theorem completely ignores this and says, you just need one neural layer. And if you believe that any problem can be reduced to a set of inputs and an output, this means that there exists a neural network to solve any problem in the world, so long as you can define it using some continuous function. And this may seem like an incredibly powerful result, but if you look closely, there are two really big caveats to this. First, this theorem makes no guarantee on the number of hidden units or the size of the hidden layer that's going to be required to solve your arbitrary problem. And additionally, it leaves open this question of, how do we actually go about finding the weights to support whatever architecture that could be used to solve this problem? It just claims that, and actually proves that, such an architecture exists. But as we know from gradient descent and this idea of finding weights in sort of like a non-convex landscape, there's no guarantee that this process of learning these weights would be any way straightforward. And finally, this theorem doesn't provide any guarantees that whatever model that's learned would generalize well to other tasks. And this theorem is a perfect example of sort of the possible effects of over-hype in AI. And as a community, I think we're all interested in sort of the state of deep learning and how we can use it. That's probably a big motivation of why you're sitting in this lecture today. But I think we really need to be extremely careful in terms of how we market and advertise these algorithms. So while the universal approximation theorem generated a lot of excitement when it first came out, it also provided some false hope to the AI community that neural networks, as they existed at that time, could solve any problem in the world. And this overhype is extremely dangerous. And historically, there have actually been two, quote unquote, AI winters, where research in AI and neural networks specifically in the second AI winter came to sort of a grinding halt. And so this is why for the first portion of this lecture, I'd like to focus on some of the limitations of these algorithms that we've learned about so far, but also to take it a step further to touch on some really exciting new research that's looking to address these problems and limitations.

points (randomly sampled labels) (12:25)

So first, let's talk about limitations of deep learning. And one of my favorite examples of a potential danger of deep neural networks comes from this paper from Google, Google Brain, that was entitled, Understanding Deep Neural Networks Requires Rethinking Generalization. And this paper really did something really simple but very powerful. They took images from the ImageNet dataset, and their labels for examples are shown here. And what they did is that for every image in their dataset, they flipped a die, a K-sided die, where K is the number of possible classes that they were trying to consider in a classification problem. And they used this result of this die roll to assign a brand new, randomly sampled label to that image. And this means that these new labels associated with each image were completely random with respect to what was actually present in the image. And if you'll notice that these two examples of dogs ended up, in this demonstration that I'm showing, being mapped to different classes altogether. So we're literally trying to randomize our labels entirely. Then what they did was that they tried to fit a deep neural network model to the sampled ImageNet data, ranging from either the untouched original data with the original labels to data that they had reassigned the labels using this completely random sampling approach. And then they tested the accuracy of their model on a test dataset. And as you may expect, the accuracy of their models progressively decreased as the randomness in the training data set increased. But what was really interesting was what happened when they looked at what happened in the training data set. And this is what they found, that no matter how much they randomized the labels, the model was able to get 100% accuracy on the training set. Because in training, you're doing input, label. You know both. And this is a really powerful example, because it shows once again, in a similar way as the universal approximation theorem, that deep neural nets can perfectly fit to any function, even if that function is based on entirely random labels.

randomization axis (14:59)

And to drive this point home, we can understand neural networks simply as functional approximators. And all the universal approximation theorem states is that neural networks are really, really good at doing this. So suppose you have this set of training data. We can use a neural network to learn a maximum likelihood estimate of this training data. And if we were to give the model a new data point, shown here in this purple arrow, we can use it to predict what the maximum likelihood estimate for that data point is going to be. But if we extend the axis a bit left and right outside of the space of the training data that the network has seen, what happens?

Adversarial (16:01)

There are no guarantees on what the training data that the network has seen, what happens? There are no guarantees on what the training data will look like outside these bounds. And this is a huge limitation that exists in modern deep neural networks and in deep learning generally. And so if you look here, outside of these bounds that the network has been trained on, we can't really know what our function looks like if the network has never seen data from those pieces before. So it's not going to do very well. And this notion leads really nicely into this idea of what's known as adversarial attacks on neural networks. And the idea here is to take some example. For example, this image of what you can see is a temple, which a standard CNN trained on Image Net, let's say, can classify as a temple with 97% probability. And then we can apply some perturbation to that image to generate what we call an adversarial example, which to us looks completely similar to the original image, right? But if we were now to feed this adversarial example through that same CNN, we can no longer recognize it as a temple, you know. And instead we predict, OK, this is an image of an ostrich. It makes no sense. So what's going on? What is it about these perturbations and how are we generating them that we're able to fool the network in this way? So remember that normally during training, when we train our network using gradient descent, we have some objective loss function, j, that we're trying to optimize given a set of weights, theta, input data, x, and some output label, y. And what we're asking is, how does a small shift in the weights change our loss? Specifically, how can we change our weights, theta, in some way to minimize this loss? And when we train our networks to optimize this set of weights, we're using a fixed input x and a fixed label y. And we're, again, reiterating, trying to update our weights to minimize that loss. With adversarial attacks, we're asking a different problem. How can we modify our input, for example, an image, our input x, in order to now increase the error in our network's prediction. So we're trying to optimize over the input x, right, to perturb it in some way, given a fixed set of weights, theta, and a fixed output, y. And instead of minimizing the loss, we're now trying to increase the loss to try to fool our network into making incorrect predictions. And an extension of this idea was recently presented by a group of students here at MIT. And they devised an algorithm for synthesizing a set of examples that would be adversarial over a diverse set of transformations like rotations or color changes. And so the first thing that they demonstrated was that they were able to generate 2D images that were robust to noise transformations, distortions, other transformations. But what was really really cool was that they actually showed that they could extend this idea to 3D objects. And they actually used 3D printing to create actual physical adversarial objects. And this was the first demonstration of adversarial examples that exist in the physical world. So what they did in this result shown here is that they 3D printed a set of turtles that were designed to be adversarial to a given network and took images of those turtles and fed them in through the network. And in the majority of cases, the network classifies these 3D turtles as rifles. And these objects are designed to be adversarial. They're designed to fool the network. So this is pretty scary. And it opens a whole Pandora's box of how can we trick networks and has some pretty severe implications for things like security. And so these are just a couple of limitations of neural networks that I've highlighted here. As we've sort of touched on throughout this course, they're very data hungry. It's computationally intensive to train them. They can be fooled by adversarial examples. They can be subject to algorithmic bias. They're relatively poor at representing uncertainty.

Interpretability (21:19)

Then a big point is this question of interpretability, right? Are neural networks just black boxes that you can't peer into? And sort of in the ML and AI community, people tend to fall in sort of two camps. One camp saying interpretability of neural networks matters a lot. It's something that we should devote a lot of energy and thought into. And others that very strongly argue that, oh, no, we should not really concern ourselves with interpretability. What's more important is building these architectures that perform really, really well on a task of interest. And in going from limitations to sort of new frontiers and emerging areas in deep learning research, I like to focus on these two sort of sets of points highlighted here. The first is the notion of understanding uncertainty, and the second is ways in which we can move past building models that are optimized for a single task to actually learning how to build a model capable of solving not one, but many different problems. So the first sort of new frontier is this field called Bayesian deep learning. And so if we consider, again, the very simple problem of image classification, what we've learned so far is, has been about modeling probabilities over a fixed number of classes. So if we are to train a model to predict, you know, dogs versus cats, we output some probability that an input image is either a dog or a cat.

Uncertainty (22:58)

But I'd like to draw a distinction between a probability and this notion of uncertainty or confidence. So if we were to feed in an image of a horse into this network, for example, we would still output a probability of being dog or cat, because probabilities need to sum to one. But the model may, even if it's saying that it's more likely that this image is of a horse, it may be more uncertain in terms of its confidence in that prediction. And there's this whole field of Bayesian deep learning that looks at modeling and understanding uncertainty in deep neural networks. And sort of this gets into a lot of statistics, but the key idea is that Bayesian neural networks are trying to, rather than learn a set of weights, they're trying to learn a distribution over the possible weights, given some input data x and some output labels y. And to actually parameterize this problem, they use Bayes rule, which is a fundamental law from probability theory. But in practice, what's called this posterior distribution of the likelihood of a set of weights given input and output is computationally intractable. And so instead of, we can't learn this distribution directly, so what we can do is find ways to approximate this posterior distribution through different types of sampling operations. And one example of such a sampling approach is to use the principle of dropout, which was introduced in the first lecture, to actually obtain an estimate of the network's uncertainty. So if we look at what this may look like for a network that's composed of convolutional layers consisting of two dimensional feature maps, we can use dropout to estimate uncertainty by performing stochastic passes through the network. And each time we make a pass through the network, we sample each of these sets of weights, these filter maps, according to some dropout mask. These are either 0s or 1s, meaning we'll keep these weights highlighted in blue, and we'll discard these weights highlighted in white to generate this stochastic sample of our original filters. And from these passes, what we can actually obtain is an estimate of sort of the expected value of the output value of the output labels given the input, the mean, as well as this variance term, which provides an uncertainty estimate. And this is useful in understanding the uncertainty of the model in making a prediction. And one application of this type of approach is shown here in the context of depth estimation. So given some input image, we train a network to predict the depth of the pixels present in that image.

Child Networks (26:43)

estimate of your uncertainty in making that prediction. And when we visualize that, what you can see is that the model is more uncertain in this sort of edge here, which makes sense. If you look back at this original input, that edge is sort of at this point where those two cars are overlapping. And so you can imagine that the model may have more difficulty in estimating the pixels that line that edge, the depth of the pixels that line that edge. Furthermore, if you remember from yesterday's lecture, I showed this video, which is work from the same group at Cambridge, where they trained a convolutional neural network-based architecture on three tasks simultaneously, semantic segmentation, depth estimation, and instance segmentation. And what we really focused on yesterday was how this segmentation result was much crisper and cleaner from this group's previous result from one year prior. But what we didn't talk about was how they're actually achieving this improvement. And what they're doing is they're using uncertainty. By training their network on these three different tasks simultaneously, what they're able to achieve is to use the uncertainty estimates from two of two tasks to improve the accuracy of the third task. And this is used to regularize the network and improve its generalization in one domain such as segmentation. And this is just another example of the results. As you can see, right, each of these semantic segmentation, instance segmentation, and depth estimation seem pretty crisp and clean when compared to the input image. So the second exciting area of new research that I'd like to highlight is this idea of learning to learn. And to understand why this may be useful and why you may want to build out algorithms that can learn to learn, right, we first like to reiterate that most neural networks today are optimized for a single task. And as models get more and more complex, they increasingly require expert knowledge in terms of engineering them and building them and deploying them. And hopefully you've gotten a taste of that knowledge through this course. So this can be kind of a bottleneck, right? Because there are so many different settings where deep learning may be useful, but only so many deep learning researchers and engineers, right? So why can't we build a learning algorithm that actually learns which model is most well-suited for an arbitrary set of data and an arbitrary task?

Learning how to create (29:50)

And Google asked this question a few years ago, and it turns out that we can do this. And this is the idea behind this concept of AutoML, which stands for sort of automatic machine learning, automatically learning how to create new machine learning models for a particular problem. And the original AutoML, which was proposed by Google, uses a reinforcement learning framework. And how it works is the following.

ML, Learning ML (30:26)

They have sort of this agent environment structure that Alexander introduced, where they have a first network, the controller, which in this case is a RNN, that proposes a child model architecture, right, in terms of the parameters of that model, which can then be trained and evaluated for its performance on a particular task. And feedback on how well that child model does on your task of interest is then used to inform the controller on how to improve its proposals for the next round in terms of, okay, what is the updated child network that I'm going to propose? And this process is repeated thousands of times, iteratively generating new architectures, testing them, I'm going to propose. And this process is repeated thousands of times, iteratively generating new architectures, testing them, and giving that feedback back to the controller to learn from. And eventually, the controller is going to learn to assign high probability to areas of the architecture space that achieve better accuracy on that desired task, and low probability to those architectures that don't perform well. So how does this agent work? As I mentioned, it's an RNN controller that sort of at the macro scale considers different layers in the proposed generated network, and at the internal level of each candidate layer, it predicts different what are known as hyperparameters that define the architecture of that layer. So for example, if we're trying to generate a child CNN, we may want to predict the number of different filters of a layer, the dimensionality of those filters, the stride that we're going to slide our filter patch over during the convolution operation, all parameters associated with convolutional layers. So then if we consider the other network in this picture, the child network, what is it doing? To reiterate, this is a network that's generated by another neural network. That's why it's called the child, right?

Rnns And Intelligence

RNNs for Parent nets (32:55)

And what we can do is we can take this child network that's sampled from the RNN, train it on a desired task with the desired data set, and evaluate its accuracy. And after we do this, we can then go back to our RNN controller, update it, right, based on how the child network performed after training. And now the RNN parent can learn to create an even better child model, right? So this is a really powerful idea. And what does it mean for us in practice? Well, Google has now put this service on the cloud, Google being Google, right? So that you can go in, provide the AutoML system a data set and a set of metrics that you want it to optimize over. And they will use parent RNN controllers to generate a candidate child network that's designed to train optimally on your data set for your task, right. And this end result is this new child network that it gives back to you, spawned from this RNN controller, which you can then go and deploy on your data set, right. This is a pretty big deal, right, and it sort of gets at sort of this deeper question, right. They've demonstrated that we can create these AI systems that can generate new AI specifically designed to solve desired tasks. And this significantly reduces the difficulties that machine learning engineers face in terms of optimizing a network architecture for a different task.

What does it mean to be intelligent? (34:39)

And this gets at the heart of the question that Alexander proposed at the beginning of this course, this notion of generalized artificial intelligence. And we spoke a bit about what it means to be intelligent, loosely speaking, the sense of taking in information, using it to inform a future decision. And as humans, our learning pipeline is not restricted to solving only specific defined tasks. How we learn one task can impact what we do on something completely unrelated, completely separate. And in order to reach that same level with AI, we really need to build systems that can not only learn single tasks, but can improve their own learning and their reasoning so as to be able to generalize well to sets of related and dependent tasks. So I'll leave you with this thought and I encourage you to to continue to discuss and think about these ideas amongst yourselves internally through introspection and also we're happy to chat and I think I can speak for the TAs in saying that they're happy to chat as well.


Outro (35:52)

So that concludes you know the series of lectures from Alexandra and I. And we'll have our three guest lectures over the next couple of days. And then we'll have the final lab on reinforcement learning. Thank you.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.