MIT 6.S191 (2020): Deep Learning New Frontiers

Transcription for the video titled "MIT 6.S191 (2020): Deep Learning New Frontiers".

1970-01-02T09:38:22.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Opening Remarks

Introduction (00:00)

Tanya Cushman Reviewer Reviewer Tanya Cushman Reviewer Okay, so as Alexander mentioned, in this final lecture, I'll be discussing some of the limitations of deep learning. And, you know, as with any technology, not just deep learning or other areas of computer science, it's important to be mindful of not only what that technology can enable, but also what are some of the caveats and limitations when considering such approaches. And then we'll move in to discuss some of the new research directions that are specifically being taken to address some of those limitations. Before we dive into the technical content, we have some important logistical and course-related announcements, which I think will be very relevant to most of you.


Main Content: Discussion On Deep Learning, Uncertainty And Automl

Course logistics (00:58)

First and foremost, we have class t-shirts, and they've arrived. And so we'll be distributing them today at the end of the lecture portion of the class. And at that time we'll take a little bit of time to discuss the logistics of how you can come and receive a t-shirt for your participation in the course. So to check in quickly on where we are in the course, this is going to be the last lecture given by Alexander and myself, and tomorrow and Friday we will have a series of guest lectures from leading researchers in industry. Today we'll also have our final lab on reinforcement learning, and thank you everyone who has been submitting your submissions for the lab competitions the deadline for doing so is tomorrow at 5 p.m. and that's for lab 1 2 & 3 so if you're interested in that please email us with your entries and on Friday will be our final guest lectures and project presentations. And so for those of you who are taking the course for credit, as was mentioned on day one, you have two options to fulfill your credit requirement. And we've received some questions about the logistics, so I'd like to go through them briefly here. So you can present a proposal for the project proposal competition. And the requirements for this, you can present as an individual or in a group from one person to four people. And in order to be eligible for a prize, you must have at least one registered student, a registered MIT student in your group. And we recognize that one week is an extremely short period of time to implement, you know, a new deep learning approach, but of course we'll not necessarily be judging you based on your results, although results will be extremely helpful for you in the in the project competition, but rather more on the novelty and the potential impact and the quality of your presentation. So these are going to be really short. They're going to be three minutes, and we're going to hold you to that three minute window as strictly as we can. And so there's a link on this slide which you can find on the PDF version that's going to take you to a document where the instructions for the final project are laid out including the details for group submission and slide submission and yeah here are additional links for the for the final project proposal and the second option to fulfill the credit requirement is a short one-page review of a recent deep learning paper. And this is going to be due on the last day of class by Friday at 1 p.m. via email to us.


Upcoming guest lectures (03:59)

Okay, so tomorrow we're going to have two guest speakers. We're going to have David Cox from IBM, who is actually the director of the MIT IBM Watson AI lab come and speak. And we're also going to have Animesh Garg, who's a professor at U Toronto and a research scientist at NVIDIA. And he's going to speak about robotics and robot learning. And the lab portion of tomorrow's class will be dedicated to just open office hours where you can work with your project partners on the final project. You can continue work on the labs, you can come and ask us and the TAs any further questions. And on Friday we're going to have two additional guest speakers. So Chuan Li, who is the chief scientific officer at Lambda Labs, it's a company that builds new hardware for deep learning, is going to speak about some of the research that they're doing. And then we're going to have an exciting talk from Google, the Google brain team, on how we can use machine learning to understand the scent and smell properties of small molecules. And importantly on Friday will be our project proposals and our awards ceremony, so if you have submitted entries for the lab competitions that is when you would be awarded prizes and so we really encourage you to attend Thursday and Friday's lectures and classes in order to be eligible to receive the prizes.


Deep learning and expressivity of NNs (05:35)

OK, so now to get into the technical content for this last lecture from Alexander and I. So hopefully over the course of the past lectures you've seen a bit about how deep learning has enabled such tremendous applications in a variety of fields from autonomous vehicles to medicine and healthcare to advances in reinforcement learning that we just heard about, generative approaches, robotics, and a whole host of other applications and areas of impact like natural language processing, finance, and security. And along with this hopefully you've also established a more concrete understanding of how these neural networks actually work. And largely we've been dealing with algorithms that take as input data in some form, you know, as signals, as images, or other sensory data, to directly produce a decision at the output or a prediction. And we've also seen ways in which these algorithms can be used in the opposite direction to generatively sample from them to create brand new instances and data examples. But really what we've been talking about is algorithms that are very well optimized to perform at a single task, but they fail to generalize and go beyond that to achieve sort of a higher order level of power. And I think one really good way to understand this limitation is to go back to a fundamental theorem about the capabilities of neural networks. And this was a theorem that was presented in 1989, and it generated quite the stir, and it's called the universal approximation theorem. And what it states is that a feed forward neural net with a single hidden layer could be sufficient to approximate any function. And we've seen, you know, with deep learning models that use multiple hidden layers, and this this theorem is actually completely ignoring that and saying, oh, you just need one hidden layer. If you believe that any problem can be reduced to a functional mapping between inputs and outputs, you can build a neural net that would approximate this. And while you may think that this is really an incredibly powerful statement, if you look closely there are a few key limitations and considerations that we need to have. First, this theorem is making no guarantees about the number of hidden units or the size of the hidden layer that would be required to make this approximation. And it's also leaving open the question of how you actually go about finding those weights and optimizing the network for that task. It just proves that one theoretically does exist. And as we know from gradient descent, this optimization can actually be really tricky and difficult in practice. And finally, there's no guarantees that are placed about how well such a network would generalize to other related tasks. And this theorem is sort of a perfect example of the possible effects of overhype of deep learning and artificial intelligence more broadly. And as a community, and now you know you all are part of that community, and as a community that's interested in advancing the state of deep learning, I believe that we need to be really careful about how we market and advertise these algorithms. And while the universal approximation theorem generated a lot of excitement when it first came out, it also in some senses provided some degree of false hope to the AI community that neural nets could be used to solve any problem. And this hype can be very dangerous. And when you look back at the history of AI and sort of the peaks and the falls of the literature, there have been these two AI winters where research in AI and neural networks specifically came to a halt and experienced a decline. And that's kind of motivating why for the rest of this lecture we want to discuss further some of the limitations of these approaches and how we could potentially move towards addressing them.


Generalization of deep models (10:02)

Okay, so what are some of those limitations? One of my favorite examples of a potential danger of deep neural nets comes from this paper from a couple years ago named Understanding Deep Neural Networks Requires Rethinking Generalization. And this was a paper from Google, and really what they did was quite simple. They took images from this huge image data set called ImageNet. And each of these images is annotated with a label. And what they did was for every image in the data set, they flipped a die that was K-sided. They made a random assignment of what a new label for that image was going to be. So, for example, if you randomly choose the labels for these images, you could generate something like this. And what this means is that these new labels that are now associated with each image are completely random with respect to what is actually present in that image. And so, and so if you see the two examples of the dog have two completely different labels, right, and so we're literally trying to randomize our data and, and the labels entirely. And after they did that, what they then tried to do was to fit a deep neural net to this sampled data, ranging from the original untouched data to data on the right where the labels were now completely randomly assigned. And as you may expect, the accuracy of the resulting model on the test set progressively tended to zero as you move from the true labels to the random labels. But what was really interesting was what happened when they looked at the accuracy on the training set. And this is what they found. They found that no matter how much they randomized the labels, the model was able to get 100% accuracy or close to 100% accuracy on the training set, meaning that it's basically fitting to the data and their labels. And this is really a powerful example because it shows once again in a similar way as the universal approximation theorem, that deep neural nets are very, very good at being able to perfectly fit, or very close to perfectly fit, any function, even if that function is this random mapping from data to labels. And to drive this point home even further, I think the best way to understand neural nets is as function approximators. And all the universal approximation function approximation theorem states is that neural networks are very good at doing this. So, for example, if we have this data visualized on a 2D grid, we can learn, we can use a neural network to learn a function, a curve that fits this data. And if we present it with a candidate point on the x-axis, it may be able to produce a strong, very likely estimate of what the corresponding y-value would be. But what happens to the left and to the right? What if we extend the spread of the data a bit in those directions? How does the network perform? Well, there are absolutely no guarantees on what the training data looks like in these regions, what the data looks like in these regions that the network hasn't seen before. And this is absolutely one of the most significant limitations that exists in modern deep learning. And this raises the questions of what happens when we look at these places where the model has insufficient or no training data, and how can we as implementers and users of deep learning and deep neural nets have a sense of when we know that the model doesn't know, when it's not confident, when it's uncertain in making a prediction?


Adversarial attacks (14:14)

And I think this notion leads very nicely into this other idea of adversarial attacks on neural nets. And the idea here is to take some data instance, for example, this image of a temple, which, you know, a standard CNN trained on image data can classify with very high accuracy, and then apply some perturbation to that image, such that when we take the result after that perturbation and now feed it back into our neural network, it generates a completely nonsensical prediction like ostrich about what is actually in that image. And so this is maybe a little bit shocking. Why is it doing this? And how is this perturbation being created to fool the network in such a way? How is this perturbation being created to fool the network in such a way? So remember when we're training our networks, we use gradient descent. And what that means is we have this objective function, j, that we're trying to optimize for. And what specifically we're trying to optimize is the set of weights, W, which means we fix our data and our labels and iteratively adjust our weights to optimize this objective function. And the way an adversarial example is created is kind of taking the opposite approach, where we now ask, how can we modify the input image, our data X, in order to increase the error in the network's prediction, to fool the network. And so we're trying to perturb and adjust X in some way by fixing the weights, fixing the labels, and iteratively changing X to generate a robust adversarial attack. And an extension of this was recently done by a group of students here at MIT, where they devised an algorithm for synthesizing adversarial examples that were robust to different transformations like changing the shape, scaling, color changes, etc. And what was really cool is they moved from beyond the 2D setting to the 3D setting, where they actually 3D printed physical objects that were designed to fool a neural network. And this was the first demonstration of actual adversarial examples that existed in the physical world, in the 3D world. And so here they 3D printed a bunch of these adversarial turtles, and when they fed images of these turtles to a neural network trained to classify these images, the network incorrectly classified these adversarial examples as rifles rather than turtles.


Limitations summary (17:00)

And so this just gives you a taste of some of the limitations that exist for neural networks and deep learning. And other examples are listed here, including the fact that they can be subject to algorithmic bias, that they can be susceptible to these adversarial attacks, that they're extremely data hungry, and so on and so forth. And moving forward to the next half of this lecture, we're going to touch on three of these sets of limitations and how we can push research to sort of address some of these limitations. Specifically, we'll focus on how we can encode structure and prior domain knowledge into designing our network architecture. We'll talk about how we can represent uncertainty and understand when our model is uncertain or not confident in its predictions. And finally, how we can move past deep learning, where models are built to solve a single problem and potentially move towards building models that are capable to address many different tasks.


Structure in deep learning (18:18)

So first, we'll talk about how we can encode structure and domain knowledge into designing deep neural nets. And we've already seen an example of this in the case of convolutional neural networks that are very well equipped to deal with spatial data and spatial information. And if you consider a fully connected network as sort of the baseline, there's no sense of structure there. as sort of the baseline, there's no sense of structure there. The nodes are connected to all other nodes, and you have these dense layers that are fully connected. But as we saw, you know, CNNs can be very well suited for processing visual information and visual data because they have this structure of the convolution operation. And recently researchers have moved on to develop neural networks that are very well suited to handle another class of data, and that's of graphs. And graphs are an irregular data structure that encode this very very rich structural information, and that structural information is very important to the problem that can be considered. And so some examples of data that can be well represented by a graph is that of social networks or internet traffic, those problems that can be represented by state machines, patterns of human mobility or transport, small molecules and chemical structures, as well as biological networks. And, you know, there are a whole class of problems that can be represented this way, an idea that arises is how can we extend neural networks to learn and process the data that's present in these graph structures. And this falls very nicely from an extension of convolutional neural nets. And with convolutional neural networks as we saw, what we have is this rectangular filter, right, that slides across an image and applies this patchy convolution operation to that image. And as we go across the entirety of the image, the idea is we can apply this set of weights to extract particular local features that are present in the image. And different sets of weights extract different features. In graph convolutional networks the idea is very similar, where now rather than processing a 2D matrix that represents an image, we're processing a graph. And what graph convolutional networks use is a kernel of weights, a set of weights, and rather than sliding this set of weights across a 2D matrix, the weights are applied to each of the different nodes present in the graph. And so the network is looking at a node and the neighbors of that node, and it goes across the entirety of the graph in this manner and aggregates information about a node and its neighbors and encodes that into a high-level representation. And so this is a really very brief and a high-level introduction to what a graph convolutional network is, and on Friday we'll hear from a expert in this domain, Alex Wojtko from Google Brain, who will talk about how we can use graph convolutional networks to learn small molecule representations. Okay, and another class of data that we may encounter is not 2D data, but rather 3D sets of points, and this is what is often referred to as a point cloud. And it's basically just this unordered cloud of points where there is some spatial dependence, and it represents sort of the depth and our perception of the 3D world. And just like images, you can perform classification or segmentation on this 3D point data. And it turns out that graph convolutional networks can also be extended to handle and analyze this point cloud data. And the way this is done is by dynamically computing a graph based on these point clouds that essentially creates a mesh that preserves the local depth and the spatial structure present in the point cloud.


Uncertainty & bayesian deep learning (22:53)

Okay, so that gives you a taste of how different types of data and different network structures can be used to encode prior knowledge into our network. Another area that has garnered a lot of interest in recent years is this question of uncertainty and how do we know how confident a model is in its predictions? So let's consider a really simple example, a classification example. And what we've learned so far is that we can use a network to output a classification probability. So here we're training a network to classify images of cats versus images of dogs, and it's going to output a probability that a particular image is a cat or it's a dog. But what happens if we feed the network an image of a horse? It's still going to output a probability that that image is a cat or that it's a dog, and because probabilities have to sum to one, they're going to sum to one, right? And so this is a clear distinction between the probability, the prediction of the network, and how confident the model is in that prediction. A probability is not a metric of confidence. And so in this case we would, you could imagine it would be desirable to have our network also give us a sense of how confident it is in that prediction. So maybe when it sees an image of a horse it says okay this is a dog with probabilities 0.8 but I'm not confident at all in this prediction that I just made, right. And one possible way to accomplish this is through Bayesian deep learning and this is a really new and emerging field. And so to understand this, right, to reiterate, our learning problem is the following. We're given some data X, and we're trying to learn an output Y, and we do that by learning this functional mapping, F, that's parametrized by a set of weights, W. mapping, F, that's parameterized by a set of weights, W. In Bayesian neural nets, what is done is rather than directly learning the weights, the neural network actually approximates a posterior probability distribution over the weights given the data, x, and the labels, y. And Bayesian neural networks are considered Bayesian because we can rewrite this posterior P of W given X and Y using Bayes' rule. But computationally it turns out that actually computing this posterior distribution is infeasible and intractable. So what has been done is there have been different approaches in different ways to try to approximate this distribution using sampling operations. And one example of how you can use sampling to approximate this posterior is by using dropout, which was a concept that we introduced in the first lecture. And in doing this you can actually obtain a metric and an estimate of the model's uncertainty. And to think about a little bit how this may work, let's consider a convolutional network where we have sets of weights. And what is done is we perform different passes through the network, and each time a pass is made through the network, the set of weights that are used are stochastically sampled. And so here, right, these are our convolutional kernels, our sets of weights, and we apply this dropout filter, this dropout mask, where in each filter some of those weights are going to be dropped out to zero. And as a result of taking a element-wise multiplication between the kernel and that mask, we generate these resulting filters where some of the weights have been stochastically dropped out. And if we do this many times, say t times, we're going to obtain different predictions from the model every time. And by looking at the expected value of those predictions and the variance in those predictions, we can get a sense of how uncertain the model is. And one application of this is in the context of depth estimation. So the goal here is to take images and to train a network to predict the depth of each pixel in that image. And then you also ask it, okay, provide us a uncertainty that's associated with each prediction. And what you can see here in the image on the right is that there's this particular band, a hotspot of uncertainty. And that corresponds to that portion of the image where the two cars are overlapping, which kind of makes sense, right? You may not have as clear of a sense of the depth in that region in particular.


Deep evidential regression (28:09)

And so to conceptualize this a bit further, this is a general example of how you can ensemble different instances of models together to obtain estimates of uncertainty. So let's say we're working in the context of self-driving cars, right, and our task is given an input image to predict a steering wheel angle that will be used to control the car, and that's mu, the mean. And in order to estimate the uncertainty, we can take an ensemble of many different instances of a model like this, and in the case of dropout sampling, each model will have different sets of weights that are being dropped out. And from each model we're going to get a different estimate of the predicted steering wheel angle, right? And we can aggregate many of these different estimates together, and they're going to lie along some distribution. And to actually estimate the uncertainty you can consider the variance, right, the spread of these estimates. And intuitively if these different estimates are spread out really really far, right, to the left and to the right, the model is going to be more uncertain in its prediction. But if they're clustered very closely together the model is more certain, more uncertain in its prediction. But if they're clustered very closely together, the model is more certain, more confident in its prediction. And these estimates are actually being drawn from an underlying distribution, and what ensembling is trying to do is to sample from this underlying distribution. But it turns out that we can approximate and model this distribution directly using a neural network. And this means that we're learning what is called an evidential distribution. And effectively, the evidential distribution captures how much evidence the model has in support of a prediction. And the way that we can train these evidential networks is by first trying to maximize the fit of the inferred distribution to the data and also minimizing the evidence that the model has for cases when the model makes errors. And if you train a network using this approach, you can generate calibrated, accurate estimates of uncertainty for every prediction that the network makes. So, for example, if we were to train a regression model, and suppose we have this case where in the white regions, the model has training data, and in the gray regions the model has no training data. And as you can see, a deterministic regression model fits this region, this white region, very well. But in the gray regions it's not doing so well because it hasn't seen data for these regions before. Now, we don't really care too much about how well the model does on those regions. But really, what would be more important is if the model could tell us, oh, I'm uncertain about my prediction in this region because I haven't seen the data before. And by using an evidential distribution, our network actually generates these predictions of uncertainty that scale as the model has less and less data or less and less evidence. And so these uncertainties are also robust to adversarial perturbations and adversarial changes, like similar to those that we saw previously. And in fact if an input, suppose an image, is adversarially perturbed and it's increasingly adversarially perturbed, the estimates of uncertainty are also going to increase as the degree of perturbation increases. And so this example shows depth estimation, where the more the input is perturbed, the more the associated uncertainty of the network's prediction increases. And so, and I won't spend too much time on this, but uncertainty estimation can also be integrated into different types of tasks beyond depth estimation or regression, also semantic and instance segmentation. And this was work done a couple years ago where they actually used estimates of uncertainty to improve the quality of the segmentations and depth estimations that they made. And what they showed was compared to a baseline model without any estimates of uncertainty, they could actually use these metrics to improve the performance of their model at segmentation and depth estimation.


AutoML (33:08)

Okay, so the final area that I'd like to cover is how we can go beyond, you know, having us as users and implementers of neural networks to where we can potentially automate this approach, this pipeline. And as you've hopefully seen through the course of these lectures and in the labs, neural networks need to be finely tuned and optimized for the task of interest. And as models get more and more complex, they require some degree of expert knowledge, right, expert knowledge, some of which you've hopefully learned through this course, to select the particular architecture of the network that's being used to selecting and tuning hyperparameters and adjusting the network to perform as best as it possibly can. What Google did was they built a learning algorithm that can be used to automatically learn a machine learning model to solve a given problem, and this is called AutoML or automated machine learning. And the way it works is it uses a reinforcement learning framework, and in this framework there is a controller neural network which is sort of the agent, and what the controller neural network does is it proposes a child model architecture in terms of the hyperparameters that that model architecture would theoretically have. And then that resulting child network is trained and evaluated for a particular task, say image classification, and its performance is used as feedback or as reward for the controller agent. And the controller agent takes this feedback into account and iteratively improves the resulting child network over thousands and thousands of iterations to iteratively produce new architectures, test them, the feedback is provided, and this cycle continues. And so how does this controller agent work? It turns out it's an RNN controller that sort of at the macro scale considers what are the different values of the hyperparameters for a particular layer in a generated network. So in the case of convolution, in the case of a CNN, that may be the number of convolutional filters, the size of these convolutional filters, et cetera. of convolutional filters, the size of these convolutional filters, et cetera. And after the controller proposes this child network, the child network is trained. And its accuracy is evaluated right through the normal training and testing pipeline. And this is then used as feedback that goes back to the controller. And the controller can then use this to improve the child network in future iterations. And what Google has done is that they've actually generated a pipeline for this and put this service on the cloud so that you as a user can provide it with a data set and a set of metrics that you want to optimize over and this AutoML framework will spit out you know candidate child networks that can be deployed for your task of interest.


Ending Remarks

Conclusion (36:43)

And so I'd like to think use this example to think a little bit about what this means for deep learning and AI more generally. This is an example of where Google was able to use a neural network and AI to generate new models that are specialized for particular tasks. And this significantly reduces the burden on us as engineers in terms of having to perform hyperparameter optimization and choosing our architectures wisely. And I think that this gets at the heart of what is the distinction between the capabilities that AI has now and our own human intelligence. We as humans are able to learn tasks and use, you know, the analytical process that goes into that to generalize to other examples in our life and other problems that we may encounter, whereas neural networks and AI right now are still very much constrained and optimized to perform well at particular individual problems. And so I'll leave you with with that and sort of encourage you to think a little bit about what steps may be taken to bridge that gap and if those steps should be taken to bridge that gap. So that concludes this talk.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.