MIT 6.S191 (2018): Introduction to Deep Learning
Transcription for the video titled "MIT 6.S191 (2018): Introduction to Deep Learning".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Good morning, everyone. Thank you all for joining us. This is MIT 6S191, and we'd like to welcome you to this course on Introduction to Deep Learning. So in this course, you'll learn how to build remarkable algorithms, intelligent algorithms, capable of solving very complex problems that just a decade ago were not even feasible to solve. And let's just start with this notion of intelligence. So at a very high level, intelligence is the ability to process information so that it can be used to inform future predictions and decisions. so that it can be used to inform future predictions and decisions. Now, when this intelligence is not engineered, but rather a biological inspiration, such as in humans, it's called human intelligence, but when it's engineered, we refer to it as artificial intelligence. So this course is a course on deep learning, which is just a subset of artificial intelligence. And really it's just a subset of machine learning, which involves more traditional methods where we try to learn representations directly from data. And we'll talk about this more in detail later today. But let me first just start by talking about some of the amazing successes that deep learning has had in the past. So in 2012, this competition called ImageNet came out, which tasked AI researchers to build an AI system capable of recognizing images, objects in images. And there was millions of examples in this data set. And the winner in 2012, for the first time ever, was a deep learning-based system. And when it came out, it absolutely shattered all other competitors and crushed the competition, crushed the challenge. And today, these deep learning-based systems have actually surpassed human-level accuracy on the ImageNet challenge and can actually recognize images even better than humans can. Now in this class, you'll actually learn how to build complex vision systems, building a computer that knows how to see. And just tomorrow, you'll learn how to build an algorithm that will take as input x-ray images, and as output it will detect if that person has a pneumothorax, just from that single input image. You'll even make the network explain to you why it decided to diagnose the way it diagnosed by looking inside the network and understanding exactly why it made that decision. Deep neural networks can also be used to model sequences where your data points are not just single images, but rather temporally dependent. So for this, you can think of things like predicting the stock price, translating sentences from English to Spanish, or even generating new music. So actually today, you'll learn how to create, and actually you'll create yourselves, an algorithm that learns, that first listens to hours of music, learns the underlying representation of the notes that are being played in those songs, and then learns to build brand new songs that have never been heard before. And there are really so many other incredible success stories of deep learning that I could talk for many hours about, and we'll try to cover as many of these as possible as part of this course. But I just wanted to give you an overview of some of the amazing ones that we'll be covering as part of the labs that you'll be implementing. And that's really the goal of what we want you to accomplish as part of this class. Firstly, we want to provide you with the foundation to do deep learning, to understand what these algorithms are doing underneath the hood, how they work and why they work. We will provide you some of the practical skills to implement these algorithms and deploy them on your own machines. And we'll talk to you about some of the state of art and cutting-edge research that's happening in deep learning industries and deep learning academia institutions. Finally, the main purpose of this course is we want to build a community here at MIT that is devoted to advancing the state of artificial intelligence, advancing a state of deep learning. As part of this course, we'll cover some of the limitations of these algorithms. There are many. We need to be mindful of these limitations such that we as a community can move forward and create more intelligent systems. But before we do that, let's just start with some administrative details in this course. So this course is a one-week course. Today is the first lecture. We meet every day this week, 10.30 AM to 1.30 PM. And during this three-hour time slot, we're broken down into one and a half hour time slots, around 50% of the course each. And each of those half sections of this course will consist of lectures, which is what you're in right now. And the second part is the labs, where you'll actually get practice implementing what you learn in lectures. We have an amazing set of lectures lined up for you. So today we're going to be talking about some of the introduction to neural networks, which is really the backbone of deep learning. We'll also talk about modeling sequence data, so this is what I was mentioning about the temporally dependent data. Tomorrow we'll talk about computer vision and deep generative models. We have one of the inventors of generative adversarial networks coming to give that lecture for us, So that's going to be a great lecture. And the day after that, we'll touch on deep reinforcement learning and some of the open challenges in AI and how we can move forward past this course. We'll spend the final two days of this course hearing from some of the leading industry representatives doing deep learning in their respective companies. And these are bound to be extremely interesting or extremely exciting, so I highly recommend attending these as well. For those of you who are taking this course for credit, you have two options to fulfill your graded assignment. The first option is a project proposal. It's a one-minute project pitch that will take place during Friday.
Neural Networks And Related Concepts
Project Proposal (06:29)
And for this, you have to work in groups of three or four. And what you'll be tasked to do is just come up with an interesting deep learning idea and try to show some sort of results, if possible. We understand that one week is extremely short to create any type of results or even come up with an interesting idea for that matter. But we're going to be giving out some amazing prizes, so including some NVIDIA GPUs and Google Homes. On Friday you'll, like I said, give a one-minute pitch. There's somewhat of an art to pitching your idea in just one minute, even though it's extremely short. So we will be holding you to a strict deadline of that one minute. The second option is a little more boring, but you'll be able to write a one-page paper about any deep learning paper that you find interesting. And really, that's if you can't do the project proposal, you can do that. This class has a lot of online resources. You can find support on Piazza. Please post if you have any questions about the lectures, the labs, installing any of the software, et cetera. Also, try to keep up to date with the course website, where we'll be posting all of the lectures, labs, and video recordings online as well. We have an amazing team that you can reach out to at any time in case you have any problems with anything. Feel free to reach out to any of us. And we want to give a huge thanks to all of our sponsors who, without their support, this class would simply not happen the way it's happening this year. So now let's start with the fun stuff. And let's start by actually asking ourselves a question. Why do we even care about deep learning? So why now? And why do we even sit in this class today? So traditional machine learning algorithms typically define sets of pre-programmed features in the data, and they work to extract these features as part of their pipeline. Now, the key differentiating point of deep learning is that it recognizes that in many practical situations, these features can be extremely brittle.
Neural Networks (08:49)
So what deep learning tries to do is learn these features directly from data, as opposed to being hand engineered by the human. That is, can we learn, if we want to learn to detect faces, can we first learn automatically from data that to detect faces, we first need to detect edges in the image, compose these edges together to detect eyes and ears, then compose these eyes and ears together to form higher level facial structure. And in this way, deep learning represents a form of a hierarchical model capable of representing different levels of abstraction in the data. So actually the fundamental building blocks of deep learning, which are neural networks, have actually been existing, have actually existed for decades. So why are we studying this now?
Well there's three key points here. The first is that data has become key points here. The first is that data has become much more pervasive. We're living in a big data environment. These algorithms are hungry for more and more data and accessing that data has become easier than ever before. Second, these algorithms are massively parallelizable and can benefit tremendously from modern GPU architectures that simply just did not exist just more than a decade ago. And finally, due to open source toolboxes like TensorFlow, building and deploying these algorithms has become so streamlined, so simple, that we can teach it in a one-week course like this. And it's become extremely deployable for the massive public. So let's start with now looking at the fundamental building block of deep learning, and that's the perceptron. This is really just a single neuron in a neural network. So the idea of a perceptron or a single neuron, is extremely simple. Let's start by talking about the forward propagation of information through this data unit. We define a set of inputs, X1 through Xm, on the left. And all we do is we multiply each of these inputs by their corresponding weight, theta1 through thetaM, which are those arrows. by their corresponding weight, theta 1 through theta m, which are those arrows. We take this weighted combination of all of our inputs, sum them up, and pass them through a nonlinear activation function. And that produces our output y. It's that simple. So we have m inputs, one output number. And you can see it's summarized on the right-hand side as a single mathematical equation. But actually, I left out one important detail that makes the previous slide not exactly correct. So I left out this notion of a bias. A bias is that green term you see on the left. And this just represents some way that we can allow our model to learn, or we can allow our activation function to shift to the left or right. So it allows us to, when we have no input features, to still provide a positive output. So on this equation on the right, we can actually rewrite this using linear algebra and dot products to make this a lot cleaner. So let's do that. Let's say X, capital X, is a vector containing all of our inputs, X1 through Xm. Capital Theta is now just a vector containing all of our thetas, theta 1 through theta m. We can rewrite that equation that we had before as just applying a dot product between x and theta, adding our bias, theta 0, and applying our non-linearity, g. Now you might be wondering, since I've mentioned this a couple times now, what is this nonlinear function g? Well, I said it's the activation function, but let's see an example of what in practice g actually could be. So one very popular activation function is the sigmoid function. You can see a plot of it here on the bottom right. And this is a function that takes its input, any real number on the x-axis, and transforms it to an output between 0 and 1.
Because all outputs of this function are between 0 and 1, it makes it a very popular choice in deep learning to represent probabilities. In fact, there are many types of nonlinear activation functions in deep neural networks. And here are some of the common ones. Throughout this presentation, you'll also see TensorFlow code snippets, like the ones you see on the bottom here, since we'll be using TensorFlow for our labs. And this is some way that I can provide to you to kind of link the material in our lectures with what you'll be implementing in labs. So the sigmoid activation function, which I talked about in the previous slide, now on the left, is just a function, like I said, it's commonly used to produce probability outputs. Each of these activation functions has their own advantages and disadvantages. On the right, a very common activation function is the rectified linear unit, or ReLU. This function is very popular because it's extremely simple to compute. It's piecewise linear. It's zero before with inputs less than zero.
It's x with any input greater than zero. And the gradients are just zero or one with a single nonlinearity at the origin. And you might be wondering why we even need activation functions. Why can't we just take our dot product, add our bias, and that's our output? Why do we need the activation function? Activation functions introduce nonlinearities into the network. That's the whole point of why activations themselves are nonlinear. We want to model nonlinear data in the world because the world is extremely nonlinear. Let's suppose I gave you this plot of green and red points and I asked you to draw a single line, not a curve, just a line, between the green and red points to separate them perfectly. You'd find this really difficult and probably you could get as best as something like this. Now, if your activation function in your deep neural network was linear, since you're just composing linear functions with linear functions, your output will always be linear. So the most complicated deep neural network, no matter how big or how deep, if the activation function is linear, your output can only look like this. But once we introduce nonlinearities, our network is extremely more, as the capacity of our network has extremely increased. We're now able to model much more complex functions. We're able to draw decision boundaries that were not possible with only linear activation functions. Let's understand this with a very simple example.
Imagine I gave you a trained network, like the one we saw before. Sorry, a trained perceptron, not a network yet, just a single node. And the weights are on the top right. So theta 0 is 1, and the theta vector is 3 and negative 2. The network has two inputs, x1 and x2. And if we want to get the output, all we have to do is apply the same story as before. So we apply the dot product of x and theta, we add the bias, and apply our nonlinearity. But let's take a look at what's actually inside before we apply that nonlinearity. This looks a lot like just a 2D line because we have two inputs. And it is. We can actually plot this line when it equals zero in feature space. So this is space where I'm plotting x1, one of our features on the x-axis, and x2, the other feature, on the y-axis.
Picturing SVM (16:33)
If we plot that line, it's just the decision boundary separating our entire space into two subspaces. Now if I give you a new point, negative 1, 2, and plot it on this feature space, depending on which side of the line it falls on, I can automatically determine whether our output is less than 0 or greater than 0, since our line represents a decision boundary equal to 0. Now, we can follow the math on the bottom and see that computing the inside of this activation function, we get 1 minus 3 minus 2, sorry, minus 4. And we get minus 6 at the output. Before we apply the activation function, once we apply the activation function, we get 0.002. So what was applied to the activation function was negative because we fell on the negative piece of this subspace. Well if we remember with the sigmoid function it actually divides our space into two parts greater than 0.5 and less than 0.5 since we're modeling probabilities and everything is between 0 and 1. So actually our decision boundary where the input to our activation function equals 0 corresponds to the output of our activation function being greater than or less than 0.5. So now that we have an idea of what a perceptron is, let's just start now by understanding how we can compose these perceptrons together to actually build neural networks. And let's see how this all comes together. So let's revisit our previous diagram of the perceptron. Now if there's a few things that you learned from this class, let this be one of them, and we'll keep repeating it over and over.
In deep learning, you do a dot product, you apply a bias, and you add your nonlinearity. You keep repeating that many, many times for each node, each neuron in your neural network. And that's a neural network. So let's simplify this diagram a little. I remove the bias, since we are going to always have that and we just take it for granted from now on. I'll remove all of the weight labels for simplicity. And note that z is just the input to our activation function, so that's just the dot product plus our bias. If we want the output of the network y, we simply take z and we apply our nonlinearity like before. If we want to define a multi-output perceptron, it's very simple. We just add another perceptron. Now we have two outputs, y1 and y2. Each one has weight vector theta corresponding to the weight of each of the inputs. Now let's suppose we want to go the next step deeper. We want to create now a single layered neural network. Single layered neural networks are actually not deep networks yet. They're only, they're still shallow networks. They're only one layer deep. But let's look at the single layered neural network where now all we do is we have one hidden layer between our inputs and outputs. We call this a hidden layer because its states are not directly observable. They're not directly enforced by the AI designer. We only enforce the inputs and outputs typically. The states in the middle are hidden. And since we now have a transformation to go from our input space to our hidden layer space, and from our hidden layer space to our output layer space, we actually need two weight matrices, theta 1 and theta 2, corresponding to the weight matrices of each layer. Now if we look at just a single unit in that hidden layer, it's the exact same story as before. It's one perceptron. We take its dot product with all of the x's that came before it, and we apply – sorry, we take the dot product of the x's that came before it with the weight matrices, theta 1 in this case. We apply a bias to get z2. And if we were to look at a different hidden unit, let's say Z3 instead, we would just take different weight matrices, different, our dot product would change, our bias would change, but, and this means that Z would change, which means its activation would also be different. So from now on, I'm going to use this symbol to denote what is called as a fully connected layer, and that's what we've been talking about so far. So that's every node in one layer is connected to every node in another layer by these weight matrices. And this is really just for simplicity, so I don't have to keep redrawing those lines. Now, if we want to create a deep neural network, all we do is keep stacking these layers and fully connected weights between the layers. It's that simple. But the underlying building block is that single perceptron. It's that single dot product, non-linearity, and bias. That's it.
Neural Networks for a Data Analysis (21:34)
So this is really incredible, because something so simple at the foundation is able to create such incredible algorithms. And now let's see an example of how we can actually apply neural networks to a very important question that I know you are all extremely worried about, you care a lot about. Here's the question. You want to build an AI system that answers the following question, will I pass this class? Yes or no? One or zero is the output. To do this let's start by defining a simple two feature model. One feature is the output. To do this let's start by defining a simple two feature model. One feature is the number of lectures that you attend, second feature is the number of hours that you spend on your final project. Let's plot this data in our feature space. We plot green points are people who pass, red points are people that fail. And we want to know, given a new person, this guy, they spent five hours on their final project and went to four lectures. We want to know, did that person pass or fail the class? And we want to build a neural network that will determine this. So let's do it. We have two inputs. One is four, the other is 5. We have one hidden layer with three units, and we want to see the final output probability of passing this class, and we compute it as 0.1, or 10%. That's really bad news, because actually this person did pass the class. They passed it with probability 1. Now, can anyone tell me why the neural network got this so wrong? Why did it do this? Any ideas? It's not trained. Exactly. So, this network has never been trained. It's never seen any data. It's basically like a baby. It's never learned anything. So we can't expect it to solve a problem it knows nothing about.
Loss function, (23:23)
So to do this, to tackle this problem of training a neural network, we have to first define a couple of things. So first, we'll talk about the loss. The loss of a network basically tells our algorithm, or our model, how wrong our predictions are from the ground truth. So you can think of this as a distance between our predicted output and our actual output. If the two are very close, if we predict something that is very close to the true output, our loss is very low. If we predict something that is very far, in a high level sense, far, like in distance, then our loss is very high. And we want to minimize this from happening as much as possible. Now let's assume we're not given just one data point, one student, but we're given a whole class of students. So as previous data, I used this entire class from last year. And if we want to quantify what's called the empirical loss, now we care about how the model did on average over the entire data set, not for just a single student, but across the entire data set. And how we do that is very simple. We just take the average of the loss of each data point. If we have n students, that's the average over n data points. This has other names besides empirical loss. Sometimes people call it the objective function, the cost function, et cetera. All of these terms are completely the same thing. Now if we look at the problem of binary classification, predicting if you pass or fail this class, yes or no, one or zero, we can actually use something that's called the softmax cross entropy loss. Now for those of you who aren't familiar with cross entropy or entropy, this is an extremely powerful notion that was actually developed or first introduced here at MIT over 50 years ago by Claude Shannon in his master's thesis. Like I said, this was 50 years ago. It's huge in the field of signal processing, thermodynamics, really all over computer science it's seen, and information theory. Now instead of predicting a single 1 or 0 output, yes or no, let's suppose we want to predict a continuous valued function. Not will I pass this class, but what's the grade that I will get? And as a percentage, let's say, 0 to 100. Now we're no longer limited to 0 to 1, but can actually output any real number on the number line. Now, instead of using cross entropy, we might want to use a different loss. And for this, let's think of something like a mean squared error loss, where as you're predicted and your true output diverge from each other, the loss increases as a quadratic function.
Mean Squared Error Loss (25:50)
OK, great. So now let's put this new loss information to the test and actually learn how we can train a neural network by quantifying its loss. And really if we go back to what the loss is, at the very high level, the loss tells us how the network is performing, right? The loss tells us the accuracy of the network on a set of examples. And what we want to do is basically minimize the loss over our entire training set. Really, we want to find the set of parameters theta, such that that loss, j of theta, that's our empirical loss, is minimum. So remember, j of theta takes as input theta. And theta is just our weights. So these are the things that actually define our network. Remember that the loss is just a function of these weights. If we want to think about the process of training, we can imagine this landscape.
Landscape of Loss functions (27:03)
So if we only have two weights, we can plot this nice diagram like this. Theta 0 and theta 1 are our two weights. They're on the planar axis on the bottom. J of theta 0 and theta 1 are plotted on the z-axis. What we want to do is basically find the minimum of this loss, of this landscape. If we can find the minimum, then this tells us where our loss is the smallest. And this tells us what values of theta 0 and theta 1 we can use to attain that minimum loss. So how do we do this? Well, we start with a random guess. We pick a point, theta 0, theta 1, and we start there. We compute the gradient of this point on the lost landscape. That's dj d theta. It's how the loss is changing with respect to each of the weights. Now, this gradient tells us the direction of highest ascent, not descent. So this is telling us the direction of highest ascent, not descent. So this is telling us the direction going towards the top of the mountain. So let's take a small step in the opposite direction. So we negate our gradient, and we adjust our weight so that we step in the opposite direction of that gradient, such that we move continuously towards the lowest point in this landscape until we finally converge at a local minimum and then we just stop. So let's summarize this with some pseudocode. So we randomly initialize our weights, we loop until convergence the following. We compute the gradient at that point and when simply we apply this update rule where the update takes as input the negative gradient. Now let's look at this term here. This is the gradient. Like I said, it explains how the loss changes with respect to each weight in the network. But I never actually told you how to compute this. And this is actually a big issue in neural networks. I just kind of took it for granted. So now let's talk about this process of actually computing this gradient, because without that gradient, you're kind of helpless. You have no idea which way down is. You don't know where to go in your landscape. So let's consider a very simple neural network, probably the simplest neural network in the world.
Loss Functions And Optimization Methods
Loss with Derivatives (29:18)
It contains one hidden unit, one hidden layer, and one output unit. And we want to compute the gradient of our loss, j of theta, with respect to theta 2, just theta 2 for now. So this tells us how a small change in theta 2 will impact our final loss at the output. So let's write this out as a derivative. We can start by just applying a chain rule because j of theta is dependent on y, right? So first, we want to back propagate through y, our output, all the way back to theta two. We can do this because y, our output y, is only dependent on the input and theta 2. That's it. So we're able to, just from that perceptron equation that we wrote on the previous slide, compute a closed form gradient, or closed form derivative, of that function. Now let's suppose I change theta 2 to theta 1. And I want to compute the same thing, but now for the previous layer and the previous weight. All we need to do is apply the chain rule one more time, backpropagate those gradients that we previously computed one layer further. And it's the same story again. We can do this for the same reason. This is because z1, our hidden state, is only dependent on our previous input, X, and that single weight, theta1. Now, the process of backpropagation is basically you repeat this process over and over again for every way in your network until you compute that gradient dj d theta. And you can use that as part of your optimization process to find your local minima. Now in theory that sounds pretty simple I hope. I mean we just talked about some basic chain rules but let's actually touch on some insights on training these networks and computing back propagation in practice. Now, the picture I showed you before is not really accurate for modern deep neural network architectures. Modern deep neural network architectures are extremely non-convex. This is an illustration or a visualization of the landscape, like I've plotted before, but of a real deep neural network of ResNet-50, to be precise. This was actually taken from a paper published about a month ago where the authors attempt to visualize the lost landscape to show how difficult gradient descent can actually be. So there's a possibility that you can get lost in any one of these local minima. There's no guarantee that you'll actually find a true global minima. So let's recall that update equation that we defined during gradient descent. Let's take a look at this term here. This is the learning rate. I didn't talk too much about it. But this basically determines how large of a step we take in the direction of our gradient. And in practice, setting this learning rate, it's just a number, but setting it can be very difficult. If we set the learning rate too low, then the model may get stuck in a local minima and may never actually find its way out of that local minima. Because at the bottom of the local minima, obviously your gradient is zero, so it's just going to stop moving. If I set the learning rate too large, it could overshoot and actually diverge. Our model could blow up. Ideally, we want to use learning rates that are large enough to avoid local minima, but also still converge to our global minima.
Learning Rates with Local Minimum Bounds (32:59)
So they can overshoot just enough to avoid some local minima, but then converge to our global minima. Now, how can we actually set the learning rate? Well, one idea is let's just try a lot of different things and see what works best. But I don't really like this solution. Let's try and see if we can be a little smarter than that. How about we try to build an adaptive algorithm that changes its learning rate as training happens? So this is a learning rate that actually adapts to the landscape that it's in. So the learning rate is no longer a fixed number. It can change, it can go up and down, and this will change depending on the location that the update is currently at, the gradient in that location, maybe how fast we're learning, and many other possible situations. In fact, this process of optimization in deep neural networks and non-convex situation has been extremely explored. There's many, many, many algorithms for computing adaptive learning rates. And here are some examples that we encourage you to try out during your labs to see what works best. And for your problems, especially real world problems, things can change a lot depending on what you learn in lecture and what really works in lab. And we encourage you to just experiment, get some intuition about each of these learning rates and really understand them at a higher level.
More Optimization Methods (34:23)
So I want to continue this talk and really talk about more of the practice of deep neural networks, this incredibly powerful notion of mini-batching. And I'll focus for now if we go back to this gradient descent algorithm. This is the same one that we saw before, and let's look at this term again. So we found out how to compute this term using backpropagation, but actually what I didn't tell you is that the computation here is extremely expensive. We have a lot of data points potentially in our data set, and this takes as input a summation over all of those data points. So if our data set is millions of examples large, which is not that large in the realm of today's deep neural networks, this can be extremely expensive just for one iteration. So we can't compute this on every iteration. Instead, let's create a variant of this algorithm called stochastic gradient descent, where we compute the gradient just using a single training example. Now this is nice because it's really easy to compute the gradient for a single training example. It's not nearly as intense as over the entire training set. But as the name might suggest, this is a more stochastic estimate. It's much more noisy. It can make us jump around the landscape in ways that we didn't anticipate. It doesn't actually represent the true gradient of our data set, because it's only a single point.
So what's the middle ground? How about we define a mini batch of b data points, compute the average gradient across those b data points, and actually use that as an estimate of our true gradient. Now, this is much faster than computing the estimate over the entire batch, because B is usually something like 10 to 100. And it's much more accurate than SGD, because we're not taking a single example, but we're learning over a smaller batch, a larger batch. Sorry. Now, the more accurate our gradient estimation is, that means the more or the easier it will be for us to converge to the solution faster. It means we'll converge smoother because we'll actually follow the true landscape that exists. It also means that we can increase our learning rate to trust each update more. This also allows for massively parallelizable computation. If we split up batches on different workers, on different GPUs or different threads, we can achieve even higher speedups, because each thread can handle its own batch, then they can come back together and aggregate together to basically create that single learning rate, or complete that single training iteration. Now finally the last topic I want to talk about is that of overfitting and regularization. Really this is a problem of generalization which is one of the most fundamental problems in all of artificial intelligence, not just deep learning, but all of artificial intelligence. And for those of you who aren't familiar, let me just go over on a high level what overfitting is, what it means to generalize.
Ideally, in machine learning, we want a model that accurately describes our test data, not our training data, but our test data. Said differently, we want to build models that can learn representations from our training data and still generalize well on unseen test data. Assume you want to build a line to describe these points. Underfitting describes the process on the left, where the complexity of our model is simply not high enough to capture the nuances of our data. If we go to overfitting on the right, we're actually having too complex of a model and actually just memorizing our training data, which means that if we introduce a new test data point, it's not going to generalize well.
Ideally, what we want is something in the middle, which is not too complex to memorize all of the training data, but still contains the capacity to learn some of these nuances in the test set. So to address this problem, let's talk about this technique called regularization. Now regularization is just this way that you can discourage your models from becoming too complex.
Slug Overfitting And Early Stopping
And as we've seen before, this is extremely critical, because we don't want our models to just memorize data and only do well in our training set. One of the most popular techniques for regularization in neural networks is dropout. This is an extremely simple idea. Let's revisit this picture of a deep neural network. And in dropout, all we do during training on every iteration, we randomly drop some proportion of the hidden neurons with some probability p. So let's suppose p equals 0.5. That means we dropped 50% of those neurons like that. Those activations become 0. And effectively, they're no longer part of our network. This forces the network to not rely on any single node, but actually find alternative paths through the network and not put too much weight on any single example with any single node. So it discourages memorization, essentially. On every iteration, we randomly drop another 50% of the node. So on this iteration, I may drop these. On the next iteration, I may drop those.
And since it's different on every iteration, you're encouraging the network to find these different paths to its answer. The second technique for regularization that we'll talk about is this notion of early stopping. Now, we know that the definition of overfitting actually is just when our model starts to perform worse and worse on our test data set. So let's use that to our advantage to create this early stopping algorithm. If we set aside some of our training data and use it only as test data set. So let's use that to our advantage to create this early stopping algorithm. If we set aside some of our training data and use it only as test data, we don't train with that data, we can use it to basically monitor the progress of our model on unseen data.
Early Stopping (40:29)
So we can plot this curve where on the x-axis we have the training iterations, on the y-axis we have the loss. Now they start off going down together. This is great because it means that we're learning, we're training, right? That's great. There comes a point though where the testing data, where the testing data set and the loss for that data set starts to plateau. Now if we look a little further, the training data set loss will always continue to go down as long as our model has the capacity to learn and memorize some of that data. That doesn't mean that it's actually generalizing well, because we can see that the testing data set has actually started to increase. This pattern continues for the rest of training, but I want to focus on this point here. This is the point where you need to stop training, because after this point, you are overfitting, and your model is no longer performing well on unseen data. If you stop before that point, you're actually underfitting, and you're not utilizing the full potential, the full capacity of your network. So I'll conclude this lecture by summarizing three key points that we've covered so far.
First, we've learned about the fundamental building blocks of neural networks called the perceptron. We've learned about stacking these units, these perceptrons, together to compose very complex hierarchical models. And we've learned how to mathematically optimize these models using a process called back propagation and gradient descent. Finally, we address some of the practical challenges of training these models in real life that you'll find useful for the labs today, such as using adaptive learning rates, batching, and regularization to combat overfitting. Thank you, and I'd be happy to answer any questions now. Otherwise, we'll have Harini talk to us about some of the deep sequence models for modeling temporal data.