MIT 6.S191 (2019): Introduction to Deep Learning
Transcription for the video titled "MIT 6.S191 (2019): Introduction to Deep Learning".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Good afternoon, everyone. Thank you all for joining us. My name is Alexander Amini. I'm one of the course organizers for 6S191. This is MIT's official course on Introduction to Deep Learning. And this is actually the third year that we're offering this course. And we've got a really good one in store for you this year with a lot of awesome updates, so I really hope that you enjoy it. So what is this course all about? This is a one-week intensive boot camp on everything deep learning. You'll get up close and personal with some of the foundations of the algorithms driving this remarkable field. And you'll actually learn how to build some intelligent algorithms capable of solving incredibly complex problems.
Dive Into Neural Network Models And Loss Optimization
A glimpse into Lamiss history (00:40)
So over the past couple years, deep learning has revolutionized many aspects of research and industry, including things like autonomous vehicles, medicine and healthcare, reinforcement learning, generative modeling, robotics, and a whole host of other applications like natural language processing, finance and security. But before we talk about that, I think we should start by taking a step back and talking about something at the core of this class, which is intelligence. What is intelligence? Well, I like to define intelligence as the ability to process information to inform future decisions. The field of artificial intelligence is actually building algorithms, artificial algorithms, to do exactly that, process information to inform future predictions. Now, machine learning is simply a subset of artificial intelligence, or AI, that actually focuses on teaching an algorithm how to take information and do this without explicitly being told the sequence of rules but instead learn the sequence of patterns from the data itself. Deep learning is simply a subset of machine learning which takes this idea one step further and actually tries to extract these patterns automatically from raw data without the need for the human to actually come in and annotate these rules that the system needs to learn. And that's what this class is all about, teaching algorithms how to learn a task from raw data. We want to provide you with a solid foundation so that you can learn how these algorithms work under the hood and with the practical skills so that you can actually implement these algorithms from scratch using deep learning frameworks like TensorFlow, which is the current most popular deep learning framework that you can code some of neural networks and other deep learning models. We have an amazing set of lectures lined up for you this week, including today, which will kick off an introduction on neural networks and sequence-based modeling, which you'll hear about in the second part of the class. Tomorrow, we'll cover some stuff about computer vision and deep generative modeling. And the day after that, we'll talk even about reinforcement learning and end on some of the challenges and limitations of the current deep learning approaches and kind of touch on how we can move forward as a field past these challenges. We'll also spend the final two days hearing from some guest lecturers from top AI researchers. These are bound to be extremely interesting though we have speakers from Nvidia, IBM, Google coming to give talks so I highly recommend attending these as well. And finally the class will conclude with some final project presentations from students like you in the audience, where you'll present some final projects for this class. And then we'll end on an award ceremony to celebrate. So as you might have seen or heard already, this class is offered for credit. You can take this class for grade. And if you're taking this class for grade, you have two options to fulfill your grade requirement. First option is that you can actually do a project proposal where you will present your project on the final day of class. That's what I was saying before on Friday. You can present your project. And this is just a three-minute presentation. We'll be very strict on the time here. And we realize that one week is a super short time to actually come up with a deep learning project. So we're not going to actually be judging you on the results that you create during this week. Instead, what we're looking for is the novelty of the ideas and how well you can present it given such a short amount of time in three minutes. And we kind of think it's like an art to being able to present something in just three minutes. So we kind of want to hold you to that tight time schedule and kind of enforce it very tightly, just so that you're forced to really think about, what is the core idea that you want to present to us on Friday? Your presentations will be judged by a panel of judges, and we'll be awarding GPUs and some Google Home AI assistance. This year, we're offering three NVIDIA GPUs, each one worth over $1,000. As some of you know, these GPUs are the backbone of doing cutting edge deep learning research. And it's really foundational or essential if you want to be doing this kind of research. So we're really happy that we can offer you this type of hardware. The second option, if you don't want to do the project presentation, but you still want to receive credit for this class, you can do the second option, which is a little more boring in my opinion. But you can write a one page review of a deep learning paper. And this will be due on the last day of class. And this is for people that don't want to do the project presentation, but you still want to get credit for this class. Please post to Piazza if you have questions about the labs that we'll be doing today or any of the future days. If you have questions about the course in general, there's course information on the website, intro2deeplearning.com, along with announcements, digital recordings, as well as slides for these classes. Today's slides are already released, so you can find everything online. And of course, if you have any questions, you can email us at intro2deeplearning-staff at mit.edu. This course has an incredible team that you can reach out to in case you have any questions or issues about anything. So please don that you can reach out to in case you have any questions or issues about anything So please don't hesitate to reach out And finally we want to give a huge thanks to all of the sponsors that made this course possible So now let's start with the fun stuff and Actually, let's start by asking ourselves a question Why do we even care about this class? Why did you all come here today? Why do we care about deep learning? Well, traditional machine learning algorithms typically define sets of rules or features that you want to extract from the data. Usually these are hand-engineered features, and they tend to be extremely brittle in practice. Now the key idea or the key insight of deep learning is that let's not hand engineer these features. Instead, let's learn them directly from raw data. That is, can we learn in order to detect a face? We can first detect the edges in the picture, compose these edges together to start detecting things like eyes, mouth, and nose, and then composing these features together to detect higher level structures in the face. And this is all performed in a hierarchical manner. So the question of deep learning is how can we go from raw image pixels or raw data in general to a more complex and complex representation as the data flows through the model. And actually, the fundamental building blocks of deep learning have existed for decades. And their underlying algorithms have been studied for many years even before that. So why are we studying this now? Well, for one, data has become so prevalent in today's society. We're living in the age of big data, where we have more access to data than ever before. And these models are hungry for data. So we need to feed them with all the data. And a lot of these data sets that we have available, like computer vision data sets, natural language processing data sets, this raw amount of data was just not available when these algorithms were created. Second, these algorithms are massively parallelizable at their core. At their most fundamental building blocks, as you'll learn today, they're massively parallelizable. And this means that they can benefit tremendously from very specialized hardware such as GPUs. And again, technology like these GPUs simply did not exist in the decades that deep learning or the foundations of deep learning were developed. And finally, due to open source toolboxes like TensorFlow, which you'll learn to use in this class, building and deploying these models has become more streamlined than ever before. It is becoming increasingly and increasingly easy to abstract away all of the details and build a neural network and train a neural network and then deploy that neural network in practice to solve a very complex problem. In just tens of lines of code, you can create a facial classifier that's capable of recognizing very complex faces from the environment. So let's start with the most fundamental building block of deep learning. And that's the fundamental building block that makes up a neural network, and that is a neuron. So what is a neuron? In deep learning, we call it a perceptron. And how does it work? So the idea of a perceptron, or a single neuron, is very simple. Let's start by talking about and describing the feedforward propagation of information through that model.
NeuralNet Model, Mathematical Equation (09:57)
We define a set of inputs, x1 through xm, which you can see on the left-hand side. And each of these inputs are actually multiplied by a corresponding weight, w1 through wm. So you can imagine if you have x1, you multiply it by w1. You have x2, you multiply it by w2, and so on. You take all of those multiplications and you add them up. So these come together in a summation. And then you pass this weighted sum through a nonlinear activation function to produce a final output, which we'll call y. So that's really simple. I actually left out one detail in that previous slide, so I'll add it here now. We also have this other term, this green term, which is a bias term, which allows you to shift your activation function left and right. And now on the right side, you can kind of see this diagram illustrated as a mathematical formula. As a single equation, we can actually rewrite this now using linear algebra, using vectors, dot products, and matrices. So let's do that. So now x is a vector of our inputs, x1 through m. So instead of now a single number, x, capital X, is a vector of all of the inputs. Capital W is a vector of all of the weights, 1 through m. And we can simply take their weighted sum by taking the dot product between these two vectors. Then we add our bias, like I said before. Our bias now is a single number, w0, and applying that nonlinear term. So the nonlinear term transforms that scalar input to another scalar output, y. So you might now be wondering, what is this thing that I've been referring to as an activation function? I've mentioned it a couple times, I called it by a couple different names. First was a nonlinear function, then was an activation function. What is it? So one common example of a nonlinear activation function is called the sigmoid function. And you can see one here defined on the bottom right. This is a function that takes as input any real number and outputs a new number between 0 and 1. So you can see it's essentially collapsing your input between this range of 0 and 1. This is just one example of an activation function, but there are many, many, many activation functions used in neural networks. Here are some common ones. And throughout this presentation, you'll see these TensorFlow code blocks on the bottom, like this, for example. And I'll just be using these as a way to kind of bridge the gap between the theory that you'll learn in this class with some of the TensorFlow that you'll be practicing in the labs later today and through the week. So the sigmoid function, like I mentioned before, which you can see on the left-hand side, is useful for modeling probabilities. Because like I said, it collapses your input to between 0 and 1. Since probabilities are modeled between 0 and 1, this is actually the perfect activation function for the end of your neural network if you want to predict probability distributions at the end. probabilities are modeled between 0 and 1, this is actually the perfect activation function for the end of your neural network if you want to predict probability distributions at the end. Another popular option is the relu function, which you can see on the far right-hand side. This function is an extremely simple one to compute. It's piecewise linear. And it's very popular because it's so easy to compute, but it has this non-linearity at z equals 0. So at z less than 0, this function equals 0. And at z greater than 0, it just equals the input. And because of this non-linearity, it's still able to capture all of the great properties of activation functions while still being extremely simple to compute. And now I want to talk a little bit about why do we use activation functions at all. I think a great part of this class is to actually ask questions and not take anything for granted. So if I tell you we need an activation function, the first thing that should come to your mind is well why do we need that activation function? So activation functions, the purpose of activation functions is to introduce non-linearities into the network. This is extremely important in deep learning or in machine learning in general because in real life, data is almost always very nonlinear. Imagine I told you to separate here the green from the red points. You might think that's easy, but then what if I told you you had to only use a single line to do it? Well, now it's impossible. That actually makes the problem not only really hard, like I said, it makes it impossible. In fact, if you use linear activation functions in a neural network, no matter how deep or wide your neural network is, no matter how many neurons it has, this is the best that it will be able to do, produce a linear decision boundary between the red and the green points.
Non-linear Activation Functions (14:45)
And that's because it's using linear activation functions. When we introduce a nonlinear activation function, that allows us to approximate arbitrarily complex functions and draw arbitrarily complex decision boundaries in this feature space. And that's exactly what makes neural networks so powerful in practice. So let's understand this with a simple example. Imagine I give you a trained network with weights, w, on the top here. So w0 is 1. And let's see, w0 is 1. The w vector is 3, negative 2. So this is a trained neural network. And I want to feed in a new input to this network. Well, how do we compute the output? Remember from before, it's the dot product. We add our bias, and we compute a nonlinearity. There's three steps. So let's take a look at what's going on here. What's inside of this nonlinear function, the input to the nonlinear function? Well, this is just a 2D line. In fact, we can actually plot this 2D line in what we call the feature space. So on the x-axis, you can see x1, which is the first input. And on the y-axis, you can see x2, which is the second input. This neural network has two inputs.
Perceptron From Scratch Model (16:02)
We can plot the line when it is equal to 0, and you can actually see it in the feature space here. If I give you a new point, a new input to this neural network, you can also plot this new point in the same feature space. So here's the point negative 1, 2. You can plot it like this. And actually, you can compute the output by plugging it into this equation that we created before, this line. If we plug it in, we get 1 minus 3 minus 4, which equals minus 6. That's the input to our activation function. And then when we feed it through our activation function, here I'm using sigmoid again, for example, our final output is 0.002. What does that number mean? Let's go back to this illustration of the feature space again. What this feature space is doing is essentially dividing the space into two hyperplanes. Remember that the sigmoid function outputs values between 0 and 1. And at z equals 0, when the input to the sigmoid is 0, the output of the sigmoid is 0.5. So essentially, you're splitting your space into two planes, one where z is greater than 0 and one where z is less than 0, and one where y is greater than 0.5 and one where y is less than 0.5. The two are synonymous. But when we're dealing with small dimensional input data, like here we're dealing with only two dimensions, we can make these beautiful plots. And these are very valuable in actually visualizing the learning algorithm, visualizing how our output is relating to our input. We're going to find very soon that we can't really do this for all problems. Because while here we're dealing with only two inputs, in practical applications and deep neural networks, we're going to be dealing with only two inputs, in practical applications in deep neural networks, we're going to be dealing with hundreds, thousands, or even millions of inputs to the network at any given time. And then drawing one of these plots in thousand-dimensional space is going to become pretty tough. So, now that we have an idea of the perceptron, a single neuron, let's start by building neural networks from the ground up using one neuron and seeing how this all comes together. Let's revisit our diagram of the perceptron. If there's a few things that you remember from this class, I want you to remember this. So there's three steps to computing the output of a perceptron. Dot product, add a bias, take a non-linearity. Three steps. Let's simplify the diagram a little bit. I just got rid of the bias. I removed the weights just for simplicity to keep things simple. And just note here that I'm writing z as the input to the activation function. So this is the weighted combination, essentially, of your inputs. y is then taking the activation function with input z. So the final output, like I said, y, is on the right-hand side here. And it's the activation function applied to this weighted sum. If we want to define a multi-output neural network, now all we have to do is add another perceptron to this picture. Now we have two outputs. Each one is a normal perceptron like we defined before, nothing extra. And each one is taking all the inputs from the left-hand side, computing this weighted sum, adding a bias, and passing it through an activation function. Let's keep going. Now let's take a look at a single layered neural network. This is one where we have a single hidden layer between our inputs and our outputs. We call it a hidden layer because unlike the input and the output, which are strictly observable, our hidden layer is learned. So we don't explicitly enforce any behavior on the hidden layer, and that's why we call it hidden in that sense. Since we now have a transformation from the inputs to the hidden layer and hidden layer to the outputs, we're going to need two weight matrices. So we're going to call it w1 to go from input to hidden layer and w2 to go from hidden layer to output. But again, the story here is the same. Dot product, add a bias for each of the neurons, and then compute an activation function. Let's zoom in now to a single hidden unit in this hidden layer. If we look at this single unit, take Z2 for example. It is just the same perceptron that we saw before. I'm going to keep repeating myself. We took a dot product with the inputs. We applied a bias. And then actually, so since it's z, we have not applied our activation function yet. So it's just a dot product plus a bias so far. If we took a look at a different neuron, let's say z3 or z4, the idea here is going to be the same. But we're probably going to end up with a different value for Z3 and Z4, just because the weights leading from Z3 to the inputs are going to be different for each of those neurons. So this picture looks a little bit messy, so let's clean things up a little bit more and just replace all of these hidden layers, all of these lines between the hidden layers with these symbols. These symbols denote fully connected layers where each input to the layer is connected to each output of the layer. Another common name for these is called dense layers. And you can actually write this in TensorFlow using just four lines of code. So this neural network, which is a single layered multi-output neural network, can be called by instantiating your inputs, feeding those inputs into a hidden layer, like I'm doing here, which is just defined as a single dense layer, and then taking those hidden outputs, feeding that into another dense layer to produce your outputs. The final model is defined end to end with that single line at the end, model of inputs and outputs. And that just essentially connects the graph end to end. So now let's keep building on this idea. Now we want to build a deep neural network. What is a deep neural network? Well, it's just one where we keep stacking these hidden layers back to back to back to back to create increasingly deeper and deeper models. One where the output is computed by going deeper into the network and computing these weighted sums over and over and over again with these activation functions repeatedly applied. So this is awesome. Now we have an idea on how to actually build a neural network from scratch going all the way from a single perceptron. And we know how to compose them to create very complex deep neural networks as well. Let's take a look at how we can apply this to a very real problem that I know a lot of you probably care about. So I was thinking of a problem, potential, that some of you might care about. Took me a while, but I think this might be one. So at MIT, we care a lot about passing our classes.
Lecture Attendance (22:45)
So I think a very good example is, let's train a neural network to determine if you're going to pass your class. So to do this, let's start with a simple two-input feature model. One feature is the number of lectures that you attend. The other feature is the number of hours that you spend on the final project. Again, since we have two inputs, we can plot this data on a feature map like we did before. Green points here represent previous students from previous years that passed the class. Red points represent students that failed the class. Now, if you want to find out if you're going to pass or fail the class, you can also plot yourself on this map. You came to four lectures, spent five hours on your final project, and you want to know if you're going to pass or fail.
Am I Going To Pass This Cocean (23:26)
And you want to actually build a neural network that's going to learn this, look at the previous people that took this course, and determine if you will all pass or fail as well. So let's do it. We have two inputs. One is four, one is five. These are fed into a single-layered neural network with three hidden units. And we see that the final output probability that you will pass this class is 0.1 or 10 percent. Not very good. That's actually really bad news. Can anyone guess why this person who actually was in the part of the feature space, it looked like they were actually in a good part of this feature space. It looked like they were going to pass the class. Why did this neural network give me such a bad prediction here? Yeah, exactly. So the network was not trained. Essentially, this network is like a baby that was just born. It has no idea of what lectures are. It doesn't know what final labs are. It doesn't know anything about this world. These are just numbers to it. It's been randomly initialized. It has no idea about the problem. So we have to actually train it. We have to teach it how to get the right answer. So the first thing that we have to do is tell the network when it makes a mistake so that we can correct it in the future. Now, how do we do this in neural networks? The loss of a network is actually what defines when the network makes the wrong prediction. It takes the input and the predicted output. Sorry, it takes as input the predicted output and the ground truth actual output. If your predicted output and your ground truth output are very close to each other, then that essentially means that your loss is going to be very low. You didn't make a mistake. But if your ground truth output is very far away from your predicted output, that means that you should have a very high loss. You should have a lot of error, and your network should correct that. So let's assume that we have data not just from one student now, but we have data from many, many different students passing and failing the class. We now care about how this model does not just on that one student, but across the entire population of students. And we call this the empirical loss. And that's just the mean of all of the losses for the individual students. We can do it by literally just computing the mean, sorry, just computing the loss for each of these students and taking their mean. When training a network, what we really want to do is not minimize the loss for any particular student, but we want to minimize the loss across the entire training set. So if we go back to our problem on predicting if you'll pass or fail the class, this is a problem of binary classification. Your output is 0 or 1. We already learned that when outputs are 0 or 1, predicting if you'll pass or fail the class, this is a problem of binary classification. Your output is 0 or 1. We already learned that when outputs are 0 or 1, you're probably going to want to use a softmax output. For those of you who aren't familiar with cross entropy, this was an idea introduced actually at MIT in a master's thesis here over 50 years ago. It's widely used in different areas like thermodynamics, and we use it here in machine learning as well. It's used all over information theory. And what this is doing here is essentially computing the loss between this 0, 1 output and the true output that the student either passed or failed the class. Let's suppose instead of computing a 0, 1 output, now we want to compute the actual grade that you will get on the class. So now it's not 0, 1, but it's actually a grade. It could be any number, actually. Now we want to use a different loss, because the output of our neural network is different. And defining losses is actually kind of one of the arts in deep learning. So you have to define the questions that you're asking so that you can define the loss that you need to optimize over. So here in this example, since we're not optimizing over a 0, 1 loss, we're optimizing over any real number. We're going to use a mean squared error loss. And that's just computing the squared error. So you take the difference between what you expect the output to be and what your actual output was. You take that difference, you square it, and you compute the mean over your entire population. OK, great. So now let's put some of this information together. We've learned how to build neural networks. We've learned how to quantify their loss. Now we can learn how to actually use that loss to iteratively update and train the neural network over time, given some data. And essentially what this amounts to, what this boils down to, is that we want to find the weights of the neural network, W, that minimize this empirical loss. So remember, again, the empirical loss is the loss over the entire training set. It's the mean loss of all of the individuals in the training set. And we want to minimize that loss, and that essentially means we want to find the weights, the parameterization of the network that results in the minimum loss. Remember again that W here is just a collection. It's just a set of all of the weights in the network. So before I defined W as W0, W1, which is the weights for the first layer, second layer, third layer, et cetera. And you keep stacking all of these weights together. You combine them, and you want to compute this optimization problem over all of these weights. So again, remember our loss function. What does our loss function look like?
Loss Optimization (28:46)
It's just a simple function that takes as inputs our weights, and if we have two weights, we can actually visualize it again. We can see on the x-axis one weight, so this is one scalar that we can change, and another weight on the y-axis. And on the z-axis, and on the z-axis this is our actual loss. If we want to find the lowest point in this landscape that corresponds to the minimum loss, and we want to find that point so that we can find the corresponding weights that were set to achieve that minimum loss. So how do we do it? We use this technique called loss optimization through gradient descent. We start by picking an initial point on this landscape, an initial W0, W1. So here's this point, this black cross. We start at this point. We compute the gradient at this local point. And in this landscape, we can see that the gradient tells us the direction of maximal ascent. Now that we know the direction of the maximal ascent, we can reverse that gradient and actually take a small step in the opposite direction. That moves us closer towards the lowest point because we're taking an egregious approach to move in that opposite direction of the gradient. We can iteratively repeat this process over and over and over again, recomputing the gradient at each time and keep moving, moving closer towards that lowest minimum. We can summarize this algorithm known as gradient descent in pseudocode by this pseudocode on the left-hand side.
Computing the gradient (30:17)
We start by initializing our weights randomly, computing this gradient, dj, dw, then updating our weights in the opposite direction of that gradient. We use this small amount, eta, which you can see here. And this is essentially what we call the learning rate. This is determining how much of a step we take and how much we trust that gradient update that we computed. We'll talk more about this later. But for now, let's take a look at this term here. This gradient, dj, dw, is actually explaining how the loss changes with respect to each of the weights. But I never actually told you how to compute this term. This is actually a crucial part of deep learning and neural networks in general. Computing this term is essentially all that matters when you try and optimize your network. It's the most computational part of training as well. And it's known as backpropagation. We'll start with a very simple network with only one hidden input. Sorry, with one input, one hidden layer, one hidden unit, and one output. Computing the gradient of our loss with respect to w2 corresponds to telling us how much a small change in w2 affects our output, our loss. So if we write this out as a derivative, we can start by computing this by simply expanding this derivative by using the chain rule. Backwards from the loss through the output. And that looks like this. So dj, dw2 becomes dj dy dy dW2.
And that's just a simple application of the chain rule. Now let's suppose instead of computing dJ dW2, we want to compute dJ dW1. So I've changed the w1, the w2 to a w1 on the left hand side. And now we want to compute this. Well, we can simply apply the chain rule again. We can take that middle term, now expand it out again using the same chain rule, and back propagate those gradients even further back in the network. And essentially, we keep repeating this for every weight in the network, using the gradients for later layers to back propagate those errors back into the original input. We do this for all of the weights, and that gives us our gradient for each weight. Yeah? So how can you ensure that this gives you the absolute min instead of like a local min? Like if you go down and it stops going down, that doesn't mean there's not somewhere else that's deeper after you've gone up again. Yeah, you're completely right. So the question is, how do you ensure that this gives you a global minimum instead of a local minimum? So you don't. We have no guarantees that this is not a global minimum. The whole training of stochastic gradient descent is a greedy optimization algorithm. So you're only taking this greedy approach and optimizing only a local minimum. There are different ways, extensions of stochastic gradient descent that don't take a greedy approach. They take an adaptive approach, they look around a little bit. These are typically more expensive to compute. Stochastic gradient descent is extremely cheap to compute in practice, and that's one of the reasons it's used. The second reason is that in practice local minimum tend to be sufficient. So that's the back propagation algorithm. In theory, it sounds very simple. It's just an application of the chain rule. But now let's touch on some insights on training these neural networks in practice that makes it incredibly complex.
Learning rate (33:59)
And this gets back to that previous point, that previous question that was raised. In practice, training neural networks is incredibly difficult. And this gets back to that previous point, that previous question that was raised. In practice, training neural networks is incredibly difficult. This is a visualization of the lost landscape of a neural network in practice. This is a paper from about a year ago, and the authors visualize what a deep neural network loss landscape really looks like. You can see many, many, many local minimum here. Minimizing this loss and finding the optimal true minimum is extremely difficult. Now recall the update equation that we defined for a gradient descent previously. We take our weights and we subtract. We move towards the negative gradient and we update our weights in that direction. I didn't talk too much about this parameter, eta. This is what we call the learning rate. I briefly touched on it. And this is essentially determining how large of a step we take at each iteration. In practice, setting the learning rate can be extremely difficult and actually very important for making sure that you avoid local minima again. So if we set the learning rate too slow, then the model may get stuck in local minimum like this. It could also converge very slowly, even in the case that it gets to a global minimum. If we set the learning rate too large, the gradient essentially explodes and we diverge from the loss itself. And it's also bad. Setting the learning rate to the correct amount can be extremely tedious in practice, such that we overshoot some of the local minima, get ourselves into a reasonable global minima, and then converge within that global minima. How can we do this in a clever way? So one option is that we can try a lot of different possible learning rates, see what works best in practice. And in practice, this is actually a very common technique. So a lot of people just try a lot of learning rates and see what works best. Let's see if we can do something a bit smarter than that as well. How about we design an adaptive algorithm that learn that that adapts its learning rate according to the lost landscape? So this can take into account the gradient at other locations in the loss. It can take into account how fast we're learning, how large the gradient is at that location, or many other options. But now since our learning rate is not fixed for all of the iterations of gradient descent, we have a bit more flexibility now in learning. In fact, this has been widely studied as well.
Line Search Optimizers (36:34)
There are many, many different options for optimization schemes that are present in TensorFlow. And here are examples of some of them. During your labs, I encourage you to try out different ones of these optimizers and see how they're different, which works best, which doesn't work so well for your particular problem. And they're all adaptive in nature. So now I want to continue talking about tips for training these networks in practice and focus on the very powerful idea of batching gradient descent and batching your data in general. So to do this, let's revisit this idea of gradient descent very quickly. So the gradient is actually very computational to compute. This back propagation algorithm, if you want to compute it for all of the data samples in your training data set, which may be massive in modern data sets, it's essentially amounting to compute it for all of the data samples in your training data set, which may be massive in modern data sets, it's essentially amounting to a summation over all of these data points. In most real life problems, this is extremely computational and not feasible to compute on every iteration. So instead, people have come up with this idea of stochastic gradient descent.
Stochastic Gradient Descent (37:40)
And that involves picking a single point in your data set, computing the gradient with respect to that point, and then using that to update your weights. So this is great because now computing a gradient of a single point is much easier than computing a gradient over many points. But at the same time, since we're only looking at one point, this can be extremely noisy. Sure, we take a different point each time, but still, when we move and we take a step in that direction of that point, we may be going in a step that's not necessarily representative of the entire data set. So is there a middle ground such that we don't have to have a stochastic gradient, but we can still be kind of computationally efficient in this sense?
Key Phases And Caution Points In Neural Network Learning
So instead of computing a noisy gradient of a single point, let's get a better estimate by batching our data into mini-batches of B data points, capital B data points. So now this gives us an estimate of the true gradient by just averaging the gradient from each of these points. This is great because now it's much easier to compute than full gradient descent. It's a lot less points. Typically, B is on the order of less than 100 or approximately in that range. And it's a lot more accurate than stochastic gradient descent because you're considering a larger population as well. This increase in gradient accuracy estimation actually allows us to converge much quicker as well because it means that we can increase our learning rate and trust our gradient more with each step, which ultimately means that we can train faster. This allows for massively parallelizable computation because we can split up batches across the GPU, send batches all over the GPU, compute their gradients simultaneously, and then aggregate them back to even speed up even further. Now the last topic I want to address before ending is this idea of overfitting. This is one of the most fundamental problems in machine learning as a whole, not just deep learning. And at its core, it involves understanding the complexity of your model. So you want to build a model that performs well and generalizes well, not just to your training set, but to your test set as well. Assume that you want to build a model that performs well and generalizes well, not just to your training set, but to your test set as well. Assume that you want to build a model that describes these points. You can go on the left-hand side, which is just a line, fitting a line through these points. This is underfitting. The complexity of your model is not large enough to really learn the full complexity of the data. Or you can go on the right-hand side, which is overfitting, where you're essentially building a very complex model to essentially memorize the data. And this is not useful either, because when you show it new data, it's not going to perfectly match on the training data and means that you're going to have high generalization error. Ideally, we want to end up with a model in the middle that is not too complex to memorize all of our training data, but still able to generalize and perform well, even when we have brand new training and testing inputs. So to address this problem, let's talk about regularization for deep neural networks. Deep neural regularization is a technique that you can introduce to your networks that will discourage complex models from being learned. And as before, we've seen that it's crucial for our models to be able to generalize to data beyond our training set, but also to generalize to data in our testing set as well. The most popular regularization technique in deep learning is a very simple idea called dropout.
Let's revisit this in a picture of a deep neural network again. In dropout, during training, we randomly set some of our activations of the hidden neurons to zero with some probability. That's why we call it dropping out, because we're essentially killing off those neurons. So let's do that. So we kill off these random sample of neurons. And now we've created a different pathway through the network. Let's say that you drop 50% of the neurons. This means that those activations are set to zero. And the network is not going to rely too heavily on any particular path through the network. But it's instead going to find a whole ensemble of different paths because it doesn't know which path is going to be dropped out at any given time. We repeat this process on every training iteration, now dropping out a new set of 50% of the neurons. And the result of this is essentially a model that, like I said, creates an ensemble of multiple models through the path of the network and is able to generalize better to unseen test data. So the second technique for regularization is this notion that we'll talk about, which is early stopping. And the idea here is also extremely simple. Let's train our neural network like before, no dropout. But let's just stop training before we have a chance to overfit. So we start training. And the definition of overfitting is just when our model starts to perform worse on the test set than on the training set. So we can start off, and we can plot how our loss is going for both the training and test set. We can see that both are decreasing, so we keep training. Now we can see that the training, the validation, both losses are kind of starting to plateau here. We can keep going. The training loss is always going to decay. It's always going to keep decreasing, because especially if you have a network that is having such a large capacity to essentially memorize your data, you can always perfectly get a training accuracy of zero. That's not always the case, but in a lot of times with deep neural networks since they're so expressive and have so many weights, they're able to actually memorize the data if you let them train for too long.
Brief Break (43:43)
If we keep training, like you can see, the training set continues to decrease. Now the validation set starts to increase. And if we keep doing this, the trend continues. The idea of early stopping is essentially that we want to focus on this point here and stop training when we reach this point. So we can keep basically records of the model during training. And once we start to detect overfitting, we can just stop and take that last model that was still occurring before the overfitting happened. So on the left-hand side, you can see the underfitting.
You don't want to stop too early. You want to let the model get the minimum validation set accuracy. But also, you don't want to keep training such that the validation accuracy starts to increase on the other end as well. So I'll conclude this first lecture by summarizing three key points that we've covered so far. First, we learned about the fundamental building blocks of deep learning, which is just a single neuron, or called the perceptron. We learned about the fundamental building blocks of deep learning, which is just a single neuron, or called the perceptron. We learned about backpropagation, how to stack these neurons into complex deep neural networks, how to backpropagate errors through them, and learn complex loss functions. And finally, we discussed some of the practical details and tricks to training neural networks that are really crucial today if you want to work in this field, such as batching, regularization, and others. So now I'll take any questions. Or if there are no questions, then I'm going to hand the mic over to Ava, who will talk about sequence modeling. Thank you.