Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.

## Opening Statement

### Introduction (00:00)

. Okay. Good afternoon everyone, and thank you all for joining today. I'm super excited to welcome you all to MIT 6S191 Introduction to Deep Learning. My name is Alexander Amini, and I'm going to be your instructor this year along with Ava Soleimani. Now, 6S191 is a really fun and fast-paced class. And for those of you who are not really familiar, I'll start by giving you a bit of background on what deep learning is and what this class is all about. Just because I think we're going to cover a ton of material in today's class, and in only one week this class is in total. And in just that one week, you're going to learn about the foundations of this really remarkable field of deep learning and get hands-on experience and practical knowledge and practical guides through these software labs using TensorFlow. Now, I like to tell people that 6S191 is like a one-week boot camp in deep learning. That's because of the amount of information that you're going to learn over the course of this one week. I'll start by just asking a very simple question, and what is deep learning? So instead of giving you some boring technical answer and description of what deep learning is and the power of deep learning and why this class is so amazing, I'll start by actually showing you a video of someone else doing that instead. So let's take a look at this first. Hi, everybody, and welcome to MIT Fit S191, the official introductory course on deep learning taught here at MIT. Deep learning is revolutionizing so many fields, from robotics to medicine and everything in between. You'll learn the fundamentals of this field and how you can build some of these incredible algorithms. In fact, this entire speech and video are not real and were created using deep learning and artificial intelligence. And in this class, you'll learn how. It has been an honor to speak with you today, and I hope you enjoy the course. So in case you couldn't tell, that video was actually not real at all. That was not real video or real audio. And in fact, the audio you heard was actually even purposely degraded even further just by us to make it look and sound not as real and avoid any potential misuse. Now this is really a testament to the power of deep learning to create such high quality and highly realistic videos and quality models for generating those videos. So even with this purposely degraded audio, that intro, we always show that intro and we always get a ton of really exciting feedback from our students and how excited they are to learn about the techniques and the algorithms that drive forward that type of progress. And the progress in deep learning is really remarkable, especially in the past few years. The ability of deep learning to generate these very realistic data and data sets extends far beyond generating realistic videos of people like you saw in this example. Now we can use deep learning to generate full simulated environments of the real world. So here's a bunch of examples of fully simulated virtual worlds generated using real data and the power and powered by deep learning and computer vision. So this simulator is actually fully data driven, we call it. And within these virtual worlds, you can actually place virtual simulated cars for training autonomous vehicles, for example. This simulator was actually designed here at MIT, and when we created it, we actually showed the first occurrence of using a technique called end-to-end training using reinforcement learning, and training an autonomous vehicle entirely in simulation using reinforcement learning, and having that vehicle controller deployed directly onto the real world on real roads on a full-scale autonomous car. Now we're actually releasing this simulator open sourcesource this week so all of you as students in 6S191 will have first access to not only use this type of simulator as part of your software labs and generate these types of environments but also to train your own autonomous controllers to drive in these types of environments that can be directly transferred to the real world. And in fact, in software lab three, you'll get the ability to do exactly this. And this is super exciting addition to 6S19 this year, because all of you as students will be able to actually enter this competition where you can propose or submit your best deep learning models to drive in these simulated environments. And the winners will actually be invited and given the opportunity to deploy their models on board a full-scale self-driving car in the real world. So we're really excited about this and I'll talk more about that in the software lab section. So now hopefully all of you are super excited about what this class will teach you so hopefully let's start now by taking a step back and answering or defining some of these terminologies that you've probably been hearing a lot about. So I'll start with the word intelligence. Intelligence is the ability to process information, take as input a bunch of information, and make some informed future decision or prediction. So the field of artificial intelligence is simply the ability for computers to do that, to take as input a bunch of information and use that information to inform some future situations or decision-making. Now, machine learning is a subset of AI or artificial intelligence, specifically focused on teaching a computer or teaching an algorithm how to learn from experiences, how to learn from data without being explicitly programmed, how to process that input information. Now, deep learning is simply a subset of machine learning as a whole, specifically focused on the use of neural networks, which you're gonna learn about in this class, to automatically extract useful features and patterns in the raw data and use those patterns or features to inform the learning task. So to inform those decisions, you're going to try to first learn the features and learn the inputs that determine how to complete that task.

## Overview Of Neural Networks And Its Concepts

### Course information (06:35)

And that's really what this class is all about. It's how we can teach algorithms, teach computers how to learn a task directly from raw data. So just be giving a data set of a bunch of examples. How can we teach a computer to also complete that task like we see in the data set? Now this course is split between technical lectures and software labs, and we'll have several new updates in this year's edition of the class, especially in some of the later lectures. In this first lecture, we'll cover the foundations of deep learning and neural networks, starting with the building blocks of neural networks, which is just the single neuron. And finally, we'll conclude with some really exciting guest lectures and student projects from all of you. And as part of the final prize competition, you'll be eligible to win a bunch of exciting prizes and awards. So for those of you who are taking this class for credit, you'll have two options to fulfill your credit requirement. The first option is a project proposal where you'll get to work either individually or in groups of up to four people and develop some cool new deep learning idea. Doing so will make you eligible for some of these awesome sponsored prizes. Now we realize that one week is a super short and condensed amount of time to make any tangible code progress on a deep learning progress. So what we're actually going to be judging you here on is not your results, but rather the novelty of your ideas and the ability that we believe that you could actually execute these ideas in practice, given the state of the art today. Now, on the last day of the class, we'll give you all a three minute presentation where your group can present your idea and win an award potentially and there's actually an art I think to presenting an idea in such a short amount of time that we're also going to be kind of judging you on to see how quickly and effectively you can convey those ideas now the second option to fill your grade requirement is just to write a one-page essay on a review of any deep learning paper and this will be due on the last Thursday of the class. Now in addition to the final project prizes, we'll also be awarding prizes for the top lab submissions for each of the three labs, and like I mentioned before, this year we're also holding a special prize for Lab 3, where students will be able to deploy their results onto a full-scale self-driving car in the real world. For support in this class, please post all of your questions to Piazza. Check out the course website for announcements, the course Canvas also for announcements, and digital recordings of the lectures and labs will be available on Canvas shortly after each of the classes so this course has an incredible team that you can reach out to if you ever have any questions either through canvas or through the email listed at the bottom of the slide feel free to reach out and we really want to give a huge shout out and thanks to all of our sponsors who without this who without their support this class would not be possible this is our fifth year teaching the class, and we're super excited to be back again and teaching such a remarkable field and exciting content.

### Why deep learning? (09:51)

So now let's start with some of the exciting stuff now that we've covered all of the logistics of the class. So let's start by asking ourselves a question. Why do we care about this, and why did all of you sign up to take this class? Why do you care about deep learning? Well, traditional machine learning algorithms typically operate by defining a set of rules or features in the environment, in the data. So usually these are hand engineered. So a human will look at the data and try to extract some hand engineered features from the data. Now in deep learning, we're actually trying to do something a little bit different. The key idea of deep learning is that these features are going to be learned directly from the data itself in a hierarchical manner. So this means that given a data set, let's say a task to detect faces, for example, given a dataset, let's say a task to detect faces, for example, can we train a deep learning model to take as input a face and start to detect a face by first detecting edges, for example, very low level features, building up those edges to build eyes and noses and mouths, and then building up some of those smaller components of faces into larger facial structure features. So as you go deeper and deeper into a neural network architecture, you'll actually see its ability to capture these types of hierarchical features. And that's the goal of deep learning compared to machine learning, is actually the ability to learn and extract these features to perform machine learning on them. Now actually the fundamental building blocks of deep learning and their underlying algorithms have actually existed for decades. So why are we studying this now? Well, for one, data has become much more prevalent. So data is really the driving power of a lot of these algorithms. And today we're living in the world of big data, where we have more data than ever before. Now, second, these models and these algorithms, neural networks, are extremely and massively parallelizable. They can benefit tremendously from, and they have benefited tremendously from, modern advances in GPU architectures that we have experienced over the past decade. And these advances, these types of GPU architecture, simply did not exist when we think about when these algorithms were detected and created, excuse me, in, for example, the neuron, the idea for the foundational neuron was created in almost 1960. So when you think back to 1960, we simply did not have the compute that we have today. And finally, due to amazing open source toolboxes like TensorFlow, we're able to actually build and deploy these algorithms, and these models have become extremely streamlined.

### The perceptron (12:30)

So let's start with the fundamental building block of a neural network, and that is just a single neuron. Now the idea of a single neuron, or let's call this a perceptron, is actually extremely intuitive. Let's start by defining how a single neuron takes as input information and it outputs a prediction. Okay, so just looking at its forward pass, its forward prediction call from inputs on the left to outputs on the right. So we define a set of inputs, let's call them x1 to xm. Now each of these numbers on the left in the blue circles are multiplied by their corresponding weight and then added all together. We take this single number that comes out of this addition and pass it through a nonlinear activation function. We call this the activation function, and we'll see why in a few slides. And the output of that function is going to give us our prediction Y. Well this is actually not entirely correct I forgot one piece of detail here we also have a bias term which here I'm calling W0 sometimes you also see it as the letter B and the bias term allows us to shift the input to our activation function to the left or to the right. Now on the right side here you can actually see this diagram on the left illustrated and written out in mathematical equation form as a single equation and we can actually rewrite this equation using linear algebra in terms of vectors and dot products. So let's do that. Here now we're going to collapse x1 to xm into a single vector called capital X, and capital W will denote the vector of the corresponding weights, w1 to wm. The output here is obtained by taking their dot product, adding a bias, and applying this non-linearity. And that's our output y. So now you might be wondering, the only missing piece here is what is this activation function?

### Activation functions (14:31)

Well, I said it's a nonlinear function, but what does that actually mean? Here's an example of one common function that people use as an activation function on the bottom right. This is called the sigmoid function, and it's defined mathematically above its plot here. In fact, there are many different types of non-linear activation functions used in neural networks. Here are some common ones. Throughout this entire presentation, you'll also see what these TensorFlow code blocks on the bottom part of the screen, just to briefly illustrate how you can take the concepts, the technical concepts that you're learning as part of this lecture and extend it into practical software. So these TensorFlow code blocks are going to be extremely helpful for some of your software labs to kind of show the connection and bridge the connection between the foundation set up for the lectures and the practical side with the labs. Now the sigmoid activation function which you can see on the left hand side is popular, like I said, largely because it's one of the few functions in deep learning that outputs values between 0 and 1. So this makes it extremely suitable for modeling things like probabilities, because probabilities are also existing in the range between 0 and 1. So if we want to output a probability, we can simply pass it through a sigmoid function, and that will give us something that resembles the probability that we can use to train with. Now, in modern deep learning neural networks, it's also very common to use what's called the ReLU function, and you can see an example of this on the right. And this is extremely popular. It's a piecewise function with a single nonlinearity at x equals 0. Now, I hope all of you are kind of asking this question to yourselves. Why do you even need activation functions? What's the point? What's the importance of an activation function? Why can't we just directly pass our linear combination of their inputs with our weights through to the output? Well, the point of an activation function is to introduce a non-linearity into our system. Now, imagine I told you to separate the green points from the red points, and that's the thing that you want to train. And you only have access to one line. It's not non-linear. So you only have access to a line. How can you do this? Well, it's an extremely hard problem then, right? And in fact, if you can only use a linear activation function in your network, no matter how many neurons you have or how deep is the network, you will only be able to produce a result that is one line, because when you add a line to a line, you still get a line output. Nonlinearities allow us to approximate arbitrarily complex functions, and that's what makes neural networks extremely powerful.

### Perceptron example (17:03)

Let's understand this with a simple example. So imagine I give you a trained network now. Here I'm giving you the weights, and the weights w are on the top right. So w is 0 is going to be set to 1. That's our bias. And the w vector, the weights of our input dimension, is going to be a vector with the values 3 and negative 2. This network only has two inputs, right? x1 and x2, and if we want to get the output of it, we simply do the same step as before. And I want to keep drilling in this message to get the output. All we have to do is take our inputs, multiply them by our corresponding weights, w, add the bias, and apply non-linearity. It's that simple. But let's take a look at what's actually inside that nonlinearity. When I do that multiplication and addition, what comes out? It's simply a weighted combination of the inputs in the form of a 2D line. So we take our inputs x of t, x transposed, excuse me, multiply it as a dot product with our weights, add a bias, and if we look at what's inside this parentheses here, what is getting passed to G, this is simply a two-dimensional line because we have two inputs, x1 and x2. So we can actually plot this line in feature space or input space, we'll call it, because this is along the x-axis is x1 and along the y-axis is x2. And we can plot the decision boundary, we call it, of the input to this activation function. This is actually the line that defines our perceptron neuron. Now if I give you a new data point, let's say x equals negative 1, 2, we can plot this data point in this space, in this two-dimensional space, and we can also see where it falls with respect to that line. Now, if I want to compute its weighted combination, I simply follow the perceptron equation to get 1 minus 3 minus 4, which equals minus 6. And when I put that into a sigmoid activation function, we get a final output of approximately 0.002. Now, why is that the case? So assume we have this input, negative 1, negative 2, and this is just going through the math again, negative 1 and 2. We pass that through our equations and we get this output from g. Let's dive in a little bit more to this feature graph. Well, remember, if the sigmoid function is defined in the standard way, it's actually outputting values between 0 and 1, and the middle is actually at 0.5. So anything on the left-hand side of this feature space, of this line, is going to correspond to the input being less than 0 and the output being greater than 0.5, or excuse me, less than 0.5, and on the other side is the opposite. That's corresponding to our activation, Z, being greater than 0 and our output, Y, being greater than 0.5. So this is just following all of the sigmoid math, but illustrating it in pictorial form, in schematics. And in practice, neural networks don't have just two weights, W1, W2. They're composed of millions and millions of weights in practice. So you can't really draw these types of plots for the types of neural networks that you'll be creating. But this is to give you an example of a single neuron with a very small number of weights. And we can actually visualize these types of things to gain some more intuition about what's going on under the hood.

### From perceptrons to neural networks (20:25)

So now that we have an idea about the perceptron, let's start by building neural networks from this foundational building block and seeing how all of this story starts to come together. So let's revisit our previous diagram of the perceptron. If there's a few things I want you to take away from this class and this lecture today, I want it to be this thing here. So I want you to remember how a perceptron works, and I want you to remember three steps. The first step is dot product your inputs with your weights, dot product, add a bias, and apply a non-linearity, and that defines your entire perceptron forward propagation all the way down in just three operations. Now, let's simplify the diagram a little bit now that we got the foundations down. I'll remove all of the weight labels. So now it's assumed that every line, every arrow, has a corresponding weight associated to it. Now, I'll remove the bias term for simplicity as well here. You can see right here. I'll remove the bias term for simplicity as well here. You can see right here. And note that z, the result of our dot product plus our bias, is before we apply the nonlinearity. So g of z is our output, our prediction of the perceptron. Our final output is simply our activation function, g, taking as input that state, z. If we want to define a multi-output neural network, so now we don't have one output y, let's say we have two outputs, y1 and y2, we simply add another perceptron to this diagram. Now we have two outputs. Each one is a normal perceptron, just like we saw before. Each one is taking inputs from x1 to xm, from the x's, multiplying them by the weights. And they have two different sets of weights because they're two different neurons. They're two different perceptrons. They're going to add their own biases. And then they're going to apply the activation function. So you'll get two different outputs because the weights are different for each of these neurons. If we want to define, let's say, this entire system from scratch now using TensorFlow, we can do this very, very simply just by following the operations that I outlined in the previous slide. So our neuron, let's start by a single dense layer. A dense layer just corresponds to a layer of these neurons, so not just one neuron or two neurons but an arbitrary number, let's say n neurons. In our dense layer, we're going to have two sets of variables. One is the weight vector and one is the bias. So we can define both of these types of variables and weights as part of our layer. The next step is to find what is the forward pass. And remember we talked about the operations that define this forward pass of a perceptron and of a dense layer. Now it's composed of the steps that we talked about. First, we compute matrix multiplication of our inputs with our weight vector. So inputs multiplied by W, add a bias, plus B, and feed it through our activation function. Here I'm choosing a sigmoid activation function, and then we return the output. And that defines a dense layer of a neural network. Now we have this dense layer. We can implement it from scratch like we've seen in the previous slide, but we're pretty lucky because TensorFlow has already implemented this dense layer for us, so we don't have to do that and write that additional code. Instead, let's just call it. Here we can see an example of calling a dense layer with the number of output units set equal to 2. Now, let's dive a little bit deeper and see how we can make now a full single-layered neural network, not just a single layer, but also an output layer as well. This is called a single hidden layered neural network. And we call this a hidden layer because these states in the middle with these red states are not directly observable or enforceable like the inputs which we feed into the model and the outputs which we know what we want to predict. So since we now have this transformation from the inputs to the hidden layer and from the hidden layer to the output layer, we need now two sets of weight matrices, W1 for the input layer and W2 for the output layer. Now, if we look at a single unit in this hidden layer, let's take this second unit for example, Z2, it's just the same perceptron that we've been seeing over and over in this lecture already. So we saw before that it's obtaining its output by taking a dot product with those Xs, its inputs, multiplying them via the dot product, adding a bias, and then passing that through the form of Z2. If we took a different hidden node, like Z3 for example, it would have a different output value just because the weights leading to Z3 are probably going to be different than the weights leading to Z2. And we basically start them to be different, so we have diversity in the neurons. Now this picture looks a little bit messy, so let me clean it up a little little bit more and from now on I'll just use this symbol in the middle to denote what we're calling a dense layer. Dense is called dense because every input is connected to every output like in a fully connected way so sometimes you also call this a fully connected layer. To define this fully connected network or dense network in TensorFlow, you can simply stack your dense layers one after another in what's called a sequential model. A sequential model is something that feeds your inputs sequentially from inputs to outputs. So here we have two layers, the hidden layer first defined with n hidden units and our output layer with two output units. And if we want to create a deep neural network it's the same thing we just keep stacking these hidden layers on top of each other in a sequential model and we can create more and more hierarchical networks. And this network for example is one where the final output in purple is actually computed by going deeper and deeper into the layers of this network. And if we want to create a deep neural network in software, all we need to do is stack those software blocks over and over and create more hierarchical models. Okay, so this is awesome.

### Applying neural networks (26:37)

Now we have an idea and we've seen an example of how we can take a very simple and intuitive mechanism of a single neuron, single perceptron, and build that all into the form of layers and complete complex neural networks. Let's take a look at how we can apply them in a very real and practical problem that maybe some of you have thought about before coming to today's class. Now here's the problem that I want to train an AI to solve if I was a student in this class. So will I pass this class? That's the problem that we're going to ask our machine or a deep learning algorithm to answer for us. And to do that, let's start by defining some inputs and outputs, or sorry, input features, excuse me, to the AI model. One feature that's used to learn from is the number of lectures that you attend as part of this course. And the second feature is the number of hours that you're going to spend developing your final project. And we can collect a bunch of data because this is our fifth year teaching this amazing class. We can collect a bunch of data from past years on how previous students performed here. So each dot corresponds to a student who took this class. We can plot each student in this two-dimensional feature space, where on the x-axis is the number of lectures they attended, and on the y-axis is the number of hours that they spent on the final project. The green points are the students who passed, and the red points are those who failed. on the final project. The green points are the students who passed and the red points are those who failed. And then there's you. You lie right here, right here at the point four, five. So you've attended four lectures and you've spent five hours on your final project. You want to build now a neural network to determine, given everyone else's standing in the class, will I pass or fail this class? Now, let's do it. So we have these two inputs. One is 4, one is 5. This is your inputs. And we're going to feed these into a single layered neural network with three hidden units. And we'll see that when we feed it through, we get a predicted value of probability of you passing this class as 0.1 or 10%. So that's pretty bad because, well, you're not going to fail the class. You're actually going to succeed. So the actual value here is going to be 1. You do pass the class. So why did the network get this answer incorrectly? Well, to start with, the network was never trained. So all it did was we just started the network. It has no idea what success 191 is, how it occurs for a student to pass or fail a class, or what these inputs 4 and 5 mean. So it has no idea. It's never been trained. It's basically like a baby that's never seen anything before and you're feeding some random data to it. And we have no reason to expect why it's going to get this answer correctly.

### Loss functions (29:18)

That's because we never told it how to train itself, how to update itself, so that it can learn how to predict such an outcome or to predict such a task of passing or failing a class. Now to do this, we have to actually define to the network what it means to get a wrong prediction or what it means to incur some error. Now the closer our prediction is to our actual value, the lower this error or our loss function will be. And the farther apart they are, the more error will occur. The closer they are together, the less error that will occur. Now, let's assume we have data not just from one student, but from many students. Now, we care about how the model did on average across all of the students in our data set. And this is called the empirical loss function. It's just simply the mean of all of the individual loss functions from our data set. And when training a network to solve this problem, we want to minimize the empirical loss. So we want to minimize the loss that the network incurs on the data set that it has access to between our predictions and our outputs. So if we look at the problem of binary classification, for example, passing or failing a class, we can use something, a loss function called, for example, the softmax cross entropy loss. And we'll go into more detail and you'll get some experience implementing this loss function as part of your software labs, but I'll just give it as a quick aside right now as part of this slide. Now let's suppose instead of predicting pass or fail, a binary classification output, let's suppose I want to predict a numeric output, for example the grade that I'm going to get in this class. Now that's going to be any real number. Now we might want to use a different loss function because we're not doing a classification problem anymore. Now we might want to use something like a mean squared error loss function or maybe something else that takes as input continuous real valued numbers.

### Training and gradient descent (31:19)

Okay, so now that we have this loss function, we're able to tell our network when it makes a mistake. Now we've got to put that together with the actual model that we defined in the last part to actually see now how we can train our model to update and optimize itself given that error function. So how can it minimize the error given a data set? So remember that we want the objective here is that we want to identify a set of weights, let's call them W star, that will give us the minimum loss function on average throughout this entire data set. That's the gold standard of what we want to accomplish here in training a neural network. So the whole goal of this class really is how can we identify W star, right? So how can we train all of the weights in our network such that the loss that we get as an output is as small as it can possibly be, right? So that means that we want to find the Ws that minimize J of W. So that's our empirical loss, our average empirical loss. Remember that w is just a group of all of the w's from every layer in the model, right? So we just concatenate them all together, and we want to minimize, we want to find the weights that give us the lowest loss. And remember that our loss function is just a function, right, that takes as input all of our weights. So given some set of weights, our loss function will output a single value. That's the error. If we only have two weights, for example, we might have a loss function that looks like this. We can actually plot the loss function because it's relatively low dimensional. We can visualize it. So on the horizontal axes, the x and y axes, we're having the two weights, w0 and w1. And on the vertical axis, we're having the loss weights w0 and w1 and on the vertical axis we're having the loss. So higher loss is worse and we want to find the weights w0 and w1 that will bring us the lowest part to the lowest part of this loss landscape. So how do we do that? This is a process called optimization and we're going to start by picking an initial w0 and w1. Start anywhere you want on this graph. And we're going to compute the gradient. Remember, our loss function is simply a mathematical function. So we can compute the derivatives and compute the gradients of this function. And the gradient tells us the direction that we need to go to maximize j of w, to maximize our loss. So let's take a small step now in the opposite direction, right, because we want to find the lowest loss for a given set of weights. So we're going to step in the opposite direction of our gradient, and we're going to keep repeating this process. We're going to compute gradients again at the new point and keep stepping and stepping and stepping until we converge to a local minima. Eventually, the gradients will converge, and we'll stop at the bottom. It may not be the global bottom, but we'll find some bottom of our loss landscape. So we can summarize this whole algorithm known as gradient descent, using the gradients to descend into our loss function in pseudocode. So here's the algorithm written out as pseudocode. We're going to start by initializing weights randomly, and we're going to repeat the two steps until we converge. So first we compute our gradients, and then we're going to step in the opposite direction, a small step in the opposite direction of our gradients to update our weights. Now the amount that we step here, eta, this is the n character next to our gradients, determines the magnitude of the step that we take in the direction of our gradients. And we're going to talk about that later. That's a very important part of this problem. But before I do that, I just want to show you also kind of the analog side of this algorithm written out in TensorFlow again, which may be helpful for your software labs. So this whole algorithm can be replicated using automatic differentiation using platforms like TensorFlow. So with TensorFlow, you can actually randomly initialize your weights. And you can actually compute the gradients and do these differentiations automatically. So it will actually take care of the definitions of all of these gradients using automatic differentiation. And it will return the gradients that you can directly use to step with and optimize and train your weights. But now let's take a look at this term here, the gradient. So I mentioned to you that TensorFlow and your software packages will compute this for you, but how does it actually do that? I think it's important for you to understand how the gradient is computed for every single weight in your neural network.

### Backpropagation (35:46)

So this is actually a process called backpropagation in deep learning and neural networks. And we'll start with a very simple network. And this is probably the simplest network in existence because it only contains one hidden neuron. Right. So it's the smallest possible neural network. Now, the goal here is that we're going to try and do back propagation manually ourselves by hand. So we're going to try and compute the gradient of our loss, J, with respect to our weight, W. For example, this tells us how much a small change in W will affect our loss function. So if I change and perturb W a little bit, how does my error change as a result? So if we write this out as a derivative, we start by applying the chain rule and use, we start by applying the chain rule backwards from the loss function through the output. Okay, so we start with the loss function here, and we specifically decompose dj dW2 into two terms. We're going to decompose that into dj dy multiplied by dy dW2. So we're just applying the chain rule to decompose the left-hand side into two gradients that we do have access to. Now this is possible because y is only dependent on the previous layer. Now let's suppose we want to compute the gradients of the weight before w2, which in this case is w1. Well, now we've replaced w2 with w1 on the left-hand side, and then we need to apply the chain rule one more time recursively. So we take this equation again, and we need to apply the chain rule to the right-hand side on the red highlighted portion and split that part into two parts again. So now we propagate our gradient, our old gradient, through the hidden unit now all the way back to the weight that we're interested in, which in this case is W1. So remember again, this is called backpropagation, and we repeat this process for every single weight in our neural network. And if we repeat this process of propagating gradients all the way back to the input, then we can determine how every single weight in our neural network needs to change and how they need to change in order to decrease our loss on the next iteration. So then we can apply those small little changes so that our loss is a little bit better on the next trial. And that's the back propagation algorithm. In theory, it's a very simple algorithm. Just compute the gradients and step in the opposite direction of your gradient. But now let's touch on some insights from training these networks in practice, which is very different than the simple example that I gave before. So optimizing neural networks in practice can be extremely difficult. It does not look like the loss function landscape that I gave you before. In practice, it might look something like this, where your loss landscape is super non-convex and very complex. So here's an example of a paper that came out a year ago where authors tried to actually visualize what deep neural network architecture landscapes actually look like. And we called this update equation that we defined during gradient descent. I didn't talk much about this parameter.

### Setting the learning rate (38:55)

I alluded to it. It's called the learning rate. And in practice it determines a lot about how much step we take and how much trust we take in our gradients. So if we set our learning rate to be very slow, then we're having a model that may get stuck in local minima, right? Because we're only taking small steps towards our gradient. So we're going to converge very slowly. We may even get stuck if it's too small. If the learning rate is too large, we might follow the gradient again, but we might overshoot and actually diverge, and our training may kind of explode, and it's not a stable training process. So in reality, we want to use learning rates that are neither not too small, not too large, to avoid these local minima and still converge. So we want to kind of use medium-sized learning rates, and what medium means is totally arbitrary, you're going to see that later on, to skip over these local minima and still find global, or hopefully more global, optimums in our loss landscape. So how do we actually find our learning rate? Well, you set this as the definition of your learning algorithm, so you have to actually input your learning rate. Well, you set this as the definition of your learning algorithm, so you have to actually input your learning rate. And one way to do it is you could try a bunch of different learning rates and see which one works the best. That's actually a very common technique in practice, even though it sounds very unsatisfying. Another idea is maybe we could do something a little bit smarter and use what are called adaptive learning rates. So these are learning rates that can kind of observe its landscape and adapt itself to kind of tackle some of these challenges and maybe escape some local minima or speed up when it's on a local minima. So this means that the learning rate, because it's adaptive, it may increase or decrease depending on how large our gradient is and how fast we're learning or many other options. So in fact, these have been widely explored in deep learning literature and heavily published on as part of also software packages like TensorFlow as well. So during your labs, we encourage you to try out some of these different types of optimizers and algorithms and how they can actually adapt their own learning rates to stabilize training much better. Now let's put all of this together now that we've learned how to create the model, how to define the loss function, and how to actually perform backpropagation using an optimization algorithm. And it looks like this. So we define our model on the top. We define our optimizer. Here you can try out a bunch of different of the TensorFlow optimizers. We feed the output of our model, grab its gradient, and apply its gradient to the optimizer so we can update our weights.

### Batched gradient descent (41:37)

So in the next iteration, we're having a better prediction. Now I want to continue to talk about tips for training these networks in practice very briefly towards the end of this lecture, And because this is a very powerful idea of batching your data into mini batches to stabilize your training even further. And to do this, let's first revisit our gradient descent algorithm. The gradient is actually very, very computationally expensive to compute because it's computed as a summation over your entire data set. Now imagine your data set is huge, right? It's not going to be feasible in many real life problems to compute on every training iteration. Let's define a new gradient function that instead of computing it on the entire data set, it just computes it on a single random example from our data set. So this is going to be a very noisy estimate of our gradient, right? So just from one example, we can compute an estimate. It's not going to be the true gradient but an estimate. And this is much easier to compute because it's very small. So just one data point is used to compute it. But it's also very noisy and stochastic since it was used also with this one example. So what's the middle ground? Instead of computing it from the whole data set and instead of computing it from just one example, let's pick a random set of a small subset of B examples. We'll call this a batch of examples. And we'll feed this batch through our model and compute the gradient with respect to this batch. This gives us a much better estimate in practice than using a single gradient. It's still an estimate because it's not the full data set, but still it's much more computationally attractive for computers to do this on a small batch. Usually we're talking about batches of maybe 32 or up to 100. Sometimes people use larger with larger neural networks and larger GPUs, but even using something smaller like 32 can have a drastic improvement on your performance. Now the increase in gradient accuracy estimation actually allows us to converge much quicker in practice. It allows us to more smoothly and accurately estimate our gradients, and ultimately that leads to faster training and more parallelizable computation, because over each of the elements in our batch, we can kind of parallelize the gradients and then take the average of all of the gradients.

## Discussion On Regularization In Neural Networks

### Regularization: dropout and early stopping (43:45)

Now this last topic I want to address is that of overfitting. This is also a problem that is very very general to all of machine learning, not just deep learning but especially in deep learning, which is why I want to talk about it in today's lecture. It's a fundamental problem and challenge of machine learning and ideally in machine learning we're given a data set like these red dots, and we want to learn a model like the blue line that can approximate our data, right? Said differently, we want to build models that learn representations of our data that can generalize to new data. So assume we want to build this line to fit our red dots. We can do this by using a single linear line on the left-hand side, but this is not going to really well capture all of the intricacies of our red points and of our data. Or we can go on the other far extreme and overfit. We can really capture all the details, but this one on the far right is not going to generalize to a new data point that it sees from a test set, for example. Ideally, we want to wind up with something in the middle that is still small enough to maintain some of those generalization capabilities and large enough to capture the overall trends. So to address this problem, we can employ what's called a technique called regularization. Regularization is simply a method that you can introduce into your training to discourage complex models, so to encourage these more simple types of models to be learned. And as we've seen before, this is actually critical and crucial for our models to be able to generalize past our training data. So we can fit our models to our training data, but actually we can minimize our loss to almost zero in most cases. But that's not what we really care about. We always want to train on a training set, but then have that model be deployed and generalized to a test set which we don't have access to. So the most popular regularization technique for deep learning is a very simple idea of dropout. Let's revisit this picture of a neural network that we started with in the beginning of this class. In dropout, during training, what we're going to do is we're going to randomly drop and set some of the activations in this neural network in the hidden layer to zero with some probability. Let's say we drop out 50 percent of the neurons. We randomly pick 50 percent of neurons. That means that their activations now are all set to zero, and we force the network to not rely on those neurons too much. So this forces the model to kind of identify different types of pathways through the network. On this iteration, we pick some random 50% to drop out, and on the next iteration, we may pick a different random percent. And this is going to encourage these different pathways and encourage the network to identify different forms of processing its information to accomplish its decision making capabilities. Another regularization technique is a technique called early stopping. Now the idea here is that we all know the definition of overfitting is when our training set is or sorry when our model starts to have very bad performance on our test set. We don't have a test set but we can kind of create a example test set using our training set so we can split up our training set into two parts one that we'll use for training and one that will not show to the training algorithm but we can use to start to identify when we start to over fit a little bit. So on the x-axis we can actually see training iterations and as we start to train we can see that both the training loss and the testing loss go down and they keep going down until they start to converge and this pattern of divergence actually continues for the rest of training and what we want to do here is actually identify the place where the testing accuracy or the testing loss is minimized. And that's going to be the model that we're going to use. And that's going to be the best kind of model in terms of generalization that we can use for deployment. So when we actually have a brand new test data set, that's going to be the model that we're going to use. So we're going to employ this technique called early stopping to identify it. And as we can see, anything that kind of falls on the left side of this line are models that are underfitting. And anything on the right side of this line are going to be models that are considered to be overfit, right, because this divergence has occurred.

## Conclusion

### Summary (47:58)

Now, I'll conclude this lecture by first summarizing the three main points that we've covered so far. So first we learned about the fundamental building blocks of neural networks, the perceptron, a single neuron. We learned about stacking and composing these types of neurons together to form layers and full networks. And then finally, we learned about how to actually complete the whole puzzle and train these neural networks end-to-end using some loss function and using gradient descent and back propagation. So in the next lecture, we'll hear from Ava on a very exciting topic, taking a step forward and actually doing deep sequence modeling. So not just one input, but now a series of sequence of inputs over time using RNNs and also a really new and exciting type of model called the transformer and attention mechanism. So let's resume the class in about five minutes once we have a chance for Ava to just get set up and bring up her presentation. So thank you very much.