MIT 6.S191: Convolutional Neural Networks
Transcription for the video titled "MIT 6.S191: Convolutional Neural Networks".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Hi everyone and welcome back to Intro to Deep Learning. We had a really awesome kickoff day yesterday, so we're looking to keep that same momentum all throughout the week and starting with today. Today, we're really excited to be talking about, actually one of my favorite topics in this course which is how we can build computers that can achieve the sense of sight and vision. Now I believe that sight and specifically like I said vision is one of the most important human senses that we all have. In fact sighted people rely on vision quite a lot in our day-to-day lives from everything from walking around, navigating the world, interacting and sensing other emotions in our colleagues and peers. And today we're going to learn about how we can use deep learning and machine learning to build powerful vision systems that can both see and predict what is where by only looking at raw visual inputs. And I like to think of that phrase as a very concise and sweet definition of what it really means to achieve vision. But at its core, vision is actually so much more than just understanding what is where. It also goes much deeper. Take this scene, for example. We can build computer vision systems that can identify, of course, all of the objects in this environment, starting first with the yellow taxi or the van parked on the side of the road. But we also need to understand each of these objects at a much deeper level, not just where they are, but actually predicting the future, predicting what may happen in the scene next. For example, that the yellow taxi is more likely to be moving and dynamic into the future because it's in the middle of the lane compared to the white van, which is parked on the side of the road. Even though you're just looking at a single image, your brain can infer all of these very subtle cues, and it goes all the way to the pedestrians on the road, and even these even more subtle cues in the traffic lights and the rest of the scene as well. Now, accounting for all of these details in the scene is an extraordinary challenge. But we as humans do this so seamlessly. Within a split second, I probably put that frame up on the slide. And all of you within a split second could reason about many of those subtle details without me even pointing them out but the question of today's class is how we can build machine learning and deep learning algorithms that can achieve that same type and subtle understanding of our world and deep learning in particular is really leading this revolution of computer vision and achieving sight of computers.
Visual Representation In Computing
Amazing applications of vision (02:37)
For example, allowing robots to pick up on these key visual cues in their environment, critical for really navigating the world together with us as humans. These algorithms that you're going to learn about today have become so mainstreamed, in fact, that they're fitting on all of your smartphones, in your pockets, processing every single image that you take enhancing those images detecting faces and so on and so forth and we're seeing some exciting advances ranging all the way from biology and medicine which we'll talk about a bit later today to autonomous driving and accessibility as well and like I said deep learning has taken this field as a whole by storm in the past decade or so because of its ability, critically, like we were talking about yesterday, its ability to learn directly from raw data and those raw image inputs in what it sees in its environment. And learn explicitly how to perform, like we talked about yesterday, what is called feature extraction of those images in the environment. And one example of that is through facial detection and recognition, which all of you are going to get practice with in today's and tomorrow's labs as part of the grand final competition of this class. Another really go-to example of computer vision is in autonomous driving and self-driving vehicles, where we can take an image as input, or maybe potentially a video as input, multiple images, and process all of that data so that we can train a car to learn how to steer the wheel or command a throttle or actuate a braking command. This entire control system, the steering, the throttle, the braking of a car, can be executed end-to-end by taking as input the images and the sensing modalities of the vehicle and learning how to predict those actuation commands. Now actually this end-to-end approach, having a single neural network do all of this, is actually radically different than the vast majority of autonomous vehicle companies. Like if you look at Waymo, for example, that's a radically different approach. But we'll talk about those approaches in today's class. And in fact, this is one of our vehicles that we've been building at MIT, in my lab in Cecil, just a few floors above this room. And we'll, again, share some of the details on this incredible work. But of course, it doesn't stop here with autonomous driving. These algorithms, directly the same algorithms that you'll learn about in today's class, can be extended all the way to impact healthcare, medical decision making, and finally, even in these accessibility applications where we're seeing computer vision algorithms helping the visually impaired. So, for example, in this project, researchers have built deep learning enabled devices that could detect trails so that visually impaired runners could be provided audible feedback so that they too could navigate when they go out for runs.
What computers see (05:35)
And like I said, we often take many of these tasks that we're going to talk about in today's lecture for granted because we do them so seamlessly in our day-to-day lives. But the question of today's class is going to be at its core how we can build a computer to do these same types of incredible things that all of us take for granted day-to-day. And specifically we'll start with this question of how does a computer really see and even more detailed than that is how does a computer process an image? If we think of sight as coming to computers through images, then how can a computer even start to process those images? Well, to a computer, images are just numbers, right? And suppose, for example, we have a picture here of Abraham Lincoln. Okay, this picture is made up of what are called pixels. Every pixel is just a dot in this image, and since this is a grayscale image, each of these pixels is just a single number. Now, we can represent our image now as this two-dimensional matrix of numbers, and because, like I said, this is a grayscale image every pixel is corresponding to just one number at that matrix location now assume for example we didn't have a grayscale image we had a color image that would be an RGB image right so now every pixel is going to be composed not just of one number but of three numbers so you can think of that as kind of a 3d matrix instead of a 2d matrix where you almost have three two-dimensional matrix that are stacked on top of each other. So now with this basis of basically numerical representations of images we can start to think about how we can or what types of computer vision algorithms we can build that can take these systems as input and what they can perform, right? So the first thing that I want to talk to you about is what kind of tasks do we even want to train these systems to complete with images? And broadly speaking, there are two broad categories of tasks. We touched on this a little bit in yesterday's lecture, but just to be a bit more concrete in today's lecture, those two tasks are either classification or regression now in regression your prediction value is going to take a continuous value right that could be any real number on the number line but in classification your prediction could take one of let's say K or n different classes right these are discrete different classes so let's consider first the task of image classification. In this task, we want to predict an individual label for every single image. And this label that we predict is going to be one of n different possible labels that could be considered. So for example, let's say we have a bunch of images of US presidents. And we want to build a classification pipeline to tell us which president is in this particular image that you see on the screen. Now the goal of our model in this case is going to be basically to output a probability score, a probability of this image containing one of these different precedents, right, and the maximum score is going to be ultimately the one that we infer to be the correct precedent in the image. So in order to correctly perform this task and correctly classify these images, our pipeline, our computer vision model, needs the ability to be able to tell us what is unique about this particular image of Abraham Lincoln, for example, versus a different picture of George Washington versus a different picture of Obama, for example, versus a different picture of George Washington versus a different picture of Obama, for example. Now, another way to think about this whole problem of image classification or image processing at its high level is in terms of features. Or think of these as almost patterns in your data or characteristics of a particular class. And classification, then, is simply done by detecting all of these different patterns in your data and identifying when certain patterns occur over other patterns. So for example if the features of a particular class are present in an image then you might infer that that image is of that class. So for example, if you want to detect cars, you might look for patterns in your data like wheels, license plates, or headlights. And if those things are present in your image, then you can say with fairly high confidence that your image is of a car versus one of these other categories. So if we're building a computer vision pipeline, we have two main steps really to consider. The first step is that we need to know what features or what patterns we're looking for in our data. And the second step is we need to then detect those patterns. Once we detect them, we can then infer which class we're in. Now one way to solve this is to leverage knowledge about our particular field. So if we know something about our field, for example, about human faces, we can use that knowledge to define our features. What makes up a face? We know faces are made up of eyes, noses, and ears, for example. We can define what each of those components look like in defining our features. But there's a big problem with this approach. And remember that images are just these three-dimensional arrays of numbers, right? They can have a lot of variation even within the same type of object. These variations can include really anything ranging from occlusions to variations in lighting, rotations, translations, intraclass variation. And the problem here is that our classification pipeline needs the ability to handle and be invariant to all of these different types of variations, while still being sensitive to all of the inter-class variations, the variations that occur between different classes. Now, even though our pipeline could use features that we as humans define, manually define based on some of our prior knowledge, the problem really breaks down in that these features become very non-robust when considering all of these vast amounts of different variations that images take in the real world. So in practice, like I said, your algorithms need to be able to withstand all of those different types of variations. And then the natural question is that how can we build a computer vision algorithm to do that and still maintain that level of robustness? And what we want is a way to extract features that can both detect those features, those patterns in the data, and do so in a hierarchical fashion, right? So going all the way from the ground up, from the pixel level, to something with semantic meaning, like for example the eyes or the noses in a human face. Now we learned in the last class that we can use neural networks exactly for this type of problem, right? Neural networks are capable of learning features directly from data and learn, most importantly, a hierarchical set of features, building on top of previous features that it's learned to build more and more complex set of features.
Learning visual features (12:38)
Now we're going to see exactly how neural networks can do this in the image domain as part of this lecture. But specifically, neural networks will allow us to learn these visual features from visual data if we construct them cleverly. And the key point here is that actually the models and the architectures that we learned about in yesterday's lecture and so far in this course, we'll see how they're actually not suitable or extensible to today's problem domain of images and how we can build and construct neural networks a bit more cleverly to overcome those issues. So maybe let's start by revisiting what we talked about in lecture one, which was where we learned about fully connected networks. These were networks that have multiple hidden layers and each neuron in a given hidden layer is connected to every neuron in its prior layer, right, so it receives all of the previous layers inputs as a function of these fully connected layers. Now let's say that we want to directly, without any modifications, use a fully connected network like we learned about in lecture one with an image processing pipeline. So directly taking an image and feeding it to a fully connected network. Could we do something like that? Actually in this case we could. The way we would have to do it is remember that because our image is a two-dimensional array, the first thing that we would have to do is collapse that to a one-dimensional sequence of numbers, right, because a fully connected network is not taking in a two-dimensional array, it's taking in a one-dimensional sequence. So the first thing that we have to do is flatten that two-dimensional array to a vector of pixel values and feed that to our network. In this case, every neuron in our first layer is connected to all neurons in that input layer. So in that original image, flattened down, we feed all of those pixels to the first layer. And here, you should already appreciate the very important notion that every single piece of spatial information that really defined our image, that makes an image an image, is totally lost already before we've even started this problem because we've flattened that two-dimensional image into a one-dimensional array. We've completely destroyed all notion of spatial information. And in addition, we really have an enormous number of parameters because this system is fully connected. Take, for example, a very, very small image, which is even 100 by 100 pixels. That's an incredibly small image in today's standards. But that's going to take 10,000 neurons just in the first layer, which will be connected to, let's say, 10,000 neurons in the second layer. The number of parameters that you'll have just in that one layer alone is going to be 10,000 squared parameters. It's going to be highly inefficient, you can imagine, if you wanted to scale this network to even a reasonably sized image that we have to deal with today. So not feasible in practice. But instead, we need to ask ourselves how we can build and maintain some of that spatial structure that's very unique about images here into our input and here into our model, most importantly. So to do this, let's represent our 2D image as its original form, as a two-dimensional array of numbers. One way that we can use spatial structure here inherent to our input is to connect what are called basically these patches of our input to neurons in the hidden layer. So for example, let's say that each neuron in the hidden layer that you can see here only is going to see or respond to a certain set or a certain patch of neurons in the previous layer. Right. So you could also think of this as almost a receptive field, what the single neuron in your next layer can attend to in the previous layer. It's not the entire image, but rather a small receptive field from your previous image. Now notice here how the region of the input layer, which you can see on the left-hand side here, influences that single neuron on the right-hand side. And that's just one neuron in the next layer. But of course, you can imagine basically defining these connections across the whole input. Each time you have the single patch on your input that corresponds to a single neuron output on the other layer. And we can apply the same principle of connecting these patches across the entire image to single neurons in the subsequent layer. And we can apply the same principle of connecting these patches across the entire image to single neurons in the subsequent layer. And we do this by essentially sliding that patch pixel by pixel across the input image, and we'll be responding with, you know, another image on our output layer. In this way, we essentially preserve all of that very key and rich spatial information inherent to our input. But remember that the ultimate task here is not only to just preserve that spatial information. We want to ultimately learn features, learn those patterns so that we can detect and classify these images. And we can do this by waving, right?
Feature extraction and convolution (17:51)
waving right waving the connections between the patches of our input and and in order to detect you know what those certain features are let me give a practical example here and so in practice this operation that I'm describing this patching and sliding operation I'm describing is actually a mathematical operation formerly known as convolution. We'll first think about this as a high level supposing that we have what's called a 4x4 pixel patch. So you can see this 4x4 pixel patch represented in red as a red box on the left-hand side and let's suppose for example since we have a 4x4 patch this is going to consist of 16 different weights in this layer we're going to apply this same 4x4 let's call this not a patch anymore let's use the terminology filter we'll apply the same 4x4 filter in the input and use the result of that operation to define the state of the neuron in the next layer right and now we're going to shift our filter by, let's say, two pixels to the right, and that's going to define the next neuron in the adjacent location in the future layer, right? And we keep doing this, and you can see that on the right-hand side, you're sliding over not only the input image, but you're also sliding over the output neurons in the secondary layer. And this is how we can start to think about convolution at a very, very high level. But you're probably wondering, right, not just how the convolution operation works, but I think the main thing here to really narrow down on is how convolution allows us to learn these features, these patterns in the data that we were talking about, because ultimately that's our final goal. That's our real goal for this class is to extract those patterns. So let's make this very concrete by walking through maybe a concrete example. So suppose, for example, we want to build a convolutional algorithm to detect or classify an x in an image. This is the an X in an image. This is the letter X in an image. And here, for simplicity, let's just say we have only black and white images. So every pixel in this image will be represented by either a 0 or a 1. For simplicity, there's no grayscale in this image. And actually here, so we're representing black as negative 1 and white as positive 1. So to classify, we simply cannot, you know, compare the left-hand side to the right-hand side, right, because these are both X's, but you can see that because the one on the right-hand side is slightly rotated to some degree, it's not going to directly align with the X on the left-hand side, even though it is an X. We want to detect X's in both of these images, so we need to think about how we can detect those features that define an X a bit more cleverly. So let's see how we can use convolutions to do that. So in this case, for example, instead we want our model to compare images of this X piece by piece or patch by patch, right, and the important patches that we look for are exactly these features that will define our X so if our model can find these rough feature patches roughly in the same positions in our input then we can determine or we can infer that these two images are of the same type or the same letter right it can get a lot better than simply measuring the similarity between these two images because we're operating at the patch level. So think of each patch almost like a miniature image, right? A small two-dimensional array of values and we can use filters to pick up on when these small patches or small images occur. So in the case of Xs, these filters may represent semantic things, for example, the diagonal lines or the crossings that capture all of the important characteristics of the X. So we'll probably capture these features in the arms and the center of our letter, right, in any image of an X, regardless of how that image is, you know, translated or rotated or so on. And note that even in these smaller matrices, right, these are filters of weights, right, these are also just numerical values of each pixel in these mini patches is simply just a numerical value. They're also images in some effect, right? And all that's really left in this problem and in this idea that we're discussing is to define that operation that can take these miniature patches and try to pick up, you know, detect when those patches occur in your image and when they maybe don't occur.
The convolution operation (22:23)
And that brings us right back to this notion of convolution, right? So convolution is exactly that operation that will solve that problem. Convolution preserves all of that spatial information in our input by learning image features in those smaller squares of regions that preserve our input data. So just to give another concrete example, to perform this operation, we need to do an element-wise multiplication between the filter matrix, those miniature patches, as well as the patch of our input image. So you have basically, think of two patches. You have the weight matrix patch, the thing that you want to detect, which you can see on the top left hand here. And you also have the secondary patch, which is the thing that you want to detect which you can see on the top left hand here and you also have the secondary patch which is the thing that you are looking to compare it against in your input image and the question is how how similar are these two patches that you observe between them so for example there was this results in a 3x3 matrix because you're doing an element wise multiplication between two small 3x3 matrices, you're going to be left with another 3x3 matrix. In this case, all of the elements of this resulting matrix, you can see here, are ones, because in every location in the filter and every location in the image patch, we are perfectly matching. So when we do that element-wise multiplication, we get ones everywhere. The last step is that we need to sum up the results of that matrix or that element-wise multiplication, and the result is, let's say nine in this case. Everything was a one, it's a three by three matrix, so the result is nine. Now, let's consider one more example. Now we have this image in green, and we want to detect this filter in yellow. Suppose we want to compute the convolution of this 5 by 5 image with this 3 by 3 filter. To do this, we need to cover basically the entirety of our image by sliding over this filter piece by piece and comparing the similarity or the convolution of this filter across the entire image. And we do that, again, through the same mechanism. At every location, we compute an element-wise multiplication of that patch with that location on the image, add up all of the resulting entries, and pass that to our next layer. So let's walk through it. First let's start off in the upper left-hand corner. We place our filter over the upper left-hand corner of our image. We element-wise multiply, we add up all the results, and we get four. And that four is going to be placed into the next layer, right? This next layer again is another image, right? But it's determined as the result of our convolution operation we slide over that filter to the next location the next location provides the next value in our image and we keep repeating this process over and over and over again until we've covered our filter over the entire image and as a result we've also completely filled out the result of our output feature map the output feature map is basically what you can think of is how closely aligned our filter is to every location in our input image so now that we've kind of gone through the mechanism that defines this operation of convolution let's see how different filters could be used to detect different types of patterns in our data. So for example, let's take this picture of a woman's face and the output of applying three different types of filters to this picture, right? So you can see the exact filter, this is, they're all three by three filters, so the exact filters you can see on the bottom right hand corner of the corresponding face. And by applying these three different filters, you can see how we can achieve drastically different results. And simply by changing the weights that are present in these 3x3 matrices, you can see the variability of different types of features that we can detect. So, for example, we can design filters that can sharpen an image, make the edges sharper in the image. We can design filters that will extract edges. We can do stronger edge detection by, again, modifying the weights in all of those filters. So I hope now that all of you can kind of appreciate the power of, you know, number one is these filtering operations and how we can define them, you know, mathematically in the form of these smaller patch-based operations and matrices that we can then slide over an image. And these concepts are so powerful because number one, they preserve the spatial information of our original input while still performing this feature extraction. Now you can think of instead of defining those filters, like we said on the previous slide, what if we tried to learn them? And remember again that those filters are kind of proxies for important patterns in our data. So our neural network could try to learn those elements of those small patch filters as weights in the neural network. And learning those would essentially equate to picking up and learning the patterns that define one class versus another class and now that we've gotten this operation and this understanding under our belt we can take this one step further right we can take this singular convolution operation and start to think about how we can build entire layers convolutional layers out of this operation so that we can build entire layers, convolutional layers out of this operation so that we can start to even imagine convolutional networks and neural networks.
Convolution neural networks (27:30)
And first we'll take a look at what are called, well, what you ultimately create by creating convolutional layers and convolutional networks is what's called a CNN, a convolutional neural network. And that's going to be the core architecture of today's class. So let's consider a very simple CNN that was designed for image classification. The task here, again, is to learn the features directly from the raw data and use these learned features for classification towards some task of object detection that we want to perform. Now, there are three main operations to a CNN, and we'll go through them step by step here, but then go deeper into each of them in the remainder of this class. So the first step is convolutions, which we've already seen a lot of in today's class already. Convolutions are used to generate these feature maps. So they take as input both the previous image as well as some filter that they want to detect, and they output a feature map of how this filter is related to the original image. The second step is, like yesterday, applying a nonlinearity to the result of these feature maps. That injects some nonlinear activations to our neural networks, allows it to deal with nonlinear data. Third step is pooling, which is essentially a downsampling operation to allow our images or allow our networks to deal with larger and larger scale images by progressively downscaling their size so that our filters can progressively grow in receptive field. so that our filters can progressively grow in receptive field. And finally, feeding all of these resulting features to some neural network to infer the class scores. Now, by the time that we get to this fully connected layer, remember that we've already extracted our features, and essentially you can think of this no longer being a two-dimensional image. We can now use the methods that we learned about in lecture one to directly take those learned features that the neural network has detected and infer, based on those learned features and based on if they were detected or if they were not, what class we're in. So now let's basically just go through each of these operations one by one in a bit more detail and see how we could even build up this very basic architecture of a CNN. So first, let's go back and consider one more time the convolution operation that's a central core to the CNN. And as before, each neuron in this hidden layer is going to be computed as a weighted sum of its inputs applying a bias and activating with a non-linearity. Should sound very similar to lecture one in yesterday's class but except now when we're going to do that first step instead of just doing a dot product with our weights we're going to apply a convolution with our weights which is simply that element wise multiplication and addition right and that sliding operation. Now, what's really special here, and what I really want to stress, is the local connectivity. Every single neuron in this hidden layer only sees a certain patch of inputs in its previous layer. So if I point at just this one neuron in the output layer, this neuron only sees the inputs at this red square. It doesn't see any of the other inputs in the rest of the image. And that's really important to be able to scale these models to very large-scale images. Now you can imagine that as you go deeper and deeper into your network, eventually, because the next layer you're going to attend to a larger patch right and that will include data from not only this red square but effectively a much larger red square that you could imagine there now let's define this actual computation that's going on for a neuron in a hidden, its inputs are those neurons that fell within its patch in the previous layer. We can apply this matrix of weights here denoted as a 4 by 4 filter that you can see on the left-hand side. And in this case, we do an element-wise multiplication. We add the outputs, we apply a bias, and we add that nonlinearity. Right? That's the core steps that we take in really all of these neural networks that you're learning about in today's and this week's class, to be honest. Now remember that this element wise multiplication and addition operation, that sliding operation, that's called convolution and that's the basis of these layers. So that defines how neurons in convolutional layers are connected, how they're mathematically formulated, but within a single convolutional layer it's also really important to understand that a single layer could actually try to detect multiple sets of filters, right? Maybe you want to detect in one image multiple features, not just one feature, but you know in if you were detecting faces you don't only want to detect eyes, you want to detect eyes, noses, mouths, ears. All of those things are critical patterns that define a face and can help you classify a face. So what we need to think of is actually convolution operations that can output a volume of different images. Every slice of this volume effectively denotes a different filter that can output a volume of different images, right? Every slice of this volume effectively denotes a different filter that can be identified in our original input. And each of those filters is going to basically correspond to a specific pattern or feature in our image as well. Think of the connections in these neurons in terms of, you know of their receptive field once again. The locations within the input of that node that they were connected to in the previous layer. These parameters really define what I like to think of as the spatial arrangement of information that propagates throughout the network and throughout the convolutional layers in particular. throughout the network and throughout the convolutional layers in particular. Now, I think just to summarize what we've seen and how connections in these types of neural networks are defined. And let's say how the output of a convolutional network is a volume. We are well on our way to really understanding convolutional neural networks and defining them, right? That's the, what we just covered is really the main component of CNNs, right? That's the convolutional operation that defines these convolutional layers. The remaining steps are very critical as well, but I want to maybe pause for a second and make sure that everyone's on the same page with the convolutional operation and the definition of convolutional layers.
Non-linearity and pooling (34:29)
Awesome. Okay. So the next step here is to take those resulting feature maps that our convolutional layers extract and apply a non-linearity to the output volume of the convolutional layer. So as we discussed in the first lecture, applying these non-linearities is really critical because it allows us to deal with nonlinear data. And because image data in particular is extremely nonlinear, that's a critical component of what makes convolutional neural networks actually operational in practice. In particular, for convolutional neural networks, the activation function that is really, really common for these models is the ReLU activation function. We talked a little bit about this in lecture one and two yesterday. The ReLU activation function, you can see it on the right-hand side. Think of this function as a pixel-by-pixel operation that replaces basically all negative values with zero. It keeps all positive values the same. It's the identity function when a value is positive, but when it's negative it basically squashes everything back up to zero. Think of this almost as a thresholding function, right? Thresholds is everything at zero. Anything less than zero comes back up to zero. So negative values here indicate basically a negative detection in convolution that you may want to just say was no detection right and you can think of that as kind of an intuitive mechanism for understanding why the relu activation function is so popular in convolutional neural networks the other common the other popular belief is that relu activation functions well it's not a belief they are extremely easy to compute and they're very easy and computationally efficient. Their gradients are very cleanly defined, they're constants except for a piecewise non-linearity. So that makes them very popular for these domains. Now the next key operation in a CNN is that of pooling. Now, pooling is an operation that is, at its core, it serves one purpose. And that is to reduce the dimensionality of the image progressively as you go deeper and deeper through your convolutional layers. Now, you can really start to reason about this is that when you decrease the dimensionality of your features, you're effectively increasing the dimensionality of your features, you're effectively increasing the dimensionality of your filters right now because every filter that you slide over a smaller image is capturing a larger receptive field that occurred previously in that network. So a very common technique for pooling is what's called maximum pooling or max pooling for short. Max pooling is exactly you know what it sounds like so it basically operates with these small patches again, that slide over an image, but instead of doing this convolution operation, what these patches will do is simply take the maximum of that patch location. So think of this as kind of activating the maximum value that comes from that location and propagating only the maximums. I encourage all of you actually to think of maybe brainstorm other ways that we could perform even better pooling operations than max pooling. There are many common ways but you could think of some for example are mean pooling or average pooling right maybe you don't want to just take the maximum you could collapse basically the average of all of these pixels into your single value in the result. But these are the key operations of convolutional neural networks at their core. And now we're ready to really start to put them together and form and construct a CNN all the way from the ground up. And with CNNs, we can layer these operations one after the other, right, starting first with convolutions, nonlinearities, and then pool's we can layer these operations one after the other right starting first with convolutions nonlinearities and then pooling and repeating these over and over again to learn these hierarchies of features and that's exactly how we obtained pictures like this which we started yesterday's lecture with and learning these hierarchical decompositions of features by progressively stacking and stacking these filters on top of each other. Each filter could then use all of the previous filters that it had learned. So a CNN built for image classification can be really broken down into two parts. First is the feature learning pipeline, which we learn the features that we want to detect. And then the second part is actually detecting those features and doing the classification. Now, the convolutional and pooling layers output from the first part of that model, the goal of those convolutional and pooling layers is to output the high-level features that are extracted from our input. But the next step is to actually use those features and detect their presence in order to classify the image. So we can feed these outputted features into the fully connected layers that we learned about in lecture one because these are now just a one-dimensional array of features and we can use those to detect, you know, what class we're in. And we can do this by using a function called a softmax function. You can think of a softmax function as simply a normalizing function whose output represents that of a categorical probability distribution. So another way to think of this is basically if you have an array of numbers you want to collapse, and those numbers could take any real number form, you want to collapse that into some probability distribution. A probability distribution has several properties, namely that all of its values have to sum to one. It always has to be between zero and one as well. So maintaining those two properties is what a softmax operation does. You can see its equation right here. It effectively just makes everything positive, and then it normalizes the result across each other, and that maintains those two properties that I just mentioned.
End-to-end code example (40:07)
Great so let's put all of this together and actually see how we could program our first convolutional neural network end-to-end entirely from scratch. So let's start by firstly defining our feature extraction head which starts with a convolutional layer and here 32 filters or 32 features. You can imagine that this first layer, the result of this first layer, is to learn not one filter, not one pattern in our image, but 32 patterns. Okay, so those 32 results are going to then be passed to a pooling layer and then passed on to the next set of convolutional operations. The next set of convolutional operations now will contain 64 features, will keep progressively growing and expanding our set of patterns that we're identifying in this image. Next, we can finally flatten those resulting features that we've identified and feed all of this through our dense layers, our fully connected layers that we learned about in lecture one. These will allow us to predict those final, let's say, ten classes. If we have ten different final possible classes in our image, this layer will account for that and allow us to output using softmax the probability distribution across those 10 classes.
So, so far we've talked about, right, how we can, let's say, use CNNs to perform image classification tasks, but in reality, one thing I really wanna stress in today's class, especially towards the end, is that this same architecture and same building blocks that we've talked about so far are extensible, and they extend to so many different applications and model types that we can imagine so for example when we considered the CNN for classification we saw that it really had two parts right the first part being feature extraction learning what features to look for and the second part being the classification the detection of those features. Now what makes a convolutional neural network really really powerful is exactly the observation that the feature learning part, this first part of the neural network, is extremely flexible. You can take that first part of the neural network, chop off what comes after it, and put a bunch of different heads into the part that comes after it. The goal of the first part is to extract those features. What you do with the features is entirely up to you, but you can still leverage the flexibility and the power of the first part to learn all of those core features. So, for example, that portion will look for, you know, all of the different image classification domains, that future portion after you've extracted the features, or we could also introduce new architectures that take those features and maybe perform tasks like segmentation or image captioning like we saw in yesterday's lecture. So in the case of classification, for example, just to tie up the classification story, there's a significant impact in domains like healthcare, medical decision making, where deep learning models are being applied to the analysis of medical scans across a whole host of different medical imagery.
Object detection (43:18)
Now, classification tells us basically a discrete prediction of what our image contains but we can actually go much deeper into this problem as well so for example imagine that we're not trying to only identify that this image is an image of a taxi which you can see here but also more importantly maybe we want our neural network to tell us not only that this is a taxi, but identify and draw a specific bounding box over this location of the taxi. So this is kind of a two-phase problem. Number one is that we need to draw a box, and number two is we need to classify what was in that box, right? So it's both a regression problem, where is the box, right? That's a continuous problem, as well as a classification problem is what is in that box. Now that's a continuous problem as well as a classification problem is what is in that box now that's a much much harder problem than what we've covered so far in the lecture today because potentially there are many objects in our scene not just one object right so we have to account for this fact that maybe our scene could contain arbitrarily many objects now our network needs to be flexible to that degree. It needs to be able to infer a dynamic number of objects in the scene. And if the scene is only of a taxi, then it should only output that one bounding box. But on the other hand, if the image has many objects, potentially even of different classes, we need a model that can draw a bounding box for each of these different examples, as well as associate their predicted classification labels to each one independently. Now, this is actually quite complicated in practice because those boxes can be anywhere in the image, right? There's no constraints on where the boxes can be. And they can also be of different sizes. They can be also different ratios, right? Some can be tall, some can be wide. Let's consider a very naive way of doing this first. Let's take our image and start by placing a random box somewhere on that image. For example, we just pick a random location, a random size, we'll place a box right there. This box, like I said, has a random location, a random size, we'll place a box right there. This box, like I said, has a random location, random size. Then we can take that box and only feed that random box through our convolutional neural network, which is trained to do classification, just classification. And this neural network can detect, well, number one, is there a class of object in that box or not? And if so, what class is it? And then what we could do is we could just keep repeating this process over and over again for all of these random boxes in our image. You know, many, many instances of random boxes. We keep sampling a new box, feed it through our convolutional neural network, and ask this question, what was in the box? If there was something in there, then what is it? And we keep moving on until we kind of have exhausted all of the was in the box? If there was something in there, then what is it? And we keep moving on until we have exhausted all of the boxes in the image. But the problem here is that there are just way too many potential inputs that we would have to deal with. This would be totally impractical to run in a real time system, for example, with today's compute. It results in way too many scales, especially for the types of resolutions of images that we deal with today so instead of picking random boxes let's try and use a very simple heuristic right to identify maybe some places with lots of variability in the image where there is high likelihood of having an object might be present. These might have meaningful insights or meaningful objects that could be available in our image, and we can use those to basically just feed in those high attention locations to our convolutional neural network, and then we can basically speed up that first part of the pipeline a lot because now we're not just picking random boxes. Maybe we use some simple heuristic to identify where interesting parts of the image might be. But still, this is actually very slow in practice. We have to feed in each region independently to the model. And plus, it's very brittle, because ultimately, the part of the model that is looking at where potential objects might be is detached from the part that's doing the detection of those objects. Ideally we want one model that is able to both you know figure out where to attend to and do that classification afterwards. So there have been many variants that have been proposed in this field of object detection but I want to just for the purpose of today's class introduce you to one of the most popular ones. Now, this is a point, or this is a model called R-CNN, or Faster R-CNN, which actually attempts to learn not only how to classify these boxes, but learns how to propose where those boxes might be in the first place so that you could learn how to feed or where to feed into the downstream neural network now this means that we can feed in the image to what are called these region proposal networks the goal of these networks is to propose certain regions in the image that you should attend to and then feed just those regions into the downstream CNN's so the goal here is to directly try to learn or extract all of those key regions and process them through the later part of the model. Each of these regions are processed with their own independent feature extractors, and then a classifier can be used to aggregate them all and perform feature detection as well as object detection. Now, the beautiful thing about this is that this requires only a single pass through the network so it's extraordinarily fast it can easily run in real time and it's very commonly used in many industry applications as well even it can even run on your smartphone so in classification we just saw how we can predict you know not only a single image per or sorry a single object per image we saw an object detection potentially inferring multiple objects with bounding boxes in your image. There's also one more type of task which I want to point out which is called segmentation. Segmentation is the task of classification, but now done at every single pixel. This takes the idea of object detection, which bounding boxes, to the extreme. Now, instead of drawing boxes, we're not even going to consider boxes. We're going to learn how to classify every single pixel in this image in isolation, right? So it's a huge number of classifications that we're going to do. And we'll do this, well, first let me show this example. So on the left-hand side, what this looks like is you're feeding in an original RGB image. The goal of the right-hand side is to learn for every pixel in the left-hand side, what was the class of that pixel, right? So this is kind of in contrast to just determining, you know, boxes over our image. Now we're looking at every pixel in isolation. And you can see, for example, you know, this pixels of the cow are clearly differentiated from the pixels of the sky or the pixels of the grass, right? And that's a key critical component of semantic segmentation networks. The output here is created by, again, using these convolutional operations, followed by pooling operations, which learn an encoder, which you can think of on the left-hand side. These are learning the features from our RGB image, learning how to put them into a space so that it can reconstruct into a new space of semantic labels. So you can imagine kind of a downscaling and then progressive upscaling into the semantic space. But when you do that upscaling, it's important, of course, you can't be pulling down that information. You need to kind of invert all of those operations. So instead of doing convolutions with pooling, you can now do convolutions with basically reverse pooling, you can now do convolutions with basically reverse pooling or expansions, right? You can grow your feature sets at every labels. And here's an example on the bottom of just a code piece that actually defines these layers. You can plug these layers, combine them with convolutional layers, and you can build these fully convolutional networks that can accomplish this type of task. Now, of course, this can be applied in many other applications in health care as well, especially for segmenting out, let's say, cancerous regions or even identifying parts of the blood which are infected with malaria, for example.
End-to-end self driving cars (51:36)
And one final example here of self-driving cars. Let's say that we want to build a neural network for autonomous navigation, specifically building a model, let's say, that can take as input an image, as well as, let's say, some very coarse maps of where it thinks it is. Think of this as basically a screenshot of Google Maps, essentially, to the neural network, right? It's the GPS location of the map. And it wants to directly infer not a classification or a semantic classification of the map, and it wants to directly infer not a classification or a semantic classification of the scene, but now directly infer the actuation, how to drive and steer this car into the future, right? Now, this is a full probability distribution over the entire space of control commands, right? It's a very large continuous probability space, and the question, how can we build a neural network to learn this function? And the key point that I'm stressing with all of these different types of architectures here is that all of these architectures use the exact same encoder. We haven't changed anything when going from classification to detection to semantic segmentation and now to here. All of them are using the same underlying building blocks of convolutions, nonlinearities, and pooling. The only difference is that after we perform those feature extractions, how do we take those features and learn our ultimate task? So for example, in the case of probabilistic control commands, we would want to take those learned features and understand how to predict the parameters of a full continuous probability distribution, like you can see on the right-hand side, as well as the deterministic control of our desired destination. And again, like we talked about at the very beginning of this class, this model, which goes directly from images all the way to steering wheel angles, essentially, of the car, is a single model. It's learned entirely end-to-end. We never told the car, for the car is a single model. It's learned entirely end to end. We never told the car, for example, what a lane marker is, or the rules of the road. It was able to observe a lot of human driving data, extract these patterns, these features from what makes a good human driver different from a bad human driver. And learn how to imitate those same types of actions that are occurring. So that without any human intervention or human rules that we impose on these systems, they can simply watch all of this data and learn how to drive entirely from scratch. So a human, for example, can actually enter the car, input a desired destination, and this end-to-end CNN will actually actuate the control commands to bring them to their destination now I'll conclude today's lecture with just saying that the applications of CNN's we've touched on a few of them today but the applications of CNN's are enormous right far beyond these examples that I provided today.
They all tie back to this core concept of feature extraction and detection. And after you do that feature extraction, you can really crop off the rest of your network and apply it to many different heads for many different tasks and applications that you might care about. We've touched on a few today, but there are really so, so many in different domains. And with that, I'll conclude. And very shortly, we'll just be talking about generative modeling, which is a really central part of today's and this week's lectures series. And after that, later on, we'll have the software lab, which I'm excited for all of you to start participating in. And yeah, we can take a short five-minute break and continue the lectures from there. Thank you.