MIT 6.S191 (2021): Convolutional Neural Networks
Transcription for the video titled "MIT 6.S191 (2021): Convolutional Neural Networks".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Hi everyone and welcome back to MIT 6S191. Today we're going to be talking about one of my favorite topics in this course and that's how we can give machines a sense of vision. Vision is one of the most important human senses. I believe sighted people rely on vision quite a lot, from everything from navigating in the world to recognizing and manipulating objects to interpreting facial expressions and understanding very complex human emotions. I think it's safe to say that vision is a huge part of everyday human life. And today we're going to learn about how we can use deep learning to build very powerful computer vision systems and actually predict what is where by only looking, and specifically looking at only raw visual inputs. I like to think that this is a very super simple definition of what vision, at its core, really means. But actually vision is so much more than simply understanding what an image is of. It means not just what the image is of, but also understanding where the objects in the scene are and really predicting and anticipating forward in the future what's going to happen next. Take this scene for example. We can build computer vision algorithms that can identify objects in the scene such as this yellow taxi or maybe even this white truck on the side of the road. But what we need to understand on a different level is what is actually going to be required to achieve true vision. Where are all of these objects going? For that we should actually focus probably more on the yellow taxi than on the white truck because there are some subtle cues in this image that you can probably pick up on that lead us to believe that probably this white truck is parked on the side of the road. It's stationary and probably won't be moving in the future, at least for the time that we're observing the scene. The yellow taxi on the other hand is, even though it's also not moving, it is much more likely to be stationary as a result of the pedestrians that are crossing in front of it. And that's something that is very subtle, but can actually be reasoned about very effectively by our brains. And humans take this for granted, but this is an extraordinarily challenging problem in the real world. And since in the real world, building true vision algorithms can require reasoning about all of these different components, not just in the foreground, but also there are some very important cues that we can pick up in the background like this road light, as well as some obstacles in the far distance as well. And building these vision algorithms really does require an understanding of all of these very subtle details.
Visual Representation In Computing
Amazing applications of vision (02:47)
Now, deep learning is bringing forward an incredible revolution or evolution as well of computer vision algorithms and applications ranging from allowing robots to use visual cues to perform things like navigation. And these algorithms that you're going to learn about today in this class have become so mainstreamed and so compressed that they are all fitting and running in each of our pockets in our telephones for processing photos and videos and detecting faces for greater convenience. We're also seeing some extraordinarily exciting applications of vision in biology and medicine for picking up on extremely subtle cues and detecting things like cancer as well as in the field of autonomous driving. And finally, in a few slides, I'll share a very inspiring story of how the algorithms that you're going to learn about today are also being used for accessibility to aid the visually impaired. Now, deep learning has taken computer vision, especially computer vision, by storm because of its ability to learn directly from the raw image inputs and learn to do feature extraction only through observation of a ton of data. Now, one example of that that is really prevalent in the computer vision field, is of facial detection and facial recognition. On the top left, or on the left hand side, you can actually see an icon of a human eye, which pictorially I'm using to represent images that we perceive. And we can also pass through a neural network for predicting these facial features. Now deep learning has transformed this field because it allows the creator And we can also pass through a neural network for predicting these facial features. Now deep learning has transformed this field because it allows the creator of the machine learning or the deep learning algorithm to easily swap out the end task given enough data to learn this neural network in the middle between the vision and the task and try and solve it. So here we're performing an end task of facial detection, but just equivalently that end task could be in the context of autonomous driving. Here where we take an image as an input, which you can see actually in the bottom right hand corner, and we try to directly learn the steering control for the output, and actually learn directly from this one observation of the scene where the car should control. So what is the steering wheel that the car should execute? And this is done completely end to end. The entire control system here of this vehicle is a single neural network learned entirely from data. Now this is very, very different than the majority of other self-driving car companies, like you'll see with Waymo and Tesla, et cetera. And we'll talk more about this later, but I actually wanted to share this one clip with you because this is one of the autonomous vehicles that we've been building in our lab here in CSAIL that I'm part of. And we'll see more about that later in the lecture as well. We're seeing, like I mentioned, a lot of applications in medicine and healthcare where we can take these raw images and scans of patients and learn to detect things like breast cancer, skin cancer, and now most recently taking scans of patients' lungs to detect COVID-19. Finally, I want to share this inspiring story of how computer vision is being used to help the visually impaired. So in this project, actually, researchers built a deep learning enabled device that can detect a trail for running and provide audible feedback to the visually impaired users such that they can run. And now to demonstrate this, let me just share this very brief video. The machine learning algorithm that we have detects the line and can tell whether the line is to the runner's left, right, or center. We can then send signals to the runner. That guides them left and right based on their positioning. The first time we went out, we didn't even know if sound would be enough to guide me. So it's sort of that beta testing process that you go through. From human eyes, it's very obvious. It's very obvious to recognize the line. Teaching a machine learning model to do that is not that easy. You step left and right as you're running, so there's like a shake to the line, left and right. As soon as you start going outdoors, now the light is a lot more variable. Tree shadows, falling leaves, and also the lion on the ground can be very narrow, and there may be only a few pixels for the computer vision model to recognize. There was no tether. There was no stick. There was no furry dog. It was just being with yourself. That's the first time I've run alone in... in decades. So these are often tasks that we as humans take for granted. But for a computer, it's really remarkable to see how deep learning is being applied to some of these problems focused on really doing good and just helping people. Here, in this case, the visually impaired, a man who has never run without his guide dog before is now able to run independently through the trails with the aid of this computer vision system.
What computers see (07:56)
And like I said, we often take these tasks for granted, but because it's so easy for each sighted individual for us to do them routinely, but we can actually train computers to do them as well. And in order to do that, though, we need to ask ourselves some very foundational questions, specifically stemming from how we can build a computer that can quote-unquote see. And specifically, how does a computer process an image? Let's use an image as our base example of sight to a computer so far. So to a computer, images are just numbers. They're two dimensional lists of numbers. Suppose we have a picture here, this is of Abraham Lincoln. It's just made up of what are called pixels. Each of those numbers can be represented by what's called a pixel. Now a pixel is simply a number, like I said, here represented by a range of either 0 to 1 or 0 to 255. And since this is a grayscale image, each of these pixels is just one number. If you have a color image, you would represent it by three numbers, a red, a green, and a blue channel channel RGB. Now what does the computer see? So we can represent this image as a two-dimensional matrix of these numbers one number for each pixel in the image and this is it this is how a computer sees an image. Like I said if we have RGB image not a a grayscale image we can represent this by a three-dimensional array. Now we have a three-dimensional array. Now we have three two-dimensional arrays stacked on top of each other. One of those two-dimensional arrays corresponds to the red channel, one for the green, one for the blue, representing this RGB image. And now we have a way to represent images to computers, and we can start to think about what types of computer vision algorithms we can perform with this. So there are two very common types of learning tasks, and those are, like we saw in the first and the second classes, those are one, regression, and those are also classification tasks. In regression tasks, our output takes the form of a continuous value. And in classification, it takes a single class label. So let's consider first the problem of classification. We want to predict a label for each image. So for example, let's say we have a database of all US presidents, and we want to build a classification pipeline to tell us which president this image is of. So we feed this image that we can see on the left-hand side to our model. And we want it to output the probability that this image is of any of these particular presidents that this database consists of. In order to classify these images correctly, though, our pipeline needs to be able to tell what is actually unique about a picture of Abraham Lincoln versus a picture of any other president like George Washington or Jefferson or Obama. Another way to think about this, these differences between these images and the image classification pipeline is at a high level in terms of the features that are really characteristics of that particular class. So for example, what are the features that define Abraham Lincoln? Now classification is simply done by detecting the features in that given image. So if the features for a particular class are present in the image, then we can predict with pretty high confidence that that class is occurring with a high probability. So if we're building an image classification pipeline, our model needs to know what are the features are, what they are. And two, it needs to be able to detect those features in a brand new image. So for example, if we want to detect human faces, some features that we might want to be able to identify would be noses, eyes, and mouths. Whereas if we want to detect cars, we might be to be able to identify would be noses, eyes, and mouths. Whereas like if we want to detect cars, we might be looking at certain things in the image like wheels, license plates, and headlights. And the same for houses and doors and windows and steps. These are all examples of features for the larger object categories. Now one way to do this and solve this problem is actually to leverage knowledge about a particular field. Say, let's say human faces. So if we want to detect human faces, we could manually define in images what we believe those features are and actually use the results of our detection algorithm for classification. But there's actually a huge problem to this type of approach. And that is that images are just 3D arrays of numbers, of brightness values, and that each image can have a ton of variation. And this includes things like occlusions in the scene. There could also be variations in illumination, the lighting conditions, as well as you could even think of intraclass variation, variation within the same class of images. Our classification pipeline, whatever we're building, really needs to be invariant to all of these types of variations, but it still needs to be sensitive to picking out the different interclass variations. So being able to distinguish a feature that is unique to this class in comparison to features or variations of that feature that are present within the class. Now, even though our pipeline could use features that we as humans define. That is, if a human was to come into this problem knowing something about the problem a priori, they could define or manually extract and break down what features they want to detect for this specific task. Even if we could do that, due to the incredible variability of the scene of image data in general, the detection of these features is still an extremely challenging problem in practice because your detection algorithm needs to be invariant to all of these different variations. So instead of actually manually defining these, how can we do better? And what we actually want to do is be able to extract features and detect their presence in images automatically in a hierarchical fashion. And this should remind you back to the first lecture when we talked about hierarchy being a core component of deep learning.
Learning visual features (14:02)
And we can use neural network based approaches to learn these visual features directly from data and to learn a hierarchy of features to construct a representation of the image internal to our network. So again, like we saw in the first lecture, we can detect these low level features and composing them together to build these mid-level features and then in later layers, these higher level features to really perform the task of interest. So neural networks will allow us to learn these hierarchies of visual features from data if we construct them cleverly. So this will require us to use some different architectures than what we have seen so far in the class, namely architectures from the first lecture with feedforward dense layers, and in the second lecture, recurrent layers for handling sequential data. This lecture will focus on yet another type of way that we can extract features specifically focusing on the visual domain. So let's recap what we learned in lecture one. So in lecture one, we learned about these fully connected neural networks, also called dense neural networks, where you can have multiple hidden layers stacked on top of each other and each neuron in each hidden layer is connected to every neuron in the previous layer. Now let's say we want to use a fully connected network to perform image classification and we're going to try and motivate the use of something better than this by first starting with what we already know and we'll see the limitations of this. So in this case, remember our input is this two-dimensional image. It's a vector, a two-dimensional vector, but it can be collapsed into a one-dimensional vector if we just stack all of those dimensions on top of each other of pixel values. And what we're going to do is feed in that vector of pixel values to our hidden layer connected to all neurons in the next layer. Now here you should already appreciate something and that is that all spatial information that we had in this image is automatically gone. It's lost because now since we have flattened this two-dimensional image into one dimension, we have now basically removed any spatial information that we previously had by the next layer. And our network now has to relearn all of that very important spatial information, for example that one pixel is closer to its neighboring pixel. That's something very important in our input but it's lost immediately in a fully connected layer. So the question is how can we build some structure into our model so that we can actually inform the learning process and provide some prior information to the model and help it learn this very complicated and large input image. So to do this, let's keep our representation of our image, our 2D image, as an array, a two-dimensional array of pixel values. Let's not collapse it down into one dimension. Now one way that we can use the spatial structure would be to actually connect patches of our input, not the whole input, but just patches of the input to neurons in the hidden layer. So before everything was connected from the input layer to the hidden layer, but now we're just going to connect only things that are within a single patch to the next neuron in the next layer. Now that is really to say that each neuron only sees, so if we look at this output neuron, this neuron is only going to see the values coming from the patch that precedes it. This will not only reduce the number of weights in our model, but it's also going to allow us to leverage the fact that in an image, spatially close pixels are likely to be somewhat related and correlated to each other and that's a fact that we should really take into account. So notice how that only a small region of the input layer influences this output neuron and that's because of this spatially connected idea that we want to preserve as part of this architecture. So to find connections across the whole input now we can apply the same principle of connecting patches in our input layer to single neurons in the subsequent layer. And we can basically do this by sliding that patch across the input image, and for each time we slide it, we're going to have a new output neuron in the subsequent layer. Now this way, we can actually take into account some of the spatial structure that I'm talking about inherent to our input. But remember that our ultimate task is not only to preserve spatial structure, but to actually learn the visual features. And we do this by weighting the connections between the patches and the neurons so we can detect particular features so that each patch is going to try to perform that detection of the feature. So now we ask ourselves, how can we weight this patch such that we can detect those features?
Feature extraction and convolution (18:50)
Well, in practice, there's an operation called a convolution, and we'll first think about this at a high level. Suppose we have a 4x4 patch or a filter, which will consist of 16 weights. We're going to apply this same filter to 4 by 4 patches in the input. And use the result of that operation to define the state of the neuron in the next layer. So the neuron in the next layer, the output of that single neuron is going to be defined by applying this patch with a filter of equal size and learned weights. We're then going to shift that patch over, let's say in this case by two pixels we have here, to grab the next patch and thereby compute the next output neuron. Now this is how we can think about convolutions at a very high level. But you're probably wondering here, think about convolutions at a very high level. But you're probably wondering here, well, how does the convolution operator actually allow us to extract features? And I want to make this really concrete by walking through a very simple example. So suppose we want to classify the letter x in a set of black and white images of letters, where black is equal to negative 1 and white is equal to positive 1. Now to classify it's clearly not possible to simply compare the two images, the two matrices on top of each other and say, are they equal? Because we also want to be classifying this x no matter if it has some slight deformations, if it's shifted or if it's enlarged, rotated, or deformed. We want to build a classifier that's a little bit robust to all of these changes. So how can we do that? We want to detect the features that define an x. So instead, we want our model to basically compare images of a piece of an x, piece by piece. And the really important pieces that it should look for are exactly what we've been calling the features. If our model can find those important features, those rough features that define the X in the same positions, roughly the same positions, then it can get a lot better at understanding the similarity between different examples of X, even in the presence of these types of deformities. So let's suppose each feature is like a mini image, it's a patch, right? It's also a small array, a small two-dimensional array of values and we'll use these filters to pick up on the features common to the X's. In the case of this X for example, the filters we might want to pay attention to might represent things like the diagonal lines on the edge as well as the crossing points you can see in the second patch here. So we'll probably want to capture these features in the arms and the center of the X in order to detect all of these different variations. So note that these smaller matrices of filters, like we can see on the top row here, these represent the filters of weights that we're going to use as part of our convolution operation in order to detect the corresponding features in the input image. So all that's left for us to define is actually how this convolution operation actually looks like and how it's able to pick up on these features given each of these, in this case, three filters. So how can it detect, given a filter, where this filter is occurring or where this feature is occurring rather in this image?
The convolution operation (22:20)
And that is exactly what the operation of convolution is all about. Convolution, the idea of convolution is to preserve the spatial relationship between pixels by learning image features in small little patches of image data. Now, to do this, we need to perform an element wise multiplication between the filter matrix and the patch of the input image, of the same dimension. So if we have a patch of three by three, we're going to compare that to an input filter, or our filter, which is also of size three by three with learned weights. So in this case, our filter, which you can see on the top left, all of its entries are of either a positive one or one one or a negative one and when we multiply this filter by the corresponding green input image patch and We element wise multiply we can actually see the result in this matrix So multiplying all of the positive ones by positive ones will get a positive one Multiplying a negative one by a negative one will also get a positive one. So the result of all of our element-wise multiplications is going to be a three-by-three matrix of all ones. Now the next step as part of a convolution operation is to add all of those element-wise multiplications together. So the result here after we add those outputs is going to be 9. So what this means now, actually before we get to that let me start with another very brief example. Suppose we want to compute the convolution now not of a very large image but this is just a 5 by 5 image. Our filter here is 3 by 3 so we can slide this 3x3 filter over the entirety of our input image and performing this element-wise multiplication and then adding the outputs. Let's see what this looks like. So let's start by sliding this filter over the top left-hand side of our input. We can element-wise multiply the entries of this patch, of this filter, with this patch, and then add them together. And for this part, this 3x3 filter is placed on the top left corner of this image, element-wise multiply, add, and we get this resulting output of this neuron to be 4. And we can slide this filter over one spot by one spot to the next patch and repeat. The results in the second entry now would be corresponding to the activation of this filter applied to this part of the image, in this case, three. And we can continue this over the entirety of our image until the end, when we have completely filled up this activation or feature map. And this feature map really tells us where in the input image was activated by this filter. So for example, wherever we see this pattern conveyed in the original input image, that's where this feature map is going to have the highest value. And that's where we need to actually activate maximally. Now that we've gone through the mechanism of the convolution operation, let's see how different filters can be used to produce feature maps. So picture this picture of a woman's face. This woman's name is Lena. And the output of applying these three convolutional filters so you can see the three filters that we're considering on the bottom right hand corner of each image by simply changing the weights of these filters each filter here has a different weight we can learn to detect very different features in that image so we can learn to sharpen the image by applying this very specific type of sharpening filter. We can learn to detect edges or we can learn to detect very strong edges in this image simply by modifying these filters. So these filters are not learned filters, these are constructed filters and there has been a ton of research historically about developing, hand engineering these filters. But what convolutional neural networks want to do is actually to learn the way it's defining these filters. So the network will learn what kind of features it needs to detect in the image. Does it need to do edge detection or strong edge detection or does it need to detect certain types of edges, curves, certain types of geometric objects, etc. What are the features that it needs to extract from this image? And by learning the convolutional filters, it's able to do that. So I hope now you can actually appreciate how convolution allows us to capitalize on very important spatial structure and to use sets of weights to extract very local features in the image and to very easily detect different features by simply using different sets of weights and different filters. Now these concepts of preserving spatial structure and local feature extraction using the convolutional operation are actually core to the convolutional neural networks that are used for computer vision tasks.
Convolution neural networks (27:27)
And that's exactly what I want to dive into next. Now that we've gotten the operation, the mathematical foundation of convolutions under our belts, we can start to think about how we can utilize this operation, this operation of convolutions, to actually build neural networks for computer vision tasks and tie this whole thing in to this paradigm of learning that we've been exposed to in the first couple lectures. Now these networks aptly are named convolutional neural networks very appropriately and first we'll take a look at a CNN or convolutional neural network designed specifically for the task of image classification. So how can you use CNNs for classification? Let's consider a simple CNN designed for the goal here to learn features directly from the image data. And we can use these learned features to map these onto a classification task for these images. Now there are three main components and operations that are core to a CNN. The first part is what we've already gotten some exposure to in the first part of this lecture, and that is the convolution operation. And that allows us, like we saw earlier, to generate these feature maps and detect features in our image. The second part is applying a non-linearity. And we saw the importance of non-linearities in the first and the second lecture in order to help us deal with these features that we extract being highly non-linear. Thirdly, we need to apply some sort of pooling operation. This is another word for a downsampling operation and this allows us to scale down the size of each feature map. Now the computation of a class of scores which is what we're doing when we define an image classification task is actually performed using these features that we obtain through convolution, non-linearity, and pooling, and then passing those learned features into a fully connected network or a dense layer like we learned about in the first part of the class in the first lecture. And we can train this model end-to-end from image input to class prediction output using fully connected layers and convolutional layers end-to-end where we learn as part of the convolutional layers the sets of weights of the filters for each convolutional layer and as well as the weights that define these fully connected layers that actually perform our classification task in the end. And we'll go through each one of these operations in a bit more detail to really break down the basics and the architecture of these convolutional neural networks. So first we'll consider the convolution operation of a CNN. And as before, each neuron in the hidden layer will compute a weighted sum of each of its inputs. Like we saw in the dense layers, we'll also need to add on a bias to allow us to shift the activation function and apply and activate it with some non-linearity so that we can handle non-linear data relationships. Now what's really special here is that the local connectivity is preserved. Each neuron in the hidden layer, you can see in the middle, only sees a very specific patch of its inputs. It does not see the entire input neurons like it would have if it was a fully connected layer. But no, in this case, each neuron output observes only a very local connected patch as input. We take a weighted sum of those patches, we compute that weighted sum, we apply a bias, and we apply and activate it with a nonlinear activation function. And that's the feature map that we're left with at the end of a convolutional layer. We can now define this actual operation more concretely using a mathematical equation. Here we're left with a 4x4 filter matrix, and for each neuron in the hidden layer, its inputs are those neurons in the patch from the previous layer. We apply this set of weights, w, i, j. In this case, like I said, it's a 4x4 filter, and we do this element-wise multiplication of every element in W, multiplied by the corresponding elements in the input x, we add the bias, and we activate it with this non-linearity. Remember, our element-wise multiplication and addition is exactly that convolutional operation that we talked about earlier. So if you look up the definition of what convolution means, it is actually that exactly. It's element-wise multiplication and then a summation of all of the results. And this actually defines also how convolutional layers are connected to these ideas. But with this single convolutional layer, how can we have multiple filters? So all we saw in the previous slide is how we can take this input image and learn a single feature map. But in reality, there are many types of features in our image. How can we use convolutional layers to learn a stack or many different types of features that could be useful for performing our type of task? How can we use this to do multiple feature extraction? Now the output layer is still convolution, but now it has a volume dimension, where the height and the width are spatial dimensions dependent upon the dimensions of the input layer, the dimensions of the filter, the stride, how much we're skipping on each time that we apply the filter. skipping on each time that we apply the filter. We also need to think about the connections of the neurons in these layers in terms of their, what's called, receptive field. The locations of their input in the model, in the path of the model that they're connected to. Now these parameters actually define the spatial arrangement of how the neurons are connected in the convolutional layers and how those connections are really defined. So the output of a convolutional layer in this case will have this volume dimension. So instead of having one filter map that we slide along our image, now we're going to have a volume of filters. Each filter is going to be slid across the image and compute this convolution operation piece by piece for each filter. The result of each convolution operation defines the feature map that that filter will activate maximally.
Non-linearity and pooling (34:05)
So now we're well on our way to actually defining what a CNN is and the next step would actually be to apply that non-linearity. After each convolution operation we need to actually apply this non-linear activation function to the output volume of that layer. And this is very very similar like I said in the first and we saw also in the second lecture. And we do this because image data is highly non-linear. A common example in the image domain is to use an activation function of ReLU which is the rectified linear unit. This is a pixel-wise operation that replaces all negative values with zero and keeps all positive values with whatever their value was. We can think of this really as a thresholding operation, so anything less than zero gets thresholded to zero. Negative values indicate negative detection of a convolution, but this non-linearity actually kind of clamps that to some sense. And that is a nonlinear operation, so it does satisfy our ability to learn nonlinear dynamics as part of our neural network model. So the next operation in convolutional neural networks is that of pooling. Pooling is an operation that is commonly used to reduce the dimensionality of our inputs and of our feature maps while still preserving spatial invariance. Now a common technique and a common type of pooling that is commonly used in practice is called max pooling as shown in this example. Max pooling is actually super simple and intuitive. It's simply taking the maximum over these 2x2 filters in our patches and sliding that patch over our input. Very similar to convolutions but now instead of applying an element-wise multiplication and summation we're just simply going to take the maximum of that patch over our input. Very similar to convolutions, but now instead of applying a element-wise multiplication and summation, we're just simply going to take the maximum of that patch. So in this case, as we feed over this two by two patch of filters and striding that patch by a factor of two across the image, we can actually take the maximum of those two by two pixels in our input, and that gets propagated and activated to the next neuron. Now, I encourage all of you to really think about some other ways that we can perform this type of pooling while still making sure that we downsample and preserve spatial invariance. Taking the maximum over that patch is one idea. A very common alternative is also taking the average. That's called mean pooling. Taking the average you can think of actually represents a very smooth way to perform the pooling operation because you're not just taking a maximum which can be subject to maybe outliers but you're averaging it or us so you will get a smoother result in your output layer but they both have their advantages and disadvantages. So these are three operations, three key operations of a convolutional neural network and I think now we're actually ready to really put all of these together and start to construct our first convolutional neural network end to end. And with CNNs, just to remind you once again, we can layer these operations. The whole point of this is that we want to learn this hierarchy of features present in the image data, starting from the low-level features, composing those together to mid-level features, and then again to high-level features that can be used to accomplish our task. Now a CNN built for image classification can be broken down into two parts. First, the feature learning part, where we actually try to learn the features in our input image that can be used to perform our specific task. That feature learning part is actually done through those pieces that we've been seeing so far in this lecture. The convolution, the nonlinearity, and the pooling to preserve the spatial invariance. Now, the second part, the convolutional layers and pooling provide output those, the output excuse me, of the first part is those high level features of the input. Now the second part is actually using those features to perform our classification or whatever our task is. In this case the task is to output the class probabilities that are present in the input image. So we feed those outputted features into a fully connected or dense neural network to perform the classification. We can do this now and we don't mind about losing spatial invariance because we've already down sampled our image so much that it's not really even an image anymore. It's actually closer to a vector of numbers and we can directly apply our dense neural network to that vector of numbers. It's also much lower dimensional now. And we can output a class of probabilities using a function called the softmax whose output actually represents a categorical probability distribution. It's summed equal to one, so it does make it a proper categorical distribution. And it is, each element in this is strictly between zero and one. So it's all positive and it does sum to one. So it makes it very well suited for the second part if your task is image classification.
End-to-end code example (38:59)
So now let's put this all together. What does a end-to-end convolutional neural network look like? We start by defining our feature extraction head, which starts with a convolutional layer with 32 feature maps, a filter size of 3x3 pixels, and we downsample this using a max pooling operation with a pooling size of 2 and a stride of 2. This is exactly the same as what we saw when we were first introducing the convolution operation. Next, we feed these 32 feature maps into the next set of the convolutional and pooling layers. Now we're increasing this from 32 feature maps to 64 feature maps and still downscaling our image as a result. So we're downscaling the image, but we're increasing the amount of features that we're detecting and that allows us to actually expand ourselves in this dimensional space while downsampling the spatial information, the irrelevant spatial information. Now finally, now that we've done this feature extraction through only two convolutional layers in this case, we can flatten all of this information down into a single vector and feed it into our dense layers and predict these final 10 outputs. And note here that we're using the activation function of Softmax to make sure that these outputs are a categorical distribution. Okay, awesome.
So, so far we've talked about how we can use CNNs for image classification tasks. This architecture is actually so powerful because it extends to a number of different tasks, not just image classification. And the reason for that is that you can really take this feature extraction head, this feature learning part, and you can really take this feature extraction head, this feature learning part, and you can put onto the second part so many different end networks, whatever end network you'd like to use. You can really think of this first part as a feature learning part and the second part as your task learning part. Now what that task is is entirely up to you and what you desire. And that's really what makes these networks incredibly powerful. So for example we may want to look at different image classification domains. We can introduce new architectures for specifically things like image and object detection, semantic segmentation, and even things like image captioning. You can use this as an input to some of the sequential networks that we saw in lecture two, even. So let's look at and dive a bit deeper into each of these different types of tasks that we could use our convolutional neural networks for. In the case of classification, for example, there is a significant impact in medicine and health care when deep learning models are actually being applied to the analysis of entire inputs of medical image scans. Now this is an example of a paper that was published in Nature for actually demonstrating that a CNN can outperform expert radiologists at detecting breast cancer directly from mammogram images.
Object detection (42:02)
Instead of giving a binary prediction of what an output is though, cancer or not cancer, or what type of objects, for example in this image we may say that this image is an image of a taxi, we may want to ask our neural network to do something a bit more fine resolutioned and tell us for this image can you predict what the objects are and actually draw a bounding box, localize this image or localize predict what the objects are and actually draw a bounding box, localize this image or localize this object within our image? This is a much harder problem since there may be many objects in our scene, and they may be overlapping with each other, partially occluded, etc. So not only do we want to localize the object, we want to also perform classification on that object. So it's actually harder than simply the classification task because we still have to do classification, but we also have to detect where all of these objects are in addition to classifying each of those objects. Now, our network needs to also be flexible and actually be able to infer not just potentially one object, but a dynamic number of objects in the scene. Now, if we have a scene that only has one taxi, it should output a bounding box over just that single taxi and the bounding box should tell us the xy position of one of the corners and maybe the height and the width of that bounding box as well. That defines our bounding box. On the other hand, if our scene contains many different types of objects, potentially even of different types of classes, we want our network to be able to output many different outputs as well and be flexible to that type of differences in our input, even with one single network. So our network should not be constrained to only outputting a single output or a certain number of outputs. It needs to have a flexible range of how it can dynamically infer the objects in the scene. So what is one maybe naive solution to tackle this very complicated problem, and how can CNNs be used to do that? So what we can do is start with this image, and let's consider the simplest way possible to do this problem. We can start by placing a random box over this image, somewhere in the image. It has some random location, it also has a random size. And we can take that box and feed it through our normal image classification network, like we saw earlier in the lecture. This is just taking a single image, or it's now a sub-image, but it's still a single image, and it feeds that through our network. Now that network is tasked to predict what is the class of this image. It's not doing object detection, and it predicts that it has some class. If there is no class of this box, then it simply can ignore it, and we repeat this process. Then we pick another box in the scene, and we pass that through the network to predict its class. And we can keep doing this with different boxes in the scene and keep doing it. And over time we can basically have many different class predictions of all of these boxes as they're passed through our classification network. In some sense if each of these boxes give us a prediction class, we can pick the boxes that do have a class in them and use those as a box where an object is found. If no object is found, we can simply discard it and move on to the next box. So what's the problem with this? Well one is that there are way too many inputs. This basically results in boxes and considering a number of boxes that have way too many scales, way too many positions, too many sizes. We can't possibly iterate over our image in all of these dimensions and have this as a solution to our object detection problem. So we need to do better than that. So instead of picking random boxes or iterating over all of the boxes in our image, let's use a simple heuristic method to identify some places in the image that might contain meaningful objects and use these to feed through our model. But still, even with this extraction of region proposals, the rest of the story is the exact same. We extract the region of proposal and we feed it through the rest of our network. We warp it to be the correct size and then we feed it through our classification network. If there's nothing in that box, we discard it. If there is, then we keep it and say that that box actually contained this image. But still, this has two very important problems that we have to consider. One is that it's still super, super slow. We have to feed in each region independently to the model. So if we extract, in this case, 2,000 regions we have here, we have to feed this, we have to run this network 2,000 times to get the answer just for the single image. It also tends to be very brittle because in practice how are we doing this region proposal? Well, it's entirely heuristic based. It's not being learned with a neural network. And it's also, even more importantly perhaps, it's detached from the feature extraction part. So our feature extraction is learning one piece, but our region proposal piece of the network, or of this architecture, is completely detached. So the model cannot learn to predict regions that may be specific to a given task. That makes it very brittle for some applications. Now many variants have been proposed to actually tackle and tackle some of these issues and advance this forward to accomplish object detection. But I'd like to touch on one extremely quickly just to point you in this direction for those of you who are interested and that's the faster RCNN method to actually learn these region proposals. The idea here is instead of feeding in this image to a heuristic based feedback region proposal network or method we can have a part of our network that is trained to identify the proposal regions of our model, of our image. And that allows us to directly understand or identify these regions in our original image where there are candidate patches that we should explore for our classification and for our object detection. Now each of these regions then are processed with their own feature extractor as part of our neural network and individuals in their CNN heads. Then after these features for each of these proposals are extracted, we can do a normal classification over each of these individual regions. Very similar as before, but now the huge advantage of this is that it only requires a single forward pass through the model. We only feed in this image once, we have a region proposal network that extracts the regions and all of these regions are fed on to perform classification on the rest of the image. So it's super super fast compared to the previous method. So in classification we predict one class for an entire image of the model. In object detection, we predict bounding boxes over all of the objects in order to localize them and identify them. We can go even further than this. And in this idea, we're still using CNNs to predict this output as well. But instead of predicting bounding boxes, which are rather coarse, we can task our network to also here predict an entire image as well. Now one example of this would be for semantic segmentation, where the input is an RGB image, just a normal RGB image, and the output would be pixel-wise probabilities for every single pixel what is the probability that it belongs to a given class so here you can see an example of this image of some two cows on the on some grass being fed into the neural network and the neural network actually predicts a brand new image but now this image is not an RGB image it's a semantic segmentation image it has a probability for every single pixel It's doing a classification problem and it's learning to classify every single pixel Depending on what class it thinks it is and here we can actually see how the cow pixels are being classified Separately from the grass pixels and sky pixels and this output is actually created Using an up sampling operation not a downsampling operation, but upsampling to allow the convolutional decoder to actually increase its spatial dimension. Now these layers are the analog, you could say, of the normal convolutional layers that we learned about earlier in the lecture. They're also already implemented in TensorFlow, so it's very easy to just drop these into your model and allow your model to learn how to actually predict full images in addition or instead of single class probabilities. This semantic segmentation idea is extremely powerful because it can be also applied to many different applications in healthcare as well especially for segmenting out for example cancerous regions on medical scans or even identifying parts of the blood that are infected with diseases like, in this case, malaria.
End-to-end self driving cars (50:52)
Let's see one final example here of how we can use convolutional feature extraction to perform yet another task. This task is different from the first three that we saw with classification, object detection, and semantic segmentation. Now we're going to consider the task of continuous robotic control here for self-driving cars and navigating directly from raw vision data. Specifically this model is going to take as input, as you can see on the top left hand side, the raw perception from the vehicle. This is coming for example from a camera on the car and it's also going to see a noisy representation of street view maps, something that you might see for example from Google Maps on your smartphone. And it will be tasked not to predict a classification problem or object detection but rather learn a full probability distribution over the space of all possible control commands that this vehicle could take in this given situation. Now how does it do that actually? This entire model is actually using everything that we learned about in this lecture today. It can be trained end-to-end by passing each of these cameras through their dedicated convolutional feature extractors and then basically extracting all of those features and then concatenating them, flattening them down and then concatenating them into a single feature extraction vector. So once we have this entire representation of all of the features extracted from all of our cameras and our maps, we can actually use this representation to predict the full control parameters on top of a deterministic control given to the desired destination of the vehicle. This probabilistic control is very powerful because here we're actually learning to just optimize a probability distribution over where the vehicle should steer at any given time. You can actually see this probability distribution visualized on this map and it's optimized simply by the negative log likelihood, which is the negative log likelihood of this distribution, which is a mixture of normal distributions. And this is nearly identical to how you operate in classification as well. In that domain, you try to minimize the cross entropy loss, which is also a negative log likelihood probability function. So keep in mind here that this is composed of the convolutional layers to actually perform this feature extraction. These are exactly the same as what we learned about in this lecture today. As well as these flattening, pooling layers, and concatenation layers to really produce this single representation and feature vector of our inputs. And finally it predicts these outputs, in this case a continuous representation of control that this vehicle should take. So this is really powerful because a human can actually enter the car, input a desired destination and the end-to-end CNN will output the control commands to actuate the vehicle towards that destination. Note here that the vehicle is able to successfully recognize when it approaches the intersections and take the correct control commands to actually navigate that vehicle through these brand new environments that it has never seen before and never driven before in its training data set.
And the impact of CNNs has been very wide reaching beyond these examples as well that I've explained here today. It has touched so many different fields in computer vision especially. And I'd like to really conclude this lecture today by taking a look at what we've covered. We really covered a ton of material today. We covered the foundations of computer vision, how images are represented as an array of brightness values, and how we can use convolutions and how they work. We saw that we can build up these convolutions into the basic architecture, defining convolutional neural networks, and discussed how CNNs can be used for classification. Finally, we talked about a lot of the extensions and applications of how you can use these basic convolutional neural network architectures as a feature extraction module and then use this to perform your task at hand. And a bit about how we can actually visualize the behavior of our neural network and actually understand a bit about what it's doing under the hood through ways of some of these semantic segmentation maps and really getting a more fine-grained perspective of the very high-resolution classification of these input images that it's seeing. And with that, I would like to conclude this lecture and point everyone to the next lab that will be upcoming today. This will be a lab specifically focused on computer vision. You'll get very familiar with a lot of the algorithms that we've been talking about today starting with building your first convolutional neural networks and then building this up to build some facial detection systems and learn how we can use unsupervised generative models like we're going to see in the next lecture to actually make sure that these computer vision facial classification algorithms are fair and unbiased. So stay tuned for the next lecture as well on unsupervised generative modeling to get more details on how to do the second part. Thank you.