MIT 6.S191 (2022): Convolutional Neural Networks

Transcription for the video titled "MIT 6.S191 (2022): Convolutional Neural Networks".

1970-01-02T09:16:50.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Introduction

Intro (00:00)

Okay, hi everyone and welcome back to 6S191. Today we're going to be taking or talking about one of my favorite subjects in this course, and that's how we can give machines a sense of sight and vision. Now, vision is one of the most important senses in sighted people. Now, sighted people and humans rely on vision quite a lot every single day, everything from navigating the physical world to manipulating objects and even interpreting very minor and expressive facial expressions.


Focusing On Computer Vision

Definition of vision (00:43)

I think it's safe to say that for all of us sight and vision is a huge part of our everyday lives and today we're going to be learning about how we can give machines and computers the same sense of sight and processing of images. Now what I like to think about is actually how can we define what vision is and one way I think about this is actually to know what we are looking at or what is where only by looking. Now, when we think about it though, vision is actually a lot more than just detecting where something is and what it is. In fact, vision is much more than just that simple understanding of what something is. So take this example for, take the scene for example. We can build a computer vision system, for example, to identify different objects in this scene. For example, this yellow taxi, as well as this van parked on the side of the road. And beyond that, we can probably identify some very simple properties about this scene as well, beyond just the aspect that there's a yellow van and a, sorry, a yellow taxi and a white van. We can actually probably infer that the white van is parked on the side of the road and the yellow taxi is moving and probably waiting for these pedestrians, which are also dynamic objects. And in fact, we can also see these other objects in the scene which present very interesting dynamical scenarios such as the red light and other cars merging in and out of traffic. Now, accounting for all of the details in this scene is really what vision is, beyond just detecting what is where. And we take all of this for granted, I think, as humans because we do this so easily, but this is an extraordinarily challenging problem for humans, or for machines, to be able to also tackle this in a learning fashion. So vision algorithms really require us to bake in all of these very, very subtle details. And deep learning has actually revolutionized and brought forth a huge rise in computer vision algorithms and their applications.


Implications of visual algorithms revolution (02:45)

So for example, everything from robots kind of operating in the physical world to mobile computing, all of you on your phones are using very advanced machine vision and computer vision algorithms that even a decade ago were training on kind of super computers that existed at the highest clusters of computer clusters. Now we're seeing that in all of our pockets used every day in our lives. We're seeing in biology and medicine, computer vision being used to diagnose cancers, as well as in autonomous driving where we're having actually machines operate together with physical humans in our everyday world. And finally, we're seeing how computer vision can help humans who are lacking a sense of sight in terms of able to increase their own accessibility as well.


Visual sensor (03:41)

So deep learning has really taken computer vision systems by storm because of their ability to learn directly from raw pixels and directly from data. And not only to learn from the data, but learn how to process that data and extract those meaningful image features only by observing this large corpus of data sets. So one example is data facial detection and recognition. Another common example is in the context of self-driving cars, where we can actually take an image that the car is seeing in its environment and try to process the control signal that it should execute at that moment in time.


Radically advanced robotics (04:12)

Now, this entire control system in this example is actually being processed by a single neural network, which is radically different when we think about how other companies like the majority of self-driving car companies like Waymo, for example, the approach that those companies are taking, which is a very, very different pipelined approach. Now we're seeing computer vision algorithms operate the entire robot control stack using a single neural network. And this was actually work that we published as part of my lab here at CSAIL, and you'll get some practice developing some of these algorithms in your software labs as well. We're seeing it in medicine and biology, taking the ability to diagnose radiograph scans from a doctor and actually make clinical decisions. And finally, computer vision is being widely used in accessibility applications, radiograph scans from a doctor and actually make clinical decisions. And finally, computer vision is being widely used in accessibility applications, for example, to help the visually impaired. So projects in this research endeavor helped build a deep learning enabled device that could actually detect trails for running and provide audible feedback to visually impaired users so that they could still go for runs in the outdoor world.


Computer vision (05:15)

And like I said, this is often, these are all tasks that we really take for granted as humans because each of us as sighted individuals have to do them routinely. But now in this class we're going to talk about how we can train a computer to do this as well. And in order to do that, we need to kind of ask ourselves, firstly, how can we build a computer to, quote-unquote, see? And specifically, how can we build a computer that can firstly process an image into something that they can understand? Well, to a computer, an image is simply a bunch of numbers in the world. Now suppose we have this picture of Abraham Lincoln. It's simply a collection of pixels. And since this is a grayscale image, each of those pixels is just one single number that denotes the intensity of the pixel. And we can also represent our bunch of numbers as a two-dimensional matrix of numbers, one pixel for every location in our matrix. And this is how a computer sees. It's going to take as input this giant two-dimensional matrix of numbers, and if we had a RGB image, a color image, it would be exactly the same story except that each pixel, now we don't have just one number that denotes intensity, but we're going to have three numbers to denote red, green, and blue channels. Now we have to actually have this way of representing images to computers, and I think we can actually think about what computer tasks we can now perform using this representation of an image. So two common types of machine learning tasks in computer vision are that of recognition and classification and regression and kind of quantitative analysis on your image. Now for regression, our output is going to take a continuous valued number and for classification our output is going to take a 1 of k different class output. So you're trying to output a probability of your image being in 1 of k classes. So let's consider firstly the task of image classification. We want to predict a single label for each image. For example, we can say we have a bunch of images of US presidents. And we want to build a classification pipeline that will tell us which president is this an image of on the left hand side. And on the output side, on the right side, we want to output the probability that this was an image coming from each of those particular presidents. Now, in order to correctly classify these images, our pipeline needs to be able to tell what is unique to each of those different presidents.


Neural Network Pipeline (07:59)

So what is unique to a picture of Lincoln versus a picture of Washington versus a picture of Obama. Now another way to think about this image classification problem and how a computer might go about solving it is at a high level in terms of features that distinguish each of those different types of images and those are characteristics of those types of classes. So classification is actually done and performed by detecting those features in our given image. Now if the features of a given class are present, then we can say or predict with pretty high confidence that our class or our image is coming from that particular class. So if we're building a computer vision pipeline, for example, our model needs to know what the features are and then it needs to be able to detect those features in an image.


Facial Detection (08:54)

So for example, if we want to do facial detection, we might start by first trying to detect noses and eyes and mouths and then if we can detect those types of features then we can be pretty confident that we're looking at a face, for example. Just like if we want to detect a car or a door, we might look at wheels or, sorry, if we want to detect a car, we might look for wheels or a license plate or headlights. And those are good indicators that we're looking at a car. Now, how might we solve this problem of feature detection, first of all? Well, we need to leverage certain information about our particular field. For example, in human faces, we need to use our understanding of human faces to say that, OK, a human face is usually comprised of eyes and noses and mouths. And a classification pipeline or an algorithm would then try to do that exactly and try to detect those small features first and then make some determination about our overall image. Now, of course, the big problem here is that we as humans would need to define for the algorithm what those features are, right? So if we're looking at faces, a human would actually have to say that a face is comprised of eyes and noses and mouths and that that's what the computer should kind of look for. But there's a big problem with this approach my eyes and noses and mouths, and that that's what the computer should kind of look for. But there's a big problem with this approach because actually a human is not very good usually at defining those types of features that are really robust to a lot of different types of variation. For example, scale variations, deformations, viewpoint variations. There's so many different variations that an image or a three-dimensional object may undergo in the physical world that make it very difficult for us as humans to define what good features that our computer algorithm may need to identify. So even though our pipeline could use the features that we, the human, may define, this manual extraction will actually break down in the detection part of this task.


Human-Definition Extraction (10:47)

So due to this incredible variability of image data, the detection of these features is actually really difficult in practice. So because your detection algorithm will need to withstand all of those different variations. So how can we do better? What we want, ideally, is we want a way to extract features and detect their presence in images automatically. Just by observing a bunch of images, can we detect what a human face is comprised of just by observing a lot of human faces, and maybe even in a hierarchical fashion. And to do that, we can use a neural network, like we saw in yesterday's class.


Learn-Extract Method (11:19)

So a neural network-based approach here is going to be used to learn and extract meaningful features from a bunch of data of human faces in this example and then learn a hierarchy of features that can be used to detect the presence of a face in a new image that it may see. So for example after observing a lot of human faces in a big data set, an algorithm may learn to identify that human faces are usually comprised of a bunch of lines and edges that come together and form mid-level features like eyes and noses, and those come together and form larger pieces of your facial structure and your facial appearance. Now, this is how neural networks are going to allow us to learn directly from visual data and extract those features if we construct them cleverly. Now, this is where the whole part of the class gets interesting because we're going to start to talk about how to actually create neural networks that are capable of doing that first step of extraction and learning those features. Now, in yesterday's lecture, we talked about two types of architectures. First, we learned about fully connected layers, these dense layers where every neuron is connected to every neuron in the previous layer. And let's say we wanted to use this type of architecture to do our image classification task. So, in this case, our image is a two-dimensional image. Sorry, our input is a two-dimensional image like we saw earlier. And since our fully connected layer is taking just a list of numbers, the first thing we have to do is convert our two-dimensional image into a list of numbers. So let's simply flatten our image into a long list of numbers, and we'll feed that all into our fully connected network. Now here immediately, I hope all of you can appreciate that the first thing that we've done here by flattening our image is we've completely destroyed all of the spatial structure in our image. Previously, pixels that were close to each other in the two-dimensional image now may be very far apart from each other in our one-dimensional flattened version. Additionally, now we're also going to have a ton of parameters because this model is fully connected. Every single pixel in our first layer has to be connected, every single pixel in our first layer has to be connected to every single neuron in our next layer. So you can imagine even for a very small image that's only 100 by 100 pixels, you're going to have a huge number of weights in this neural network just within one layer, and that's a huge problem.


Inbuilt Convolutions (13:38)

So the question I want to pose, and that's going to kind of motivate this computer vision architecture that we'll talk about in today's class, is how can we preserve this spatial structure that's present in images to kind of inform and detect features and inform the decision of how we construct this architecture to do that form of feature extraction.


However convolution works (13:55)

So to do this, let's keep our representation of an image as a 2D matrix. So we're not going to ruin all of that nice spatial structure. And one way that we can leverage that structure is now to inherit, is now to actually feed our input and connect it to patches of some weights. So instead of feeding it to a fully connected layer of weights, we're going to feed it just to a patch of weights. And that is basically another way of saying that each neuron in our hidden layer is only going to see a small patch of pixels at any given time. So this will not only reduce drastically the number of parameters that our next hidden layer is going to have to learn from, because now it's only attending to a single patch at a time, it's actually quite nice, because pixels that are close to each other also share a lot of information with each other. So there's a lot of correlation that things that are close to each other in images often have, especially if we look at the local part of an image, very locally. Close to that patch, there's often a lot of relationships. So notice here how the only one small region in this red box of the input layer influences this single neuron that we're seeing on the bottom right of the slide. Now to define connections across the entire input image, we can simply apply this patch-based operation across the entire image. So we're going to take that patch and we're going to slowly slide it across the image and each time we slide it, it's going to predict this next single neuron output on the bottom right. So by sliding it many times over the image, we can create now another two-dimensional extraction of features on the bottom right. And keep in mind, again, that when we're sliding this patch across the input, each patch that we slide is the exact same patch. So we create one patch and we slide that all the way across our input. We're not creating new patches for every new place in the image. And that's because we want to reuse that feature that we learned and kind of extract that feature place in the image. And that's because we want to reuse that feature that we learned and kind of extract that feature all across the image. And we do this feature extraction by waving the connections between the patch that the input is applied on and the neurons that get fed out so as to detect certain features. So in practice, the operation so as to detect certain features. So in practice, the operation that we can use to do this, this sliding operation and extraction of features, is called a convolution. That's just the mathematical operation that is actually being performed with this small patch and our large image. Now, I'm going to walk through a very brief example. So suppose we have a 4 by 4 patch, which we can see in red on the top left illustration. That means that this patch is going to have 16 weights. So there's one weight per pixel in our patch. And we're going to apply this same filter of 4x4 pixels all across our entire input. And we're going to use the result of that operation to define the state of the neuron in the next layer. So for example, this red patch is going to be applied at this location, and it's going to inform the value of this single neuron in the next layer. This is how we can start to think of a convolution at a very high level. But now you're probably wondering how exactly or mathematically, very precisely, how does the convolution work and how does it allow us actually to extract these features. So how are the weights determined?


An x-parison occurs (17:36)

How are the features learned? And let's make this concrete by walking through a few simple examples. So suppose we want to classify this image of an x, right? So we're given a bunch of images, black and white images, and we want to find out if in this image there's an x or not. Now here black is actually, the black pixel is defined by a negative one, and a white pixel is defined by a positive one, right? So this is a very simple image, black and white, and to classify it, it's clearly not possible to kind of compare the two matrices to see if they're equal, because if we did that, what we would see is that these two matrices are not exactly equal because there are some slight deformations and transformations between one and the other. So we want to classify an X even if it's rotated, shrunk, deformed in any way. So we want to be resilient and robust to those types of modifications and still have a robust classification system. So instead, we want our model to compare the images of an X piece by piece or patch by patch. We want to identify certain features that make up an X and try to detect those instead. Now if our model can find these rough feature matches in roughly the same places, then this is probably again a good indicator that indeed this image is of an X. Now each feature you should think of is kind of like a mini image, right? So it's a small two-dimensional array of values. So here are some of these different filters or features that we may learn. Now each of the filters on the top right is going to be designed to pick up a different type of feature in the image. So in the case of Xs, our filters may represent diagonal lines, like in the top left, or it may represent a crossing type of behavior, which you can see in the middle, or an oriented diagonal line in the opposite direction on the far right. Now, note that these smaller matrices are filters of weights, just like in images. So these filters on the top row, they're smaller mini-images, but they're still two-dimensional images in some way. So they're defined by a set of weights in 2D. All that's left now is to define an operation, a mathematical operation, that will connect these two pieces, that will connect the small little patches on the top to our big image on the bottom and output a new image on the right. So convolution is that operation. Convolution, just like addition or multiplication, is an operation that takes as input two items. Instead of like addition, which takes as input two numbers, convolution is an input that takes as, or sorry, it's a function that takes as input two matrices. In this case, or in the more general case, it takes as input two functions, and it will output a third function. So the goal of convolution is to take as input now two images and output a third image. And because of that, convolution preserves the spatial relationship between pixels by learning image features in small squares of the image or input image data. And to do this, we can perform an element-wise multiplication of our filter or our feature with our image. So we can place this filter on the top left on top of our image on the bottom and kind of element-wise multiply every pixel in our filter with every pixel in the corresponding overlap region of our image.


Understanding Convolutional Neural Networks

What our inner convolution looks like (20:53)

So for example here, we take this bottom right pixel of our filter 1, we can multiply it by the corresponding pixel in our image, which is also 1, and the result 1 times 1 is 1, and we can basically do this for every single pixel in our filter. So we repeat this all over our filter and we see that all of the resulting element-wise multiplications are 1. We can add up all of those results. We get 9 in this case. And that's going to be the output of the convolution at this location in our next image.


Discussion of Zeroing Pad (21:38)

Next time we slide it, we'll have a different set of numbers that we multiply with and we'll have a different output as well. So let's consider this with one more example. Suppose we want to compute the convolution of a 5x5 image and a 3x3 filter. Okay, so the image is on the left, the filter is on the right, and to do this we need to cover the input image entirely with our filter and kind of slide it across the image, and then that's going to give us a new image. So each time we put the filter on top of our image, we're going to perform that operation I told you about before, which is element-wise multiply every pixel in our filter and image and then add up the result of all of those multiplications. So let's see what this looks like. So first we're going to place our filter on the top left of our image and when we element-wise multiply everything and add up the result we get the value 4 and we're going to place that value 4 in the top left of our new feature map, let's call it. This is just the output of this operation. Next time we slide the filter across the image we're going to have a new set of input pixels that we're going to element-wise multiply with, add up the result, and we'll get 3. And we can keep repeating this all over the image as we slide across the image. And that's it. Once we get to the end of the image, now we have a feature map on the right-hand side that denotes at every single location the strength of detecting that filter or that feature at that location in the input pixel. So where there's a lot of overlap you're going to see that the element-wise multiplication will have a large result and where there's not a lot of overlap the element-wise multiplication will have a much smaller result, right? So we can see kind of where this feature was detected in our image by looking at the result of our feature map. We can actually now observe how different filters can be used to produce different types of outputs or different types of feature maps. In effect, our filter is going to capture or encode that feature within it. So let's take this picture of a woman's sorry, let's take this picture of a woman's face. And by applying the output of three different convolutional features or filters, we can obtain three different forms of the same image. So for example, if we take this filter, this 3 by 3 matrix of numbers, it's just 3 by 3 numbers, 9 numbers, 9 weights, and we slide it across our entire image, we actually get the same image back but in sharpened form, so it's a much sharper version of the same image. Similarly, we can change the weights in our filter or our feature detector and now we can detect different types of features. Here for example, this feature is performing edge detection, right, so we can see that everything is kind of blacked out except for the edges that were present in the original image. Or we can modify those weights once again to perform an even stronger or magnified version of edge detection. And again, here you can really appreciate that the edges are the things that remain in our output detector. And this is just to demonstrate that by changing the weights in our filter, our model is able to learn how to identify different types of features that may be present in our image. So I hope now you can kind of appreciate how convolution allows us to capitalize on spatial structure and also use this set of weights to extract local features and very easily detect different features by extracting different filters. So we can learn a bunch of those filters now and each filter is going to capture a different feature and that's going to define different types of objects and features that the image is possessing. Now, these concepts of preserving spatial structure, as well as the local feature extraction using convolution, are at the core of neural networks that we're going to learn about today. And these are the primary neural networks that are still used for computer vision tasks and have really shattered all of the state of the art algorithms.-the-art algorithms. So now that we've gotten to the operational hood underneath convolutional neural networks, which is the convolution operation, now we can actually start to think about how we can utilize this operation to build up these full-scale, like I called them, convolutional neural networks, right? So these networks are essentially appropriately named because under their hood they're just utilizing the same operation of convolution kind of combined with this weight multiplication and addition formulation that we discussed in the first lecture with fully connected layers. So let's take a look at how convolutional neural networks are actually structured, and then we'll kind of dive a little bit deeper into the mathematics of each layer. So let's again stick with this example of image classification.


Convolutional Layers (26:34)

Now the goal here is to learn the features again directly from the image data and use these learned features to actually identify certain types of properties that are present in the image and use those properties to guide or inform some classification model. Now there are three main operations that you need to be familiar with when you're building a convolutional neural network. First is obviously the convolution operation. So you have convolutional layers that simply apply convolution with your original input and a set of filters that our model is going to learn. So these are going to be weights to the model that will be optimized using backpropagation. The second layer is just like we saw in the first lecture, we're going to have to apply a nonlinearity. So this is to introduce nonlinearity to our model. Oftentimes in convolutional neural networks, we'll see that this is going to have to apply a non-linearity. So this is to introduce non-linearity to our model. Oftentimes in convolutional neural networks, we'll see that this is going to be relu, because it's a very fast and efficient form of activation function. And thirdly, we're going to have some pooling layer, which is going to allow us to downsample and downscale our features. Every time we downscale our features, our filters are now going to attend to a larger region of our input space. So imagine as we progressively go deeper and deeper into our network, each step kind of downscaling, now our features are capable of attending to that original image, which is much larger. And kind of the attention field of the later layers becomes and grows much larger as we downsample. So we'll go through each of these operations now just to break down the basic architecture of a CNN.


Learning convolutional filters (28:11)

And first we'll consider the convolutional operation and how it's implemented specifically in the operation and specifically in neural networks. We saw how the mathematical operation was implemented or computed or formulated. Now let's look at it specifically in neural networks. We saw how the mathematical operation was implemented or computed or formulated. Now let's look at it specifically in neural networks. So as before, each neuron in our hidden layer will compute a weighted sum of inputs, right? So remember how we talked about yesterday when we were talking about three steps of a perceptron. One was apply weights, second add a bias, and then three was apply a non-linearity. And we're going to keep the same three steps in convolutional neural networks. First is we're going to compute a weighted sum of its inputs using a convolutional operation. We're going to add a bias and then we're going to activate it with a non-linearity. And here's an example of, or here's an illustration formulation of that same exact idea written out in mathematical form. So each of our weights is multiplied element-wise with our input x, we add a bias, we add up all the results, and then we pass it through a nonlinear activation function. So this defines how neurons in one layer, in our input layer, are connected to our output layer and how they're actually computed, that output layer. But within a single convolutional layer now, we can have multiple filters. So just like before in the first lecture, when we had a fully connected layer that can have multiple neurons and perceptrons, now we're having a convolutional layer that can learn multiple filters. And the output of each layer of our convolution now is not just one image, but it's a volume of images, right? It's one image corresponding to each filter that is learned. Now, we can also think of the connections of neurons in convolutional layers in terms of kind of their receptive field, like I mentioned before. The locations of the i-th input is a node in that path that it's connected to at the ith output. Now these parameters essentially define some spatial arrangement of the output of a convolutional layer because of that. And to summarize, we've seen now how connections in convolutional layers are defined and how they can output and that output is actually a volume right because we're learning the stack of filters not just one filter and for each filter we output an image. Okay so we are well on our way now to understanding how a CNN or a convolutional network is working in practice. There's a few more steps. The next step is to apply that nonlinearity like I said before and the motivation is exactly like we saw in the first two lectures. As introduced previously, we do this because our images and real data is highly non-linear, so to capture those types of features, we need to apply non-linearities to our model as well. In CNNs, it's very common practice to apply what is called the ReLU activation function in this case. You can think of the ReLU activation as actually some form of thresholding, right? So when the input image on the left-hand side is less than zero, nothing gets passed through. And when it's greater than zero, the original pixel value gets passed through. So it's kind of like the identity function when you're positive, and it's zero when you're positive and it's zero when you're negative it's a form of thresholding centered at zero and yes also you can think of negative numbers right so these are negative numbers that correspond to kind of inverse detection of a feature during your convolutional operation so positive numbers correspond to kind of positive detection of that feature. Zero means there's no detection of the feature, and if it's negative, you're seeing like an inverse detection of that feature. And finally, once we've done this activation of our output filter map, the next operation in the CNN is pooling. Pooling is an operation that is primarily used just to reduce the dimensionality and make this model scalable in practice while still preserving spatial invariance and spatial structure. So a common technique to pooling is what's called max pooling. And the idea is very intuitive, as the name suggests. We're going to select patches now in our input on the left-hand side, and we're going to pool down the input by taking the maximum of each of the patches. So, for example, in this red patch, the maximum is 6, and that value gets propagated forward to the output. You can note that the output, because this max pooling operation was done with a scale of 2, the output is half as large as the input, right? So it's downscaled by also a factor of 2, the output is half as large as the input, right? So it's downscaled by also a factor of 2. And I encourage you some different ways to think about some different ways that we could perform this downsampling or downscaling operation without using a max pooling specific equation. Is there some other operation, for example, that would also allow you to kind of downscale and preserve spatial structure without doing the pooling of using the maximum of each of the patches?


Convolutional neural networks (CNNs) (33:11)

And there's some interesting kind of changes that can happen when you use different forms of pooling in your network. So these are some key operations to convolutional neural networks, and now we're kind of ready to put them together into this full pipeline so that we can learn these hierarchical features from one convolutional layer to the next layer to the next layer. And that's what we're going to do with convolutional neural networks. And we can do that by essentially stacking these three steps in tandem in the form of a neural network in sequence. So on the left-hand side we have our image. We're going to start by applying a series of convolutional filters that are learned. That extracts a feature map, or sorry, a feature volume, right, one per map. We're going to apply our activation function, pull down, and repeat that process. And we can kind of keep stacking those layers over and over again. And this will eventually output a set of feature volumes right here that we can take. And at this time, now it's very small in terms of the spatial structure. We can now, at this point, extract all of those features and start to feed them through a fully connected layer to perform our decision-making task.


Conclusion (34:14)

So the objective of the first part is to extract features from our image. And the objective of our second part is to use those features and actually perform our detection. Now let's talk about actually putting this all together into code, very tangible code, to create our first end-to-end convolutional neural network. So we start by defining our feature extraction head, which starts with a convolutional layer here with 32 feature maps. So this just means that our neural network is going to learn 32 different types of features at the output of this layer. We sample spatial information, in this case using a max pooling layer. We're going to downscale here by a factor of 2 because our pooling side and our stride size is of 2. And next we're going to feed this into our next set of convolutional pooling layers. Now, instead of 32 features, we're going to learn a larger number of features because remember that we've downscaled our image. So now we can afford to kind of increase the resolution of our feature dimension as we downscale. So we're kind of having this inverse relationship and trade-off. As we downscale and can enlarge our attention or receptive field, we can also expand on the feature dimension. Then we can finally flatten this spatial information into a set of features that we ultimately feed into a series of dense layers and the series of dense layers will perform our eventual decision. So this is going to do the ultimate classification that we care about.


Deep Learning And Object Detection

Image classification (35:53)

So so far, we've talked only about using CNNs for image classification tasks. In reality, this architecture is extremely general and can extend to a wide number of different types of applications just by changing the second half of the architecture. So the first half of the architecture that I showed you before is focusing on feature detection or feature extraction, right? Picking up on those features from our data set. And then the second part of our network can be kind of swapped out in many different ways to create a whole bunch of different models. For example, we can do classification in the way that we talked about earlier, or if we change our loss function, we can do regression. We can change this to do object detection as well, segmentation or probabilistic control, so we can even output probability distributions, all by just changing the second part of the network. We keep the first part of the network the same though because we still need to do that first step of just what am I looking at, what are the features that are present in this image, and then the second part is, okay, how am I going to use those features to do my task? So in the case of classification, there's a significant impact right now in medicine and healthcare where deep learning models are being applied to the analysis of a whole host of different forms of medical imagery.


Breast cancer detection (37:01)

Now this paper was published on how a CNN could actually outperform human expert radiologists in detecting breast cancer directly from looking at mammogram images. Classification tells us a binary prediction of what an object sees. So, for example, if I feed the image on the left, classification is essentially the task of the computer telling me, okay, I'm looking at the class of a taxi, right, and that's a single class label. Maybe it even outputs a probability that this is a taxi versus a probability that it's something else.


Triple detections (37:42)

But we can also take a look and go deeper into this problem to also determine not just what this image is or that this image has a taxi, but maybe also a much harder problem would be for the neural network to tell us a bounding box for every object in this image. So it's going to look at this image. It's going to detect what's in the image and then also provide a measure of where that image is by encapsulating it in some bounding box that it has to tell us what that is. Now this is a super hard problem because a lot of images, there may be many different objects in the scene. So not only does it have to detect all of those different objects, it has to localize them, place boxes at their location, and it has to do the task of classification. It has to tell us that within this box, there's a taxi. So our network needs to be extremely flexible and infer a dynamic number of objects. Because, for example, in this image, I may have one primary object, but maybe in a different image, I might have two objects. So our model needs to be capable of outputting a variable number of different detections. On the other hand, so here for example there's one object that we're outputting, a taxi at this location. What if we had an image like this? Now we have kind of many different objects that are very present in our image and our model needs to be able to have the ability to output that variable number of classifications. Now, this is extremely complicated. Number one, because the boxes can be anywhere in the image, and they can also be of different sizes and scales. So how can we accomplish this with convolutional neural networks? Let's consider, firstly, the very naive way of doing this. Let's first take our image like this, and let's start by placing a random box somewhere in the image. So for example, here's a white box that I've placed randomly at this location. I've also randomized the size of the box as well. Then let's take this box and feed it through just this small box, and let's feed it through a CNN. And we can ask ourselves, what's the class of this small box. Then we can repeat this basically for a bunch of different boxes in our model. So we keep sampling, randomly pick a box in our image and feed that box through our neural network and for each box try to detect if there's a class. Now the problem here, this might work actually, the problem is that there are way too many inputs to be able to do this. There's way too many scales. So can you imagine for a reasonably sized image the number of different permutations of your box that you'd have to be able to account for would just be too intractable. So instead of picking random boxes let's use a simple heuristic to kind of identify where these boxes of interesting information might be. Okay so this this is, in this example, not going to be a learned kind of heuristic to identify where the box is, but we're going to use some algorithm to pick some boxes in our image.


Object detection (40:33)

Usually this is an algorithm that looks for some signal in the image, so it's going to ignore kind of things where there's not a lot of stuff going on in the image, and only focus on kind of regions of the image where there is some interesting signal and it's going to feed that box to a neural network it's first going to have to shrink the box to fit a uniform size right that it's going to be amenable to a single network it's going to warp it down to that single size feed it through a classification model try to see if there's a an object in that patch if there is then we try to say to the model, OK, there's a class there, and there's the box for it as well. But still, this is extremely slow. We still have to feed each region of our heuristic, the thing that our heuristic gives us, we still have to feed each of those values and boxes down to our CNN. And we have to do it one by one, right, and check for each one of them is there a class there or not. Plus, this is extremely brittle as well, since the network part of the model is completely detached from the feature extraction or the region part of the model. So extracting the regions is one heuristic, and extracting the features that correspond to the region is completely separate, right? And ideally, they should be very related, right? So probably we'd want to extract boxes where we can see certain features, and that would inform a much better process here. So there's two big problems. One is that it's extremely slow, and two is that it's brittle because of this disconnect. Now, many variants have been proposed to tackle these issues, but I'd like to touch on one extremely briefly and just point you in this direction called faster RCNN method or faster RCNN model, which actually attempts to learn the regions instead of using that simple heuristic that I told you about before. So now we're going to take as input the entire image and the first thing that we're going to do is feed that to what's called a region proposal network. And the goal of this network is to identify proposal regions where there might be interesting boxes for us to detect and classify in the future. Now that region proposal network now is completely learned, and it's part of our neural network architecture. Now we're going to use that to directly grab all of the regions out and process them independently. But each of the regions are going to be processed with their own feature extractor heads and then a classification model will be used to ultimately perform that object detection part. Now it's actually extremely fast, right, or much, much faster than before because now that part of the model is going to be learned in tandem with the downstream classifier. So in a classification model we want to predict a single, we want to predict from a single image to a list of bounding boxes, and that's object detection. Now one other type of task is where we want to predict not necessarily a fixed number or a variable number of classes and boxes, but what if we want to predict for every single pixel what is the class of this pixel, right? This is a super high-dimensional output space now. It's not just one classification. Now we're doing one classification per pixel. So imagine for a large input image, you're going to have a huge amount of different predictions that you have to be able to predict for. Here, for example, you can see an example of the cow pixels on the left being classified separately from the grass pixels on the bottom and separately from the sky pixels on the top right. And this output is created first, again, the beginning part is exactly as we saw before, kind of in feature extraction model. And then the second part of the model is an upscaling operation, which is kind of the inverse side of the encoding part. So we're going to encode all that information into a set of features. And then the second part of our model is going to use those features, again, to learn a representation of whatever we want to output on the other side, which in this case is pixel-wise classes. So instead of using two-dimensional convolutions on the left, we're going to now use what are called transpose convolutions on the right. And effectively, these are very, very similar to convolutions, except they're able to do this upscaling operation instead of downscaling. Of course this can be applied to so many other applications as well especially in healthcare where we want to segment out for example cancerous regions of medical scans and even identify parts of the blood that are affected with malaria for example.


DeepDriveAI and PhaseOne (45:02)

Let's see one final example of how we can have another type of model for a neural network. And let's consider this example of self-driving cars. So let's say we want to learn one neural network for autonomous control of self-driving cars. Specifically, now we want a model that's going to go directly from our raw perception, what the car sees, to some inference about how the car should control itself. So what's the steering wheel angle that it should take at this specific instance as it's seeing what it sees. So we're trying to infer here a full, not just one steering command, but we're trying to infer a full probability distribution of all of the different possible steering commands that could be executed at this moment in time.


Object Detection (45:37)

And the probability is going to be very high. You can kind of see where the red lines are darker. That's going to be where the model is saying there is a high probability that this is a good steering command to actually take. So this is again very different than classification, segmentation, all of those types of networks. Now we're outputting kind of a continuous distribution of our outputs. So how can we do that? This entire model is trained end-to-end, just like all the other models, by passing each of the cameras through their own dedicated convolutional feature extractors. So each of these cameras are going to extract some features of the environment. Then we're going to concatenate, combine all of those features together to have now a giant set of features that encapsulates our entire environment, and then predicting these control parameters. So the loss function is really the interesting part here. The top part of the model is exactly like we saw in lecture one. This is just a fully connected layer, or dense layer, that takes its input to features and outputs the parameters of this continuous distribution and on the bottom is really the interesting part to enable learning these continuous continuous probability distributions and we can do that even though the human never took let's say all three of these actions it could take just one of these actions and we can learn to maximize that action in the future after seeing seeing a bunch of different intersections, we might see that, okay, there is a key feature in these intersections that is going to permit me to turn in all of these different directions, and let me maximize the probability of taking each of those different directions. And that's an interesting way, again, of predicting the variable number of outputs in a continuous manner. the variable number of outputs in a continuous manner. So actually in this example, a human can enter the car and put a desired destination, and not only will it navigate to that location, it will do so entirely autonomously and end-to-end. The impact of CNNs is very wide-reaching beyond these examples that I've explained here, and I've also touched on many other fields in computer vision that I'm not going to be able to talk about today for the purpose of time. And I'd like to really conclude today's lecture by taking a look at what we've covered, just to summarize.


Capstone: Feature Extraction For Real World

Capstone Part 1: Feature Extraction (47:55)

So first, we covered the origins of computer vision and of the computer vision problem. So how can we represent images to computers? And how can we define what a convolutional operation does to an image? Given a set of features, which is just a small weight matrix, how can we extract those features from our image using convolution? Then we discussed the basic architecture using convolution to build that up into convolutional layers and convolutional neural networks. And finally, we talked a little bit about the extensions and applications of this very, very general architecture and model into a whole host of different types of tasks and different types of problems that you might face when you're building an AI system, ranging from segmentation to captioning and control.


Call to Action (48:42)

And with that, I'm very excited to go to the next lecture, which is going to be focused on generative modeling. And just to remind you that we are going to have the software lab. And the software lab is going to tie very closely to what you just learned in this lecture, with lecture three, and convolutional neural networks and kind of a combination with the next lecture that you're going to hear about from Ava, which is going to be on generative modeling.


Ending Remarks

Outro (48:59)

Now with that, I will pause the lecture and let's reconvene in about five minutes after we set up with the next lecture. Thank you.


Could not load content

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.