MIT 6.S191 (2020): Convolutional Neural Networks

Transcription for the video titled "MIT 6.S191 (2020): Convolutional Neural Networks".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.

Opening Remarks

Introduction (00:00)

Hi everyone and welcome back to MIT 6S191. Today we're going to be talking about one of my favorite topics in this course and that's how we can give machines a sense of vision. Now, vision, I think, is one of the most important senses that humans possess. Sighted people rely on vision every single day from things like navigation, manipulation, how you can pick up objects, how you can recognize objects, recognize complex human emotion and behaviors. And I think it's very safe to say that vision is a huge part of human life. Today we're going to be learning about how deep learning can build powerful computer vision systems capable of solving extraordinary complex tasks that maybe just 15 years ago would have not even been possible to be solved. Now one example of how vision is transforming computer or how deep learning is transforming computer vision is is facial recognition. So on the top left you can see an icon of the human eye which visually represents vision coming into a deep neural network in the form of images or pixels or video. And on the output on the bottom, you can see a depiction of a human face or detecting human face, but this could also be recognizing different human faces or even emotions on the face, recognizing key facial features, etc. Now, deep learning has transformed this field specifically because it means that the creator of this AI does not need to tailor that algorithm specifically towards facial detection, but instead they can provide lots and lots of data to this algorithm and swap out this end piece instead of facial detection. They can swap it out for many other detection types or recognition types. And the neural network can try and learn to solve that task. So for example, we can replace that facial detection task with the detection of disease region in the retina of the eye. And similar techniques could also be applied throughout health care, medicine, and towards the detection and classification of many different types of diseases in biology and so on. Now, another common example is in the context of self-driving cars, where we take an image as input and try to learn an autonomous control system for that car. This is all entirely end-to-end. So we have vision and pixels coming in as input, and the actuation of the car coming in as output. Now this is radically different than the vast majority of autonomous car companies and how they operate. So if you look at companies like Waymo and Tesla, this end-to-end approach is radically different. We'll talk more about this later on, but this is actually just one of the autonomous vehicles that we built here as part of my lab at CSAIL. So that's why I'm bringing it up.

Discussion On Visual Features And Convolutional Neural Networks

What computers see (03:04)

autonomous vehicles that we built here as part of my lab at CSAIL, so that's why I'm bringing it up. So now that we've gotten a sense of, at a very high level, some of the computer vision tasks that we as humans solve every single day, and that we can also train machines to solve, the next natural question I think to ask is, how can computers see? And specifically, how does a computer process an image or a video? How do they process pixels coming from those images? Well, to a computer, images are just numbers. And suppose we have this picture here of Abraham Lincoln. It's made up of pixels. Now, each of these pixels, since it's a grayscale image, can be represented by a single number. And now we can represent our image as a two-dimensional matrix of numbers, one for each pixel in that image, and that's how a computer sees this image. It sees it as just a matrix of two-dimensional numbers, or a two-dimensional matrix of numbers, rather. Now, if we have an RGB image, a color image, instead of a grayscale image, we can simply represent that as three of these two-dimensional images concatenated or stacked on top of each other. One for the red channel, one for the green channel, one for the blue channel. And that's RGB. Now we have a way to represent images to computers, and we can think about what types of computer vision tasks this will allow us to solve, and what we can perform given this foundation. Well, two common types of machine learning that we actually saw in lecture one and two yesterday are those of classification and those of regression. In regression, we have our output take a continuous value. In classification, we have our output take a continuous label. So let's first start with classification, specifically the problem of image classification. We want to predict a single label for each image. Now, for example, we have a bunch of U.S. presidents here, and we want to build a classification pipeline to determine which president is in this image that we're looking at, outputting the probability that this image is each of those U.S. presidents. In order to correctly classify this image, our pipeline needs to be able to tell what is unique about a picture of Lincoln versus what is unique about a picture of Washington versus a picture of Obama. It needs to understand those unique differences in each of those images or each of those features. Now, another way to think about this and this image classification pipeline at a high level is in terms of features that are characteristics of a particular class. Classification is done by detecting these types of features in that class. If you detect enough of these features specific to that class, then you can probably say with pretty high confidence that you're looking at that class. Now one way to solve this problem is to leverage knowledge about your field, some domain knowledge, and say let's suppose we're dealing with human faces. We can use our knowledge about human faces to say that if we want to detect human faces, we can first detect noses, eyes, ears, mouths, and then once we have a detection pipeline for those, we can start to detect those features and then determine if we're looking at a human face or not. Now, there's a big problem with that approach, and that's that preliminary detection pipeline. How do we detect those noses, ears, mouths? And this hierarchy is kind of our bottleneck in that sense. And remember that these images are just three-dimensional arrays of numbers. Well, actually, they're just three-dimensional arrays of brightness values, and that images can hold tons of variation. So there's variation such as occlusions that we have to deal with, variations in illumination, and even inter-class variation. And when we're building our classification pipeline, we need to be invariant to all of these variations, and be sensitive to inter-class variation. So sensitive to the variations between classes, but invariant to the variations within a single class. Now, even though our pipeline could use the features that we, as humans, defined, the manual extraction of those features is where this really breaks down. Now, due to the incredible variability in image data specifically, the detection of these features is super difficult in practice, and manually extracting these features is super difficult in practice. And manually extracting these features can be extremely brittle. So how can we do better than this? That's really the question that we want to tackle today. One way is that we want to extract both these visual features and detect their presence in the image simultaneously and in a hierarchical fashion and for that we can use neural networks like we saw in lab in in class number one and two and our approach here is going to be to learn the visual features directly from data and to learn a hierarchy of these features so that we can reconstruct a representation of what makes up our final class label.

Learning visual features (08:06)

So I think now that we have that foundation of how images work, we can actually move on to asking ourselves how we can learn those visual features, specifically with a certain type of operation in neural networks. And neural networks will allow us to directly learn those features from visual data if we construct them cleverly and correctly. In lecture one, we learned about fully connected or dense neural networks, where you can have multiple hidden layers, and each hidden layer is densely connected to its previous layer. And densely connected here, let me just remind you, is that every input is connected to every output in that layer. So let's say that we want to use these densely connected networks in image classification. What that would mean is that we're going to take our two-dimensional image, right, it's a two-dimensional spatial structure, we're going to collapse it down into a one-dimensional vector and then we can feed that through our dense network. So now every pixel in that that one-dimensional vector will feed into the next layer and you can already imagine that, or you can, you should already appreciate that all of our two-dimensional structure in that image is completely gone already because we've collapsed that two-dimensional image into one dimension. We've lost all of that very useful spatial structure in our image and all of that domain knowledge that we could have used a priori. And additionally, we're going to have a ton of parameters in this network. Because it's densely connected, we're connecting every single pixel in our input to every single neuron in our hidden layer. So this is not really feasible in practice. And instead, we need to ask how we can build some spatial structure into neural networks so we can be a little more clever in our learning process and allow us to tackle this specific type of inputs in a more reasonable and well-behaved way. Also we're dealing with some prior knowledge that we have specifically that spatial structure is super important in image data. And to do this let's first represent our two-dimensional image as an array of pixel values, just like they normally were to start with. One way that we can keep and maintain that spatial structure is by connecting patches of the input to a single neuron in the hidden layer. So instead of connecting every input pixel from our input layer in our input image to a single neuron in the hidden layer. So instead of connecting every input pixel from our input layer in our input image to a single neuron in the hidden layer like with dense neural networks, we're going to connect just a single patch, a very small patch. And notice here that only a region of that input layer or that input image is influencing this single neuron at the hidden layer. To define connections across the entire input, we can apply the same principle of connecting patches in the input layer in single neurons in the subsequent layer. We do this by simply sliding that patch window across the input image, and in this case, we're sliding it by two units each time. In this way, we take into account and we maintain all of that spatial structure, that spatial information inherent to our image input, but we also remember that the final task that we really want to do here, that I told you we wanted to do, was to learn visual features. And we can do this very simply by weighting those connections in the patches. So each of the patches, instead of just connecting them uniformly to our hidden layer, we're going to weight each of those pixels and apply a similar technique like we saw in lab one. Instead of, we can basically just have a weighted summation of all of those pixels in that patch, and that feeds into the next hidden unit in our hidden layer to detect a particular feature. Now, in practice, this operation is simply called convolution, which gives way to the name convolutional neural network, which we'll get to later on. We'll think about this at a high level first, and suppose that we have a 4x4 filter, which means that we have 16 different weights, 4x4. We are going to apply the same filter of 4x4 patches across the entire input image, and we'll use the result of that operation to define the state of the neurons in the next hidden layer. We basically shift this patch across the image. We shift it, for example, in units of two pixels each time to grab the next patch. We repeat the convolution operation. And that's how we can start to think about extracting features in our input.

Feature extraction and convolution (12:36)

But you're probably wondering, how does this convolution operation actually relate to feature extraction? So, so far we've just defined the sliding operation where we can slide a patch over the input, but we haven't really talked about how that allows us to extract features from that image itself. So let's make this concrete by walking through an example first. Suppose we want to classify X's from a set of black and white images. So here black is represented by minus one, white is represented by the pixel 1. Now to classify X's, clearly we're not going to be able to just compare these two matrices because there's too much variation between these classes. We want to be able to be invariant to certain types of deformation to the images, scale, shift, rotation. We want to be able to handle all of that. So we can't just compare these two like as they are right now. So instead, what we're going to do is we want to model our model to compare these images of X's piece by piece or patch by patch. And the important patches or the important pieces that it's looking for are the features. Now, if our model can find rough feature matches across these two images, then we can say with pretty high confidence that they're probably coming from the same image. If they share a lot of the same visual features, then they're probably representing the same object. Now each feature is like a mini image. Each of these patches is like a mini image. It's also a two-dimensional array of numbers. And we'll use these filters, let me call them now, to pick up on the features common to X. In the case of Xs, filters representing diagonal lines and crosses are probably the most important things to look for. And you can see those on the top row here. So we can probably capture these features in terms of the arms and the main body of that X. So the arms, the legs, and the body will capture all of those features that we show here. And note that the smaller matrices are the filters of weights. So these are the actual values of the weights that correspond to that patch as we slide it across the image. Now all that's left to do here is really just define that convolution operation and tell you when you slide that patch over the image, what is the actual operation that takes that patch on top of that image and then produces that next output at the hidden neuron layer. So convolution preserves that spatial structure between pixels by learning the image features in these small squares or these small patches of the input data. To do this, the entire equation or the entire computation is as follows. We first place that patch on top of our input image of the same size. So here we're placing this patch on the top left on this part of the image in green on the X there. And we perform an element wise multiplication so for every pixel on our image where the patch overlaps with we element wise multiply every pixel in the filter the result you can see on the right is just a matrix of all ones because there's perfect overlap between our filter in this case and our image at the patch location The only thing left to do here is sum up all of those numbers and when you sum them up you get 9 and that's the output at the next layer. Now let's go through one more example a little bit more slowly of how we did this and you might be able to appreciate what this convolution operation is intuitively telling us now. That's mathematically how it's done, but now let's see intuitively what this is showing us. Suppose we want to compute the convolution now of this 5x5 image in green with this 3x3 filter, or this 3x3 patch. To do this, we need to cover that entire image by sliding the filter over that image and performing that element-wise multiplication and adding the output for each patch. And this is what that looks like. So first we'll start off by placing that yellow filter on the top left corner. We're going to element-wise multiply and add all of the output and we're going to get four. And we're going to place that four in our first entry of our output matrix. This is called the feature map. Now we can continue this and slide that 3 by 3 filter over the image, element-wise multiply, add up all the numbers, and place the next result in the next row, in the next column, which is 3. And we can just keep repeating this operation over and over. And that's it the feature map on the right reflects where in the image there is activation by this particular filter so let's take a look at this filter really quickly you can see in this filter this filter is an X or a cross it has ones on both diagonals and in the image you can see that it's being activated also along this main diagonal on the four where the four is being maximally activated so this is showing that there is maximum overlap with this filter on this image along this central diagonal now let's take a quick example of how different types of filters are changing the weights in your filter, can impact different feature maps or different outputs. So simply by changing the weights in your filter, you can change what your filter is looking for, or what it's going to be activating. So take, for example, this image of this woman, Lena, on the left. That's the original image on the left. If you slide different filters over this image, you can get different output feature maps. So for example, you can sharpen this image by having a filter shown on the second column. You can detect edges in this image by having the third column, by using the third column's features, filter, sorry. And you can even detect stronger edges by having the fourth column and these are ways that changing those weights in your filter can really impact the features that you detect so now I hope you can appreciate how convolution allows us to capitalize on spatial structure and use sets of weights to extract these local features within images. And very easily we can detect different features by simply changing our weights and using different filters. Okay? Now these concepts of preserving spatial information and spatial structure, while local feature extraction, while also doing local feature extraction, using the convolution operation are at the core of neural networks and we use those for computer vision tasks.

Convolution neural networks (19:12)

So now that we've gotten convolutions under our belt, we can start to think about how we can utilize this to build full convolutional neural networks for solving computer vision tasks. And these networks are very appropriately named convolutional neural networks because the computer vision tasks. And these networks are very appropriately named convolutional neural networks, because the backbone of them is the convolution operation. And we'll take a look first at a CNN, or convolutional neural network, architecture designed for image classification tasks. And we'll see how the convolution operation can actually feed into those spatial sampling operations so that we can build this full thing end-to-end. So first let's consider the simple, very simple, CNN for image classification. Now here the goal is to learn features directly from data and to use these learn feature maps for classification of these images. There are three main parts to a CNN that I want to talk about now. First part is the convolutions, which we've talked about before. These are for extracting the features in your image or in your previous layer in a more generic saying. The second step is applying your non-linearity, and again like we saw in lecture one and two, non-linearities allow us to deal with nonlinear data and introduce complexity into our learning pipeline so that we can solve these more complex tasks. And finally the third step, which is what I was talking about before, is this pooling operation, which allows you to downsample your spatial resolution of your image and deal with multiple scales of that image, or multiple scales of your features within that image. And finally the last point I want to make here is that the computation of class scores, let's suppose if we're dealing with image classification, can be outputted using maybe a dense layer at the end after your convolutional layers. So you can output a dense layer which represents those probabilities of representing each class. And that can be your final output in this case. And now we'll go through each of these operations and break these ideas down a little bit further, just so we can see the basic architecture of a CNN and how you can implement one as well. OK, so going through this step by step, those three steps. The first step is that convolution operation. And as before, this is the same story that we've been going through. Each neuron here in the hidden layer will compute a weighted sum of its inputs from that patch, and will apply a bias, like in lecture one and two, and activate with a local nonlinearity. What's special here is that local connectivity that I just want to keep stressing again. Each neuron in that hidden layer is only seeing a patch from that original input image, and that's what's really important here. We can define the actual computation for a neuron in that hidden layer. Its inputs are those neurons in the patch in the previous layer. We apply a matrix of weights, again that's that filter, a 4x4 filter in this case. We do an element-wise multiplication, add the result, apply a bias, activate with that non-linearity, and that's it. That's our single neuron at the hidden layer, and we just keep repeating this by sliding that patch over the input. Remember that our element-wise multiplication and addition here is simply that convolution operation that we talked about earlier. I'm not saying anything new except the addition of that bias term before our nonlinearity. So this defines how neurons in convolutional layers are connected. But with a single convolutional layer, we can have multiple different filters or multiple different features that we might want to extract or detect. The output layer of a convolution, therefore, is not a single image as well, but rather a volume of images representing all of the different filters that it detects. So here at D, the depth is the number of filters or the number of features that you want to detect in that image, and that's set by the human. So when you define your network, you define at every layer how many features do you want to detect at that layer. Now we can also think about the connections in a neuron, in a convolutional neural network, in terms of their receptive field. And the locations of their input of that specific node that they're connected to. So these parameters define the spatial arrangement of that output of the convolutional layer. And to summarize, we can see basically how the connections of these convolutional layers are defined, first of all, and also how the output of a convolutional layer is a volume defined by that depth or the number of filters that we want to learn. And with this information, this defines our single convolutional layer. And now we're well on our way to defining the full convolutional neural network. The remaining steps are kind of just icing on the cake at this point.

Non-linearity and pooling (24:03)

And it starts with applying that non-linearity. So on that volume, we apply an element-wise non-linearity. In this case, I'm showing a rectified linear unit activation function. This is very similar in idea to lectures one and two, where we also applied non-linearities to deal with highly non-linear data. Now, here, the ReLU activation function rectified linear unit we haven't talked about it yet but this is just an activation function that takes as input any real number and essentially shifts everything less than 0 to 0 and anything greater than 0 it keeps the same. Another way you can think about this is it makes sure that the minimum of whatever you feed in is 0 so if it's greater than zero it doesn't touch it if it's less than zero it makes sure that it caps it at zero now the key idea the next key idea let's say of convolutional neural networks is pooling and that's how we can deal with different spatial resolutions and become spatially or like invariant to spatial size in our image. Now the pooling operation is used to reduce the dimensionality of our input layers. And this can be done on any layer after the convolutional layer. So you can apply on your input image, a convolutional layer, apply a non-linearity and then down sample using a pooling layer to get a different spatial resolution before applying your next convolutional layer and repeat this process for many layers and a deep neural network now a common technique here for pooling is called max pooling and when and the idea is as follows so you can slide now another window or another patch over your network and for each of the patches you simply take the maximum value in that patch so let's say we're dealing with two by two patches in this case the red patch you can see on the top right we just simply take the maximum value in that red patch which is six and the output is on the right hand side here so that six is the maximum from this patch this two by two patch and we repeat this over the entire image. This allows us to shrink the spatial dimensions of our image while still maintaining all of that spatial structure. So actually, this is a great point because I encourage all of you to think about what are some other ways that you could perform a pooling operation. How else could you down sample these images max pooling is one way so you could always take the maximum of these these two by two patches but there are a lot of other really clever ways as well so it's interesting to think about some ways that we can also in other ways potentially perform this down sampling operation, the key idea here of these convolutional neural networks and how we're now with all of this knowledge, we're kind of ready to put this together and perform these end-to-end networks. So we have the three main steps that I talked to you about before, convolution, local nonlinearities, and pooling operations. And with CNNs, we can layer these operations to learn a hierarchy of features and a hierarchy of features that we want to detect if they're present in the image data or not so a CNN built for image classification I'm showing the first part of that CNN here on the left we can break it down roughly into two parts so the first part I'm showing here on the left, we can break it down roughly into two parts. So the first part I'm showing here is the part of feature learning. So that's where we want to extract those features and learn the features from our image data. This is simply applying that same idea that I showed you before. We're gonna stack convolution and nonlinearities with pooling operations and repeat this throughout the depth of our network. The next step for our convolutional neural network is to take those extracted or learned features and to classify our image. So the ultimate goal here is not to extract features. We want to extract features, but then use them to make some classification or make some decision based on our image. So we can feed these outputted features into a fully connected or dense layer and that dense layer can output a probability distribution over the image membership in different categories or classes. And we do this using a function called softmax which you actually already got some experience with in lab one, whose outputs represents this categorical distribution.

Code example (28:30)

So now let's put this all together into coding our first end-to-end convolutional neural network from scratch. We'll start by defining our feature extraction head, which starts with a convolutional layer. Here I'm showing with 32 filters. So 32 is coming from this number right here. That's the number of filters that we want to extract inside of this first convolutional layer. We downsample the spatial information using a max pooling operation like I discussed earlier. And next we feed this into the next set of convolutional layers in our network. So now, instead of 32 features, we're going to be extracting even more features. So now we're extracting 64 features. Then, finally, we can flatten all of the spatial information and the spatial features that we've learned into a vector and learn our final probability distribution over class membership. And that allows us to actually classify this image into one of these different classes that we've defined.

Applications (29:32)

So far, we've talked only about using CNNs for image classification tasks. In reality, this architecture extends to many, many different types of tasks and many, many different types of applications as well. When we're considering CNNs for classification, we saw that it has two main parts, first being the feature learning part, shown here, and then a classification part on the second part of the pipeline. What makes a convolutional neural network so powerful is that you can take this feature extraction part of the pipeline, and at the output, you can attach whatever kind of output that you can take this feature extraction part of the pipeline and at the output you can attach whatever kind of output that you want to it. So you can just treat this convolutional feature extractor simply as that, a feature extractor, and then plug in whatever other type of neural network you want at its output. So you can do detection by changing the output head. You can do semantic segmentation, which is where you want to detect semantic classes for every pixel in your image. You can also do end-to-end robotic control, like we saw with autonomous driving before. So what's an example of this? We've seen a significant impact in computer vision in medicine and healthcare over the last couple years. Just a couple weeks ago, actually, there was this paper that came out where deep learning models have been applied to the analysis of a whole host of breast, sorry, mammogram cancer detection or breast cancer detection in mammogram images. So what we showed, what was showed here was that CNNs were able to significantly outperform expert radiologists in detecting breast cancer directly from these mammogram images. That's done by feeding these images through a convolutional feature extractor, outputting those features, those learned features, to dense layers, and then performing classification based on those dense layers. Instead of predicting a single number, breast cancer or no breast cancer, you could also imagine for every pixel in that image, you want to predict what is the class of that pixel. So here we're showing a picture of two cows on the left. Those are fed into a convolutional feature extractor. And then they're upscaled through the inverse convolutional decoder to predict for every pixel in that image what is the class of that pixel so you can see that the network is able to correctly classify that it sees two cows and brown whereas the grass is in green and the sky is in blue and this is basically detection but not for a single number over the image yes or no there's a cow in this image but for every pixel what over the image. Yes or no, there's a cow in this image, but for every pixel, what is the class of this pixel? This is a much harder problem. And this output is actually created using these up sampling operations. So this is no longer a dense neural network here, but we have kind of inverse or what are called transposed convolutions, which scale back up our image data and allow us to predict these images as outputs and not just single numbers or single probability distributions. And of course this idea can be, you can imagine, very easily applied to many other applications in healthcare as well, especially for segmenting various types of cancers, such as here we're showing brain tumors on the top, as well as parts of the blood that are infected with malaria on the bottom.

End-to-end self driving cars (32:53)

So let's see one final example before ending this lecture. And here we're showing again and going back to the example of self-driving cars. And the idea, again, is pretty similar. So let's say we want to learn a neural network to control a self-driving car and learn autonomous navigation. Specifically, we want to go from a model. We're using our model to go from images of the road, maybe from a camera attached to that car on top of the car. You can think of the actual pixels coming from this camera that are fed to the neural network. And in addition to the pixels coming from the camera, we also have these images from a bird's eye street view of where the car roughly is in the world. And we can feed both of those images. These are just two two-dimensional arrays. So this is one two-dimensional array of images, of pixels, excuse me. And this is another two-dimensional array of pixels. Both represent different things, so this represents your perception of the world around you, and this represents roughly where you are in the world globally. to do with this is then to directly predict or infer a full distribution of possible control actions that the car could take at this instant. So if it doesn't have any goal destination in mind, it could say that it could take any of these three directions and steer in those directions. And that's what we want to predict with this network. One way to do this is that you can actually train your neural network to take as input these camera images coming from the car, pass them each through these convolutional encoders or feature extractors, and then now that you've learned features for each of those images, you can concatenate all of them together. So now you have a global set of features across all of your sensor data, now you have a global set of features across all of your sensor data and then learn your control outputs from those on the right hand side. Now again this is done entirely end-to-end right so we never told the car what a lane marker was, what a road was, or how to even turn right or left or what's an intersection. So we never told any of that information but it's able to learn all of this and extract those features from scratch just by watching a lot of human driving data and learn how to drive on its own. So here's an example of how a human can actually enter the car, input a desired destination, which you can see on the top right. The red line indicates where we want the car to go in the map. So think of this as like a Google map. So you plug into Google Maps where you want to go, and the end-to-end CNN, the convolutional neural network, will output the control commands, given what it sees on the road, to actually actuate that vehicle towards that destination. Note here that the vehicle is able to successfully navigate through those intersections, even though it's never been driving in this area before, it's never seen these roads before, and we never even told it what an intersection was.


Summary (35:55)

It learned all of this from data using convolutional neural networks. Now, the impact of CNNs has been very, very wide-reaching beyond these examples that I've given to you today. And it has touched so many different fields of computer vision, ranging across robotics, medicine, and many, many other fields. I'd like to conclude by taking a look at what we've covered in today's lecture. We first considered the origins of computer vision and how images are represented as brightness values to a computer and how these convolution operations work in practice. So then we discussed the basic architecture and how we could build up from convolution operations to build convolutional layers and then pass that to convolutional neural networks. And finally, we talked about the extensions and applications of convolutional neural networks and how we can visualize a little bit of the behavior and actually actuate some of the real world with convolutional neural networks, either by predicting some parts of medicine or some parts of medical scans or even actuating robots to interact with humans in the real world. And that's it for the CNN lecture on computer vision. Next up we'll hear from Ava at Deep Generative Modeling and thank you.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.