MIT 6.S191 (2020): Deep Generative Modeling
Transcription for the video titled "MIT 6.S191 (2020): Deep Generative Modeling".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
So far in this class we've talked about how we can use neural networks to learn patterns within data. And so in this next lecture we're going to take this a step further and talk about how we can build systems that not only look for patterns in data, but actually can learn to generate brand new data based on this learned information. And this is a really new and emerging field within deep learning, and it's enjoying a lot of success and attention right now and in the past couple years especially. So this broadly can be considered this field of deep generative modeling. So I'd like to start by taking a quick poll. So study these three faces for a moment. These are three faces. Raise your hand if you think face A is real. Is real. Is real. OK, so that roughly follows with the first vote. Face B is real, is real. OK, so that roughly follows with the first vote. Face B is real. Raise your hand if you think face B is real. OK, and finally face C. It doesn't really matter because all these faces are fake. It doesn't really matter because all these faces are fake. And so hopefully, this is really recent work. This was just posted on Archive in December of 2019, the results from this latest model. And today, by the end of this lecture, you're going to have a sense of how a deep neural network can be used to generate data that is so realistic that it fooled many of you, or rather, all of you. OK, so so far in this course, we've been considering this problem of supervised learning, which means we're given a set of data and a set of labels that go along with that data. And our goal is to learn a functional mapping that goes from data to labels. And in this course, this is a course on deep learning. And we've been largely talking about functional mappings that are described by deep neural networks. But at the core, these mappings could be anything. Now we're going to turn our attention to a new class of problems, and that's what is called unsupervised learning and is the topic of today's lecture. And in unsupervised learning, we're given only data and no labels. And our goal is to try to understand the hidden or underlying structure that exists in this data. And this can help us get insights into what is the foundational level of explanatory factors behind this data, and as we'll see later, to even generate brand new examples that resemble the true data distribution. And so this is this topic of generative modeling, which is an example of unsupervised learning. And as I kind of alluded to, our goal here is to take input examples from some training distribution and to learn and infer a model that represents that distribution. And we can do this for two main goals, the first being this concept of density estimation, where we see a bunch of samples, and they lie along some probability distribution. And we want to learn a model that approximates the underlying distribution that's describing or generating where this data was drawn from. The other task is this idea of sample generation and so in this in this context we're given input samples and we want our model to be able to generate brand new samples that represent those inputs and so that's the idea with the fake faces that we showed in the first slide. And really, the core question behind generative modeling is how can we learn some probability distribution? How can we model some probability distribution that's similar to the true distribution that describes how the data was generated. So why care about generative modeling, besides the fact that it can be used to generate these realistic human-like faces, right?
Advancements In Unsupervised Learning
Why do we care? (04:37)
Well, first of all, generative approaches can uncover the underlying factors and features present in a data set. So for example, if we consider the problem of facial detection, we may be given a data set with many, many different faces and we may not know the exact distribution of faces in terms of different features like skin color or gender or clothing or occlusion or the orientation of the face. And so our training data may actually end up being very biased to particular instances of those features that are overrepresented in our data set without us even knowing it. And today, and in the lab as well, we'll see how we can use generative models to actually learn these underlying features and uncover the over and underrepresented parts of the data and use this information to actually create fairer, more representative data sets to train an unbiased classification model. Another great example is this question of outlier detection. So if we go back to the example problem of self-driving cars, most of the data that we may want to train a control network that Alexander was describing may be very common driving scenes, so on a freeway or on a straight road where you're driving. But it's really critical that our autonomous vehicle would be able to handle all cases that it could potentially encounter in the environment, including edge cases and rare events like crashes or pedestrians, not just the straight freeway driving that is the majority of the time on the road. And so generative models can be used to actually detect the outliers that exist within training distributions and use this to train a more robust model.
Latent variable models (06:36)
And so we'll really dive deeply into two classes of models that we call latent variable models. But first, before we get into those details, we have a big question. What is a latent variable? And I think a great example to describe the difference between latent and observed variables is this little parable and story from Plato's Republic from thousands of years ago, which is known as the myth of the cave. And in this myth, a group of prisoners are constrained to face a wall as punishment. And the only things that they can see and observe are the shadows of objects that pass in front of a fire that's actually behind them. So they're just observing the shadows in front of their faces. And so from the prisoner's perspective, these shadows are the observed variables. They can measure them. They can give them names, because to them, you know, that's their reality. They don't know that behind them, there are these true objects that are actually casting the shadows because of this fire. And so those objects that are actually casting the shadows because of this fire. And so those objects that are behind the prisoners are like the latent variables. They're not directly observable by the prisoners, but they're the true explanatory factors that are casting the shadows that the prisoners see. And so our goal in generative modeling is to find ways of actually learning these hidden and underlying latent variables, even when we are only given the observed data.
OK, so let's start by discussing a very simple generative model, which tries to do this by encoding its input. And these models are known as autoencoders. So to take a look at how an autoencoder works, what is done is we first begin by feeding raw input data into the model. It's passed through a series of neural network layers. And what is outputted at the end of that encoding is what we call a low-dimensional latent space, which is a feature representation that we're trying to predict. And we call this network an encoder because it's mapping this data, x, into a vector of latent variables, z. Now, let's ask ourselves a question, right? Why do we care about having this low dimensional latent space, z? Any ideas? Yeah. It will be easier to process with further algorithms? Yeah, it's easier to process, and I think the key that you're getting at is that it's a compressed representation of the data. And in the case of images, right, these are pixel-based data, they can be very, very, very highly dimensional. And what we want to do is to take that high dimensional information and encode it into a compressed, smaller latent vector. So how do we actually train a network to learn this latent variable vector z? Do we have training data for z? Do we observe the true values of z? And can we do supervised learning? The answer is no, right? We don't have training data for those latent variables. But we can get around this by building a decoder structure that is used to reconstruct a resemblance of the original image from this learned latent space. And again, this decoder is a series of neural network layers, which can be convolutional layers or fully connected layers, that are now taking that hidden latent vector and mapping it back to the dimensionality of the input space. And we call this reconstructed output x hat, since it's going to be some imperfect reconstruction or estimation of what the input x looks like. And the way we can train a network like this is by simply considering the input x and our reconstructed output x hat and looking at how they are different. And we want to try to minimize the distance between the reconstruction and the input to try to get as realistic of a reconstruction as possible. And so in the case of this image example, we can simply take the mean squared error, right, x minus x hat n squared, from the input to the outputted reconstructions. And so the really important thing here is that this loss function doesn't have any labels, right? The only components of the loss are input x and the reconstructions x hat. It's not supervised in any sense. And so we can simplify this diagram a little bit further by abstracting away those individual layers in the encoder and the decoder. And this idea of unsupervised learning is really a powerful idea because it allows us to, in a way, quantify these latent variables that we're interested in but we can't directly observe. And so a key consideration when building a model like an autoencoder is how we select the dimensionality of our latent space. And the latent space that we're trying to learn presents this sort of bottleneck layer, because it's a form of compression. And so the lower the dimensionality of the latent space that we choose, the poorer the quality of the reconstruction that's generated in the end. And so in this example, this is the data set of, very famous data set of handwritten digits called MNIST, and on the right you can see the ground truth for example digits from this data set. And as you can hopefully appreciate in these images, by going just from a 2D latent space to a 5D latent space, we can drastically improve the quality of the reconstructions that are produced as output by an autoencoder structure. Okay, so to summarize, autoencoders are using this bottlenecking hidden layer that forces the network to learn a compressed latent representation of the data. And by using this reconstruction loss, we can train the network in a completely unsupervised manner, and the name autoencoder comes from the fact that we're automatically encoding information within the data into this smaller latent space.
Variational autoencoders (13:30)
So we will now see how we can build on this foundational idea a bit more with this concept of variational autoencoders, or VAEs. And so with a traditional autoencoder, what is done, as you can see here, is going from input to a reconstructed output. And so if I feed in an input to the network, I will get the same output so long as the weights are the same. This is a deterministic encoding that allows us to reproduce the input as best as we can. But if we want to learn a more smooth representation of the latent space and use this to actually generate new images and sample new images that are similar to the input data set, we can use a structure called a variational autoencoder to more robustly do this. And this is a slight twist on the traditional autoencoder. And what it's done is that instead of a deterministic bottleneck layer, z, we've replaced this deterministic bottleneck layer z, we've replaced this deterministic layer with a stochastic sampling operation. And that means that instead of learning the latent variables directly, for each variable we learn a mean and a standard deviation sigma that actually parameterize a probability distribution for each of those latent variables. So now we've gone from a vector of latent variable z to learning a vector of means mu and a vector of standard deviations sigma, which describe the probability distributions associated with each of these latent variables. And what we can do is sample from these described distributions to obtain a probabilistic representation of our latent space. And so as you can tell, and as I've emphasized, the VAE structure is just an autoencoder with this probabilistic twist. So now what this means is instead of deterministic functions that describe the encoder and decoder, both of these components of the network are probabilistic in nature. And what the encoder actually does is it computes this probability distribution, p of phi, of the latent space, z, given an input x, while the decoder is phi, of the latent space, z, given an input, x. While the decoder is doing sort of the reverse inference, it's computing a new probability distribution, q of theta, of x given the latent variables, z. And because we've introduced this probabilistic aspects to this network, our loss function has also slightly changed. The reconstruction loss in this case is basically exactly like what we saw with the traditional autoencoder. The reconstruction loss is capturing the pixel-wise difference between our input and the reconstructed output. And so this is a metric of how well our network is doing at generating outputs that are similar to the input. Then we have this other term, which is the regularization term, which gets back to that earlier question. And so because the VAE is learning these probability distributions, we want to place some constraints on how this probability distribution is computed and what that probability distribution resembles as a part of regularizing and training our network. And so the way that's done is by placing a prior on the latent distribution. And that's p of z. And so that's some initial hypothesis or guess about what the distribution of z looks like. And so this essentially helps enforce the learned z's to follow the shape of that prior distribution. And the reason that we do this is to help the network not overfit, right? Because without this regularization, it may overfit on certain parts of the latent space, but if we enforce that each latent variable adopts something similar to this prior, it helps smooth out the landscape of the latent space and the learned distributions. And so this regularization term is a function that captures the divergence between the inferred latent distribution and this fixed prior that we've placed. So as I mentioned, a common choice for this prior distribution is a normal Gaussian, which means that we center it with a mean of 0 and a standard deviation 1. And as the great question pointed out, what this enables us to do is derive some very nice properties about the optimal bounds of how well our network can do. And by choosing this normal Gaussian prior, what is done is the encoder is encouraged to sort of put the, distribute the latent variables evenly around the center of this latent space, distributing the encoding smoothly, and actually the network will learn to penalize itself when it tries to cheat and cluster points outside sort of the smooth Gaussian distribution, as would be the case if it was overfitting or trying to memorize particular instances of the data. And what also can be derived in the instance of when we choose a normal Gaussian as our prior is this specific distance function, which is formulated here. And this is called the KL divergence, the Kublai-Leibler divergence. And this is specifically in the case when the prior distribution is a 0, 1 Gaussian. The divergence that measures the separation between our inferred latent distribution and this prior takes this particular form. OK, yeah. So to re-emphasize, this term is the regularization term that's used in the formulation of the total loss.
Reparameterization trick (20:18)
So now that we've seen a bit about the reconstruction loss and a little more detail into how the regularization term works, we can discuss how we can actually train the network end-to-end using backpropagation. And what immediately jumps out as an issue is that z here is the result of a stochastic sampling operation. And we cannot backpropagate gradients through a sampling layer because of their stochastic nature. Because backpropagation requires deterministic nodes to be able to iteratively pass gradients through and apply the chain rule through. But what was a really breakthrough idea that enabled VAEs to really take off was to use this little trick called a re-parameterization trick to re-parameterize the sampling layer such that the network can now be trained end-to-end. And I'll give you the key idea about how this operation works. And so instead of drawing z directly from a normal distribution that's parametrized by mu and sigma, which doesn't allow us to compute gradients. Instead, what we can do is consider the sampled latent vector z as a sum of a fixed vector mu, a fixed variance vector sigma, and then scale this variance vector by a random constant that is drawn from a prior distribution. So for example, from a normal Gaussian. And what is the key idea here is that we still have a stochastic node, but since we've done this reparameterization with this factor epsilon, which is drawn from a normal distribution, this stochastic sampling does not occur directly in the bottleneck layer Z. We've re-parameterized where that sampling is occurring. And a visualization that I think really helps drive this point home is as follows. So originally, if we were to not perform this re-parameterization,eterization, our flow looks a little bit like this, where we have deterministic nodes shown in blue, the weights of the network, as well as our input vector. And we have this stochastic node, z, that we're trying to sample from. And as we saw, we can't do back propagation because we're going to get stuck at this stochastic sampling node when the network is parametrized like this. Instead, when we re-parameterize, we get this diagram, where we've now diverted the sampling operation off to the side to this stochastic node epsilon, which is drawn from a normal distribution, and now the latent variable Z are deterministic with respect to epsilon, the sampling operation. And so this means that we can back propagate through Z without running into errors that arise from having stochasticity. And so this is a really powerful trick because this simple reparameterization is what actually allows for VAEs to be trained end-to-end.
Latent pertubation (23:55)
Okay, and so what do these latent variables actually look like and what do they mean? Because we impose these distributional priors on the latent variables, we can sample from them and actually fix all but one latent variable and slowly tune the value of that latent variable to get an interpretation of what the network is learning. And what is done is after that value of that latent variable is tuned, you can run the decoder to generate a reconstructed output. And what you'll see now in the example of these faces is that that output that results from tuning a single latent variable has a very clear and distinct semantic meaning. So for example, this is differences in the pose or the orientation of a face. And so to really to re-emphasize here, the network is actually learning these different latent variables in a totally unsupervised way, and by perturbing the value of a single latent variable, we can interpret what they actually mean and what they actually represent. And so ideally, right, because we're learning this compressed latent space, what we would want is for each of our latent variables to be independent and uncorrelated with each other, to really learn the richest and most compact representation possible. So here's the same example as before, now with faces again, right, where we're looking at faces. And now we're walking along two axes, which we can interpret as pose or orientation on the x. And maybe you can tell the smile on the y-axis. And to re-emphasize, these are reconstructed images, but they're reconstructed by keeping all other variables fixed except these two, and then tuning the value of those latent features. And this is this idea of disentanglement, by trying to encourage the network to learn variables that are as independent and uncorrelated with each other as possible.
Debiasing with VAEs (26:12)
And so to motivate the use of VAE and generative models a bit further, let's go back to this example that I showed from the beginning of lecture, the question of facial detection. And as I kind of mentioned, right, if we're given a data set with many different faces, we may not know the exact distribution of these faces with respect to different features like skin color. And why this could be problematic is because imbalances in the training data can result in unwanted algorithmic bias. So for example, the faces on the left are quite homogenous, right? And a standard classifier that's trained on a data set that contains a lot of these types of examples will be better suited at recognizing and classifying those faces that have features similar to those shown on the left. And so this can generate unwanted bias in our classifier. And we can actually use a generative model to learn the latent variables present in the data set and automatically discover which parts of the feature space are underrepresented or overrepresented. And since this is the topic of today's lab, I want to spend a bit of time now going through how this approach actually works. And so what is done is a VAE network is used to learn the underlying features of a training data set, in this case images of faces, in a unbiased and unsupervised manner without any annotation. And so here we're showing an example of one such learned latent variable, the orientation of the face. And again, right, we never told the network that orientation was important. It learned this by looking at a lot of examples of faces and deciding, right, that this was an important factor. And so from these latent distributions that are learned, we can estimate the probability distribution of each of the learned latent variables. And certain instances of those variables may be over- represented in our data set, like homogenous skin color or pose. And certain instances may have lower probability and fall sort of on the tails of these distributions. And if our data set has many images of a certain skin color that are overrepresented, the likelihood of selecting an image with those features during training will be unfairly high. That can result in unwanted bias. And similarly, these faces with rarer features, like shadows or darker skin or glasses may be underrepresented in the data and so the likelihood of selecting them during sampling will be low. And the way this algorithm works is to use these inferred distributions to adaptively resample data during training. And then, and this is used to generate a more balanced and more fair training data set that can be then fed into the network to ultimately result in an unbiased classifier. And this is exactly what you'll explore in today's lab. OK, so to reiterate and summarize some of the key points of VAEs, these VAEs encode a compressed representation of the world by learning these underlying latent features. And from this representation, we can sample to generate reconstructions of the input data in an unsupervised fashion. We can use the reparameterization trick to train our networks end to end, and use this perturbation approach to interpret and understand the meaning behind each of these hidden latent variables. OK, so VAEs are looking at this question of probability density estimation as their core problem. What if we just are concerned with generating new samples that are as realistic as possible as the output?
Generative adversarial networks (30:40)
And for that, we'll transition to a new type of generative model called a generativeversarial Network, or GAN. And so the idea here is we don't want to explicitly model the density or the distribution underlying some data, but instead just generate new instances that are similar to the data that we've seen. Which means we want to try to sample from a really, really complex distribution, which we may not be able to learn directly in an efficient manner. And so the approach that GANs take is really simple. They have a generator network, which just starts from random noise. And this generator network is just starts from random noise and this generator network is trained to learn a transformation going from that noise to the training distribution. And our goal is we want this generated fake sample to be as close to the real data as possible. And so the breakthrough to really achieving this was this GAN structure, where we have two neural networks, one we call a generator and one we call a discriminator, that are competing against each other, they're adversaries. Specifically, the goal of the generator is to go from noise to produce imitations of data that are close to real as possible. Then the discriminator network takes both the fake data as well as true data and learns how to classify the fake from the real, to distinguish between the fake and the real. And by having these two networks competing against each other, we can force the discriminator to become as good as possible at distinguishing fake and real. And the better it becomes at doing that, the better and better the generator will become at generating new samples that are as close to real as possible to try to fool the discriminator.
Intuitions behind GANs (32:40)
So to get more intuition behind this, let's break this down into a simple example. So the generator is starting off from noise, and it's drawing from that noise to produce some fake data, which we're just representing here as points on a 1D line. The discriminator then sees these points, and it also sees some real data. And over the course of the training of the discriminator, it's learning to output some probabilities that a particular data point is fake or real. And in the beginning, if it's not trained, its predictions may not be all that great, but then you can train the discriminator and it starts increasing the probabilities of what is real, decreasing the probabilities of what is fake, until you get this perfect separation where the discriminator is able to distinguish real from fake. And now the generator comes back and it sees where the real data lie. And once it sees this, it starts moving the fake data closer and closer to the real data. And it goes back to the discriminator that receives these new points, estimates the probability that each point is real, learns to decrease the probability of the fake points maybe a little bit, and continues to adjust. And now the cycle repeats again. One last time, the generator sees the real examples, and it starts moving these fake points closer and closer to the real data, such that the fake data is almost following the distribution of the real data. And eventually, it's going to be very hard for the discriminator to be able to distinguish between what's real, what's fake, while the generator is going to continue to try to create fake data points to fool the discriminator. And with this example, what I'm hoping to convey is really sort of the core intuition behind the approach, not necessarily the detailed specifics of how these networks are actually trained. OK. And so you can think of this as an adversarial competition between these two networks, the generator and the discriminator. And what we can do is is after training, use the trained generator network to create, to sample and create new data that's not been seen before.
GANs: Recent advances (35:12)
And so to look at examples of what we can achieve with this approach, the examples that I showed at the beginning of the lecture were generated by using this idea of progressively growing GANs to iteratively build more detailed image generations. And the way this works is that the generator and the discriminator start by having very low spatial resolution, and as training progresses, more and more layers are incrementally added to each of the two networks to increase the spatial resolution of the outputted generation images. And this is good because it allows for stable and robust training and generates outputs that are quite realistic. And so here are some more examples of fake celebrity faces that were generated using this approach. Another idea involves unpaired image to image translation or style transfer, which uses a network called CycleGAN. And here we're taking a bunch of images in one domain, for example the horse domain, and without having the corresponding image in another domain, we want to take the input image, generate an image in a new style that follows the distribution of that new style. So this is essentially transferring the style of one domain from a second. And this works back and forth, right? And so the way this CycleGAN is trained is by using a cyclic loss term, where if we go from domain X to domain Y, we can then take the result and go back from domain Y back to domain X, and we have two generators and two discriminators that are working at this at the same time. So maybe you'll notice in this example of going from horse to zebra that the network has not only learned how to transform the skin of the horse from brown to the stripes of a zebra, but it's also changed a bit about the background in the scene. It's learned that zebras are more likely to be found in maybe the savanna grasslands, so the grass is browner and maybe more savanna-like in the zebra example compared to the horse. And what we actually did is to use this approach of CycleGAN to synthesize speech in someone else's voice. And so what we can do is we can use a bunch of audio recordings in one voice and audio recordings in another voice and build a CycleGAN to learn to transform representations of that one voice to make them appear like they are representations from another voice. So what can be done is to take an audio waveform, convert it into an image representation, which is called a spectrogram, and you see that image on the bottom, and then train a cycleGAN to perform the transfer from one domain to the next. And this is actually exactly how we did the speech transformation for yesterday's demonstration of Obama's introduction to the course. And so we're showing you what happened under the hood here. And what we did is we took original audio of Alexander saying that script that was played yesterday and took the audio waveforms, converted them into the spectrogram images, and then trained a cycle GAN using this information and audio recordings of Obama's voice to transfer the style of Obama's voice onto our script. So. to transfer the style of Obama's voice onto our script. So. Hi everybody, and welcome to MIT Fit S191, the official introductory course on deep learning taught here at MIT. And so, yeah, on the left, right, that was Alexander's original audio spectrogram. And the spectrogram on the right was what was generated by the CycleGAN in the style of Obama's voice.
OK, so to summarize, today we've covered a lot of ground on autoencoders and variational autoencoders and generative adversarial networks. And hopefully this discussion of these approaches gives you a sense of how we can use deep learning to not only learn patterns in data, but to use this information in a rich way to achieve generative modeling. And I really appreciate the great questions and discussions. And all of us are happy to continue that dialogue during the lab session. And so our lab today is going to focus on computer vision. And as Alexander mentioned, there is another corresponding competition for Lab 2. And we encourage you all to stick around if you wish to ask us questions. And thank you again. Thank you. Thank you.