MIT 6.S191 (2021): Deep Generative Modeling
Transcription for the video titled "MIT 6.S191 (2021): Deep Generative Modeling".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Hi everyone and welcome to lecture 4 of MIT 6S191. In today's lecture we're going to be talking about how we can use deep learning and neural networks to build systems that not only look for patterns in data but actually can go a step beyond this to generate brand new synthetic examples based on those learned patterns. And this I think is an incredibly powerful idea and it's a particular subfield of deep learning that has enjoyed a lot of success and and gotten a lot of interest in the past couple of years, but I think there's still tremendous, tremendous potential of this field of deep generative modeling in the future and in the years to come, particularly as we see these types of models and the types of problems that they tackle becoming more and more relevant in a variety of application areas. All right, so to get started, I'd like to consider a quick question for each of you. Here we have three photos of faces. And I want you all to take a moment, look at these faces, study them, and think about which of these faces you think is real. Is it the face on the left? Is it the face in the center? Is it the face on the right? Which of these is real? Well, in truth, each of these faces are not real. They are all fake. These are all images that were synthetically generated by a deep neural network. None of these people actually exist in the real world. And hopefully, I think, you all have appreciated the realism of each of these people actually exists in the real world. And hopefully I think you all have appreciated the realism of each of these synthetic images and this to me highlights the incredible power of deep generative modeling and not only does it highlight the power of these types of algorithms and these types of models but it raises a lot of questions about how we can consider the fair use and the ethical use of such algorithms as they are being deployed in the real world. So by setting this up and motivating in this way, I now like to take a step back and consider fundamentally what is the type of learning that can occur when we are training neural networks to perform tasks such as these. So so far in this course, we've been considering what we call supervised learning problems. Instances in which we are given a set of data and a set of labels associated with that data. And our goal is to learn a functional mapping that moves from data to labels. And those labels can be class labels or continuous values. And in this course, we've been concerned primarily with developing these functional mappings that can be described by deep neural networks. But at their core, these mappings could be anything, any sort of statistical function. The topic of today's lecture is going to focus on what we call unsupervised learning, which is a new class of learning problems. And in contrast to supervised settings, where we're given data and labels, in unsupervised learning we're given only data, no labels. And our goal is to train a machine learning or deep learning model to understand or build up a representation of the hidden and underlying structure in that data. And what this can do is it can allow sort of an insight into the foundational structure of the data and then in turn we can use this understanding to actually generate synthetic examples. understanding to actually generate synthetic examples. And unsupervised learning beyond this domain of deep generative modeling also extends to other types of problems and example applications which you may be familiar with, such as clustering algorithms or dimensionality reduction algorithms. Generative modeling is one example of unsupervised learning. And our goal in this case is to take as input examples from a training set and learn a model that represents the distribution of the data that is input to that model. And this can be achieved in two principal ways. The first is through what is called density estimation, where let's say we are given a set of data samples and they fall according to some density. The task for building a deep generative model applied to these samples is to learn the underlying probability density function that describes how and where these data fall along this distribution. And we can not only just estimate the density of such a probability density function, but actually use this information to generate new synthetic samples. Where again, we are considering some input examples that fall and are drawn from some training data distribution. And after building up a model using that data, our goal is now to generate synthetic examples that can be described as falling within the data distribution modeled by our model. So the key idea in both these instances is this question of how can we learn a probability distribution using our model, which we call p model of x, that is so similar to the true data distribution, which we call p data of x. This will not only enable us to effectively estimate these probability density functions, but also generate new synthetic samples that are realistic and match the distribution of the data we're considering.
Generated Models And Their Applications
Why care about generative models? (06:03)
So this, I think, summarizes concretely what are the key principles behind generative modeling. But to understand how generative modeling may be informative and also impactful, let's take this idea a step further and consider what could be potential impactful applications and real world use cases of generative modeling. What generative models enable us as the users to do is to automatically uncover the underlying structure and features in a data set. The reason this can be really important and really powerful is often we do not know how those features are distributed within a particular data set of interest. So let's say we're trying to build up a facial detection classifier and we're given a data set of faces for which we may not know the exact distribution of these faces with respect to key features like skin tone or pose or clothing items. Without going through our data set and manually inspecting each of these instances, our training data may actually be very biased with respect to some of these features without us even knowing it. And as you'll see in this lecture and in today's lab, what we can actually do is train generative models that can automatically learn the landscape of the features in a data set like these, like that of faces, and by doing so actually uncover the regions of the training distribution that are underrepresented and over-represented with respect to particular features such as skin tone. And the reason why this is so powerful is we can actually now use this information to actually adjust how the data is sampled during training to ultimately build up a more fair and more representative data set that then will lead to a more fair and unbiased model. And you'll get practice doing exactly this and implementing this idea in today's lab exercise. Another great example and use case where generative models are exceptionally powerful is this broad class of problems that can be considered outlier or anomaly detection. One example is in the case of self-driving cars, where it's going to be really critical to ensure that an autonomous vehicle governed and operated by a deep neural network is able to handle all of the cases that it may encounter on the road, not just, you know, the straight freeway driving that is going to be the majority of the training data and the majority of the time the car experiences on the road. So generative models can actually be used to detect outliers within training distributions and use this to, again, improve the training process so that the resulting model can be better equipped to handle these edge cases and rare events.
Latent variable models (08:56)
Alright, so hopefully that motivates why and how generative models can be exceptionally powerful and useful for a variety of real-world applications. To dive into the bulk of the technical content for today's lecture, we're going to discuss two classes of what we call latent variable models. Specifically, we'll look at autoencoders and generative adversarial networks, or GANs. But before we get into that, I'd like to first begin by discussing why these are called latent variable models and what we actually mean when we use this word latent. And to do so, I think really the best example that I've personally come across for understanding what a latent variable is, is this story that is from Plato's work, The Republic. And this story is called the myth of the cave or the parable of the cave. And the story is as follows. In this myth, there are a group of prisoners and these prisoners are constrained as part of their prison punishment to face a wall. And the only things that they can see on this wall are the shadows of particular objects that are being passed in front of a fire that's behind them, so behind their heads and out of their line of sight. And the prisoners, the only thing they're really observing are these shadows on the wall. And so to them that's what they can see, that's what they can measure, and that's what they can give names to. That's really their reality. These are their observed variables. But they can't actually directly observe or measure the physical objects themselves that are actually casting these shadows. So those objects are effectively what we can analyze like latent variables. They're the variables that are not directly observable, but they're the true explanatory factors that are creating the observable variables which in this case the prisoners are seeing, like the shadows cast on the wall. And so our question in generative modeling broadly is to find ways of actually learning these underlying and hidden latent variables in the data, even when we're only given the observations that are made. And this is an extremely, extremely complex problem that is very well suited to learning by neural networks because of their power to handle multi-dimensional data sets and to learn combinations of nonlinear functions that can approximate really complex data distributions.
Alright, so we'll first begin by discussing a simple and foundational generative model which tries to build up this latent variable representation by actually self-encoding the input. And these models are known as autoencoders. What an autoencoder is, is it's an approach for learning a lower dimensional latent space from raw data. To understand how it works, what we do is we feed in as input raw data. For example, this image of a 2 that's going to be passed through many successive deep neural network layers. And at the output of that succession of neural network layers, what we're going to generate is a low dimensional latent space, a feature representation. And that's really the goal that we're trying to predict. And so we can call this portion of the network an encoder, since it's mapping the data, x, into an encoded vector of latent variables, z. So let's consider this latent space, z. If you've noticed I've represented z as having a smaller size, a smaller dimensionality as the input x. Why would it be important to ensure the low dimensionality of this latent space, z. Having a low dimensional latent space means that we are able to compress the data, which in the case of image data can be on the order of many, many, many dimensions. We can compress the data into a small latent vector, where we can learn a very compact and rich feature representation. So how can we actually train this model? Are we going to have, are we going to be able to supervise for the particular latent variables that we're interested in? Well, remember that this is an unsupervised problem where we have training data, but no labels for the latent space z. So in order to actually train such a model what we can do is learn a decoder network and build up a decoder network that is used to actually reconstruct the original image starting from this lower dimensional latent space. And again this decoder portion of our autoencoder network is going to be a series of layers, neural network layers, like convolutional layers, that's going to then take this hidden latent vector and map it back up to the input space. And we call our reconstructed output x hat because it's our prediction and it's an imperfect reconstruction of our input x. And the way that we can actually train this network is by looking at the original input x and our reconstructed output x hat and simply comparing the two and minimizing the distance between these two images. So for example we could consider the mean squared error which in the case of images means effectively subtracting one image from another and squaring the difference, right, which is effectively the pixel-wise difference between the input and reconstruction, measuring how faithful our reconstruction is to the original input. And again, notice that by using this reconstruction loss, this difference between the reconstructed output and our original input, we do not require any labels for our data beyond the data itself, right? So we can simplify this diagram just a little bit by abstracting away these individual layers in the encoder and decoder components. And again, note once again that this loss function does not require any labels. It is just using the raw data to supervise itself on the output. And this is a truly powerful idea and a transformative idea, because it enables the model to learn a quantity, the latent variables z, that we're fundamentally interested in but we cannot simply observe or cannot readily model. And when we constrain this latent space to a lower dimensionality, that affects the degree to which and the faithfulness to which we can actually reconstruct the input. And what this, the way you can think of this is as imposing a sort of information bottleneck during the model's training and learning process. And effectively what this bottleneck does is it's a form of compression, right? We're taking the input data, compressing it down to a much smaller latent space, and then building back up a reconstruction. And in practice, what this results in is that the lower the dimensionality of your latent space, the poorer and worse quality reconstruction you're going to get out. All right. So in summary, these autoencoder structures use this sort of bottlenecking hidden layer to learn a compressed latent representation of the data and we can self supervise the training of this network by using what we call a reconstruction loss that forces the forces the autoencoder network to encode as much information about the data as possible into a lower dimensional latent space while still being able to build up faithful reconstructions. So the way I like to think of this is automatically encoding information from the data into a lower dimensional latent space.
Variational autoencoders (17:00)
Let's now expand upon this idea a bit more and introduce this concept and architecture of variational autoencoders, or VAEs. So as we just saw, traditional autoencoders go from input to reconstructed output. And if we pay closer attention to this latent layer denoted here in orange, what you can hopefully realize is that this is just a normal layer in a neural network, just like any other layer. It's deterministic. If you're going to feed in a particular input to this network, you're going to get the same output so long as the weights are the same. So effectively, a traditional autoencoder learns this deterministic encoding, which allows for reconstruction and reproduction of the input. In contrast, variational autoencoders impose a stochastic or variational twist on this architecture and the idea behind doing so is to generate smoother representations of the input data and improve the quality of the, of, not only of reconstructions, but also to actually generate new images that are similar to the input dataset, but not direct reconstructions of the input data. And the way this is achieved is that variational autoencoders replace that deterministic layer Z with a stochastic sampling operation. What this means is that instead of learning the latent variables z directly, for each variable the variational autoencoder learns a mean and a variance associated with that latent variable. And what those means and variances do is that they parametrize a probability distribution for that latent variable. And what those means and variances do is that they parametrize a probability distribution for that latent variable. So what we've done in going from an autoencoder to a variational autoencoder is going from a vector of latent variable z to learning a vector of means mu and a vector of variances sigma, sigma squared, that parametrize these variables and define probability distributions for each of our latent variables. And the way we can actually generate new data instances is by sampling from the distribution defined by these mus and sigmas to generate a latent sample and get probabilistic representations of the latent space. And what I'd like you to appreciate about this network architecture is that it's very similar to the autoencoder I previously introduced, just that we have this probabilistic twist where we're now performing the sampling operation to compute samples from each of the latent variables. Alright, so now because we've introduced this sampling operation, the stochasticity into our model, what this means for the actual computation and learning process of the network, the encoder and decoder, is that they're now probabilistic in their nature. And the way you can think of this is that our encoder is going to be trying to learn a probability distribution of the latent space z given the input data x, while the decoder is going to take that learned latent representation and compute a new probability distribution of the input x given that latent distribution z. And these networks, the encoder, the decoder, are going to be defined by separate sets of weights, phi and theta, and the way that we can train this variational autoencoder is by defining a loss function that's going to be a function of the data, x, as well as these sets of weights, phi and theta. And what's key to how VAEs can be optimized is that this loss function is now comprised of two terms instead of just one. We have the reconstruction loss, just as before, which again is going to capture this difference between the input and the reconstructed output. And also a new term to our loss, which we call the regularization loss, also called the VAE loss. And to take a look in more detail at what each of these loss terms represents, let's first emphasize again that our overall loss function is going to be defined and taken with respect to the sets of weights of the encoder and decoder and the input x. The reconstruction loss is very similar to before, right? And you can think of it as being driven by a log likelihood function, for example for image data the mean squared error between the input and the output. And we can self-supervise the reconstruction loss just as before to force the latent space to learn and represent faithful representations of the input data, ultimately resulting in faithful reconstructions. The new term here, the regularization term, is a bit more interesting and completely new at this stage, so we're going to dive in and discuss it further in a bit more detail. So our probability distribution that's going to be computed by our encoder, q phi of z of x, is a distribution on the latent space z given the data x. And what regularization enforces is that as a part of this learning process we're going to place a prior on the latent space z, which is effectively some initial hypothesis about what we expect the distributions of z to actually look like. And by imposing this regularization term what we can achieve is that the model will try to enforce the z's that it learns to follow this prior distribution and we're going to denote this prior as p. This term here, d, is the regularization term and what it's going to do is it's going to try to enforce a minimization of the divergence or the difference between what the encoder is trying to infer, the probability distribution of z given x, and that prior that we're going to place on the latent variables p of z. And the idea here is that by imposing this regularization factor, we can try to keep the network from overfitting on certain parts of the latent space by enforcing the fact that we want to encourage the latent variables to adopt a distribution that's similar to our prior. So we're going to go through now, you know, both the mathematical basis for this regularization term as well as a really intuitive walkthrough of what regularization achieves to help give you a concrete understanding and an intuitive understanding about why regularization is important and why placing a prior is important. So let's first consider, yeah so to re-emphasize once again this regularization term is going to consider the divergence between our inferred latent distribution and the fixed prior we're going to place. So before we get into this let's consider what could be a good choice of prior for each of these latent variables.
Priors on the latent distribution (24:30)
How do we select P? I'll first tell you what's commonly done. The common choice that's used very extensively in the community is to enforce the latent variables to roughly follow normal Gaussian distributions, which means that they're going to be a normal distribution centered around mean 0 and have a standard deviation and variance of 1. By placing these normal Gaussian priors on each of the latent variables and therefore on our latent distribution overall, what this encourages is that the learned encodings, learned by the encoder portion of our VAE, are going to be sort of distributed evenly around the center of each of the latent variables. And if you can imagine and picture when you have sort of a roughly even distribution around the center of a particular region of the latent space, what this means is that outside of this region, far away, there's going to be a greater penalty. And this can result in instances from instances where the network is trying to cheat and try to cluster particular points outside the center these centers in the latent space like if it was trying to memorize particular outliers or edge cases in the data after we place a normal Gaussian prior on our latent variables, we can now begin to concretely define the regularization term component of our loss function. This loss, this term to the loss, is very similar in principle to a cross entropy loss that we saw before, where the key is that we're going to be defining the distance function that describes the difference or the divergence between the inferred latent distribution q phi of z given x and the prior that we're going to be placing p of z. And this term is called the Kublai-Leibler or KL divergence and when we choose a normal Gaussian prior, this results in the KL divergence taking this particular form of this equation here, where we're using the means and sigmas as input and computing this distance metric that captures the divergence of that learned latent variable distribution from the normal Gaussian. All right, so now I really want to spend a bit of time to get some, build up some intuition about how this regularization works and why we actually want to regularize our VAE, and then also why we select a normal prior. All right, so to do this let's consider the following question. What properties do we want this to achieve from regularization? Why are we actually regularizing our network in the first place? The first key property that we want for a generative model like a VAE is what I can, what I like to think of as continuity, which means that if there are points that are represented closely in the latent space, they should also result in similar reconstructions, similar outputs, similar content after they are decoded. You would expect intuitively that regions in the latent space have some notion of distance or similarity to each other, and this indeed is a really key property that we want to achieve intuitively that regions in the latent space have some notion of distance or similarity to each other and this indeed is a really key property that we want to achieve with our generative model. The second property is completeness and it's very related to continuity and what this means is that when we sample from the latent space to decode the latent space into an output, that should result in a meaningful reconstruction, a meaningful sampled content that is, you know, resembling the original data distribution. You can imagine that if we're sampling from the latent space and just getting garbage out that has no relationship to our input, this could be a huge, huge problem for our model. Alright, so with these two properties in mind, continuity and completeness, let's consider the consequences of what can occur if we do not regularize our model. Well, without regularization, what could end up happening with respect to these two properties is that there could be instances of points that are close in latent space but not similarly decoded. So I'm using this really intuitive illustration where these dots represent abstracted away sort of regions in the latent space and the shapes that they relate to you can think of as what is going to be decoded after those instances in the latent space are passed through the decoder. So in this example we have these two dots, the greenish dot and the reddish dot, that are physically close in latent space but result in completely different shapes when they're decoded. We also have an instance of this purple point which when it's decoded it doesn't result in a meaningful content it's just a scribble. So by not regularizing and I'm abstracting a lot away here and that's on purpose, we could have these instances where we don't have continuity and we don't have completeness. Therefore our goal with regularization is to be able to realize a model where points that are close in the latent space are not only similarly decoded but also meaningfully decoded. So for example here we have the red dot and the orange dot which result in both triangle-like shapes but with slight variations on the triangle itself. So this is the intuition about what regularization can enable us to achieve and what are desired properties for these generative models. Okay, how can we actually achieve this regularization? And how does the normal prior fit in? As I mentioned, VAEs, they don't just learn the latent variable z directly. They're trying to encode the inputs as distributions that are defined by mean and variance. So my first question to you is, is it going to be sufficient to just learn mean and variance, learn these distributions? Can that guarantee continuity and completeness? No. And let's understand why. Alright, without any sort of regularization, what could the model try to resort to? Remember that the VAE, the loss function, is defined by both a reconstruction term and a regularization term. If there is no regularization, you can bet that the model is going to just try to optimize that reconstruction term. that the model is going to just try to optimize that reconstruction term. So it's effectively going to learn to minimize the reconstruction loss, even though we're encoding the latent variables via mean and variance. And two consequences of that is that you can have instances where these learned variances for the latent variable end up being very, very, very small, effectively resulting in pointed distributions. And you can also have means that are totally divergent from each other, which result in discontinuities in the latent space. And this can occur while still trying to optimize that reconstruction loss, direct consequence of not regularizing. But in order to overcome these problems, we need to regularize the variance and the mean of these distributions that are being returned by the encoder. And the normal prior, placing that normal Gaussian distribution as our prior, helps us achieve this. that normal Gaussian distribution as our prior helps us achieve this. And to understand why exactly this occurs is that effectively the normal prior is going to encourage these learned latent variable distributions to overlap in latent space. Recall, right, mean 0, variance of 1. That means all the latent variables are going to be enforced to try to have the same mean, a centered mean, and all the variances are going to be regularized for each and every of the latent variable distributions. And so this will ensure a smoothness and a regularity and an overlap in the latent space, which will be very effective in helping us achieve these properties of continuity and completeness. Centering the means, regularizing the variances. So the regularization via this normal prior by centering each of these latent variables, regularizing their variances, is that it helps enforce this continuous and complete gradient of information represented in the latent space, where again points and distances in the latent space have some relationship to the reconstructions and the content of the reconstructions that result. Note though that there's going to be a trade-off between regularizing and reconstructing. The more we regularize there's also a risk of suffering the quality of the reconstruction and the generation process itself. So in optimizing VAs there's going to be this trade-off that's going to try to be tuned to fit the problem of interest. All right. So hopefully by walking through this example and considering these points, you've built up more intuition about why regularization is important and how specifically the normal prior can help us regularize. Great.
Reparameterization trick (34:38)
So now we've defined our loss function. We know that we can reconstruct the inputs. We've understood how we can regularize learning and achieve continuity and completeness via this normal prior. These are all the components that define a forward pass through the network going from input to encoding to decoded reconstruction. But we're still missing a critical step in putting the whole picture together and that's of backpropagation. And the key here is that because of this fact that we've introduced this stochastic sampling layer, we now have a problem where we can't backpropagate gradients through a sampling layer that has this element of stochasticity. Backpropagation requires deterministic nodes, deterministic layers, for which we can iteratively apply the chain rule to optimize gradients, optimize the loss via gradient descent. Alright, VAEs introduced sort of a breakthrough idea that solved this issue of not being able to backpropagate through a sampling layer. And the key idea was to actually subtly re-parameterize the sampling operation such that the network could be trained completely end-to-end. So as we already learned, right, we're trying to build up this latent distribution defined by these variables z, placing a normal prior defined by a mean and a variance. And we can't simply backpropagate gradients through this sampling layer because we can't compute gradients through this stochastic sample. The key idea instead is to try to consider the sample latent vector z as a sum defined by a fixed mu, a fixed sigma vector, and scale that sigma vector by random constants that are going to be drawn from a prior distribution, such as a normal Gaussian. And by re-parameterizing the sampling operation as so, we still have this element of stochasticity. But that stochasticity is introduced via this random constant epsilon, which is not occurring within the bottleneck latent layer itself. We've re-parameterized and distributed it elsewhere. To visualize how this looks, let's consider the following. Where originally, in the original form of the VAE, we had this deterministic nodes, which are the weights of the network, as well as an input vector, and we are trying to back propagate through the stochastic sampling node Z. But we can't do that. So now, via reparameterizationization what we've achieved is the following form where our latent variable z are defined with respect to mu sigma squared as well as this noise factor epsilon such that when we want to do backpropagation through the network to update, we can directly backpropagate through z, defined by mu and sigma squared, because this epsilon value is just taken as a constant. It's reparameterized elsewhere. And this is a very very powerful trick, the reparameterization trick, because it enables us to train variational auto encoders end-to-end by back propagating with respect to z and with respect to the actual gradient, the actual weights of the encoder network.
Latent perturbation and disentanglement (38:14)
All right. One side effect and one consequence of imposing these distributional priors on the latent variable is that we can actually sample from these latent variables and individually tune them while keeping all of the other variables fixed. And what you can do is you can tune the value of a particular latent variable and run the decoder each time that variable is changed, each time that variable is perturbed to generate a new reconstructed output. So an example of that result is in the following where this perturbation of the latent variables results in a representation that has some semantic meaning about what the network is maybe learning. So in this example these images show variation in head pose and the different dimensions of z, the latent space, the different latent variables, are in this way encoding different latent features that can be interpreted by keeping all other variables fixed and perturbing the value of one individual latent variable. Ideally, in order to optimize VAEs and try to maximize the information that they encode, we want these latent variables to be uncorrelated with each other, effectively disentangled, and what that could enable us to achieve is to learn the richest and most compact latent representation possible. So in this case we have head pose on the x-axis and smile on the y-axis and we want these to be as uncorrelated with each other as possible. One way we can achieve this that's been shown to achieve this disentanglement is rather a quite straightforward approach called beta VAEs. So if we consider the loss of a standard VAE, again we have this reconstruction term defined by a log likelihood and a regularization term defined by the KL divergence. Beta VAEs introduce a new hyperparameter beta which controls the strength of this regularization term and it's been shown mathematically that by increasing beta, the effect is to place constraints on the latent encoding, such as to encourage disentanglement. And there have been extensive proofs and discussions as to how exactly this is achieved. But to consider the results, let's again consider the problem of face reconstruction, where using a standard VAE, if we consider the latent variable of head pose or rotation, in this case where beta equals 1, what you can hopefully appreciate is that as the face pose is changing, the smile of some of these faces is also changing. In contrast, by enforcing a beta much larger than 1, what is able to be achieved is that the smile remains relatively constant while we can perturb the single latent variable of the head rotation and achieve perturbations with respect to head rotation alone.
Debiasing with VAEs (41:25)
Alright, so as I motivated and introduced in the beginning and the introduction of this lecture, one powerful application of generative models and latent variable models is in model de-biasing. And in today's lab, you're actually going to get real hands-on experience in building a variational autoencoder that can be used to achieve automatic debiasing of facial classification systems, facial detection systems. And the power and the idea of this approach is to build up a representation, a learned latent distribution of face data, and use this to identify regions of that latent space that are going to be overrepresented or underrepresented. And that's going to all be taken with respect to particular learned features, such as skin tone, pose, objects, clothing. And then from these learned distributions, we can actually adjust the training process such that we can place greater weight and greater sampling on those images and on those faces that fall in the regions of the latent space that are underrepresented automatically. And what's really, really cool about deploying a VAE or a latent variable model for an application like model de-biasing is that there's no need for us to annotate and prescribe the features that are important to actually de-bias against. The model learns them automatically. And this is going to be the topic of today's lab. And it also opens the door to a much broader space that's going to be explored further in a later spotlight lecture that's going to focus on algorithmic bias and machine learning fairness. Alright, so to summarize the key points on VAEs. They compress representation of data into an encoded representation. Reconstruction of the data input allows for unsupervised learning without labels. We can use the reparameterization trick to train VAEs end-to-end. We can take hidden latent variables, perturb them to interpret their content and their meaning, and finally we can sample from the latent space to generate new examples.
Generative adversarial networks (43:42)
But what if we wanted to focus on generating samples and synthetic samples that were as faithful to a data distribution generally as possible? To understand how we can achieve this, we're going to transition to discuss a new type of generative model called a generative adversarial network, or GAN for short. The idea here is that we don't want to explicitly model the density or the distribution underlying some data, but instead just learn a representation that can be successful in generating new instances that are similar to the data. Which means that we want to optimize to sample from a very very complex distribution which cannot be learned and modeled directly. Instead we're going to have to build up some approximation of this distribution. And the really cool and breakthrough idea of GANs is to start from something extremely extremely simple, just random noise, and try to build a neural network, a generative neural network, that can learn a functional transformation that goes from noise to the data distribution. And by learning this functional generative mapping, we can then sample in order to generate fake instances, synthetic instances, that are going to be as close to the real data distribution as possible. The breakthrough to achieving this was this structure called GANs, where the key component is to have two neural networks, a generator network and a discriminator network, that are effectively competing against each other, they're adversaries. Specifically, we have a generator network, which I'm going to denote here on out by G, that's going to be trained to go from random noise to produce an imitation of the data. And then the discriminator is going to take that synthetic fake data as well as real data and be trained to actually distinguish between fake and real. And in training these two networks are going to be competing against each other. And so in doing so, overall the effect is that the discriminator is going to get better and better at learning how to classify real and fake and the better it becomes at doing that it's going to force the generator to try to produce better and better synthetic data to try to fool the discriminator back and forth back and forth.
Intuitions behind GANs (46:14)
So let's now break this down and go from a very simple toy example to get more intuition about how these GANs work. The generator is going to start, again, from some completely random noise and produce fake data. And I'm going to show that here by representing these data as points on a one-dimensional line. The discriminator is then going to see these points as well as real data. And then it's going to be trained to output a probability that the data it sees are real or if they are fake. And in the beginning it's not going to be trained very well, right? So its predictions are not going to be very good. But then you're going to train it and you're going to train it and it's going to start increasing the probabilities of real versus not real appropriately such that you get this perfect separation where the discriminator is able to perfectly distinguish what is real and what is fake. Now it's back to the generator, and the generator is going to come back. It's going to take instances of where the real data lie as inputs to train. And then it's going to try to improve its imitation of the data, trying to move the fake data, the synthetic data that is generated, closer and closer to the real data. And once again, the discriminator is now going to receive these new points. And it's going to estimate a probability that each of these points is real. And again, learn to decrease the probability of the fake points being real further and further. And now we're going to repeat again. And one last time, the generator is going to start moving these fake points closer and closer to the real data, such that the fake data are almost following the distribution of the real data. At this point it's going to be really really hard for the discriminator to effectively distinguish between what is real and what is fake, while the generator is going to continue to try to create fake data instances to fool the discriminator. And this is really the key intuition behind how these two components of GANs are essentially competing with each other.
Training GANs (48:27)
All right, so to summarize how we train GANs, the generator is going to try to synthesize fake instances to fool a discriminator, which is going to be trained to identify the synthesized instances and discriminate these as fake. To actually train, we're going to see that we are going to define a loss function that defines competing and adversarial objectives for each of the discriminator and the generator. And a global optimum, the best we could possibly do, would mean that the generator could perfectly reproduce the true data distribution such that the discriminator absolutely cannot tell what's synthetic versus what's real. So let's go through how the loss function for a GAN breaks down. The loss term for a GAN is based on that familiar cross entropy loss and it's going to now be defined between the true and generated distributions. So we're's going to now be defined between the true and generated distributions. So we're first going to consider the loss from the perspective of the discriminator. We want to try to maximize the probability that the fake data is identified as fake. And so to break this down, here g defines the generator's output. And so d is the discriminator's estimate of the probability that a fake instance is actually fake. d is the discriminator's estimate of the probability that a real instance is fake. So 1 minus d is its probability estimate that a real instance is real. So together together from the point of view of the discriminator we want to maximize this probability. Maximize probability fake is fake, maximize the estimate of probability real is real. Now let's turn our attention to the generator. Remember that the generator is taking random noise and generating an instance. It cannot directly affect the term d of x, which shows up in the loss, because d of x is solely based on the discriminator's operation on the real data. So for the generator, the generator is going to have the adversarial objective to the discriminator, which means it's going to try to minimize this term, effectively minimizing the probability that the discriminator can distinguish its generated data as fake, d of g of z. And the goal for the generator is to minimize this term of the objective. So the objective of the generator is to try to synthesize fake instances that fool the discriminator. And eventually over the course of training the discriminator, the discriminator is going to be as best as it possibly can be at discriminating real versus fake. Therefore, the ultimate goal of the generator is to synthesize fake instances that fool the best discriminator. And this is all put together in this min-max objective function which has these two components optimized adversarially. And then after training we can actually use the generator network which is now fully trained to produce new data instances that have never been seen before. And then after training, we can actually use the generator network, which is now fully trained, to produce new data instances that have never been seen before. So we're going to focus on that now. And what is really cool is that when the train generator of a GAN synthesizes new instances, it's effectively learning a transformation from a distribution of noise to a target data distribution. And that transformation, that mapping, is going to be what's learned over the course of training. So if we consider one point from a latent noise distribution, it's going to result in a particular output in the target data space. And if we consider another point of random noise, feed it through the generator, it's going to result in a new instance that, and that new instance is going to fall somewhere else on the data manifold. And indeed what we can actually do is interpolate and trans, and traverse in the space of Gaussian noise to result in interpolation in the target space. And you can see an example of this result here where a transformation in series reflects a traversal across the target data manifold. And that's produced in the synthetic examples that are outputted by the generator.
GANs: Recent advances (52:57)
Alright, so in the final few minutes of this lecture I'm going to highlight some of the recent advances in GANs and hopefully motivate even further why this approach is so powerful. So one idea that's been extremely extremely powerful is this idea of progressive GANs, progressive growing, which means that we can iteratively build more detail into the generated instances that are produced. And this is done by progressively adding layers of increasing spatial resolution in the case of image data. And by incrementally building up both the generator and discriminator networks in this way as training progresses, it results in very well resolved synthetic images that are output ultimately by the generator. So some results of this idea of a progressive GAN are displayed here. Another idea that has also led to tremendous improvement in the quality of synthetic examples generated by GANs is a architecture improvement called style GAN. Which combines this idea of progressive growing that I introduced earlier with principles of style transfer, which means trying to compose an image in the style of another image. So for example, what we can now achieve is to map input images, source A, using application of coarse-grained styles from secondary sources onto those targets to generate new instances that mimic the style of source B. And that result is shown here. And hopefully you can appreciate that these coarse-grained features, these coarse-grained styles like age, facial structure, things like that can be reflected in these synthetic examples. This same style GAN system has led to tremendously realistic synthetic images in the areas of both face synthesis as well as for animals, other objects as well. Another extension to the GAN architecture that has enabled particularly powerful applications for select problems and tasks is this idea of conditioning, which imposes a bit of additional further structure on the types of outputs that can be synthesized by a GAN. So the idea here is to condition on a particular label by supplying what is called a conditioning factor, denoted here as C. And what this allows us to achieve is instances like that of paired translation, in the case of image synthesis, where now instead of a single input as training data for our generator, we have pairs of inputs. So for example here, we consider both a driving scene and a corresponding segmentation map to that driving scene. And the discriminator can in turn be trained to classify fake and real pairs of data. And again, the generator is going to be trained to try to fool the discriminator. Example applications of this idea are seen as follows, where we can now go from an input of a semantic segmentation map to generate a synthetic street scene, mapping according to that segmentation. Or we can go from an aerial view, from a satellite image, to a street map view, or from particular labels of an architectural building to a synthetic architectural facade, or day to night, black and white to color, edges to photos, different instances of paired translation that are achieved by conditioning on particular labels. So another example which I think is really cool and interesting is translating from Google Street View to a satellite view and vice versa. And we can also achieve this dynamically. So for example in coloring, given an edge input, the network can be trained to actually synthetically color in the artwork that is resulting from this particular edge sketch.
Cyclegan And Unpaired Translation
CycleGAN of unpaired translation (57:15)
Another idea, instead of pair translation, is that of unpaired image-to-image translation. And this can be achieved by a network architecture called CycleGAN, where the model is taking as input images from one domain and is able to learn a mapping that translates to another domain without having a paired corresponding image in that other domain. So the idea here is to transfer the style and the distribution from one domain to another. And this is achieved by introducing the cyclic relationship and the cyclic loss function, where we can go back and forth between a domain X and a domain Y. And in this system, there are actually two generators and two discriminators that are going to be trained on their respective generation and discrimination tasks. In this example, the CycleGAN has been trained to try to translate from the domain of horses to the domain of zebras. And hopefully you can appreciate that in this example, there's a transformation of the skin of the horse from brown to a zebra-like skin in stripes. And beyond this there's also a transformation of the surrounding area from green grass to something that's more brown in the case of the zebra. I think to get an intuition about how this cycle GAN transformation is going, is working, let's go back to the idea that conventional GANs are moving from a distribution of Gaussian noise to some target data manifold. With cycle GANs, the goal is to go from a particular data manifold, X, to another data manifold, Y. And in both cases, and I think the underlying concept that makes GANs so powerful is that they function as very, very effective distribution transformers, and it can achieve these distribution transformations. Finally, I'd like to consider one additional application that you may be familiar with of using cycle GANs, and that's to transform speech and to actually use this cycle GAN technique to synthesize speech in someone else's voice. And the way this is done is by taking a bunch of audio recordings in one voice and audio recordings in another voice and converting those audio waveforms into an image representation, which is called a spectrogram. We can then train a psychogan to operate on these spectrogram images to transform representations from voice A to make them appear like they appear that they are from another voice, voice B. And this is exactly how we did the speech transformation for the synthesis of Obama's voice in the demonstration that Alexander gave in the first lecture. So, to inspect this further, let's compare side by side the original audio from Alexander as well as the synthesized version in Obama's voice that was generated using a cycleGAN. Hi, everybody, and welcome to MIT SIGS191, the official introductory course on deep learning taught here at MIT. So notice that the spectrogram that results for Obama's voice is actually generated by an operation on Alexander's voice and effectively learning a domain transformation from Obama domain onto the domain of Alexander domain. And the end result is that we create and synthesize something that's more Obama-like.
Alright, so to summarize, hopefully over the course of this lecture, you built up understanding of generative modeling and classes of generative models that are particularly powerful in enabling probabilistic density estimation as well as sample generation. And with that, I'd like to close the lecture and introduce you to the remainder of today's course which is going to focus on our second lab on computer vision, specifically exploring this question of debiasing in facial detection systems and using variational autoencoders to actually achieve an approach for automatic debiasing of classification systems. So I encourage you to come to the class gather town to have your questions on the labs answered and to discuss further with any of us. Thank you.