MIT 6.S191 (2022): Deep Generative Modeling

Transcription for the video titled "MIT 6.S191 (2022): Deep Generative Modeling".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Intro (00:00)

. Okay, so welcome back. Hopefully you had a little bit of a break as we got set up. So in this next lecture on deep generative modeling, we're going to be talking about a very powerful concept. Building systems that not only look for patterns in existing data but can actually go a step beyond this to actually generate brand new data instances based on those learned patterns and this is an idea that's different from what we've been exploring so far in the first three lectures and this area of generative modeling is a particular field within deep learning that's enjoying a lot of success, a lot of attention and interest right now, and I'm eager to see how it continues to develop in the coming years. Okay, let's get into it. So to start, take a look at these three images, these three faces, and I want you all to think about which face of these three you think is real. Unfortunately, I can't see the chat right now as I'm lecturing, but please, mentally or submit your answers. I think they may be coming in. OK, mentally think about it. The punch line, which I'll give to you right now for the sake of time, is in fact that all three of these faces are in fact fake. They were completely generated as you may or may not have guessed by a generative model trained on datasets of human faces. So this shows the power and maybe inspires some caution about the impact that generative modeling could have in our world. Alright so to get into the technical bit, so far in this course, we've been primarily looking at what we call problems of supervised learning, meaning that we're given data and we're given some labels, for example, a class label or a numerical value, and we want to learn some function that maps the data to those labels.

Understanding Generative Models

Unsupervised learning (01:48)

And this is a course on deep learning, so we've been largely concerned about building neural network models that can learn this functional mapping. But at its core, that function that is performing this mapping could really be anything. Today we're going to step beyond this from the class of supervised learning problems to now consider problems in the domain of unsupervised learning. And it's a brand new class of problems, and here in this setting, we're simply given data, data X, right? And we're not necessarily given labels associated with each of those individual data instances. And our goal now is to build a machine learning model that can take that raw data and learn what is the underlying structure, the hidden and underlying structure that defines the distribution of this data. So you may have seen some examples of unsupervised learning in the setting of traditional machine learning, for example, clustering algorithms or principal component analysis, for example. These are all unsupervised methods. But today, we're going to get into using deep generative models as an example of unsupervised learning, where our goal is to take some data examples, data samples from our training set, and those samples are going to be drawn from some general data distribution. Our task is to learn a model that is capturing some representation of that distribution. And we can do this in two main ways. The first is what is called density estimation, where we're given our samples, our data samples. They're going to fall according to some probability distribution. And our task is to learn an approximation of what the function of that probability distribution could be. The second class of problems is in sample generation, where now, given some input samples, right, from, again, some data distribution, we're going to try to learn a model of that data distribution, and then use that process to actually now generate new instances, new samples, that hopefully fall in line with what the true data distribution is. And in both of these cases, our task overall is actually fundamentally the same. We're trying to learn a probability distribution using our model, and we're trying to match that probability distribution similar to the true distribution of the data. And what makes this task difficult and interesting and complex is that often we're working with data types like images where the distribution is very high dimensional. It's not a simple normal distribution that we can predict with a known function. And that's why using neural networks for this task is so powerful, because we can learn these extraordinarily complex functional mappings and estimates of these high dimensional data distributions.

Why Generative models (05:02)

So why care about generative models? What could some applications be? Well, first of all, because they're modeling this probability distribution, they're actually capable of learning and uncovering what could be the underlying features in a data set in a completely unsupervised manner. And where this could be useful is in applications where maybe we want to understand more about what data distributions look like in a setting where our model is being applied for some downstream task. So for example, in facial detection, we could be given a data set of many, many different faces. And starting off, we may not know the exact distribution of these faces with respect to features like skin tone or hair or illumination or occlusions, so on and so forth.

The Representation Learning Dilemma (05:51)

And our training data that we use to build a model may actually be very homogenous, very uniform with respect to these features. And we could want to be able to determine and uncover whether or not this actually is the case before we deploy a facial detection model in the real world. And you'll see in today's lab and in this lecture how we can use generative models to not only uncover what the distribution of these underlying features may be, but actually use this information to build more fair and representative data sets that can be used to train machine learning models that are unbiased and equitable with respect to these different underlying features. Another great example is in the case of outlier detection. For example, when in the context of autonomous driving, and you want to detect rare events that may not be very well represented in the data, but are very important for your model to be able to handle and effectively deal with when deployed. And so generative models can be used to, again, estimate these probability distributions and identify those instances, for example, in the context of driving when pedestrian walks in or there's a really strange event like a deer walking onto the road or something like that and be able to effectively handle and deal with these outliers in the data. Today we're going to focus on two principal classes of generative models, the first being auto encoders, specifically auto encoders and variational auto encoders, and the second being an architecture called generative adversarial networks, or GANs. And both of these are what we like to call latent variable models. And I just threw out this term of latent variable, but I did actually tell you what a latent variable actually means. And the example that I love to use to illustrate the concept of a latent variable comes from this story from the work of Plato, Plato's Republic. And this story is known as the myth of the cave. And in this legend, there is a group of prisoners who are being held imprisoned, and they're constrained as part of their punishment to face a wall, and just stare at this wall, observe it. The only things they can actually see are shadows of objects that are behind their heads. So these are their observations. They're not seeing the actual entities, the physical objects that are casting these shadows in front of them. So to their perspective, these shadows are the observed variables. But in truth, there are physical objects directly behind them that are casting these shadows onto the wall. And so those objects here are like latent variables. They're the underlying variables that are governing some behavior, but that we cannot directly observe. We only see what's in front of us. They're the true explanatory factors that are resulting in some behavior. And our goal in generative modeling is to find ways to actually learn what these true explanatory factors, these underlying latent variables, can be using only observations, only given the observed data. So we're going to start by discussing a very simple generative model that tries to do this. And the idea behind this model, called autoencoders, is to build some encoding of the input and try to reconstruct an input directly. And to take a look at the way and try to reconstruct an input directly. And to take a look at the way that the autoencoder works, it functions very similarly to some of the architectures that we've seen in the prior three lectures. We take as input raw data, pass it through some series of deep neural network layers, and now our output is directly a low dimensional latent space, a feature space which we call z. And this is the actual representation, the actual variables that we're trying to predict in training this type of network. So I encourage you to think about, in considering this type of architecture, why we would care about trying to enforce a low dimensional set of variables z. Why is this important? The fact is that we are able to effectively build a compression of the data by moving from the high dimensional input space to this lower dimensional latent space. And we're able to get a very compact and hopefully meaningful representation of the input data. OK. So how can we actually do this? If our goal is to predict this vector z, we don't have any sort of labels for what these variables z could actually be. They're underlying, they're hidden, we can't directly observe them. How can we train a network like this? Because we don't have training data, what we can do is use our input data maximally to our advantage by complementing this encoding with a decoder network that now takes that latent representation, that lower dimensional set of variables, and goes up from it, builds up from it, to try to learn a reconstruction of the original input image. And here, the reconstructed output is what we call X hat, because it's an imperfect reconstruction of the original data. And we can train this network end to end by looking at our reconstructed output, looking at our input, and simply trying to minimize the distance between them. Taking the output, taking the input, subtracting them, and squaring it. And this is called a mean squared error between the input and the reconstructed output. And so in the case of images, this is just the pixel by pixel difference between that reconstruction and our original input. And note here, our loss function doesn't have any labels. All we're doing is taking our input and taking the reconstructed output spit out to us at the end of training by our network itself. OK. So we can simplify this plot a little bit by just abstracting away those individual neural layers and saying, OK, we have an encoder, we have a decoder, and we're trying to learn this reconstruction.

Encoder-Decoder (12:12)

And this type of diagram where those layers are abstracted away is something that I'll use throughout the rest of this presentation, and you'll probably also come across as you move forward with looking at these types of models further beyond this course. To take a step back, this idea of using this reconstruction is a very, very powerful idea in taking a step towards this idea of unsupervised learning. We're effectively trying to capture these variables, which could be very interesting, without requiring any sort of labels to our data. And the fact is that because we're lowering the dimensionality of our data into this compressed latent space, the degree to which we perform this compression has a really big effect on how good our reconstructions actually turn out to be. And as you may expect, the smaller that bottleneck is, the fewer latent variables we try to learn, the poorer quality of reconstruction we're going to get out, because effectively, this is a form of compression. And so this idea of the autoencoder is a powerful method, a powerful first step for this idea of representation learning, where we're trying to learn a compressed representation of our input data without any sort of label from the start.

Variational Autoencoder, (13:45)

And in this way, we're sort of building this automatic encoding of the data, as well as self-encoding the input data, which is why this term of autoencoder comes into play. From this, from this barebone autoencoder network, we can now build a little bit more and introduce the concept of a variational autoencoder, or VAE, which is more commonly used in actual generative modeling today. To understand the difference between the traditional autoencoder that I just introduced and what we'll see with the variational autoencoder, let's take a closer look at the nature of this latent representation z. So here, with the traditional autoencoder, given some input x, if we pass it through after training, we're always going to get the same output out, no matter how many times we pass in the same input, one input, one output. That's because this encoding and decoding that we're learning is deterministic once the network is fully trained. However, in the case of a variational autoencoder, and more generally, we want to try to learn a better and smoother representation of the input data and actually generate new images that we weren't able to generate before with our autoencoder structure because it was purely deterministic. And so VAEs introduce an element of stochasticity, of randomness to try to be able to now generate new images and also learn more smooth and more complete representations of the latent space. And specifically, what we do with a VAE is we break down the latent space z into a mean and a standard deviation. And the goal of the encoder portion of the network is to output a mean vector and a standard deviation vector which correspond to distributions of these latent variables z. And so here, as you can hopefully begin to appreciate, we're now introducing some element of probability, some element of randomness that will allow us to now generate new data and also build up a more meaningful and more informative latent space itself. The key that you'll see and the key here is that by introducing this notion of a probability distribution for each of those latent variables, each latent variable being defined by a mean, standard deviation, we will be able to sample from that latent distribution to now generate new data examples.

Student cool down (16:36)

OK. So now, because we have introduced this element of probability, both our encoder and decoder architectures or networks are going to be fundamentally probabilistic in their nature. And what that means is that over the course of training, the encoder is trying to infer a probability distribution of the latent space with respect to its input data, while the decoder is trying to infer a new probability distribution over the input space given that same latent distribution. And so when we train these networks, we're going to learn two separate sets of weights, one for the encoder, which I'll denote by phi, and one for the decoder, which is going to be denoted by the variable theta. And our loss function is now going to be a function of those weights, phi and theta. And what you'll see is that now our loss is no longer just constituted by the reconstruction term. We've now introduced this new term, which we'll call the regularization term. And the idea behind the regularization term is that it's going to impose some notion of structure in this probabilistic space. And we'll break it down step by step in a few slides. OK. So just remember that after we define this loss over the course of training, as always, we're trying to optimize the loss with respect to the weights of our network. And the weights are going to iteratively be updated over the course of training the model. To break down this loss term, the reconstruction loss is very related to as it was before with the autoencoder structure. So in the case of images, you can think about the pixel-wise difference between your input and the reconstructed output. What is more interesting and different here is the nature of this regularization term. So we're going to discuss this in more detail. What you can see is that we have this term d. regularization term. So we're going to discuss this in more detail. What you can see is that we have this term d, right? And it's introducing something about a probability distribution, q, and something about a probability distribution, p. The first thing that I want you to know is that this term d is going to reflect a divergence, a difference between these two probability distributions, q of phi and p of z. First, let's look at the term q of phi of z given x. This is the computation that our encoder is trying to learn. It's a distribution of the latent space given the data x computed by the encoder. And what we do in regularizing this network is place a prior p of z on that latent distribution. And all a prior means is it's some initial hypothesis about what the distribution of these latent variables z could look like. And what that means is it's going to help the network enforce some structure based on this prior, such that the learned latent variables z roughly follow whatever we defined this prior distribution to be. And so when we introduce this regularization term d, we're trying to prevent the network from going too wild or to overfitting on certain restricted parts of the latent space by imposing this enforcement that tries to effectively minimize the distance between our inferred latent distribution and some notion of this prior. And so in practice, we'll see how this helps us smooth out the actual quality of the distributions we learn in the latent space. What turns out to be a common choice for this prior, because I haven't told you anything about how we actually select this prior. In the case of variational autoencoders, a common choice for the prior is a normal Gaussian distribution, meaning that it is centered with a mean of 0 and has a standard deviation and variance of 1. And what this means in practice is that it encourages our encoder to try to place latent variables roughly evenly around the center of this latent space, and distribute its encodings quite smoothly. And from this, now that we have defined the prior on the latent distribution, we can actually make this divergence, this regularization term, explicit. And with VAEs, what is commonly used is this function called the Kublai-Leibler divergence, or KL divergence. And all it is is a statistical way to measure the divergence, the distance between two distributions. So I want you to think about this term, the KL divergence, as a metric of distance between two probability distributions. A lot of people, myself included, when introduced to VAEs have a question about, OK, you've introduced this idea, said to us, we are defining our prior to be a normal Gaussian.

Why a normal prior? (21:50)

Why? Why? It seems kind of arbitrary. Yes, it's a very convenient function. It's very commonly used. But what effect does this actually have on how well our network regularizes? So let's get some more intuition about this. First, I'd like you to think about what properties we actually want this regularization function to achieve. The first is that we desire this notion of continuity, meaning that if two points are close in a latent space, probably they should relate to similar content that's semantically or functionally related to each other after we decode from that latent space. Secondly, we want our latent space to be complete, meaning that if we do some sampling, we should get something that's reasonable and sensible and meaningful after we do the reconstruction. So what could be consequences of not meeting these two criteria in practice? Well, if we do not have any regularization at all, what this could lead to is that if two points are close in the latent space, they may not end up being similarly decoded, meaning that we don't have that notion of continuity. And likewise, if we have a point in latent space that cannot be meaningfully decoded, meaning it just in this example doesn't really lead to a sensible shape, then we don't have completeness. Our latent space is not very useful for us. What regularization helps us achieve is these two criteria of continuity and completeness. We want to realize points that are close in this lower dimensional space that can be meaningfully decoded and that can reflect some notion of continuity and of actual relatedness after decoding. OK, so with this intuition, now I'll show you how the normal prior can actually help us achieve this type of regularization. Again, going to our very simple example of colors and shapes, simply encoding the latent variables according to a non-regularized probability distribution does not guarantee that we'll achieve both continuity and completeness. Specifically, if we have variances, these values sigma, that are too small, what this could result in is distributions that are too narrow, too pointed. So we don't have enough coverage of the latent space. And furthermore, if we say, OK, each latent variable should have a completely different mean, we don't impose any prior on them being centered at mean 0, what this means is that we can have vast discontinuities in our latent space. And so it's discontinuities in our latent space. And so it's not meaningful to traverse the latent space and try to find points that are similar and related. Imposing the normal prior alleviates both of these issues. By imposing the standard deviations to be 1 and trying to regularize the means to be 0, we can ensure that our different latent variables have some degree of overlap, that our distributions are not too narrow, that they have enough breadth, and therefore encourage our latent space to be regularized and be more complete and smoother. And this, again, reiterating that this is achieved by centering our means around 0 and regularizing variances to be 1. Note, though, that the greater degree of regularization you impose in the network can adversely affect the quality of your reconstruction. And so there's always going to be a balance in practice between having a good reconstruction and having good regularization that helps you enforce this notion of a smooth and complete latent space by imposing this normal-based regularization. OK. So with that, now we've taken a look at both the reconstruction component and the regularization component of our loss function.

Reparameterizing序akers (26:11)

And we've talked about how both these encoder and the decoder are inferring and computing a probability distribution over their respective learning tasks. But one key step that we're missing is how we actually, in practice, can train this network end to end. And what you may notice is that by introducing this mean and variance term, by introducing this mean and variance term, by imposing this probabilistic structure to our latent space, we introduce stochasticity. This is effectively a sampling operation operating over a probability distribution defined by these mu and sigma terms. And what that means is that during backpropagation, we can't effectively backpropagate gradients through this layer because it's stochastic. And so in order to train using backpropagation, we need to do something clever. The breakthrough idea that solved this problem was to actually re-parameterize the sampling layer a little bit so that you divert the stochasticity away from these mu and sigma terms and then ultimately be able to train the network end to end. So as we saw, this notion of probability distribution over mu and sigma squared does not lead to direct back propagation because of this stochastic nature. What we do instead is now re-parameterize the value of z ever so slightly. And the way we do that is by taking mu, taking sigma independently, trying to learn fixed values of mu, fixed values of sigma, and effectively diverting all the randomness, all the stochasticity, to this value epsilon, where now epsilon is what is actually being drawn from a normal distribution. And what this means is that we can learn a fixed vector of means, a fixed vector of variances, and scale those variances by this random constant, such that we can still enforce learning over a probability distribution by diverting the stochasticity away from those mean and sigmas that we actually want to learn during training. Another way to visualize this is that looking at sort of a broken down flow chart of where these gradients could actually flow through. In the original form, we were trying to go from inputs x through z to a mapping. And the problem we saw was that our probabilistic node z prevented us from doing backpropagation. What the reparameterization does is that it diverts the probabilistic operation completely elsewhere away from the means and sigmas of our latent variables such that we can have a continuous flow of gradients through the latent variable z and actually train these networks end to end.

Variational Auto-encoders (Vae) (29:23)

And what is super, super cool about the AEs is that because we have this notion of probability and of these distributions over the latent variables, we can sample from our latent space and actually perturb and tune the values of individual latent variables, keeping everything else fixed, and generate data samples that are perturbed with a single feature or a single latent variable. And you can see that really clearly in this example, where one latent variable is being changed in the reconstructed outputs, and all other variables are fixed. And you can see that this is effectively functioning to tilt the pose of this person's face as a result of that latent perturbation. And these different latent variables that we're trying to learn over the course of training can effectively encode and pick up on different latent features that may be important in our data set. different latent features that may be important in our data set. And ideally, our goal is we want to try to maximize the information that we're picking up on through these latent variables, such that one latent variable is picking up on some feature, and another is picking up on a disentangled or separate and uncorrelated feature. And this is this idea of disentanglement. So in this example, we have the head pose changing on the x axis and something about the smile or the shape of the person's lips changing on the y axis. The way we can actually achieve and enforce this disentanglement in process is actually fairly straightforward. And so if you take a look at the standard loss function for a VAE, again, we have this reconstruction term, a regularization term. And with an architecture called beta VAEs, all they do is introduce this hyperparameter, this different parameter beta, that effectively controls the strength of how strictly we are regularizing. And it turns out that if you enforce beta to be greater than 1, you can try to impose a more efficient latent encoding that encourages disentanglement, such that with a standard VAE looking at a value of beta equals 1, you can see that we can enforce the head rotation to be changing, but also the smile to also change in conjunction with this. Whereas now, if we look at a beta VAD with a much higher value of beta, hopefully it's subtle, but you can appreciate that the smile, the shape of the lips is staying relatively the same while only the head pose, the rotation of the head, is changing as a function of latent variable perturbation. OK. So I introduced at the beginning a potential use case of generative models in terms of trying to create more fair and de-biased machine learning models for deployment. And what you will explore in today's lab is practicing this very idea. And it turns out that by using latent variable model like VAE, because we're training these networks in a completely unsupervised fashion, we can pick up automatically on the important and underlying latent variables in a data set, such that we can build estimates of the distributions of our data with respect to important features like skin, skin tone, pose, illumination, head rotation, so on and so forth. And what this actually allows us to do is to take this information and go one step forward by using these distributions of these latent features to actually adjust and refine our data set actively during training in order to create a more representative and unbiased data set that will result in a more unbiased model. And so this is the idea that you're going to explore really in depth in today's lab. OK, so to summarize our key points on variational auto encoders, they use a compressed representation of the world to return something that's interpretable in terms of the latent features they're picking up on. They allow for completely unsupervised learning via this reconstruction. We employ the reparameterization trick to actually train these architectures end to end via backpropagation. We can interpret latent variables using a perturbation function, and also sample from our latent space to actively generate new data samples that have never been seen before.

GANs, Why Generative Adversarial Networks (34:11)

OK. That being said, the key problem of variational autoencoders is a concern of density estimation, trying to estimate the probability distributions of these latent variables, z. What if we want to ignore that or pay less attention to it and focus on the generation of high quality new samples as our output? For that, we're going to turn and transition to a new type of generative model called GANs, where the goal here is really we don't want to explicitly model the probability density or distribution of our data. We want to care about this implicitly, but use this information mostly to sample really, really realistic and really new instances of data that match our input distribution. The problem here is that our input data is incredibly complex, and it's very, very difficult to go from something so complex and try to generate new, realistic samples directly. And so the key insight and really elegant idea behind GANs is what if instead we start from something super simple, random noise, and use the power of neural networks to learn a transformation from this very simple distribution, random noise, to our target data distribution where we can now sample from. This is really the key breakthrough idea of generative adversarial networks. And the way that GANs do this is by actually creating a overall generative model by having two individual neural networks that are effectively adversaries. They're competing with each other. And specifically, we're going to explore how this architecture involves these two components, a generator network, which is functioning and drawing from a very simple input distribution, purely random noise.

Intuition: GAN (36:09)

And it's trying to use that noise and transform it into an imitation of the real data. And conversely, we have this adversary network, a discriminator, which is going to take samples generated by the generator, entirely fake samples, and is going to predict whether those samples are real or whether they're fake. And we're going to set up a competition between these two networks such that we can try to force the discriminator to classify real and fake data, and to force the generator to produce better and better fake data to try to fool the discriminator. And to show you how this works, we're going to go through one of my absolute favorite illustrations of this class and build up the intuition behind GANs. So we're going to start really simply, right? We're going to have data that's just one-dimensional points on line and we begin by feeding the generator completely a random noise from this one-dimensional space producing some fake data, right? The discriminator is then going to see these points together with some real examples. And its task is going to be to try to output a probability that the data it sees are real or if they're fake. And initially when it starts out, the discriminator is not trained at all. So its predictions may not be very good. But then over the course of training, the idea is that we can build up the probability of what is real versus decreasing the probability of what is fake. Now that we've trained our discriminator until we've achieved this point where we get perfect separation between what is real and what is fake, we can go back to the generator. And the generator is now going to see some examples of real data, and as a result of this, it's going to start moving the fake examples closer to the real data, increasingly moving them closer, such that now the discriminator comes back, receives these new points, and it's going to estimate these probabilities that each point is real. And then iteratively learn to decrease the probability of the fake points. And now we can continue to adjust the probabilities until eventually we repeat again, go back to the generator, and one last time, the generator is going to start moving these fake points closer to the real data and increasingly iteratively closer and closer, such that these fake examples are almost identical following the distribution of the real data.

GAN objective (38:45)

Such that now, at this point, at the end of training, it's going to be very, very hard for the discriminator to effectively distinguish what is real, what is fake, while the generator is going to continue to try to improve the quality of its sample that it's generating in order to fool the discriminator. So this is really the intuition behind how these two components of a GAN are effectively competing with each other to try to maximize the quality of these fake instances that the generator is spitting out. OK. So now translating that intuition back to our architecture, we have our generator network synthesizing fake data instances to try to fool the discriminator. The discriminator is going to try to fool the discriminator. The discriminator is going to try to identify the synthesized instances, the fake examples, from the real data. And the way we train GANs is by formulating an objective, a loss function that's known as an adversarial objective. And overall, our goal is for our generator to exactly reproduce the true data distribution. That would be the optimum solution. But of course, in practice, it's very difficult to try to actually achieve this global optimum. But we'll take a closer look at how this loss function works. The loss function, while at first glance may look a little daunting and scary, it actually boils down to concepts that we've already introduced. We're first considering here the objective for the discriminator network, D. And the goal here is that we're trying to maximize the probability of the discriminator of identifying fake data here as fake and real data as real. And this term comprising a loss over the fake data and the real data is effectively a cross-entropy loss between the true distribution and the distribution generated by the generator network. And our goal as the discriminator is to maximize this objective overall. Conversely, for our generator, we still have the same overall component, this cross-entropy type term, within our loss. But now we're trying to minimize this objective from the perspective of the generator. And because the generator cannot directly access the true data distribution d of x, its focus is on minimizing the distribution and loss term d of g of z, which is effectively minimizing the probability that its generated data is identified as fake.

Expanding Generative Models And Real World Applications

Traversing a Data Manifold (41:53)

So this is our goal for the generator. And overall, we can put this together to try to comprise the overall loss function, the overall min max objective, which has both the term for the generator as well as the term for the discriminator. Now, after we've trained our network, our goal is to really use the generator network once it's fully trained, focusing in on that, and sample from it to create new data instances that have never been seen before. And when we look at the trained generator, the way it's synthesizing these new data instances is effectively going from a distribution of completely random Gaussian noise and learning a function that maps a transformation from that Gaussian noise towards a target data distribution. And this mapping, this approximation, is what the generator is learning over the course of training itself. And so if we consider one point in this distribution, one point in the noise distribution is going to lead to one point in the target distribution. And similarly, now if we consider an independent point, that independent point is going to produce a new instance in the target distribution, falling somewhere else on this data manifold. And what is super, super cool and interesting is that we can actually interpolate and traverse in the noise space to then interpolate and traverse in the target data distribution space. And you can see the result of this interpolation, this traversal, in practice, where in these examples we've transformed this image of a black goose or a black swan on the left to a robin on the right, simply by traversing this input data manifold to result in a traversal in the target data manifold.

Domain Transformation (43:38)

And this idea of domain transformation and traversal in these complex data manifolds leads us to discuss and consider why GANs are such a powerful architecture and what some examples of their generated data actually can look like. One idea that has been very effective in the practice of building GANs that can synthesize very realistic examples is this idea of progressive growing. The idea here is to effectively add layers to each of the generator and discriminator as a function of training, such that you can iteratively build up more and more detailed image generations as a result of the progression of training.

Practical Advances (44:43)

So you start with a very simple model. And the outputs as a result of this are going to have very low spatial resolution. But if you iteratively add more and more network layers, you can improve the quality and the spatial resolution of the generated images. And this helps also speed up training and result in more stable training as well. And so here are some examples of GAN architecture using this progressive growing idea. And you can see the photorealism of these generated outputs.

Style, Transfer (45:22)

Another very interesting advancement was in this idea of style transfer. This has been enabled by some fundamental architecture improvements in the network itself, which we're not going to go into too detail about. But the idea here is that we're going to actually be able to build up a progressive growing GAN that can also transfer styles, so features and effects from one series of images onto a series of target images. And so you can see that example here where on one axis we have target images and on the other axis, the horizontal axis, is the style, captures the style of image that we want to transfer onto our target. And the result of such an architecture is really remarkable, where you can see now the input target has effectively been transformed in the style of those source images that we want to draw features from. And as you may have guessed, the images that I showed you at the beginning of the lecture were generated by one of these types of GAN architectures. And these results are very, very striking in terms of how realistic these images look. You can also clearly extend it to other domains and other examples. And I will note that while I have focused largely on image data in this lecture, this general idea of generative modeling applies to other data modalities as well.

Non-Image Modalities (46:55)

And in fact, many of the more recent and exciting applications of generative models are in moving these types of architectures to new data modalities and new domains to formulate design problems for a variety of application areas. OK, so one final series of architecture improvements that I'll briefly touch on is this idea of trying to impose some more structure on the outputs itself and control a little bit better about what these outputs actually look like. And the idea here is to actually impose some sort of conditioning factor, a label that is additionally supplied to the GAN network over the course of training to be able to impose generation in a more controlled manner. One example application of this is in the instance of paired translation. So here, the network is considering new pairs of inputs, for example, a scene, as well as a corresponding segmentation of that scene. And the goal here is to try to train the discriminator accordingly to classify real or fake pairs of scenes and their corresponding segmentations. And so this idea of pair translation can be extended to do things like moving from semantic labels of a scene to generating an image of that scene that matches those labels, going from an aerial view of a street to a map type output, going from a label to a facade of a building, a day to night, black and white to color, edges of an image to a filled out image. And really, the applications are very, very wide. And the results are quite impressive in being able to go back and forth and do this sort of pair translation operation, for example, in data from Google Street View shown here.

Colour (48:56)

And I think this is a fun example, which I'll briefly highlight. This here is looking at coloring from edges of a sketch. And in fact, the data that were used to train the scan network were images of Pokemon. And these are results that the network was generating from simply looking at images of Pokemon. You can see that that training can actually extend to other types of artwork instances beyond the Pokemon example shown here. Okay, that just replaced it. Okay. The final thing that I'm going to introduce and touch on when it comes to GAN architectures is this cool idea of completely unpaired image-to-image translation.

Targeted Robust Neural Networks, Utilizing Technology

Target Driven Daisy Networks (49:44)

of completely unpaired image-to-image translation. And our goal here is to learn a transformation across domains with completely unpaired data. And the architecture that was introduced a few years ago to do this is called CycleGAN. And the idea here is now we have two generators and two discriminators where they're effectively operating in their own data distributions. And we're also learning a functional mapping to translate between these two corresponding data distributions and data manifolds. And without going into, I could explain, and I'm happy to explain the details of this architecture more extensively. But for the sake of time, I'll just highlight what the outputs of this architecture more extensively. But for the sake of time, I'll just highlight what the outputs of this architecture can look like, where in this example, the task is to translate from images of horses to images of zebras, where you can effectively appreciate these various types of transformations that are occurring as this unpaired translation across domains is occurring. OK. The reason why I highlighted this example is I think that the CycleGAN highlights this idea, this concept, of GANs being very, very powerful distribution transformers. Where the original example we introduced was going from Gaussian noise to some target data manifold. And in CycleGAN, our objective is to do something a little more complex, going from one data manifold to another target data manifold, for example, horse domain to zebra domain. More broadly, I think it really highlights this idea that this neural network mapping that we're learning as a function of this generative model is effectively a very powerful distribution transformation. And it turns out that cycle GANs can also extend to other modalities, right, as I alluded to, not just images. We can effectively look at audio and sequence waveforms to transform speech by taking an audio waveform, converting it into a spectrogram, and then doing that same image-based domain translation, domain transformation, learned by the psychogan to translate and transform speech in one domain to speech in another domain. And you may be thinking ahead, but this turns out that this was the exact approach that we used to synthesize the audio of Obama's voice that Alexander showed at the start of the first lecture.

EDU Obama (52:22)

We used a psychcleGAN architecture to take Alexander's audio in his voice and convert that audio into a spectrogram waveform, and then use the CycleGAN to translate and transform the spectrogram waveform from his audio domain to that of Obama. So to remind you, I'll just play this output. Welcome to MIT FITS 191, the vocational introductory course on deep learning here at MIT. Again, with this sort of architecture, the applications can be very, very broad and extend to other instances and use cases beyond turning images of horses into images of zebras.

Hard Robust Games (53:12)

Hi, everybody. OK. All right, so that concludes the core of this lecture and the core technical lectures for today. In this lecture in particular, we touched on these two key generative models, variational autoencoders and autoencoders, where we're looking to build up estimates of lower dimensional probabilistic latent spaces. And secondly, generative adversarial networks, where our goal is really to try to optimize our network to generate new data instances that closely mimic some target distribution. With that, we'll conclude today's lectures. And just a reminder about the lab portion of the course, which is going to follow immediately after this. We have the open office hour sessions in 10-250, where Alexander and I will be there in person as well as virtually in Gather Town.

Final Notes

Closing Remarks (54:00)

Two important reminders. I sent an announcement out this morning about picking up t-shirts and other related swag. We will have that in room 10-250. We're going to move there next, so please be patient as we arrive there. We'll make announcements about availability for the remaining days of the course. The short answer is yes, we'll be available to pick up shirts at later days as well. And yeah, that's basically it for today's lectures. And I hope to see many of you in office hours today, and if not, hopefully for the remainder of the week. Thank you so much.

Could not load content

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.