MIT 6.S191: Deep Generative Modeling

Transcription for the video titled "MIT 6.S191: Deep Generative Modeling".

1970-01-03T18:45:34.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Opening Remarks

Introduction (00:00)

I'm really really excited about this lecture because as Alexander introduced yesterday, right now we're in this tremendous age of generative AI and today we're going to learn the foundations of deep generative modeling where we're going to talk about building systems that can not only look for patterns in data but can actually go a step beyond this to generate brand new data instances based on those learned patterns. This is an incredibly complex and powerful idea and as I mentioned it's a particular subset of deep learning that has actually really exploded in the past couple of years and this year in particular. So to start and to demonstrate how powerful these algorithms are, let me show you these three different faces. I want you to take a minute, think. Think about which face you think is real. Raise your hand if you think it's face A. Okay, see a couple of people. Face B. Many more people. Face C. About second place. Well, the truth is that all of you are wrong. All three of these faces are fake. These people do not exist. These images were synthesized by deep generative models trained on data of human faces and asked to produce new instances. Now I think that this demonstration kind of demonstrates the power of these ideas and the power of this notion of generative modeling. So let's get a little more concrete about how we can formalize this. So far in this course, we've been looking at what we call problems of supervised learning, meaning that we're given data, and associated with that data is a set of labels. Our goal is to learn a function that maps that data to the labels. Now we're in a course on deep learning so we've been concerned with functional mappings that are defined by deep neural networks, but really that function could be anything. Neural networks are powerful, but we could use other techniques as well. In contrast, there's another class of problems in machine learning that we refer to as unsupervised learning, where we take data but now we're given only data, no labels, and our goal is to try to build some method that can understand the hidden underlying structure of that data. What this allows us to do is it gives us new insights into the foundational representation of the data. And as we'll see later, actually enables us to generate new data instances. data instances. Now, this class of problems, this definition of unsupervised learning, captures the types of models that we're going to talk about today in the focus on generative modeling, which is an example of unsupervised learning and is united by this goal of the problem where we're given only samples from a training set. And we want to learn a model that represents the distribution of the data that the model is seeing. Generative modeling takes two general forms. First, density estimation, and second, sample generation. In density estimation, the task is, given some data examples, our goal is to train a model that learns an underlying probability distribution that describes where the data came from. With sample generation, the idea is similar, but the focus is more on actually generating new instances. Our goal with sample generation is to again, learn this model of this underlying probability distribution, but then use that model to sample from it and generate new instances that are similar to the data that we've seen, approximately falling along ideally that same real data distribution. Now, in both these cases of density estimation and sample generation, the underlying question is the same. Our learning task is to try to build a model that learns this probability distribution that is as close as possible to the true data distribution. Okay, so with this definition and this concept of generative modeling, what are some ways that we can actually deploy generative modeling forward in the real world for high impact applications? Well, part of the reason that generative models are so powerful is that they have this ability to uncover the underlying features in a data set and encode it in an efficient way. So for example, if we're considering the problem of facial detection and we're given a data set with many, many different faces, starting out without inspecting this data, we may not know what the distribution of faces in this data set is with respect to features we may be caring about. For example, the pose of the head, clothing, glasses, skin tone, hair, etc. And it can be the case that our training data may be very, very biased towards particular features without us even realizing this.


Explored Topics On Latent Variable And Generative Models

Why care about generative models? (05:48)

biased towards particular features without us even realizing this. Using generative models we can actually identify the distributions of these underlying features in a completely automatic way without any labeling in order to understand what features may be over represented in the data, what features may be under represented in the data? And this is the focus of today and tomorrow's software labs, which are going to be part of the software lab competition, developing generative models that can do this task and using it to uncover and diagnose biases that can exist within facial detection models. Another really powerful example is in the case of outlier detection, identifying rare events. So let's consider the example of self-driving autonomous cars. With an autonomous car, let's say it's driving out in the real world, we really really want to make sure that that car can be able to handle all the possible scenarios and all the possible cases it may encounter, including edge cases like a deer coming in front of the car or some unexpected rare events. Not just, you know, the typical straight freeway driving that it may see the majority of the time. With generative models, we can use this idea of density estimation to be able to identify rare and anomalous events within the training data and as they're occurring as the model sees them for the first time. So hopefully this paints a picture of what generative modeling, the underlying concept, is and a couple of different ways in which we can actually deploy these ideas for powerful and impactful real-world applications.


Latent variable models (07:33)

In today's lecture, we're going to focus on a broad class of generative models that we call latent variable models and specifically distill down into two subtypes of latent variable models, and specifically distilled down into two subtypes of latent variable models. First things first, I've introduced this term latent variable, but I haven't told you or described to you what that actually is. I think a great example, and one of my favorite examples throughout this entire course that gets at this idea of the latent variable is this little story from Plato's Republic which is known as the myth of the cave. In this myth there is a group of prisoners and as part of their punishment they're constrained to face a wall. Now the only things the prisoners can observe are shadows of objects that are passing in front of a fire that's behind them, and they're observing the casting of the shadows on the wall of this cave. To the prisoners, those shadows are the only things they see, their observations. They can measure them, they can give them names, because to them that's their reality. But they're unable to directly see the underlying objects, the true factors themselves, that are casting those shadows. Those objects here are like latent variables in machine learning. They're not directly observable, but they're the true underlying features or explanatory factors that create the observed differences and variables that we can see and observe. And this gets at the goal of generative modeling, which is to find ways that we can actually learn these hidden features, these underlying latent variables, even when we're only given observations of the observed data.


Autoencoders (09:30)

So let's start by discussing a very simple generative model that tries to do this through the idea of encoding the data input. The models we're going to talk about are called autoencoders. And to take a look at how an autoencoder works, we'll go through step by step, starting with the first step of taking some raw input data and passing it through a series of neural network layers. Now, the output of this first step is what we refer to as a low dimensional latent space. It's an encoded representation of those underlying features and that's our goal in trying to train this model and predict those features. The reason a model like this is called an encoder or an autoencoder is that it's mapping the data, X, into this vector of latent variables, z. Now, let's ask ourselves a question. Let's pause for a moment. Why may we care about having this latent variable vector z be in a low dimensional space? Anyone have any ideas? Anyone have any ideas? All right, maybe there are some ideas. Yes? The suggestion was that it's more efficient. Yes, that's that's gets at it the heart of the of the question. The idea of having that low dimensional latent space is that it's a very efficient compact encoding of the rich high dimensional data that we may start with. As you pointed out, right, what this means is that we're able to compress data into this small feature representation, a vector, that captures this compactness and richness without requiring so much memory or so much storage. So how do we actually train the network to learn this latent variable vector? Since we don't have training data, we can't explicitly observe these latent variables Z, we need to do something more clever. What the auto encoder does is it builds a way to decode this latent variable vector back up to the original data space, trying to reconstruct the original image from that compressed efficient latent encoding. And once again we can use a series of neural network layers layers such as convolutional layers, fully connected layers, but now to map back from that lower dimensional space back upwards to the input space. This generates a reconstructed output which we can denote as X hat since it's an imperfect reconstruction of our original input data. To train this network, all we have to do is compare the outputted reconstruction and the original input data and say, how do we make these as similar as possible? We can minimize the distance between that input and our reconstructed output. So for example, for an image we can compare the pixel-wise difference between the input data and the reconstructed output, just subtracting the images from one another and squaring that difference to capture the pixel-wise divergence between the input and the reconstruction. What I hope you'll notice and appreciate is in that definition of the loss, it doesn't require any labels. The only components of that loss are the original input data X and the reconstructed output X hat. So I've simplified now this diagram by abstracting away those individual neural network layers in both the encoder and decoder components of this. And again, this idea of not requiring any labels gets back to the idea of unsupervised learning, since what we've done is we've been able to learn a encoded quantity, our latent variables, that we cannot observe without any explicit labels. All we started from was the raw data itself. It turns out that as as the question and answer got at, that dimensionality of the latent space has a huge impact on the quality of the generated reconstructions and how compressed that information bottleneck is. Auto encoding is a form of compression and so the lower the dimensionality of the latent space, the less good our reconstructions are going to be. But the higher the dimensionality, the more the less efficient that encoding is going to be. So to summarize this first part, this idea of an autoencoder is using this bottlenecked, compressed, hidden latent layer to try to bring the network down to learn a compact, efficient representation of the data. We don't require any labels. This is completely unsupervised. And so in this way, we're able to automatically encode information within the data itself to learn this latent space. Auto-encoding information, auto-encoding data. Now, this is a pretty simple model.


Variational autoencoders (15:03)

And it turns out that in practice practice this idea of self-encoding or auto-encoding has a bit of a twist on it to allow us to actually generate new examples that are not only reconstructions of the input data itself. And this leads us to the concept of variational auto-encoders or VAEs. With the traditional autoencoder that we just saw, if we pay closer attention to the latent layer, which is shown in that orange salmon color, that latent layer is just a normal layer in the neural network. It's completely deterministic. What that means is once we've trained the network, once the weights are set, any anytime we pass a given input in and go back through the latent layer, decode back out, we're going to get the same exact reconstruction. The weights aren't changing, it's deterministic. In contrast, variational autoencoders, VAEs, introduce a element of randomness, a probabilistic twist on this idea of autoencoding. What this will allow us to do is to actually generate new images, similar to the, or new data instances that are similar to the input data, but not forced to be strict reconstructions. In practice, with the variational autoencoder, we've replaced that single deterministic layer with a random sampling operation. Now instead of learning just the latent variables directly themselves, for each latent variable we define a mean and a standard deviation that captures a probability distribution over that latent variable, we define a mean and a standard deviation that captures a probability distribution over that latent variable. What we've done is we've gone from a single vector of latent variable z to a vector of means mu and a vector of standard deviations sigma that parameterize the probability distributions around those latent variables. What this will allow us to do is now sample using this element of randomness, this element of probability, to then obtain a probabilistic representation of the latent space itself. As you hopefully can tell, right, this is very, very, very similar to the autoencoder itself, but we've just added this probabilistic twist where we can sample in that intermediate space to get these samples of latent variables. Okay, now to get a little more into the depth of how this is actually learned, how this is actually trained, With defining the VAE, we've eliminated this deterministic nature to now have these encoders and decoders that are probabilistic. The encoder is computing a probability distribution of the latent variable z given input data x, while the decoder is doing the inverse, trying to learn a probability distribution back in the input data space given the latent variables z. And we define separate sets of weights, phi and theta, to define the network weights for the encoder and decoder components of the VAE. All right, so when we get now to how we actually optimize and learn the network weights in the VAE, first step is to define a loss function, right? That's the core element to training a neural network. Our loss is going to be a function of the data and a function of the neural network weights, just like before. But we have these two components, these two terms that define our VAE loss. First we see the reconstruction loss, just like before, where the goal is to capture the difference between our input data and the reconstructed output. And now for the VAE we've introduced a second term to the loss, what we call the regularization term. Often you'll maybe even see this referred to as a VAE loss. And we'll go into describing what this regularization term means and what it's doing. To do that and to understand remember and and keep in mind that in all neural network operations our goal is to try to optimize the network weights with respect to the data, with respect to minimizing this objective loss. And so here we're concerned with the network weights phi and theta that define the weights of the encoder and the decoder. We consider these two terms. First, the reconstruction loss. Again, the reconstruction loss is very very similar, same as before. You can think of it as the error or the likelihood that effectively captures the difference between your input and your outputs. And again we can trade this in an unsupervised way, not requiring any labels to force the latent space and the network to learn how to effectively reconstruct the input data. The second term, the regularization term, is now where things get a bit more interesting. So let's go on into this in a little bit more detail. Because we have this probability distribution and we're trying to compute this encoding and then decode back up, as part of regularizing, we want to take that inference over the latent distribution and constrain it to behave nicely, if you will. The way we do that is we place what we call a prior on the latent distribution. And what this is is some initial hypothesis or guess about what that latent variable space may look like. or guess about what that latent variable space may look like. This helps us and helps the network to enforce a latent space that roughly tries to follow this prior distribution. And this prior is denoted as p of z, right? That term d, that's effectively the regularization term. It's capturing a distance between our encoding of the latent variables and our prior hypothesis about what the structure of that latent space should look like. So over the course of training we're trying to enforce that each of those latent variables adapts a probability distribution that's similar to that prior.


Priors on the latent distribution (21:45)

A common choice when training VAEs and developing these models is to enforce the latent variables to be roughly standard normal Gaussian distributions, meaning that they are centered around mean zero and they have a standard deviation of one. What this allows us to do is to encourage the encoder to put the latent variables roughly around a centered space, distributing the encoding smoothly so that we don't get too much divergence away from that smooth space, which can occur if the network tries to cheat and try to simply memorize the data. By placing the Gaussian standard normal prior on the latent space, we can define a concrete mathematical term that captures the divergence between our encoded latent variables and this prior. And this is called the KL divergence. When our prior is a standard normal, the KL divergence takes the form of the equation that I'm showing up on the screen, but what I want you to really get away, come away with, is that the concept of trying to smooth things out and to capture this divergence and this difference between the prior and the latent encoding is all this KL term is trying to capture. So it's a bit of math and I acknowledge that, but what I want to next go into is really what is the intuition behind this regularization operation. Why do we do this and why does the normal prior in particular work effectively for VAEs? So let's consider what properties we want our latent space to adopt and for this regularization to achieve. The first is this goal of continuity. And what we mean by continuity is that if there are points in the latent space that are close together, ideally, after decoding, we should recover two reconstructions that are similar in content that make sense that they're close together. The second key property is this idea of completeness. We don't want there to be gaps in the latent space. We want to be able to decode and sample from the latent space in a way that is smooth and a way that is connected. To get more concrete, let's ask what could be the consequences of not regularizing our latent space at all? Well, if we don't regularize, we can end up with instances where there are points that are close in the latent space but don't end up with similar decodings or similar reconstructions. Similarly, we could have points that don't lead to meaningful reconstructions at all. They're somehow encoded, but we can't decode effectively. Regularization allows us to realize points that end up close in the latent space and also are similarly reconstructed and meaningfully reconstructed. Okay, so continuing with this example, the example that I showed there and I didn't get into details was showing these shapes, these shapes of different colors and that we're trying to be encoded in some lower dimensional space. With regularization, we are able to achieve this by trying to minimize that regularization term. It's not sufficient to just employ the reconstruction loss alone to achieve this continuity and this completeness. Because of the fact that without regularization, just encoding and reconstructing does not guarantee the properties of continuity and completeness. We overcome this, these issues of having potentially pointed distributions, having discontinuities, having disparate means that could end up in the latent space without the effect of regularization. We overcome this by now regularizing the mean and the variance of the encoded latent distributions according to that normal prior. What this allows is for the learned distributions of those latent variables to effectively overlap in the latent space, because everything is regularized to have, according to this prior, mean zero standard deviation one and that centers the means regularizes the variances for each of those independent latent variable distributions together the effect of this regularization in net is that we can achieve continuity and completeness in the latent space. Points and distances that are close should correspond to similar reconstructions that we get out. So hopefully this gets at some of the intuition behind the idea of the VAE, behind the idea of the regularization, and trying to enforce the structured normal prior on the latent space. With this in hand, with the two components of our loss function, reconstructing the inputs, regularizing learning to try to achieve continuity and completeness, we can now think about how we define a forward pass through the network, going from an input example and being able to decode and sample from the latent variables to look at new examples. Our last critical step is how the actual backpropagation training algorithm is defined and how we achieve this. The key, as I introduced with VAEs, is this notion of randomness of sampling that we have introduced by defining these probability distributions over each of the latent variables. The problem this gives us is that we cannot backpropagate directly through anything that has an element of sampling, anything that has an element of randomness. of sampling, anything that has an element of randomness. Backpropagation requires completely deterministic nodes, deterministic layers, to be able to successfully apply gradient descent and the backpropagation algorithm.


Reparameterization trick (28:16)

The breakthrough idea that enabled VAEs to be trained completely end-to-end was this idea of re-parameterization within that sampling layer. And I'll give you the key idea about how this operation works. It's actually really quite clever. So as I said, when we have a notion of randomness of probability, we can't sample directly through that layer. Instead, with re-parameterization, what we do is we redefine how a latent variable vector is sampled as a sum of a fixed deterministic mean mu, a fixed vector of standard deviation sigma. And now the trick is that we divert all the randomness, all the sampling to a random constant, epsilon, that's drawn from a normal distribution. So mean itself is fixed, standard deviation is fixed, all the randomness and the sampling occurs according to that epsilon constant. We can then scale the mean and standard deviation by that random constant to re-achieve the sampling operation within the latent variables themselves. What this actually looks like, and an illustration that breaks down this concept of reparameterization and divergence, is as follows. So looking here, right, what I've shown is these completely deterministic steps in blue, and the sampling random steps in orange. Originally, if our latent variables are what effectively are capturing the randomness, the sampling themselves, we have this problem in that we can't backpropagate, we can't train directly through anything that has stochasticity, that has randomness. What re-parameterization allows us to do is it shifts this diagram, where now we've completely diverted that sampling operation off to the side to this constant epsilon, which is drawn from a normal prior. And now, when we look back at our latent variable, it is deterministic with respect to that sampling operation. What this means is that we can back-propagate to update our network weights completely end-to-end without having to worry about direct randomness, direct stochasticity within those latent variables Z. This trick is really really powerful because it enabled the ability to train these VAEs completely end-to-end in a, through a back propagation algorithm.


Latent perturbation and disentanglement (31:05)

All right, so at this point we've gone through the core architecture of VAEs, we've introduced these two terms of the loss, we've seen how we can train it end-to-end. Now let's consider what these latent variables are actually capturing and what they represent. When we impose this distributional prior, what it allows us to do is to sample effectively from the latent space and actually slowly perturb the value of single latent variables, keeping the other ones fixed. And what you can observe and what you can see here is that by doing that perturbation, that tuning of the value of the latent variables, we can run the decoder of the VAE every time, reconstruct the output every time we do that tuning, and what you'll see hopefully with this example with the face is that an individual latent variable is capturing something semantically informative, something meaningful. And we see that by this perturbation, by this tuning. In this example, the face, as you hopefully can appreciate, is shifting. The pose is shifting. And all this is driven by is the perturbation of a single latent variable, tuning the value of that latent variable and seeing how that affects the decoded reconstruction. The network is actually able to learn these different encoded features, these different latent variables, such that by perturbing the values of them individually, we can interpret and make sense of what those latent variables mean and what they represent. To make this more concrete, right, we can consider even multiple latent variables simultaneously, compare one against the other. And ideally we want those latent features to be as independent as possible in order to get at the most compact and richest representation and compact encoding. So here again in this example of faces we're walking along two axes, head pose on the x-axis and what appears to be kind of a notion of a smile on the y-axis. And you can see that with these reconstructions, we can actually perturb these features to be able to perturb the end effect in the reconstructed space. And so ultimately, with a VAE, our goal is to try to enforce as much information to be captured in that encoding as possible. We want these latent features to be independent and ideally disentangled. It turns out that there is a very clever and simple way to try to encourage this independence and this disentanglement. While this may look a little complicated with the math and a bit scary, I will break this down with the idea of how a very simple concept enforces this independent latent encoding and this disentanglement. All this term is showing is those two components of the loss, the reconstruction term, the regularization term. That's what I want you to focus on. The idea of latent space disentanglement really arose with this concept of beta VAEs. What beta VAEs do is they introduce this parameter, beta. And what it is, it's a weighting constant. The weighting constant controls how powerful that regularization term is in the overall loss of the VAE. And it turns out that by increasing the value of beta, you can try to encourage greater disentanglement, more efficient encoding to enforce these latent variables to be uncorrelated with each other. Now, if you're interested in mathematically why beta VAEs enforce this disentanglement, there are many papers in the literature and proofs and discussions as to why this occurs, and we can point you in those directions. But to get a sense of what this actually affects downstream, when we look at face reconstruction as a task of interest, with the standard VAE, no beta term or rather a beta of one, you can hopefully appreciate that the features of the rotation of the head, the pose and the the rotation of the head, is also actually ends up being correlated with smile and the facial and the rotation of the head is also actually ends up being correlated with smile and the facial, the mouth expression in the mouth position, in that as the head pose is changing, the apparent smile or the position of the mouth is also changing. But with beta VAEs, empirically we can observe that with imposing these beta values much much much greater than one, we can try to enforce greater disentanglement, where now we can consider only a single latent variable head pose, and the smile, the position of the mouth in these images, is more constant compared to the standard VAE. All right, so this is really all the core math, the core operations, the core architecture of VAEs that we're going to cover in today's lecture and in this class in general.


Debiasing with VAEs (36:37)

To close this section and as a final note, I want to remind you back to the motivating example that I introduced at the beginning of this lecture, facial detection. Where now hopefully you've understood this concept of latent variable learning and encoding and how this may be useful for a task like facial detection where we may want to learn those distributions of the underlying features in the data. And indeed you're going to get hands-on practice in the software labs to build variational autoencoders that can automatically uncover features underlying facial detection datasets and use this to actually understand underlying and hidden biases that may exist with those data and with those models. And it doesn't just stop there. Tomorrow we'll have a very very exciting guest lecture on robust and trustworthy deep learning which will take this concept a step further to realize how we can use this idea of generative models and latent variable learning to not only uncover and diagnose biases, but actually solve and mitigate some of those harmful effects of those biases in neural networks for facial detection and other applications. All right, so to summarize quickly the key points of VAEs, we've gone through how they are able to compress data into this compact encoded representation. From this representation we can generate reconstructions of the input in a completely unsupervised fashion. We can train them end-to-end using the re-parameterization trick. We can understand the semantic interpretation of individual latent variables by perturbing their values. And finally we can sample from the latent space to generate new examples by passing back up through the decoder. So VAEs are looking at this idea of latent variable encoding and density estimation as their core problem. What if now we only focus on the quality of the generated samples, and that's the task that we care more about?


Generative adversarial networks (38:55)

For that, we're going to transition to a new type of generative model called a generative adversarial network, or GAN. generative adversarial network or GAN. Where with GANs our goal is really that we care more about how well we generate new instances that are similar to the existing data. Meaning that we want to try to sample from a potentially very complex distribution that the model is trying to approximate. It can be extremely, extremely difficult to learn that distribution directly, because it's complex, it's high dimensional, and we want to be able to get around that complexity. What GANs do is they say, OK, what if we start from something super, super simple, as simple as it can get, completely random noise? Could we build a neural network architecture that can learn to generate synthetic examples from complete random noise? And this is the underlying concept of GANs, where the goal is to train this generator network that learns a transformation from noise to the training data distribution with the goal of making the generated examples as close to the real deal as possible. With GANs the breakthrough idea here was to interface these two neural networks together, one being a generator and one being a discriminator. And these two components, the generator and discriminator, are at war, at competition with each other. Specifically, the goal of the generator network is to look at random noise and try to produce an imitation of the data that's as close to real as possible. The discriminator then takes the output of the generator as well as some real data examples and tries to learn a classification decision distinguishing real from fake. And effectively in the GAN these two components are going back and forth, competing each other, trying to force the discriminator to better learn this distinction between real and fake, while the generator is trying to fool and outperform the ability of the discriminator to make that classification.


Intuitions behind GANs (41:25)

So that's the overlying concept, but what I'm really excited about is the next example, which is one of my absolute favorite illustrations and walkthroughs in this class, and it gets at the intuition behind GANs, how they work, and the underlying concept. Okay, we're going to look at a 1D example, points on a line, right? That's the data that we're working with. And again, the generator starts from random noise, produces some fake data, they're going to fall somewhere on this one-dimensional line. Now, the next step is the discriminator then sees these points, and it also sees some real data. The goal of the discriminator then sees these points. And it also sees some real data. The goal of the discriminator is to be trained to output a probability that a instance it sees is real or fake. And initially, in the beginning, before training, it's not trained, right? So its predictions may not be very good. But over the course of training, you're going to train it, and it hopefully will start increasing the probability for those examples that are real, and decreasing the probability for those examples that are fake. Overall goal is to predict what is real. Until eventually the discriminator reaches this point where it has a perfect separation, perfect classification of real versus fake. OK, so at this point, the discriminator thinks, OK, I've done my job. Now we go back to the generator, and it sees the examples of where the real data lie. And it can be forced to start moving its generated fake data closer and closer, increasingly closer, to the real data. We can then go back to the discriminator, which receives these newly synthesized examples from the generator, and repeats that same process of estimating the probability that any given point is real, and learning to increase the probability of the true real examples, decrease the probability of the fake points, adjusting, adjusting over the course of its training. And finally we can go back and repeat to the generator again, one last time, the generator starts moving those fake points closer, closer, and closer to the real data, such that the fake data is almost following the distribution of the real data. At this point, it becomes very, very hard for the discriminator to distinguish between what is real and what is fake, while the generator will continue to try to create fake data points to fool the discriminator. This is really the key concept, the underlying intuition behind how the components of the GAN are essentially competing with each other, going back and forth between the generator and the discriminator.


Training GANs (44:25)

And in fact, this intuitive concept is how the GAN is trained in practice, where the generator first tries to synthesize new examples, synthetic examples, to fool the discriminator. And the goal of the discriminator is to take both the fake examples and the real data to try to identify the synthesized instances. In training, what this means is that the objective, the loss for the generator and discriminator, have to be at odds with each other. They're adversarial. And that is what gives rise to the component of adversarial in generative adversarial network. These adversarial objectives are then put together to then define what it means to arrive at a stable global optimum, where the generator is capable of producing the true data distribution that would completely fool the discriminator. Concretely, this can be defined mathematically in terms of a loss objective, and again, though I'm showing math, we can distill this down and go through what each of these terms reflect in terms of that core intuitive idea and conceptual idea that hopefully that 1D example conveyed. So we'll first consider the perspective of the discriminator D. Its goal is to maximize probability that its decisions, in its decisions that real data are classified real, fake data classified as fake. So here, the first term, g is the generator's output, and d is the discriminator's estimate of that generated output as being fake. D of x, x is the real data, and so d of x is the estimate of the probability that a real instance is fake. One minus d of x is the estimate that that real instance is real. So here, in both these cases, the discriminator is producing a decision about fake data, real data, and together it wants to try to maximize the probability that it's getting answers correct, right? Now with the generator we have those same exact terms, but keep in mind the generator is never able to affect anything that the discriminators decision is actually doing besides generating new data examples. So for the generator, its objective is simply to minimize the probability that the generated data is identified as fake. Together we want to then put this together to define what it means for the generator to synthesize fake images that hopefully fool the discriminator. All in all, right, this term, besides the math, besides the particularities of this definition, what I want you to get away from this section on GANs is that we have this dual competing objective where the generator is trying to synthesize these synthetic examples that ideally fool the best discriminator possible. And in doing so, the goal is to build up a network via this adversarial training, this adversarial competition, to use the generator to create new data that best mimics the true data distribution and is completely synthetic new instances. What this amounts to in practice is that after the training process you can look exclusively at the generator component and use it to then create new data instances. All this is done by starting from random noise and trying to learn a model that goes from random noise to the real data distribution. And effectively what GANs are doing is learning a function that transforms that distribution of random noise to some target. What this mapping does is it allows us to take a particular observation of noise in that noise space and map it to some output, a particular output in our target data space. And in turn, if we consider some other random sample of noise, if we feed it through the generator of GAN, it's going to produce a completely new instance falling somewhere else on that true data distribution manifold. And indeed what we can actually do is interpolate and traverse between trajectories in the noise space that then map to traversals and interpolations in the target data space. And this is really really cool because now you can think about an initial point and a target point and all the steps that are going to take you to synthesize and go between those images in that target data distribution. So hopefully this gives a sense of this concept of generative modeling for the purpose of creating new data instances.


GANs: Recent advances (50:07)

And that notion of interpolation and data transformation leads very nicely to some of the recent advances and applications of GANs, where one particularly commonly employed idea is to try to iteratively grow the GAN to get more and more detailed image generations, progressively adding layers over the course of training to then refine the examples generated by the generator. And this is the approach that was used to generate those synthetic, those images of those synthetic faces that I showed at the beginning of this lecture. This idea of using a GAN that is refined iteratively to produce higher resolution images.


Coverage Of Gan Culture

Conditioning GANs on a specific label (50:55)

Another way we can extend this concept is to extend the GAN architecture to consider particular tasks and impose further structure on the network itself. One particular idea is to say, okay, what if we have a particular label or some factor that we want to condition the generation on? We call this C, and it's supplied to both the generator and the discriminator. What this will allow us to achieve is paired translation between different types of data. So for example, we can have images of a street view, and we can have images of the segmentation of that street view. And we can build a GAN that can directly translate between the street view and the segmentation. Let's make this more concrete by considering some particular examples. So what I just described was going from a segmentation label to a street scene. We can also translate between a satellite view, aerial satellite image, to what is the roadmap equivalent of that aerial satellite image, or particular annotation or labels of the image of a building to the actual visual realization and visual facade of that building. We can translate between different lighting conditions, day to night, black and white to color, outlines to a colored photo. All these cases, and I think in particular the most interesting and impactful to me is this translation between street view and aerial view. And this is used to consider, for example, if you have data from Google Maps, how you can go between a street view of the map to the aerial image of that. Finally, again extending the same concept of translation between one domain to another, another idea is that of completely unpaired translation, and this uses a particular GAN architecture called CycleGAN.


CycleGAN of unpaired translation (53:02)

So in this video that I'm showing here, the model takes as input a bunch of images in one domain, and it doesn't necessarily have to have a corresponding image in another target domain, but it is trained to try to generate examples in that target domain that roughly correspond to the source domain, transferring the style of the source onto the target and vice versa. So this example is showing the translation of images in horse domain to zebra domain. The concept here is this cyclic dependency. You have two GANs that are connected together via this cyclic loss, transforming between one domain and another. And really, like all the examples that we've seen so far in this lecture, the intuition is this idea of distribution transformation. Normally with a GAN you're going from noise to some target. With the cycle GAN you're trying to go from some source distribution, some data manifold X, to a target distribution, another data manifold Y. And this is really, really not only cool, but also powerful in thinking about how we can translate across these different distributions flexibly. And in fact, this allows us to do transformations not only to images, but to speech and audio as well. So in the case of speech and audio, turns. So in the case of speech and audio, turns out that you can take sound waves, represent it compactly in a spectrogram image, and use a cycle GAN to then translate and transform speech from one person's voice in one domain to another person's voice in another domain, right? These are two independent data distributions that we define. Maybe you're getting a sense of where I'm hinting at, maybe not, but in fact this was exactly how we developed the model to synthesize the audio behind Obama's voice that we saw in yesterday's introductory lecture. What we did was we trained a CycleGAN to take data in Alexander's voice and transform it into data in the manifold of Obama's voice. So we can visualize how that spectrogram waveform looks like for Alexander's voice versus Obama's voice that was completely synthesized using this CycleGAN approach. Hi everybody and welcome to MIQ 6S191, the locational introductory course on deep learning taught here at MIQ. Hi everybody. I replayed it. OK. But basically what we did was Alexander spoke that exact phrase that was played yesterday, and we had the train cycle GAN model. And we can deploy it then on that exact audio to transform it from the domain of Alexander's voice to Obama's voice, generating the synthetic audio that was played for that video clip. Alright, okay, before I accidentally play it again, I jump now to the summary slide.


Summary of VAEs and GANs (56:39)

So today in this lecture we've learned deep generative models, specifically talking mostly about latent variable models, autoencoders, variational autoencoders, where our goal is to learn this low dimensional latent encoding of the data as well as generative adversarial networks where we have these competing generator and discriminator components that are trying to synthesize synthetic examples. We've talked about these core foundational generative methods, but it turns out, as I alluded to in the beginning of the lecture, that in this past year in particular we've seen truly, truly tremendous advances in generative modeling, many of which have not been from those two methods, those two foundational methods that we described, but rather a new approach called diffusion modeling.


Preview Of Other Models

Diffusion Model sneak peak (57:17)

Diffusion models are driving, are the driving tools behind the tremendous advances in generative AI that we've seen in this past year in particular. These GANs, they're learning these transformations, these encodings, but they're largely restricted to generating examples that fall similar to the data space that they've seen before. Diffusion models have this ability to now hallucinate and envision and imagine completely new objects and instances which we as humans may not have seen or even thought about, right? Parts of the design space that are not covered by the training data. So an example here is this AI-generated art, which art, if you will, right, which was created by a diffusion model. And I think not only does this get at some of the limits and capabilities of these powerful models, but also questions about what does it mean to create new instances? What are the limits and bounds of these models, and how do they, how can we think about their advances with respect to human capabilities and human intelligence. And so I'm really excited that on Thursday in Lecture 7 on New Frontiers in Deep Learning, we're going to take a really deep dive into diffusion models, talk about their fundamentals, talk about not only applications to images, but other fields as well in which we're seeing these models really start to make transformative advances because they are indeed at the very cutting edge and very much the new frontier of generative AI today. All right, so with that tease and hopefully set the stage for lecture seven on Thursday and conclude and remind you all that we have now about an hour for open office hour time for you to work on your software labs. Come to us, ask any questions you may have, as well as the TAs who will be here as well. Thank you so much.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.