MIT 6.S191 (2018): Deep Generative Modeling

Transcription for the video titled "MIT 6.S191 (2018): Deep Generative Modeling".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Intro (00:00)

All right. Thanks very much for the invitation to speak to you. Yeah, so I'm going to be talking about deep generative models. So when we talk about deep generative models, what we're really talking about here, from my point of view, is to essentially train neural nets from training examples in order to represent the distribution from which these came. So we can think about this as either explicitly doing density estimation, where we have some samples here and we try to model those samples with some density estimation, or we can think of it more like this, which is what actually I'll be doing a lot more of this kind of thing, which is essentially we're worried about sample generation here. So we have some training examples like this where we're sort of just natural images from the world. And we're asking a model to train to learn to output images like this. Now, these are actually not true samples. These are actually just other images from the same training site. I believe this is from ImageNet. A few years ago, or even a few months ago, this would have seemed obvious that you couldn't generate samples like this. But in fact, nowadays, this is actually not so obvious that we couldn't generate these. So it's been a very exciting time in this area. And the amount of work we've done and the amount of progress we've made in the last few years has been pretty remarkable. And so I think part of what I want to do here is tell you a little bit about that progress, give you some sense of where we were in, say, 2014 when this started really to accelerate and where we are now. So yeah, so why generative models? Why do we care about generative modeling? Well, there's a bunch of reasons. Some of us are just really interested in making pretty pictures. And I confess that, for the most part, that's what I'll be showing you today is just, as an evaluation metric, we'll just be looking at pictures, just natural images and how well we're doing at natural images. But there's actually real tasks that we care about when we talk about generative modeling. One of them is just, let's say you want to do some conditional generation, like, for example, machine translation. So we're conditioning on some source sentence, and we want to output some target sentence.

Concepts And Examples Of Latent Variables

Example: Clusters of Manureworth and Dorescdot (02:12)

Well, the structure within that target sentence, the target language, let's say, the rules, the grammatical rules, you can model that structure using a generative model. So this is an instance of where we would do conditional generative modeling. Another this is an instance of where we would do conditional generative modeling. Another example where this is something that we're actually looking a little bit towards is can we use generative models as outlier detections? And actually, recently, these types of models have been integrated into RL algorithms to help them do exploration more effectively. This was work done at DeepMind, I believe. So here we're looking at a case where if you can think of a kind of a toyish version of the autonomous vehicle task, and you want to be able to distinguish cars and wheelchairs, and then you're going to have something like this. And you don't want your classifier to just blindly say, oh, I think it's either a car or a wheelchair. You want your classifier to understand that say, oh, I think it's either a car or a wheelchair. You want your classifier to understand that this is an outlier. And you can use generative modeling to be able to do that by noticing that there aren't very many things like this example from the training set. So you can proceed with caution. And this is kind of a big deal, because we don't want our classifiers. Our neural net classifiers are very, very capable of doing excellent performance classification. But any classifier is just trained to output one of the classes that it's been given. And so in cases where we actually are faced with something really new and that's not seen before, or perhaps in illumination conditions that it's never been trained to cope with, we want models that are conservative in those cases. So we hope to be able to use generative models for that. Another case where we're looking at generative models being useful is in going from simulation to real examples in robotics. So in robotics, training these robots with neural nets is actually quite laborious. If you're really trying to do this on the real robot, it would take many, many trials, and it's not really practical. In simulation, this works much, much better. But the problem is that if you train a policy in simulation and transfer it to the real robot, well, that hasn't worked very well because the environment is just too different. But what if we could use a generative model to make our simulation so realistic that that transfer is viable? So this is another area that a number of groups are looking at, this kind of pushing generative modeling in this direction. So there's lots of really practical ways to use generative models beyond just looking at pretty pictures. Right, so I break down the kinds of generative models there are in the world in two rough categories here. And maybe we can take issue with this. Oh, by the way, if you guys have questions, go ahead and ask me while you have them. I think I like interaction if possible. Or you can just save them to the end. Either way is fine. And sorry for my voice. I've got a cold. So yeah, we have autoregressive models. And we have latent variable models. have autoregressive models, and we have latent variable models. So autoregressive models are models where you basically define an ordering over your input. So for things like speech recognition, or rather speech synthesis in the congenital modeling case, this is natural. It's just there's a natural ordering to that data. It's just the temporal sequence. For things like images, it's a little less obvious how you would define an ordering over pixels. But there are nonetheless models such as PixelRNN and PixelCNN that do just this. In fact, PixelCNN is a really interesting model from this point of view. They basically define a convolutional neural net with a mask. So if you remember our previous lecture, we saw these convolutional neural nets. But what they do is they stick a mask. So if you remember our previous lecture, we saw these convolutional neural nets. But what they do is they stick a mask on it so that you've got sort of a causal direction. So you're only looking at previous pixels in the ordering that you've defined the ordering defined by the autoregressive model. So you maintain this ordering as you go through the conf net. And what that allows you to do is come up with a generative model that's supported by this conf net. It's just a full generative model. And it's a pretty interesting model in its own right. But because I have rather limited time, I'm actually not going to go into that model in particular. Another thing I just want to point out here is that WaveNet is probably the state-of-the-art model for speech synthesis right now. And it forms the basis of very interesting speech synthesis systems.

Example: Speech Recognition (06:32)

It's another area where gender models have made remarkable contributions in the last few years, even the last few months. So now we're seeing models that we would be fairly hard-pressed to distinguish between natural speech and these kinds of models, for the most part. So what I'm going to concentrate on is latent variable models. So latent variable models are models that essentially posit that you have some latent variables that you hope will represent some latent factors of variation in the data.

Latent variable modeling. (06:59)

So these are things that that as you wiggle them, they're going to move the data in what you hope will be natural ways. So you can imagine a latent variable for images corresponding to illumination conditions. Or if they're faces, it's a common thing we find is latent variable corresponding to a smile. So if we move this latent variable, the image that we see, that we generate, a smile. So if we move this latent variable, the image that we see that we generate, a smile appears and disappears. These are the kind of latent variables that we want. And we want to discover these. So this is really a challenging task in general. We want to take natural data, just unlabeled data, and discover these latent factors that give rise to the variation we see. There's two kinds of models in this family that I'm going to be talking about, adversarial autoencoders and generative adversarial nets, or GANs here. I work personally with both of these kinds of models. They serve different purposes for me. And yeah, let's dive in.

Latent variables X-encoded variational autoencoder. (08:02)

So first, I'll talk about the variolation ALO encoders. This actually was a model that was developed simultaneously by two different groups, one at DeepMind. That's the bottom one here. And then King-Manueling at the University of Amsterdam. So again, the idea behind the latent variable models in general is kind of represented in this picture. So here's the space of our latent variables. And we can see this is kind of represented as being fairly simple. And we have our two coordinate, z1 and z2. And they're independent in this case, and they're sort of fairly regular. And they sort of form a chart for what is our complicated distribution here in x space. So you can think of this as the data manifold. So you can think of this as image space, for example. So image space embedded in pixel space, natural images embedded in pixel space, form this kind of manifold. And what we want is coordinates that allow you to, as you move smoothly in this space, move along this what can be a very complicated manifold. So that's the kind of hope that what we're looking for when we do latent variable modeling. So here's just an example of what I mean by exactly that. This is an early example using these variational auto encoders. So here's the Frey face data set, just a whole bunch of images of Brendan Frey's face that's in the data set. What we're showing here is the model output of this variational autoencoder for different values of these latent variables z1 and z2. Now, we've kind of post hoc-ed added these labels, pose and expression, on them. Because you see, as we move this z2 here, you can see the expression kind of smoothly changes from what looks like a frown to eventually a smile and through what looks over here like a, well, sticking his tongue out, I guess. And in this direction, there's a slight head turn. It's pretty subtle, but it's there. So like I said, these were sort of post hoc added. The model just discovered that these were two sources of variation in the data that were relatively independent, and the model just pulled those out. For something like MNIST, these are samples drawn on a model trained by MNIST, it's a little less natural, right? Because in this case, you could argue the data is really best modeled as something not with continuous latent factors, but more like in clusters. So you get this somewhat interesting, somewhat bizarre relationship where you've got some of this relationship that the tilt happens here, but then the 1 morphs into a 7, which morphs into a 9. And you get these different regions in this continuous space that represent different examples here. And you know, because these different regions in this continuous space that represent different examples here. So a little bit more detail into how we do these kinds of latent variable models, at least in the context of the variational autoencoder or VAE model.

Neural Latent Variables. (10:53)

So what we're trying to learn here is p of x, some distribution over the data. We're trying to maximize the likelihood of the data. That's it. But the way we're going to parameterize our model is with p of x given z. Z is our latent variables. Oh, sorry, z, I guess, for you guys. p of x given z, and then p of z, some prior distribution. So this p of z here is typically something simple. It's some prior distribution. We actually generally want it to be independent. There's some modeling compromises to be made there. But the reason why you'd want it independent is because that helps get you the kind of orthogonal representation here. So this dimension and this dimension, we want sort of not very much interaction in order to make them more interpretable. Yeah, and so the other thing we want sort of not very much interaction in order to make them more interpretable. Yeah, and so the other thing we want to do is we want to think about how are we going to. So going from something simple, like you can think about this as like in a Gaussian distribution or uniform distribution. But now we want a model G here that transforms Z from this space into this complicated space. And the way we're going to do that is with a neural net. Actually, in all of the examples that I'm going to show you today, the way we're going to do that is with a convolutional neural net. And the way to think about that, it's a bit of an interesting thing to think about going from some fully connected thing, z, into some two-dimensional input with a topology here in the natural image space, x. And so just the way going from what we talked about, it's kind of like the opposite path of what you would take to do a conf net classification. There's a few different ways you could think about doing that, one of which is called a transpose convolution. This turns out to not to be such a good idea. This is a case where you essentially fill in a bunch of zeros. It seems like the most acceptable way to do that right now is to just, once you get some small level topology here, you just do interpolation. So you just super sample from the image. You can do bilinear interpolation. And then do a conf that preserves that size. And then upsample again, conf, upsample, conf. That tends to give you the best results. So when you see this kind of thing for our purposes, think convolutional neural map. So it's important to point out that if we had over here z's that went with our x, we'd be done, because this is just a supervised learning problem at this point, and we'd be fine. The trick is these are latent, meaning they're hidden. We don't know these z's. We haven't discovered them yet. So how do we learn with this model? Well, the way we're going to do it is we're going to use a trick that's actually been around for quite some time. So this isn't particularly new. We're going to use a variational lower bound on the data likelihood. So it turns out that we can actually express the data likelihood here.

Lower bound (13:50)

Again, this is the thing we're trying to maximize. We can express a lower bound for it given by something like this. So we posit that we have some q distribution of z that estimates the posterior of z for a given x. And we're trying to then, I guess, maximize this joint probability over x and z minus the log qz. So this is this variational or bound. One of the ways we can express this is you're trying to find the q. From q's point of view is, if you were to find a Q that actually recovered the exact posterior distribution over Z given X, this would actually be a tight lower bound. So then we would for sure be optimizing. If we were to now optimize this lower bound, we would be for sure optimizing likelihood. In practice, that's what we're going to do anyway. We're going to have this lower bound. We're going to try to optimize it. We're trying to raise that up in hopes of raising up our likelihood. In practice, that's what we're going to do anyway. We're going to have this lower bound. We're going to try to optimize it. We're trying to raise that up in hopes of raising up our likelihood. But the problem is this posterior, the actual posterior of, say, of this G model here, this neural net, this is just some forward model neural net. So computing the posterior of Z given X is intractable. We have no good way of doing this. It's going to be some complicated thing, and we have no sensible way of doing this. So we going to be some complicated thing, and we have no sensible way of doing this. So we're going to approximate it with this Q in this way. So we're going to now, we can actually, and what's interesting about this formulation, and this is new to the variational automquery, is they've sort of just reformulated this a little bit differently. And what they've got is they come up with this different expression here, which actually can be thought of in two terms here. One is the reconstruction term here. If you look at what this is, this is just you get some, from some q, you get a z, and you're just trying to reconstruct x from that z. So you start with x, you get a z, and then you're trying to do a reconstruction of x from that z. This is where the name variational autoencoder comes from, is because this really looks like an encoder on the side of Q here and a decoder here. And you're just trying to minimize that reconstruction error. But in addition to this, they add this regularization term. And this is interesting. So what they're doing here is they're basically saying, well, we want to regularize this posterior. And this is actually new. Autoencoders don't have this. So we're trying to regularize this posterior. And this is actually new. Autoencoders don't have this. So we're trying to regularize this posterior to try to be a little bit closer to the prior here. And it's a common mistake when people learn about this to sort of think that, oh, well, the goal is for these things to actually match. That would be terrible. That means that you lose all information about x. You definitely don't want these things to match. But it does act as a regularizer, sort of as a counterpoint to this reconstruction term. And so now, we've talked a little bit about this. But what is this Q? Well, for the variational autoencoder, the Q is going to be another neural net. And in this case, we can think of this as just a straight conf net for the case of natural images. So again, we've got our lower bound, our objective that we're trying to minimize. And we're going to parameterize Q as this neural net, the conf net, that goes from x to z. And we've got now our generative model here, our decoder that goes from z to x. I'm going to add a few more details to make this thing actually in practice. Up till now, this is not too new. There's been instances of this kind of formalism of an encoder network and a decoder network. But what they do next is actually kind of interesting. They notice that if they parameterize this a certain way, if they say q is equal to, actually you can use any continuous distribution here, but they pick a normal distribution here. So q of x is some normal distribution where the parameters mu and sigma from the defined this normal distribution here. So q of x is some normal distribution where the parameters mu and sigma from the defined this normal distribution are defined by this encoder network. Then they can actually encode it like this. It's called a reparameterization trick, where they take z, our random variable here, is equal to some function of the input mu plus sigma, our scaling factor, over some noise. And what this allows us to do now, when they formulate it this way, is that when in training this model, they can actually backprop through the decoder and into the encoder to train both models simultaneously.

Forward propagation (17:49)

Looks a little bit like this. So they can do forward propagation, start with an x, forward propagate to z, add noise here. That was at epsilon. And then forward propagate here to this x hat, which is our reconstruction. Compute the error between x and x hat, and back propagate that error all the way through. And that allows them to actually train this model very effectively in ways that we've never been able to train before this trick came up. And when you do that, this is the kind of thing that came out. So this came out in 2014. These were actually really, I promise, these were really impressive results in 2014. This is the first time we were seeing sort of, this is not, this is from the label Face in the Wild.

Trick (18:38)

These days we use Celebi. And this is ImageNet. So not a whole lot there. Actually, this is a small version of ImageNet. But you can do things with this model, actually. So for example, one of the things that we've done with this model is we actually just talked to, I mentioned briefly, this PixelCNN. We actually include this PixelCNN into the decoder side. So one of the problems, if I just go back, one of the problems why we get these kinds of images is this model makes a lot of independence assumptions. And part of it is because we want those independence assumptions to make our zeds more interpretable. But they have consequences to them. And one of the consequences is you end up with kind of blurry images. That's part of why you end up with blurry images is because we're making these approximations in the variational lower bound. And so by adding the PixelCNN, that allows us to encode more complexity in here. And by the way, this is now a hierarchical version of the VAE using PixelCNN. That allows us to encode sort of complicated distributions in Z1 here, given the upper level Zs. And with this kind of thing, this is the kind of images that we can synthesize using this variational, we'll call this the pixel VAE model. So these are bedroom scenes. So you can sort of see, it's reasonably good, clear bedroom scenes.

Inference is tough because modeling is hard (20:00)

And then ImageNet, which you can see that it gets roughly the textures right. It's not really getting objects yet. Actually, objects are really tough to get when you model things in an unconditional way. What I mean by that is the model doesn't know that it's supposed to generate a dog, let's say, if it was going to generate something. So it's just generating from p of x in general. That's actually pretty challenging when we talk about ImageNet. All right, so that's one way we can improve the VAE model. Another way we can improve the VAE model is work on the encoder side. And that was done by a few people, but culminating, I think, in the inverse autoregressive flow model. So this is actually a very effective way to deal with the same kind of independence problems we saw that we're addressing on the decoder side, but they're addressing it on the encoder side. So you can kind of see just briefly what this is doing. So this is your prior distribution. Ideally, you would like sort of the marginal posterior, which is sort of like combining all these things together to be as close to this as possible. Because any sort of disagreement between those two is really a modeling error. It's an approximation error. So standard VAE model is going to learn to do something like this, which is it's kind of as close as it can get to this while still maintaining independence in the distributions. Using this IIF method, it's a bit of a complicated method that involves many, many iterations of transformations that you can actually compute that are actually invertible. And you need this to be able to do the computation. But with that, you can get this kind of thing, which is pretty much exactly what you'd want in this setting. So we've played around with this model. And in fact, we find it works really well in practice, actually. But again, it's on the encoder side, what we were doing with the Pixel VAE is's on the encoder side, what we were doing with the pixel VAE is working on the decoder side.

Generative adversarial networks (21:48)

So this is actually fairly complicated. Both these models are actually fairly complicated to use and fairly involved. So one question is, is there another way to train this model that isn't quite so complicated? And so at the time, a student of Yoshua Benjua and I, Ian Goodfellow, was toying around with this idea.

Pixels Vaep: ImageNet (22:05)

And he came up with generative adversarial nets. And the way generative adversarial nets work is it posits the learning of a generative model, G, in the form of a game. So it's a game between this generative model, G here, and a discriminator, D. So the discriminator's job is to try to tell the difference between true data and data generated from the generator. So it's trying to tell the difference between fake data that's generated by the generator and true data from the training distribution. And it's generated by the generator and true data from the training distribution. And it's just trained to do it. So this guy's trained to try to output 1 if it's true data and output 0 if it's fake data. And the generator is being trained to fool the discriminator by using its own gradients against it, essentially. So we back propagate the discriminator error all the way through through x. We usually have to use continuous x for this. And into the discriminator. Now we're going to change the parameters of the generator here in order to try to maximally fool the discriminator. So in sort of a more, I guess, abstract way to represent this looks like this.

Generative Gaussian networks (23:28)

So we have the data on this side. We have the discriminator here with its own parameters. And this, again, for our purposes, is almost always a convolutional neural net. And then we have the generator, which is, again, one of these kind of flipped convolutional models, because it takes noise as input. It needs noise because it needs variability. And then it converts that noise into something, an image space that's trained with these parameters that are trained to fool the discriminator.

Understanding Generative Adversarial Networks

The objective function (23:50)

All right. So we can be a little bit more formal about this. This is actually the objective function we're training on. So let's just break this down for a second. So from the discriminator's point of view, what is this? This is just, it's called the cross entropy loss. It's literally just what you would apply if you were doing a classification with this discriminator. That's all this is. From the generator's point of view, the generator comes in just right here, right? It's the thing you draw these samples from. And it's trying to minimize, well, the discriminator is trying to maximize this quantity. This is essentially likelihood. And the generator is moving in the opposite direction. So we can analyze this game to see, there's a question, right? Actually, the way this happened was at first he just tried it, and it worked. It was kind of an overnight kind of thing, and we got some very promising results. And then we set about trying to think about, well, how do we actually explain what it's doing? Why does this work? And so we did a little bit of theory, which is useful to discuss. And I can tell you there's been a lot more theory on this topic that's been done that I will not be telling you about, but it's actually been a very interesting development in the last few years. But this is the theory that appeared in the original paper. So the way we approached this was, let's imagine we have an optimal discriminator. And this turns out you can pretty easily show this is the optimal discriminator up here. Now, this is not a practical thing, because we don't know PR, which is probability of the real distribution.

Mini-lecture (25:27)

This is not available to us. This is only defined over training set, so only by training examples. So we actually can't instantiate this. But in theory, if we had this optimal discriminator, then the generator would be trained to minimize the Jensen-Shannon divergence between the true distribution that gave rise to the training data and our generated distribution. So this is good, right? This is telling us that we're actually doing something sensible in this kind of non-parametric ideal setting that we're not really using, but it's actually interesting nonetheless. So one thing I can say, though, that in practice, we actually don't use exactly the objective function that I was just describing. What we use instead is a modified objective function. And the reason is because if we were to minimize g, what we had before was this term minimizing g. What happens is that as the discriminator gets better and better, the gradient on g actually saturates. It goes to 0. So that's not very useful if we want to train this model. And this is actually one of the practical issues that you see when you actually train these models, is that you're constantly fighting this game. You're sort of on this edge of the discriminator doing too well or the generator. It's essentially, you're basically almost always fighting the discriminator because it's always going to, as soon as the discriminator starts to win this competition between the generator and the discriminator, you end up with unstable training. And in this case, you end up with basically, the generator stops training and the discriminator runs away with it. Well, that's actually in the original case. So what we do instead is we optimize this, which is a slight modification, but it's still monotonic. And it actually corresponds to the same fixed point. But what we're doing is we're just actually, with respect to g, again, coming in through the samples here, we're maximizing this quantity rather than minimizing this one. OK. So that's just a practical kind of heuristic thing, but it actually makes a big difference in practice. So when we did this, when we first published this paper, this is the kind of results we would see. And what you're looking at now is movies formed by moving smoothly in z-space. So you're looking at transformations on the image manifold coming from smooth motions in z-space. So you're kind of looking at transformations on the image manifold coming from smooth motions in z-space. So we were pretty impressed with these results. Again, they felt good at the time. But there's been a few papers that have come out recently. Well, not so recently, actually, at this point. In 2016, there was a this came out in 2014. In 2016, there was a big jump in the quality. And this was sort of one of those stages. This is the least squares scan. This is just one example of many I could have pointed out. But this is the kind of results we're seeing. So one of the secrets here is that it's 128 by 128. So bigger images actually give you much better perception of quality in terms of the images. But so these are not necessarily or generally not real bedrooms. These are actually generated from the model. So trained on roughly, I think, 100,000 or at least 100,000 bedroom scenes, asked to generate from these random Z bits, this is what it gives you. So one thing you could think of, and one thing that certainly occurred to me when I first saw these kinds of results, is that, well, it's just overfit on some small set of examples, and it's just learning these delta functions. So that's not that interesting in some sense. It's kind of memorized some small set of data, and it's enough that it looks good and it's impressive. But it doesn't seem like that's actually the case. And one of the parts of evidence that was pointed to, and this is in the DC GAN paper, was this. So that same trick that I showed you with the movies in MNIST where we were sort of moving smoothly in z-space, they applied basically that same idea here. So this is basically one long trajectory through z-space, they applied basically that same idea here. So this is basically one long trajectory through z-space. And what you can see is starting up here and ending up all the way down here. What you can see is a smooth transition. And at every point, it seems like a reasonable bedroom scene. So it really does seem like that picture that I showed you where we had the z space that was smooth, and then we had this x space that had this manifold on it. It really does feel like that's what's happening here, right? We're moving smoothly in z space, and we're moving along the image manifold in x space. So for example, I guess I don't know if this is a picture or TV, but it slowly morphs into a window, I guess, and then kind of becomes clearly a window and then turns just into this edge, sort of an edge of the room. So one of the things actually if you want to nitpick about these, the models actually don't seem to understand 3D geometry very well. It often gets the perspective just a little wrong. Sort of something might be interesting for future work. So yeah, so one question you might be, so why? Why do these things work well? And keep in mind that when we talked about the VAE model, we actually had to do quite a bit of work to get comparable results.

An intuition for GANтам (30:44)

We had to embed these pixel CNNs in the decoder, or we had to do quite a bit of work to get the encoder to work right. In these models, we literally just took a conf net, stuck in some noise at the beginning, pushed it through, and we got these fantastic samples. It really is kind of that simple. So what's going on? Why is it working as well as it is? And so I have an intuition for you, a kind of a cartoon view. So imagine that this is the image manifold. So this is kind of a cartoon view of an image manifold, but this is in two pixel dimensions here. And we're imagining here that these are just parts of image manifold, and they sort of share some features close by. But what this is basically representing is the fact that most of this space isn't on the image manifold. Image manifold is some complicated non-linearity. And if you were to randomly sample in pixel space, you would not land on this image manifold, which makes sense. Randomly sample in pixel space, you're not getting a natural image out of it. This is sort of a cartoon viewer, my perspective on the difference between what you see with maximum likelihood models, of which the VAE is one, and something like a GAN. So the maximum likelihood, the way it's trained is it has to give a certain amount of likelihood density for each real sample. If it doesn't, it's punished very severely for that. So it's willing to spread out its probability mass over regions of the input space or of the pixel space, which actually doesn't correspond to a natural image. And so when we sample from this model, that's most of where our samples come from. These are these blurry images that we're looking at. Again, models things differently, right? Because it's only playing this game between the encoder and, well, sorry, between the discriminator and the generator. All it has to do is sort of stick on some subset of the examples, or maybe some subset of the manifolds that are present, and have enough diversity that the discriminator can't notice that it's modeling a subset. So there's pressure on the model to maintain a certain amount of diversity. But at the same time, it doesn't actually face any pressure to model all aspects of the training distribution.

GANS (32:55)

It can just ignore certain aspects of the training examples or certain aspects of the training examples or certain aspects of the training distribution without significant punishment from the training algorithm. So anyway, that's, I think, a good idea to have in your mind about the difference between how these methods work. Yeah, so I'd like to sort of conclude with a few steps that have happened since the GANs.

ALI and Big GAN (33:25)

One of the things, this is something that we've done. You might ask, well, GANs are great. But in a way, it's kind of unsatisfying, because we start with this Z, and then we can generate images. So yes, we generate really nice looking images. But we had this hope when we started talking about these latent variable models that we could actually maybe infer the z from an image. So we can actually extract some semantics out of the image using these latent variables that you discover. And in the GAN, we don't have them. The question is, can we actually, within this generative adversarial framework, can we reincorporate inference mechanism? So that's exactly what we're doing here with this work. And this is actually a model we call ALI, but the identical work essentially came out at the same time, known as BIGAN. And the basic idea here is just to incorporate an encoder into the model. So rather than just giving the data set here on the left, earlier the GAN was defined. We had the decoder, our generative model. But over here, we only gave it x, training examples. And here, we only compared against x generated from the generator. But in this case, what we're doing is we take x, and then we actually use an encoder here to generate a z given x. And on the decoder, we have, again, our traditional GAN style. We take a z sample from some simple distribution, and we generate x. And on the decoder, we have, again, our traditional GAN style. We take a z sample from some simple distribution, and we generate x. So again, this is the data distribution over here, encode it to z. And then we take our decode sample from z, and we decode to x. And our discriminator now, crucially, is not just given x, but it's given x and z. And it's asked, can you tell the difference in this joint distribution between the encoder side and the decoder side? That's the game. And what we find is, well, first of all, it actually generates very good samples, which is interesting. It's actually, it seems to generate sort of better samples than we see with comparable GANs, which there might be some regularization effect. I'm not entirely sure why that would be, but actually it gets fairly compelling samples. This is just with CelebA, a large data set of celebrity images. But this is the more interesting plot. So this actually corresponds to a hierarchical version of this model. So this is why we have multiple z's. So this is z1 and z2. This is a two-layer model, inversion of this model. So if we just reconstruct from the higher level z, which is this model. So if we just reconstruct from the higher level z, which is this containing fairly little information, because it's a single vector, and then it has to synthesize into this large image, what we're looking at here are reconstructions. So we take this image, we encode it all the way to z2, and then we decode it. And what we end up with is this, which is reasonably close, but not that great. And so it's the same thing. They sort of hold some piece of the information, which in some sense, it's remarkable that it does as well as it does, because it's actually not trained to do reconstruction, unlike something like the VAE, which is actually explicitly trained to do reconstruction. This is just trained to match these two distributions. And this is kind of a probe to see how well it's doing. Because we take x from 1, map it to z, and we take that z over. And when we resynthesize the x, and we're seeing now an x space, how close did it come? And it does OK. But over here, when we give it z1 and z2, so in this case, we're really just giving it all of the latent variable information, we actually get much, much closer, which is interesting because this is telling us that this pure joint modeling, in this case, it would be a joint modeling between X, Z1, and Z2, that this is enough to do fairly good reconstruction without ever actually explicitly building that in. So it's giving us an interesting probe into how close are we coming to learning this joint distribution. And it seems like we're getting actually surprisingly close. So it's a testament to how effective I think this generative adversarial training algorithm actually is. So I want to just end with a few other things that have nothing to do with our work, but I think are very, very interesting and well worth you guys learning about. So first one is CycleGAN. CycleGAN is this really cool idea starting with, let's imagine you have two sets of datasets that somehow correspond, but you don't know the correspondence.

Constellation Problem (37:35)

You don't have like, there's an alignment that exists between these two datasets, but you might not have paired data. This actually happens a lot. Say for example, this is not an image space, but a great example of this happens in machine translation. You almost always have lots of unilingual data, so just text data in a given language, but it is very expensive to get aligned data, but to data paired as a source and target distribution. The question is, what can you do if you just have unilingual data? How successful can you be at learning and mapping between the two? And they essentially use GANs to do this. So this is the setup. They have some domain x here and a domain y here. And what they do is they start with an x. They transform it through some convolutional neural net, usually a ResNet-based model, into some y. And on this y, they're going to evaluate it as a GAN-style discriminator here. So can x through g make a convincing y? That's being measured here. So you can think of x as taking the place, which is some other image, let's say. This is some image to another image. This image x is taking the place of our z, of our random bits. It's getting its randomness from x. And then we do the same thing. We can kind of transform it through f. So we've got x here. Transform it through g to d. And we evaluate on our discriminator here, on a GAN style training. And then we re-encode this in x. Now, once we get here, that's over here. So we've taken x, transformed it into y, transform it back. They actually do what's called a cycle consistency loss, which this is actually a reconstruction. This is an L1 reconstruction error. And they backprop through f and g. And then it's a symmetric relationship, so they do the exact same thing on the other side. They start with y. They see if they can transform it into X. They compare that generated X with true X's via a discriminator, and then again transform that to Y and do the cycle consistency loss. So without any paired data, this is the kind of thing that they can get. So a particular note is horses and zebras. So this is a case where it's impossible to get this kind of pair data. Say you wanted a transformation that transformed horses to zebras and vice versa. You will never find pictures of horses and zebras in the exact same pose. That's just not a kind of data that you're ever going to be able to collect. And yet they do a fairly convincing job of doing this.

Gan Reconstruction

SPITE GAN Reconstruction (40:08)

And you can see that they even turn like, there's a little bit of this one actually doesn't do it very well. But oftentimes what you see is they turn like green grass a little bit more savanna-like. That kind of dulls it out because zebras are found generally in savanna-like conditions. They can do winter to summer kind of transitions. I've seen examples day to night. These are pretty interesting, various other things. Now, I think there's a lot of interesting things you can do with this data set. If you think about that simulation example with robotics that I gave, the motivating example at the beginning, this is a prime application area for this kind of technology. But I will say that one of the things that they've done here is they assume a deterministic transformation between these two domains. So I think there's a lot of interest looking at how do you actually break that kind of restriction and imagine something more like a many-to-many mapping between these two domains. So the last thing I want to show you is kind of the most recent stuff, which is just kind of mind blowing in terms of just the quality of generation that they show. So these are images from NVIDIA, actually. So I don't know if I'm sort of undercutting. I don't know if he was going to show these or not. But yeah, so these trained on the original Celebi data set, the same one we had before, but now much, much larger images, so 1,024 by 1,024. And they're able to get these kinds of generated models. So I would argue that many of these, maybe all of the ones shown here, essentially pass a kind of a Turing test, an image Turing test. You cannot tell that these are not real people. Now, I should say not all images actually look this good. Some of them are actually really spooky. But you can go online and look at the video and pick some out there. How they do this is with a really interesting technique. And this is actually so new. I have students that are starting to look at this, but we haven't really probed this very far. So I actually don't know how effective this is in general. But it actually seems very interesting. So they just start with a tiny 4 by 4 image. And once you train up those parameters for both the discriminator here and the generator, so again, these are convolutions, but we're starting with a relatively small input. We increase the size of the input. We add a layer. And some of these parameters here are actually formed by the parameters that gave you this image. So you sort of just stick this up here, and now you add some parameters. And you now train the model to learn something bigger. And you keep going, and you keep going, and then you get something like this. As far as I'm concerned, this sort of amounts to kind of a curriculum of training. It does two things for you. One is it helps build a bit of global structure to the image, because you're starting with such low dimensional inputs.

Day-Night Generation

Day-Night Generation (42:55)

It helps reinforce the kind of global structure. But it also does something else which is pretty important. It allows the model to sort of not have to spend a lot of time training a very, very large model like this. I would imagine they spend relatively little time training here, although this is NVIDIA, so they might spend a lot of time training this. But it allows you to spend a lot of the time sort of in much, much smaller models, so much more computationally efficient to train this model. All right, so that's it for me. Thanks a lot. Oh, wait. Oh, much smaller models. So much more computationally efficient to train this model. All right, so that's it for me. Thanks a lot. Oh, wait. Oh, yeah, sorry. One more thing I forgot. This is just what they get on conditional image generation now with MNIST. So this is, again, so you give it horse, and then it's able to generate this kind of thing. So far, it's able to generate this, right? So bicycles, it's able to generate these, which is pretty amazing quality. If you zoom in here, you can actually, it's kind to generate this, right? So bicycles, it's able to generate these, which is pretty amazing quality. If you zoom in here, you can actually, it's kind of fun, because it kind of gets the idea of these spokes, but not exactly. Like some of them just sort of end midway. But still, pretty remarkable. All right, so thanks. If they have questions, I'll take them.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.