MIT 6.S191: Towards AI for 3D Content Creation

Transcription for the video titled "MIT 6.S191: Towards AI for 3D Content Creation".

1970-01-01T06:36:17.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Opening Remarks

Introduction (00:00)

Great. Yeah, thanks for the nice introduction. I'm going to talk to you about 3D content creation and particularly deep learning techniques to facilitate 3D content creation. Most of the work I'm going to talk about is the work I've been doing with my group at Nvidia and the collaborators, but it's going to be a little bit of my work at U of T as well. All right. So, you know, you guys, I think this is a deep learning class, right? So you heard all about how AI has made so much progress in the last maybe a decade almost. But computer graphics actually was revolutionized as well with you know many new rendering techniques or faster rendering techniques but also by working together with ii um so this is a latest video that johnson introduced a couple of months ago. So this is all done, all this rendering that you're seeing is done real time. It's basically rendered in front of your eyes. And, you know, compared to the traditional game you're used to, maybe real time gaming, but here there's no big lights. There's no big light. Everything is computed online. Physics, real-time ray tracing, lighting, everything is done online. What you're seeing here is rendered in something called Omniverse. It's this visualization and collaboration software that Nvidia has just recently released. You guys should check it out. It's really awesome. All right. Oops.


Deep Dive Into 3D Content Creation And Synthesis

What is 3D content? (02:10)

This light always gets stuck. Yeah, so when I joined NVIDIA, this was two years and a half ago, it was actually the org that I'm in was creating the software called Omniverse, the one that I just showed. And I got so excited about it and I wanted to somehow contribute in this space, somehow introduce AI into this content creation and graphics pipeline. And 3D content is really everywhere and graphics is really in a lot of domains, right? So in architecture, designers would create office spaces, apartments, whatever, everything would be done in some modeling software with computer graphics, right? So that you can judge whether you like some space before you go out and build it. All modern games are all like heavy 3D. In film, there's a lot of computer graphics, in fact, because directors just want too much out of characters or humans. So you just need to have them all done with computer graphics and animate in realistic ways now that we are all home you know vr is super popular right everyone wants a tiger in the room or have a 3d character version 3d avatar of yourself and so on there's also robotics so healthcare and. There's actually also a lot of computer graphics in these areas. And these are the areas that I'm particularly excited about. And why is that? It's actually for simulation. So before you can deploy any kind of robotic system in the real world, you need to test it in a simulated environment, all right? You need to test it against all sorts of challenging scenarios in healthcare for robotic surgery, robotic self-driving cars, you know, warehouse robots and stuff like that. I'm going to show you this simulator called DriveSim that Nvidia has been developing. And this video is a couple of years old. Now it's a lot better than this. But basically, simulation is kind of like a game. It's really a game engine for robots, where now you expose a lot more out of the game engine. You want to have the creator, the roboticist, some control over the environment, right? You want to decide how many cars you're going to put in there. What's going to be the weather night or day and so on. So this gives you some control over the scenarios you're going to test against. But the nice thing about, you know, having this computer graphics pipeline is everything is kind of labeled in 3D. You already have created a 3D model of a car, you know, it's a car, and you know, the parts of the car, as you know something is a lane and so on. And instead of just rendering the picture, you can also render, you know, grand truth for AI to both train on and be tested against. Right, so you can get grand truth lanes, grand truth weather, grand truth segmentation, all that stuff that's super hard to collect in the real world. Okay. My kind of goal would be, you know, if we wanna think about all these applications and particular robotics, you know, can we simulate the world in some way? Can we just load up a model like this, which looks maybe good from far, but we wanna create really good content at street level. And you know, both assets as well as behaviors and just make this virtual cities alive such that we can not test our robot inside this. Alright, so it turns out that actually requires significant human effort. Here, we see a person creating a scene aligned with a given real world image the artist places scene elements edits their poses textures as well as scene or global properties such as weather lighting camera position this process ended up taking four hours for this particular scene so here the artist already had the assets, you know, bought them online or whatever. And the only goal was to kind of recreate the scene above. And it already took four hours. Right, so this is really, really slow. And I don't know whether you guys are familiar with, you know, games like Grand Theft Auto. That was an effort by a thousand engineers, a thousand people working for three years, basically recreating LA, Los Angeles, going around the city and taking tons of photographs, you know, 250,000 photographs, many hours of footage, anything that would give them an idea of what they need to replicate in the real world. All right. So this is where AI can help. You know, we know computer vision, we know deep learning. Can we actually just take some footage and recreate the cities, both in terms of reconstruction, the assets, as well as behavior, so that we can simulate all this content, all this live content.


AI for 3D content creation (07:00)

All right. So this is kind of my idea of what we need to create. And I really hope that some guys, you know, some of you guys are going to be equally excited about these topics and are going to work on this. So I believe that we need AI in this particular area. So we need to be able to synthesize worlds, which means both, you know, scene layouts, you know, where am I placing these different objects, maybe map of the world, assets. So we need some way of creating assets like, you know, cars, people and so on in some scalable way. So we don't need artists to create this content very slowly. As well as, you know, dynamic parts of the world. So scenarios, you know, which means I need to be able to have really good behavior for everyone, right? How am I gonna try as well as animation, which means that the human or any articulated object you animate needs to look realistic, okay? A lot of this stuff, it's already done there for any game, the artists and engineers need to do that. What I'm saying is, can we have AI to do this much better, much faster? All right, so, you know, what I'm gonna talk about today is kind of like our humble beginning. So this is the main topic of my, you know, Toronto NVIDIA lab. And I'm gonna tell you a little bit about all these different topics that we have been slowly addressing, but there's just so much more to do.


Synthesizing worlds (08:20)

Okay, so the first thing we wanna tackle is, but all these different topics that we have been slowly addressing, but there's just so much more to do. Okay. So the first thing we wanna tackle is can we synthesize worlds by just maybe looking at real footage that we can collect, let's say from a self-driving platform. So can we take those videos and train some sort of a generative model is going to generate scenes that look like the real city that we wanna drive in. So if I'm in Toronto, I might need brick walls. If I'm in LA, I just need many more streets. Like I need to somehow personalize this content based on the part of the world that I'm gonna be in. Okay. If you guys have any questions, just write them up. I like if the lecture is interactive. All right, so how can we compose scenes? And our thinking was really kind of looking into how games are built, right? In games, you know, people need to create very diverse levels. So they need to create in a very scalable way, very large worlds. And one way to do that is using some procedural models, right, or probabilistic grammar, which basically tells you, you know, rules about how the scene is created, such that it looks like a valid scene. So in this particular case, and I would, I would sample a road, right, with some number of lanes, and then on each lane, you know, sample some number of cars, and maybe there's a sidewalk next to a lane with maybe people walking there, and there's trees or something like that, right? So this probabilistic models can be fairly complicated. You can quickly imagine how this can become complicated. But at the same time, it's not so hard to actually write this. Anyone would be able to write a bunch of rules about how to create this content. Okay? So it's not too tough, but the tough part is, you know, setting all these distributions here and such that the render scenes are really gonna look like your target content, right? Meaning that if I'm in Toronto, maybe I wanna have more cars. If I'm in a small village somewhere, I wanna have less cars. So for all that, I need to go in and, you know, kind of personalize these models, set the distributions correctly. So this is just one example of, you know, sampling from a probabilistic model. Here, the probabilities for the orientations of the cars will come on randomly set. But then soon the scene already looks kind of okay, right? Because it's already incorporates all the rules that we know about the world and the model will be needing to train. All right, so you can think of this as some sort of a graph, right, where each node defines the type of asset we wanna place. And then we also have attributes, meaning we need to have location, height, pose, anything that is necessary to actually place this car in the scene and render it. Okay. And these things are typically set by an artist, right? They look at the real data and then they decide, you know, how many pickup trucks I'm gonna have in the city or so on. All right. So basically they set this distribution by hand. What we're saying is, can we actually learn this distribution by just looking at data?


Scene composition (11:45)

Okay. And we had this paper called MetaSim a couple of years ago where the idea was, let's assume that the structure of the scenes that I'm sampling. So in this particular case, you know, how many lanes I have, how many cars I have, that comes from some distribution that the artist has already designed. So the graphs are going to be correct, but the attributes should be modified. So if I sample this original scene graph from that I can render, like you saw that example before, the cars were kind of randomly rotated and so on. The idea is can a neural network now modify the attributes of these nodes, modify the orientations, the colors, maybe one type of object such that when I render those scene graphs, I get images that look like real images that I have recorded in distribution. So we don't want to go after exact replica of each scene. We want to be able to train a generative model that's going to synthesize images that are going to look like images we have recorded. That's the target. Okay. So basically we have some sort of a graph neural network that's operating on scene graphs, and it's trying to re-predict attributes for each node. I don't know whether you guys talked about graph neural nets and then the last that's coming out is through this renderer here. And we're using something called maximum mean discrepancy. So I'm not gonna go into details, but basically the idea is you could, you need to compare two different distributions. You could compare them by comparing the means of the two distributions or maybe higher order moments. And MMD was designed to compare higher order moments. Now this last can be back propped through this non-differentiable renderer back to graph neural net. And we just use numerical gradients to do this step. And the cool part about this is we haven't really needed any sort of annotation on the image. We're comparing images directly because we're assuming that the image, the synthesized images already look pretty good, right? So we actually don't need data. We just need to drive around and record these things. Okay, you can do something even cooler. You can actually try to personalize this data to the task you're trying to solve later, which means that you can train this network to generate data that if you train some other neural net on top of this data, let's say an object detector, it's gonna really do well on, you know, whatever task you have in the end collected in the real world. Okay. Which might not mean that the object needs to look really good in the scene. It just might, it just means that you need to generate scenes that are going to be useful for some network that you want to train on that data. Okay. And that you again back prop this and you can do this with reinforcement learning. Okay. So this was now training the distribution for the attributes, but it was kind of the easy part. And we were sidestepping the issue of, well, what about the structure of these graphs? Meaning if I had always generated, you know, five or eight or 10 cars in the scene, but now I'm in a village, I will just not train anything very useful, right? So the idea would be, can we learn the structure, the number of lanes, the number of cars and so on as well? Okay. And it turns out that actually you can do this as well. And we're here, we had a probabilistic context free grammar which basically means you have a root node, you have some symbols and which can be non-terminal or terminal symbols and rules that basically expand non-terminal symbols into new symbols. So an example would be here, right? So you have a road which, you know, generates lanes, lanes can go into lane or more lanes, right?


Learning structure (15:50)

And so on, so these are the rules, okay? And basically what we wanna do is we wanna train a network that's gonna learn to sample from this probabilistic context-free grammar, okay? So we're gonna have some sort of a latent vector. Here, we know where we are in the tree that the graph we have already generated before. So imagine we are in, we have sampled some lane or whatever. So now we know the corresponding symbols that we can actually sample from here. We can use that to mask the probabilities for everything else out, right? And our network is basically gonna learn how to produce the correct probabilities for the next symbol we should be sampling. Okay, so basically at each step, I'm going to sample a new rule until I hit all the terminal symbols. Okay. That basically gives me something like that. Is that the sample the rules in this case, which can be converted to a graph. And then using the previous method, we can, you know, augment this graph with attributes and then we can render the scene. Okay. So basically now we are also learning how to generate the actual scene graph, the actual structure of the scene graph and the attributes. And this is super hard to train. So there's a lot of bells and whistles to make this to work, but essentially, because this is all non-differentiable steps, you need something like reinforcement learning. And there's a lot of tricks to actually make this to work. But I was super surprised how well this can actually turn out. So on the right side, you see samples from the real dataset. KT is like a real driving dataset. On the left side is samples from probabilistic grammar. Here we have set this first probabilities manually and we purposely made it really bad, which means that this probabilistic grammar, when you sample, you got really few cars, almost no buildings. And you can see this is like almost not populated scenes. After training, the generative model learn how to sample this kind of scenes because they were much closer to the real target data. So these were the final trained scenes. And now how can you actually evaluate that we have done something reasonable here? You can look at, for example, the distribution of cars in the real data set. This is Kitty over here. So here, we have a histogram of how many cars you have in each scene. You have this orange guy here, which is the prior, meaning this badly initialized probabilistic grammar, where only you are sampling, most of the time, very few cars. And then the learned model, which is the green, the lime here. So you can see that the generated scenes really, really closely follow this distribution of the real data without any single annotation at hand, right? Now you guys could argue, well, it's super easy to write, you know, these distributions by hand and we're done with it. I think this just shows that this can work. And the next step would just be make this really large scale, make this really huge probabilistic models where it's hard to tune all these parameters by hand. And the cool part is that because everything can be trained now automatically from real data. Now, any end user can just take this and it's gonna train on their end. They don't need to go and set all this stuff by hand. Okay, now the next question is, how can I evaluate that my model is actually doing something reasonable? And one way to do that is by actually sampling from this model, synthesizing these images along with the ground truth and then train some, you know, end model, like a detector on top of this data and testing it on the real data. And just seeing whether the performance that has some point improved compared to, you know, let's say on that badly initialized probabilistic grammar. Then it turns out that that's the case. Okay. Now this was the example shown on driving, but, oh, sorry. So this model is just here. I'm just showing basically what's happening during training. Let me just go quickly. So the first snapshot is the first sample from the model. And then what you're seeing is how this model is actually training. So how is modifying the scene during training? Let me show you one more time. So you can see the first frame was really kind of badly placed cars, and then it's slowly trying to figure out where to place them to be correct. And of course it's generative model, right? So you can sample tons of scenes and everything comes labeled. Cool, right.


Synthesizing medical data (20:55)

This model here was shown on driving, but you can also apply it everywhere else, like in other domains. And here, you know, medical or healthcare now is very important, particularly these days when everyone is stuck at home. So, can you use something like this to also synthesize medical data? And what do I mean by that? So, doctors need to take CT or MRI volumes and go and label every single slice of that with you know let's say a segmentation mask such that they can train like a you know cancer segmentation or a car segmentation or lung segmentation kobe detection whatever right so first of all data is very hard to come by right Because in some diseases you just don't have a lot of this data. The second part is that it's actually super time consuming and you need experts to label that data. So in the medical domain it's really important if we can actually somehow learn how to synthesize this data, label data, so that we can kind of augment the real data sets with that. Okay? And the model here is going to be very simple again, you know, we have some generative model. Let's go from a latent codes to some parameters of a mesh in this case, this is our asset within a material map. And then we synthesize this with a physically based CT simulator, which, you know, looks a little bit blurry. And then we train a enhancement model with something like again, and then you'll get simulated data out. Obviously again, there is a lot of bells and whistles, but you can get really nice looking synthesized volumes. So here the users can actually play with the shape of the heart and then they can click synthesize data and you get some labeled volumes out where the label is basically the stuff on the left and this is the simulated sensor in this case.


Recovering rules of the world (23:00)

Okay. All right, so now we talked about using procedural models to generate worlds. And of course the question is, well, do we need to write all those rules? Can we just learn how to recover all those rules? And here was our first take on this. And here we wanted to generate or learn how to generate city road layouts. Okay, which means, you know, we wanna be able to generate or learn how to generate city road layouts, okay? Which means, you know, we wanna be able to generate something like that where, you know, the lines over here representing roads, okay? This is the base of any city. And we wanna again have some control over these worlds. You're gonna have something like interactive generation. I want this part to look like Cambridge, this part to look like New York, this part to look like Toronto, whatever part to look like New York, this part to look like Toronto, whatever. And we want to be able to generate or synthesize everything else, you know, according to these styles. Okay. You can interpret road layout as a graph. Okay, so what does that mean? I have some control points and two control points being connected means I have a road line segment between them. So really the problem that we're trying to solve here is can we have a neural net generate graphs? Graphs with attributes where each attribute might be an X, Y location of a control point. Okay, and again, giant graph, but this is an entire city we wanna generate. So we had actually a very simple model where you're kind of iteratively generating this graph and imagine that we have already, you know, generated some part of the graph. What we're going to do is take a node from like an unfinished set, what we call. We encode every path that we have already synthesized and leads to this node, which basically means we wanna kind of encode how this node already looks like, what are the roads that it's connecting to. And we wanna generate the remaining nodes, basically how these roads continue in this case. Okay, and this was super simple. You just have like RNNs encoding each of these paths and one RNN that's decoding these neighbors. Okay, and you stop where basically you hit some predefined size of the city. Okay, let me show you some results. So here you can condition on the style of the city. So you can generate Barcelona or Berkeley. You can have these control or you can condition on part of the city being certain style. And you can use the same model, the generative model to also parse real maps or real aerial images and create variations of those maps for something like simulation, because for simulation we need to be robust to the actual layouts. So now you can turn that graph into an actual small city where you can maybe procedurally generate the rest of the content like we were discussing before, where the houses are, where the traffic signs are and so on. Cool. Right, so now we can generate the map of the city. We can place some objects somewhere in the city. So we're kind of close to our goal of synthesizing worlds but we're still missing objects objects are still a pain that the artists need to create right so all this content needs to be manually designed and that just takes a lot of time to do all right and maybe it's already available you guys are going to argue that you know for cars you can just go online and pay for this stuff.


Object creation (26:45)

First of all, it's expensive. And second of all, it's not really so widely available for certain classes. Like if I want a raccoon, because I'm in Toronto, there's just tons of them. There's just a couple of them and they don't really look like real raccoons, right? So the question is, can we actually do this, solve these tasks by taking just pictures and synthesizing this content from pictures, right? So ideally we would have something like an image and we want to produce out, you know, a 3D model, 3D texture model. Why did I then insert in my real scenes? And ideally we wanna do this on just images that are widely available on the web, right? I think the new iPhones all have LiDAR. So maybe this world is gonna change because everyone is gonna be taking 3D pictures, right? With some 3D sensor. But right now the majority of pictures that are available of objects on Flickr, let's say, with some 3D sensor. But right now, the majority of pictures that are available of objects on Flickr, let's say, it's all single images. People just snapshotting a scene or snapshotting on a particular object. So the question is, you know, how can we learn from all the data and go from an image on the left to a 3D model? And in our case, we're gonna want to produce as an output from the image a mesh, which basically has location of vertices x, y, z, and some color material properties on each vertex. And 3D vertices along with faces, which means which vertices are connected, that's basically defining this 3D object. Okay. And now we're gonna turn to graphics to help us with our goal to do this from, you know, the kind of without supervision learning from the web.


Graphics via differentiable rendering (28:50)

Okay. And in graphics, we know that images are formed by geometry interacting with light, right? That just principle of rendering. Okay. So we know that you can, you, you? That just principle of rendering, okay? So we know that you can, if you have a mesh, if you have some light source or sources and you have a texture and also materials and so on, which I'm not writing out here and some graphics renderer, you know, there's many to choose from, you get out a rendered image, okay? Now, if we make this part differentiable, if you make the graphics renderer differentiable, then maybe there is hope of going the other way, right? You can think of computer vision being inverse graphics. Graphics is going for 3D to images. Computer vision wants to go from images into 3D. And if this module is a differential, maybe there's hope of doing that. So there's been quite a lot of work lately on basically this kind of a pipeline with different modifications. But basically, this summarizes the ongoing work where you have an image, you have some sort of a neural net that you want to train and you're making this kind of bottleneck predictions here, which is just mesh, light, texture, maybe material. Okay. Now, instead of having the loss over here, because you don't have it, you don't have the ground truth mesh for this car because otherwise you would need to annotate it. What we're going to do instead is we're gonna send these predictions over to this renderer, which is gonna render an image. And we're going to have the loss defined on the rendered image and the input image. We're basically gonna try to make these images to match. Okay? And of course there's a lot of other losses that people use here, like multi-view allows, you're assuming that in training, you have multiple views of the same objects, you have masks and so on. So there's a lot of bells and whistles how to really make this pipeline work. But in principle, it's a very clean idea, right? Where we wanna predict these properties, I have this graphics renderer and I am just comparing input and output. And because this render is differentiable, I can propagate this loss back to all my desired, you know, lightweights. So I can predict this properties. Okay. Now we in particularly had a very simple like OpenGL type renderer which we made differentiable. There's also versions where you can make ray tracing differentiable and so on. But basically the idea that we employed was super simple. Right? A mesh is basically projected onto an image and you get out triangles. And each pixel is basically just a barocentric interpolation of the vertices of this projected triangle. And now if you have any properties defined on those vertices like color or texture and so on, then you can compute this value here through your renderer that assumes some lighting or so on, in a differentiable manner using this percent coordinates. This is a differential function. And you can just go back through whatever lighting or whatever shade shader model you're using. Okay. So very simple and there's, you know, much, much richer models that are available, richer differential renderers available these days.


Data generation (32:30)

But here we try to be a little bit clever as well with respect to data, because most of the related work was taking synthetic data to train their model. Why? Because most of the work needed multi-view data during training, which means I have to have multiple pictures from multiple different views of the same object. And that is hard to get from just web data, right? It's hard to get. So people were just basically taking synthetic cars from synthetic data sets and rendering in different views and then training the model, which really just maybe makes a problem not so interesting because now we are actually relying on synthetic data to solve this. And the question is, how can we get data? And we try to be a little bit clever here and we turn to generative models of images. I don't know whether you guys covered in class, you know, image GANs, but if you take something like StyleGAN, just, you know, Genesys, Vercelio, and Nowork, designed to really produce high quality images by sampling from some prior, you get really amazing pictures out. Like all these images have been synthesized. None of this is real. This is all synthetic. Okay. You know, this GANs, basically what they do is you have some latent code and then there's a, you know some nice progressive architecture that slowly transforms that latent code into an actual image. Okay. What happens is that if you start analyzing this latent code, or I guess I'm going to talk about this one, if you take certain dimensions of that code and you try and you freeze them, okay, and you just manipulate the rest of the code, it turns out that you can find really interesting controllers inside this latent code. Basically, the gun has learned about the 3D world and it's just hidden in that latent code. Okay. What do I mean by that? So you can find some latent dimensions that basically control the viewpoint and the rest of the code is kind of controlling the content, meaning the type of car and the viewpoint means the viewpoint of that car. Okay, so if I look at it here, we basically varied the viewpoint code and kept this content code, the rest of the code frozen. And this is just basically synthesized. And the cool part is that it actually looks like, you know, multiple views of the same object. It's not perfect, like this guy, the third object in the top row doesn't look exactly matched, but most of them look like the same car in different views. And the other side also holds. So if I keep a content, like a viewpoint code fixed in each of these columns, but they vary the content code, meaning different rows here, I can actually get different cars in each viewpoint. Okay. So this is basically again, synthesized. And that's precisely the data we need. So we didn't do anything super special to our technique. The only thing we were smart about was how we got the data. And now you can use this data to train our, you know, differentiable rendering pipeline. And you got, you know, predictions like this. You have an input image and a bunch of 3D predictions, but also now we can do cars. So the input image on the left, and then the 3D prediction rendered in that same viewpoint here in this column. And that's that prediction rendered in multiple different viewpoints just to showcase the 3D nature of the predictions. And now we basically have this tool that can take any image and produce a 3D asset. So we can have tons and tons of cars by just basically taking pictures. Here is a little demo in that Omniverse tool where the user can now take a picture of the car and get out a 3D model. Notice that we also estimate materials because you can see the windshields are a little bit transparent and the car body looks like it's shiny, so it's metal, because we were also predicting 3D parts. And you know, it's not perfect, but they're pretty good. And now just, you know, a month ago, we have a new version that can also animate this prediction. So you can take an image, predict this guy, and we can just put, you know put tires instead of the predicted tires. You can estimate physics and you can drive these cars around. So they actually become useful assets. This is only in cars now, but of course the system is general. So we're in the process of applying it to all sorts of different content. Cool. I think, I don't know how much more time I have. So maybe I'm just gonna skip to the end. I have always too much slides. So I have all these behaviors and whatever, but I wanted to show you just the last project that we did. Cause I think you guys give me only 40 minutes.


Neural simulation (38:30)

So, you know, we also have done some work on animation using reinforcement learning and behavior that, you know maybe I skipped here, but we basically are building modular deep learning blocks for all the different aspects. And the question is, can we even sidestep all that? Can we just learn how to simulate data? Everything with one neural net, and we're gonna call it neural simulation. So can we have one AI model that can just look at our interaction with the world and then be able to simulate that? at our interaction with the world and then be able to simulate that. So, in computer games, we know that they accept some user action left, right keyboard control or whatever. And then the computer engine is basically synthesizing the next frame, which is gonna tell us how the world has changed according to your action. So what we're trying to attempt here is to replace the game engine with a neural net, which means that we still wanna have the interactive part of the game where the user is going to be inputting actions, gonna be playing, but the screens are going to be synthesized by a neural net. Which basically means that this neural net needs to learn how the world works. If I hit into a car, it needs to produce a frame that's going to look like that. Now, in the beginning, our first project was, can we just learn how to emulate a game engine? Can we take a Pac-Man and try to mimic it, try to see if the neural net can learn how to mimic Pac-Man? But of course, the interesting part is going to start where we don't have access to the game engine, like the world. You can think of the world as being the matrix, where we don't have access to the matrix, but we still want to learn how to the access to the matrix, but we still wanna learn how to simulate and emulate the matrix. And that's really exciting future work. But basically we have, you know, now we're just kind of trying to mimic what the game engine does, where you're inputting some, you know, action and maybe the previous frame, and then you'll have something called dynamics engine which is basically just an lsd and was trying to learn how the dynamics in the world looks like how how frames change we have a rendering engine that takes that latent code is going to actually produce a nice looking image and we also have some memory which allows us to push any information that we want to be able to consistently produce the consistent gameplay in some additional block here. Okay, and here was our first result on Pac-Man. And we released this on the 40th birthday of Pac-Man. What you see over here is all synthesized. And to me is even if it's such a simple game, it's actually not so easy because, you know, the neural net needs to learn that Pac-Man, if it eats the food, the food needs to disappear. The ghost can become blue. And then if you eat a blue ghost, you survive, otherwise you die. So there's already a lot of different rules that you need to recover along with just like synthesizing images, right? And of course our next step is, can we scale this up? Can we go to 3D games and can we eventually go to the real world? Okay. So again, here the control is going to be the steering control. So like speed and the steering wheel. This is done by the user, by a human. And what you see on the right side is, you know the frames painted by, byenggan, by this model. So here we are driving this car around and you can see what the model is painting. It's a pretty consistent world in fact. And there's no 3D, there's no nothing. We are basically just synthesizing frames. And here's a little bit more complicated version where we try to synthesize other cars as well. And this is on a carless simulator that was the game engine we're trying to emulate. It's not perfect. Like you can see that the cars actually change color. And it was quite amazing that it's able to do that entirely. And right now we have a version actually training on the real driving videos, like a thousand hours of real driving and it's actually doing an amazing job already. And, you know, so I think this could be a really good alternative to the rest of the pipeline. All right.


D deep learning library (43:25)

You know, one thing to realize when you're doing something that's so broad and such a big problem is that you're never going to solve it alone. You're never going to solve it alone. So one mission that I have is also to provide tools to community such that, you know, you guys can take it and build your own ideas and build your own 3D content generation methods. Okay, so we just recently released... 3D deep learning is an exciting new frontier, but it hasn't been easy adapting neural networks to this domain. Kaolin is a suite of tools for 3D deep learning, including a PyTorch library and an Omniverse application. Kaolin's GPU-optimized operations and interactive capabilities bring much-needed tools to help accelerate research in this field. For example, you can visualize your model's predictions as it's training. In addition to textured meshes, you can view predicted point clouds and voxel grids with only two lines of code. You can also sample and inspect your favorite dataset, easily convert between meshes, point clouds, and voxel grids, render 3D datasets with ground truth labels to train your models. And build powerful new applications that bridge the gap between images and 3D using a flexible and modular differentiable renderer. And there's more to come, including the ability to visualize remote training checkpoints in a web browser. Don't miss these exciting advancements in 3D deep learning research and how CowLin will soon expand to even more applications. Yeah. So a lot of the stuff I talked about, all the basic tooling is available. So, you know, please take it and do something amazing with it. I'm really excited about that.


Conclusion Remarks

Summary and conclusion (45:25)

Just to conclude, you know, my goal is to really democratize 3D content creation. You know, I want my mom to be able to create really good 3D models and she has no idea even how to use Microsoft Word or whatever, so it needs to be super simple. Have AI tools that are going to be able to also assist maybe more advanced users like artists, game developers, but just reduce the load of the boring stuff. Just enable their creativity to just come to play much faster than it can right now. And all of that is also connected to learning to simulate for robotics. Simulation is just a fancy game engine that needs to be real as opposed to being from fantasy, but it can be really, really useful for robotics applications. Right, and what we have here is really just like two years and a half of our lab, but there's so much more to do. And I'm really hoping that you guys are gonna do this. I just wanted to finish with one slide because you guys are students. My advice for research, just learn, learn, learn. This deep learning course is one, don't stop here, continue. One very important aspect is just be passionate about your work and never lose that passion because that's where you're really going to be productive and you're really gonna do good stuff. If you're not excited about what the research you're doing, you know, choose something else. Don't rush for papers. Focus on getting really good papers as opposed to the number of papers. That's not a good metric, right? Hunting citations maybe also not the best metrics, right? Some not so good papers have a lot of citations. some good papers don't have a lot of citations. You're going to be known for the good work that you do. Find collaborations, find collaborators. And that's particularly kind of in my style of research, I want to solve real problems. I want to solve problems, which means that how to solve it is not clear. And sometimes we need to go to physics, sometimes we need to go to graphics, sometimes we need to go to NLP, whatever. And I have no idea about some of those domains and you just wanna learn from experts. So it's really good to find collaborators. And the last point, which I have always used as guidance, it's very easy to get frustrated because 99% of the time things won't work, but just remember to have fun. This research is really fun. And that's all from me. I don't know whether you guys have some questions.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.