MIT 6.S191 (2020): Neural Rendering
Transcription for the video titled "MIT 6.S191 (2020): Neural Rendering".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Beginning, Overview Of Video
Thanks a lot for having me here. Today I'm going to talk about neural rendering. Because rendering is such a massive topic, I'll start with some clarifications. So, as far as this lecture goes, rendering can be both a forward process and an inverse process. The forward rendering computes an image from some 3D scene parameters, such as the shape of object, the color of object, the surface material, and the light source, et cetera. Forward rendering has been one of the major focus of computer graphics for many years. The opposite of this problem is the inverse rendering. It started as a problem that given some images, we are trying to work out what are the 3D things that was used to produce this image. And the inverse rendering is closely related to computer vision with applications such as 3D reconstruction and motion capture, et cetera. Forward rendering and the inverse rendering are interestingly related because the high level representation of a vision system should look like no representation using computer graphics. In this lecture I'm going to talk about how machine learning can be used to improve the solution to both of these two problems. But before we dive into neural networks, let's first take a very quick tour to the conventional method. This is a sort of a toy example of ray tracing, which is a widely used forward rendering technique. Imagine you are inside of a cave. The red bar is a light source, and the grid is an image plane. Ray tracing works by shooting rays from an imaginary eye to every pixel in the grid, in the pixel image grid. And it tries to compute the color of the object that you can see through the ray. In this case, the ray directly hits the light source. So we're using the color of the light source to color the pixel. However, more often than not, the ray will hit some object's surface before it bounces into the light source. In this case, we need to calculate the color of the surface. The color of the surface can be computed as an integral of the instance radiance. However, this is very difficult to do in an analytic way. So what people normally do is to use multicolor sampling which generates random rays within the integral domain and then computes the average of these rays as an approximation of the integral. We can also change the sampling function to approximate surface material. This is how we can make the surface look more glossy or rough. Most of the time, a ray needs multiple bounds before it hits the light source. And this soon develops into a recursive problem, which is very expensive to do. So there are many advanced ray tracing techniques that have been invented to deal with this problem, which I'm not going to talk here. But the general consensus is ray tracing with multicolor sampling is very expensive because of the very high estimate of variance and the low convergence rate. For a complex thing like this, you need hundreds of millions or maybe billions of ray to render so the question what we ask is whether machine learning can be used to speed up this process and the answer as we will see later in the lecture is yes but before we dive into the answer let's quickly switch to the inverse problem for a moment. So this is a classical shape from a stereo problem, where you have two images of the same object, and you try to work on the 3D shape of the object. You do this by first finding the same features across two images. Then you can compute the camera motion between the two photos. The camera motion usually is parameterized by a rotation and a translation. With this camera motion, you can work on the 3D location of these features by translation. More cameras can be brought in to improve the result. And the modern method can scale up to work with thousand and even hundred thousand photos. This is truly amazing. can scale up to work with thousands and even hundreds of thousands of photos. This is truly amazing. However, the output of such computer vision system is often noisy and sparse. In contrast, computer graphic application needs very clean data. It needs razor sharp details. So oftentimes, people, human have to step in and clean the result. And sometimes, we even need to handcraft from scratch. So every time you hear the word handcraft nowadays, it's a strong signal for machine learning to step in and automate the process. So in the rest of the lecture, I'm going to talk about how neural networks can be used as a sub-module and end-to-end pipeline for forward rendering and also I'm going to talk about how neural networks can be used as a differentiable renderer that opens the door for many interesting inverse applications.
In-Depth Discussions On Video Content
Forward rendering (05:40)
Let's first start from the forward rendering process. As we mentioned before, Monte Carlo sampling is very expensive. And this is an example. On the top left, we have the noisy rendering with one sample rate per pixel. And then the number of samples doubles from left to right and from top to bottom. As you can see, the result also improves. But at the same time, the computational cost also improves, also increases. And I'm going to make a very fascinating analogy here. Most of you should be familiar with AlphaGo by now, which uses a policy network and a value network to speed up the multicolor tree search. For those who skipped some of the previous lectures, a value network takes an input board position and predicts a scalar value as the winning probability. In essence, it reduces the depth of the tree search. Similarly, the policy network takes the board position Similarly, the policy network takes the board position as input and outputs the probability distribution for the best next move. In essence, it reduces the breadth of the search. So the analogy I'm trying to make here is we can also use a policy network and a value network to speed up the multicolor sampling for rendering. For example, you can use a value network to denoise the rendering with low sample per pixel. Basically, it tries to predict the correct pixel value from some noisy input. As far as policy network goes, we can use a network to generate a useful policy that smartly samples the arrays so the whole rendering converges faster. Let's first take a look at the value-based approach. This is the recent work we did for denoising multicolor rendering. On the left, we have a noisy input image sampled at four samples per pixel. In the middle is a denoising result. On the right is a ground-truth reference image rendered with a 32K sample per pixel. In the middle is the denoising result. On the right is the ground truth reference image rendered with a 32K sample per pixel. It takes about 90 minutes to do on a 12-core CPU. In contrast, the denoising result only takes about a second to run on a commodity GPU. So there's a very good trade-off between speed and quality. The whole network is trained end-to-end as an autoencoder with two laws. The first law is the L1 laws of the VGG feature of the output imaging. The second law is the GAN laws. The GAN laws here is obviously trying to retain the details in the output imaging. So this is a side-by-side comparison between the result trained with and without the GAN laws. Denoising natural images has been studied for a long time, but denoising multicolor rendering has some very unique point. The first thing is we can separate the diffuse and the specular components and run them through different paths of network, and then merge the result together. And this tends to improve the result a lot. Secondarily, there are some very inexpensive byproducts of the rendering pipeline that we can use to further improve the result, such as the albedo map, normal map, and depth. This byproduct can be used as an auxiliary feature that generates a context where the noise should be conditioned on. However, how to fit this auxiliary feature into the pipeline is still pretty much an open research question. The way we did in this paper is something called element-wise biosing and scaling. The element-wise biosing takes auxiliary features and runs them through a bunch of convolution layers and adds the result into the input feature x animal-wise. One can prove this is equivalent to feature concatenation. Animal-wise scaling runs animal-wise multiplication between auxiliary features and input feature x. The argument to have both scaling and biosing here is they capture different aspects of the relationship between two inputs. You can think animal-wise biosing is sort of an OR operator, which checks if a feature is in any one of these two inputs. In contrast, ANWR scaling is an AND operator which checks whether the feature is in both of these two inputs. So by combining them together, the auxiliary feature can be utilized in a better way. And this is a denoised result taking noisy input of sample data at full range per pixel. And then we compare our result with alternative method. In general, our method has less noise and more details. Now let's move on to the policy-based approach. I'm not going to cover the entire literature here, but I just want to point you to a very recent work from this new research, which is called the Neural Impotent Sampling. So the idea is, we want to find for each location in the scene, a very good policy that can help us to sample rates smartly and reduce the convergence time. And in practice, the best possible policy is actually the instant radius map at that point because it literally tells you where the light come from. So the question is, can we generate this instant radius map from some local surface property through a neural network? And the answer is yes. Just like how we can nowadays generate images from random input noise, we can also train a generic network that generates this instant radius map from some local surface property, such as location, the direction of the incoming ray, and then the surface normal. However, the catch is such mapping from the surface local property to an instant radius map varies from scene to scene, so the learning has to be carried online during the rendering process, meaning the network starts from generating some random policies and gradually learns the scene structure so it's able to produce better policies. As a result, on And this is the result. On the left is the conventional ray tracing. On the right is the ray tracing with neural input sampling, which converges much faster, as you can see here.
End-to-end rendering (12:18)
So far, we have been talking about how neural networks can be used as a sub-module for forward rendering. Next, I'm going to talk about how we can use neural networks can be used as a sub-module for forward rendering. The next I'm going to talk about how we can use neural networks as an end-to-end pipeline. Remember we talked about ray tracing, which starts from casting rays from the pixel to a 3D scene. This is so-called an image-centric approach. It's actually this approach is kind of very difficult for neural networks to learn, because first of all, it's recursive. Secondarily, you need to do discrete sample, which is very difficult to do analytically. And in contrast, there's another way of doing rendering, which is called rasterization, which is object-centric. What it does is, for every 3D point, you can kind of shoot a race towards an image, and it only needs to shoot one primary ray, so there's no recursion, and then you do not need to do any sampling. So this turns out to be very, it turns out to be easier for the neural network to learn. And rasterization contains two main steps. The first step is for every 3D primitives, you kind of project the primitives to the image plane and then impose them onto each other based on their distance to the image. So in this way, the front most surface can always be visible in the final rendering. Next step is to compute the shading. Basically it calculates the pixel color by interpolating the color of the 3D primitives, such as the vertex color. In general, rasterization is faster than ray tracing. And as we mentioned before, it's easier for neural network to learn because it does not have recursion or screen sampling process. All sounds great.
D data representations (14:20)
Apart from that, there's another catch, which is the input data format. Here are some major mainstream 3D formats. Depth map, voxels, 3D point clouds, and mesh. And some of them are not very friendly to neural networks. And my policy network tells me I should avoid them in this lecture. So anyway, let's start from the depth map. This is probably the easiest one because all you need to do in literature is to change the number of input channels for the first layer. Then you can run your favorite neural networks with it. And it's also very memory efficient because nowadays the accelerators are designed to run images. Another reason for depth map to be convenient to use is you do not need to calculate the visibility, because every information in the depth map is already from the frontmost surface. All you need to do is to compute the shading. And there are many works in rendering depth map into images all the other way around, and I'm not going to talk about them in this lecture. So let's move on to Voxel. Voxel is also kind of friendly to neural networks, because all the data are arranged in a grid structure. However, Voxel is very memory intensive. It's actually one order of magnitude higher than image data. So conventional neural networks can only run Voxels of very low resolution. But what makes Voxel very interesting to us is it needs to compute both visibility and shading. So this is a very good opportunity for us to learn an end-to-end pipeline for neural rendering.
RenderNet (Voxels) (16:12)
So we try this end-to-end neural voxel rendering called a RenderNet. It starts from transforming the input voxel into a camera coordinate. I'd like to quickly emphasize here that such a 3D rigid body transformation is something we actually do not want to learn because it's very easy to do with coordinate transformation, but very hard for convolution and such operation to perform. And we will come back to this later. to perform and we will come back to this later having having transformed input voxel into a camera frame the next step is to learn a neural voxel representation of the 3d shape what we can do here is to pass an input voxel through a sequence of 3d convolution then output the output neural workflow contains deep features that is going to be used for computing the shading and the visibility. Next step is computing the visibility. One might be attempting to basically say, okay, we can use the standard depth buffer algorithm here. But it turns out this is not so easy because when you do this 3D convolution, you kind of diffuse the value within the entire voxel grid. So it's not clear that which grids are from the frontmost surface. At the same time, since every voxel contains deep features, now you have to integrate along across all these channels to compute the visibility. To deal with this problem, we use something called a projection unit. This projection unit first takes the 4D input tensor, which is this neural work cell, and reshapes it into a 3D tensor by squeezing the last two dimensions. The last two dimensions are the depth dimension and the feature channel dimension. Then you learn say multi-layer perception which learns to compute the visibility from along the squeeze last channel. So on a higher level, you can think this multi-level perception is a inception network that learns to integrate visibility along both depths and the features. Last step is to use a sequence of 2D up convolution to render the projected neural voxel into a picture. And then we train this network end to end with mean square pixel loss. Here are some results. The first row is the input voxels. The second row is the output of the new render net. As you can see, render net is able to learn how to do the computation of visibility and shading. You can also learn render net to generate to generate different rendering effects such as contour map, tone shading, and ambient occlusion. In terms of generalization performance, we can use the render net trade on chair model to render some unseen objects such as a bunny and the thing with multiple objects. You can also handle data with corruption and the low resolution. The first row renders an input shape that is randomly corrupted. The second row renders an input shape that has a resolution that is 50% lower than the training resolution. The render net can also be used to render texture models. In this case, we learn a additional texture network that encodes input texture into a neural texture workflow. And this neural texture workflow will be concatenated with the input shape workflow in a channel-wise way, and the concatenated workflow is going to be fed into a network to render. And these are some results of rendered texture models. The first row is the input voxel. The second row is the ground truth reference image. The third row is our result from RenderNet. As you can see, the ground truth image obviously has more sharper details, but in general, the RenderNet is able to capture the major facial features and compute the visibility and shading correctly. As a final experiment, we try to mix and match the shape and the texture. In this example, in the first row, the images are rendered with the same shape input voxel and a different texture voxel. The second row is rendered with the same texture but different shapes. Okay, now let's move on to 3D point cards.
Neural point based graphics (Pointclouds) (21:00)
3D point cards are actually not so friendly to neural networks. First of all, the data is not arranged on a grid structure. In the meantime, it depends on how you sample points from the surface, the number of points and the order of the points can also vary. I just want to quickly point out this recent work called Neural Point-Based Graphics from I think it's the Samsung AI Lab. But before we talk about that, let's first talk about how conventional rasterization is done for 3D point clouds. Basically, for every point in a 3D scene, you project the point into the image as a square. And the size of the square is kind of inversely proportional to the distance of the point to the image. Obviously, you have to impose the projected square on top of each other based on depth, too. Next, what you do is you try to color the squares using the RGB color on the 3D points. However, if you do this, there's a lot of holes in the result. In the same time, we can see a lot of color blocks. So what this neural point-based graphics did is they replaced the RGB color by a learned neural descriptor. This neural descriptor is sort of an eight dimensional vector that is associated with each input point. You can think about it's a deep feature that compensate the sparsity on the point cloud. This is a visualization of the neural descriptor using the first three PCA components. We start by randomly internalize this neural descriptor for each point in the scene and optimize for that particular scene. Obviously, you cannot use the descriptor of one scene to describe another scene, so this optimization has to be done both in the training and the testing stage for each scene. both in the training and the testing stage for each scene. And then the authors use an autoencoder to encode the projected neuro descriptor into a photo-resume imaging. This render network is jointly trained with the optimization of the neuro descriptor during the training, but can be reused in the testing stage. These are some results which I think is really amazing. The first row is the rendering with conventional RGB descriptor. The second row is the rendering with the neural descriptor. As you can see, there's no hole and the result is in general much sharper. And the very cool thing about this method is the neural descriptor is trained to be view invariant, meaning once they are optimized for a scene, you can render the scene from different angles. That's really cool. render the scene from different angles. That's really cool.
Mesh model rendering (24:06)
Okay, last we have these mesh models, which is difficult for neural networks because of its graphical representation. I just want to quickly point out these two papers. The first one is called deferred neural rendering. It actually uses a very similar idea as we just talked about, which is called defer neural rendering. It actually uses a very similar idea as we just talked about, which is in this neural, sorry, this neural point-based graphics. They use very similar ideas, but this paper applies the idea to render mesh models. Another paper is neural 3D mesh render. It is cool in a way that you can even do 3D step transfer. However, the neural network part of this method is used mainly to change the vertex color and position. It's not so much into the rendering part. But I just put this reference here for people who are interested.
Inverse rendering (25:00)
Okay, so far we have been talking about the forward rendering. Let's move on to the re- the inverse rendering. Because the re- inverse rendering is the problem that given some input image, we want to work on a 3D scene that was used to generate this image. The particular method I'm going to talk about today is called differentiable rendering. It works as, as follows. First, we start from target image. It works as follows. First we start from target image. Then we generate some kind of approximation of a 3D scene. This approximation does not be very good. As far as we can render it, we can compare the result with the target image. And then we can define some metric to measure the different, quantitatively measure difference between the render image and the target image. And even though rendering process is differentiable, we can back propagate the loss to update the input model. And if we iteratively do this, eventually the hope is the input model will converge to something meaningful and the key point here is the forward rendering process has to be differentiable in order to calculate this back for back back for gate operation and this is where we immediately see the value of neural networks because modern neural networks are designed to be differentiable, designed to perform this bifurcation, so we got the gradient for free. Another reason for neural networks to be helpful here is, as you can imagine, this iterative optimization process is going to be very expensive. So what we can do with neural network is to learn a feed forward process that approximates its iterative optimization. For example, we can learn an autoencoder which encodes the input image into some sort of a latent representation that enables some really interesting downstream tasks such as novel view synthesis. However, in order to let the encoder to learn the useful implementation, we need to use the correct inductive bias. And the one inductive bias I'm very interested in, very excited about it, is like that learning can be a lot easier if you can separate the pose from the appearance. And I truly believe it is something human do. This is my four-year-old son playing shape puzzle. The task is to build a complex shape using some basic shape primitive, such as triangle and square. In order to do this task, he has to apply 3D rigid body transformation to these primitive shapes in order to match what is required on the board. It's amazing that human can do this work rather effortlessly while this is something the neural networks was invented to suffer with. For example, the 2D convolution, general convolution is a local operation which there's no way they can carry this global transformation. Boudiccan layers might be able to do this, but at the cost of network capacity, because it has to memorize all the different configurations on the same object. So we ask the question, how about we just use simple coordinated transmission to encode the pose of the object and separate the pose from the appearance, whether that will make the learning easier.
So we tried this idea called HoloGAN, which learns a 3D representation from natural images without 3D supervision. By without 3D supervision, I mean there's the 3D supervision. By without the 3D supervision, I mean there's no 3D data, there's no ground-truth label from the pose of the object in the image during the training process. Everything is learned purely from 2D unlabeled data. So the cool thing about this idea is the learning is driven by inductive bias as opposed to supervision. Let's first take a look at how conventional generative networks works. For example, conventional generative networks, generative images use 2D convolutions with very few assumptions about the 3D world. For example, this conditional GAN concatenate post vectors or apply feature-wise transformation to control the pose of the object in the generative images. Unless ground truth label is used in the training process, the pose is going to be learned as a latent variable, which is hard to interpret. At the same time, using 2D convolution to generate a 3D motion will generate artifact in the result. In contrast, HoloGAN generate much better results by separating the pose from the motion. These are some random faces generated by HoloGAN, and I'd like to emphasize there's no 3D data used in the training process. And the key point here is HoloGAN uses a neural voxel as its latent representation. To learn such neural voxels, we use a 3D generator network. And the learned 3D voxel is going to be rendered by a random net, as we just talked about. The 3D generator is basically a extension of StarGAN into 3D. It has two inputs. The first one is a 4D tensor, a learned 4D tensor, a constant tensor, which is learned as a sort of templates of a particular class of object. And this tensor is going to run through a sequence of 3D convolution to become the neural voxel representation. The second input is this random vector used as a controller that will be first transformed into a five parameter of the adaptive instant normalization layer all through the pipeline. Now generate the learned 3D work-flow representation is going, as I said before, is going to be rendered by RenderNet. And then in order to train this network in an unsupervised way, we use a discrete network to classify the render image against real world image. The key here is it is crucially important that during the training process, we have to apply random rigid body transformation in the, onto this voxel representation. And this is actually how the inductive bias is injected during the learning process. Because in this way, the network is forced to learn some very strong representation that is unbreakable under arbitrary pose. And in fact, if we do not apply random transformation during the learning process, the network was not able to learn. These are some result. As you can see, HoloGAN is pretty robust to view transition and also complex background. One limitation of HoloGAN is it can only learn posts that exist in the data set, in the training data set. For example, in this card data set, there's very little variation in the elevation direction, so the network cannot extrapolate. However, when there are enough training data, the network can surely learn. For example, we use ShapeNet to generate more poses for chairs, and network is able to learn to do 180 degree rotation in elevation. We also try the network with some really challenging data set. For example, this background data set. This data set is very challenging due to the fact that there's a very strong appearance of vibration across the data set. You can hardly find two bedrooms that looks like each other from different views. In this sense, there's a very weak pulse signal in the data set. However, the network is still able to generate something reasonable, and I think that's really interesting. Another surprise is the network is able to further decompose the appearance into shape and texture. As a test, we fit two different control vectors, one to the 3D part of the network, the other to the 2D part of the network, and it turns out the 3D controller, the controller fitting to the 3D part of the network. And it turns out on the 3D controller, the controller fitting into the 3D part of the network controls the shape, and the controller fitting into the 2D part of the network controls the texture. So these are some result. Every row in this image using the same texture controller, but a different shape controller. Every column in this image using the same shape controller, but different controller. I think this is truly amazing because it always reminds me about the vertex shader and the fragment shader in a conventional graphics pipeline, where the vertex shader changes geometry and the conventional shader doing the coloring.
Okay, I think it's probably a good time to start to draw conclusions. At the beginning of this talk, we asked the question, can neural network be helpful to fold rendering and inverse rendering? And I think the answer is yes. We have seen neural networks being used as a submodule to speed up ray tracing, and we have seen examples of no value based approach and policy based approach we also have seen neural network being used as a end-to-end the system that helps 3d processing and as far as the inverse problem goes we see neural network can be used as a very powerful differential differential renderer and the opens the door to many interesting downstream applications such as view synthesis. And the key thing here is neural network is able to use the correct interactive bios to learn a strong representation. And before I finish the talk, I just want to say this is pretty much still a opening question, a very new research frontier. There's a lot of opportunities. For example, there's a lot of opportunities. For example, there's still a huge gap between the quality of the end-to-end rendering and the conventional physical-based rendering. And as far as I know, there's really no good solution for neural-based mesh renderer. And in terms of the inverse problem, we have seen encouraging result of learning strong representation. However, it's interesting to see what more effective inductive BIOS and network architecture can be used to push learning forward. And before I end the talk, I'd like to thank my colleague and collaborators who did an amazing job on these papers, especially two who did the most work for Randonet and Hologan, and also Bing, who did the most work for the neural Monte Carlo denoising. Okay, with that, I finish my talk. Thank you.