MIT 6.S191 (2020): Reinforcement Learning
Transcription for the video titled "MIT 6.S191 (2020): Reinforcement Learning".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Now, I think this field is really incredible because at its core, it moves away from this paradigm that we've seen so far in this class in the first four lectures actually. So far in this class, we've been using deep learning on fixed datasets, and we've really been caring about our performance on that fixed data set but now we're moving away from that and we're thinking about scenarios where our deep learning model is its age its own self and it can act in an environment and when it takes actions in that environment it's exploring the environment learning how to solve some tasks and we really get to explore these type of dynamic scenarios where you have a autonomous agent potentially working in the real world with humans or in a simulated environment and you get to see how we can build agents that learn to solve these tasks without any human supervision in some cases or any guidance at all so they learn to solve these tasks entirely from scratch without any data set just by interacting with their environment. Now this has huge obvious implications in fields like robotics where you have self-driving cars and also manipulation so having hands that can grasp different objects in the environment but it also impacts the world of gameplay and specifically strategy and planning. And you can imagine that if you combine these two worlds, robotics and gameplay, you can also create some pretty cool applications where you have a robot playing against the human in real life. Okay, so this is a little bit dramatized, and the robot here is not actually using deep reinforcement learning, I'd like to say that first. So this is actually entirely choreographed for a TV ad, but I do hope that it gives you a sense of what this marriage of having autonomous agents interact in the real world and the potential implications of having efficient learning of the autonomous controllers that define the actions of those autonomous agents. So actually, let's first take a step back and look at how deep reinforcement learning fits into this whole paradigm of what we've seen in this class so far. So, so far what we've explored in the first three lectures actually has been what's called supervised learning and that's where we have a data set of our data x and our labels y and what we've tried to do in the first three lectures really is learn a neural network or learn a model that takes as input the data x and learns to predict the labels Y.
Dive Into Deep Reinforcement Learning Principles And Challenges
Classes of learning problems (02:47)
So an example of this is if I show you this picture of an apple, we want to train our model to predict that this is an apple. It's a classification problem. Next we discussed in the fourth lecture the topic of unsupervised learning, and in this realm we only have access to data, there are no labels at all and the goal of this problem is that we just want to find structure in the data. So in this case we might see an example of two types of apples and we don't know that these are apples per se because there's no labels here but we need to understand that there's some structure, underlying structure within these apples and we can identify that yes these two things are the same even if we don't know that they're specifically apples. Now finally in reinforcement learning we're going to be given data in the form of what are called state action pairs. So states are the observations or the inputs to the system and the actions are the actions well that the agent wants to take in that environment. Now the goal of the agent in this world is just to maximize its own rewards or to take actions that result in rewards and in as many rewards as possible. So now in the apple example we can see again we don't know that this thing is an apple but our agent might have learned that over time if it eats an apple it counts as food and might survive longer in this world so it learns to eat this thing if it sees it. So again today in this class our focus is going to be just on this third realm of reinforcement learning and seeing how we can build deep neural networks that can solve these problems as well. And before I go any further I want to start by building up some key vocabulary for all of you just because in reinforcement learning a lot of the vocabulary is a little bit different than in supervised or unsupervised learning so I think it's really important that we go back to the foundations and really define some important vocabulary that's going to be really crucial before we get to building up to the more complicated stuff later in this lecture. So it's really important that if any of this doesn't make sense in these next couple slides, you stop me and make sure you ask questions.
So first we're going to start with the agent. The agent is like the central part of the reinforcement learning algorithm. It is the neural network in this case. The agent is the thing that takes the actions. In real life you are the agents, each of you. If you're trying to learn a controller for a drone to make a delivery, the drone is the agent. The next one is the environment. The environment is simply the world in which the agent operates or acts. So in real life again the world is your environment. Now the agent can send commands to the environment in the form of what are called actions. Now in many cases we simplify this a little bit and say that the agent can pick from a finite set of actions that it can execute in that world. So, for example, we might say that the agent can move forward, backwards, left, or right within that world, and at every moment in time, the agent can send one of those actions to the environment, and in return, the environment will send back observations to that agent. So, for example, the agent might say that, okay, I want to move forward one step, then the environment is going to send back an observation in the form of a state, and a state is a concrete or immediate situation that the agent finds itself in. So again, for example, the state might be the actual vision or the scene that the agent sees around it. It could be in the form of an image or a video, maybe sound, whatever you can imagine. It's just the data that the agency is around it. It could be in the form of an image or a video, maybe sound, whatever you can imagine. It's just the data that the agency is in return. And again this loop continues. The agent sees that observation or that state and it takes a new action in return. And we continue this loop. Now the goal of reinforcement learning is that the agent wants to try to maximize its own reward in this environment. So at every step the agent is also getting back a reward from that environment. Now the reward is simply just a feedback measure of success or failure every time the agent acts. And you don't have to get a reward every time you act, but your reward might be delayed, you might only get one reward at the very end of your episode, so you might live a long time and then at the end of your life get a reward or not so it doesn't have to be like every moment in time you're getting a reward these rewards effectively you can think about them as just evaluating all of the agents actions so from them you can get a sense of how well the agent is doing in that environment and that's what we want to try and maximize now we can look at the total reward as just the summation of all of the individual rewards and time. So if you start at some time t we can call capital R of t as the sum of all of the rewards from that point on to the future and then so simply expanding the summation out you can see it's little r of t which is the reward at this time step right now after taking this action at this time, plus all of the rewards into the future, potentially on an infinite time horizon. Now often it's very useful to consider not just the sum of all rewards, but what's called the discounted sum of rewards. And that's obtained by simply multiplying this discounting factor, lambda, by each of the rewards at any point in time. And the reason you do this is simply so that you can discount the future rewards so they don't count quite as much as a current reward. So let me give a concrete example. If I could offer you $5 today or $5 in 10 years, it's still a reward of $5, but you'd take the one today. And that's because mentally, you're discounting over time that five dollars. It's not worth as much to you because it's coming so far into the future, so you'd prefer rewards that come as quickly as possible. And again, just showing this discounting total reward expanded out from a summation, you can see that at each time point it's multiplying the reward at that time multiplied by the discounting factor which is typically between 0 & 1. Okay so now that we've defined all these terms there's one very important function called the Q function in reinforcement learning that we now need to define. So let's go back a step and remember how this total reward, the total discounted reward, or what's also called the return, is defined. So that's again just taking the current reward at time t, multiplying it by a discounting factor, and then adding on all future rewards also multiplied by their discounting factor as well.
The Q function (09:23)
Now the Q function takes as input the current state of the agent and also takes as input the action that the agent executes at that time and it returns the expected total discounted return that the agent could expect at that point in time. So let's think about what this means. So this is telling us if the agent is in state S with and it takes an action A, the total amount of reward, total amount of discounted reward that it could obtain if it takes that action in that state is that the result from that Q function. And that's all the Q function is telling you. So it's a higher Q value is going to tell us that we're taking an action that's more desirable in that state. A lower Q value is going to tell us that we've made an undesirable action in that state. So always we want to try and take actions that maximize our Q value. Okay so now the question is if we take this magical Q function and we have an agent that has access to this oracle of a Q function. So assume that I give you the Q function for now, and the agent has access to it, and I place that agent in an environment. The question is how can that agent use that Q function to take actions in the environment? So let me actually ask you this as a question. So if I give you this Q value, Q function, and you're the agent agent and all you see is the state, how would you use that Q function to take your next action? Exactly, yeah. So what would you do? You would feed in all of the possible actions that you could execute at that time. You evaluate your Q function. Your Q function's gonna tell you for some actions, you have a very high Q value. For other actions, you have a very low Q value. You pick the action that gives you the highest Q value and that's the one that you execute at that time. So let's actually go through this. So ultimately what we want is to take actions in the environment. The function that will take as input a state or an observation and predict or evaluate that to an action is called the policy denoted here as pi of s and the strategy that we always want to take is just to maximize our Q value so pi of s is simply going to be the argmax over our actions of that Q function so we're going to evaluate our Q function over all possible actions and then just pick the action that maximizes this Q function that's our policy. Now in this lecture we are going to focus on two classes of reinforcement learning algorithms and the two categories that we're going to primarily focus on first are cases where we want our deep neural network to learn the Q function so now we're actually not given the Q function as a ground truth or as an Oracle but we want to learn that Q function directly and the second class of algorithms is where sorry so we take we learn that Q function and we use that Q function to define our policy the second class of functions second class of algorithms is going to directly try and learn that policy without the intermediate Q function to start with. Okay so first we'll focus on value learning, which again just to reiterate is where we want the deep neural network to learn that Q function and then we'll use that learned Q function to determine our policy through the same way that I did before. Okay so let's start digging a little bit deeper into that Q function so you can get a little more intuition on how it works and what it really means and to do that I'd like to introduce this breakout game which you can see on the left. The idea of the breakout game is that you have you are the paddle you're on the bottom you can move left or right or stay in the middle, or don't move at all rather, and you also have this ball that's coming towards you and your objective in the game is to move left and right so that you hit that ball, it bounces off your paddle and it tries to hit as many of the colored blocks on top as possible.
Deeper into the Q function (13:18)
Every time you hit a colored block on top you break off that block, hence the name of the game is called Breakout. The objective of the game is to break out all of the blocks, or break out as many of the blocks as possible before that ball passes your paddle. And, yeah, so the ball bounces off your paddle and you try and break off as many colored blocks as possible. The Q function basically tells us the expected total return that we can expect at any state given a certain action that we take at that state. And the point I'd like to make here is that estimating or guessing what the Q value is is not always that intuitive in practice. So for example if I show you these two actions that are two states and action pairs that this agent could take and I ask you which one of these probably has a higher Q value or said differently which one of these will give you a higher total return in the future so the sum of all of those rewards in the future from this action and state forward. How many of you would say state action pair A? Okay. How many of you would say state action pair B? Okay so you guys think that this is a more desirable action to take in that state, state B or a scenario B. Okay so first let's go through these and see the two policies working in practice. So we'll start with A. Let me first describe what I think A is gonna be acting like. So A is a pretty conservative policy. It's not gonna move when it sees that ball coming straight toward it, which means that it's probably going to be aiming that ball somewhere towards the middle of the board or uniformly across the top of the board, and it's going to be breaking off color or colored blocks across the entire top of the board right so this is what that looks like it's making progress it's killing off the blocks it's doing a pretty good job it's not losing I'd say it's doing a pretty good job but it doesn't really dominate the game okay so now let's go to B. So what B is doing is actually moving out of the way of the ball just so that it can come back towards the ball and hit the ball on its corner. So that ball ricochets off at an extreme angle and tries to hit the colored blocks at a super extreme angle. Now what that means, well actually let me ask why might this be a desirable policy? These could catch more of the ones that are out of the way. Yeah, so if you catch some ones on the really extreme edges, what might happen is that you might actually be able to sneak your ball up into a corner and start killing off all of the balls on the top, or all of the blocks on the top rather. So let's see this policy in action. So you can see it's really hitting at some extreme angles and eventually it breaches a corner on the left and starts to kill off all the blocks on the top. Now it gets a huge amount of reward from this. So this is just an example of how it's not always intuitive. To me when I first saw this I thought A was going to be the safer action to take, it would be the one that gives me the more return, but it turns out that there are some unintuitive actions that reinforcement learning agents can learn to really, I don't know if I would call it cheating the environment, but really doing things that we as humans would not find intuitive. Okay so the way we can do this practically with deep learning is we can have a deep neural network which takes as input a state which in this case is just the pixels coming from that game at that instant and also some representation of the action that we want to take in this case maybe go right move the paddle the right, it takes both of those two things as input and it returns the Q value, just a single number of what the neural network believes the Q value of that state action pair is.
Deep Q Networks (17:17)
Now that's fine, you can do it like this. There's one minor problem with doing it like this and that's if you want to create your policy, you want to try out all of the different possible actions that you could execute at that time which means that you're gonna have to run this network n times at every time instant where n is the number of actions that you could take so every time you'd have to execute this network many times just to see which way to go the alternative is that you could have one network that output or takes as input that state but now it has learned to output all of the different Q values for all of the different actions. So now here we have to just execute this once, we forward propagate once, and we can see that it gives us back the Q value for every single action. We look at all of those Q values, we pick the one that's maximum, and take the action that corresponds. Now that we've set up this network, how do we train it to actually output the true Q value at a particular instance or the Q function over many different states? Now what we want to do is to maximize the target return, right, and that will train the agent. So this would mean that the target return is going to be maximized over some infinite time horizon and this can serve as the ground truth to train that agent. So we can basically roll out the agent, see how it did in the future, and based on how we see it got rewards we can use that as the ground truth. Okay now I'm going to define this in two parts. First is the target queue value which is the real value that we got by just rolling out the episode of the agent inside this simulator or environment let's say. That's the target queue value. So the target queue value is composed of the reward that we got at this time by taking this action plus the expected or plus the maximum like the the best action that we could take at every future time so we take the best action now and we take the best action at every future time as well assuming we do that we can just look at our data see what the rewards were, add them all up and discount appropriately, and that's our true Q value. Okay, now the predicted Q value is obviously just the output from the network. We can train these, we have a target, we have a predicted, we can train this whole network end to end by subtracting the two, taking the squared difference, and that's our loss function. It's a mean squared error between the target Q value and the predicted Q value from the network. Okay, great. So let's just summarize this really quickly and see how this all fits together. In our Atari game, we have a state. We get it as pixels coming in and you can see that on the left hand side that gets fed into a neural network our neural network outputs in this case of Atari it's going to output three numbers the Q value for each of the possible actions it can go left it can go right or it can stay and don't do anything each of those Q values will have a numerical value that the neural network will predict now again how do we pick what action to take given this Q function? We can just take the argmax of those Q values and just see. Okay, if I go left, I'm going to have an expected return of 20. That means I'm going to probably break off 20 colored blocks in the future. If I stay in the center, maybe I can only break off a total of three blocks in the future. If I go right I'm gonna miss that ball and the game is going to be over so I'm gonna have a return of zero. So I'm gonna take the action that's gonna maximize my total return which in this case is left. Does it make sense? Okay great. That action is then fed back into the Atari game in this case. The game repeats, the next frame goes, and this whole process loops again. Now DeepMind actually showed how these networks, which are called DeepQ networks, could actually be applied to solve a whole variety of Atari games, providing the state as input through pixels, so just raw input state as pixels, and showing how they could learn the Q function. So all of the possible actions are shown on the left hand, on the right hand side, and it's learning that Q function just by interacting with its environment. And in fact they showed that on many different Atari games they were able to achieve superhuman performance on over 50% of them just using this very simple technique that I presented to you today.
Atari results and limitations (21:44)
And it's actually amazing that this technique works so well because to be honest it is so simple and it is extremely clean. How clean the idea is, it's very elegant in some sense how simple it is and still it's able to achieve superhuman performance which means that it beat the human on over 50% of these Atari games. So now that we saw the magic of Q-learning I'd like to touch on some of the downsides that we haven't seen so far. So far the main downside of Q-learning is that it doesn't do too well with complex action scenarios where you have a lot of actions, a large action space, or if you have a continuous action space which would correspond to infinite number of actions, right. So you can't effectively model or parameterize this problem to deal with continuous action spaces. There are ways that you can kind of tweak it, but at its core, what I presented today is not amenable to continuous action spaces. It's really well suited for small action spaces where you have a small number of possible actions and discrete possibilities, right? So a finite number of possible actions at every given time. It's also, its policy is also deterministic because you're always picking the action that maximizes your, your Q function. And this can be challenging specifically when you're dealing with stochastic environments like we talked about before. So Q, sorry, Q value learning is really well suited for deterministic action spaces, sorry, deterministic environments, discrete action spaces, and we'll see how we can move past Q learning to something like a policy gradient method, which allows us to deal with continuous action spaces and potentially stochastic environments. So next up we'll learn about policy learning to get around some of these problems of how we can deal with also continuous action spaces and stochastic environments or probabilistic environments. And again just to reiterate now we've gone through this many times I want to keep drilling it in. You're taking as input in Q networks, you're taking as input the state, you're predicting Q values for each of your possible actions and then your final answer, your policy, is determined by just taking the argmax of that Q function and taking an action that maximizes that Q function.
Policy learning algorithms (24:13)
Okay. The differentiation with policy gradient methods is that now we're not going to take, we're still gonna take as input the state at that time, but we're not going to output the Q function. We're directly going to output the policy of the network, or rather let me say that differently, we're going to output a probability distribution over the space of all actions given that state. So this is the probability that taking that action is going to result in the highest Q value. This is not saying that what Q value am I going to get, it's just saying that this is going to be the highest Q value, this is the probability that this action will give me the highest Q value. So it's a much more direct formulation, we're not going with this intermediate Q function, we're just directly saying let's optimize that policy automatically. Does that make sense? Okay so once we have that probability distribution we can again we see how our policy executes very naturally now. So that probability distribution may say that taking a left will result in the maximum Q value of 0.9 with probability 0.9, staying in the center will result in a probability or a maximum reward or return with point one. Going to the right is a bad action. You should not do that because you're definitely not going to get any return. Now with that probability distribution, that defines your policy. Like that is your policy. You can then take an action simply by sampling from that distribution. So if you draw a sample from that probability distribution, that exactly tells you the action you should take. So if I sample from this probability distribution here, I might see that the action I select is A1 going left, but if I sample again, since it's probabilistic, I could sample again and it could tell me A2 because A2 also has a probability of 0.1. On average though, I might see that 90% of my samples will be A1, 10% of my samples will be A2. But at any point in time, if I want to take an action, all I do is just sample from that probability distribution and act accordingly. And note again that since this is a probability distribution, it follows all of the typical probability distribution properties. So all of its elements all like its its total mass must add up to one because it's a probability distribution now already off the bat does anyone see any advantages of this formulation why we might care about directly modeling the policy instead of modeling the Q function and then using that to deduce a policy. If you formulate the problem like this, your output is a probability distribution like you said, but what that means is now we're not really constrained to dealing only with categorical action spaces. We can parameterize this probability distribution however we'd like. In fact we could make it continuous pretty easily. So let's take an example of what that might look like. This is the discrete action space. So we have three possible actions, left, right, or stay in the center. And the discrete action space is going to have all of its mass on these three points. The summation is going to be one of those masses, but still they're concentrated on three points. A continuous action space in this realm, instead of asking what direction should I move, a continuous action space is going to say maybe how fast should I move in whatever direction. So on the right it's going to be faster and faster to the right, it's a speed now, and on the left of the axis is going to be faster and faster to the left.
Discrete vs continuous actions (27:36)
So you could say I want to move to the left with speed 0.5 meters per second or 1.5 meters per second or whatever real number you want. It's a continuous action space here. Now when we plot the probability density function of this policy, we might see that the probability of taking an action giving a state has a mass over the entire number line not just on these three points because now we can take any of the possible actions along this number line not just a few specific categories so how might we do that with policy gradient networks that's really the interesting question here and what we can do is if we assume that our output follows a Gaussian distribution we can parameter is if we assume that our output follows a Gaussian distribution we can parameterize that Gaussian or the output of that Gaussian with a mean and a variance. So at every point in time now our network is going to predict the mean and the variance of that distribution. So it's outputting actually a mean number and a variance number. Now all we have to do then let's suppose that mean and variance is minus 1 and 0.5. So it's saying that the center of that distribution is minus 1 meters per second or moving 1 meter per second to the left. All of the mass then is centered at minus 1 with a variance of 0.5. Okay now again if we want to take an action with this probability distribution or this policy, we can simply sample from this distribution. If we sample from this distribution, in this case we might see that we sample a speed of minus 0.8 which corresponds to, or sorry, a velocity of minus 0.8 which corresponds to a speed of 0.8 to the left. Okay and again same idea as before now that it's continuous if we take an integral over this probability distribution it has to add up to 1. Okay makes sense? Great. Okay so that's a lot of material so let's cover how policy gradients works in a concrete example now. So let's walk through it and let's first start by going back to the original reinforcement learning loop. We have the agent, the environment, agent sends actions to the environment, environment sends observations back to the agent. Let's think about how we could use this paradigm combined with policy gradients to train like a very, I guess, intuitive example.
Training policy gradients (30:11)
Let's train a self-driving car. OK. So the agent in this case is the vehicle. Its state is whatever sensory information it receives. Maybe it's a camera attached to the vehicle. The action it could take, let's say simple, it's just the steering wheel angle that it should execute at that time. This is a continuous variable. It can take any of the angles within some bounded set. And finally the reward, let's say, is the distance traveled before we crash. Okay, great. So the training algorithm for policy gradients is a little bit different than the training algorithm for Q function or Q deep neural networks. So let's go through it step by step in this example. So to train our self-driving car what we're going to do is first initialize the agent. The agent is the self-driving car. We're going to start the agent in the center of the road and we're going to run a policy until termination. Okay so that's the policy that we've ran. In the beginning it didn't do too good. It crashed pretty early on but we can train it. Okay, so what we're going to do is record all of the states and all of the actions and all of the rewards at every single point in time during that entire trajectory. Given all of these state action reward pairs, we're going to first look at right before the crash and say that all of these actions, because they happened right before a crash or right before this undesirable event, we're going to penalize all of those actions. So we're going to decrease the probability of selecting those actions again in the future, and we're going to look at actions that were taken farther away from that undesirable event with higher rewards, and we're going to increase the probability of those actions because those actions resulted in more desirable events. The car stayed alive longer when it took those actions. When it crashed it didn't stay alive so we're gonna decrease the probability of selecting those actions again in the future. Okay so now that we've tried this once through one training iteration, we can try it again. We reinitialize the agent, we run a policy until termination, we do the same thing again, decrease the probability of things closer to the crash, increase the probability of actions farther from the crash, and just keep repeating this over and over until you see that the agent starts to perform better and better, drive farther and farther, and accumulate more and more reward, until eventually it starts to follow the lanes without crashing. Now this is really awesome because we never taught anything about what our lane markers, it's just seeing images of the road. We never taught anything about how to avoid crashes, it just learned this from sparse rewards. The remaining question here is how we can actually do these two steps, I think. How can we do the step of decreasing the probability of actions that were undesirable, and how can we increase the probability of actions that were desirable? I think everything else conceptually at least is pretty clear, I hope. The question is how do we improve our policy over time? So to do that, let's first look at the loss function for training policy gradients, and then we'll dissect it to understand a little bit why this works. The loss consists of two terms. The first term is the log likelihood of selecting the action given the state that you were in. This really tells us how likely was this action that you selected. The second term is the total discounted return that you received by taking that action. That's really what you want to maximize. So let's say if the agent or if the car got a lot of return, a lot of reward for an action that had very high log likelihood, So it was very likely to be selected and they got a lot of reward from that action. That's going to be a large number multiplied by a large number. When we multiply them together, and we multiply them together you get another large number. You add in this negative in front of this loss function, so now it's going to be an extremely negative number. Remember that neural networks try to minimize their loss, so that's great. So we're in a very desirable place, we're in a pretty good minimum here, so we're not going to touch that probability at all. Let's take another example. Now we have an example of where the reward is very low for an action, so R is very small, and let's assume that the probability of selecting this action that we took was very high. So we took an action that we were very confident in taking, but we got a very low reward for it or very low return for it. What are we going to do? So that's a small number now multiplied by this probability distribution. Our loss is going to be very small when we multiply it by the negative in front the total loss is going to be large right so on the next training iteration we're going to try and minimize that loss and that's going to be either by trying out different actions that may result in higher return or higher reward removing some of the probability of taking that action that we took again in the future. So we don't want to take that same action again because we just saw that it didn't have a good return for us. And when we plug this loss into the gradient descent algorithm that we saw in lecture one to train our neural network, we can actually see that the policy gradient itself is highlighted here in blue. So that's the that's why it's called policy gradients, right? Because you're taking the gradient of this policy function scaled by the blue. So that's the, that's why it's called policy gradients, right, because you're taking the gradient of this policy function scaled by the return. And that's where this method really gets its name from. So now I want to talk a little bit about how we can extend this to perform reinforcement learning in real life. So far I've only really shared examples of you, with you, about doing reinforcement learning in either games or in the simple toy example with the car. What do you think is the shortcoming of this training algorithm? So there's a real reason here why we haven't seen a ton of success of reinforcement learning in the real life like we have seen with the other fields that we've covered so far in this class.
RL in real life (36:04)
And that's because one of these steps has a severe limitation when you try and deploy it in the real world. Does anyone have an idea? Okay, so reinforcement learning in real life, the big limitation here obviously is you can't actually run a lot of these policies in real life, safety critical domains, especially let's think about self-driving cars. I said run until termination. I don't think you wanna do that on every single training iteration, not just like at the end goal. This is like every single step of your gradient descent algorithm, millions of steps, I don't think that's a desirable outcome. We can get around this though, so we can think about about training in simulation before deploying in the real world. The problem is that a lot of modern simulators are not really well suited for very photorealistic simulation that would support this kind of transfer, transfer from the simulator to the real world when you deploy them. One really cool result that we created in in my lab also with some of the TAs so you can ask them if you have any questions has been developing a brand new type of photorealistic simulation engine for self-driving cars that is entirely data-driven. So the simulator we created was called Travis, or sorry, Vista, and it allows us to use real data of the world to simulate virtual agents. So these are virtual reinforcement learning agents that can travel within these synthesized environments, and the results are incredibly photorealistic and they allow us to train agents in reinforcement learning environments entirely in simulation so that they can be deployed without any transfer in the real world. In fact, that's exactly what we did.
VISTA simulator (37:40)
We placed agents inside of our simulator, we trained them using policy grading algorithms, which is exactly what we talked about today, then we took these train policies and put them on board our full-scale autonomous vehicle. Without changing anything, they learn to drive in the real world as well, just like they learn to drive in the simulator. On the left-hand side you can actually see us sitting in the vehicle, but it's completely autonomous. It's executing a policy that was trained using reinforcement learning entirely within the simulation engine. And this actually represented the first time ever a full-scale autonomous vehicle was trained using only reinforcement learning and able to successfully be deployed in the real world. So this was a really awesome result that we had. And now we've covered some fundamentals of policy learning, also value learning with Q functions. What are some exciting applications that have sparked this field? I want to talk about these now as well. For that we turn to the game of Go. So in the game of Go, agents, humans or autonomous agents, can be playing against each other and an autonomous agent specifically was trained to compete against humans, specifically a human or many human champions, and achieved what at the time was a very very exciting result.
AlphaGo and AlphaZero (38:55)
So first I want to give some background, very quick background, onto the game of Go because probably a lot of you are not too familiar with Go. Go is played on a 19 by 19 grid. It's played between two players who rolled white and black pieces. The objective of the game is to basically occupy as much space on the board as as possible. You want to claim territory on the board. But the game of Go and the strategy behind Go is incredibly complex. That's because there are more positions, more possible positions in Go than the number of atoms in the universe. So the objective of our AI is to learn this incredibly complex state space and learn how to not only beat other autonomous agents, but learn how to beat the existing gold standard human professional Go players. Now Google DeepMind rose to the challenge a couple years ago and developed a reinforcement learning pipeline which defeated champion players. And the idea at its core was actually very simple, and I'd like to go through it just in a couple steps in the last couple minutes here. First they trained a neural network to watch a human play Go. So these are expert humans and they got a whole data set of how humans play Go and they trained their neural network to imitate the moves of those humans and imitate those behaviors. They then use these pre-trained networks trained trained from the expert Go players, to play against their own reinforcement learning agents. And the reinforcement learning policy network, which allowed the policy to go beyond what was imitating the humans, and actually go beyond human level capabilities to go superhuman. The other trick that they had here, which really made all of this possible was the usage of an auxiliary neural network which not only was the policy network which took as input the state to predict the action but also an auxiliary network which took as input the state and predicted the how good of a state this was. So you can think of this as kind of, again, similar in idea to the Q network, but it's telling you for any particular state of the board, how good of a state is this? And given this network, what the AI could do is basically hallucinate different possible trajectories of actions that it could take in the future and see where it ends up, use that network to say how good of a state have all of these board positions become, and use that to determine where that action or where that agent needs to act in the future. And finally, a recently published extension of these approaches just a year ago called AlphaZero, used basically self-play all the way through. So the previous example I showed used a build-up of expert data to imitate and that was what started the foundation of the algorithm. Now in AlphaZero they start from zero, they start from scratch and use entirely self-play from the beginning. In these examples they showed examples on, let's see, it was chess, Go, many other games as well, they showed that you could use self-play all the way through without any need of training with, or pre-training with human experts, and instead optimize these networks entirely from scratch. Okay, so finally I'd just like to summarize this lecture. Today we saw how deep reinforcement learning could be used to train agents in environments in this learning loop where the agent interacts with the environments. We got a lot of foundations into reinforcement learning problems. We learned about Q learning where agents try to optimize the total expected return of their actions in the future. And finally, we got to experience how policy gradient algorithms are trained and how we can directly optimize the policy without going through the Q function at all.
Training Context Wrap-Up
And in our lab today, we will get some experience of how we can train some of these reinforcement learning agents in the game of Pong and also some simpler examples as well for you to debug with and play around with. Next we will hear from Abba who's going to be talking about limitations as a very wide stretching lecture. It's a very exciting lecture. It's actually one of my favorite lectures. It's the limitations of deep learning, so of all of these approaches that we've been talking about, and also some really exciting new advances of this field and where it's moving in the future. So I hope you enjoyed that as well. Thank you. Gracias. you