MIT 6.S191 (2020): Generalizable Autonomy for Robot Manipulation

Transcription for the video titled "MIT 6.S191 (2020): Generalizable Autonomy for Robot Manipulation".

1970-01-01T10:10:03.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Introduction

Introduction (00:00)

. I wanted to start with this cute little video. What I really study is algorithmic methods to make robot manipulation generalizable. Why I really like this video is, this is how I got inspired to work in robotics. This is science fiction and as a researcher you are always chasing science fiction and then trying to make some of it reality. And really to think about this if you think of something like this if I were to have a system like this in my home I would want it to do a variety of things maybe clean cook do laundry perhaps help me sweep or other stuff and not only that I would probably want it to work outside of my lab I would want it to work in a variety of settings maybe my home your home or maybe in settings which are much more complex than you can show it, perhaps in the dark. And the idea really is, how do we enable such complexity and learning in such sort of general purpose diversity of skills which interaction in the real world requires? And this is where I argue that a lot of my research agenda lies. We are trying to build these systems, physical agents particularly, that can really extend our ability. And when I use extend and augment, it's both in cognitive and physical sense. But there's nothing new about that. Dr. Cox already talked about the talk of emergence of AI that happened in 1956. And soon after, we were dreaming of robot assistants in the kitchen. This is the first industrial robot. I don't know how many of you are aware of Unimate. It is as big as the kitchen, but this is actually 50 years to date. In the time since, I would argue people have done variety of tremendous stuff. This is not my work, this is Boston Robotics, but this is one particularly good example robots can walk on ice lift heavy boxes and whatnot cut to this year in CES this is a video from Sony doing pretty much the same thing this is a concept video of a robot cooking circa 2020 52 years after the original video. What is very striking is, despite the last 50 years, when you put real robots in real world, it doesn't really work work. It turns out doing long-term planning, real-time perception in real robots is really hard. So what gives? What gives? From 1950, 1960 to today, I argue that we need algorithms that can generalize to unstructured settings that a robot is going to encounter, both in terms of perception, in terms of dynamics, perhaps task definition. And I've only given you examples of kitchen, but there's nothing particularly special about kitchen. This sort of lack of generalization happens in all sorts of robotics applications, from manufacturing to healthcare to personal and service robotics.


Discussion On Autonomous Learning

Achieving generalizable autonomy (03:45)

So this is where I argue that to tackle this problem, what we really need is to inject some sort of structured inductive bias and priors to achieve the generalization. In simpler terms, you can really think about this that we need algorithms to learn from specifications of tasks, where specifications can be easy, language, video, or let's say kinesthetic demonstrations. And then we need mechanisms where the system can self-practice to generalize to new but similar things.


Leveraging imitation learning (04:19)

But often, imitation gets a bad rep. Imitation is just copying. But it actually is not. Let's think about this very simple example of scooping some leaves in your yard. If you have a two-year-old looking at you, trying to imitate, they may just move the mop around. They're trying to just basically try to get the motion right but nothing really happens. So they get what you would call movement skills. As they grow up, probably they can do a bit better. They can get some sort of planning. They can do some sort of generalization. Some of the task actually works. And they even grow to a point where now they understand the concept of imitation is really not the motion, but it is actually semantics. You need to do the task, not exactly always the how of it, the what matters. So they may actually use a completely different set of tools to do the exact same task. And this is precisely what we want in algorithmic, let's say, equivalents. in algorithmic, let's say, equivalence. So today what I'm gonna talk about is at all three levels of these, let's say, imitation at level of control, planning, and perception, how can we get this kind of generalization through structured priors and inductive biases? So let's start with control. So I started with these simple skills in the house. Let's take one of these skills and think about what we can do with this. What we are really after is algorithms which would be general. So I don't have to code up new algorithms for let's say sweeping versus cleaning or create a completely new setup for I don't know let's say cutting so let's think about that this is where precisely learning based algorithms come into play but one of the things that is very important is let's say we take the example of cleaning cleaning is something that is very very common in everyday household you can argue wiping is a motion that is required across the board.


Learning visuo-motor policies (06:08)

Not very many people clean radios though, but still. The concept is generalization. Perhaps cleaning a harder stain would require you to push harder. You have some sort of reward function where you wipe until it is clean. It's just that the classifier of what is clean is not really given to you explicitly or maybe you know the concept or context where if you're wiping glass do not push too hard you might just crack it how do we build algorithms that can get this sort of generalization one way can be this recent wave of what you would call machine learning and reinforcement learning. You get magical input of images, some sort of torque or output in actions. And this is actually done very well. In robotics, we have seen some very interesting results for longstanding problems, being able to open closed doors. This is handling some sort of fluids or at least deformable media. But, and this is actually surprising and very impressive. But one of the things that strikes out is these methods are actually very, very sample inefficient. It may take days, if not weeks, to do very simple things with specification. And even then these methods would be very let's say iffy very unstable you change one thing and then the whole thing comes shattering down you have to start all over again now the alternate to do this might be something that is more classical let's say look at controls you are given a robot model. The robot model may be specification of the dynamics. It may include the environment if the task is too complicated. Given something like this, what would you do? You would come up with some sort of task structure T. I need to do this particular step, maybe go to the table, do wiping, and when it is wiped, then I come out. So this has worked for a long time actually when you have particular tasks. But there's a problem. Generalization is very hard because I need to create this tree for every task. Two, perception is very hard because I have to build a perception for a particular task. Wiping this table may be very different from wiping the whiteboard because I need to build a classifier to detect when it is wiped. So one of the algorithms we started working on is like, can we take advantages of the best of both worlds argument in this case? So the idea is reinforcement learning or learning in general can allow you to be general purpose, but is very sample inefficient. On the contrary, model in general can allow you to be general purpose, but is very sample inefficient. On the contrary, model-based methods allow you to rely on priors or things that you know about the robot, but require you to code up a lot of the stuff about the task. So what we thought is maybe the way to look at this problem is break up this problem in modular sense, where you take the action space of the learning to not be in the robot space but in the task space itself what do we really want to do is think about a modular method where the output of a policy learned policy that is taking in images is actually not at the level of what joint angles do you want to change, but really think about the pose and the velocity of the end effector, but also the gains or the impedances. How hard or stiff does the end effector needs to be at this? Why is it important? It is important because this enables you to manage different stiffnesses when you are in different stages of the task. This basically obviates the need for you to create a task structure tree. So the system can learn when it's free, when it needs to be stiff, and in what dimensions it needs to be stiff. And this is very important. Now the policy is essentially outputting this stiffness parameters, which can then be put into a robot model that we already know. The robot model can be nonlinear, rather complicated, why bother wasting time spending learning effort to do this. And this is really sort of best of both worlds where you are using the model however much you can, but you are still keeping the environment, which is general, to be unmodeled. What benefit does this sort of give you? So what we do here is, what you see here is image input to the agent, and this is environment behavior. We model this as a reinforcement learning problem with a fairly simple objective. The objective is up top, basically clean up all of the tiles, do not apply any forces that would kill the robot. That is basically it. And what you really need to see is we tested this against a bunch of different action spaces. So action space is the prior here that you're using and the only thing that you should sort of take away is at the space is the prior here that you're using. And the only thing that you should sort of take away is at the bottom is the favorite image to talk. And at the top is basically impedance controlled or variable impedance directly provided through the policy. And this is basically the difference between failure and success in both terms of sample efficiency and also smoothness and control, because you're using known mechanisms of doing control at high frequency, where you can actually safeguard the system without worrying about what the reinforcement learning algorithm can do to your system. Interestingly, what you can do with this is, because you have a decoupled system now you can train purely in simulation. F-Sim can be a model in the simulation. You can replace this model with a real robot model on the fly. You do not need fine-tuning because again the policy is outputting end-effector spaces in end-effector spaces so you can replace this model and even though there might be finite differences in parameters at least in these cases we found that the policy generalizes very well without any sort of let's say computational loss we did not have to do simulation based randomization we did not have to do simulation-based randomization. We did not have to do techniques on either fine-tuning the policy when you go to the real world. So this is kind of zero-shot transfer in this case, which is pretty interesting. Basically, identifying the right prior enables you to do this generalization very efficiently and also gets you sample efficiency in learning.


Learning skills (13:09)

So moving on, let's talk about reinforcement learning again. We always want reinforcement learning to do these sort of interesting tasks which can do, let's say, image to control kind of tasks. This is yet again an example from Google, but often when you want to do very complicated things it is it can be frustrating because the amount of data required on realistic systems can actually be very big so when tasks get slightly harder reinforcement learning starts to what you would call stutter so you you want to do longer time longer term things which it can be very hard. And interestingly, something that came out last year, this is not my work, but friends from a company actually showed that even though reinforcement learning is doing fancy stuff, you can code that up in about 20 minutes, and you'll still do better. So what was interesting is that in these cases at least from a practical perspective you could code up a simple solution much faster than a learned solution which basically gave us the idea what is going on here. What we really want to do is exploration in reinforcement learning or in these sort of learned systems is very slow. Often in these cases you already know part of the solution. You know a lot about the problem as a designer of the system. The system isn't actually working ab initio anyways. So the question we were asking is how can human intuition guide exploration? How can we do simple stuff which can help learning faster? So the intuition here was, let's say you are given a task. The task requires you some sort of reasoning. It is basically move the block to a particular point. If you can reach the block, you'll move the block directly. If you cannot reach the block, then you'll use the tool. So we thought that what we can do easily is, instead of writing a policy, it is very easy to write subparts of the policy. Basically specify what you know, but don't specify the full solution. So provide whatever you can, but you don't actually have to provide the full solution. So this basically results in, you get a bunch of teachers which are essentially black box controllers. Most of them are suboptimal. In fact, they may be incomplete. So in this case, you can basically say that I can provide a teacher which only goes to a particular point. It doesn't know how to solve the task. There is no notion or concept of how to complete the task. So you can start with these teachers, and then the idea would be that you want to complete, you still want a full policy that is both faster than the teachers in terms of learning, and at test time doesn't necessarily use the teachers because teacher may have access to privileged information that a policy may not have. But the idea, even though it's simple to specify, is actually non-trivial. Think about this. If you have teachers, multiple of them, some of the teachers can actually be partial. So you might need to sequence them together. Maybe they are partial and they are not even complete in the sense that there is no single sequence that even will complete the task because there is no requirement we put in on these teachers. Sometimes the teachers may actually be contradictory. You did not say that they are all helpful. They can be adversarial. Independently, they are useful because they provide information. But when you try to put them together, let's say go back and move forward, and you can keep using them without making progress in the task.


Off-policy RL + AC-Teach (16:38)

So how do we do this? So let's review some basics in reinforcement learning. I believe you went through a lecture in reinforcement learning. This is an off-policy reinforcement learning algorithm called DDPG. What DDPG does is, you start with some sort of state that an environment provides. You run a policy. A policy is, let's say, your current iterate of your system. When you operate with the policy, the policy gives you the next state, the reward, and you put that pupil in a database that is called experience replay buffer. This is a standard trick in modern deep reinforcement learning algorithms. Now what you do is you sample mini-batches in the same way you would do, let's say, any sort of deep learning algorithm from this database that is constantly updating to compute two gradients one gradient is the value of what you call a function critic which is value function which is basically telling what is the value of this state a value of the state can be thought of is how far would the goal be from my current state and then you use the policy gradient in this particular case something called deterministic policy gradient. That is what the name is, deep deterministic policy gradient. So you have these two gradients, one to update the critic and one to update the policy. And you can do them in asynchronous manner offline, offline in the sense that the data is not being generated by the same policy. So you can have these two update process and the rollout process separate. That is why it's called off policy. So now let's assume you have some RL agent, whether it's DDPG or not doesn't really matter. You have an environment, you get in state, and you can run the policy. But then now the problem is, not only the policy, you actually have a bunch of other teachers which are giving you advice. So they can all basically tell you what to do. Now you have to not only decide how the agent should behave, you also need to figure out if I should trust the teacher or not. How do I do this? One way to do this? One way to do this is think about how bandit algorithms work. I can basically, at any point of time, think about the value of any of these teachers in a particular state. I can basically think of an outer loop, running RL in the outer loop of the policy learning, and I basically state, if the problem that I was solving was selecting which agent to use or which one of these teachers to use, then I just need to know which will result in the best outcome in the current state. This formalism can basically be stated, you learn a critic or a value function where it is basically choosing which which of the actions you should you should pick simultaneously as you run this thing then the policy which is called the behavioral policy basically runs that agent whether it's your own agent the learned agent or one of the teachers but the now the trick is regardless of who operates whether it's the teacher or your agent the data goes But now the trick is, regardless of who operates, whether it's the teacher or your agent, the data goes back to the replay buffer, and then my agent can still learn from that data. So basically, I'm online running some teachers, so using supervisors, and using the data to train my agent. So whenever a supervisor is useful, the agent will learn from it. If the supervisor is not useful, then standard reinforcement learning will happen. So what does this result in? So we go back to the same task. We provide four teachers. They look something like grab, position, push, pull. But we do not provide any sort of mechanism to complete the task. So the first question we were asking is, if we give one full teacher, the method should basically be able to copy the teacher. And that's what we find, that in terms of baseline, we are basically able to do something that is kosher. At least copy one teacher when the teacher is near optimal. A more interesting thing happens when you actually get multiple teachers. So when you get multiple teachers, the problem gets a bit more complicated because you have to decide which one to use, and if you use a suboptimal one, you waste time. This is where we basically see that using our method results in sample efficiency. Even more interestingly, what happens is if you provide teachers which are, let's say, incomplete, I only provide teachers for part of the task, and the other part needs to be filled in. This is where all of the other methods fail, but the fact that you're essentially using reinforcement learning with imitation, you still maintain the sample efficiency. So just taking a breather here, what we really learned from this line of work is understanding domain specific action representations. Even though they are domain specific, they are fairly general. Manipulation is fairly general in that sense. And using weakly supervised systems. So in this case, as we are using weakly supervised teachers, suboptimal teachers, provides enough structure to promote both sample efficiency in learning and generalization to variations of distance. So let's go back to the original setup. We started with low-level skills.


Compositional planning (22:02)

Let's graduate to a bit more complicated skills. So we started with simple skills like grasping, pushing. What happens when you need to do sequential skills? Things that you need to reason for a bit longer. So we started by studying this problem which is fairly sort of interesting. Let's say you have a task to do, maybe it's sweeping or hammering, and you are given an object but the identity of the object is not given to you. You basically are given a random object, whether it's a pen or a bottle. How would you go about this? One way to go about this is look at the task, look at the object, predict some sort of optimal grasp for this object, and then try to predict the optimal task policy. But I can argue that optimally grasping the hammer near the center of mass is suboptimal for both of these tasks what is more interesting is you actually have to grab the hammer only in a manner that you'll still succeed but not optimally because what the gold standard that you're really after is the task success not the grasping success nobody grabs stuff for the purpose of grabbing them so how do we go about this problem you have some sort of input some sort of task and the problem is we need to understand and evaluate there there can be many ways to grasp an object and we still need to optimize for the policy. So there is a very large discrete space where you are basically grasping objects in different ways. Each of those ways will result in a policy. Some of those policies will succeed, some of those will not. But the intuition, or at least the realization that enables the problem to be computationally tractable is the fact that whenever the task succeeds, grasp must succeed, but the other way around is not actually true. You can grab the object and still fail at the task, but you can never succeed at the task without grabbing the object. This enables us to factorize this value function into two parts, which is a condition grasp model and a grasp model itself, just an independent grasp model. And this sort of factorization enables us to create a model with three loss terms. One is, are you able to grab the object? Whenever you grab the object, does the task succeed? And a standard policy graded loss. This model then can be jointly trained in simulation where you have a lot of these simulated objects in a simulator trying to do the task, where the reward function is sparse. You're basically only told, do you succeed in the task or not? There is no other reward function going on. So at test time, what you get is you get some object, a real object that is not in the test set, you get the RGBD image of that. What you do is you generate a lot of these grasp samples. The interesting part is what you're doing here is you're ranking grasps based on the task. So this ranking is generated by your belief of task success. Then given this ranking, you can pick a grasp and evaluate the task. So you can actually go out and do the task. This is what the errors from here is back-propped into picking this ranking. The way this problem is set up, you can generalize to arbitrary new objects because nothing about object category is given to you. So in this particular case, you are seeing simulation for the hammering task and for pushing task. And we evaluated this against a couple of baselines. Basically, a very simple grasping baseline. Then you have this sort of two-stage pipeline where you optimally grasp the object and then optimally try to do the task. And then our method where you are jointly optimizing the system. What we find is in this case, end-to-end optimization basically gets you more than double the performance. There's nothing special about simulation. Because we are using depth images. We can directly go to real world without any fine tuning because input is depth. So in this cases, it is basically doing the same task but in real world and pretty much the trend for the performance still holds. So we are more than at double the performance than let's say a two-stage pipeline where you're optimally grasping the object. So moving forward in the same setup, we wanted to now ask the question, can we do more interesting sequential tasks which require reasoning, as Dr. Cox was mentioning earlier, that can we do something that is requiring you to do both discrete and continuous planning simultaneously? So think of these kind of spaces where the task is to roll the pin to you, but if you keep rolling, the pin will roll off the table, hence you need a support, and you may have objects that are blocking the task, and there can be variants of this sort of setup. What it requires you to do is basically both discrete and continuous reasoning. The discrete reasoning is which object to push in the scene, and continuous reasoning is how much to push it by or what is the actual sort of mechanism of control.


Model-based RL (27:20)

So basically the kind of question we are asking is, can a robot efficiently learn to perform these sort of multi-step tasks under various both physical and semantic constraints. These are usually kind of things that people use to let's say test animal intelligence behaviors. So we attempted to study this question in a manipulation setting, in a simple manipulation setting where the robot is asked to move a particular object to a particular position. The interesting thing is that there is constraints. So in this particular setup the constraint can be the object can only move along a given path in this case along let's say gray tiles and there can be other objects along the way. So in this particular case in presence an obstacle multiple decisions need to be made. You cannot just push the can to the yellow square. You actually need to push this object out of the way first. And then you can do a greedy decision making. So you have to think about this at different levels of timescale. So now doing something like this, you would argue, can be done with a model-based approach. You can learn a model of the dynamics in the system. You can roll the model out and use this model to come up with some sort of optimal action sequence to do this. And one would argue that in recent times, we have seen a number of these papers where such model can be learned in pure image spaces so you are basically doing some sort of pushing in pure image space then the question we were asking is since this is such a general solution basically it's visual surveying it seems natural that that these sort of models will do everything and and we are really surprised that that even though the solution is fairly general and there's nothing new about these and we were really surprised that even though the solution is fairly general and there's nothing new about these papers from the perspective of the solution, it's basically learn a model and then do optimal control, these particular classes of models do not scale to more complicated setups. So you cannot ask these complicated questions of doing hybrid reasoning with these simple geometric models. The reason is to be able to learn a complicated model that can do long-term planning or long-term prediction, the amount of data that you would need scales super linearly. So to be able to do something like this would require many, many robots and many, many months of data. Even then, we do not know if it will work. many many robots and many many months of data even then we do not know if it will work on the contrary what insight we had is there is hierarchical nature to this action space basically there's some sort of long-term symbolic effects rather than the actual space of tasks and then there is a local motion if you can learn both of these things simultaneously then perhaps you can generalize to an action sequence that can achieve this reasoning task. So what we propose is basically a latent variable model where you are learning both long-term effects as what you would call the effect code and local motions. So what this does is essentially you can think of the long-term planner doesn't really tell you how to get to the airport but it only gets what would be the milestones when you get to the airport. When you do that then the local local planner can tell you how to get to each of these milestones as you go along. So think of it like this, that you can sample a metadynamics model which generates these multiple trajectories, multiple ways to get to the airport. You select one of those depending on your cost function. Given the sequence of subtasks, now you can actually generate actions in a manner that would give you a distribution of those actions, for going forward, let's say, from milestone to milestone. And you would check it against a learned low-level dynamics model, the validity of that action sequence. So you are basically saying that the action sequence generated by a model is that going to be valid based on the data that I've seen so far, and then you can weight these action sequences based on cost functions. So essentially what you're doing is you're trying to train a model of dynamics for multiple levels. But you're training all of this purely in simulation without any task labels. So you're not actually trying to go to the airport only. You're basically just pushing around. The other thing is you do not actually get labels for, let's say, milestones, which is equivalent to saying you don't get labels for latent variables. So motion codes and effect codes are essentially latent. So you set this up as a variational inference problem. So you see these modules at the bottom. These are used to infer latent codes without explicit labels. So the setup overall looks something like this. You have a robot. The robot input image is parsed into object-centric representations. Then this representation is passed into the planner. The planner outputs a sequence of states that can basically be now fed into the planner. The planner outputs a sequence of states that can basically be now fed into the system, and you can basically loop through it. I gave you this example of a simple task of moving the object, but we did other tasks as well, where you're basically trying to move the object to a particular goal in a field of obstacles, or trying to clear a space with all of the, multiple objects needs to be pushed out. So what we found is comparing this model with a bunch of baselines which were using simpler models, that having a more complicated model works better, especially when you have dense reward function. But when you have sparse reward functions, which is basically saying you only get reward when the task completes no intermediate reward then the performance gap is bigger and again the way this is set up you can go to a real world system without any fine-tuning pretty much get the same performance again the trick is input is depth images okay so just to give you qualitative examples of how this system works, in this case the yellow box needs to go to the yellow bin and there are multiple obstacles along the way and the system is basically doing discrete and continuous planning simultaneously without doing any sort of modeling for us to do sort of discrete and continuous models or specifically designed or task specific design models. So the interesting thing is there's only single model that we learned for all of these tasks. It is not separate. Yet another example is the system basically takes a longer path rather than pushing through the greedy path to get this bag of chips. In this particular case case the system figures out that the object needs to create a path by pushing some other object along the way or out of the way. So in both of these projects what we learned is the power of self supervision in robotics is very strong. You can actually do compositional priors with latent variable models using purely self-supervision. Both the task where we were doing hammering and in this case, we basically had models which were doing pure self-supervised data in simulated setups, and we were able to get real-world performance out of these.


Leveraging task structure (34:37)

So moving on, the next thing I wanted to study was what happens when tasks grow a bit more complex. So we looked at simple, let's say two-stage tasks. What happens when you are graph structure tasks, when you actually have to reason about longer tasks, let's say towers of Hanoi problem. So we talked about these problems. Clearly, RL would be much harder to do. Imitation, even imitation in these cases starts to fail because specification of imitation, let's say to do a very long multistage task, whether it's building Legos or BlockWorlds, is actually very hard. What we really want is meta-imitation learning. So meta-imitation learning can really be thought of as, you have an environment, the environment is bounded, but it can be in many final states, which can be thought of as a task. So you get many examples of reconfigurations of that environment. This can be thought of as examples of tasks that you are seeing in train distribution. In a test tank, you are given a specification of one final task that can be new, most likely, and you still need to be able to do this. How do we do these kind of tasks in the current solutions? Actually, let me skip this. So the way we do this right now is write programs. These programs essentially enable you to reason about long-term tasks even at a very granular scale. This is how you would code up a robot to put two blocks on top of each other. Now, if you were to do this slightly differently, you need to write a new program. So this basically gave us an idea that perhaps instead of doing reinforcement learning, we can pose this problem as a program induction, or neural program induction.


Neural task programming (NTP) (36:35)

It's essentially reducing a reinforcement learning problem or decision-making problem to a supervised learning problem in very large space. So you get an input video, a meta-learning model, which is basically taking current state and outputting what is the next program you should output. Not only the next program, but of course also the arguments that you need to pass using that API. It's essentially equivalent to saying that if you give the robot an API, can the system use that API. It's essentially equivalent to saying that if you give the robot an API, can the system use that API itself? So what you need is a dataset of demos, video demonstrations, and let's say a planner that is giving you what sub-programs were called in that execution. The loss basically looks very much like a supervised learning loss, where you have a prediction and you compare it with your plan at home. What does this look like? Okay, this should not be like this. Okay. So you can really think of this as a high level neuro symbolic planner where you at the start output something like block stacking. Oh no, let's see. This should be better. You start with block stacking. The block stacking unpacks to pick and place, pick and place can unpack to pick. Once you unpack to pick, you can basically say the robot will actually execute the API level command. As the API level command executes, the executor goes back to pick, moves forward with the pick, and then actually does the pick itself. Once the pick is complete, the pick in place moves forward to the place aspect of it, and then goes on to pick up the object by grabbing the object and picking it up in sequence. And once place is complete, the executor basically goes back to pick and place to block stacking and you can continue doing this. So this is just an example of one pick and place, but this can actually continue to multiple blocks. We tested with over 40 of these examples. Sorry, the videos didn't play as well. So what does this enable? Sorry, the videos didn't play as well. So what does this enable? What this enables is, you can now input the specification of the task through, let's say, doing VR execution, what you see in the inset. And then the robot can actually look at the video to try to do this task. What is important is to understand what is happening in this sequence of block executions. The system is not just parsing the video because that would be easy. The system is actually creating a policy out of this sequence. So one way to test this is, let's say if there is an adversarial human in the system that will break the model. So if you have done the task halfway through, the world is stochastic. The world goes back. It should not continue doing the same thing that you saw. It should actually be a reactive policy. So it is actually state-dependent. In terms of numbers, when you look at this, basically what we find is that if you have a flat policy or a deep RL-style policy, it does not work on test tasks but this sort of task programming or program induction works very well and it actually works even with vision. So you have pure visual input, no specification of where the tasks, where the objects are, so you get generalization with visual input without specifying particular domain specific visual design. But again particular domain-specific visual design. But again, none of this thing works perfectly. It would not be robotics if it worked. So what fails? Often what happens is, because we are using API, if the API doesn't declare when the failure happens, let's say you're trying to grab something but the grab action did not succeed. The high- level planner does not know and it continues. So we went back to the model and said what is what is it that is actually causing it to fail. So we found that even though we used the output as the program, we were able to inject structure, the model itself was still a black box. It was basically an LSTM. We thought perhaps we can open the black box and actually put a compositional prior. What does this compositional prior look like? We basically say, think of graph neural networks. So graph can basically be this idea of executing the task in a planner, in a PDDL style planner where nodes are states, edges are graphs, and you plan through these things. So this can actually still result in a two-stage model where you are learning the graph of the task itself rather than a black box LSTM predicting this. But there is a problem. The problem is in these kinds of setups, the number of states can actually be combinatorial, millions maybe, and the number of actions are finite. So the concept of learning this graph in a neural sense was to understand that the graph will not be in this setup, but actually a conjugate graph. The conjugate graph flips the model by saying nodes are now actions and edges are states. So you can really think of these are nodes are telling you what to do, and edges are pre and post conditions. And how does this model work now? You can actually have an observation model which tells you where to go, in any particular state, what action was executed, and each action tells you what state you end up in, which tells you what would be the next action to do. Because this graph is learned, you're basically getting the policy for free. And the training is very similar to the program induction, except you do not need full execution. With the lowest level of actors actors or lowest level of actions, you are sufficiently, that would be sufficient. And what that basically gives us is stronger generalization in both videos and states with much less supervision, again. So we have fewer data points or weaker supervision, but we get better generalization. So the big picture key insight again is compositional priors, such as let's say neural programs or neural task graphs, enable us a modular structure that is needed to achieve one-shot generalization in these long-term sequential plans.


Data for robotics (43:04)

So in the one or two minutes that I have left, I want to leave you with this question. Often we are looking at robotics as the sort of ultimate challenge for AI. And we compare the performance of robotics with our colleagues in vision and language, where we have done a lot of progress. But if you notice one of the things that happens is, as the tasks grow smaller, the data sets grow very small, very quickly in robotics. But to do more interesting tasks, more complicated tasks, we still need to keep the data sets large enough for us to be able to sort of leverage powers of these algorithms. If you look at robotics in recent times, the data sets have been essentially minuscule, about 30 minutes of data that can be collected by a single person. This is a chart of large data sets in robotics. This is not very old, actually. This is Coral 2018. Just to compare this with NLP and vision data sets, we are about three orders of magnitudes off. So we were basically asking the task, why is it the problem? The problem is both vision and language have Mechanical Turk. They can get a lot of labeled data but in robotics labeling doesn't work, you actually need to show. So we spent a lot of time to create the system, which is very, very similar to Mechanical Turk.


RoboTurk (44:24)

We call it Robo Turk, where you can use essentially commodity devices like a phone to get large scale data sets, which are actual demonstrations, full 3D demonstrations. And this can now enable us to get data both on real systems and simulated systems at scale. So you can be in places wherever you want, collect data from crowdsource workers at very large scale. We did some pilots. We were able to collect hundreds of hours of data. Just to give you a sense of how this compares, that's 13 hours of data collected in about seven months. We were able to collect about 140 hours of data in six days now the next question would be why is this data useful so we did reinforcement learning what we find is if you do pure RL with no data even on three days of of doing this with multiple machines you you get no progress. As you keep injecting data in the system, the performance keeps improving. So there is actually value in collecting data. So the take home lesson was basically that more data with structured and semantic supervision can fuel robot learning in increasingly complex tasks and scalable crowdsourcing methods such as RoboTurk are really sort of enable us to access this treasure trove.


Summary

Summary (45:54)

So going back, what I really want to leave you with is we talked about a variety of methods in different levels of abstraction from controls to planning to perception and then we talked about how variety of methods in different levels of abstraction, from controls to planning to perception, and then we talked about how to collect data. But if there's one thing that I want to leave you today with is if you want to do learning in complex tasks and complex domains such as robotics, it is very important to understand the value of injecting structured priors and inductive biases in your models. Generic models from deep learning that have worked for vision may or may not work for you. That is one. Two, the use of modular components and modularization of your problem, where you use domain-dependent expertise with data-driven problems, can enable you to actually build practical systems for much more diverse and complex applications. That, I would like to thank you all for being such a patient audience and happy to take questions.


Could not load content

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.