An AI Primer with Wojciech Zaremba | Transcription

Transcription for the video titled "An AI Primer with Wojciech Zaremba".

1970-01-01T05:33:21.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Introduction

Intro (00:00)

Hey, today we have Voichek Tsaremba and we're gonna talk about AI. So Voichek, could you give us a quick background? I'm a founder at OpenAI. I'm working on robotics. I think that deep learning and AI is a great application for robotics. Prior to that, I spent a year at Google Brain and I spent a year at Facebook AI research and same time I graduated from, I have finished my PhD at NYU. Can you explain how you pulled that off? That seems pretty rare. So the great thing about both of these organizations is that they are focused on research. So throughout my PhD, I was actually publishing papers over there. I highly recommend both organizations as well as of course OpenAI. Yeah, okay. So most people probably don't know what OpenAI is. So could you just give that quick explanation? So OpenAI focuses on building AI for the good of humanity.


Understanding Ai And Machine Learning

What is AI? (01:04)

We are a group of researchers and engineers collaborating together who essentially try to figure out what are the missing pieces of general artificial intelligence and how to build it in a way that would be maximally beneficial to humanity as a whole.


What is Machine Learning? (01:14)

OpenAI is greatly supported by Elon Musk and Sam Altman. In total, we gather an investment of $1 billion in the group. Which is quite a lot. And so what are, I mean, I know some, but what are the OpenAI projects? So there is several large projects going on.


What is Deep Learning? (01:51)

Simultaneously we have also, we are doing also basic research. So let me first enumerate large projects. These are robotics. So in terms of robotics, we are working on manipulation. We think that manipulation is the complete, it's the one of the parts of robotics, which is the most unresolved. Sorry, just to clarify, what does that mean exactly? It means that, so in robotics, there are essentially three major families of tasks.


What is manipulation (02:27)

One is locomotion, which means how to move from, let's say how to walk, how to move from point A to point B. Second is navigation. It's moving in the complicated environment, such as, for instance, flat or a building, and you have to figure out actually to which rooms have you visited before, which net, and where to go. And the last one is manipulation. So it means you want to grasp an object, let's say open an object, place objects in various locations. And the third one is the one which is currently the most difficult. So it turns out that when it comes to arbitrary objects, current robots are enabled to just grasp an arbitrary object. For any object, it's possible to hand code a single solution. So say as long as, let's say in factor, if you have same object, like I don't know, you are producing glasses, and there exists a hand coded solution to it. There is a way to buy code to write a program saying, let's place a hand in the middle of the glass and then let's close it. But there is no way so far to write a program such that it would be able to grasp an arbitrary object. Okay, gotcha. And then just very quickly, the other open AI projects, Kona? So another one has to do with playing a complicated computer game. And the third one has to do with playing large number of computer games. And you might ask why it's interesting. And in some sense, we'd like to see that. So human is, has an incredible skill of being able to learn extremely quickly. And it has to do with a prior experience. So let's say, even if you haven't played ever a volleyball, if you try it out for the first time, within 10 or 15 minutes, you would be able to grasp how to actually how to play. And it has to do with all the prior experience that you have from from different games, if you would put the child, like if you would put an infant on the volleyball court and ask a scheme or her to play, it would fail miserably. But I mean, due to the fact that it has experience coming from large number of other games, or let's say other live situations, it's able to actually transfer all the knowledge. So at OpenAI, we're able to pull together a large number of computer games. And computer games can be it's quite easy to quantify how good you're in the computer game. Currently, best AI systems. So so first of all, it's possible for many computer games to write a program that solves it pretty well, or plays it well. And there are also results from let's their results in terms of reinforcement learning or in terms of so called deep reinforcement learning, showing that it's possible to learn how to play a computer game. These are is that the initial results are coming from deep mind.


Science and Engineering (05:53)

And but simultaneously, and simultaneously, it takes extremely long time, like in terms of real time execution, to learn to play computer games. So for instance, Atari games, when, for instance, in terms of real time execution, it takes something around three years of play to learn to play simple games. I mean, it can be hugely part of lives, therefore, it takes few days to train it on current computers. But it's way shorter for humans. Because in 10 minutes, we can kind of teach you how to play and when. Okay. And is that through you giving it feedback? So the way how it works in case of computer games, the feedback comes from computer from the score. So it looks at the score in the game and tries to optimize it. And I would say that's kind of reasonable. But I would say simultaneous is not that satisfying to me. So the reason why it's not that satisfying to me, so the assumption underlying reinforcement learning is that there is some some environment and I environment, you are an agent and you're acting in environment by executing actions and getting rewards from the environment. And the rewards might be taught as, let's say pleasure, or so. And the main issue is that it's actually not that easy to figure out what are the rewards in the real world. Further on, other underlying assumption is in being able to reset environment to kind of get to repetitive the same situation. And so the system can try thousands or millions of times to actually finish a game. So there are some small discrepancies. People also believe that it might be possible somehow to hard code into system rewards. But I would say that's actually one of the big issues that it's kind of unresolved. Like when I look how my nephew plays computer game, he actually doesn't look on screen because he cannot read. And still, yeah, they can play pretty well. So I mean, you can say maybe reward is somewhat different. Maybe reward comes from a single nice hearing nice voice in the game or so. But I would say that's something what is very unclear how to build a system and what system should optimize. So some sense if we have a metric that we want to optimize, it's possible to build a system that could optimize for it. But it turns out that in many cases is not that easy. And I would say that's actually one of the motivations why I wanted to work on robotics. Because in case of robotics, it's way closer to the system that we care about. So what I mean by that, for instance, let's say you would like your robot to prepare Scramble X for you. And so the question is, so how should I build a reward? And in computer games, actually, nice thing is they are getting reward extremely frequently. So let's say anytime you kill an enemy or let's say, want to die, it's quite great. But in case of scrambling eggs, it would mean or the way how people write rewards for systems, it would mean distance from hand to a pen, then let's say, somehow you have to quantify what's the if the egg is if you were able to crack open an egg or let's say if you fried it sufficiently and how to how to kind of quantify turns out to be extremely difficult. And also, there is no way even to reset the system, how to reset the system to the same place. So, okay, these are like fundamental issues. And the reason why I'm personally interested in robotics is think that actually these challenges will tell us how to solve. So let's start by defining a couple of things. So what is artificial intelligence? What is machine learning? And then what is deep learning? Okay, these are pretty good. And these are pretty good questions. Okay. So artificial intelligence is actually extremely broad. It's extremely broad domain and the machine learning is sub part of this domain. And essence, artificial intelligence consists of any writing any software that tries to solve some problems through some intelligence, it might be hand coded solution rules rule based system. Yeah, so pretty much it's actually very hard to say what is not artificial intelligence, you can say that. So initial version, for instance, of Google search was based on it's it was avoiding any machine learning. And it was there was like a well defined algorithm called page rank. And essentially page rank counts how many incoming links are from other websites. And that's artificial intelligence, it's essentially a system that does intelligent things for you. And then over the time, Google search started to use machine learning, because it was it helps to improve results. At similar times, the they wanted to avoid it for some time. As it's more difficult to interpret the results. And it's more difficult to actually understand what system does. So what is machine learning? machine learning. It's essentially way of building, or let's say that's a actually, you have data, and you would like to generate based on data, program with some behavior. So like the most common example, which is still sub branch of machine learning, so called supervised learning. So you have pairs of examples, x comma y, which means that I would like to map x to y, for instance, either if given email is spam or not spam, or let's say, even image, what is the category of an image, or for instance, to whom should I recommend given product. And based on these data, you would like to generate the program, some sort of the black box or some function that for new examples would be able to give you similar answers. And that's an example of supervised learning. But this is machine learning means that you would like to generate program from data. Okay. And this usually uses statistical machine learning methods. So somehow you can't somebody versus how many times given events given events occurred or so. Okay, gotcha. And then the third being deep learning. So deep learning. That's also that's the kind of that's one paradigm in terms of machine learning. Okay. And idea behind this ridiculously simple. Okay. So people realized that if you want to, as I said, machine learning means that you get data as an input and program as the output. And deep learning says that the computation of the program then that what I'm actually doing with this data should involve many steps. Not one step, but many. Okay. And pretty much that's it, in terms of meaning of deep learning. So you might ask why it's so popular now and how it's so different from what was there before saying there's that if you assume that you do one step of computation, let's say that you take your data and you kind of have single if statement or a small number of statements, then I can, like for instance, say if you have, I don't know, let's say your data is a recording from a stock market and they are saying you're going to sell or buy depending on value speaker or smaller than something or if let's say depending on who is the new president or so, you are making some decisions. So in a sense, there's that in case of models that are based on single step, people are able to prove plenty of stuff mathematically.


Single Step Problems (15:12)

And in terms of models that require multiple steps of computation, mathematical proofs are extremely weak. And for a long time, models that do single step of computation, they were outperforming models that do many steps of computation. But recently, it kind of changed. And it was for many people, it was obvious for a long time that true intelligence cannot be done in single step, but it would require many steps. But so far, many systems actually, they worked in the way that they had kind of very, very shallow. They were very shallow, but simultaneously extremely gigantic. So what I mean by that, you could generate, let's say, for the task of interest, let's say the recommendation, you could generate large number of features, let's say, thousands of them. These are features saying, for instance, from let's say you want to do movie recommendation, you can say, is movie longer or shorter than two hours?


Obstructions (16:29)

Is it longer or shorter than one hour? That's there are two features. You can say, is it drama? Is it thriller? Is it something? And you can generate million of these. And then, or let's say 100,000, that's actually quite reasonable value. And then your shallow classifier can determine based on the combination of these features, either to recommend it to you or not. In case of deep learning, you would say, let's kind of combine it for multiple steps. And that's essentially, that's an entire difference. And in case of deep learning, the most successful embodiment of deep learning is in terms of neural networks. Okay. So let's define that too. So neural networks, it's also extremely simple concept. And that's something that people came up with a long time ago. And then it means as follows, you have an input, this might be, say, vector, or it might have some additional structure, like let's say image.


Neural Networks (18:06)

So it's kind of a matrix two dimensional. And you neural network, it's a sequence of layers, layers are represented by matrices. And what you do is you multiply your input by a matrix and apply some nonlinear operation and multiply it again by a matrix and apply nonlinear operation. You might ask, why would I even need to apply this nonlinear operation? It turns out that if you would multiply by two matrices, it can be reduced to multiplication by single matrix like a composition of two linear operators can be written as single linear operator, you could multiply these matrices together and the result of the end and you could condense it into single matrix. And nonlinearity is something like the classical nonlinearity. So say there are extremely large number of variants in terms of what I said. But what I just described is so-called feed forward neural network. So it essentially takes input, multiplies it by matrix, nonlinearity multiplies it by matrix. Examples of nonlinearities, there is something that one which is classical, something called sigmoid. So sigmoid is a function that it has a shape of s character, s letter, it's kind of close to zero. For negative values, it grows to half at zero and then goes up to one when the values are larger. It kind of modulates the input and that's the most classical version of activation function. It turns out that the one which is even simpler empirically works way better, which is called ReLU, rectify linear unit. And this one is ridiculously simple. ReLU is just maximum of zero comma x. So when you have negative values at zero, you have positive value, just copy the value and that's it. So you might ask. So first of all, what are the successes of deep learning? Why we actually believe that it works? Why what change and why it's so much different than it was before? And there are some few differences. This is a good question. No, it's exactly where I was going to go, but I was going to ask beforehand. Yeah. Why neural networks are a thing now as opposed to in the past? The main difference is all of a sudden we can train them to solve various problems. And let's say one family of problems, these are problems in supervised learning. So better than any other method, they can map these examples to labels and then on the hold out data on test data, they outperform anything else. And in many cases, they get superhuman results. And is that just a function of computational power that we have access to? When it comes to models and neural networks is an example of model, there is always a question. So how to figure out parameters of a model. So there is some training procedure. And the most common procedure for neural networks is so-called stochastic gradient descent. It's also ridiculously simple procedure. And turns out that empirically it works very well. So people came out with vast number of learning algorithms. Stochastic gradient descent is an example of one learning algorithm or others. Let's say there is something called Hebbian learning that's motivated by the way how neurons in human brain learn. But this one so far empirically is working the best.


ImageNet (22:37)

Okay. So then let's go to the question you asked yourself, which is why now? What's happening to make people care about it right now? So since 20 years ago, there were several small differences in terms of how people train neural networks. And there is a large increase in computational power. So I can speak about the major advances. So number one advance, I would say that's even the one advance that's actually an old one, but it seems to be extremely critical. Something called convolutional neural network. Okay. What does that mean? Yeah. So it's actually a very simple concept. So let's say your input is an image. And let's say your image is of a size 200 by 200. That has also let's say three colors. So the number of values in total is actually 120,000. So if you would actually squash it into a vector, this vector would be of this size. Okay. And then I can think that if you would like, let's say to apply neural network to essentially multiply it by a matrix. And let's say, if you would like to have output of the multiplication of similar size, let's say 120,000, then all of a sudden the matrix to multiply it would be of a gigantic size. And learning, learning consists of estimating parameters of a neural network. It turns out that empirically, that wouldn't essentially work that if you would use algorithm of backpropagation, you would get quite poor results. And people realized that in case of images, you might want to multiply by a little bit special matrix that also allows to do way faster computation. So you can think that neural network as it applies some computation to the input. So neural network applies some computation to the input. You might want to constrain this computation in some sense. So you might think as you will have several layers, maybe initially you would like to do very local computation and it should be pretty much similar in every location. So you would like to apply the same computation in the center as in the corners. Maybe later on, you need some diversification, but you want to pre-process image the same way. So the idea is that when you take an image or any actually two dimensional structure, so the other example is you can take voice and it turns out that you can by applying Fourier transform, you turn voice into image and all the-- it's like a two dimensional-- - Like a waveform? - Yeah, so you take a waveform and you apply Fourier transform. - Okay. - And essentially on the x-axis, you have time as the speech goes on and on y-axis, you have different frequencies and that's an image and speech recognition systems, they also treat sound as it would be an image. - I didn't realize that, that's really cool, okay. - So that's why I'm saying that the technique that-- also like a kind of as a side track, the cool thing about neural networks is it used to be the case that people specialize in processing text, images, sound and these days, this is the same group of people. - That's really cool. - They-- we are using the same methods. So coming back to what is convolutional neural network, yeah, as I mentioned, you would like to apply the same computation all over the placing image and essentially convolutional neural network says when we take an image, let's just connect neuron with local values on the image and let's copy the same weights over and over again. So this way you will multiply kind of-- multiply values in the center, in the corners by the same values in the matrix. - Okay. - So input to the convolution is an image and output is kind of also an image. You can think that there is also some specific vocabulary. So in the-- this kind of three-dimensional images like you have height and you have also depth. So let's say in case of image that's three dimensions and then you apply convolution, you can kind of change number of depth dimensions. Usually people go to, let's say, I don't know, 100 dimensions or so. - Okay, gotcha. - And then you kind of have several of these layers and then there are so-called fully connected layers which are just a conventional matrices. So I would say that's one of advances that actually happened 20 years ago already. Another one which is-- it might sound kind of funny, but for a long time people didn't believe that it's possible to train deep neural networks and they were thinking quite a lot about what are the proper learning algorithms. And it turns out that-- so let's say when you train a neural network, you start off by initializing weights to some random values. And it turns out that it's very important to be careful to what magnitudes you initialize weights. And if you set it to right values and I can even give you some, let's say, intuition of what it means, it turns out that then simplest algorithm which is called stochastic gradient descent actually works pretty well. So some says, as I said, let's say layers of neural network, they kind of multiply-- they multiply input by matrices.


ImageNet (36:10)

And that's like on team was Alex Krzyzewski and Ilya Suskawer. They actually got to something like 15%. So let's say all other teams, they were like at 25%. The difference was 1%. And these two guys, they got to 15%. And the crazy thing is that within following three years on this data set, the error dropped dramatically. I remember like next year, the error got to, let's say, 11%, 8%. I was kind of, I remember by that time, I was wondering what's the limit? How good can you be? And I was thinking 5%. That's the best. And even they're like humans trying to see how far they can get if they spend an arbitrary amount of time on, let's say, looking on other images and kind of comparing to be able to figure out what is there. I mean, it's not that simple for humans. For instance, if you have plenty of breeds of dogs, and like a who knows breed, but let's say if you can use some external images to kind of compare and so on, that helps. But in a sense, within several years, people got down, I believe, to 3% error. And that's essentially superhuman performance. And as I'm saying, it used to be the case that systems in computer vision need to take a picture of a sky. They were telling you it's a car. And all of a sudden, you're getting to superhuman performance. And it turns out that these results actually are not just limited to computer vision. People were able to get amazing other systems, let's say, speed recognition or so. So because that's the underlying question, right? Because to someone not in the field like me, it's not necessarily intuitive that computer vision, computer image recognition would seed artificial intelligence. So what came after that? So in that sense, the crazy thing is that the same architectures worked for various tasks. And all of a sudden, that fields which seem to be unrelated, they started to benefit from each other. So as I mentioned, it turns out that problems in speech recognition can be in very similar way. You can essentially take speech, apply Fourier transform, and then speech starts to look like an image. And you apply similar object recognition network to kind of recognize what are the sounds over there. And like a phonemes. And so phonemes are like kinds of sounds so better. And then you can turn it into text. And so that's where it went to speech after images. And then yeah, then the next big thing was a such a translation. Translation was extremely surprising to people. That's the result by Ilyas Uskauer. So translation is an example of another field that actually lived there by its own. And one of the crazy things about translation is input is of a variable length and output is of variable length. And it was unclear even how to kind of consume it with neural network, how to produce variable length, input variable length output. And Ilyas came out with an idea. There is something called recurrent neural network. So I mean, let's say, recurrent neural network and convolutional neural network, they share an idea, which is you might want to use the same parameters if you are doing similar stuff.


Great! (29:37)

And the property that you would like to retain, it's you don't want the magnitude of a values to blow up and also you don't want it to shrink down. And if you kind of multiply-- if you choose random initialization, it's easy to choose some initialization that will kind of turn-- that the magnitude will keep on increasing.


adjusting magnitudes (30:03)

And then if you have 10 layers, then let's say in each of them you multiply by 2, 2, 2, 2, 2. And then the output all of a sudden is of completely different magnitude. And learning is not happening anymore. And if you kind of just choose them-- and it's a matter of choosing variance or like a magnitude of initial weights. And if you set it, start it at, let's say, output is of the same magnitude as input and everything works. So basically just adjusting those magnitudes was what proved that you could do this with a neural network? Yes. Oh, wow. OK. That's kind of ridiculous that, let's say, people haven't realized it for a long time, but that's what it is. And when and where did that happen? It happened actually at the University of Toronto. Oh, OK. So then at the Jeffrey's Hinton lab. So the crazy thing is people had several schemes in terms of how to train deep neural networks. And one was called generative pre-training. And so let's say there was some scheme what to do in order to get to such a state of neural network that all of a sudden you can use this trivial algorithm called stochastic gradient descent. So there was like an entire involved procedure. And at some point, Jeffrey asked his student to compare it to like a simpler solution, which would be adjusting magnitudes and like showing how big difference there is. That's crazy, man. Oh my God. OK. So a question that's a little bit broader is just like, then what has happened in the past, say, five years to excite people so much about AI? So I would say the most stunning were so-called image net results. So first of all, I should tell you where was computer vision five years ago. Then I will tell you what is image net. And then I will tell you about the results. So computer vision is a field where essentially you try to make sense of images. Like a computer tries to interpret what is on images. And it's extremely simple to say, oh, here on an image, there is a cow, a horse or so. But for computer images, just the collection of numbers. So it's a large matrix of numbers. And it's very difficult to say, oh, like how to-- it's very difficult to interpret what's the content. And it was the case that people came out with various schemes how to do it. You could imagine, I don't know, maybe let's quantify how much of a brown color there is such that you can say it's a horse. Like a simple stuff. People, of course, came out with more clever solutions. But my systems were quite bad. I mean, you could feed the picture of a sky, the system, and it was telling you that there is a car. It's like a-- So not so good. Yeah. Yeah. So then Fay Fay Lee, Fay Fay Lee is a professor at Stanford. She, together with her students, she collected a large data set of images. And the data set is called ImageNet. It consists of one million images and 1,000 classes. So that was, by the time, actually the largest data set of images. In a class, just to clarify, being like car might be a class? Yes. So there is the data set, I would say it's not artifact. It has, for instance, it doesn't contain people. That was one of constraints over there. It contains a large number of breeds of dogs. So that's the queer key thing about it. But same time, I mean, that's the essential data set that made deep learning happen. Types of dogs. No, no. Whoa. The fact that it's so large. So what happened, there was like a plenty of teams actually participating in ImageNet competition. And let's say, even as I'm saying, there is 1,000 classes over there. So if you have a guess, a random guess, then Robab delete that your guess is correct is essentially 0.1%. The metric there was slightly different. You actually, if you make five guesses and if one of them is correct, then you are good. Because there might be some other objects and so on. And I remember when I, for the first time, when I have seen that someone created a system that had 50% error, I was impressed. Oh, man. It's like 1,000 classes and it can say with 50% error what is there. I was quite impressed. But then during competition, pretty much all the teams got around 25% error rate. There was a difference by one person. One person, for instance, a team from University of Amsterdam, Japanese team, plenty of people around the world. And a team from University of Toronto led by Jeffrey Hinton.


ImageNet (22:37)

Okay. So then let's go to the question you asked yourself, which is why now? What's happening to make people care about it right now? So since 20 years ago, there were several small differences in terms of how people train neural networks. And there is a large increase in computational power. So I can speak about the major advances. So number one advance, I would say that's even the one advance that's actually an old one, but it seems to be extremely critical. Something called convolutional neural network. Okay. What does that mean? Yeah. So it's actually a very simple concept. So let's say your input is an image. And let's say your image is of a size 200 by 200. That has also let's say three colors. So the number of values in total is actually 120,000. So if you would actually squash it into a vector, this vector would be of this size. Okay. And then I can think that if you would like, let's say to apply neural network to essentially multiply it by a matrix. And let's say, if you would like to have output of the multiplication of similar size, let's say 120,000, then all of a sudden the matrix to multiply it would be of a gigantic size. And learning, learning consists of estimating parameters of a neural network. It turns out that empirically, that wouldn't essentially work that if you would use algorithm of backpropagation, you would get quite poor results. And people realized that in case of images, you might want to multiply by a little bit special matrix that also allows to do way faster computation. So you can think that neural network as it applies some computation to the input. So neural network applies some computation to the input. You might want to constrain this computation in some sense. So you might think as you will have several layers, maybe initially you would like to do very local computation and it should be pretty much similar in every location. So you would like to apply the same computation in the center as in the corners. Maybe later on, you need some diversification, but you want to pre-process image the same way. So the idea is that when you take an image or any actually two dimensional structure, so the other example is you can take voice and it turns out that you can by applying Fourier transform, you turn voice into image and all the-- it's like a two dimensional-- - Like a waveform? - Yeah, so you take a waveform and you apply Fourier transform. - Okay. - And essentially on the x-axis, you have time as the speech goes on and on y-axis, you have different frequencies and that's an image and speech recognition systems, they also treat sound as it would be an image. - I didn't realize that, that's really cool, okay. - So that's why I'm saying that the technique that-- also like a kind of as a side track, the cool thing about neural networks is it used to be the case that people specialize in processing text, images, sound and these days, this is the same group of people. - That's really cool. - They-- we are using the same methods. So coming back to what is convolutional neural network, yeah, as I mentioned, you would like to apply the same computation all over the placing image and essentially convolutional neural network says when we take an image, let's just connect neuron with local values on the image and let's copy the same weights over and over again. So this way you will multiply kind of-- multiply values in the center, in the corners by the same values in the matrix. - Okay. - So input to the convolution is an image and output is kind of also an image. You can think that there is also some specific vocabulary. So in the-- this kind of three-dimensional images like you have height and you have also depth. So let's say in case of image that's three dimensions and then you apply convolution, you can kind of change number of depth dimensions. Usually people go to, let's say, I don't know, 100 dimensions or so. - Okay, gotcha. - And then you kind of have several of these layers and then there are so-called fully connected layers which are just a conventional matrices. So I would say that's one of advances that actually happened 20 years ago already. Another one which is-- it might sound kind of funny, but for a long time people didn't believe that it's possible to train deep neural networks and they were thinking quite a lot about what are the proper learning algorithms. And it turns out that-- so let's say when you train a neural network, you start off by initializing weights to some random values. And it turns out that it's very important to be careful to what magnitudes you initialize weights. And if you set it to right values and I can even give you some, let's say, intuition of what it means, it turns out that then simplest algorithm which is called stochastic gradient descent actually works pretty well. So some says, as I said, let's say layers of neural network, they kind of multiply-- they multiply input by matrices.

ImageNet (36:10)

And that's like on team was Alex Krzyzewski and Ilya Suskawer. They actually got to something like 15%. So let's say all other teams, they were like at 25%. The difference was 1%. And these two guys, they got to 15%. And the crazy thing is that within following three years on this data set, the error dropped dramatically. I remember like next year, the error got to, let's say, 11%, 8%. I was kind of, I remember by that time, I was wondering what's the limit? How good can you be? And I was thinking 5%. That's the best. And even they're like humans trying to see how far they can get if they spend an arbitrary amount of time on, let's say, looking on other images and kind of comparing to be able to figure out what is there. I mean, it's not that simple for humans. For instance, if you have plenty of breeds of dogs, and like a who knows breed, but let's say if you can use some external images to kind of compare and so on, that helps. But in a sense, within several years, people got down, I believe, to 3% error. And that's essentially superhuman performance. And as I'm saying, it used to be the case that systems in computer vision need to take a picture of a sky. They were telling you it's a car. And all of a sudden, you're getting to superhuman performance. And it turns out that these results actually are not just limited to computer vision. People were able to get amazing other systems, let's say, speed recognition or so. So because that's the underlying question, right? Because to someone not in the field like me, it's not necessarily intuitive that computer vision, computer image recognition would seed artificial intelligence. So what came after that? So in that sense, the crazy thing is that the same architectures worked for various tasks. And all of a sudden, that fields which seem to be unrelated, they started to benefit from each other. So as I mentioned, it turns out that problems in speech recognition can be in very similar way. You can essentially take speech, apply Fourier transform, and then speech starts to look like an image. And you apply similar object recognition network to kind of recognize what are the sounds over there. And like a phonemes. And so phonemes are like kinds of sounds so better. And then you can turn it into text. And so that's where it went to speech after images. And then yeah, then the next big thing was a such a translation. Translation was extremely surprising to people. That's the result by Ilyas Uskauer. So translation is an example of another field that actually lived there by its own. And one of the crazy things about translation is input is of a variable length and output is of variable length. And it was unclear even how to kind of consume it with neural network, how to produce variable length, input variable length output. And Ilyas came out with an idea. There is something called recurrent neural network. So I mean, let's say, recurrent neural network and convolutional neural network, they share an idea, which is you might want to use the same parameters if you are doing similar stuff.


Applying And Learning From Ai

Use cases for Neural Machine Translation (40:20)

And in case convolutional network means let's share the same parameters in space. So let's say let's apply the same transformation to the middle of image as in the corners and so on. And in case of recurrent neural network, this will be reading text from left to right. I can consume first word, can create some hidden state representation. And then the next time step when I'm consuming next word, I can take it together with this hidden representation and generate next hidden representation. And you are applying the same function over again. And this function consumes hidden representation and next word, hidden representation and word, hidden representation and word. So it's relatively simple. The cool thing is, if you are doing it this way, regardless of length of your input, you have the same size of a network. And the way how his model works and that's described in a paper called Sequence to Sequence, you essentially consume word by word sentence that you want to translate. And then when you are about to generate translation, you essentially start emitting word by word and at the end, you are when you emit dot, that's end. That's so cool. And it was quite surprising to people by that time they got to decent performance, they were not able to beat phrase based systems. And now it's like outperform like a long time ago already. And yeah, that one other issue that people have. So with neural network systems, like in case of translation, the problem with deploying it on the large scale is that it's quite computationally expensive. And it requires a sunshine and in deep learning literature, there are various ideas how to make things way, way cheaper computationally after you train it. So it's possible to throw away a large number of weights or essentially turn floats, 32 bit floats into smaller size numerics, and so on and so forth. And pretty much that's the reason why things are not largely deployed in production systems out there. But neural network based solutions are actually outperforming anything what is out there. There are a couple more things I would like to just define for a general listener. So there are a couple of words being thrown around a lot. So narrow AI, general AI, and then super intelligence. Can you just break those apart? Sure. So pretty much all AI that we have out there is narrow AI. No one built so far general AI. No one built super intelligence. So narrow AI means artificial intelligence. So it's like a piece of software that solves a single predefined problem. General AI means it's a piece of software that can solve huge vast number of problems, all the problems. So you can say that human is general, it's generally intelligent, because you can give an arbitrary problem and human can solve it. But for instance, battle opener can solve only battle opening. So pretty much when we look at any tools out there, at any software, our software is good in solving single problem. For instance, our chess playing programs cannot drive a car. And for any problem we have to create a separate piece of software. And general artificial intelligence is a software that could solve arbitrary problems. So how we know that it's even doable, because there is an example of a creature that has such a property. And then super intelligence is just I assumed the next step. Yeah, essentially super intelligence means that it's more intelligent than human.


General Intelligence, Super Intelligence, AI Hype (45:23)

Cool. So given all that, given that like we're basically at a state of narrow AI across the board at this point, where, where do you think is like, what's the current status of this stuff? Where do you see it going in the next five or so years? So as I mentioned, there are essentially machine learning there, there are bodies also paradigm. So the one of them is supervised learning, there is something called unsupervised learning, there is also something called reinforcement learning. And so far, the supervised learning paradigm is the only one that works so remarkably remarkably well that it's ready to be applied in business applications. All other are not really there. And say you asked me where we are so we can solve these problems, other problems they require further work. It's very difficult to plan with ideas how long it will take to make them work. The thing which is very different with contemporary artificial intelligence is that we are using precisely the same techniques across the board. Simultaneously, majority of business problems can be framed as supervised learning. And therefore, they can be solved with current techniques. As long as we have sufficient number of input examples and what we want to predict. And as I mentioned, the purse can be extremely rich, like output might be a sentence. And current systems work pretty well with it. And nonetheless, it requires an expert to train it. And so then given the given like pretty substantial hype we see, what do you think of it all? The field is simultaneously under hyped and over hyped. So from the perspective of business application, as long as you have pairs of examples, pairs from like that indicate mapping. What's the input? What's the output? We can pretty often get to superhuman performance. But in all other fields, we are still not there. And it's unclear how long it will take. So give some example, let's say for recommendation systems, you have often companies like Amazon, they have examples of millions of users. And they know what they bought when they were happy or not. And that's an example of a task that is pretty good for neural network to learn what to recommend to new users. Simultaneously, Google knows what is the good search query for you. Because on the search result page, we are clicking on the links that you are interested in. And therefore, they should be displayed first. And in other fields, it's actually quite often more difficult. In case of, let's say, Apple picking robot, it's difficult to provide supervised data telling how to move an arm towards Apple. Therefore, that's way more complicated. Same time, the problem of detecting where Apple is, it's better defined and can be outsourced to human to annotate plenty of images and to give localization of the Apple. And quite often the rest of the problem can be scripted by an engineer. But the problem of how to place fingers on an Apple or how to grip it, it's not well scientifically solved. And so we have a couple of questions then at this point. If people were to be interested in learning more about AI and maybe working with open AI or doing something, how would you recommend they get involved and educate themselves? So a good place to start is Coursera. Coursera is pretty good.


Frameworks and Tutorials (50:23)

There is also a lot of TensorFlow tutorials. TensorFlow is an example of framework to train neural networks. Also, Andre Carpati's class at Stanford, it's extremely accessible. You can find it on, I believe, on YouTube. Yeah. And then in terms of actual exercises? So in case of TensorFlow tutorial, many of the problems-- so I believe in case of Andre's class, there might be homework. And in case of TensorFlow exercises, it's quite often easy to come up with some random thought after, let's say, reading. You can take, for instance, let's say-- the simple task over there is let's classify digits and let's classify pictures of digits. Let's assign them classes. You can try maybe download some images from some other source, like a Flickr. Let's try to classify it toward tags. OK, so given that you guys are working with robots at this point, one of the other things that's thrown in part and parcel with AI is automation, specifically of a lot of these low level blue collar jobs. What do you think about the future, maybe the next 10 years, of those jobs?


The Economic Future of Human Jobs in Robotics (51:58)

So I believe that we'll have to offer to people basic income. I super strongly believe that actually that's the only way. So I don't think that it will be possible for 40 years old taxi driver to reinvent himself every 10 years. I think it might be extremely hard. Other crazy thing is people define themselves through job. And that might be another big social problem.


Jacques Iss (52:52)

And simultaneously, they might not even like their jobs. Like if you ask someone, would you like your kid to sell in a supermarket, to be a seller in the supermarket, they would answer no. And maybe it's possible to live in a world that there is abundance of resources and people can just enjoy their life. I think we're going to have to figure out a way. I mean, maybe people will always find purpose, but I think making it easier to find that purpose will become much more important in the future if automation actually happens to the degree people talk about. And what about influences on you that maybe have inspired you to work with robotics and in AI? Are there any books or films or any media that you really enjoyed? It's pretty good book called Homo Deus. Actually describes the history of humans and then speaks and has various predictions about the future or where we are heading. That's one pretty good. I mean, there is nowadays there is like a plenty of movies about AI and how it can go wrong. What's the best one? I think Harry's pretty good.


Influences In Pop Culture

Movies, Media, Books, and Inspiration (54:27)

Okay. Yeah, Ex Machina is also pretty good. Cool. All right. Do you have any other last things you want to address? Oh, I think no. Thank you. Okay, cool. Thanks, man. Transcribed by


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.