MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention
Transcription for the video titled "MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Leonardo Silva Reviewer Reviewer 1 Hello everyone, and I hope you enjoyed Alexander's first lecture. I'm Ava, and in this second lecture, lecture two, we're going to focus on this question of sequence modeling, how we can build neural networks that can handle and learn from sequential data. So in Alexander's first lecture, he introduced the essentials of neural networks, starting with perceptrons, building up to feed-forward models, and how you can actually train these models and start to think about deploying them forward. Now we're going to turn our attention to specific types of problems that involve sequential processing of data. to specific types of problems that involve sequential processing of data and we'll realize why these types of problems require a different way of implementing and building neural networks from what we've seen so far and I think some of the components in this lecture traditionally can be a bit confusing or daunting at first but what I really really want to do is to build this understanding up from the foundations, walking through step-by-step, developing intuition, all the way to understanding the math and the operations behind how these networks operate. Okay, so let's get started. To begin, I first want to motivate what exactly we mean when we talk about sequential data or sequential modeling. So we're going to begin with a really simple intuitive example. Let's say we have this picture of a ball, and your task is to predict where this ball is going to travel to next. Now, if you don't have any prior information about the trajectory of the ball, its motion, its history, any guess or prediction about its next position is going to be exactly that, a random guess. If however, in addition to the current location of the ball, I gave you some information about where it was moving in the past, now the problem becomes much easier. And I think hopefully we can all agree that most likely, our most likely next prediction is that this ball is going to move forward to the right in the next frame. So this is a really reduced down, bare bones, intuitive example. But the truth is that beyond this, sequential data is really all around us. As I'm speaking, the words coming out of my mouth form a sequence of sound waves that define audio, which we can split up to think about in this sequential manner. Similarly, text, language, can be split up into a sequence of characters or a sequence of words. And there are many, many, many more examples in which sequential processing, sequential data is present, right? From medical signals like EKGs, to financial markets and projecting stock prices, to biological sequences encoded in DNA, to patterns in the climate, to patterns of motion, and many more.
Comprehensive Guide To Recurrent Neural Networks And Attention Mechanisms
Sequence modeling (03:07)
And so already hopefully you're getting a sense of what these types of questions and problems may look like and where they are relevant in the real world. When we consider applications of sequential modeling in the real world we can think about a number of different kind of problem definitions that we can have in our arsenal and work with. In the first lecture Alexander introduced the notions of classification and the notion of regression, where he talked about and we learned about feedforward models that can operate one-to-one in this fixed and static setting, right? Given a single input, predict a single output. The binary classification example of will you succeed or pass this class. And here, there's no notion of sequence, there's no notion of time. Now if we introduce this idea of a sequential component, we can handle inputs that may be defined temporally and potentially also produce a sequential or temporal output. So as one example, we can consider text, language, and maybe we want to generate one prediction given we can consider text, language, and maybe we want to generate one prediction given a sequence of text, classifying whether a message is a positive sentiment or a negative sentiment. Conversely, we could have a single input, let's say an image, and our goal may be now to generate text or a sequential description of this image, right? Given this image of a baseball player throwing a ball, can we build a neural network that generates that as a language caption? Finally, we can also consider applications and problems where we have sequence in, sequence out. For example, if we want to translate between two languages, and indeed this type of thinking and this type of architecture is what powers the task of machine translation in your phones, in Google Translate, and many other examples. So hopefully, right, this has given you a picture of what sequential data looks like, what these types of problem definitions may look like. And from this, we're going to start and build up our understanding of what neural networks we can build and train for these types of problems.
Neurons with recurrence (05:09)
So first we're going to begin with the notion of recurrence and build up from that to define recurrent neural networks. And in the last portion of the lecture we'll talk about the underlying mechanisms underlying the transformer architectures that are very very very powerful in terms of handling sequential data. But as I said at the beginning right the theme of this lecture is building up that understanding step by step starting with the fundamentals and the intuition. So to do that we're going to go back revisit the perceptron and move forward from there. Right, so as Alexander introduced, where we study the perceptron in lecture one, the perceptron is defined by this single neural operation, where we have some set of inputs, let's say x1 through xm, and each of these numbers are multiplied by a corresponding weight, passed through a nonlinear activation function that then generates a predicted output y hat. Here we can have multiple inputs coming in to generate our output, but still these inputs are not thought of as points in a sequence or time steps in a sequence. Even if we scale this perceptron and start to stack multiple perceptrons together to define these feed-forward neural networks, we still don't have this notion of temporal processing or sequential information. Even though we are able to translate and convert multiple inputs, apply these weight operations, apply this nonlinearity to then define multiple predicted outputs. So taking a look at this diagram, right, on the left in blue you have inputs, on the right in purple you have these outputs, and the green defines the neural, the single neural network layer that's transforming these inputs to the outputs. Next step, I'm going to just simplify this diagram. I'm going to collapse down those stacked perceptrons together and depict this with this green block. Still it's the same operation going on, right? We have an input vector being transformed to predict this output vector. Now what I've introduced here, which you may notice, is this new variable, t, right, which I'm using to denote a single time step. We are considering an input at a single time step and using our neural network to generate a single output corresponding to that input. How could we start to extend and build off this to now think about multiple time steps and how we could potentially process a sequence of information. Well, what if we took this diagram, all I've done is just rotated it 90 degrees, where we still have this input vector and being fed in producing an output vector, and what if we can make a copy of this network, right, and just do this operation multiple times to try to handle inputs that are fed in corresponding to different times, right? We have an individual time step starting with T zero, and we can do the same thing, the same operation for the next time step. Again, treating that as an isolated instance, and keep doing this repeatedly. And what you'll notice, hopefully, is all these models are simply copies of each other, just with different inputs at each of these different time steps. And we can make this concrete, right, in terms of what this functional transformation is doing. The predicted output at a particular time step, y hat of t, is a function of the input at that time step, x of t, and that function is what is learned and defined by our neural network weights. Okay, so I've told you that our goal here is, right, trying to understand sequential data, do sequential modeling, but what can be the issue with what this diagram is showing and what I've shown you here? Well, yeah, go ahead. Exactly, that's exactly right. So the student's answer was that X1 or it could be related to X naught and and you have this temporal dependence, but these isolated replicas don't capture that at all. And that's exactly, answers the question perfectly, right? Here, a predicted output at a later time step could depend precisely on inputs at previous time steps, if this is truly a sequential problem with this temporal dependence. So how can we start to reason about this? How can we define a relation that links the network's computations at a particular time step to prior history and memory from previous time steps? Well, what if we did exactly that, right? What if we simply linked the computation and the information understood by the network What if we did exactly that, right? What if we simply linked the computation and the information understood by the network to these other replicas via what we call a recurrence relation? What this means is that something about what the network is computing at a particular time is passed on to those later time steps. And we define that according to this variable, H, which we call this internal state, or you can think of it as a memory term, that's maintained by the neurons and the network, and it's this state that's being passed time step to time step as we read in and process this sequential information. What this means is that the network's output, its predictions, its computations, is not only a function of the input data x, but also we have this other variable h, which captures this notion of state, captures this notion of memory that's being computed by the network and passed on over time. Specifically, right, to walk through this, our predicted output, y hat of t, depends not only on the input at a time, but also this past memory, this past state. And it is this linkage of temporal dependence and recurrence that defines this idea of a recurrent neural unit. What I've shown is this connection that's being unrolled over time, but we could also depict this relationship according to a loop. This computation to this internal state variable H is being iteratively updated over time and that's fed back into the neuron the neurons computation in this recurrence relation this is how we define these recurrent cells that comprise recurrent neural networks or RNNs and the key here is that we have this this idea of this recurrence relation that captures the cyclic temporal dependency.
Recurrent neural networks (12:05)
And indeed, it's this idea that is really the intuitive foundation behind recurrent neural networks, or RNNs. And so let's continue to build up our understanding from here and move forward into how we can actually define the RNN operations mathematically and in code. So all we're going to do is formalize this relationship a little bit more. The key idea here is that the RNN is maintaining the state and it's updating the state at each of these time steps as the sequence is processed. We define this by applying this recurrence relation and what the recurrence relation captures is how we're actually updating that internal state H of T. Specifically that state update is exactly like any other neural network operation that we've introduced so far where again we're learning a function defined by a set of weights W. We're using that function to update the cell state H of T and the additional component, the newness here, is that that function depends both on the input and the prior time step H of T minus 1. And what you'll note is that this function F sub W is defined by a set of weights and it's the same set of weights, the same set of parameters parameters that are used time step to time step as the recurrent neural network processes this temporal information, this sequential data. OK, so the key idea here, hopefully, is coming through, is that this RNN state update operation takes this state and updates it each time a sequence is processed.
RNN intuition (13:47)
We can also translate this to how we can think about implementing RNNs in Python code, or rather pseudocode, hopefully getting a better understanding and intuition behind how these networks work. So what we do is we just start by defining an RNN. For now, this is abstracted away. And we start, we initialize its hidden state, and we have some sentence, right? Let's say this is our input of interest, where we're interested in predicting maybe the next word that's occurring in this sentence. What we can do is loop through these individual words in the sentence that define our temporal input, and at each step as we're looping through, each word in that sentence is fed into the RNN model along with the previous hidden state, and this is what generates a prediction for the next word and updates the RNN state in turn. Finally, our prediction for the final word in the sentence, the word that we're missing, is simply the RNN's output after all the prior words have been fed in through the model. So this is really breaking down how the RNN works, how it's processing the sequential information.
Unfolding RNNs (15:03)
And what you've noticed is that the RNN computation includes both this update to the hidden state as well as generating some predicted output at the end. That is our ultimate goal that we're interested in. And so to walk through this, how we're actually generating the output prediction itself, what the RNN computes is given some input vector, it then performs this update to the hidden state and this update to the hidden state is just a standard neural network operation just like we saw in the first lecture where it consists of taking a weight matrix, multiplying that by the previous hidden state, taking another weight matrix, multiplying that by the input at a time step, and applying a non-linearity. And in this case, right, because we have these two input streams, the input data X of T and the previous state H, we have these two separate weight matrices that the network is learning over the course of its training. That comes together, we apply the non-linearity and then we can generate an output at a given time step by just modifying the hidden state using a separate weight matrix to update this value and then generate a predicted output. And that's what there is to it, right? That's how the RNN in its single operation updates both the hidden state and also generates a predicted output. Okay, so now this gives you the internal working of how the RNN computation occurs at a particular time step. Let's next think about how this looks like over time and define the computational graph of the RNN as being unrolled or expanded across time. So, so far the dominant way I've been showing the RNNs is according to this loop-like diagram on the left, right? Feeding back in on itself. Another way we can visualize and think about RNNs is as kind of unrolling this recurrence over time, over the individual time steps in our sequence. What this means is that we can take the network at our first time step and continue to iteratively unroll it across the time steps going on forward all the way until we process all the time steps in our input. Now we can formalize this diagram a little bit more by defining the weight matrices that connect the inputs to the hidden state update inputs to the hidden state update and the weight matrices that are used to update the internal state across time and finally the weight matrices that define the update to generate a predicted output now recall that in all these cases right for all these three weight matrices at all these time steps we are simply reusing the same weight matrices, right? So it's one set of parameters, one set of weight matrices that just process this information sequentially. Now you may be thinking, okay, so how do we actually start to be thinking about how to train the RNN, how to define the loss, given that we have this temporal processing in this temporal dependence. Well, a prediction at an individual time step will simply amount to a computed loss at that particular time step. So now we can compare those predictions time step by time step to the true label and generate a loss value for those timestamps. And finally, we can get our total loss by taking all these individual loss terms together and summing them, defining the total loss for a particular input to the RNN.
RNNs from scratch (18:57)
If we can walk through an example of how we implement this RNN in TensorFlow starting from scratch. The RNN can be defined as a layer operation and layer class that Alexander introduced in the first lecture. And so we can define it according to an initialization of weight matrices, initialization of a hidden state, which commonly amounts to initializing these two to zero. Next, we can define how we can actually pass forward through the RNN network to process a given input x, and what you'll notice is in this forward operation, the computations are exactly like we just walked through. We first update the hidden state according to that equation we introduced earlier, and then generate a predicted output that is a transformed version of that hidden state. Finally, at each time step, we return both the output and the updated hidden state, as this is what is necessary to be stored to continue this RNN operation over time. What is very convenient is that although you could define your RNN network and your RNN layer completely from scratch is that TensorFlow abstracts this operation away for you so you can simply define a simple RNN according to this call that you're seeing here which makes all the computations very efficient and very easy. And you'll actually get practice implementing and working with RNNs in today's software lab. Okay, so that gives us the understanding of RNNs, and going back to what I described as kind of the problem setups or the problem definitions at the beginning of this lecture, I just want to remind you of the types of sequence modeling problems on which we can apply RNNs, right? We can think about taking a sequence of inputs, producing one predicted output at the end of the sequence. We can think about taking a static single input and trying to generate text according to that single input. And finally we can think about taking a sequence of inputs, producing a prediction at every time step in that sequence, and then doing this sequence to sequence type of prediction and translation. Okay. doing this sequence to sequence type of prediction and translation. Okay, so, yeah, so this will be the foundation for the software lab today, which will focus on this problem of many to many processing and many to many sequential modeling, taking a sequence, going to a sequence. What is common and what is universal across all these types of problems and tasks that we may want to consider with RNNs is what I like to think about, what type of design criteria we need to build a robust and reliable network for processing these sequential modeling problems.
Design criteria for sequential modeling (21:50)
What I mean by that is what are the characteristics, what are the design requirements that the RNN needs to fulfill in order to be able to handle sequential data effectively? The first is that sequences can be of different lengths. They may be short, they may be long. We want our RNN model or our neural network model in general to be able to handle sequences of variable lengths. Secondly, and really importantly, is as we were discussing earlier, that the whole point of thinking about things through the lens of sequence is to try to track and learn dependencies in the data that are related over time. So our model really needs to be able to handle those different dependencies, which may occur at times that are very, very distant from each other. Next, sequence is all about order. There's some notion of how current inputs depend on prior inputs, and the specific order of observations we see makes a big effect on what prediction we may want to generate at the end. And finally, in order to be able to process this information effectively, our network needs to be able to do what we call parameter sharing, meaning that given one set of weights, that set of weights should be able to apply to different time steps in the sequence and still result in a meaningful prediction. And so today we're going to focus on how recurrent neural networks meet these design criteria and how these design criteria motivate the need for even more powerful architectures that can outperform RNNs in sequence modeling.
Word prediction example (23:45)
So to understand these criteria very concretely, we're going to consider a sequence modeling problem where, given some series of words, our task is just to predict the next word in that sentence. So let's say we have this sentence, this morning I took my cat for a walk. And our task is to predict the last word in the sentence given the prior words. This morning I took my cat for a blank. Our goal is to take our RNN, define it, and put it to test on this task. What is our first step to doing this? Well, the very, very first step before we even think about defining the RNN is how we can actually represent this information to the network in a way that it can process and understand. If we have a model that is processing this data, processing this text-based data, and wanting to generate text as the output, our problem can arise in that the neural network itself is not equipped to handle language explicitly, right? Remember that neural networks are simply functional operators, they're just mathematical operations, and so we can't expect it, right, it doesn't have an understanding from the start of what a word is or what language means. Which means that we need a way to represent language numerically so that it can be passed in to the network to process. So what we do is that we need to define a way to translate this text, this language information into a numerical encoding, a vector, an array of numbers that can then be fed in to our neural network and generating a vector of numbers as its output. So now, right, this raises the question of how do we actually define this transformation? How can we transform language into this numerical encoding? The key solution and the key way that a lot of these networks work is this notion and concept of embedding. What that means is it's some transformation that takes indices or something that can be represented as an index into a numerical vector of a given size. So if we think about how this idea of embedding works for language data, let's consider a vocabulary of words that we can possibly have in our language. And our goal is to be able to map these individual words in our vocabulary to a numerical vector of fixed size. One way we could do this is by defining all the possible words that could occur in this vocabulary and then indexing them, assigning a index label to each of these distinct words. A corresponds to index one, cat responds to index two, so on and so forth. And this indexing maps these individual words to numbers, unique indices. What these indices can then define is what we call a embedding vector, which is a fixed length encoding where we've simply indicated a one value at the index for that word when we observe that word. And this is called a one-hot embedding, where we have this fixed length vector of the size of our vocabulary, and each instance of that vocabulary corresponds to a one-hot one at the corresponding index. This is a very sparse way to do this, by one at the corresponding index. This is a very sparse way to do this, and it's simply based on purely the count index. There's no notion of semantic information, meaning that's captured in this vector-based encoding. Alternatively, what is very commonly done is to actually use a neural network to learn an encoding, to learn an embedding. And the goal here is that we can learn a neural network that then captures some inherent meaning or inherent semantics in our input data and maps related words or related inputs closer together in this embedding space, meaning that they'll have numerical vectors that are more similar to each other. This concept is really really foundational to how these sequence modeling networks work and how neural networks work in general. Okay, so with that in hand we can go back to our design criteria, thinking about the capabilities that we desire. First, we need to be able to handle variable length sequences. If we again want to predict the next word in the sequence, we can have short sequences, we can have long sequences, we can have even longer sentences, and our key task is that we want to be able to track dependencies across all these different lengths. And what we mean by dependencies is that there could be information very, very early on in a sequence, but that may not be relevant or come up late until very much later in the sequence. And we need to be able to track these dependencies and maintain this information in our network. information in our network. Dependencies relate to order and sequences are defined by their order and we know that same words in a completely different order have completely different meanings, right? So our model needs to be able to handle these differences in order and the differences in length that could result in different predicted outputs. Okay, so hopefully that example, going through the example in text, motivates how we can think about transforming input data into a numerical encoding that can be passed into the RNN and also what are the key criteria that we want to meet in handling these types of problems. So, so far we've painted the picture of RNNs, how they work, intuition, their mathematical operations, and what are the key criteria that they need to meet.
Backpropagation through time (29:57)
The final piece to this is how we actually train and learn the weights in the RNN. And that's done through backpropagation algorithm with a bit of a twist to just handle sequential information. If we go back and think about how we train feedforward neural network models, the steps break down in thinking through starting with an input, where we first take this input and make a forward pass through the network going from input to output. The key to backpropagation that Alexander introduced was this idea of taking the prediction and backpropagating gradients back through the network and using this operation to then define and update the loss with respect to each of the parameters in the network in order to each of the parameters in the network in order to gradually adjust the parameters, the weights of the network, in order to minimize the overall loss. Now with RNNs, as we walked through earlier, we had this temporal unrolling, which means that we have these individual losses across the individual steps in our sequence that sum together to comprise the overall loss. What this means is that when we do backpropagation, we have to now, instead of backpropagating errors through a single network, backpropagate the loss through each of these individual time steps. And after we backpropagate loss through each of the individual time steps, we then do that across all time steps, all the way from our current time, time t, back to the beginning of the sequence. And this is why this algorithm is called backpropagation through time, right? Because as you can see, the data and the predictions and the resulting errors are fed back in time all the way from where we are currently to the very beginning of the input data sequence. So the back propagation through time is actually a very tricky algorithm to implement in practice. And the reason for this is if we take a close look, looking at how gradients flow across the RNN, what this algorithm involves is many, many repeated computations and multiplications of these weight matrices repeatedly against each other. In order to compute the gradient with respect to the very first time step, we have to make many of these multiplicative repeats of the weight matrix.
Gradient issues (32:25)
Why might this be problematic? Well, if this weight matrix W is very, very big, what this can result in is what we call the exploding gradient problem, where our gradients that we're trying to use to optimize our network do exactly that. They blow up, they explode, and they get really big and makes it infeasible and not possible to train the network stably. What we do to mitigate this is a pretty simple solution called gradient clipping, which effectively scales back these very big gradients to try to constrain them in a more restricted way. Conversely, we can have the instance where the weight matrices are very, very small, and if these weight matrices are very, very small, we end up with a very, very small value at the end as a result of these repeated weight matrix computations and these repeated multiplications. And this is a very real problem in RNNs in particular, where we can lead into this funnel called a vanishing gradient, where now your gradient has just dropped down close to zero, and again, you can't train the network stably. Now, there are particular tools that we can use to implement, again you can't train the network stably. Now there are particular tools that we can use to implement, that we can implement to try to mitigate this vanishing gradient problem and we'll touch on each of these three solutions briefly. First being how we can define the activation function in our network and how we can change the network architecture itself to try to better handle this vanishing gradient problem. Before we do that, I want to take just one step back to give you a little more intuition about why vanishing gradients can be a real issue for recurrent neural networks. Point I've kept trying to reiterate is this notion of dependency in the sequential data and what it means to track those dependencies. Well, if the dependencies are very constrained in a small space, not separated out that much by time, this repeated gradient computation and the repeated weight matrix multiplication is not so much of a problem. If we have a very short sequence where the words are very closely related to each other, and it's pretty obvious what our next output is going to be, the RNN can use the immediately passed information to make a prediction. And so there are not going to be that many, that much of a requirement to learn effective weights if the related information is close to each other temporally. Conversely now, if we have a sentence where we have a more long-term dependency, what this means is that we need information from way further back in the sequence to make our prediction at the end, and that gap between what's relevant and where we are at currently becomes exceedingly large, and therefore the vanishing gradient problem is increasingly exacerbated, meaning that we really need to... The RNN becomes unable to connect the dots and establish this long-term dependency, all because of this vanishing gradient issue. So the ways and modifications that we can make to our network to try to alleviate this problem, threefold. The first is that we can simply change the activation functions in each of our neural network layers to be such that they can effectively try to mitigate and safeguard from gradients in instances where from shrinking the gradients in instances where the data is greater than zero and this is in particular true for the relu activation function and the reason is that in all instances where x is greater than zero, with the ReLU function, the derivative is one. And so that is not less than one, and therefore it helps in mitigating the vanishing gradient problem. Another trick is how we initialize the parameters in the network themselves to prevent them from shrinking to zero too rapidly and there are there are mathematical ways that we can do this namely by initializing our weights to identity matrices and this effectively helps in practice to prevent the weight updates to shrink too rapidly to zero however the most robust solution to the vanishing gradient problem is by introducing a slightly more complicated version of the recurrent neural unit to be able to more effectively track and handle long-term dependencies in the data.
Long short term memory (LSTM) (37:03)
And this is this idea of gating. And what the idea is is is by controlling selectively the flow of information into the neural unit to be able to filter out what's not important while maintaining what is important. And the key and the most popular type of recurrent unit that achieves this gated computation is called the LSTM, or long short-term memory network. achieves this gated computation is called the LSTM, or Long Short Term Memory Network. Today, we're not going to go into detail on LSTMs, their mathematical details, their operations, and so on. But I just want to convey the key idea, an intuitive idea, about why these LSTMs are effective at tracking long-term dependencies. The core is that the LSTM is able to control the flow of information through these gates to be able to more effectively filter out the unimportant things and store the important things. What you can do is implement LSTMs in TensorFlow just as you would in RNN, but the core concept that I want you to take away when thinking about the LSTM is this idea of controlled information flow through gates. Very briefly, the way the LSTM operates is by maintaining a cell state just like a standard RNN, and that cell state is independent from what is directly outputted. The way the cell state just like a standard RNN and that cell state is independent from what is directly outputted. The way the cell state is updated is according to these gates that control the flow of information, forgetting and eliminating what is irrelevant, storing the information that is relevant, updating the cell state in turn, and then filtering this updated cell state to produce the predicted output, just like the standard RNN. And again, we can train the LSTM using the backpropagation through time algorithm, but the mathematics of how the LSTM is defined allows for a completely uninterrupted flow of the gradients, which largely eliminates the vanishing gradient problem that I introduced earlier. Again, if you're interested in learning more about the mathematics and the details of LSTMs, please come and discuss with us after the lectures. But again, just emphasizing the core concept and the intuition behind after the lectures but again just emphasizing the core concept and the intuition behind how the LSTM operates.
RNN applications (39:50)
Okay so so far where we've been at we've covered a lot of ground. We've gone through the fundamental workings of RNNs, the architecture, the training, the type of problems that they've been applied to. And I'd like to close this part by considering some concrete examples of how you're going to use RNNs in your software lab. And that is going to be in the task of music generation, where you're going to work to build an RNN that can predict the next musical note in a sequence and use it to generate brand new musical sequences that have never been realized before. So to give you an example of just the quality and type of output that you can try to aim towards, a few years ago there was a work that trained an RNN on a corpus of classical music data. And famously, there is this composer, Schubert, who wrote a famous unfinished symphony that consisted of two movements. But he was unable to finish his symphony before he died. So he died, and then he left the third movement unfinished. So a few years ago a group trained a RNN based model to actually try to generate the third movement to Schubert's famous unfinished symphony given the prior two movements. So I'm going to play the result quite right now. I paused it, I interrupted it quite abruptly there, but if there are any classical music aficionados out there, hopefully you get an appreciation for kind of the quality that was generated in terms of the music quality. And this was already from a few years ago, and as we'll see in the next lectures, and continuing with this theme of generative AI, the power of these algorithms has advanced tremendously since we first played this example, particularly in a whole range of domains, which I'm excited to talk about, but not for now. Okay, so you'll tackle this problem head on in today's lab, RNN Music Generation. Also, we can think about the simple example of input sequence to a single output with sentiment classification, where we can think about, for example, text like tweets and assigning positive or negative labels to these text examples based on the contents that is learned by the network. Okay. So this concludes the portion on RNNs. And I think it's quite remarkable that using all the foundational concepts and operations that we've talked about so far, we've been able to try to build up networks that handle this complex problem of sequential modeling. But like any technology, right, an RNN is not without limitations. So what are some of those limitations and what are some potential issues that can arise with using RNNs or even LSTMs? The first is this idea of encoding and dependency in terms of the temporal separation of data that we're trying to process. What RNNs require is that the sequential information is fed in and processed time step by time step. What that imposes is what we call an encoding bottleneck, right? Where we have, we're trying to encode a lot of content, for example a very large body of text, we're trying to encode a lot of content, for example, a very large body of text, many different words, into a single output that may be just at the very last time step. How do we ensure that all that information leading up to that time step was properly maintained and encoded and learned by the network? In practice, this is very, very challenging and a lot of information can be lost. Another limitation is that by doing this time step by time step processing RNNs can be quite slow. There is not really an easy way to parallelize that computation and finally together these components of the encoding bottleneck, the requirement to process this data step by step imposes the biggest problem, which is when we talk about long memory, the capacity of the RNN and the LSTM is really not that long. We can't really handle data of tens of thousands or hundreds of thousands or even beyond sequential information that effectively to learn the complete amount of information and patterns that are present within such a rich data source. And so because of this very recently there's been a lot of attention in how we can move beyond this notion of step-by-step recurrent processing to build even more powerful architectures for processing sequential data.
Introduction To Attention And Its Role In Machine Learning
Attention fundamentals (44:50)
To understand how we can start to do this, let's take a big step back, right? Think about the high-level goal of sequence modeling that I introduced at the very beginning. Given some input, a sequence of data, we want to build a feature encoding and use our neural network to learn that and then transform that feature encoding into a predicted output. What we saw is that RNNs use this notion of recurrence to maintain order information, processing information time step by time step. But as I just mentioned, we have these key three bottlenecks to RNNs. What we really want to achieve is to go beyond these bottlenecks and RNNs. What we really want to achieve is to go beyond these bottlenecks and achieve even higher capabilities in terms of the power of these models. Rather than having an encoding bottleneck, ideally we want to process information continuously as a continuous stream of information. Rather than being slow, we want to be able to parallelize computations to speed up processing. And finally, of course, our main goal is to really try to establish long memory that can build nuanced and rich understanding of sequential data the limitation of RNNs that's linked to all these problems and issues in our inability to achieve these capabilities is that they require this time step by time step processing. So what if we could move beyond that? What if we could eliminate this need for recurrence entirely and not have to process the data time step by time step? Well a first and naive approach would be to just squash all the data, all the time steps together, to create a vector that's effectively concatenated. Right, the time steps are eliminated. There's just one stream where we have now one vector input with the data from all time points that's then fed into the model. It calculates some feature vector and then generates some output, which hopefully makes sense. And because we've squashed all these time steps together, we could simply think about maybe building a feed-forward network that could do this computation. Well, with that we'd eliminate the need for recurrence, but we still have the issues that it's not scalable because the dense feedforward network would have to be immensely large, defined by many, many different connections. And critically, we've completely lost our in-order information by just squashing everything together blindly. There's no temporal dependence and we're then stuck in our ability to try to establish long-term memory. So what if instead we could still think about bringing these time steps together, but be a bit more clever about how we try to extract information from this input data? The key idea is this idea of being able to identify and attend to what is important in a potentially sequential stream of information.
Intuition of attention (48:10)
And this is the notion of attention or self-attention, which is an extremely, extremely powerful concept in modern deep learning and AI. I cannot understate or, I don't know, understate, overstate, I cannot emphasize enough how powerful this concept is attention is the foundational mechanism of the transformer architecture which many of you may have heard about and it's the the notion of a transformer can often be very daunting because sometimes they're presented with these really complex diagrams or deployed in complex applications and you may think okay how do I even start to make sense of this at its core though attention the key operation is a very intuitive idea and we're going to in the last portion of this lecture break it down step by step to see why it's so powerful and how we can use it as part of a larger neural network like a transformer. Specifically we're going to be talking and focusing on this idea of self-attention, attending to the most important parts of an input example. So let's consider an image. I think it's most intuitive to consider an image first. This is a picture of Iron Man, and if our goal is to try to extract information from this image of what's important, what we could do maybe is using our eyes naively scan over this image pixel by pixel, right, just going across the image. However, our brains maybe, maybe internally they're doing some type of computation like this, but you and I we can simply look at this image and be able to attend to the important parts we can see that it's Iron Man coming at you right in the image and then we can focus in a little further and say okay what are the details about Iron Man that may be important what is key what you're doing is your brain is identifying which parts are attending to, to attend to, and then extracting those features that deserve the highest attention. The first part of this problem is really the most interesting and challenging one, and it's very similar to the concept of search. Effectively, that's what search is doing. Taking some larger body of information and trying to extract and identify the important parts.
Attention and search relationship (50:30)
So let's go there next. How does search work? You're thinking you're in this class, how can I learn more about neural networks? Well in this day and age one thing you may do besides coming here and joining us is going to the Internet. Having all the videos out there trying to find something that matches doing a search operation so you have a giant database like YouTube you want to find a video you enter in your query deep learning and what comes out are some possible outputs right for every video in the, there is going to be some key information related to that video. Let's say the title. Now, to do the search, what the task is to find the overlaps between your query and each of these titles, right, the keys in the database. What we want to compute is a metric of similarity and relevance between the query and these keys. How similar are they to our desired query? And we can do this step by step. Let's say this first option of a video about the elegant giant sea turtles, not that similar to our query about deep learning. Our second option, query about deep learning. Our second option, introduction to deep learning, the first introductory lecture on this class, yes, highly relevant. The third option, a video about the late and great Kobe Bryant, not that relevant. The key operation here is that there is this similarity computation bringing the query and the key together. The final step is now that we've identified what key is relevant, extracting the relevant information, what we want to pay attention to, and that's the video itself. We call this the value, and because the search is implemented well, we've successfully identified the relevant video on deep learning that you are going to want to pay attention to. And it's this idea, this intuition of giving a query, trying to find similarity, trying to extract the related values that form the basis of self-attention, and how it works in neural networks like transformers.
Learning attention with neural networks (52:40)
So to go concretely into this, right, let's go back now to our text, our language example. With the sentence, our goal is to identify and attend to features in this input that are relevant to the semantic meaning of the sentence. Now, first step, we have sequence, we have order, we've eliminated recurrence, right? We're feeding in all the time steps all at once. We still need a way to encode and capture this information about order and this positional dependence. How this is done is this idea of positional encoding, which captures some inherent order information present in the sequence. I'm just going to touch on this very briefly, but the idea is related to this idea of embeddings which I introduced earlier. What is done is a neural network layer is used to encode positional information that captures the relative relationships in terms of order within this text. That's the high level concept, right? We're still being able to process these time steps all at once, there is no notion of time step, rather, the data is singular, but still we learn this encoding that captures the positional order information. Now, our next step is to take this encoding and figure out what to attend to, exactly like that search operation that I introduced with the YouTube example. Extracting a query, extracting a key, extracting a value, and relating them to each other. So we use neural network layers to do exactly this. Given this positional encoding, what attention does is applies a neural network layer, transforming that, first generating the query. We do this again using a separate neural network layer. And this is a different set of weights, a different set of parameters that then transform that positional embedding in a different way, generating a second output, the key. And finally, this operation is repeated with a third layer, a third set of weights, generating the value. Now, with these three in hand, the key, the query, the key, and the value, we can compare them to each other to try to figure out where in that self-input the network should attend to what is important. And that's the key idea behind this similarity metric or what you can think of as an attention score. What we're doing is we're computing a similarity score between a query and the key. And remember that these query and key values are just arrays of numbers. We can define them as arrays of numbers, which you can think of as vectors in space. The query values are some vector. The key values are some other vector. And mathematically, the way that we can compare these two vectors to understand how similar they are is by taking the dot product and scaling it. Captures how similar these vectors are, how, whether or not they're pointing in the same direction, right? This is the similarity metric, and if you are familiar with a little bit of linear algebra, this is also known as the cosine similarity. The operation functions exactly the same way for matrices. If we apply this dot product operation to our query in key matrices, we get this similarity metric out. Now, this is very, very key in defining our next step, computing the attention waiting in terms of what the network should actually attend to within this input. This operation gives us a score which defines how the components of the input data are related to each other. So given a sentence, right, when we compute this similarity score metric, we can then begin to think of weights that define the relationship between the sequential, the components of the sequential data to each other. So for example, in the this example with a text sentence, he tossed the tennis ball to serve, the goal with the score is that words in the sequence that are related to each other should have high attention weights. Ball related to toss, related to tennis. And this metric itself is our attention weighting. What we have done is passed that similarity score through a softmax function, which all it does is it constrains those values to be between zero and one. And so you can think of these as relative scores of relative attention weights. Finally, now that we have this metric that captures this notion of similarity and these internal self relationshipsrelationships, we can finally use this metric to extract features that are deserving of high attention. And that's the exact final step in the self-attention mechanism. In that, we take that attention-weighting matrix, multiply it by the value, and get a transformed transformation of the initial data as our output, which in turn reflects the features that correspond to high attention.
Scaling attention and applications (58:16)
All right, let's take a breath. Let's recap what we have just covered so far. The goal with this idea of self-attention, the backbone of transformers, is to eliminate recurrence, attend to the most important features in the input data. In an architecture, how this is actually deployed is, first, we take our input data, we compute these positional encodings, the neural network layers are applied threefold to transform the positional encoding into each of the key query and value matrices. We can then compute the self-attention weight score according to the dot product operation that we went through prior, and then self-attend to these information to extract features that deserve high attention. What is so powerful about this approach in taking this attention weight, putting it together with the value to extract high attention features, is that this operation, this scheme that I'm showing on the right, defines a single self-attention head and multiple of these self-attention heads can be linked together to form larger network architectures where you can think about these different heads trying to extract different information, different relevant parts of the input to now put together a very very rich encoding and representation of the data that we're working with. Intuitively, back to our Iron Man example, what this idea of multiple self-attention heads can amount to is that different salient features and salient information in the data is extracted. First, maybe you consider Iron Man, attention head one, and you may have additional attention heads that are picking out other relevant parts of the data which maybe we did not realize before. For example the building or the spaceship in the background that's chasing Iron Man. And so this is a key building block of many many many many powerful architectures that are out there today. I, again, cannot emphasize how enough, how powerful this mechanism is. And indeed, this backbone idea of self-attention that you just built up understanding of is the key operation of some of the most powerful neural networks and deep learning models out there today, of some of the most powerful neural networks and deep learning models out there today, ranging from the very powerful language models like GPT-3, which are capable of synthesizing natural language in a very human-like fashion, digesting large bodies of text information to understand relationships in text, to models that are being deployed for extremely impactful applications in biology and medicine, such as AlphaFold2, which uses this notion of self-attention to look at data of protein sequences and be able to predict the three-dimensional structure of a protein just given sequence information alone. And all the way even now to computer vision, which will be the topic of our next lecture tomorrow, where the same idea of attention that was initially developed in sequential data applications has now transformed the field of computer vision. And again, using this key concept of attending to the important features in an input to build these very rich representations of complex high-dimensional data. Okay, so that concludes lectures for today. I know we have covered a lot of territory in a pretty short amount of time, but that is what this boot camp program is all about.
Conclusion And Summary
So hopefully today you've gotten a sense of the foundations of neural networks in the lecture with Alexander. We talked about RNNs, how they're well suited for sequential data, how we can train them using back propagation, how we can deploy them for different applications, and finally how we can move beyond recurrence to build this idea of self-attention for building increasingly powerful models for deep learning in sequence modeling. Alright, hopefully you enjoyed. We have about 45 minutes left for the lab portion and open office hours in which we welcome you to ask us questions of us and the TAs and to start work on the labs. The information for the labs is up there. Thank you so much for your attention.