MIT 6.S191 (2019): Recurrent Neural Networks
Transcription for the video titled "MIT 6.S191 (2019): Recurrent Neural Networks".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Hi, everyone. My name is Ava, and welcome to our second lecture on deep sequence modeling. So in the first lecture, Alexander talked about the essentials of neural networks and feedforward models. And now we're going to turn our attention to applying neural networks to problems which involve sequential processing of data and why these sorts of tasks require a different type of network architecture from what we've seen so far.
Understanding Recurrent Neural Networks (Rnns) And Lstm
An Example (00:32)
So before we dive in, I'd like to start off with a really simple example. Suppose we have this picture of a ball, and we want to predict where it will travel to next. Without any prior information about the ball's history, any guess on its next position will be exactly that, just a guess. However, if in addition to the current location of the ball, I also gave you a history of its previous locations, now the problem becomes much easier. And I think we can all agree that we have a pretty clear sense of where the ball is going to next. So this is a really, really simple sequence modeling problem. Given this image, this thought experiment of a ball's travel through space, can we predict where it's going to go next? But in reality, the truth is that sequential data is all around us. For example, audio can be split up into a sequence of sound waves, while text can be split up into a sequence of either characters or words. And beyond these two ubiquitous examples, there are many more cases in which sequential processing may be useful, from analysis of medical signals like EKGs, to predicting stock trends, to processing genomic data. And now that we've gotten a sense of what sequential data looks like, I want to turn our attention to another simple problem to motivate the types of networks that we're going to use for this task. And in this case, suppose we have a language model where we're trying to train a neural network to predict the next word in a phrase or a sentence. And suppose we have this sentence. This morning, I took my cat for a walk. Yes, you heard and read that right. This morning I took my cat for a walk. And let's say we're given these words. This morning I took my cat for a, and we want to predict the next word in the sequence. And since this is a class on deep learning, we're going to try to build a deep neural network, like a feedforward network from our first lecture, to do this. And one problem that we're immediately going to run into is that our feedforward network can only take a fixed length vector as its input. And we have to specify the size of this input right at the start. to specify the size of this input right at the start. And you can imagine that this is going to be a problem for our task in general, because sometimes we'll have a sentence, we'll have seen five words, sometimes seven words, sometimes ten words, and we want to be able to predict what comes next. So fundamentally, we need a way to handle variable length input. And one way we can do this is to use this idea of a fixed window to force our input vector to be a certain length, in this case, 2. And this means that no matter where we're trying to make our prediction, we just take the previous two words and try to predict the next word. And we can represent these two words as a fixed length vector where we take a larger vector, allocate some space for the first word, some space for the second word, and encode the identity of each word in that vector. But this is problematic, because of the fact that we're using this fixed window, we're giving ourselves a limited history, which means that we can't effectively model long-term dependencies in our input data, which is important in sentences like this one, where we clearly need information from much earlier in the sentence to accurately predict the next word.
Representing Long-Term Dependencies (04:07)
If we were only looking at the past two words, or the past three words, or the past five words even, we wouldn't be able to make this prediction, being the word French. So we need a way to integrate information from across the sentence, but also still represent the input as a fixed length vector. And another way we could do this is by actually using the entire sequence, but representing it as a set of counts. And this representation is what's called a bag of words, where we have some vector, and each slot in this vector represents a word. And the value that's in that slot represents the number of times that that word appears in the sentence. And so we have a fixed length vector over some vocabulary of words, regardless of the length of the input sentence, but the counts are just going to be different. And we can feed this into our feedforward neural network to generate a prediction about the next word. And you may have already realized that there's a big problem with this approach. In using counts, we've just completely abolished all sequence information. And so for example, these two sentences with completely opposite semantic meanings would have the exact same bag of words representation. Same words, same counts. So obviously, this isn't going to work.
Representing Means (05:42)
Another idea could be to simply extend our first idea of a fixed window, thinking that by looking at more words, we can get most of the context we need. And so we can represent our sentence in this way, just a longer fixed window, feed it into our feedforward model, and make a prediction. And if we were to feed this vector into a feedforward neural network, each of these inputs, each 0 or 1 in the vector, would have a separate weight connecting it to the network. And so if we repeatedly were to see the words this morning at the beginning of the sentence, the neural network would be able to learn that this morning represents a time or a setting. But if in another sentence, this morning were to now appear at the end of that sentence, the network is going to have difficulty recognizing that this morning actually means this morning, because the parameters that see the end of the vector have never seen that phrase before. And the parameters that see the end of the vector have never seen that phrase before. And the parameters from the beginning haven't been shared across the sequence. And so at a higher level, what this means is that what we learn about the sequence at one point is not going to transfer anywhere to anywhere else in the sequence if we use this representation. And so hopefully by walking through this, I've motivated that why a traditional feed-forward neural network is not really well suited to handle sequential data.
RNNs and the Recurrent Patition of Neural Networks (07:07)
And this simple example further motivates a concrete set of design criteria that we need to keep in mind when thinking about building a neural network for sequence modeling problems. Specifically, our network needs to be able to handle variable length sequences, be able to track long-term dependencies in the data, maintain information about the sequence order, and share the parameters it learns across the entirety of the sequence. And today we're going to talk about using recurrent neural networks, or RNNs, as a general framework for sequential processing and sequence modeling problems. So let's go through the general principle behind RNNs and explore how they're a fundamentally different architecture from what we saw in the first lecture. So this is an abstraction of our standard feed-forward neural network. And in this architecture, data propagates in one direction, from input to output. And we already motivated why a network like this can't really handle sequential data.
Hidden state vs feedback (08:22)
RNNs, in contrast, are really well suited for handling cases where we have a sequence of inputs rather than a single input. And they're great for problems like this one, in which a sequence of data is propagated through the model to give a single output. For example, you can imagine training a model that takes as input a sequence of words and outputs a sentiment that's associated with that phrase or that sentence. Alternatively, instead of returning a single output, we could also train a model where we take in a sequence of inputs, propagate them through our recurrent neural network model, and then return an output at each time step in the sequence. And an example of this would be in text or music generation. And you'll get a chance to explore this type of model later on in the lab. And beyond these two examples, there are a whole host of other recurrent neural network architectures for sequential processing, and they've been applied to a range of problems. So what fundamentally is a recurrent neural network? As I mentioned before, to reiterate, our standard vanilla feedforward neural network, we are going from input to output in one direction, and this fundamentally can't maintain information about sequential data. RNNs, on the other hand, are networks where they have these loops in them which allow for information to persist. So in this diagram, our RNN takes as input this vector x of t, outputs a value like a prediction y hat of t, but also makes this computation to update an internal state which we call h of t, and then passes this information about its state from this step of the network to the next. And we call these networks with loops in them recurrent because information is being passed internally from one time step to the next. So what's going on under the hood? How is information being passed? RNNs use a simple recurrence relation in order to process sequential data. Specifically, they maintain this internal state, H of T. And at each time step, we apply a function parametrized by a set of weights, w, to update this state, based on both the previous state, h of t minus 1, and the current input, x of t. And the important thing to know here is that the same function and the same set of parameters are used at every time step. And this addresses that important design criteria from earlier of why it's useful to share parameters in the context of sequence modeling. To be more specific, the RNN computation includes both a state update as well as the output. So given our input vector, we apply some function to update the hidden state. And as we saw in the first lecture, this function is a standard neural net operation that consists of multiplication by a weight matrix and applying a nonlinearity. But in this case, since we both have the input vector, x of t, as well as the previous state, h of t minus 1, as inputs to our function, we have two weight matrices. And we can then apply our nonlinearity to the sum of these two terms. Finally, we generate an output at a given time step, which is a transformed version of our internal state that falls from a multiplication by a separate weight matrix. So so far, we've seen RNNs as depicted as having these loops that feed back in on themselves. Another way of thinking about the RNN can be in terms of unrolling this loop across time. And if we do this, we can think of the RNN as multiple copies of the same network, where each copy is passing a message onto its descendant. And continuing this scheme throughout time, you can easily see that RNNs have this chain-like structure, which really highlights how and why they're so well-suited for processing sequential data. So in this representation, we can make our weight matrices explicit, beginning with the weights that transform the inputs to the hidden state, transform the previous hidden state to the next hidden state, and finally transform the hidden state to the output. And it's important once again to note that we are using the same weight matrices at every time step. And from these outputs we can compute a loss at each time step and this completes our what is called our forward pass through the network. And finally to define the total loss we simply sum the losses from all the individual time steps and since our total loss consists of these individual contributions over time this means that training the network will also have to involve some time component. OK, so in terms of actually training RNNs, how can we do this? And the algorithm that's used is an extension of the back propagation idea that Alexander introduced in the first lecture. And it's called backpropagation through time. So to remind you, let's think back to how we train feedforward models. Given our inputs, we first make a forward pass through the network going from input to output, and then backpropagate gradients back through the network, taking the derivative of the loss with respect to each parameter in the network, and then tweaking our parameters in order to minimize the loss. For RNNs, our forward pass through the network consists of going forward across time, updating the cell state based on the input and the previous state, generating an output at each time step, computing a loss at each time step, and then finally summing these individual losses to get the total loss. And what this means is that instead of back-propagating errors through a single feed-forward network at a single time step. In RNNs, errors are back propagated at each individual time step, and then across time steps, all the way from where we are currently to the very beginning of the sequence.
Gradient Propagation Through Time (15:09)
And this is the reason why it's called back propagation through time, because as you can see, all the errors are flowing back in time to the beginning of our data sequence. And so if we take a closer look at how gradients flow across this chain of repeating modules, you can see that in between these time steps, in doing this back propagation, we have this factor, w h h, which is a matrix. And this means that in each step, we have to perform a matrix multiplication that involves this weight matrix w. And furthermore, each cell state update results from a nonlinear activation. And what this means is that in computing the gradient in an RNN, the derivative of the loss with respect to our initial state, H0, we have to make many matrix multiplications that involve the weight matrix, as well as repeated use of the derivative of the activation function. Why might this be problematic? Well, if we consider these multiplication operations, if many of these values are greater than 1, what we can encounter is what's called this exploding gradient problem, where our gradients become extremely large and we can't do any optimization. And to combat this, one thing that's done in practice is called gradient clipping, which basically means you scale back your gradients when they become too large. And this is a really good practical option, especially when you have a network that's not too complicated with, and doesn't have many parameters. On the flip side, we can also have the opposite problem, where if our matrix values are too small, we can encounter what's called the vanishing gradient problem. And it's really the motivating factor behind the most widely used RNN architectures. And today we're going to address three ways in which we can alleviate the vanishing gradient problem by changing the activation function that's used, being clever about how we initialize the weights in our network, and finally how we can fundamentally change the RNN architecture to combat this.
Vanishing gradients. (17:23)
And so before we go into that, let's take a step back and try to establish some more intuition as to why vanishing gradients are such a big problem. So imagine you have a number, right, and you keep multiplying that number by something in between zero and one. That number is going to keep shrinking and shrinking, and eventually it's going to vanish. When this happens to gradients, this means it's going to be harder and harder to propagate errors further back into the past because the gradients are going to become smaller and smaller. And this means that during training, we'll end up biasing our network to capture short-term dependencies, which may not always be a problem. Sometimes we only need to consider very recent information to perform our task of interest.
So to make this concrete, let's go back to our example from the beginning of the lecture. A language model, we're trying to predict the next word in a phrase. So in this case, if we're trying to predict the last word in the phrase, the clouds are in the blank, it's pretty obvious what the next word is going to be. And there's not much of a gap between the relevant information, like the word cloud, and the place where the prediction is needed. And so a standard RNN can use the past information to make the prediction. But there can be other cases where more context is necessary, like in this example. More recent information suggests that the next word is most likely the name of a language. But to identify which language, we need information from further back, the context of France. And in many cases, the gap between what's relevant and the point where that information is needed can become really, really large. And as that gap grows, standard RNNs become increasingly unable to connect the information. And that's all because of the vanishing gradient problem. So how can we alleviate this? The first trick is pretty simple. We can change the activation function the network uses. And specifically both the tanh and sigmoid activation functions have derivatives less than one pretty much everywhere, right? In contrast, if we use a ReLU activation function, the derivative is one for whenever x is greater than zero. And so this helps prevent the value of the derivative from shrinking our gradients.
Initializing parameters (20:08)
But it's only true for when x is greater than 0. Another trick is to be smart in terms of how we initialize parameters in our network. By initialing our weights to the identity matrix, we can help prevent them from shrinking to zero too rapidly during back propagation. The final and most robust solution is to use a more complex type of recurrent unit that can more effectively track long-term dependencies by controlling what information is passed through and what's used to update the cell state. Specifically, we'll use what we call gated cells, and there are many types of these gated cells that exist. And today we'll focus on one type of gated cell called a long short-term memory network, or LSTMs for short, which are really good at learning long-term dependencies and overcoming this vanishing gradient problem. And LSTMs are basically the gold standard when it comes to building RNNs in practice and they're very very widely used by the deep learning community. So to understand what makes LSTMs special, right, let's think back to the general structure of an RNN. All recurrent neural networks have this form of a series of repeating modules, right, the RNN being unrolled across time. And in a standard RNN, the repeating module contains one computation node. In this case, it's a tanh layer. LSTMs also have this chain-like structure, but the repeating module is slightly more complex. And don't get too frightened, hopefully, by what these flow diagrams mean. We'll walk through it step by step. But the key idea here is that the repeating unit in an LSTM contains these different interacting layers that control the flow of information. The first key idea behind LSTMs is that they maintain an internal cell state, which will denote C of t in addition to the standard RNN state H of t. And this cell state runs throughout the chain of repeating modules. And as you can see, there are only a couple of simple linear interactions. This is a pointwise multiplication, and this is addition, that update the value of C of t. And this means that it's really easy for information to flow along relatively unchanged. The second key idea that LSTMs use is that they use these structures called gates to add or remove information to the cell state. And gates consist of a sigmoid neural net layer followed by a pointwise multiplication. So let's take a moment to think about what these gates are doing. The sigmoid function is special because it's forcing the input to the gate to be between 0 and 1. And intuitively, you can think of this as capturing how much of the input should be passed through the gate. If it's 0, don't pass any of that information through the gate. If it's zero, don't pass any of that information through the gate. If it's one, pass all the information through the gate. And so this regulates the flow of information through the LSTM.
LSTM Operations (23:30)
So now you're probably wondering, okay, these lines look really complicated. What's, how do these LSTMs actually work? Thinking of the LSTM operations at a high level, it boils down to three key steps. The first step in the LSTM is to decide what information is going to be thrown away from the prior cell state. Forget irrelevant history, right? The next step is to take both the prior information as well as the current input, process this information in some way, and then selectively update the cell state. And our final step is to return an output. And for this, LSTMs are going to use an output gate to return a transformed version of the cell state. So now that we have a sense of these three key LSTM operations, forget, update, output, let's walk through each step by step to get a concrete understanding of how these computations work. And even though we're going to walk through this, I really want you to keep in mind the high level concepts of each operation, because that's what's important in terms of establishing the intuition behind how LSTMs work. And we'll again go back to our language model example that we've been using throughout this lecture, where we're trying to predict the next word in a sequence. So our first task is to identify what past information is relevant and what's irrelevant. And we achieve this using a sigmoid layer called the forget gate, F of T, right? And F of T is parametrized by a set of weights and biases just like any neural network layer. And this layer looks at the previous information, h of t minus 1, as well as the input, x of t, and then outputs a number between 0 and 1, between completely forgetting that information and completely keeping that information, and then passes along this decision. So in our language model example, the cell state might have included some information about the gender pronoun of a subject in a sentence, for example. So you can imagine updating the LSTM to forget the gender pronoun of a sentence's past subject once it encounters a new subject in that sentence. Our second step is to decide what new information is going to be stored in our updated cell state and to actually execute that update. So there are two steps to this. The first is a sigmoid layer, which you can think of as gating the input, which identifies what values we should update.
LSTM Gating (26:29)
Secondly, we have a tanh layer that generates a new vector of candidate values that could be added to the state. And in our language model, we may decide to add the gender of a new subject in order to replace the gender of the old subject. Now we can actually update our old cell state, c of t minus 1, into the new cell state, c of t. Our previous two steps decided what we should do. Now it's about actually executing that. So to perform this update, we first multiply our old cell state, c of t minus 1, by our forget state, our forget gate, f of t. This forgets what we decided to forget, right? We can then add our set of new candidate values scaled by the input gate to selectively update each state value. And so in our language model example, this means that we're dropping information about the old subject's gender and then adding the new information. Finally, we just need to decide what we're going to output and actually output it. And what we are going to output, H of T, is going to be a filtered version of our internal state that we've been maintaining and updating all along. So again we're going to use a sigmoid layer to gate where we're going to output, and we put our recently updated cell state, C of T, through a tanh layer and then multiply this by the output of the sigmoid gate. Essentially, this amounts to transforming the updated cell state using that tanh and then gating it. And in our language model, for example, you may want to output information that relates to a verb, for example, if we've just seen the subject, a new subject in the sentence. So this gives us a sense of the internal workings of the LSTM, but if there's one thing that you take away, it's sort of those three high level operations of the LSTM. Forget old information, update the cell state, and output a filtered version. But to really appreciate how the LSTM helps overcome the vanishing gradient problem, let's consider the relationship between C of t and C of t minus 1.
Difficulty in LSTM (28:50)
When we back propagate from C of t, our current cell state, to C of t minus 1, what you'll notice is that we only have to perform element-wise multiplications and an addition in doing this back propagation. There's no matrix multiplication that's involved. And that's entirely because we're maintaining this separate cell state, c of t, apart from H of t. And that C of t is only involved in really simple computations. And so when you link up these repeating LSTM units in a chain, what you'll see is that you get this completely uninterrupted gradient flow, unlike in a standard RNN where you have to do repeated matrix multiplications. And this is really great for training purposes and for overcoming the vanishing gradient problem. So to recap the key ideas behind LSTMs, we maintain a separate cell state from what's outputted. We use gates to control the flow of information, first forgetting what's irrelevant, selectively updating the cell state based on both the past history and the current input, and then outputting some filtered version of what we just computed. And this maintenance of the separate cell state allows for simple backpropagation with uninterrupted gradient flow. So now that we've gone through the fundamental workings of RNNs, backpropagation through time, the vanishing and exploding gradient problems, and the LSTM architecture, I'd like to close by considering three really concrete examples of how to use RNNs. Let's first imagine we're trying to learn a RNN to predict the next musical note and to use this model to generate brand new musical sequences. So you can imagine inputting a sequence of notes and producing an output at each time step, where our output at each time step is what we think is the next note in the sequence, right? And if you train a model like this, you can actually use it to generate brand new music that's never been heard before. And so for example, you get the idea.
Relevance And Limitations Of Recurring Neural Networks (Rnns)
Dialogue Systems (31:38)
This sounds like classical music, right? But in reality, this was music that was generated by a recurrent neural network that trained on piano pieces from Chopin. And after the training process was asked, OK, now generate some new music based on what it is you've learned. And you can see, right, this sounds like extremely realistic. You may not have been able to tell that this was music generated by a machine, unless maybe you're an expert piano aficionado. And you'll actually get some practice with building a model to do exactly this in today's lab, where you'll be training an RNN to generate brand new Irish folk music that has never been heard before.
Generated from war training (32:13)
As another cool example, where we're going from an input sequence to just a single output, we can train an RNN to take as input words in a sentence and actually output the sentiment or the feeling of that particular sentence, either positive or negative. So for example, if we were to train a model like this on a set of tweets, we could train our RNN to predict that this wonderful first tweet about our class, Success191, has a really positive sentiment, which hopefully you agree with, but that this other tweet about the weather is actually negative.
Attention in Auto три Gren Agua (33:04)
The final example I'll briefly talk about is one of the most powerful and widely used applications of RNNs in industry. And it's the backbone of Google's translate algorithm. And that's machine translation, where you input a sentence in one language and train an RNN to output a sentence in a new language. And this is done by having an encoder that encodes, encodes the original sentence into a state vector and a decoder which decodes that state vector into a new language.
Gupta sequences receive feedback (33:45)
But there's a big problem in this approach as depicted here, and that's the fact that this entire original sentence needs to be encoded in a single vector that's passed from the encoder to the decoder. And this is a huge bottleneck when you're considering large bodies of text that you're trying to translate. And actually, you know, researchers devised a clever way to get around this problem, which is this idea of attention. And the basic idea here is that instead of the decoder only having access to the final encoded state, it now has access to each of these states after each of the steps in the original sentence. And the actual weighting of these vectors from encoder to decoder is learned by the network during training. And this technique is called attention because when the network learns this weighting, it's placing its attention on different parts of the input sequence. And in this sense, you can think of it as actually capturing a sort of memory of the original sentence. So hopefully, you've gotten a sense of how RNNs work and why they're so powerful for sequential processing.
We've discussed why they're so well suited for sequential modeling tasks, seen how to define their operation using this recurrence relation, how to train them using back propagation through time, and also looked at how gated cells can let us model long-term dependencies. And finally, we discussed three concrete applications of RNNs. And so this concludes the lecture portion of our first day of 6S191. And we're really excited now to transition to the lab portion, which, as I mentioned, is going to mostly focus on training in RNN to generate brand new music. And so the lab is going to be broken down into two parts. But before I go into the specifics of getting started with the labs, I'd like to take a few minutes pause. For those of you who plan to stick around for the labs, we're happy to have you.
Thanks very helpful (36:07)
We're also happy to address any questions that you may have about either Lecture 1 or Lecture 2 here at the front, and we'll just take a five-minute break or so to orient ourselves and get set up for the lab. Thank you.