MIT 6.S191 (2018): Sequence Modeling with Neural Networks
Transcription for the video titled "MIT 6.S191 (2018): Sequence Modeling with Neural Networks".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Hi everybody, my name is Harini and I'm going to be talking about how to use neural networks to model sequences. In the previous lecture you saw how you could use a neural network to model a data set of many examples. The difference with sequences is that each example consists of multiple data points. There can be a variable number of these data points per example, and the data points can depend on each other in complicated ways. So a sequence could be something like a sentence, like, this morning I took the dog for a walk. This is one example, but it consists of multiple words, and the words depend on each other. Another example would be something like a medical record. One medical record would be one example, but it consists of many measurements. Another example would be something like a speech waveform, where this one waveform is an example, but again, it consists of many, many measurements. You've probably encountered sequence modeling tasks in your everyday life, especially if you've used things like Google Translate, Alexa, or Siri, tasks like machine translation and question answering are all sequence modeling tasks. And the state of the art in these tasks is mostly deep learning based. Another interesting example I saw recently was this self-parking car by Audi. When you think about it, parking is also a sequence modeling task, because parking is just a sequence of movements. And the next movement depends on all the previous movements. You can watch the rest of this video online. Okay, so a sequence modeling problem. Now I'm just gonna walk through a sequence modeling problem to kind of motivate why we need a different framework for specifically for modeling sequences and what we should be looking for in that framework. So the problem is predicting the next word. Given these words, we want to predict what comes next. The first problem we run into is that machine learning models that are not explicitly designed to deal with sequences take as input a fixed length vector. Think back to the feedforward neural network from the first lecture that Alexander introduced. We have to specify the size of the input right at the outset. We can't sometimes feed in a vector of length 10, other times feed in a vector of length 20. It has to be fixed length. So this is kind of an issue with sequences because sometimes we might have seen 10 words and we want to predict the next word. Sometimes we might have seen four words and we want to predict the next word. So we have to get that variable length input into a fixed length vector. One simple way to do this would be to just cut off the vector. So say, okay, we're going to just take a fixed window, force this vector to be fixed length by only considering the previous two words. No matter where we're making the prediction, we'll just take the previous two words and then try to predict the next word. Now we can represent these two words as a fixed length vector by creating a larger vector and then allocating space in it for the first word and for the second word. We have a fixed length vector now no matter what two words we're using and we can feed this into a machine learning model like a feed forward neural network or logistic regression or any other model and try to make a prediction. One thing you might be noticing here is that by using this fixed window, we're giving ourselves a very limited history. We're trying to predict the word walk, having only seen the words for and a. This is almost impossible. Put differently, it's really hard to model long-term dependencies. To see this clearly, consider the sentence, in France, I had a great time and I learned some of the blank language, where we're trying to predict the word in the blank. I knew it was French, but that's because I looked very far back at the word France that appeared in the beginning of the sentence. If we were only looking at the past two words, or the past three words, or even the past five words, it would be really hard to guess the word in that blank. So we don't want to limit ourselves so much. We want to ideally use all of the information that we have in the sequence. But we also need a fixed length vector. So one way we could do this is by using the entire sequence but representing it as a set of counts. In language, this representation is also known as a bag of words. All this is is a vector in which each slot represents a word, and the number in that slot represents the number of times that that word occurs in the sentence. So here, the second slot represents the word this, and there's a 1 because this appears once in the sentence. Now we have a fixed length vector. No matter how many words we have, the vector will always be the same size. The counts will just be different. We can feed this into a machine learning model and try to make a prediction. The problem you may be noticing here is that we're losing all of the sequential information. These counts don't preserve any order that we had in the sequence. To see why this is really bad, consider these two sentences. The food was good, not bad at all, versus the food was bad, not good at all. These are completely opposite sentences but their bag of words representation would be exactly the same because they contain the same set of words. So by representing our sentence as counts we're losing all of the sequential information which is really important because we're trying to model sequences. Okay so what do we know now? We want to preserve order in the sequence, but we also don't want to cut it off to a very short length. You might be saying, well, why don't we just use a really big fixed window? Before, we were having issues because we were just using a fixed window of size 2. What if we extended that to be a fixed window of size 7? And we think that by looking at seven words, we can get most of the context that we need. Well, yeah, okay, we can do that. Now we have another fixed length vector, just like before. It's bigger, but it's still fixed length. We have allocated space for each of the seven words. We can feed this into a model and try to make a prediction. The problem here is that, and consider this in the scenario where we're feeding this input vector into a feed-forward neural network. Each of those inputs, each of those ones and zeros, has a separate weight connecting it to the network. If we see the words this morning at the beginning of the sentence very, very commonly, the network will learn that this morning represents a time or a setting. If this morning then appears at the end of the sentence, we'll have a lot of trouble recognizing that because the weights at the end of the vector never saw that phrase before and the weights from the beginning of the vector are not being shared with the end. In other words, things we learn about the sequence won't transfer if they appear at different points in the sequence because we're not sharing any parameters. Alright, so you kind of see all the problems that arise with sequences now and why we need a different framework.
Concepts And Techniques In Recurrent Neural Networks
Share Parameters (06:50)
Specifically, we want to be able to deal with variable length sequences, we want to maintain sequence order so we can keep all of that sequential information, we want to keep track of longer-term dependencies rather than cutting it off too short.
Recurrent Neural Networks (07:07)
And we want to be able to share parameters across the sequence so we don't have to relearn things across the sequence. Because this is a class about deep learning, I'm going to talk about how to address these problems with neural networks. But know that time series modeling and sequential modeling is a very active field in machine learning, and it has been. And there very active field in machine learning, and it has been, and there are lots of other machine learning methods that have been developed to deal with these problems. But for now, I'll talk about recurrent neural networks. Okay. So a recurrent neural network is architected in the same way as a normal neural network. We have some inputs, we have some hidden layers, and we have some outputs. The only difference is that each hidden unit is doing a slightly different function. So let's take a look at this one hidden unit to see exactly what it's doing. A recurrent hidden unit computes a function of an input and its own previous output. Its own previous output. Its own previous output is also known as the cell state. And in the diagram, it's denoted by s. The subscript is the time step.
Initial State (08:11)
So at the very first time step, t equals 0, the recurrent unit computes a function of the input at t equals 0 and of its initial state. Similarly, at the next time step, it computes a function of the new input and its previous cell state. If you look at the function at the bottom, the function to compute S2, you'll see it's really similar to the function that a hidden unit in a feed forward network computes. The only difference is that we're adding in an additional term to incorporate its own previous state. A common way of viewing recurrent neural networks is by unfolding them across time. So this is the same hidden unit at different points in time. Here you can see that at every point in time, it takes its input, its own previous state, and the new input at that time step. One thing to notice here is that throughout the sequence, we're using the same weight matrices, w and u. This solves our problem of parameter sharing. We don't have new parameters for every point in the sequence. Once we learn something, it can apply at any point in the sequence. This also helps us deal with variable length sequences. Because we're not pre-specifying the length of the sequence, we don't have separate parameters for every point in the sequence. So in some cases, we could unroll this RNN to four time steps. In other cases, we could unroll it to 10 time steps. The final thing to notice is that S sub N, the cell state at time n, can contain information from all of the past time steps.
Long-Term Dependencies (09:50)
Notice that each cell state is a function of the previous cell state, which is a function of the previous cell state, and so on. So this kind of solves our issue of long-term dependencies, because at a time step very far in the future, that cell state encompasses information about all of the previous cell states.
Gradient Descent (10:05)
All right, so now that you understand what a recurrent neural network is, and just to clarify, I've shown you one hidden unit in the previous slide, but in a full network, you would have many, many of those hidden units and even many layers of many hidden units. So now we can talk about how you would train a recurrent neural network. It's really similar to how you train a normal neural network. It's back propagation. There's just an additional time dimension. As a reminder, in back propagation, we want to find the parameters that minimize some loss function. The way that we do this is by first taking the derivative of the loss with respect to each of the parameters and then shifting the parameters in the opposite direction in order to try and minimize the loss. This process is called gradient descent. So one difference with RNNs is that we have many time steps. So we can produce an output at every time step. Because we have an output at every time step, we can have a loss at every time step. Rather than just one single loss at the end. And the way that we deal with this is pretty simple. The total loss is just the sum of the losses at every time step.
Total Loss Gravsal W (11:28)
Similarly, the total gradient is just the sum of the gradients at every time step. So we can try this out by walking through this gradient computation for a single parameter, w. w is the weight matrix that we're multiplying by our inputs. We know that the total loss, the total gradient, so the derivative of the loss with respect to w, will be the sum of the gradients at every time step. So for now, we can focus on a single time step, knowing that at the end, we would do this for each of the time steps and then sum them up to get the total gradient. So let's take time step two. We can solve this gradient using the chain rule. So the derivative of the loss with respect to w is the derivative of the loss with respect to the output, the derivative of the output with respect to the cell state at time two, and then the derivative of the cell state with respect to w. So this seems fine, but let's take a closer look at this last term. You'll notice that s2 also depends on s1, and s1 also depends on w. So we can't just leave that last term as a constant. We actually have to expand it out farther. Okay, so how do we expand this out farther? What we really want to know is how exactly does the cell state at time step 2 depend on w? Well, it depends directly on w because it feeds right in. We also saw that s2 depends on s1, which depends on W. And you can also see that S2 depends on S0, which also depends on W. In other words, and here I'm just writing it as a summation, and the sum that you saw in the previous slide as a summation form. And you can see that the last two terms are basically summing the contributions of w in previous time steps to the error at time step t. This is key to how we model longer term dependencies. This gradient is how we shift our parameters, and our parameters define our network. By shifting our parameters such that they include contributions to the error from past time steps, they're shifted to model longer term dependencies. And here I'm just writing it as a general sum, not just for time step two. Okay, so this is basically the process of back propagation through time. You would do this for every parameter in your network, and then use that in the process of back propagation through time. You would do this for every parameter in your network and then use that in the process of gradient descent. In practice RNNs are a bit difficult to train. So I kind of want to go through why that is and what some ways that we can address these issues.
A Chain Rule for Training LSTMs (14:17)
So let's go back to the summation. As a reminder, this is the derivative of the loss with respect to w. And this is what we would use to shift our parameters w. The last two terms are considering the error of w at all of the previous time steps. Let's take a look at this one term. This is how we, this is the derivative of the cell state at time step two with respect to each of the previous cell states. You might notice that this itself is also a chain rule, because s2 depends on s1, and s1 depends on s0. We can expand this out farther. This is just for the derivative of s2 with respect to s0. But what if we were looking at a time step very far in the future, like time step n? That term would expand into a product of n terms. And okay, you might be thinking, so what? Well, notice that as the gap between time steps gets bigger and bigger, this product in the gradient gets longer and longer. And if we look at each of these terms, what are each of these terms?
W and F'(x (15:32)
They all kind of take the same form. It's the derivative of a cell state with respect to the previous cell state. That term can be written like this. And the actual form, that actual formula isn't that important, just notice that it's a product of two terms, W's and F primes. W's are our weight matrices. These are sampled mostly from a standard normal distribution, so most of the terms will be less than one. F prime is the derivative of our activation function. If we use an activation function such as the hyperbolic tangent or a sigmoid, F prime will always be less than one. In other words, we're multiplying a lot of small numbers together in this product. Okay, so what does this mean? Basically, recall that this product is how we're adding the gradient from future time steps to the gradient, sorry, how we're adding the gradient from past time steps to the gradient at a future time step. What's happening then is that errors due to further and further back time steps have increasingly smaller gradients because that product for further back time steps will be longer, and since the numbers are all decimals, it will become increasingly smaller. What this ends up meaning at a high level is that our parameters will become biased to capture shorter term dependencies. The errors that arise from further and further back time steps will be harder and harder to propagate into the gradient at future time steps. Recall this example that I showed at the beginning. The whole point of using recurrent neural networks is because we wanted to model long-term dependencies. But if our parameters are biased to capture short-term dependencies, even if they see the whole sequence, the parameters will become biased to predict things based mostly on the past couple words. OK, so now I'm going to go through, a couple methods that are used to address this issue in practice that work pretty well.
Choose Activation Method (17:57)
The first one is the choice of activation function. So you saw that one of the terms that was making that product really small was the f prime term. f prime is the derivative of whatever activation function we choose to use. Here I've plotted the derivatives of some common activation functions. You can see that the derivative of hyperbolic tangent and sigmoid is always less than 1. In fact, for sigmoid, it's always less than 0.25.
Re Uses Initialization (18:21)
If instead we choose to use an activation function like ReLU, it's always 1 above 0.25. If instead we choose to use an activation function like relu, it's always 1 above 0. So that will at least prevent the f prime terms from shrinking the gradient. Another solution would be how we initialize our weights. If we initialize the weights from a normal distribution, they'll be mostly less than 1, and they'll immediately shrink the gradients. If instead we initialize the weights to something like the identity matrix, it'll at least prevent that W term from shrinking that product, at least at the beginning. The next solution is very different. It involves actually adding a lot more complexity to the network using a more complex type of cell called a gated cell. Here, rather than each node just being that simple RNN unit that I showed at the beginning, we'll replace it with a much more complicated cell. A very common gated cell is something called an LSTM, or a long short-term memory cell. Like its name implies, LSTM cells are able to keep memory within the cell state unchanged for many time steps. This allows them to effectively model longer term dependencies. So I'm going to go through a very high level overview of how LSTMs work. But if you're interested, feel free to email me or ask me afterwards. And I can direct you to some more resources to read about LSTMs in a lot more detail. All right, so LSTMs basically have a three-step process. The first step is to forget irrelevant parts of the cell state. For example, if we're modeling a sentence and we see a new subject, we might want to forget things about the old subject because we know that future words will be conjugated according to the new subject. The next step is an update step. Here's where we actually update the cell state to reflect the information from the new input. In this example, like I said, if we've just seen a new subject, we might want to, this is where we actually update the cell state with the gender or whether the new subject is plural or singular.
Finally, we want to output certain parts of the cell state. So if we've just seen a subject, we have an idea that the next word might be a verb. So we'll output information relevant to predicting a verb, like the tense. Each of these three steps is implemented using a set of logic gates. And the logic gates are implemented using sigmoid functions. To give you some intuition on why LSTMs help with the vanishing gradient problem, is that first, the forget gate, the first step, can equivalently be called the remember gate, because there you're choosing what to forget and what to keep in the cell state. The forget gate can choose to keep information in the cell state for many, many time steps. There's no activation function or anything else shrinking that information. The second thing is that the cell state is separate from what's outputted. This is not true of normal recurrent units. Like I showed you before, in a simple recurrent unit, the cell state is the same thing as what that cell outputs. With an LSTM, it has a separate cell state, and it only needs to output information relevant to the prediction at that time step. Because of this, it can keep information in the cell state, which might not be relevant at this time step, but might be relevant at a much later time step. So it can keep that information without being penalized for that. Finally, I didn't indicate this explicitly in the diagram, but the way that the update step happens is through an additive function, not through a multiplicative function. So when we take the derivative, there's not a huge expansion. So now I just want to move on to going over some possible tasks. So the first task is classification. So here we want to classify tweets as positive, negative, or neutral. And this task is also known as sentiment analysis. The way that we would design a recurrent neural network to do this is actually not by having an output at every time step. We only want one output for the entire sequence. And so we'll take in the entire sequence, the entire tweet, one word at a time. And at the very end, we'll take in the entire sequence, the entire tweet, one word at a time, and at the very end, we'll produce an output, which will actually be a probability distribution over possible classes, where our classes in this case would be positive, negative, or neutral. Note that the only information that is producing the output at the end is the final cell state. And so that final cell state kind of has to summarize all of the information from the entire sequence into that final cell state. So we can imagine if we have very complicated tweets or, well, I don't know if that's possible, but very complicated paragraphs or sentences, we might want to create a bigger network with more hidden states to allow that last state to be more expressive. The next task would be something like music generation. And I'll see if this will play. You can kind of hear it. OK, so that was music generated by an RNN, which is pretty cool and something you're actually also going to do in the lab today.
Music Generation (23:58)
But an RNN can produce music because music is just a sequence. And the way that you would construct a recurrent neural network to do this would be at every time point taking in a note and producing the most likely next note given the notes that you've seen so far. So here you would produce an output at every time step. seen so far. So here you would produce an output at every time step. The final task is machine translation. Machine translation is interesting because it's actually two recurrent neural networks side by side.
The first is an encoder. The encoder takes as input a sentence in a source language like English. It then, there's then a decoder which produces the same sentence in a target language like French. Notice in this architecture that the only information passed from the encoder to the decoder is the final cell state. And the idea is that that final cell state should be kind of a summary of the entire encoder sentence. And given that summary, the decoder should be able to figure out what the encoder sentence was about and then produce the same sentence in a different language. You can imagine though that, okay, maybe this is possible for a sentence, a really simple sentence like the dog eats. Maybe we can encode that in the final cell state. But if we had a much more complicated sentence, or a much longer sentence, that would be very difficult to try and summarize the whole thing in that one cell state. So what's typically done in practice for machine translation is something called attention. With attention, rather than just taking in the final cell state to the decoder at each time step, we take in a weighted sum of all of the final cell state to the decoder at each time step, we take in a weighted sum of all of the previous cell states. So in this case, we're trying to produce the first word. We'll take in a weighted sum of all of the encoder states. Most of the weight will probably be on the first state because that's what would be most relevant to producing the first word. Then when we produce the second word, most of the weight will probably be on the second cell state, but we might have some on the first and the third to try and get an idea for the tense or the gender of this noun. And the same thing for all of the cell states. The way that you implement this is just by including those weight parameters in the weighted sum as additional parameters that you train using backpropagation just like everything else. Okay so I hope that you have an idea now why we need a different framework to model sequences and how recurrent neural networks can solve some of the issues that we saw at the beginning, as well as an idea of how to train them and solve some of the vanishing gradient problems. I've talked a lot about language, but you can imagine using these exact same recurrent neural networks for modeling time series or waveforms, or doing other interesting sequence prediction tasks, like predicting stock market trends, or summarizing books or articles. And maybe you'll consider some sequence modeling tasks for your final project.