Recurrent Neural Network (RNN)
To make predictions on sequences of data | Sequence modeling
RESOURCES
Notebooks 😎 NLP
What make data sequential ? or what is sequential data ?
Suppose we have a set of observations recorded overtime. If the past observations tell us information about future observation then we have a sequential data.
Is this sequential data ?
- If we flip a coin at regular interval and record the outcome :
No! this is not a sequential data , because coin flip outcome is independent of previous flips - Daily temperature recordings :
This constitute a sequence — because if we look at daily temperature recordings for a week, we can see high correlation. - Oil price :
This constitute a sequence — because if we look at price variation records for a week, we can see high correlation. The past price does tell us information about future price.
RNNs are widely used in the following domains/ applications:
- Prediction problems.
- Language Modelling and Generating Text.
- Machine Translation.
- Speech Recognition.
- Generating Image Descriptions.
- Video Tagging.
- Text Summarization.
- Call Center Analysis.
- Face detection, OCR Applications as Image Recognition
- Other applications like Music composition
RNNs are generally useful in working with sequence prediction problems. Sequence prediction problems come in many forms and are best described by the types of inputs and outputs it supports.
Sequence prediction problems include:
- One-to-Many : In this type of problem, an observation is mapped as input to a sequence with multiple steps as an output.
An example is image captioning where input is the image and output is the caption generated. - Many-to-One : Here a sequence of multiple steps as input are mapped to a class or quantity prediction.
An example is smart reply where input is either text or voice and output is an item from a pre-populated dictionary. - Many-to-Many : A sequence of multiple steps as input are mapped to a sequence with multiple steps as output.The Many-to-Many problem is often referred to as sequence-to-sequence, or seq2seq.
An example is language translation
Convolution Neural Networks are excellent models in image domain. Can’t we use this CNN for modeling sequential data?
Convolution filters are powerful, it can learn location independent patterns. We can use CNN for sequential data modeling because locality is important for images and sequences.
To put it in other words — just as locality played a role in image domain where pixels in the neighborhood of a given pixel are much more likely to be related , so does locality play a role in sequence modeling.
In the beginning where we looked at examples of sequential data we’ve seen that locality is important for sequence data.
The Daily temperature recordings — This constitute a sequence because if we look at daily temperature recordings for a week, and we can see high correlation
Temperature of lets say Wednesday is highly correlated with temperature of Monday , Tuesday (Lag features) and Thursday , Friday (Lead features).
CNNs can be applied to sequence domain , but the performance of CNN didn’t improve much over DNNs !
Why didn’t CNN perform much better than DNN ?
DNNs are capable of learning the contributions of specific regions in the feature space and thus they aren’t well suited for image domain. Because in image domain — the objects in the image are often the same under spatial translations like zooming , flipping , rotating etc…
In image domain we needed a model that is robust against spatial translation. Enter CNNs — Powerful models which are capable of learning location independent filters.
In Sequence domain we have more challenges :
In the context of natural language processing, for an example, consider a batch of sentences as input to our sequence model. A sentence is a sequence of words. All sentence don’t have the same number of words. Some sentence are short like “ we love science ” But some are wordy like “ Blue and yellow flowers as Queen meets Trudeau ”
Previously we learned about word embedding [ link ]. Word embedding is a mapping from words in the vocabulary to dense vectors. We can convert each word in the sentence to a dense vector, but the next challenge is, since all the sentence don’t have the same number of words, we have to tackle the varying sequence length problem. Padding is required here because the examples inside of a batch need to be the same size and shape. We can employee these techniques like cutting and padding to get a fixed length vector, but ultimately this might tamper performance.
What we need is a sequence model that is robust against this varying sequence length problem and Recurrent Neural Network (RNN) was the solution.
How RNN address the variable length sequence ?
RNNs handle the variable length sequence by recasting the problem of representing an entire variable length sequence to representing a single event given what has come before.
RNNs works differently than other models : Instead of accepting a fixed length input representing an entire sequence like DNNs do……
…..RNNs accepts a fixed length representation of an event along with a fixed length representation of what it has seen previously
Two key ideas for RNN:
- RNN learn a compact hidden state that represent the past
- Input to an RNN is a concatenation of the original , stateless input and the hidden state.
The idea of persistent hidden state that is learned from ordered inputs is what distinguishes an RNN from other models like DNN.
In DNNs the hidden state is not updated during prediction, but in RNN it does. These aspects of RNN allows them to remember what they have seen previously.
RNNs can create powerful representation of the past
RNNs accepts sequences, one event at a time and develop representation of what it has seen previously as it scans the input. This aligns with how humans process a sequence. We have talked about this in a previous tutorial , but lets quickly go through an example:
— “john loves dogs and his dogs wags their ???????????” —
Suppose you are talking to your friend via the phone ,but the connection somehow got terminated in the middle of the conversation. The above given sequence was the talk before it ended abruptly
What would be the possible next word ? Are you incapacitated by the abrupt halt ?
The fact that you were able to guess the next word suggest that you build up a representation and you use that representation to predict the next word. Intuitively RNN does the same — RNNs accepts sequences, one event at a time and develop representation of what it has seen previously as it scans the input.
RNN architecture
I hope i didn’t scare you with these diagrams. I’ll explain this in simple terms.
Lets look at a simple view of RNN
Suppose we have a sequence S = [X1, X2, X3 , ….. ] , where X1 , X2, etc… are events occuring at different time steps — we feed this sequence into an RNN. The RNN scans the input across the sequence and learn how to extract information from a given event in order to make use of them at a later point in the sequence.
- The input to an RNN is a concatenation of the hidden state from the previous iteration [ h0 ] and the event [ X1 ] under consideration.
- The RNN give as output a hidden state [ h1 ] and its prediction for the next event in the sequence [ y1 ]
- The hidden state [ h1 ] goes in to the next iteration, where we concatenate it with the event [ X2 ] and pass it into the RNN cell as input
- RNN again give as output a hidden state [ h2 ] and its prediction for the next event in the sequence [ y2 ]
This repeats till we reach the end of the sequence.
- This architecture passes the hidden state into the next iteration via a recurrent connection, a repeating structure
- RNNs has a recurrent connection between their hidden layer and the input
- The same architecture is used multiple times for each event in the sequence
- Even though the architecture is used multiple times, the parameters { weights and biases } are not changing. The same set of parameters are used for processing all events in the sequence.
Don’t get confused by the unrolled view !
We are using the same RNN cell to process all events in the sequence. The unrolled view is just for understanding the concept. The model parameters are the same.
The forward propagation in RNN
- RNN accepts one event at at time and concatenate it with the hidden state vector from the previous iteration to form the input.
- This fixed length input is represented by X
Operations performed on the fixed length input : X
- Hidden state H(t) = TanH ( X. Weights_Set1 + bias_1)
This hidden state vector is passed in to the next iteration. This is the knowledge from the past iteration. - To predict the next event in the sequence the hidden state vector is channeled into a dense layer where we perform weighted sum using a different set of weights and apply an activation function.
Dense layer = ReLU ( H. Weigth_Set2 + bias_2 ) - Finally we take the outputs from the dense layer and pass it through a softmax layer to get the a prediction for the next event.
RNN remember some information/context through the hidden layer activations that get passed from one time-step to the next.
This allows a uni-directional RNN to take information from the past to process later inputs.
A bidirection RNN can take context from both the past and the future.
An intuitive example to solidify the concepts:
Suppose we have a sequence of text as input and we are trying to predict the next word in a sequence given the current word.
The word W(n-1) was the input. We concatenate it with hidden state h(n-1) from previous iteration to form X(n-1).
The X(n-1) gets multiplied by a set of weights and bias
We use hyperbolic tangent as activation function
The output of the hidden layer h(n) is passed into the next iteration through a recurrent connection. The same hidden vector is also channeled through a dense layer followed by a linear layer to get the prediction for next word in the sequence w(n).
The Softmax layer maps the input latent vectors into probability. So our model predicts the nth word given (n-1)th word and a hidden state from the past iteration.
- w(n-1) → The previous word vector
- h(n-1) → Hidden state vector from the past
The information RNN has learned from the previous iteration - h(n) → hidden vector output by the RNN after taking the previous word w(n-1) and h(n-1) as input
- w(n) → The prediction for the next word in a sequence
Back propagation in RNN
The idea of unfolding the network plays a bigger part in the way recurrent neural networks are implemented for the backward pass. Importantly, the back propagation of error for a given time step depends on the activation of the network at the prior time step. Error is propagated back to the first input time step of the sequence so that the error gradient can be calculated and the weights of the network can be updated.
Back propagation through time
During back propagation we update the weights using the loss at the final layer. But for RNNs, the loss at a given time step depends on the activation of the network at the prior time step.
Conceptually, Back propagation through time works by unrolling the RNN. For each time step there is one input X(i), one copy of the network params, and one output Y(i). Errors are then calculated and accumulated for each time step.
The Weight Update for each parameter of the model is done using the average of all the partial derivatives from our iterations, this approach is called back propagation through time.
Because each parameter is updated with a combination of losses from all the iteration , the model as a whole is pressured to preserve information that’s useful in the short term and the long term and throw away what is not.
Limitations:
Back Propagation through time can only be used up to a limited number of time steps. If we are using it to update the weights of a longer sequence, the gradient becomes too small and model learning suffers →this is called the “Vanishing gradient” problem
- The problem is that the contribution of information decays over time.
Unfolding the recurrent network graph also introduces additional concerns , like each time step requires a new copy of the network, which in turn takes up memory.
Is RNN a perfect option for modeling sequences ?
- RNN is a sequence model, it can be used whenever the data is sequential : that is , when earlier observation provide information about later observation
- RNN struggles with long term dependencies : Modeling longer sequence with RNN is not possible because of the vanishing gradient problem.
- Our usual techniques for tackling Vanishing gradient problem are not enough — like using ReLU activation , Gradient Clipping, Weight regularization etc…
- For RNNs the major advances were architectural →and this lead us to an architecture called Long Short Term Memory.
That’s it folks, I hope i’was able to convey somethings about RNN in this tutorial. Our next topic is LSTM, so stay tuned!
In this tutorial we discuss about the architecture of RNNs, How RNNs address the variable length sequences etc… and we concluded by discussing limitations of RNN.
Long term dependencies : The limitation of RNN lead us to the next architecture for modeling sequential data and that is long short term memory — LSTMs