Long Short Term Memory networks

8 min readApr 2, 2022

Remembering information for long periods of time.

RNN is a sequence model, it can be used whenever the data is sequential : that is , when earlier observation provide information about later observation
RNN struggles with long term dependencies : Modeling longer sequence with RNN is not possible because of the vanishing gradient problem.
Our usual techniques for tackling Vanishing gradient problem are not enough — like using ReLU activation , Gradient Clipping, Weight regularization etc…
For RNNs the major advances were architectural →and this lead us to an architecture called Long Short Term Memory.

Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies.

LSTMs are explicitly designed to avoid the long-term dependency problem.

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

A common LSTM unit is composed of a memory cell, an input gate, an output gate and a forget gate. The memory cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

Lets study this architecture using an intuitive example : text processing. Suppose we have corpus of text. Each word in the document is represented by word vectors. W(n) represent the nth word in the document. Our goal is to have a model that takes in a word and predict the next word in the sequence.

Before digging deep lets talk briefly about Activation functions, specifically The Sigmoid Activation function and the Hyperbolic Tangent Activation function. To understand more about different activation functions please use this link : [ Activation functions ]

Sigmoid Activation function

The output of sigmoid is always between 0 and 1
When input is large and positive the output fall toward one
When input is small and negative the output fall toward zero

Hyperbolic tangent function ( tanH )

Hyperbolic tangent give us an output between -1 and +1
When input is large and positive the output fall toward positive one
When input is small and negative the output fall toward negative one

In our example we were predicting the word in a sequence given the current word. Lets study this in detail

The inputs are word vectors, here W(n-1) represent the word vectors corresponding to (n-1)th word in the document.
LSTM concatenate the word vector W(n-1) with the hidden state vector from previous iteration H(n-1) to form a fixed length input X(n-1).
LSTM not only have a hidden state vector but also a memory cell.
The Memory state from previous iteration is represented by C(n-1).
LSTM predict the next event in a sequence by utilizing the Input vector, the Hidden state vector and the memory cells

A LSTM unit is composed of three control gates and a memory cell. The control gates regulate the flow of information in the cell and the memory cell help the architecture remember the past

Control Gates in LSTM

The fixed length input, which is a concatenation of the word vector and the hidden state is fed separately in to these control gates

Input control → Perform weighted sum on the input and apply sigmoid activation. The input control is characterized by a set of weights Weights(i) and a bias vector B(i). The input control regulate the new memory estimate from current.
Forget control → Perform weighted sum on the input and apply sigmoid activation. The forget control is characterized by a set of weights Weights(f) and a bias vector B(f). The forget control help regulating old memory estimate from previous iterations.
Output control → Perform weighted sum on the input and apply sigmoid activation. The output control is characterized by a set of weights Weights(o) and a bias vector B(o). The output control regulate the output produced by the LSTM

Memory cell in LSTM

The fixed length input, which is a concatenation of the word vector and the hidden state is fed in to an update gate in LSTM to produce a new estimate.

Update gate → Perform weighted sum on the input and apply tan hyperbolic activation. The update gate is characterized by a set of weights Weights(u) and a bias vector B(u). The update gate produce a new estimate.

Understanding the flow in LSTM

The fixed length input, which is a concatenation of the word vector and the hidden state is fed separately in to these control gates and update gates.

X(n-1) = [ W(n-1) | H(n-1) ]

Remember : the control gates have sigmoid as activation function and the update gate have tan hyperbolic as activation function.

The output vectors from control gates have values between zero and one , whereas the output of update gate have values between -1 and +1.

The output of input gate, I(n), and the update gate, ~C(n), are combined to obtain a new estimate. We perform a Hadamard product of these vectors.

Hadamard product means, each component of I(n) is multiplied with corresponding component of ~C(n) to get a regulated new estimate

Hadamard product of two matrices, the matrix such that each entry is the product of the corresponding entries of the input matrices

The output of forget gate, F(n), is used to control the memory from previous iteration C(n-1). Each component of F(n) is multiplied with corresponding component of C(n-1). Thus regulates the memory from the past iteration.

This regulated the memory from the past iteration, the Hadamard product(F(n), C(n-1)) is added with the regulated update Hadamard product(I(n), ~C(n-1))to produce new memory estimate for the next iteration C(n).

Regulated past memory added with regulated new memory to produce a memory estimate for the next iteration.

LSTM has two state : a hidden state and a memory state. The memory estimate produced C(n) is applied with tan hyperbolic activation and each component of the resulting vector is multiplied with the corresponding components from output control, O(n) to produce a hidden state vector H(n).

H(n) = Hadamard product ( O(n) , tanh(C(n)) )

Now lets see LSTM in action

Lets train a LSTM network to learn the alphabets. Our goal is to have a model that can predict the next character in a sub sequence.

Notebook Link : GitHub

predict the next letter 
#input sequence# --> #output#
['T', 'U'] --> V
['L', 'M', 'N', 'O'] --> P
['J', 'K', 'L', 'M'] --> N
['N'] --> O
['I', 'J', 'K'] --> L

Lets first write some code to create variable length data

# Generate variable length input# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# create mapping of characters to integers (0-25) and the reverse
char_to_int = {key:val for val, key in enumerate(alphabet)}
int_to_char = {key:val for key, val in enumerate(alphabet)}
num_samples = 1000
max_len = 5
dataX=[]
dataY=[]for i in range(num_samples):
  start = np.random.randint(0, len(alphabets)-2)
  stop = np.random.randint(start, 
                    min(len(alphabets)-1,start+max_len))  seq_in = alphabets[start:stop+1]
  seq_out = alphabets[stop+1]  dataX.append([char_to_int[i] for i in seq_in])
  dataY.append([char_to_int[i] for i in seq_out])

Preprocessing step

Padding is required here because the examples inside of a batch need to be the same size and shape, but the examples in these datasets are not all the same size.

X = tf.keras.preprocessing.sequence.pad_sequences(dataX, max_len, dtype='float32')

Normalize the data

X = X / float(len(alphabet))

One-hot-encode the target variable

y = tf.keras.utils.to_categorical(dataY)

Model building : lets build a simple keras sequential model consisting of two layers. First an LSTM layers , which was followed by a Dense layer.

shape = (X.shape[1], X.shape[2]))model = tf.keras.Sequential([
   tf.keras.layers.LSTM(units=32, input_shape=shape),
   tf.keras.layers.Dense(y.shape[1])
])# Compile
model.compile(optimizer='adam',
  loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
  metrics=['accuracy'])# Training
model.fit(X, y, epochs=500, batch_size=1, verbose=0)

Prediction

for i in range(20):   index = np.random.randint(0, num_samples)
   test = dataX[index]   # zero padding
   X_ = tf.keras.preprocessing
        .sequence.pad_sequences([test], max_len)   # reshaping
   X_ = np.reshape(X_,(1,max_len,1))   # prediction
   pred = variable_len_seq.predict(X_)
   idx = np.argmax(idx)   print([int_to_char[j] for j in test],
          '--------->',
        [int_to_char[j] for j in dataY[index]])************************************************
OUTPUT
************************************************
['C', 'D', 'E', 'F', 'G'] --> H
['L', 'M', 'N', 'O', 'P'] --> Q
['I'] --> J
['S', 'T', 'U'] --> V
['E', 'F', 'G', 'H', 'I'] --> J
['I', 'J'] --> K
['D', 'E', 'F', 'G'] --> H
['V', 'W', 'X'] --> Y
['C', 'D'] --> E
['U', 'V'] --> W
['P'] --> Q
['X'] --> Y
['P', 'Q'] --> R
['G', 'H', 'I', 'J'] --> K
['B', 'C'] --> D
['D', 'E', 'F', 'G', 'H'] --> I
['Q', 'R', 'S'] --> T
['L', 'M', 'N'] --> O
['K'] --> L
['J', 'K', 'L', 'M'] --> N