Transformer Network

17 min readApr 7, 2022

Attention based network

Make sure you know about the topics mentioned below before we start learning about attention based neural networks. All of the below mentioned topics are discussed in previous tutorials so make use of them. You can use the link for more information related to topics.

RESOURCES

Notebooks 😎 NLP

Today is all about attention. We will study the Self Attention based network, Multi Head attention network, Sequence encoder and much more…

While modeling natural language, we treat words as continuous vectors in the N-dimensional space rather than one hot encoding them. One hot encoding constrain the words to be orthogonal to each other. Its more efficient to let them flow anywhere in the N-dimensional space.

Learning word vectors

We can learn word vectors in two ways

Supervised learning → learn word vectors by training on a supervised task, like sentiment analysis , question answering, text classification etc… Here we need labeled data and getting data labeled is both time intensive and cost intensive.
Unsupervised learning → We like to learn word vectors from the text corpus itself instead of requiring a human to label them first and then learning the word vectors by training on a supervised task. We have two unsupervised model architectures for learning these word vectors, they are Continuous Bag of Words [ CBOW ] and Skip Gram Model.

Continuous Bag of Words [ CBOW ]

Skip Gram Model

For more information please visit the previous tutorials using the link : Word Embedding and Language Modeling

Word Vectors

Every word in our vocabulary is going to be mapped to a vector, and then we will do our analysis of natural language with these word vectors.

Words in our vocabulary are going to be mapped to a point in a high dimensional space, the closer two words are in that mapping or in that high dimensional space the more related or synonymous the words are. The further apart two words are in this space, the more dissimilar we would expect the words to be. Through the word embedding we are able to encode the meaning of the words in to the components of the vector.

The idea behind word vectors

Every language has many rules inplace when constructing a sentence. The words in a sentence follow these rules. This implies that for each word in a given sentence we should be able to predict the presence of surrounding words.

The words have meaning and that meaning implies that a given word should indicate the presence of other words in its surroundings. The word vectors we learn, are meant to preserve this relationship.

Each component of the word vector represent different aspects of the word. Simply we can think of this as thematic meaning associated with the word. The different themes associated with the word are encoded into the components of its vector representation.

The underlying components of the word vector have some thematic meaning. We don’t explicitly have to know what that meaning is, but it is uncovered through the machine learning.

Various aspect related to the word are encoded in to the components of the word vector

The components of word vectors are either positive or negative. The value of component is positive if it’s aligned with the topic, or negative if it’s not aligned. This vector representation captures the underlying thematic meaning of the word. That’s what this word2vec concept is meant to reflect.

If we can learn these word vectors by taking the context into account, then we could look at pairs of words which we know are basically the same except for perhaps one area of distinction and uncover information. For example words like ‘king’ and ‘queen’. By taking these word vectors, we can look for what is similar between them, what is different, etc… and from that we can uncover some underlying information associated with these word vectors.

This concept of mapping each word in a document to a single vector is restrictive, in the sense that words has multiple meaning and the meaning that applies corresponds to the context in which the word is used.

For example : Consider the following sentences

“ The bank appropriated the property of the defaulters ”
“There can be problems in appropriating funds for legal expenses”
“Teaching appropriate Behavior in the work place”

By Associating a word to a fixed embedding we cannot handle cases where words have multiple meanings. Suppose we are encoding the word “ bank ” . The context independent encoding will map the word “ bank ” in the sentence “On the river bank” and “Open an account in the bank” to the same embedding vector. Ideally we want the word embedding to be more contextualized to reflect the surrounding words around them. So that the word “ bank ” in the sentence “On the river bank” will be mapped different from the word “ bank ” in the sentence word “Open an account in the bank”

Word embedding must encode contextual information

RNN, LSTM & GRU

These modeling architecture provides a way of contextualizing the word embedding so that same words can be mapped differently depending on the context in which its used.

RNN scan the input and take one event at a time.
RNN maintains an internal state , in which it encodes the information it learn for the previous events.

Suppose the task of the RNN is to predict the missing word in the sentence : “On the river bank they [MASK]”. The RNN reads the input sequentially , one event at a time and update the state vector it maintains as it progress thorough the sentence.

We can think of the internal state vector as an encoding of the sentence seen so far. At every time-step RNN outputs a contextualized embedding Y(i) based on the internal state vector [context information] and input X(i).

Our input sequence is : “On the river bank they [MASK]”. For encoding the word “bank” this mechanism is helpful. By the time RNN reaches the word “bank” it has already seen the word “river” and encoded it in its internal state. So it’s in a better position to distinguish the word “bank” in this context from it’s other meaning of a financial institution.

For more information please visit the previous tutorials using the link : RNN and LSTM

What we would like to do now is to develop a framework by which we can learn these word vectors in a way that takes into account the meaning implied by the surrounding words [ contextual information ].

Quantifying the similarity between words

We can use dot product (inner product) to quantify the similarity between words. Suppose we have the word vectors of word-1 and word-2 in a document. We can compute the similarity between them by multiplying corresponding components and adding them up. The output of dot-product if positive means word vectors are similar else if negative means dissimilar.

The components of word vectors are composed of different themes.
If the word aligns with the theme then the corresponding component will be a positive number.
If the word doesn’t aligns with the theme then the corresponding component will be a negative number.
The dot product between two word vectors give us similarity between two words
If the words are similar then it’s components will have the same sign, so the dot product give us a positive number.
If the words are dissimilar then it’s components will have different signs, so the dot product will give us a negative number.

Exponentiation of the inner product

When we use dot product to compute the similarity between word vectors we get positive number as output for similar words and a negative number as output for dissimilar words. However it will be more convenient to only work with positive numbers.

For any value of x, positive or negative , Exp(x) is always positive. Exponential function is called a monotonically increasing function of the input

We want to work with positive numbers while preserving the idea that positive inner product means similar words and negative inner product means dissimilar words. The Exponential function preserve this idea while always giving us positive number as output. If the input is large and positive then output is a large positive number, if the input is small and negative then output is a small positive number.

Introduction to attention mechanism

Suppose we have the word vectors of all words in a sentence and we want to quantify how similar the kth word is to all other words in the sentence.

We can use inner product to get the similarity score

c(k).c(i) → Quantifies similarity score
Exp(c(k).c(i)) → Exponentiation of similarity score
Normalization to obtain Relative similarity score

Relative similarity score

Relative similarity R(k →i) implies how similar is the kth word to the ith word, relative to all other words in the sequence.

This relative similarity score is always positive and ranges between zero and one. The stronger the relationship is between word-k and any other word the larger the relative similarity score will be, closer to one. If the relationship between word-k and any other word is weak, then it’ll be smaller and tends toward zero.

There is an underlying grammatical relationship between words in a sentence. The relative similarity score is a mathematical way of quantifying the relationship between words and those relationships between words are characteristic of language itself. These relationship dictates how words unfold in a sentence.

Refining the word vector

The word vector for word-k in the sentence is created independent of the context in which its used. We have computed the relative similarity score for word-k. The R(k →i) tell us how similar is word-k to other words in the sentence. By using this relative similarity score we can refine the word vector for the word-k , C(k).

We take the word vectors for all the other words in the sentence and perform a weighted sum, by using the similarity score as weights, to obtain a refined word-vector for the word-k by incorporating contextual information.

~C(k) = R(k →1)C(k →1) + R(k →2)C(k →2) + … + R(k →n)C(k →n)

The word vectors we started with was not informed of the context in which it was used. By computing the relative similarity score, we now have the information about how much attention we should pay to the surrounding words. The relative similarity score R(k →i) tell us how much attention should we pay to the word-i when constructing the word vector for word-k.

Lets understand this through an example Suppose the sentence we are working with is : “ On the river bank they [MASK] ”
We want to encode the word “bank” by incorporating contextual information.

Compute how similar is the word “bank” to all other words in the surroundings, this score tell us how much attention we should pay when constructing word vector.
Perform a weighted sum to obtain a word vector that is informed of the context in which the word is used in the sentence.

Concept of attention

The R(k →i) implies how similar is word-k to other words in the sentence.
The R(k →i) is a number between zero and one.
The Relative similarity score tell us how much attention we should pay to the surrounding words when constructing word vector.
If Relative similarity score is high, then pay high attention.
If Relative similarity score is low, then pay low attention.

Self attention terminologies

The self attention has three model parameters these are two dimensional matrices.

Query parameter
Key parameter
Value parameter

The Keys

Keys → The words surrounding the kth word, that we pay attention to are called the key token. The key token doesn’t need to span the entire sentence, they could be the next three words. But most attention mechanism use the entire sentence for context.
The key parameters apply a transformation to the key tokens. We can think this as multiplying a vector by a matrix to get another vector

The Query

Query → The word vector that is current being embedded is called the query token. This is the kth word we examine with respect to all other word in the sentence
The query parameter apply a transformation to the Query token. We multiply the non contextual embedding of the word vector under consideration by the matrix to obtain a different embedding.

Basic concept of self attention is a weighted sum. We want to assign a weight to every single word vectors in the sentence. This weight reflect the usefulness of the surrounding words, the contextual information, in embedding the query token. We can do that by measuring the similarity between keys and query. By taking the dot product between the keys and the query we can quantify the similarity. Finally we can normalize the similarity scores to obtain the relative similarity score. This tells us how much attention we should pay to the surrounding words while constructing the word vector.

**The attention scores : reflect how relevant is the key to the query**

The Values

We multiply the input vectors by the value parameters (2D matrix ) to get another set of vectors called the values

The output of self attention

We have the similarity scores computed , the weights
We have the Keys , Query , Values
Finally the output of self attention is a sum of these weighted values

Self attention

The words have different meaning depending on the context in which its used. So mapping word to a single vector is restrictive in the sense that it implies only one kind of meaning

Example : “On the river bank they sit” , “The bank control the interest rate”

Self attention helps encoding word to vectors by taking contextual information into account
Self attention does not takes into account word order

Word order is important

The self attention is independent of word order. The word vectors for the sentences : “ On the river bank they sit ” and “ They sit on the river bank ” are the same, in this case its fine, because the context is the same. But consider the following two sentences

“The united states of [MASK]”
“The states united of [MASK]”

When word order changes, the meaning changes. The attention mechanism will give us word vectors they are independent of word order. Changing the word order doesn’t change the word encoding produced by the attention mechanism.

Word order matters if we want to capture the contextual meaning. The word vectors created using attention mechanism don’t pay attention to the word order. In natural language processing the word order is very important. If the order of the words in the sentence changes , the meaning of a sentence changes

Embedding positional information

We need some modifications to incorporate the word order into the embedding vectors

Skip connection
Positional embedding

Word embedding, Positional embedding and Attention mechanism

Skip connection

We want to preserve the meaning of the word, the word embedding and account for the contextual information. When we go from input at the bottom to output at the top, we are losing the original word embedding. The skip connection help us preserve the original embedding

Positional Embedding

The word vectors are d-dimensional vectors. We want to incorporate the positional information into these word vectors. The positional embedding are d-dimensional vectors that reflect the position of the word in a sequence. For every component in this vector we are going to associate a sine wave of different frequency. The frequency of the sine wave increases as we move down the vector component by component.

Sequence Encoder

We take a sequence of words as input [ w(1), w(2), … , w(n)]
We map each word to an individual word embedding vector. These embedding reflect the meaning of the word. The components of the vectors represent different aspects related to the word.
The word embedding encodes the meaning of the word but it neither account for the contextual information of the surrounding words nor the positional information of the words.
By introducing positional embedding, we can encode the positional information into the word vectors. Depending on the position of the word, each word will have a word order dependent d-dimensional vector.

The introduction of the positional embedding give us positional information and the word embedding give us the meaning of the word. We can add them to produce a vector containing both information.
To encode the contextual information we can use self attention mechanism.

Word meaning + Positional information + Contextual information

Skip connection preserve the original word embedding, by skipping the attention network and adding it to the output of the attention network. For similar reasons we can see a skip connection used with feed forward nets on the top of the diagram.
Finally we have the Feed Forward Neural Network ( FFNN ). We pass the output of the attention network, vectors representing the words in the input sequence → encoding meaning of the word , positional information of the word and the contextual information of the surrounding words, these are passed through a FFNN.
This FFNN provides regularization on the network, Restricting the output activation to be contained within a desired range
We use tan hyperbolic activation function in the dense layer, this will constrain the outputs to between -1 and +1.

The Transformer Encoder

Transformers are nothing more than a stack of self attention layers.

Deep Sequence Encoder

The sequence encoder is a stack of different layers. In the bottom we have word embedding layer and positional embedding, we add and normalize the embedding vectors. The next set of layers an Attention network and FFNN with skip connection in between. This set of layers can be repeated ‘K’ times to form a deep sequence encoder, which will improve the performance of the network.

Sequence to Sequence model

Application of Deep sequence encoder includes sentiment classification , predicting next word in a sequence, language translation etc… to name a few. Lets look at an example of a sequence to sequence model, where goal of the model is to translate from one language to another.

Suppose we provide English words as input to the model , the model will produce a German translation.

Composed of a Encoder and a Decoder for sequence to sequence modeling
Attention mechanism is the key component of Encoder and Decoder
Input to the Encoder is a sequence of English words
For each word in the sequence, the Encoder will output a word vector that is informed of the meaning of the word, the position of the word and the context of the word.
This output of the Encoder will server as Keys and Values while decoding.
While Encoder is based on self attention the Decoder is based on self attention and cross attention
Cross attention employ attention on the final embedding of the Encoder. The Encoder outputs is used a the Keys and Values to the attention model while the Query comes from embedding translated words.

This attention mechanism can be performed in parallel, thus its faster to train when compared with models that process the input sequentially.

Multi Head Attention

The self attention has three model parameters these are two dimensional matrices.

Query parameter
Key parameter
Value parameter

Multiple sets of Keys , Values and Query parameters are called attention heads. Having multiple attention heads is equivalent to having multiple kernel in CNN.

In CNN we have multiple kernel to extract specific information from image like edge detection , sharpening an image, blurring etc..

For more information on Convolution neural networks , use these links : Convolution 101 : for beginners and Translation in-variance

Word embedding can encode multiple aspects related to a word. These aspects can be anything like part of speech, gender, quantifiers etc…

Components of word vector represent different aspects related to the word

The Word vectors associated with each words has in its components the thematic meaning that is related to the word. The parameters of the attention network are two dimensional matrix of values. Where each row represent these themes/meta information related to the word. When we multiply the input word vector with these matrices , we are highlighting these aspects that are important.

Lets understand this through an example:

Suppose our input word sequence is “paris is a lovely place”

Components of word vector : encoding different aspects related to the word

Suppose these are some of the aspects that are characteristics of the word we want to encode. If the word align with the aspect then we’ll have a positive value for that component, if it does not align then we’ll have a negative value for that component. These aspects which are characteristics of the word are encoded into the different components of word vectors.

We want to embed the word “paris”, so the vector corresponding to it becomes the query token. The query parameter apply a transformation to the Query token. We multiply the non contextual embedding of the word vector by the matrix to obtain a different embedding. This transformation will highlight certain aspects that are important.

Highlighting different aspect of the word : multiple attention head

We are mapping these characteristic themes associated with the word into meta-topics, where each meta topics is characterized by one of the rows of the parameter matrix.
The weights in each row of the parameter matrix highlight the particular aspect associated with the word vector.
In multi head attention → we have different parameter matrices for extracting specific information from the words.
Focusing on different linguistic aspects one at a time is beneficial. For instance one head focuses on parts of speech while other focus on verb tenses and so on.

When we apply these transformation using multiple heads, we end up with multiple outputs
The multiple outputs are passed into a FFNN to combine them into a single output.

Every token attends to every other tokens, so the complexity is O(n²)
Unlike RNNs , we can process the O(n² ) connections in parallel

That’s it folks ; this tutorial presents you a big picture view of natural language modeling , self attention , multi head attention and transformer networks. Stay tuned for more !

The goal of this tutorial is to look at the big picture behind the state of the arts transformer models. I hope I was able to convey some ideas today. Much of the slides I used in this tutorial are taken from Tensorflow ML talks. The url has been added in each slides and also in the resource section below. I highly recommend to watch these ML tech talks , these are great resources created by Tensorflow team.

Transformer Network

RESOURCES

Notebooks 😎 NLP

Learning word vectors

Continuous Bag of Words [ CBOW ]

Skip Gram Model

Word Vectors

The idea behind word vectors

RNN, LSTM & GRU

Quantifying the similarity between words

Exponentiation of the inner product

Introduction to attention mechanism

We can use inner product to get the similarity score

Relative similarity score

Refining the word vector

Concept of attention

Self attention terminologies

The Keys

The Query

The Values

The output of self attention

Self attention

Word order is important

Embedding positional information

Skip connection

Positional Embedding

Sequence Encoder

The Transformer Encoder

Deep Sequence Encoder

Sequence to Sequence model

Multi Head Attention

Resources

Written by Ajay krishnan