Word Embedding

8 min readFeb 26, 2022

Concept of similarity manifested through proximity

Representing text

When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.

How should we encode text in ML model?

One Hot Encoding
Integer Encoding
Word vectors

Lets start with a naive view : Assume that words are independent and discrete tokens. Lets gather all the unique tokens and build a dictionary of tokens by sorting them alphabetically .

We want to convert text into numbers, so the first thing we could do is to assign an index to each token

There are many downsides to this approach:

The integer-encoding is arbitrary (it does not capture any relationship between words).
An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.
The very high numbers are not very suitable for machine learning or gradient descent in particular

One Hot Encoding

One-hot encode each word in your vocabulary. Now each word in the vocabulary is represented by a vector that has the length of your vocabulary.
That means if your vocab size = 1000 , then each word in your when one-hot encoded has a length of 1000.
This is not very efficient because just to represent each word, you will create a zero vector with length equal to the vocabulary, then place a ‘one’ in the index that corresponds to the word.

Downsides of one-hot encoding:

When represented as dense vectors these encodings have very high dimensionality, because these encoding has a dimension for every word in the vocabulary
One hot encoding fails to capture any knowledge about the tokens. Each one-hot encoded vector represent presence or absence of tokens
For eg: “King” and “Queen” have more in common with each other but when one-hot encoded these vectors will have 90-degree angles between them

Lets see this in action, suppose our vocab has only three words like :
Vocab [ ‘King’, ‘Queen’, ‘Tiger’ ]
King = [ 1, 0, 0 ]
Queen = [ 0, 1, 0 ]
Tiger = [ 0, 0, 1 ]

Size of Vocab is 3 , dimensionality of vectors is 3
Words are unit vectors aligned against the axis

Why not allow the words to occupy the entire space ?

Words are continuous vectors in the N-dimensional space so rather than constraining them to be orthogonal to each other, let them flow anywhere in the N-dimensional space

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. An embedding is a dense vector of floating point values (the length of the vector is a hyperparameter).

Word Vectors

Every word in our vocabulary is going to be mapped to a vector, and then we will do our analysis of natural language in the context of these word vectors.

Every word in our vocabulary is going to be mapped to a point in a ‘m’ dimensional space, the closer two words are in that mapping or in that ‘m’-dimensional space the more related or synonymous the words are. The further apart two words are in this ‘m’-dimensional space, the more dissimilar we would expect the words to be.

The value of ‘m’ which is the dimensionality of the word vector is dependent upon the particular problem that we choose and we can choose different values of m.

Learning the mapping of every word to a vector :
Words are represented as vectors. When words are similar, they should be nearby each other in this vector space, whenever they are unrelated they should be far apart from each other in this vector space.

This idea of modeling words or mapping words to vectors is called Word2Vec.
Each of those vectors that is associated with a given word is often called an embedding.

Why are we doing this mapping ?
Because words are composed of letters or shapes, they are not numbers. And to do analysis on a computer or with an algorithm, we have to have a way of mapping words to numbers with which we can analyze.

Imagine that we have a vocabulary which is composed of v words.
vocab = [ w(1) , w(2),…, w(V) ] This is our vocabulary.

We want to map each words in out vocab to a m-dimensional vector.
embedding = [c(1), c(2), …., c(V)] These are our embedding vectors.

Once we have this embedding we can do Natural language processing.

How do we learn these vectors ?

How do we learn these word vectors such that similar words will have similar embedding and dissimilar words have different embedding ?
We learn these embedding by training on a supervised task

Natural Language processing

Convolutional Neural Network are excellent feature extractors , can we use them for natural language processing ?
Why not , let’s start with CNNs for processing text.

Convolutional Neural Network in the context of NLP

Convolutional Neural Network in the context of natural language processing, utilizing this Word2Vec or word to vector.

Filters represent a concept

what we do is we take a filter [m X d], and we shift it through the data.
‘m’ corresponds to the dimension of the word vector
‘d’ corresponds to the number of words in the filter.

For example : Lets suppose we have ‘k’ convolution filters of size [m X d].
The height of filter, ‘m’ corresponds to the dimensionality of the word vector.
The width of filter in this case is of dimension three,and it corresponds to a concept which is related to three consecutive words.

We take each filter and convolve it or shift it through multiple positions along the length of our text.Whenever the filter (the concept reflected in the filter of three consecutive words ) is aligned or related to the associated text at the corresponding shift location , we would expect a high correlation or connection between the filter and the text.
For every shift location, we’re going to get a number, and that number is reflective of the match between the filter and the text.
Through this process we take the original text, and we map it into a set of N vectors each of which is k-dimensional.

Each row corresponds to the degree of match between the corresponding filter in the text at the associated shift location.

Pooling step

For each row of this matrix (output of convolution), take the maximum value. Which would basically quantify which value across the text gives us the largest correlation between our filter and the original text.

we get a ‘k’-dimensional vector which is telling us the maximum correlation across the entire text, between each of the K filters and that text.

We can also do an average pooling as well , to get a ‘k’-dimensional vector.

We can’t work with raw symbols and words in our text corpus , so we need a way to encode these words into a dense representation.
Once we have these dense representation of words in our vocabulary, we can take that ‘k’ dimensional word vectors and send it through a logistic regression, or a multi-layer perceptron to make a decision.

Weights of embedding layer are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

Once we learned the embedding we can reuse them in a similar problem

For eg: The embedding that you learned on london traffic data can be used for predicting frankfurt traffic
We can simply load the embedding from the trained model for london and tell the frankfurt model not to train this layer … or also we could use this embedding as a starting point for the frankfurt model as well , in that case we can set trainable = True.

The need of labeled data

If we are going to do learning of work vectors using a CNN or DNN , we require lots of labeled text.
This is expensive and labeling task is time consuming.

Learning in a supervised way using labeled data is time consuming so what we want is to learn embedding from the text corpus itself.
Directly learn the embedding from the text corpus without requiring any labeled data.

A note on the performance, when CNNs are used with sequential data like a text corpus or a timeseries.

One thing we can notice here is even though CNNs are powerful models they are not that powerful when dealing with sequence data like text corpses.
That mean the performance of the CNN will be similar to a DNN , why is that?

DNNs are capable of learning contributions of specific regions in feature-space. For eg: if we take mnist digits dataset , for each image (2D) of size [28,28] — we flatten the image into a 1D vector of [784 ,] and pass it through a dense layer.
At dense layer we perform weighted sum of the flattened pixels and apply an activation function. Here we are comparing pixels to pixels , so for model an image and its rotated version are different.
Scaled , rotated , or whatever translations applied , for the human eye its the same. Objects in the image the same under spatial transformations.
We needed some thing that could help us learn location independent features and our solution was CNNs. Convolution operation will give us a high correlation if the filter aligns with the pattern in the image. CNNs filters learn these location independent features and was a powerful feature extractor in the image domain.
Here we have a sequence . Sentence is a sequence of words. Each sequence has different number of words in them so we need a model that is robust to this variable length sequence