Language Modeling

Ajay krishnan
8 min readFeb 28, 2022

Learn the word vectors directly from corpus of text : Unsupervised Learning

Today we are going to take a look at how we can model the natural language. Can we learn good word vectors just by using the text corpus alone ? This is the question we answer in this tutorial.

We have looked at what word embedding are and how to learn them using labeled data, as a supervised task in a previous tutorial [ link ].

Today is all about how to learn these word vectors in an unsupervised way. The main reason why we have to learn them unsupervised is because learning word vectors in as a supervised task is expensive. It require a human to label the data for us.

Word vectors : lets take every word in our vocab and map it to a vector and do some analysis in the space of these word vectors.

Our goal is to learn word embedding … Every word in our vocabulary is going to be mapped to a point in a ‘m’ dimensional space, the closer two words are in that mapping or in that ‘m’-dimensional space the more related or synonymous the words are. The further apart two words are in this m-dimensional space, the more dissimilar we would expect the words to be.

Word embedding

We want to learn it in a way by which we do not need label data. We’re going to assume that we have access to a large or massive corpus of unlabeled documents, raw documents. The idea is that if we learn these word vectors unsupervised , in such a way that each word in a given document is predictive of the presence of surrounding words.

Consider an example : guess the next word

Guess the next word

Humans build representations overtime. We use these representation to predict what comes next.

Suppose your friend was talking to you via phone , but stopped abruptly in the middle of the conversation … this was the last sequence

— “john loves dogs and his dogs wags their ???????????” —

What would be the possible next word ? Are you incapacitated by the abrupt halt ?

The fact that you were able to guess the next word suggest that you build up a representation and you use that representation to predict the next word. Can we make machine learning models that are capable of taking a given word and make predictions about surrounding words.

The way that this will be done will by utilizing these word vectors. And therefore good word vectors are word vectors for which using those vectors we’re capable of predicting the words that might be surrounding a given word.

Suppose we have access to a large text corpus. We can look each word in that corpus and we can look at the words around each word. And what we would like to do is to learn word vectors in such a way that using those word vectors we can effectively predict which words might be in the vicinity of any given word.

Natural Language modeling the big picture

Natural language model : The Big Picture

For the moment lets assume that we have the word embedding , that means we have a mapping from words in the vocabulary to vectors in embedding space. We give as input a word vector corresponding to N-th word and the model outputs a probability score for each word in the vocabulary. The output probability quantifies the possibility that the words in the vocab would live in the neighborhood of the input word.

At the hidden layer we are doing a weighted sum and applies a non linear activation function. A common choice is to use TanH as non linearity.

H(j) = tanH( Weights*input + bias)

Hidden layer

The output from the hidden layer is passed to the output layer where we have a neuron for each word in the vocabulary.

y(i) = Weights*hidden_units + bias

Output layer

The output, y , are then passed into a softmax layer which give us the probabilities

Softmax layer

Here we are trying to build a model that predicts the presence of surrounding words of a given word (input).

This is how we model text :

Modeling text

Note : Softmax function is just a generalization of Logistic function

Suppose our vocab only has just two words. Now suppose the probability of w1 given the input is P(1), then the probability of the second word is 1 — p(1).

Generalization of sigmoid

Big picture : the architecture for modeling text

  • How can we use this model architecture to learn word vectors ?
  • How do we learn the parameters of the model , the weights and biases ?
  • Do we need labeled data ? Is this Supervised learning ?
  • Can we learn the model parameters in an unsupervised way ?

The thing to notice about this or to think about is that we can learn this model based upon the text itself. We don’t need any human to label the meaning of the text.

CONTINUOUS BAG OF WORDS MODEL

CBOW

In CBOW architecture we use the context around the words as input to the model. We take the word vectors for all the words in the neighborhood of the nth-word “ dog ” . Then we average the word vectors to get a fixed sized input. This process in called as bagging.

We feed the average word vector as input to the model. In the hidden layer we perform a weighted sum and use hyperbolic tangent activation function.

The hidden units from the hidden layers are then forwarded to the output layer where we have a unit / neuron for each word in the vocabulary. At the output layer we perform a weighted sum and finally we pass the output logits to the softmax layer to get probability scores.

We take the argmax of the output predictions to get the index of the predicted word. We can find the word in our vocabulary using the index. This is the word that the model predicted as the n-th word given the input.

Draw backs of this approach :

  • Continuous bag of word model doesn’t take into account the word order
  • Word order is important

For example : predicting the next word in a sequence problem i mentioned at the start of this tutorial. We talked about how humans are able to predict the next word in a sequence.

  • Humans build representations overtime and use these representation to predict what comes next.
  • Here is an another example : The united states of ______

What could be the next possible word ?

  • Consider this sequence : The states united of ______

Now what do you say ? Isn’t word order important ?

SKIP GRAM MODEL

Skip Gram model

In Skip Gram architecture instead of using the context words as input, we use the nth word as input and ask the model to predict the surrounding words in a sequence.

We take the nth word embedding and pass it as input to the model. At the hidden layer we perform the weighted sum and apply hyperbolic tangent activation function.

The hidden units from the hidden layers are then forwarded to the output layer where we have a unit / neuron for each word in the vocabulary . At the output layer we perform a weighted sum and finally we pass the output logits to the softmax layer to get probability score. We could take first ‘ k ’ high probability score from the model output as the neighborhood for the input word

Unsupervised learning

  • Supervised learning using the labeled data is expensive so we have build a new model architecture that could learn form the text corpus itself.
  • We can just take the text and by via the Skip-Gram or CBOW model, we can directly learn these word vectors. We don’t need any labeled data.
  • We’re just going to take a corpus of documents and learn this predictive model.
  • How do we learn the model parameters ?
  • What loss are we optimizing ?

A model is represented by its parameters … its weights and biases. We learn these parameters by optimizing a cost function. We can represent our natural language model as a probabilistic model

Mathematical representation
  • We need to optimize the model parameters in such a way that given an input the model give us the right output .
  • Here our model output probabilities. At the output layer we have a unit/neuron for each word in the vocabulary.
  • We take the log() of the output probabilities and add them up
  • Do this for all input — output pair
  • We are seeking model parameters that maximizes this loss-function
Loss function for our model
  • If we can optimize this cost-function then it means we have a model that is good at predicting the desired output given the input.
  • Through this process of optimization we have learned good embedding vectors for the words in our vocabulary

That’s it folks ; this tutorial presents you a big picture view of natural language modeling , learning word vectors.

This is just a tip of an ice-berg and we’ve much to cover like RNN, LSTM , ATTENTION , TRANSFORMERS etc.. to name a few. So stay tuned and bye for now.

RESOURCES

Notebooks 😎 NLP

--

--