Wednesday 26 June 2024

Overview of CBOW

CBOW is a neural network model used to learn word embeddings. The goal is to predict a target word from the surrounding context words within a window.

Detailed Steps

  1. Context Window:
    • Definition: The context window is the span of words around a target word that you consider for prediction.
    • Example: Consider the sentence "The quick brown fox". If "quick" is the target word and the window size is 2, the context words are "The" and "brown". If the window size was 4, the context words would include "The", "brown", and "fox".
  2. Input Representation:
    • One-Hot Encoding: Convert each context word into a one-hot vector. A one-hot vector is a binary vector of the size of the vocabulary with all elements set to 0 except for the element corresponding to the word, which is set to 1.
    • Example: If the vocabulary is ["The", "quick", "brown", "fox"], the one-hot encoding for "The" would be [1, 0, 0, 0], and for "brown", it would be [0, 0, 1, 0].
  3. Projection Layer:
    • Average One-Hot Vectors: Combine the one-hot vectors of the context words by averaging them.
    • Example: If "The" and "brown" are the context words, average their one-hot vectors:
      • [1, 0, 0, 0] (for "The")
      • [0, 0, 1, 0] (for "brown")
      • Average: [0.5, 0, 0.5, 0]
    • Projection to Hidden Layer: Multiply this averaged vector by a weight matrix (W1), which maps the input space to a hidden layer of neurons. The result is the hidden layer representation.
      • If W1 is a matrix of size V × N (where V is the vocabulary size and N is the number of neurons in the hidden layer), the hidden layer representation is computed as h = W1 × avg_one_hot_vector.
  4. Output Layer:
    • From Hidden to Output: Multiply the hidden layer vector by another weight matrix (W2), which maps the hidden layer back to the vocabulary size.
      • If W2 is a matrix of size N×V, the output layer scores (u) are computed as u = W2 × h       
  5. Softmax:
    • Convert Scores to Probabilities: Apply the softmax function to the output scores to convert them into a probability distribution over the vocabulary. The softmax function ensures that all the probabilities sum to 1.
      • ypred = eui  / j e^uj   where ui are the scores for each word in the vocabulary.

Given Scores:

Assume we have scores for four words in our vocabulary:

·         u(The) = 2.0

·         u(quick) = 1.0

·         u(brown) = 0.1

·         u(fox) = 0.5

Steps:

1.      Exponentials of Scores:

o    e2.0≈7.389e^{2.0} \approx 7.389e2.07.389

o    e1.0≈2.718e^{1.0} \approx 2.718e1.02.718

o    e0.1≈1.105e^{0.1} \approx 1.105e0.11.105

o    e0.5≈1.649e^{0.5} \approx 1.649e0.51.649

2.      Sum of Exponentials:

o    Sum = 7.389 + 2.718 + 1.105 + 1.649 ≈ 12.861

3.      Softmax Probabilities:

o    Probability(The) = 7.38912.861≈0.574\frac{7.389}{12.861} \approx 0.57412.8617.3890.574

o    Probability(quick) = 2.71812.861≈0.211\frac{2.718}{12.861} \approx 0.21112.8612.7180.211

o    Probability(brown) = 1.10512.861≈0.086\frac{1.105}{12.861} \approx 0.08612.8611.1050.086

o    Probability(fox) = 1.64912.861≈0.128\frac{1.649}{12.861} \approx 0.12812.8611.6490.128

 

  1. Loss Function:
    • Measure Prediction Accuracy: Use a loss function to measure how far the predicted probabilities (from softmax) are from the actual distribution (where the target word has a probability of 1 and all other words have a probability of 0).
    • Negative Log Likelihood: The loss is typically computed using the negative log likelihood of the true word given the predicted probabilities:
      • loss=−log(ypred​[target_word]).
  2. Backpropagation:
    • Adjust Weights: Use backpropagation to adjust the weights (W1 and W2) in the network. This involves computing the gradient of the loss with respect to each weight and updating the weights in the direction that reduces the loss.
    • Gradient Descent: The weights are updated using gradient descent, where each weight is adjusted by subtracting the product of the learning rate and the gradient of the loss with respect to that weight.

Summary

In CBOW, the model learns to predict a target word based on the average representation of the surrounding context words. Through multiple iterations over a large corpus, the weights in the network (which ultimately form the word vectors) are adjusted to minimize the prediction error. These learned word vectors capture semantic relationships between words, enabling various natural language processing tasks.

Summary of CBOW

In the Continuous Bag of Words (CBOW) model, the goal is to predict a target word using the surrounding context words. Here's a more detailed breakdown:

1.      Context and Target Words:

    • Context Words: These are the words surrounding the target word within a defined window size.
    • Target Word: This is the word the model tries to predict based on the context words.

2.      Input Representation:

    • Each context word is converted into a one-hot vector, which is a binary vector of the size of the vocabulary with a single 1 indicating the word's position in the vocabulary and 0s elsewhere.

3.      Projection Layer:

    • The one-hot vectors of the context words are averaged.
    • This averaged vector is then multiplied by a weight matrix (W1) to produce a hidden layer representation, which captures the combined context information.

4.      Output Layer:

    • The hidden layer representation is multiplied by another weight matrix (W2) to produce scores for all words in the vocabulary.

5.      Softmax Function:

    • The scores are passed through the softmax function, which converts them into probabilities. These probabilities indicate the likelihood of each word in the vocabulary being the target word.

6.      Loss Function:

    • The model uses a loss function (typically negative log likelihood) to measure the difference between the predicted probabilities and the actual target word.
    • The loss is minimized through backpropagation, adjusting the weight matrices (W1 and W2) to improve predictions over time.

7.      Training Process:

    • The model is trained on a large corpus of text. Through multiple iterations (epochs), the weights are continuously updated to reduce the prediction error.

8.      Word Vectors:

    • The rows of the weight matrix W1 (after training) become the word vectors.
    • These vectors encode semantic relationships between words, such that words with similar meanings have similar vectors.

9.      Applications:

    • The learned word vectors can be used in various natural language processing (NLP) tasks, such as text classification, sentiment analysis, and machine translation, because they capture meaningful patterns and relationships in the data.

Key Takeaway

The CBOW model learns to predict a word based on its context, and through this process, it generates word vectors that encapsulate semantic relationships. These word vectors are valuable for numerous NLP applications, as they represent words in a way that reflects their meanings and relationships to other words in the vocabulary.