CBOW is a neural network model used to learn word embeddings. The goal is to predict a target word from the surrounding context words within a window.
Detailed Steps
- Context Window:
- Definition: The context window is the
span of words around a target word that you consider for prediction.
- Example: Consider the sentence
"The quick brown fox". If "quick" is the target word
and the window size is 2, the context words are "The" and
"brown". If the window size was 4, the context words would include
"The", "brown", and "fox".
- Input Representation:
- One-Hot Encoding: Convert each context word
into a one-hot vector. A one-hot vector is a binary vector of the size of
the vocabulary with all elements set to 0 except for the element
corresponding to the word, which is set to 1.
- Example: If the vocabulary is
["The", "quick", "brown", "fox"],
the one-hot encoding for "The" would be [1, 0, 0, 0], and for
"brown", it would be [0, 0, 1, 0].
- Projection Layer:
- Average One-Hot Vectors: Combine the one-hot vectors
of the context words by averaging them.
- Example: If "The" and
"brown" are the context words, average their one-hot vectors:
- [1,
0, 0, 0] (for "The")
- [0,
0, 1, 0] (for "brown")
- Average:
[0.5, 0, 0.5, 0]
- Projection to Hidden Layer: Multiply this averaged
vector by a weight matrix (W1), which maps the input space to a hidden
layer of neurons. The result is the hidden layer representation.
- If W1
is a matrix of size V × N (where V is the vocabulary size and N is the
number of neurons in the hidden layer), the hidden layer representation
is computed as h = W1 × avg_one_hot_vector.
- Output Layer:
- From Hidden to Output: Multiply the hidden layer
vector by another weight matrix (W2), which maps the hidden layer back to
the vocabulary size.
- If W2
is a matrix of size N×V, the output layer scores (u) are computed as u =
W2 × h
- Softmax:
- Convert Scores to
Probabilities:
Apply the softmax function to the output scores to convert them into a
probability distribution over the vocabulary. The softmax function
ensures that all the probabilities sum to 1.
- ypred
= eui / ∑j
e^uj where ui are
the scores for each word in the vocabulary.
Given Scores:
Assume we have scores for four words in our
vocabulary:
·
u(The) = 2.0
·
u(quick) = 1.0
·
u(brown) = 0.1
·
u(fox) = 0.5
Steps:
1. Exponentials of Scores:
o
e2.0≈7.389e^{2.0}
\approx 7.389e2.0≈7.389
o
e1.0≈2.718e^{1.0}
\approx 2.718e1.0≈2.718
o
e0.1≈1.105e^{0.1}
\approx 1.105e0.1≈1.105
o
e0.5≈1.649e^{0.5}
\approx 1.649e0.5≈1.649
2. Sum of Exponentials:
o
Sum = 7.389 + 2.718 + 1.105
+ 1.649 ≈ 12.861
3. Softmax Probabilities:
o
Probability(The) = 7.38912.861≈0.574\frac{7.389}{12.861} \approx 0.57412.8617.389≈0.574
o
Probability(quick) = 2.71812.861≈0.211\frac{2.718}{12.861} \approx 0.21112.8612.718≈0.211
o
Probability(brown) = 1.10512.861≈0.086\frac{1.105}{12.861} \approx 0.08612.8611.105≈0.086
o
Probability(fox) = 1.64912.861≈0.128\frac{1.649}{12.861} \approx 0.12812.8611.649≈0.128
- Loss Function:
- Measure Prediction Accuracy: Use a loss function to
measure how far the predicted probabilities (from softmax) are from the
actual distribution (where the target word has a probability of 1 and all
other words have a probability of 0).
- Negative Log Likelihood: The loss is typically
computed using the negative log likelihood of the true word given the
predicted probabilities:
- loss=−log(ypred[target_word]).
- Backpropagation:
- Adjust Weights: Use backpropagation to
adjust the weights (W1 and W2) in the network. This involves computing
the gradient of the loss with respect to each weight and updating the
weights in the direction that reduces the loss.
- Gradient Descent: The weights are updated
using gradient descent, where each weight is adjusted by subtracting the
product of the learning rate and the gradient of the loss with respect to
that weight.
Summary
In CBOW, the
model learns to predict a target word based on the average representation of
the surrounding context words. Through multiple iterations over a large corpus,
the weights in the network (which ultimately form the word vectors) are
adjusted to minimize the prediction error. These learned word vectors capture
semantic relationships between words, enabling various natural language
processing tasks.
Summary of CBOW
In the Continuous Bag of Words (CBOW) model, the goal is to predict a target
word using the surrounding context words. Here's a more detailed breakdown:
1. Context and Target Words:
- Context Words: These are the words
surrounding the target word within a defined window size.
- Target Word: This is the word the model tries
to predict based on the context words.
2. Input Representation:
- Each
context word is converted into a one-hot vector, which is a binary vector
of the size of the vocabulary with a single 1 indicating the word's
position in the vocabulary and 0s elsewhere.
3. Projection Layer:
- The
one-hot vectors of the context words are averaged.
- This
averaged vector is then multiplied by a weight matrix (W1) to produce a
hidden layer representation, which captures the combined context
information.
4. Output Layer:
- The
hidden layer representation is multiplied by another weight matrix (W2)
to produce scores for all words in the vocabulary.
5. Softmax Function:
- The
scores are passed through the softmax function, which converts them into
probabilities. These probabilities indicate the likelihood of each word
in the vocabulary being the target word.
6. Loss Function:
- The
model uses a loss function (typically negative log likelihood) to measure
the difference between the predicted probabilities and the actual target word.
- The
loss is minimized through backpropagation, adjusting the weight matrices
(W1 and W2) to improve predictions over time.
7. Training Process:
- The
model is trained on a large corpus of text. Through multiple iterations
(epochs), the weights are continuously updated to reduce the prediction
error.
8. Word Vectors:
- The
rows of the weight matrix W1 (after training) become the word vectors.
- These
vectors encode semantic relationships between words, such that words with
similar meanings have similar vectors.
9. Applications:
- The
learned word vectors can be used in various natural language processing
(NLP) tasks, such as text classification, sentiment analysis, and machine
translation, because they capture meaningful patterns and relationships
in the data.
Key Takeaway
The CBOW model learns to predict a word based on its context, and through
this process, it generates word vectors that encapsulate semantic
relationships. These word vectors are valuable for numerous NLP applications,
as they represent words in a way that reflects their meanings and relationships
to other words in the vocabulary.