Negative Sampling
Negative sampling was proposed by the word2vec authors, in their 2nd paper. From the Word2Vec Embedding, we can see that for only 10000 size vocabulary, the number of hidden layers need to update is 3M weights, which is very inefficient.
So the authors proposed that instead of updating weights for all of the negative words, they will sample K number of negative words, and update only K + 1 number of hidden layer weights (1 positive + K negatives).
Authors found out that for small dataset optimal value of K is 5-20 and for large datasets its 2-5 words
- In that way, in the output layer only (K+1) x 10000 weight will be updated, which is much less than the 3M weights for the output layer
- And for the hidden layer, it will always update (1 x 10000) weight (negative sample or not). Why?
- Because we are using One Hot Vector and for the input layer without the positive one the other position will be 0, which will make the gradient 0.
- The same idea can be used in many other places