FastText Embedding

#nlp #interview

Proposed in this paper
Developed by Facebook AI Research (FAIR) Lab
In fasttext, every word is divided into multiple subwords
Unlike Word2Vec Embedding, the unit token is not the word but the sub-words
For example,
- for a word great and ngram range from 3-5, it will take all ngram as follows:
  - 3 = gre, rea, eat
  - 4 = grea, reat
  - 5 = great
- For getting the word embedding of the word "great", it aggregates all the embedding of the above 6 tokens

Pros:

Can handle any language due to the Sub-word Tokenizer (specially the languages where sentences are not separated by space)
Can handle more rare words than GloVe Embedding or Word2Vec Embedding
More morphological awareness as it captures the nuances of word morphology through subwords

Cons:

Computationally inefficient for a large dataset due to multiple representation of same word

References

Related Notes