- Proposed in this paper
- Developed by Facebook AI Research (FAIR) Lab
- In fasttext, every word is divided into multiple subwords
- Unlike Word2Vec Embedding, the unit token is not the word but the sub-words
- For example,
- for a word great and ngram range from 3-5, it will take all ngram as follows:
- 3 = gre, rea, eat
- 4 = grea, reat
- 5 = great
- For getting the word embedding of the word "great", it aggregates all the embedding of the above 6 tokens
Pros:
- Can handle any language due to the Sub-word Tokenizer (specially the languages where sentences are not separated by space)
- Can handle more rare words than GloVe Embedding or Word2Vec Embedding
- More morphological awareness as it captures the nuances of word morphology through subwords
Cons:
- Computationally inefficient for a large dataset due to multiple representation of same word
References