SentencePiece Tokenization

#nlp #deep-learning #interview

SentencePiece tokenization can be used with any of the Byte Pair Encoding (BPE), Unigram Tokenization or WordPiece Tokenization.

The only difference is the sentence doesn't need to be pre-tokenized into words and SentencePiece can directly work on the raw text directly. For that, it thinks each character as unicode character and space as _

Pros:

It can directly handle any language as it doesn't depend on Word Tokenizer or Space tokenization
It is more flexible and can learn more synthetic features

Used In:

T5
mT5
ALBERT
XLNet

References

https://aman.ai/primers/ai/tokenizer
https://blog.floydhub.com/tokenization-nlp/

Related Notes

Soft Margin in SVM
Decision Tree (Regression)
L2 or Ridge Regression
Gradient Descent
PCA vs. Autoencoder