SentencePiece Tokenization

SentencePiece tokenization can be used with any of the Byte Pair Encoding (BPE), Unigram Tokenization or WordPiece Tokenization.

The only difference is the sentence doesn't need to be pre-tokenized into words and SentencePiece can directly work on the raw text directly. For that, it thinks each character as unicode character and space as _

Pros:

  1. It can directly handle any language as it doesn't depend on Word Tokenizer or Space tokenization
  2. It is more flexible and can learn more synthetic features

Used In:

  1. T5
  2. mT5
  3. ALBERT
  4. XLNet

References

  1. https://aman.ai/primers/ai/tokenizer
  2. https://blog.floydhub.com/tokenization-nlp/

Related Notes