Sub-word Tokenizer

Sub word tokenizer is a mixture of both Word Tokenizer and Character Tokenizer, where word level semantic meaning is kept and also tries to solve the OOV issue with sub-words.

Different Sub-word Tokenizers:

  1. Byte Pair Encoding (BPE)
  2. Byte Level BPE
  3. Unigram Tokenization
  4. WordPiece Tokenization
  5. SentencePiece Tokenization

References


Related Notes