Sub-word Tokenizer
Sub word tokenizer is a mixture of both Word Tokenizer and Character Tokenizer, where word level semantic meaning is kept and also tries to solve the OOV issue with sub-words.
Different Sub-word Tokenizers:
- Byte Pair Encoding (BPE)
- Byte Level BPE
- Unigram Tokenization
- WordPiece Tokenization
- SentencePiece Tokenization