Byte Level BPE

It is same as Byte Pair Encoding (BPE) but instead of thinking the character as character they are thought as byte. This way the initial vocabulary is very small (256), but it covers every characters one can think of.

This is mainly used to handle some out of vocabulary rare characters in test time. Like if this (🥸) emoji was not present in training, but its present in test then it will be treated as OOV.

Pros:

  1. Its impossible to have a unknown character with byte level BPE

References


Related Notes