WordPiece Tokenization

WordPiece tokenization works same like as Byte Pair Encoding (BPE) but with subtle difference. Here the word is divided into subwords but not based on the frequency but based on a score

Tokenizer Learner Steps:

  1. Get all the unique characters as initial vocabulary, append ## to all the characters other than the first one to indicate there are characters before it
  2. Form corpus of all words split into characters with ##
  3. Get 2 pairs (a,b) with most score from current corpus, score=freqabfreqafreqb
  4. Merge the pairs into ab and add it to vocabulary
  5. Update all adjacent a,b to ab in the corpus and for the new corpus
  6. Go to step 3 unless predetermined vocabulary size is reached or predefined iteration limit is reached

Tokenizer Segmenter Steps:

  1. For a sentence segment word by spaces
  2. For each word
    1. Find the longest token from the vocabulary
    2. Go through the other part of the word and continue

Example

Training Corpus:
low lower lowest
low me

Lets do it for 1 iteration.

Step 1:
Unique characters: [l, ##o, ##w, ##e, ##r, ##s, ##t, m]
Step 2:
New Corpus:
2 l ##o ##w
1 l ##o ##w ##e ##r
1 l ##o ##w ##e ##s ##t
1 m ##e

Step 3.0:
Most scored pairs: l, o = 33+3 = 0.5
Step 4.0:
New Vocabulary: [l, ##o, ##w, ##e, ##r, ##s, ##t, m, lo]
Step 5.0:
New Corpus:
2 lo w _
1 lo w e r _
1 lo w e s t _
1 m e _

Final vocabulary: [l, ##o, ##w, ##e, ##r, ##s, ##t, m, lo]

Test sentence: lower lose

  1. lower = l o w e r = lo ##w ##e ##r
  2. lose = l o s e = lo ##s ##e

Used in

  1. BERT
  2. RoBERTa
  3. ERNIE
  4. DistilBERT

References

  1. https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt
  2. https://aman.ai/primers/ai/tokenizer/#wordpiece-1

Related Notes