Byte Pair Encoding (BPE)

Byte Pair Encoding is a Sub-word Tokenizer, where a word is divided into multiple sub-words. It is done based on the frequency. There is another variation of this which is known as Byte Level BPE

Tokenizer Learner Steps:

  1. Get all the unique characters as initial vocabulary
  2. Form corpus of all words split into characters, whitespace as _
  3. Get 2 pairs (a,b) of most frequent tokens from current corpus
  4. Merge the pairs into ab and add it to vocabulary
  5. Update all adjacent a,b to ab in the corpus and for the new corpus
  6. Go to step 3 unless predetermined vocabulary size is reached or predefined iteration limit is reached

Tokenizer Segmenter Steps:

  1. For a sentence segment word by spaces
  2. For each word
    1. Find the pair that is merged before and merge them

Example

Training Corpus:
low lower lowest
low me

Lets do it for 3 iteration.

Step 1:
Unique characters: [l, o, w, e, r, s, t, m, _]
Step 2:
New Corpus:
2 l o w _
1 l o w e r _
1 l o w e s t _
1 m e _

Step 3.0:
Most frequent pairs: l, o
Step 4.0:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo]
Step 5.0:
New Corpus:
2 lo w _
1 lo w e r _
1 lo w e s t _
1 m e _

Step 3.1:
Most frequent pairs: lo, w
Step 4.1:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo, low]
Step 5.1:
New Corpus:
2 low _
1 low e r _
1 low e s t _
1 m e _

Step 3.2:
Most frequent pairs: low, e
Step 4.2:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo, low, lowe]
Step 5.2:
New Corpus:
2 low _
1 lowe r _
1 lowe s t _
1 m e _

Final vocabulary: [l, o, w, e, r, s, t, m, _, lo, low, lowe]
Merges in Order: [lo, low, lowe]

Test sentence: lower lose

  1. lower = l o w e r = lo w e r = low e r = lowe r = lower
  2. lose = l o s e = lo s e = los e = lose

Used In

  1. LLaMA
  2. GPT-2, GPT-3
  3. XLNet
  4. BART

References

  1. https://www.youtube.com/watch?v=tOMjTCO0htA&t=108s
  2. https://www.youtube.com/watch?v=i0D5GbudU6c
  3. https://blog.floydhub.com/tokenization-nlp
  4. https://aman.ai/primers/ai/tokenizer/

Related Notes