Byte Pair Encoding (BPE)
Byte Pair Encoding is a Sub-word Tokenizer, where a word is divided into multiple sub-words. It is done based on the frequency. There is another variation of this which is known as Byte Level BPE
Tokenizer Learner Steps:
- Get all the unique characters as initial vocabulary
- Form corpus of all words split into characters, whitespace as
_
- Get 2 pairs (
) of most frequent tokens from current corpus - Merge the pairs into
and add it to vocabulary - Update all adjacent
to in the corpus and for the new corpus - Go to step 3 unless predetermined vocabulary size is reached or predefined iteration limit is reached
Tokenizer Segmenter Steps:
- For a sentence segment word by spaces
- For each word
- Find the pair that is merged before and merge them
Example
Training Corpus:
low lower lowest
low me
Lets do it for 3 iteration.
Step 1:
Unique characters: [l, o, w, e, r, s, t, m, _]
Step 2:
New Corpus:
2 l o w _
1 l o w e r _
1 l o w e s t _
1 m e _
Step 3.0:
Most frequent pairs: l, o
Step 4.0:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo]
Step 5.0:
New Corpus:
2 lo w _
1 lo w e r _
1 lo w e s t _
1 m e _
Step 3.1:
Most frequent pairs: lo, w
Step 4.1:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo, low]
Step 5.1:
New Corpus:
2 low _
1 low e r _
1 low e s t _
1 m e _
Step 3.2:
Most frequent pairs: low, e
Step 4.2:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo, low, lowe]
Step 5.2:
New Corpus:
2 low _
1 lowe r _
1 lowe s t _
1 m e _
Final vocabulary: [l, o, w, e, r, s, t, m, _, lo, low, lowe]
Merges in Order: [lo, low, lowe]
Test sentence: lower lose
- lower = l o w e r = lo w e r = low e r = lowe r
- lose = l o s e = lo s e