Byte Pair Encoding (BPE)

#deep-learning #interview #nlp

Byte Pair Encoding is a Sub-word Tokenizer, where a word is divided into multiple sub-words. It is done based on the frequency. There is another variation of this which is known as Byte Level BPE

Tokenizer Learner Steps:

Get all the unique characters as initial vocabulary
Form corpus of all words split into characters, whitespace as _
Get 2 pairs ( $a, b$ ) of most frequent tokens from current corpus
Merge the pairs into $a b$ and add it to vocabulary
Update all adjacent $a, b$ to $a b$ in the corpus and for the new corpus
Go to step 3 unless predetermined vocabulary size is reached or predefined iteration limit is reached

Tokenizer Segmenter Steps:

For a sentence segment word by spaces
For each word
1. Find the pair that is merged before and merge them

Example

Training Corpus:
low lower lowest
low me

Lets do it for 3 iteration.

Step 1:
Unique characters: [l, o, w, e, r, s, t, m, _]
Step 2:
New Corpus:
2 l o w _
1 l o w e r _
1 l o w e s t _
1 m e _

Step 3.0:
Most frequent pairs: l, o
Step 4.0:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo]
Step 5.0:
New Corpus:
2 lo w _
1 lo w e r _
1 lo w e s t _
1 m e _

Step 3.1:
Most frequent pairs: lo, w
Step 4.1:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo, low]
Step 5.1:
New Corpus:
2 low _
1 low e r _
1 low e s t _
1 m e _

Step 3.2:
Most frequent pairs: low, e
Step 4.2:
New Vocabulary: [l, o, w, e, r, s, t, m, _, lo, low, lowe]
Step 5.2:
New Corpus:
2 low _
1 lowe r _
1 lowe s t _
1 m e _

Final vocabulary: [l, o, w, e, r, s, t, m, _, lo, low, lowe]
Merges in Order: [lo, low, lowe]

Byte Pair Encoding (BPE)

Tokenizer Learner Steps:

Tokenizer Segmenter Steps:

Example

Used In

References

Related Notes