WordPiece Tokenization

#deep-learning #nlp #interview

WordPiece tokenization works same like as Byte Pair Encoding (BPE) but with subtle difference. Here the word is divided into subwords but not based on the frequency but based on a score

Tokenizer Learner Steps:

Get all the unique characters as initial vocabulary, append ## to all the characters other than the first one to indicate there are characters before it
Form corpus of all words split into characters with ##
Get 2 pairs ( $a, b$ ) with most score from current corpus, $s c o r e = \frac{f r e q_{a b}}{f r e q_{a} * f r e q_{b}}$
Merge the pairs into $a b$ and add it to vocabulary
Update all adjacent $a, b$ to $a b$ in the corpus and for the new corpus
Go to step 3 unless predetermined vocabulary size is reached or predefined iteration limit is reached

Tokenizer Segmenter Steps:

For a sentence segment word by spaces
For each word
1. Find the longest token from the vocabulary
2. Go through the other part of the word and continue

Example

Training Corpus:
low lower lowest
low me

Lets do it for 1 iteration.

Step 1:
Unique characters: [l, ##o, ##w, ##e, ##r, ##s, ##t, m]
Step 2:
New Corpus:
2 l ##o ##w
1 l ##o ##w ##e ##r
1 l ##o ##w ##e ##s ##t
1 m ##e

Step 3.0:
Most scored pairs: l, o = $\frac{3}{3 + 3}$ = 0.5
Step 4.0:
New Vocabulary: [l, ##o, ##w, ##e, ##r, ##s, ##t, m, lo]
Step 5.0:
New Corpus:
2 lo w _
1 lo w e r _
1 lo w e s t _
1 m e _

Final vocabulary: [l, ##o, ##w, ##e, ##r, ##s, ##t, m, lo]

Test sentence: lower lose

lower = l o w e r = lo w e r = low e r = lowe r = lower
lose = l o s e = lo s e = los e = lose>)

WordPiece Tokenization

Tokenizer Learner Steps:

Tokenizer Segmenter Steps:

Example

Used in

References

Related Notes