Unigram Tokenization
Unigram tokenization is a Sub-word Tokenizer. Its a different approach than Byte Pair Encoding (BPE) or any of its predecessor. It starts with a big vocabulary and removes token based on which removal will give less increase of losses.
This vocabulary can be anything from strict substring, any combination of the training corpus or even the vocabulary from BPE with a very big vocabulary size.
Tokenizer Learner Steps:
- Start with a big vocabulary
- Find out the
tokens which has less amount of increase in losses - Remove those
- Continue the process until a pre-determined vocabulary size is not reached
How to calculate loss in Learner
Lets say,
corpus = [pug, pu]
vocabulary = [p, u, g, pu, pug]
frequency = [2, 2, 1, 2, 1]
so for word "pug", There are 3 possibilities
- [p, u, g] =
= = 0.015625 - [pu, g] =
= = 0.125 - [pug] =
= 1.0
So for each word, we will take the highest probability
for a vocabulary,
Tokenizer Segmenter Steps:
- Split the sentence by space tokenizer
- For each word,
- Find out the merge with highest probability
How to calculate the highest probability
It is same as the previous example where we found out that [pug] is the highest probability
Used In
- Unigram tokenization is mainly used in SentencePiece Tokenization
Need review:
- Make a gif for unigram representation