Unigram Tokenization

Unigram tokenization is a Sub-word Tokenizer. Its a different approach than Byte Pair Encoding (BPE) or any of its predecessor. It starts with a big vocabulary and removes token based on which removal will give less increase of losses.

This vocabulary can be anything from strict substring, any combination of the training corpus or even the vocabulary from BPE with a very big vocabulary size.

Tokenizer Learner Steps:

  1. Start with a big vocabulary
  2. Find out the p% tokens which has less amount of increase in losses
  3. Remove those
  4. Continue the process until a pre-determined vocabulary size is not reached

How to calculate loss in Learner

Lets say,
corpus = [pug, pu]
vocabulary = [p, u, g, pu, pug]
frequency = [2, 2, 1, 2, 1]

so for word "pug", There are 3 possibilities

  1. [p, u, g] = P(p)P(u)P(g) = 282828 = 0.015625
  2. [pu, g] = P(pu)P(g) = 2818 = 0.125
  3. [pug] = P(pug) = 1.0
    So for each word, we will take the highest probability

for a vocabulary, (v1,v2,v3) and frequency (f1,f2,f3) the loss is,

loss=ifi(log(P(vi)))

Tokenizer Segmenter Steps:

  1. Split the sentence by space tokenizer
  2. For each word,
    1. Find out the merge with highest probability

How to calculate the highest probability

It is same as the previous example where we found out that [pug] is the highest probability

Used In

  1. Unigram tokenization is mainly used in SentencePiece Tokenization
Need review:
  1. Make a gif for unigram representation

References

  1. https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt

Related Notes