BLEU Score

BLEU = BiLingual Evaluation Understudy
BLEU tells us how good the generated sentence is compared to the reference ground truths
As BLEU tells us how good the prediction is, it is often compared with Precision
BLEU score is computed with the multiplication of Brevity Penalty and Geometric Mean of Precision
Heavily used in Machine Translation

BLEU Score

$\begin{aligned} BLEU-N & = Brevity-Penalty * \exp ({Geometric Mean of Precision}_{1. . . N}) \end{aligned}$ $Brevity-Penalty = m i n (1, \exp (1 - \frac{reference-length}{generation-length}))$ ${Geometric Mean of Precision}_{1. . . N} = \sum_{n = 1}^{N} w_{n} l o g p r e_{n}$ $p r e_{n} = \frac{# of n-gram mathched in both target & generation}{total # of n-grams in the generation}$

Problems with BLEU Score

Doesn't consider semantic meaning
Doesn't consider synonyms
Struggles with non-english language
Hard to compare with different tokenizers
1. as different tokenizer will break sentence in different parts; hence n-gram using tokenizer-A is not same as n-gram using tokenizer-B

Problems with BLEU Score

Related Notes