Here we describe the machine translation metrics (reference-based metrics) that we use in evaluations.
Translation Edit Rate (TER) is an automatic metric based on edit distance. It measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform the MT output to the reference. It shows how much a human would have to edit a machine translation output to make it identical to a given reference translation. The corpus TER score is the arithmetic mean of segment scores. TER ranges from 0 to infinity. The greater the score, the farther a translation is from reference.
- Snover, Matthew, Bonnie J. Dorr, R. Schwartz and L. Micciulla. “A Study of Translation Edit Rate with Targeted Human Annotation.” (2006).
- TER implementation
BLEU is an automatic metric based on n-grams. It measures the precision of n-grams of the machine translation output compared to the reference, weighted by a brevity penalty to punish overly short translations. We use a particular implementation of BLEU, called sacreBLEU. It outputs corpus scores, not segment scores. BLEU ranges from 0 to 90. The greater the score, the closer a translation is to reference.
- Papineni, Kishore, S. Roukos, T. Ward and Wei-Jing Zhu. “Bleu: a Method for Automatic Evaluation of Machine Translation.” ACL (2002).
- Post, Matt. “A Call for Clarity in Reporting BLEU Scores.” WMT (2018).
- sacreBLEU implementation
BERTScore is an automatic metric based on contextual word embeddings. It computes embeddings (BERT, RoBERTa, etc.) and pairwise cosine similarity between representations of a machine translation and a reference translation. Essentially, BERTScore aims to measure semantic similarity. The corpus BERTScore score is the arithmetic mean of segment scores. BERTScore ranges from 0 to 1. The greater the score, the closer a translation is to reference.
- Zhang, Tianyi, V. Kishore, Felix Wu, Kilian Q. Weinberger and Yoav Artzi. “BERTScore: Evaluating Text Generation with BERT.” ArXiv abs/1904.09675 (2020)
- BERTScore implementation
hLEPOR is an automatic metric based on n-grams. It takes into account enhanced length penalty, n-gram position difference penalty, and recall. hLEPOR is an enhanced version of LEPOR metric. Basically, hLEPOR computes the similarity of n-grams in a machine translation and a reference translation of a text segment. The corpus hLEPOR score is the arithmetic mean of segment scores. hLEPOR ranges from 0 to 1. The greater the score, the closer a translation is to reference.
- Han, Lifeng, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng. “Language-independent Model for Machine Translation Evaluation with Reinforced Factors.” (2013).
- hLEPOR implementation