MT quality metrics
Here we describe the machine translation metrics (reference-based metrics) that we use in MT Studio.
COMET
COMET is a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgements. COMET predicts machine translation quality using information from both the source input and the reference translation. The greater the score, the closer a translation is to the reference translation. As of February 2022, MT Studio uses COMET version 1.0.1.
COMET-QE is a model similar to the main COMET model that predicts MT quality without reference translations.
References
- COMET, the new standard for machine translation evaluation
- COMET: A Neural Framework for MT Evaluation
- COMET implementation
- COMET-QE
Supported languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
TER
Translation Edit Rate (TER) is an automatic metric based on edit distance. It measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform the MT output to the reference. It shows how much a human would have to edit a machine translation output to make it identical to a given reference translation. The corpus TER score is the total number of edits divided by the total number of words and multiplied by 100. TER ranges from 0 to infinity. The greater the score, the farther a translation is from reference.
References
- Snover, Matthew, Bonnie J. Dorr, R. Schwartz and L. Micciulla. “A Study of Translation Edit Rate with Targeted Human Annotation.” (2006).
- TER implementation
Supported languages: As an n-gram metric, TER supports all languages
BLEU
BLEU is an automatic metric based on n-grams. It measures the precision of n-grams of the machine translation output compared to the reference, weighted by a brevity penalty to punish overly short translations. We use a particular implementation of BLEU, called sacreBLEU. It outputs corpus scores, not segment scores. BLEU ranges from 0 to 90. The greater the score, the closer a translation is to reference.
References
- Papineni, Kishore, S. Roukos, T. Ward and Wei-Jing Zhu. “Bleu: a Method for Automatic Evaluation of Machine Translation.” ACL (2002).
- Post, Matt. “A Call for Clarity in Reporting BLEU Scores.” WMT (2018).
- sacreBLEU implementation
Supported languages: As an n-gram metric, BLEU supports all languages
BERTScore
BERTScore is an automatic metric based on contextual word embeddings. It computes embeddings (BERT, RoBERTa, etc.) and pairwise cosine similarity between representations of a machine translation and a reference translation. Essentially, BERTScore aims to measure semantic similarity. The corpus BERTScore score is the arithmetic mean of segment scores. BERTScore ranges from 0 to 1. The greater the score, the closer a translation is to reference.
References
- Zhang, Tianyi, V. Kishore, Felix Wu, Kilian Q. Weinberger and Yoav Artzi. “BERTScore: Evaluating Text Generation with BERT.” ArXiv abs/1904.09675 (2020)
- BERTScore implementation
Supported languages: Albanian, Arabic, Aragonese, Armenian, Asturian, Azerbaijani, Bashkir, Basque, Bavarian, Belarusian, Bengali, Bishnupriya Manipuri, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Chinese (Simplified), Chinese (Traditional), Chuvash, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Kirghiz, Korean, Latin, Latvian, Lithuanian, Lombard, Low Saxon, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Marathi, Minangkabau, Nepali, Newar, Norwegian (Bokmal), Norwegian (Nynorsk), Occitan, Persian (Farsi), Piedmontese, Polish, Portuguese, Punjabi, Romanian, Russian, Scots, Serbian, Serbo-Croatian, Sicilian, Slovak, Slovenian, South Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Waray-Waray, Welsh, West Frisian, Western Punjabi, Yoruba
hLEPOR
hLEPOR is an automatic metric based on n-grams. It takes into account enhanced length penalty, n-gram position difference penalty, and recall. hLEPOR is an enhanced version of LEPOR metric. Basically, hLEPOR computes the similarity of n-grams in a machine translation and a reference translation of a text segment. The corpus hLEPOR score is the arithmetic mean of segment scores. hLEPOR ranges from 0 to 1. The greater the score, the closer a translation is to reference.
References
- Han, Lifeng, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng. “Language-independent Model for Machine Translation Evaluation with Reinforced Factors.” (2013).
- hLEPOR implementation
Supported languages: As an n-gram metric, hLEPOR supports all languages