MT quality metrics

Here we describe the machine translation metrics (reference-based metrics) that we use in MT Studio.

COMET

COMET is a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgements. COMET predicts machine translation quality using information from both the source input and the reference translation. The greater the score, the closer a translation is to the reference translation. As of February 2022, MT Studio uses COMET version 1.0.1.

COMET-QE is a model similar to the main COMET model that predicts MT quality without reference translations.

References

Supported languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

TER

Translation Edit Rate (TER) is an automatic metric based on edit distance. It measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform the MT output to the reference. It shows how much a human would have to edit a machine translation output to make it identical to a given reference translation. The corpus TER score is the total number of edits divided by the total number of words and multiplied by 100. TER ranges from 0 to infinity. The greater the score, the farther a translation is from reference.

References

Supported languages: As an n-gram metric, TER supports all languages

BLEU

BLEU is an automatic metric based on n-grams. It measures the precision of n-grams of the machine translation output compared to the reference, weighted by a brevity penalty to punish overly short translations. We use a particular implementation of BLEU, called sacreBLEU. It outputs corpus scores, not segment scores. BLEU ranges from 0 to 90. The greater the score, the closer a translation is to reference.

References

Supported languages: As an n-gram metric, BLEU supports all languages

BERTScore

BERTScore is an automatic metric based on contextual word embeddings. It computes embeddings (BERT, RoBERTa, etc.) and pairwise cosine similarity between representations of a machine translation and a reference translation. Essentially, BERTScore aims to measure semantic similarity. The corpus BERTScore score is the arithmetic mean of segment scores. BERTScore ranges from 0 to 1. The greater the score, the closer a translation is to reference.

References

Supported languages: Albanian, Arabic, Aragonese, Armenian, Asturian, Azerbaijani, Bashkir, Basque, Bavarian, Belarusian, Bengali, Bishnupriya Manipuri, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Chinese (Simplified), Chinese (Traditional), Chuvash, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Kirghiz, Korean, Latin, Latvian, Lithuanian, Lombard, Low Saxon, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Marathi, Minangkabau, Nepali, Newar, Norwegian (Bokmal), Norwegian (Nynorsk), Occitan, Persian (Farsi), Piedmontese, Polish, Portuguese, Punjabi, Romanian, Russian, Scots, Serbian, Serbo-Croatian, Sicilian, Slovak, Slovenian, South Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Waray-Waray, Welsh, West Frisian, Western Punjabi, Yoruba

hLEPOR

hLEPOR is an automatic metric based on n-grams. It takes into account enhanced length penalty, n-gram position difference penalty, and recall. hLEPOR is an enhanced version of LEPOR metric. Basically, hLEPOR computes the similarity of n-grams in a machine translation and a reference translation of a text segment. The corpus hLEPOR score is the arithmetic mean of segment scores. hLEPOR ranges from 0 to 1. The greater the score, the closer a translation is to reference.

References

Supported languages: As an n-gram metric, hLEPOR supports all languages