Evaluate Models

In this article:

  1. Create a Project

  2. Make Sure Your Data is Correct

  3. Make Translations in MT Studio

  4. What is a Test Set

  5. Quality Metrics

  6. Scoring Results

  7. Scoring Charts

  8. MT Customization Analysis

Create a Project

To create a project:

  1. Visit Intento MT Studio

  2. Select Create project → Evaluate models:

    1-Create project.png
  3. Enter a name for your new project and the source and target languages

  4. Upload your data:

    To learn more, see the section Make sure your data is correct?

    2-Upload your data.png
  5. The system processes the file and shows the metadata:

    The number of uploaded segments and already provided translations with models. Up to 30 machine translations. The optimal number of segments for evaluation is 2,000.

    3-optimal number of segments.png
  6. If you have the option to translate with stock engines in the studio, select providers with which you will prepare translations:

    • via Intento - means that all the usage will be via Intento

    • via connected account - means that all the usage will be via your direct contract with the provider (Intento won’t charge for it)

  7. Start the project

Make Sure Your Data is Correct

To make sure your data is correct, in most cases, the user should provide the translation memory (source / golden reference) and translations with different models. Your spreadsheet must be in the following file format:

  • XLS

  • XLSX

  • CSV

  • TSV

Mandatory conditions for a dataset:

  • The first row must contain column names

  • If the Studio translations feature is turned on, the spreadsheet must contain at least the source column

  • If the Studio translations feature is not available, the spreadsheet must contain at least two columns: source and at least one translation

SOURCE
This column contains the source/original texts. Cell values in the column must be text. The column name must be "source".

REFERENCE
This column contains the reference translations. Cell values in the column must be text. The column name must be “reference”.

This column is optional. If you don’t have reference translations, you can compute metrics that work without reference translations.

TRANSLATION COLUMNS
All other columns contain machine translations. Cell values in the column must be text. These columns can have any name. If a column name has "custom" in it, Intento MT Studio will make this translation light blue on the charts.

If your spreadsheet has any other columns with content, delete them.

If your spreadsheet contains empty rows and columns, no need to delete them. 

VALUES IN COLUMNS
Intento MT Studio has specific requirements for cell value. If a cell doesn't meet the requirements, the respective row with this cell is ignored by MT Studio. If no cell meets the requirements, the whole spreadsheet is not imported. 

A row will be deleted if any of the following is true:

  • Text is longer than 2,000 characters

  • Text is longer than 300 words

  • A cell value is not text

  • A cell is empty

  • A text in the "source" or "reference" column is a duplicate of another text in this column

Make Translations in MT Studio

If there are translations in the project, there is a special Translations tab, where the system shows the actual status of translations:

4-Make translations.png

If there is a failure in the translation process, the user can restart the Translation.

When translations are ready, the system writes down all the translations in the Test set (you can find it on the Files tab)

What is a Test Set

A test set is a file you upload to the MT Studio (source/reference/MT model translations), which the system processes. This data is used to calculate evaluation metrics.

If you make translations with stock models directly from the MT Studio, the system creates a test set automatically. Likewise, it contains source/reference/MT model translations inside.

Quality Metrics

When translations are ready for evaluation, in MT Studio, you can compute five quality metrics:

  1. COMET (and a kind of metric COMET-QE)

  2. BERTScore

  3. hLEPOR

  4. TER

  5. sacreBLEU.

Select the ones you need, then select Start Scoring.

Here we describe the machine translation metrics (reference-based metrics) we use in MT Studio.

1- COMET
COMET is a neural framework for training multilingual machine translation evaluation models that obtains new state-of-the-art levels of correlation with human judgments. COMET predicts machine translation quality using information from both the source input and the reference translation. The greater the score, the closer a translation is to the reference translation. As of February 2022, MT Studio uses COMET version 1.0.1.

COMET-QE is a model similar to the main COMET model that predicts MT quality without reference translations.

References:

COMET supported languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

2- TER
Translation Edit Rate (TER) is an automatic metric based on edit distance. It measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform the MT output to the reference. It shows how much a human would have to edit a machine translation output to make it identical to a given reference translation. The corpus TER score is the total number of edits divided by the total number of words and multiplied by 100. TER ranges from 0 to infinity. The greater the score, the farther a translation is from the reference.

References:

TER Supported languages:

As an n-gram metric, TER supports all languages

3- BLEU
BLEU is an automatic metric based on n-grams. It measures the precision of n-grams of the machine translation output compared to the reference, weighted by a brevity penalty to punish overly short translations. We use a particular implementation of BLEU, called sacreBLEU. It outputs corpus scores, not segment scores. BLEU ranges from 0 to 90. The greater the score, the closer a translation is to the reference.

References:

BLEU supported languages:

As an n-gram metric, BLEU supports all languages

4- BERTScore
BERTScore is an automatic metric based on contextual word embeddings. It computes embeddings (BERT, RoBERTa, etc.) and pairwise cosine similarity between representations of a machine translation and a reference translation. Essentially, BERTScore aims to measure semantic similarity. The corpus BERTScore score is the arithmetic mean of segment scores. BERTScore ranges from 0 to 1. The greater the score, the closer a translation is to the reference.

References:

BERTScore supported languages:

Albanian, Arabic, Aragonese, Armenian, Asturian, Azerbaijani, Bashkir, Basque, Bavarian, Belarusian, Bengali, Bishnupriya Manipuri, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Chinese (Simplified), Chinese (Traditional), Chuvash, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Kirghiz, Korean, Latin, Latvian, Lithuanian, Lombard, Low Saxon, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Marathi, Minangkabau, Nepali, Newar, Norwegian (Bokmal), Norwegian (Nynorsk), Occitan, Persian (Farsi), Piedmontese, Polish, Portuguese, Punjabi, Romanian, Russian, Scots, Serbian, Serbo-Croatian, Sicilian, Slovak, Slovenian, South Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Waray-Waray, Welsh, West Frisian, Western Punjabi, Yoruba

5- hLEPOR
hLEPOR is an automatic metric based on n-grams. It considers enhanced length penalty, n-gram position difference penalty, and recall. hLEPOR is an advanced version of LEPOR metric. In essence, hLEPOR computes the similarity of n-grams in a machine translation and a reference translation of a text segment. The corpus hLEPOR score is the arithmetic mean of segment scores. hLEPOR ranges from 0 to 1. The greater the score, the closer a translation is to the reference.

References:

hLEPOR supported languages:

As an n-gram metric, hLEPOR supports all languages

Scoring Results

MT Studio will compute quality scores for all models' translations. This takes about a minute for hLEPOR, BLEU, and TER, and several minutes for COMET and BERTScore. The more models you have, the longer scoring takes.

For COMET and BERTScore, before computing these metrics, MT Studio makes the machine translations and the reference translation all-lowercase. Thus, errors in capitalization will not lead to a lower score.

When scoring is complete, you'll see the scores for all models.

Models are sorted from highest-scoring to lowest-scoring by the first metric you've chosen. Note: higher translation quality means higher hLEPOR, BLEU, COMET, and BERTScore, but lower TER.

For hLEPOR, COMET, and BERTScore, MT Studio computes 83% confidence intervals. You can see them right under the corpus scores:

5-corpus scores.png

To sort models by another metric, select the metric name.

Scoring Charts

Select Show charts to see visualizations of the corpus scores for each model:

6-Scoring charts.png

HOW TO READ CHARTS

Scoring charts allow for distinguishing translations by stock MT models and custom models. If the model name has the word "custom" in it, it's light blue on the charts. Other models are dark blue.

The height of the bar shows the corpus score for the whole test set. The black ticks on the bars are 83% confidence intervals.

BLEU and TER are corpus metrics, thus, they do not have confidence intervals on their charts.

DOWNLOAD RESULTS

Select Download results to download a report with scores. The report includes:

  • A spreadsheet with corpus scores

  • Bar charts with models ranked according to the metrics you chose

  • A spreadsheet with segment scores (not available for some plans)

MT Customization Analysis

With Intento MT Studio, you can compare two translations in detail, for example, translations by a stock MT model and its customized version.

To start an in-depth analysis, select two models, then select Analyze and choose a metric for analysis. Our MT customization analysis tool will open. See more about MT customization analysis here.

All analyses of pairs of models will be saved in the Analysis history tab in your project.

To learn more, see https://help.inten.to/hc/en-us/articles/360016908739-MT-customization-analysis