Contents
- Create a project
- How to make sure your data is correct
- Make translations in MT Studio
- What is a test set
- Quality metrics
- Scoring results
- MT Customization analysis
Create a project
- Visit Intento MT Studio
- Select Create project → Evaluate models
- Enter a name for your new project and the source and target languages.
- Upload your data. How to make sure your data is correct?
- The system processes the file and shows the metadata: the number of uploaded segments and already provided translations with models. Up to 30 machine translations.
The optimal number of segments for evaluation is 2,000. - If you have the option to translate with stock engines in the studio, select providers with which you will prepare translations
- via Intento - means that all the usage will be via Intento
- via connected account - means that all the usage will be via your own direct contract with the provider (Intento won’t charge for it)
- Start the project
How to make sure your data is correct
To make the evaluation, in most cases the user should provide the translation memory (source / golden reference) and translations with different models.
Your spreadsheet must be an .xls, .xlsx, .csv or .tsv file format.
A mandatory condition for a dataset:
- The first row must contain column names.
- If the Studio translations feature is turned on: the spreadsheet must contain at least the source column
- If the Studio translations feature is not available: The spreadsheet must contain at least two columns: source and at least one translation
SOURCE
This column contains the source/original texts. Cell values in the column must be text. The column name must be "source".
REFERENCE
This column contains the reference translations. Cell values in the column must be text. The column name must be “reference”. NOTE: this column isn’t obligatory. If you don’t have reference translations, you can compute metrics that work without reference translations.
TRANSLATION COLUMNS
All other columns contain machine translations. Cell values in the column must be text. These columns can have any name. If a column name has "custom" in it, Intento MT Studio will make this translation light blue on the charts.
Note:
If your spreadsheet has any other columns with content, delete them.
If your spreadsheet contains empty rows and columns, there is no need to delete them.
Values in columns:
Intento MT Studio has specific requirements for cell value. If a cell doesn't meet the requirements, the respective row with this cell will be ignored by MT Studio. If all cells do not meet the requirements, the whole spreadsheet will not be imported.
A row will be deleted if any of the following is true:
- Text is longer than 2,000 characters.
- Text is longer than 300 words.
- A cell value is not text.
- A cell is empty.
- A text in the "source" or "reference" column is a duplicate of another text in this column.
Make translations in MT Studio
If there are translations in the project, there is a special Translations tab in the project, where the system shows the actual status of translations.
The user can restart translation if there was a failure in the translations process.
When translations are ready -- the system writes down all the translations in the Test set (you can find it on the Files tab)
What is a test set
A test set is a file you upload to the MT Studio (source/reference/MT model translations), which the system processes. This data is used to calculate evaluation metrics.
If you make translations with stock models right from the MT Studio, the system creates a test set automatically. Likewise, it contains source/reference/MT model translations inside.
Quality metrics
When translations are ready to evaluate, in MT Studio you can compute five quality metrics: COMET (and a kind of metric COMET-QE), BERTScore, hLEPOR, TER, and sacreBLEU. Select the ones you need, then select Start scoring.
Here we describe the machine translation metrics (reference-based metrics) that we use in MT Studio.
COMET
COMET is a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgments. COMET predicts machine translation quality using information from both the source input and the reference translation. The greater the score, the closer a translation is to the reference translation. As of February 2022, MT Studio uses COMET version 1.0.1.
COMET-QE is a model similar to the main COMET model that predicts MT quality without reference translations.
References
Supported languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
TER
Translation Edit Rate (TER) is an automatic metric based on edit distance. It measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform the MT output to the reference. It shows how much a human would have to edit a machine translation output to make it identical to a given reference translation. The corpus TER score is the total number of edits divided by the total number of words and multiplied by 100. TER ranges from 0 to infinity. The greater the score, the farther a translation is from reference.
References
- Snover, Matthew, Bonnie J. Dorr, R. Schwartz and L. Micciulla. “A Study of Translation Edit Rate with Targeted Human Annotation.” (2006).
- TER implementation
Supported languages: As an n-gram metric, TER supports all languages
BLEU
BLEU is an automatic metric based on n-grams. It measures the precision of n-grams of the machine translation output compared to the reference, weighted by a brevity penalty to punish overly short translations. We use a particular implementation of BLEU, called sacreBLEU. It outputs corpus scores, not segment scores. BLEU ranges from 0 to 90. The greater the score, the closer a translation is to reference.
References
- Papineni, Kishore, S. Roukos, T. Ward and Wei-Jing Zhu. “Bleu: a Method for Automatic Evaluation of Machine Translation.” ACL (2002).
- Post, Matt. “A Call for Clarity in Reporting BLEU Scores.” WMT (2018).
- sacreBLEU implementation
Supported languages: As an n-gram metric, BLEU supports all languages
BERTScore
BERTScore is an automatic metric based on contextual word embeddings. It computes embeddings (BERT, RoBERTa, etc.) and pairwise cosine similarity between representations of a machine translation and a reference translation. Essentially, BERTScore aims to measure semantic similarity. The corpus BERTScore score is the arithmetic mean of segment scores. BERTScore ranges from 0 to 1. The greater the score, the closer a translation is to reference.
References
- Zhang, Tianyi, V. Kishore, Felix Wu, Kilian Q. Weinberger and Yoav Artzi. “BERTScore: Evaluating Text Generation with BERT.” ArXiv abs/1904.09675 (2020)
- BERTScore implementation
Supported languages: Albanian, Arabic, Aragonese, Armenian, Asturian, Azerbaijani, Bashkir, Basque, Bavarian, Belarusian, Bengali, Bishnupriya Manipuri, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Chinese (Simplified), Chinese (Traditional), Chuvash, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Kirghiz, Korean, Latin, Latvian, Lithuanian, Lombard, Low Saxon, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Marathi, Minangkabau, Nepali, Newar, Norwegian (Bokmal), Norwegian (Nynorsk), Occitan, Persian (Farsi), Piedmontese, Polish, Portuguese, Punjabi, Romanian, Russian, Scots, Serbian, Serbo-Croatian, Sicilian, Slovak, Slovenian, South Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Waray-Waray, Welsh, West Frisian, Western Punjabi, Yoruba
hLEPOR
hLEPOR is an automatic metric based on n-grams. It takes into account enhanced length penalty, n-gram position difference penalty, and recall. hLEPOR is an enhanced version of LEPOR metric. Basically, hLEPOR computes the similarity of n-grams in a machine translation and a reference translation of a text segment. The corpus hLEPOR score is the arithmetic mean of segment scores. hLEPOR ranges from 0 to 1. The greater the score, the closer a translation is to reference.
References
- Han, Lifeng, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng. “Language-independent Model for Machine Translation Evaluation with Reinforced Factors.” (2013).
- hLEPOR implementation
Supported languages: As an n-gram metric, hLEPOR supports all languages
Scoring results
MT Studio will compute quality scores for all models' translations. This takes about a minute for hLEPOR, BLEU, and TER, and several minutes for COMET and BERTScore. The more models you have, the longer scoring takes.
Note on COMET and BERTScore: before computing these metrics, MT Studio makes the machine translations and the reference translation all-lowercase. Thus, errors in capitalization will not lead to a lower score.
When scoring is complete, you'll see the scores for all models.
Models are sorted from highest-scoring to lowest-scoring by the first metric you've chosen. Note: higher translation quality means higher hLEPOR, BLEU, COMET, and BERTScore, but lower TER.
For hLEPOR, COMET, and BERTScore, MT Studio computes 83% confidence intervals. You can see them right under the corpus scores.
To sort models by another metric, select the metric name.
Scoring charts
Select Show charts to see visualizations of the corpus scores for each model.
How to read the charts
Scoring charts make it easy to tell apart translations by stock MT models and custom models. If the model name has the word "custom" in it, it's light blue on the charts. Other models are dark blue.
The height of the bar shows the corpus score for the whole test set. The black ticks on the bars are 83% confidence intervals.
BLEU and TER are corpus metrics, so there are no confidence intervals on the BLEU and TER charts.
Download results
Select Download results to download a report with scores. The report includes:
- a spreadsheet with corpus scores
- bar charts with models ranked according to the metrics you chose
- a spreadsheet with segment scores (not available for some plans)
MT Customization analysis
With Intento MT Studio, you can compare two translations in detail, for example, translations by a stock MT model and its customized version.
To start an in-depth analysis, select two models, then select Analyze and choose a metric for analysis. Our MT customization analysis tool will open. See more about MT customization analysis here.
All analyses of pairs of models will be saved in the Analysis history tab in your project.
And all that is here: https://help.inten.to/hc/en-us/articles/360016908739-MT-customization-analysis