The Data Cleaning options allow you to remove inconsistent and junk data from the training data, which may affect the model training results. In some cases, this step is necessary for the customization of the MT models.
To clean your training set:
- Open your MT Studio project and click the Data Cleaning button.
- Select the language pair of your TM file and upload it; make sure your file has the proper format and structure (it matches the requirements for the training sets in MT Studio).
- Select and configure the cleaning parameters.
- Start cleaning. The system will create a clean training set in the .csv format that you can further use for model customization.
Cleaning parameters
You can select and configure the following parameters for data cleaning:
- Remove empty segments. All segments without any text will be deleted.
-
Remove duplicates. All segments where:
- source text matches the target text,
-
one source segment has multiple translations,
will be deleted.
- Clean HTML markup from segments. All inline HTML markup tags will be deleted (for example, <h5> Heading 5 </h5> → Heading 5 )
- Remove segments with emojis. Segments that contain emojis will be deleted.
- Remove segments with punctuation at the beginning. Segments with leading punctuation will be deleted. Note: it might be necessary because such segments can be pieces of a sentence, parts of the previous sentence, etc.
After the cleaning is finished, you can download the result:
File format
You can clean training sets in .tmx, .csv, and .tsv formats. Please make sure that your file is bilingual.
- .csv and .tsv files must contain only two columns: source and target. There should not be any other columns in the file.
- .tmx files must contain only two languages: the first will be the source language, and the second will be the target language.