Data Cleaning

The Data Cleaning options allow you to remove inconsistent and junk data from the training data, which may affect the model training results. In some cases, this step is necessary for customizing the MT models:

1-emove inconsistent and junk.png

To clean your training set:

  1. Open your MT Studio project and click Data Cleaning

  2. Select the language pair of your TM file and upload it; make sure your TM file has the proper format and structure (it matches the requirements for the training sets in MT Studio)

  3. Select and configure the cleaning parameters

  4. Start cleaning

The system will create a clean training set in the CSV format that you can further use for model customization.

Cleaning Parameters

You can select and configure the following parameters for data cleaning:

  • Remove empty segments.
    All segments without any text will be deleted.

  • Remove duplicates.
    All segments where the source text matches the target text. One source segment that has multiple translations will be deleted.

  • Clean HTML markup from segments.
    All inline HTML markup tags will be deleted (for example, <h5> Heading 5 </h5> → Heading 5 )

  • Remove segments with emojis.
    Segments that contain emojis will be deleted.

  • Remove segments with punctuation at the beginning.
    Segments with leading punctuation will be deleted.

    It might be necessary because such segments can be pieces of a sentence, parts of the previous sentence, etc.

After the cleaning is finished, you can download the result:

2-download the result.png

File Format

You can clean training sets in .TMX, .CSV, and .TSV formats. Make sure that your file is bilingual. Also note that:

  • .CSV and .TSV files must contain only two columns: source and target. There should not be any other columns in the file.

  • .TMX files must contain only two languages: the first will be the source language, and the second will be the target language.