How to Optimize Text for Machine Translation
In this article, we focus on stock (=publically available) machine translation systems because custom models can be trained on your data samples to handle specifics of your text style better, while generic models are there to deal with all kinds of text.
To optimize text for machine translation:
Use a formal writing style
Use a simplified sentence structure
Unify terms
Check orthography, punctuation, and misspelling
Unify formats
Use the lowercase letters as much as possible
Mind e-mails, file paths, URLs
Use glossaries for specialized terms
Use a unified approach for toponym translation
When sending translation requests to get better MT results, make sure to specify the source text language, source text format, and standard HTML tags.
See the following sections for details on each of the above-mentioned optimization methods.
Formal Writing Style
It is beneficial to remove or substitute the following:
Slang words [e.g., wooot, buddy, or dude]
Loanwords and Neologisms [e.g., Grand Prix, e-bike]
Idioms and specialized Local Terms [e.g., break the ice = “get the conversation going”]
Ambiguous words and words with a different meaning in source language dialects, for example:
a) Words ending on -ed or -ing
b) The word "table" could mean a piece of furniture or a list, depending on the context
c) The word "glass" could mean material or tableware, etc
Phrases based on Local Humor, Customs, Sayings, and Biases
Ad-hoc Abbreviations [e.g., in French there are a lot of abbreviations used in everyday communication: BJR = bonjour, BZ = bisous, bises., etc.]
Use phrases based on the common knowledge [e.g., Earth is a planet]
Simplified Sentence Structure
To simplify sentence structure:
Make sentences consistent and self-sufficient
Don’t indulge in complex sentences with subordinate clauses
If you can, avoid passive voice
If needed, split a complex sentence
Unify Terms
The concept of unifying terms can be understood as, instead of using both “client” and “customer” to describe the same thing, stick with only one term throughout.
Orthography, Punctuation, and Misspelling
Mistyped words can be mistranslated — “void gaps” instead of “avoid gaps” completely changes the meaning of the sentence.
Once our office had to take a break when we saw a machine-translated result of the mistyped word “assked”.
Unify Formats
It is beneficial to unify the following formats:
Prices and Currencies [e.g., $1.000]
Units of measurement [e.g., kg]
Numerals [try to use numbers instead of numerals, e.g., use “1“ instead of “one”]
Dates and Times [e.g., 2020–08–12, 14:45]
All other specified data and terms that could be unified
Lowercase Letters
It is beneficial to use lowercase letters as much as possible, and:
Avoid unnecessary capitalization [e.g., use “counterparty” instead of “Counterparty”]
Remove CAPS LOCK [e.g., the word “HERO” can be left untranslated]
E-mails, File paths, URLs
Avoid translating emails, file paths, and URLs as they can introduce some unexpected results in the translations. For example, the e-mail address “daisy@garden.to” can be machine-translated as flower @ yard, which is probably not the intended outcome.
Specialized Terms Glossaries
Use glossaries for specialized terms:
Add Sites [physical locations]/addresses [e.g., “Language Street” could be translated to the target language as “[target language direct translation of language + street]]
Add Products and Services names [e.g., translated Product name could be different from your company's Product name guide]
Add Names and Acronyms to a glossary [e.g., acronym “WORLD” could be translated as the word “world”]
Toponym Translation
Use a unified approach for toponym translation:
Choose whether to translate toponyms like La Grand-Place or leave their original naming
Follow grammatical rules when keeping foreign words in their original language in translated texts. For example, if you need to use some original French words in a translated English text, stick with English language grammatical rules.
Get Better MT Results
When sending translation requests to get better MT results, make sure to specify
Source text language: If the source text language is not specified, language detection kicks in. Language detection takes time and can also provide wrong (not literally wrong, but unexpected) results in some cases, e.g., Kungens Kurva is the name of a street in Stockholm (which is King’s Curve in Swedish). But if you try to translate it with no source language specified, you might get it autodetected as Croatian or even Polish. Consequently, the translation result will be very far from the original meaning.
Source text format: When specifying TEXT as a format, you’ll get plain text back. When specifying HTML, be ready to handle HTML-entities in the translation result, e.g., if you translate “Jag är mammas son” from Swedish to English with HTML format, you might get back “I'm my mother's son”.
Tagged Text: When translating tagged text, consider sticking to standard HTML tags, as some MT engines treat non-standard tags as a sentence breaker. Try translating “She <o>rose <o>and <o>left” to French, for example. You might get “Elle <o> Rose <o> et <o> la gauche” back while expecting something like “Elle s’est levée et est partie”.
If you stick to all the above-mentioned optimizations, you’re likely to be happy with the result.
If, however, you feel like you need to take some significant bits out of your text to make it machine-translatable, here is a trick: take them out, have it translated, and then bring them back, sprinkling your already decent-looking translated text with some spice and flavor.