Approaches to Machine Translation

This story is an overview of the field of Machine Translation. The story introduces several highly cited literature and famous applications, but I’d like to encourage you to share your opinion in the comments. The aim of this story is to provide a good start for someone new to the field. It covers the three main approaches to machine translation as well as several challenges of the field. Hopefully, the literature mentioned in the story presents the history of the problem as well as the state-of-the-art solutions.

Machine translation (MT) is the task to translate a text from a source language to its counterpart in a target language. There are many challenging aspects of MT: 1) the large variety of languages, alphabets and grammars; 2) the task to translate a sequence (a sentence for example) to a sequence is harder for a computer than working with numbers only; 3) there is no one correct answer (e.g.: translating from a language without gender-dependent pronouns, he and she can be the same).

Machine translation is a relatively old task. From the 1970s, there were projects to achieve automatic translation. Over the years, three major approaches emerged:

Rule-based Machine Translation (RBMT):
Statistical Machine Translation (SMT):
Neural Machine Translation (NMT):

Rule-based Machine Translation

A rule-based system requires experts’ knowledge about the source and the target language to develop syntactic, semantic and morphological rules to achieve the translation.

The Wikipedia article of RBMT includes a basic example of rule-based translation from English to German. The translation needs an English-German dictionary, a rule set for English grammar and a rule set for German grammar

An RBMT system contains a pipeline of Natural Language Processing (NLP) tasks including Tokenisation, Part-of-Speech tagging and so on. Most of these jobs have to be done in both source and target language.

RBMT examples

SYSTRAN is one of the oldest Machine Translation company. It translates from and to around 20 languages. SYSTRAN was used for the Apollo-Soyuz project (1973) and by the Europian Commission (1975) [1]. It was used by Google’s language tools until 2007. See more at its Wikipedia article or the company’s website. With the emerge of STM, SYSTRAN started using statistical models and recent publications show that they are experimenting with the neural approach as well [2]. The OpenNMT toolkit is also a work of the company’s researchers [3].

Apertium is open-source RBMT software released under the terms of GNU General Public License. It is available in 35 languages and it is still under development. It was originally designed for languages closely related to Spanish [4]. The image below is an illustration of the Apertium’s pipeline human translators statistical machine translation system.

Cross-Lingual Transfer Learning

Zoph et al. (2018) applied transfer learning in machine translation and proved that having prior knowledge in translation of a separate language pair can improve translating a low-resource language.

Figure 3 illustrates their idea of cross-lingual transfer learning. The researchers first trained an NMT model on a large parallel corpus — French–English — to create what they call the parent model. In a second stage, they continued to train this model, but fed it with a considerably smaller parallel corpus of a low-resource language. The resulting child model inherits the knowledge from the parent model by reusing its parameters for machine translation system.

Compared to a classic approach of training only on the low-resource language, they record an average improvement of 5.6% BLEU over the four languages they experiment with. They further show that the child model doesn’t only reuse knowledge of the structure of the high resource target language but also on the process of translation itself.

The high-resource language to choose as the parent source language is a key parameter in this approach. This decision is usually made in a heuristic way judging by the closeness to the target language in terms of distance in the language family tree or shared linguistic properties. A more sound exploration of which language is best to go for a given language is made in Lin et al. (2019).

Advantages:

No bilingual text required
Domain-independent
Total control (a possible new rule for every situation)
Reusability (existing rules of languages can be transferred when paired with new languages)

Disadvantages

Requires good dictionaries
Manually set rules (requires expertise)
The more the rules the harder to deal with the system

Multilingual Training

The path that was cleared by cross-lingual transfer learning led naturally to the use of multiple parent languages. The straightforward approach, first described by Dong et al. (2015), mixes all the available parallel data in the languages of interest and sends them into training as illustrated in Figure 4. What results from the example is one single model that translates from the four languages (French, Spanish, Portuguese and Italian) to English, machine translation engine translation quality.

Multilingual NMT offers three main advantages. Firstly, it reduces the number of individual training processes needed to one, yet the resulting model can translate many languages at once. Secondly, transfer learning makes it possible for all languages to benefit from each other through the transfer of knowledge. And finally, the model serves as a more solid starting point for a possible low-resource language.

Statistical Machine Translation

For instance, if we were interested in training MT for Galician, a low-resource romance language, the model illustrated in Figure 4 would be a perfect fit as it already knows how to translate well in four other high-resource romance languages machine translations linguistic rules.

A solid report on the use of multilingual models is given by Neubig and Hu (2018). They use a “massively multilingual” corpus of 58 languages to leverage MT for four low-resource languages: Azeri, Belarusian, Galician, and Slovakian. With a parallel corpus size of only 4500 sentences for Galician, they achieved a BLEU score of up to 29.1% in contrast to 22.3% and 16.2% obtained with a classic single-language training with statistical machine translation (SMT) and NMT respectively.

Transfer learning also enables what is called a zero-shot translation, when no training data is available for the language of interest. For Galician, the authors report a BLEU score of 15.5% on their test set without the model seeing any Galician sentences before.

John Hutchins