Low-resource Neural Machine Translation

Image by mohamed Hassan from Pixabay

The developments in Neural Machine Translation community (NMT) have been showing the way for the rest of the researchers in the text generation community since the last 5-6 years, especially after the advent of seq2seq models [Sutskever et al., 2014, Bahdanau et al., 2014] for translation. While more recent advances made in this field for languages like French (fr) and German (de) have been astonishing, one frequently overlooked fact is the amount of data and resources required for achieving those results.
Especially when we consider low-resource languages like Basque (eu) and Gujarati (gu), it would be almost impossible to replicate those lofty scores for these languages due to much less amount of freely available data. However, encouraging efforts have been made in this field to make use of transfer learning [Zoph et al., 2016], whereby a parent model (say, de-en) over much larger dataset is used to provide a good initial point for translation in a low-resource setting (gu-en).
As a part of this project, I implemented an ACL 2019 paper [Kim et al., 2019] on transfer learning in NMT which also uses cross-lingual maps [Conneau et al., 2018] to learn, without any need for shared vocabularies between source and target languages. Due to limitations on available resources, I was only able to replicate the general trend of results, as reported in the paper, for Basque-English and extend the framework to Gujarati-English.

Avatar
Keshaw Singh
Member of Technical Staff II

Adobe Inc.

Related