Research on the Development of Data Augmentation Techniques in the Field of Machine Translation

Zhipeng Zhang, Aleksey Poguda


Neural machine translation usually requires a large number of bilingual parallel corpus for training, which is very easy to overfit on the training set of small data. Through a large number of experiments, it has been proved that almost all excellent neural network models are trained on large-scale datasets. High quality bilingual parallel corpus is difficult to obtain, and manual labeling of corpus is usually expensive, and it takes a lot of time. The data augmentation method is an effective technique for scaling data and has achieved significant results in some areas. For example, in the field of computer vision, training data is often augmented with methods such as cropping, flipping, bending or color transformation. Although data augmentation methods have become a basic technique for training neural network models in the field of computer vision, this technology has not been well applied in the field of natural language processing. This article systematically reviews the development of data augmentation techniques in the field of natural language processing in recent years, especially in the subfield of machine translation and conducts research on the mainstream data augmentation methods in the field of machine translation.

Full Text:



Schwenk H, Chaudhary V, Sun S, et al. Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia[J]. arXiv preprint arXiv:1907.05791(2019).

Nguyen X P, Joty S, Wu K, et al. Data diversification: A simple strategy for neural machine translation[J]. Advances in Neural Information Processing Systems, (2020), 33: 10018-10029.

Wei, Jason W., and Kai Zou. “Eda: Easy data augmentation techniques for boosting performance on text classification tasks.” arXiv preprint arXiv:1901.11196 (2019).

Anaby-Tavor, Ateret, et al. “Not Enough Data? Deep Learning to the Rescue!.” arXiv preprint arXiv:1911.03118 (2019).

Hu, Zhiting, et al. “Learning Data Manipulation for Augmentation and Weighting.” Advances in Neural Information Processing Systems. (2019).

Wang, William Yang, and Diyi Yang. “That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets.” Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. (2015).

Chawla, Nitesh V., et al. “SMOTE: synthetic minority over-sampling technique.” Journal of artificial intelligence research16 (2002): 321-357.

Xie, Qizhe, et al. “Unsupervised data augmentation.” arXiv preprint arXiv:1904.12848 (2019).

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, (2016).

Sennrich, Rico, Barry Haddow, and Alexandra Birch. “Improving neural machine translation models with monolingual data.” arXiv preprint arXiv:1511.06709 (2015).

Edunov, Sergey, et al. “Understanding back-translation at scale.” arXiv preprint arXiv:1808.09381 (2018).

Yu, Adams Wei, et al. “Qanet: Combining local convolution with global self-attention for reading comprehension.” arXiv preprint arXiv:1804.09541 (2018).

Wei, Jason W., and Kai Zou. “Eda: Easy data augmentation techniques for boosting performance on text classification tasks.” arXiv preprint arXiv:1901.11196 (2019).

Kobayashi, Sosuke. “Contextual augmentation: Data augmentation by words with paradigmatic relations.” arXiv preprint arXiv:1805.06201 (2018).

Wu, Xing, et al. “Conditional BERT contextual augmentation.” International Conference on Computational Science. Springer, Cham, (2019).

Liu, Ting, et al. “Generating and exploiting large-scale pseudo training data for zero pronoun resolution.” arXiv preprint arXiv:1606.01603 (2016).

Hou, Yutai, et al. “Sequence-to-sequence data augmentation for dialogue language understanding.” arXiv preprint arXiv:1807.01554 (2018).

Dong, Li, et al. “Learning to paraphrase for question answering.” arXiv preprint arXiv:1708.06022 (2017).

Hu, Zhiting, et al. “Toward controlled generation of text.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, (2017).

Guu, Kelvin, et al. “Generating sentences by editing prototypes.” Transactions of the Association for Computational Linguistics 6 (2018): 437-450.

Eleftheria Briakou, Marine Carpuat “Can Synthetic Translations Improve Bitext Quality?” Published by ACL 2022.

Nishant Kambhatla, Logan Born, Anoop Sarkar. “CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation” Published by ACL 2022.

Xiangpeng Wei, Heng Yu, Yue Hu, Rongxiang Weng, Weihua Luo, Rong Jin. “Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation” Published by ACL 2022.

Qiao Cheng, Jin Huang, Yitao Duan. “Semantically Consistent Data Augmentation for Neural Machine Translation via Conditional Masked Language Model”. arXiv preprint arXiv:2209.10875(2022).


  • There are currently no refbacks.

Abava  Кибербезопасность MoNeTec 2024

ISSN: 2307-8162