Abstract
The performance of neural machine translation models based on the Transformer architecture is contingent upon the quality of the data. When the training data contains a high proportion of noise, the performance of the model deteriorates. This paper addresses the issue of diminished model capability in the presence of noisy datasets by proposing an optimization method based on semantic confidence-weighted alignment. This method integrates alignment metrics and model parameter confidence adjustments to recalibrate loss weights, thereby enhancing the model’s ability to identify and process noisy data. Experimental results demonstrate that this approach significantly improves the performance of translation models, particularly in low-resource language pairs such as Malay-Chinese, especially when dealing with noisy datasets. Compared to traditional methods, there is a notable increase in BLEU scores.
Similar content being viewed by others
Data availability
Here are links to the datasets used in this article: https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ (Asian Language Treebank (ALT) Project) and https://opus.nlpl.eu/ CCMatrix-v1.php (CCMatrix).
References
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al (eds) Advances in neural information processing systems. vol. 30. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
Geng X, Wang L, Wang X, Yang M, Feng X, Qin B et al (2022) Learning to refine source representations for neural machine translation. Int J Mach Learn Cybern 13(8):2199–2212
Wang W, Jiao W, Hao Y, Wang X, Shi S, Tu Z et al (2022) Understanding and improving sequence-to-sequence pretraining for neural machine translation. arXiv preprint arXiv:2203.08442
Tran P, Nguyen T, Vu DH, Tran HA, Vo B (2022) A method of Chinese-Vietnamese bilingual corpus construction for machine translation. IEEE Access 10:78928–78938
Wen Y, Guo J, Yu Z, Yu Z (2023) Chinese-Vietnamese pseudo-parallel sentences extraction based on image information fusion. Information 14(5):298
Holmer D, Rennes E (2023) Constructing pseudo-parallel Swedish sentence corpora for automatic text simplification. In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). pp 113–123
Batheja A, Deoghare S, Kanojia D, Bhattacharyya P (2023) APE-then-QE: correcting then filtering pseudo parallel corpora for MT training data creation. arXiv preprint arXiv:2312.11312
Tran P, Nguyen T, Vu DH, Tran HA, Vo B (2022) A method of Chinese-Vietnamese bilingual corpus construction for machine translation. IEEE Access 10:78928–78938
Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics; pp 6490–6500. Available from: https://aclanthology.org/2021.acl-long.507
Açarçiçek H, Çolakoğlu T, Hatipoğlu PEA, Huang CH, Peng W (2020) Filtering noisy parallel corpus using transformers with proxy task learning. In: Proceedings of the Fifth Conference on Machine Translation. pp 940–946. Available from: https://aclanthology.org/2020.wmt-1.105
Xu J, Ruan Y, Bi W, Huang G, Shi S, Chen L, et al (2022) On synthetic data for back translation. In: Carpuat M, de Marneffe MC, Meza Ruiz IV (eds) Proceedings of the 2022 conference of the north American chapter of the association for computational linguistics: human language technologies. Seattle, United States: Association for Computational Linguistics; pp 419–430. Available from: https://aclanthology.org/2022.naacl-main.32
Herold C, Rosendahl J, Vanvinckenroye J, Ney H (2021) Data filtering using cross-lingual word embeddings. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tur D, Beltagy I, Bethard S et al (eds) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics; . pp 162–172. Available from: https://aclanthology.org/2021.naacl-main.15
Adjeisah M, Liu G, Nyabuga D, Nortey RN, Song J (2021) Pseudotext injection and advance filtering of low-resource corpus for neural machine translation. Comput Intell Neurosci 2021
Nikolova-Stoupak I, Shimizu S, Chu C, Kurohashi S (2022) Filtering of noisy web-crawled parallel corpus: the Japanese-Bulgarian language pair. In: Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022). Sofia, Bulgaria: Department of Computational Linguistics, IBL – BAS; . pp 39–48. Available from: https://aclanthology.org/2022.clib-1.4
Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics; pp 72–79. Available from: https://aclanthology.org/P03-1010
Abdul-Rauf S, Schwenk H (2009) On the use of comparable corpora to improve SMT performance. In: Lascarides A, Gardent C, Nivre J (ed) Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). Athens, Greece: association for computational linguistics; pp 16–23. Available from: https://aclanthology.org/E09-1003
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with Subword units. arXiv preprint arXiv:1508.07909
Gandrabur S, Foster G (2003) Confidence estimation for translation prediction. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003:95–102 (Available from: https://aclanthology.org/W03-0413)
Batheja A, Bhattacharyya P (2023) A little is enough: few-shot quality estimation based corpus filtering improves machine translation. arXiv e-prints;p. arXiv–2306
Batheja A, Bhattacharyya P (2022) Improving machine translation with phrase pair injection and corpus filtering. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; pp 5395–5400. Available from: https://aclanthology.org/2022.emnlp-main.361
Zhou X, Cao H, Zhao T (2015) Domain adaptation for SMT using sentence weight. In: Sun M, Liu Z, Zhang M, Liu Y (eds) Chinese computational linguistics and natural language processing based on naturally annotated big data. Springer International Publishing, Cham, pp 153–163
Wang W, Watanabe T, Hughes M, Nakagawa T, Chelba C (2018) Denoising neural machine translation training with trusted data and online data selection. arXiv preprint arXiv:1809.00068
Wang S, Liu Y, Wang C, Luan H, Sun M (2019) Improving back-translation with uncertainty-based confidence estimation. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; pp 791–802. Available from: https://aclanthology.org/D19-1073
Dou Q, Gales M (2022) Parallel attention forcing for machine translation. arXiv preprint arXiv:2211.03237
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Isabelle P, Charniak E, Lin D (eds) Proceedings of the 40th annual meeting of the association for computational linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics; pp 311–318. Available from: https://aclanthology.org/P02-1040
Imankulova A, Sato T, Komachi M (2019) Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3341726
Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Available from: arXiv: org/abs/2004.09813
Abdelali A, Guzman F, Sajjad H, Vogel S (2014) The AMARA corpus: building parallel language resources for the educational domain. In: Calzolari N, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J et al (eds) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA); pp 1856–1862. Available from: http://www.lrec-conf.org/proceedings/lrec2014/pdf/877_Paper.pdf
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grants No. 62376111, U23A20388, U21B2027,62366027)); the Yunnan High-Tech Industry Development Project (Grant No. 201606); the Key Research and Development Program of Yunnan Province (Grants No. 202103AA080015, 202303AP140008); the Basic Research Plan of Yunnan Province (Grant No. 202001AS070014); and the Yunnan Province Program for Science and Technology Talent and Platform (Grant No. 202105AC160018).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhuang, X., Gao, S., Yu, Z. et al. Low resource neural machine translation model optimization based on semantic confidence weighted alignment. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02148-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13042-024-02148-w