Skip to main content
Log in

Low resource neural machine translation model optimization based on semantic confidence weighted alignment

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The performance of neural machine translation models based on the Transformer architecture is contingent upon the quality of the data. When the training data contains a high proportion of noise, the performance of the model deteriorates. This paper addresses the issue of diminished model capability in the presence of noisy datasets by proposing an optimization method based on semantic confidence-weighted alignment. This method integrates alignment metrics and model parameter confidence adjustments to recalibrate loss weights, thereby enhancing the model’s ability to identify and process noisy data. Experimental results demonstrate that this approach significantly improves the performance of translation models, particularly in low-resource language pairs such as Malay-Chinese, especially when dealing with noisy datasets. Compared to traditional methods, there is a notable increase in BLEU scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Here are links to the datasets used in this article: https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ (Asian Language Treebank (ALT) Project) and https://opus.nlpl.eu/ CCMatrix-v1.php (CCMatrix).

References

  1. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  2. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al (eds) Advances in neural information processing systems. vol. 30. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  3. Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025

  4. Geng X, Wang L, Wang X, Yang M, Feng X, Qin B et al (2022) Learning to refine source representations for neural machine translation. Int J Mach Learn Cybern 13(8):2199–2212

    Article  Google Scholar 

  5. Wang W, Jiao W, Hao Y, Wang X, Shi S, Tu Z et al (2022) Understanding and improving sequence-to-sequence pretraining for neural machine translation. arXiv preprint arXiv:2203.08442

  6. Tran P, Nguyen T, Vu DH, Tran HA, Vo B (2022) A method of Chinese-Vietnamese bilingual corpus construction for machine translation. IEEE Access 10:78928–78938

    Article  Google Scholar 

  7. Wen Y, Guo J, Yu Z, Yu Z (2023) Chinese-Vietnamese pseudo-parallel sentences extraction based on image information fusion. Information 14(5):298

    Article  Google Scholar 

  8. Holmer D, Rennes E (2023) Constructing pseudo-parallel Swedish sentence corpora for automatic text simplification. In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). pp 113–123

  9. Batheja A, Deoghare S, Kanojia D, Bhattacharyya P (2023) APE-then-QE: correcting then filtering pseudo parallel corpora for MT training data creation. arXiv preprint arXiv:2312.11312

  10. Tran P, Nguyen T, Vu DH, Tran HA, Vo B (2022) A method of Chinese-Vietnamese bilingual corpus construction for machine translation. IEEE Access 10:78928–78938

    Article  Google Scholar 

  11. Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics; pp 6490–6500. Available from: https://aclanthology.org/2021.acl-long.507

  12. Açarçiçek H, Çolakoğlu T, Hatipoğlu PEA, Huang CH, Peng W (2020) Filtering noisy parallel corpus using transformers with proxy task learning. In: Proceedings of the Fifth Conference on Machine Translation. pp 940–946. Available from: https://aclanthology.org/2020.wmt-1.105

  13. Xu J, Ruan Y, Bi W, Huang G, Shi S, Chen L, et al (2022) On synthetic data for back translation. In: Carpuat M, de Marneffe MC, Meza Ruiz IV (eds) Proceedings of the 2022 conference of the north American chapter of the association for computational linguistics: human language technologies. Seattle, United States: Association for Computational Linguistics; pp 419–430. Available from: https://aclanthology.org/2022.naacl-main.32

  14. Herold C, Rosendahl J, Vanvinckenroye J, Ney H (2021) Data filtering using cross-lingual word embeddings. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tur D, Beltagy I, Bethard S et al (eds) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics; . pp 162–172. Available from: https://aclanthology.org/2021.naacl-main.15

  15. Adjeisah M, Liu G, Nyabuga D, Nortey RN, Song J (2021) Pseudotext injection and advance filtering of low-resource corpus for neural machine translation. Comput Intell Neurosci 2021

  16. Nikolova-Stoupak I, Shimizu S, Chu C, Kurohashi S (2022) Filtering of noisy web-crawled parallel corpus: the Japanese-Bulgarian language pair. In: Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022). Sofia, Bulgaria: Department of Computational Linguistics, IBL – BAS; . pp 39–48. Available from: https://aclanthology.org/2022.clib-1.4

  17. Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics; pp 72–79. Available from: https://aclanthology.org/P03-1010

  18. Abdul-Rauf S, Schwenk H (2009) On the use of comparable corpora to improve SMT performance. In: Lascarides A, Gardent C, Nivre J (ed) Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). Athens, Greece: association for computational linguistics; pp 16–23. Available from: https://aclanthology.org/E09-1003

  19. Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with Subword units. arXiv preprint arXiv:1508.07909

  20. Gandrabur S, Foster G (2003) Confidence estimation for translation prediction. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003:95–102 (Available from: https://aclanthology.org/W03-0413)

  21. Batheja A, Bhattacharyya P (2023) A little is enough: few-shot quality estimation based corpus filtering improves machine translation. arXiv e-prints;p. arXiv–2306

  22. Batheja A, Bhattacharyya P (2022) Improving machine translation with phrase pair injection and corpus filtering. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; pp 5395–5400. Available from: https://aclanthology.org/2022.emnlp-main.361

  23. Zhou X, Cao H, Zhao T (2015) Domain adaptation for SMT using sentence weight. In: Sun M, Liu Z, Zhang M, Liu Y (eds) Chinese computational linguistics and natural language processing based on naturally annotated big data. Springer International Publishing, Cham, pp 153–163

    Chapter  Google Scholar 

  24. Wang W, Watanabe T, Hughes M, Nakagawa T, Chelba C (2018) Denoising neural machine translation training with trusted data and online data selection. arXiv preprint arXiv:1809.00068

  25. Wang S, Liu Y, Wang C, Luan H, Sun M (2019) Improving back-translation with uncertainty-based confidence estimation. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; pp 791–802. Available from: https://aclanthology.org/D19-1073

  26. Dou Q, Gales M (2022) Parallel attention forcing for machine translation. arXiv preprint arXiv:2211.03237

  27. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Isabelle P, Charniak E, Lin D (eds) Proceedings of the 40th annual meeting of the association for computational linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics; pp 311–318. Available from: https://aclanthology.org/P02-1040

  28. Imankulova A, Sato T, Komachi M (2019) Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3341726

    Article  Google Scholar 

  29. Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Available from: arXiv: org/abs/2004.09813

  30. Abdelali A, Guzman F, Sajjad H, Vogel S (2014) The AMARA corpus: building parallel language resources for the educational domain. In: Calzolari N, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J et al (eds) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA); pp 1856–1862. Available from: http://www.lrec-conf.org/proceedings/lrec2014/pdf/877_Paper.pdf

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grants No. 62376111, U23A20388, U21B2027,62366027)); the Yunnan High-Tech Industry Development Project (Grant No. 201606); the Key Research and Development Program of Yunnan Province (Grants No. 202103AA080015, 202303AP140008); the Basic Research Plan of Yunnan Province (Grant No. 202001AS070014); and the Yunnan Province Program for Science and Technology Talent and Platform (Grant No. 202105AC160018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to ShengXiang Gao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhuang, X., Gao, S., Yu, Z. et al. Low resource neural machine translation model optimization based on semantic confidence weighted alignment. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02148-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13042-024-02148-w

Keywords

Navigation