Low resource neural machine translation model optimization based on semantic confidence weighted alignment

Zhuang, Xuhui; Gao, ShengXiang; Yu, ZhengTao; Guo, Junjun; Wang, XiaoCong

doi:10.1007/s13042-024-02148-w

Low resource neural machine translation model optimization based on semantic confidence weighted alignment

Original Article
Published: 25 April 2024

(2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Xuhui Zhuang^1,2,
ShengXiang Gao^1,2^na1,
ZhengTao Yu^1,2^na1,
Junjun Guo^1,2^na1 &
…
XiaoCong Wang^1,2^na1

73 Accesses
Explore all metrics

Abstract

The performance of neural machine translation models based on the Transformer architecture is contingent upon the quality of the data. When the training data contains a high proportion of noise, the performance of the model deteriorates. This paper addresses the issue of diminished model capability in the presence of noisy datasets by proposing an optimization method based on semantic confidence-weighted alignment. This method integrates alignment metrics and model parameter confidence adjustments to recalibrate loss weights, thereby enhancing the model’s ability to identify and process noisy data. Experimental results demonstrate that this approach significantly improves the performance of translation models, particularly in low-resource language pairs such as Malay-Chinese, especially when dealing with noisy datasets. Compared to traditional methods, there is a notable increase in BLEU scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-Aware Learning Rate for Neural Machine Translation

Extremely low-resource neural machine translation for Asian languages

Article Open access 01 December 2020

Recent advances of low-resource neural machine translation

Article 30 October 2021

Data availability

Here are links to the datasets used in this article: https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ (Asian Language Treebank (ALT) Project) and https://opus.nlpl.eu/ CCMatrix-v1.php (CCMatrix).

References

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al (eds) Advances in neural information processing systems. vol. 30. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
Geng X, Wang L, Wang X, Yang M, Feng X, Qin B et al (2022) Learning to refine source representations for neural machine translation. Int J Mach Learn Cybern 13(8):2199–2212
Article Google Scholar
Wang W, Jiao W, Hao Y, Wang X, Shi S, Tu Z et al (2022) Understanding and improving sequence-to-sequence pretraining for neural machine translation. arXiv preprint arXiv:2203.08442
Tran P, Nguyen T, Vu DH, Tran HA, Vo B (2022) A method of Chinese-Vietnamese bilingual corpus construction for machine translation. IEEE Access 10:78928–78938
Article Google Scholar
Wen Y, Guo J, Yu Z, Yu Z (2023) Chinese-Vietnamese pseudo-parallel sentences extraction based on image information fusion. Information 14(5):298
Article Google Scholar
Holmer D, Rennes E (2023) Constructing pseudo-parallel Swedish sentence corpora for automatic text simplification. In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). pp 113–123
Batheja A, Deoghare S, Kanojia D, Bhattacharyya P (2023) APE-then-QE: correcting then filtering pseudo parallel corpora for MT training data creation. arXiv preprint arXiv:2312.11312
Tran P, Nguyen T, Vu DH, Tran HA, Vo B (2022) A method of Chinese-Vietnamese bilingual corpus construction for machine translation. IEEE Access 10:78928–78938
Article Google Scholar
Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics; pp 6490–6500. Available from: https://aclanthology.org/2021.acl-long.507
Açarçiçek H, Çolakoğlu T, Hatipoğlu PEA, Huang CH, Peng W (2020) Filtering noisy parallel corpus using transformers with proxy task learning. In: Proceedings of the Fifth Conference on Machine Translation. pp 940–946. Available from: https://aclanthology.org/2020.wmt-1.105
Xu J, Ruan Y, Bi W, Huang G, Shi S, Chen L, et al (2022) On synthetic data for back translation. In: Carpuat M, de Marneffe MC, Meza Ruiz IV (eds) Proceedings of the 2022 conference of the north American chapter of the association for computational linguistics: human language technologies. Seattle, United States: Association for Computational Linguistics; pp 419–430. Available from: https://aclanthology.org/2022.naacl-main.32
Herold C, Rosendahl J, Vanvinckenroye J, Ney H (2021) Data filtering using cross-lingual word embeddings. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tur D, Beltagy I, Bethard S et al (eds) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics; . pp 162–172. Available from: https://aclanthology.org/2021.naacl-main.15
Adjeisah M, Liu G, Nyabuga D, Nortey RN, Song J (2021) Pseudotext injection and advance filtering of low-resource corpus for neural machine translation. Comput Intell Neurosci 2021
Nikolova-Stoupak I, Shimizu S, Chu C, Kurohashi S (2022) Filtering of noisy web-crawled parallel corpus: the Japanese-Bulgarian language pair. In: Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022). Sofia, Bulgaria: Department of Computational Linguistics, IBL – BAS; . pp 39–48. Available from: https://aclanthology.org/2022.clib-1.4
Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics; pp 72–79. Available from: https://aclanthology.org/P03-1010
Abdul-Rauf S, Schwenk H (2009) On the use of comparable corpora to improve SMT performance. In: Lascarides A, Gardent C, Nivre J (ed) Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). Athens, Greece: association for computational linguistics; pp 16–23. Available from: https://aclanthology.org/E09-1003
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with Subword units. arXiv preprint arXiv:1508.07909
Gandrabur S, Foster G (2003) Confidence estimation for translation prediction. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003:95–102 (Available from: https://aclanthology.org/W03-0413)
Batheja A, Bhattacharyya P (2023) A little is enough: few-shot quality estimation based corpus filtering improves machine translation. arXiv e-prints;p. arXiv–2306
Batheja A, Bhattacharyya P (2022) Improving machine translation with phrase pair injection and corpus filtering. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; pp 5395–5400. Available from: https://aclanthology.org/2022.emnlp-main.361
Zhou X, Cao H, Zhao T (2015) Domain adaptation for SMT using sentence weight. In: Sun M, Liu Z, Zhang M, Liu Y (eds) Chinese computational linguistics and natural language processing based on naturally annotated big data. Springer International Publishing, Cham, pp 153–163
Chapter Google Scholar
Wang W, Watanabe T, Hughes M, Nakagawa T, Chelba C (2018) Denoising neural machine translation training with trusted data and online data selection. arXiv preprint arXiv:1809.00068
Wang S, Liu Y, Wang C, Luan H, Sun M (2019) Improving back-translation with uncertainty-based confidence estimation. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; pp 791–802. Available from: https://aclanthology.org/D19-1073
Dou Q, Gales M (2022) Parallel attention forcing for machine translation. arXiv preprint arXiv:2211.03237
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Isabelle P, Charniak E, Lin D (eds) Proceedings of the 40th annual meeting of the association for computational linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics; pp 311–318. Available from: https://aclanthology.org/P02-1040
Imankulova A, Sato T, Komachi M (2019) Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3341726
Article Google Scholar
Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Available from: arXiv: org/abs/2004.09813
Abdelali A, Guzman F, Sajjad H, Vogel S (2014) The AMARA corpus: building parallel language resources for the educational domain. In: Calzolari N, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J et al (eds) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA); pp 1856–1862. Available from: http://www.lrec-conf.org/proceedings/lrec2014/pdf/877_Paper.pdf

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grants No. 62376111, U23A20388, U21B2027,62366027)); the Yunnan High-Tech Industry Development Project (Grant No. 201606); the Key Research and Development Program of Yunnan Province (Grants No. 202103AA080015, 202303AP140008); the Basic Research Plan of Yunnan Province (Grant No. 202001AS070014); and the Yunnan Province Program for Science and Technology Talent and Platform (Grant No. 202105AC160018).

Author information

Sheng Xiang Gao, Zheng Tao Yu, Junjun Guo and Xiao Cong Wang have contributed equally to this work.

Authors and Affiliations

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan, 650500, China
Xuhui Zhuang, ShengXiang Gao, ZhengTao Yu, Junjun Guo & XiaoCong Wang
Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Yunnan, 650500, China
Xuhui Zhuang, ShengXiang Gao, ZhengTao Yu, Junjun Guo & XiaoCong Wang

Authors

Xuhui Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
ShengXiang Gao
View author publications
You can also search for this author in PubMed Google Scholar
ZhengTao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Junjun Guo
View author publications
You can also search for this author in PubMed Google Scholar
XiaoCong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to ShengXiang Gao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhuang, X., Gao, S., Yu, Z. et al. Low resource neural machine translation model optimization based on semantic confidence weighted alignment. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02148-w

Download citation

Received: 03 January 2024
Accepted: 16 March 2024
Published: 25 April 2024
DOI: https://doi.org/10.1007/s13042-024-02148-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Low resource neural machine translation model optimization based on semantic confidence weighted alignment

Abstract

Access this article

Similar content being viewed by others

Cost-Aware Learning Rate for Neural Machine Translation

Extremely low-resource neural machine translation for Asian languages

Recent advances of low-resource neural machine translation

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Low resource neural machine translation model optimization based on semantic confidence weighted alignment

Abstract

Access this article

Similar content being viewed by others

Cost-Aware Learning Rate for Neural Machine Translation

Extremely low-resource neural machine translation for Asian languages

Recent advances of low-resource neural machine translation

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation