KLI

검색

Ulsan Univ. Repository Thesis General Graduate School Computer Engineering & Information Technology 2. Theses (Ph.D)

Enhancing the performance of Vietnamese-Korean Neural Machine Translation using Contextual Embedding

Metadata Downloads

Abstract: Since deep learning was introduced, a series of achievements have been published in the field of automatic machine translation (MT). However, Vietnamese-Korean MT systems face many challenges because of a lack of data. In this research, we built the open extensive Vietnamese-Korean parallel corpora for training MT models consisting of over 412 thousand sentence pairs.
Besides, the problem of multiple meanings of words depending on their contexts leads to difficulty to understand the meaning of the corpus for MT. This dissertation discusses a method of applying a linguistic annotation named Part-of-Speech (POS) tagging to Vietnamese sentences to improve the performance of Vietnamese-Korean MT systems. The experimental results indicate that tagging POS in Vietnamese sentences can improve the quality of Vietnamese-Korean Neural MT (NMT) in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score. After applying POS tagging to the Vietnamese corpus, our Vietnamese-Korean MT system improved by 1.07 BLEU points and 2.96 TER scores, respectively.
In addition, in recent years, a state-of-the-art context-based embedding model called BERT introduced by Google has appeared in the MT models in different ways to boost the accuracy of MT systems. The BERT model for Vietnamese has been built up and significantly improved in natural language processing (NLP) tasks such as POS, NER, dependency parsing, and natural language inference. This dissertation discusses a method for applying POS tagging that is also developed based on the BERT model to Vietnamese sentences to improve the performance of Vietnamese-Korean MT systems. Moreover, our research experiment injected the Vietnamese BERT into the NMT model where the BERT model for Vietnamese is concurrently connected to both encoder layers and decoder layers in the NMT model. MT results show that using the contextual embedding model significantly enhances the performance of Vietnamese-Korean MT by 2.78 BLEU points at the sentence-level and 3.01 BLEU points at the document-level, respectively.