KLI

The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units

Metadata Downloads
Alternative Title
형태소단위를 유지하기 위한 멀티홉 표현기반의 언어모델
Abstract
Natural language models brought rapid developments to Natural Language Processing (NLP) performance following the emergence of large-scale deep learning models. Language models have previously used token units to represent natural language while reducing the proportion of unknown tokens. However, tokenization in language models raises language-specific issues. One of the key issues is that separating words by morphemes may cause distortion to the original meaning; also, it can prove challenging to apply the information surrounding a word, such as its semantic network. We propose a multi-hot representation language model to maintain Korean morpheme units. This method represents a single morpheme as a group of syllable-based tokens for cases where no matching tokens exist. This model has demonstrated similar performance to existing models in various natural language processing applications. The proposed model retains the minimum unit of meaning by maintaining the morpheme units and can easily accommodate the extension of semantic information.
Author(s)
Ju-Sang LeeJoon-Choul ShinChoel-Young Ock
Issued Date
2022
Type
Article
Keyword
language modeltokenizationmulti-hot representationmaintain morpheme unitsmorpheme and syllable-base tokens
DOI
10.3390/app122010612
URI
https://oak.ulsan.ac.kr/handle/2021.oak/15258
Publisher
APPLIED SCIENCES-BASEL
Language
영어
ISSN
2076-3417
Citation Volume
12
Citation Number
20
Citation Start Page
1
Citation End Page
9
Appears in Collections:
Engineering > IT Convergence
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.