KLI

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

Metadata Downloads
Abstract
Background: When using machine learning in the real world, the missing value problem is the first problem encountered.
Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations
by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision
tree.
Objective: The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed
to effectively impute data using a progressive method called self-training in the medical field where training data are scarce.
Methods: In this paper, we propose a self-training method that gradually increases the available data. Models trained with
complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is
validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling.
This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy
of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model.
Results: In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson
correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE
and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation
showed the lowest possible P value, 3.05e-5, in all situations.
Conclusions: Self-training showed significant results in comparing the predicted values and actual values, but it needs to be
verified in an actual machine learning system. And self-training has the potential to improve performance according to the
pseudolabel evaluation method, which will be the main subject of our future research.
Author(s)
강희준권한슬김영학김윤하서혜람안임진전태준조하나최희정
Issued Date
2021
Type
Article
Keyword
AlgorithmsDatasetsHealth administrationLaboratoriesMachine learningMedical recordsStatistical methods
DOI
10.2196/30824
URI
https://oak.ulsan.ac.kr/handle/2021.oak/8268
https://ulsan-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=TN_cdi_proquest_miscellaneous_2581824876&context=PC&vid=ULSAN&lang=ko_KR&search_scope=default_scope&adaptor=primo_central_multiple_fe&tab=default_tab&query=any,contains,Self-Training%20With%20Quantile%20Errors%20for%20Multivariate%20Missing%20Data%20Imputation%20for%20Regression%20Problems%20in%20Electronic%20Medical%20Records:%20Algorithm%20Development%20Study&offset=0&pcAvailability=true
Publisher
JMIR Public Health and Surveillance
Location
캐나다
Language
영어
ISSN
2369-2960
Citation Volume
7
Citation Number
10
Citation Start Page
30824
Citation End Page
30824
Appears in Collections:
Medicine > Medicine
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.