KLI

검색

Ulsan Univ. Repository Thesis General Graduate School Medical Engineering 1. Theses(Master)

A Retrospective Study on Cardiovascular Disease Risk Prediction based on Machine Learning Using Electronic Medical Records

Metadata Downloads

Abstract: Background: The clinical data stored in medical institutions is rapidly increasing with the development of healthcare-related technologies. To apply artificial intelligence (AI) approaches to medical data such as electronic medical records (EMRs), it is necessary to establish and curate a specialized database. Specially, cardiovascular diseases (CVDs) are difficult to diagnose early and have risk factors that are easy to overlook. Early prediction and personalization of treatment through the use of AI may help clinicians and patients manage CVDs more effectively. Moreover, since effective resource management in hospitals can improve the quality of medical service, predicting a patient’s hospitalization period may support making judicious decision regarding bed management.

Objectives: First, we aim to build a suitable database (CardioNet) for CVDs that can utilize AI technology, contributing to the overall care of patients with CVDs. Second, we aim to develop a Machine Learning (ML)-based model for predicting the discharge probability and to explain the individual risk factors for improving the patient management. Third, we aim to develop a Deep Learning (DL)-based model for estimated glomerular filtration rate (eGFR) prediction of inpatients with heart failure (HF) and to visualize results of prediction to enhance the quality of medical services.

Methods: First, we build the CardioNet with data from 748,474 patients, which consisted of anonymized records who had visited the Asan Medical Center (AMC) or Ulsan University Hospital (UUH) because of CVDs between January 1, 2000, and December 31, 2016. In addition, we pre-processed EMRs to remove errors and duplications, and performed natural language processing to structuralize the free-text readings. Second, we processed the data to create a suitable dataset by reindexing the date-index, integrating the present features with past features from the previous 3 years, and imputing missing values. Subsequently, we trained the ML-based predictive models, and predicted the discharge probability within 3 days and explained the outcomes of the model by identifying, quantifying, and visualizing its features. Third, we extracted data of hospitalized patients with HF, performed pre-processing and created a dataset including time series to train a DL-based model to make predictions for eGFR. Additionally, we proposed visualized outcomes of the DL-based model for utilizing the results in clinical practices.

Results: CardioNet is a comprehensive database that can serve as a training set for AI models and assist in all aspects of clinical management of CVDs. It comprises information extracted from EHRs and results of readings of CVD-related digital tests. It consists of 27 tables, a code-master table, and a descriptive table.
In order to predict hospital discharge prediction, we experimented with 5 ML-based models using 5 cross-validations. The extreme gradient boosting (XGB), which was selected as the final model, accomplished an average area under the receiver operating characteristic curve score that was 0.865 higher than for other models. Furthermore, we performed feature reduction, represented the feature importance, and assessed prediction outcomes. One of the outcomes, the individual explainer, provides a discharge score during hospitalization and a daily feature influence score to the medical team and patients. Finally, we visualized simulated bed management to use the outcomes.
In order to predict the eGFR to prevent the patients with HF to risk, we performed pre-processing to create sequential learning dataset and developed the DL-based model based on recurrent neural networks. The predictive model we developed learns 24 hours of data, predicts eGFR levels after 12, 24, 36, and 48 hours and predicts one of five risk labels. Our DL-based model achieved the mean squared error of 169.626, the mean absolute error of 5.82, and the accuracy of classification was 85.1%. Subsequently, we visualized the outcomes of models including overall eGFR graphs and divided graphs for each time step.

Conclusions: First, we established the comprehensive database specialized in CVDs. We are actively supporting multi-center research, which may require further data processing, depending on the subject of the study. CardioNet will serve as the fundamental database for future CVD-related research projects. Second, we proposed an individual explainer based on an ML-based predictive model, which provides the discharge probability and relative contributions of individual features. Our model can assist medical teams and patients in identifying individual and common risk factors in CVDs and support hospital administrators in improving the management of hospital beds and other resources. Third, we conducted the effective pre-processing method for generating sequential data from EMRs. We developed the DL-based predictive model providing the value and risk of eGFR for inpatients with HF. Additionally, we presented overall and divided graph by time step which could support the medical team and patients in managing the risk of HF and CVDs in advance.

Keywords: electronic medical records, cardiovascular diseases, artificial intelligence, database, hospital discharge prediction, risk prediction.
|최근 의료기관에 저장되는 임상 데이터는 헬스케어 관련 기술의 발전으로 급증하고 있다. 전자의무기록은 임상 데이터 중의 하나로 환자의 다양한 진료 기록을 포함하고 있다. 이러한 환자의 기록은 개인정보보호로 인해 활용되지 못하였지만, 가명화나 익명화 등의 비식별화 과정을 통해 후향적 연구를 진행할 수 있게 되었다. 전자의무기록을 활용한 후향적 연구는 여러 위험 예측 연구를 수행할 수 있고, 실사용증거 (Real-World Evidence) 로 활용될 수 있다.
심혈관질환은 여러 동반질환을 수반하는 급성 및 만성 질환 중 하나로, 지속적이고 적극적인 관리가 필요하며, 이를 위해 인공지능 기반 연구들이 많이 수행되고 있다. 이러한 환자 관리를 지원하기 위해 다음 세가지 연구를 계획하였다. 첫째, 향후 심혈관질환 연구에 지속적으로 활용될 수 있는 심혈관질환 특화 데이터 베이스를 구축하는 데 목적이 있다. 둘째, 구축한 데이터 베이스를 활용하여 심혈관질환 관련 예측 연구들을 수행하기 위해 머신러닝 기반 모델을 개발하였다. 이 모델은 입원 환자의 퇴원 예측을 수행하여 효율적인 병원 자원 활용 지원에 그 목적이 있다. 셋째, 동일한 데이터 베이스를 활용하여 추정 사구체 여과율을 예측하는 딥러닝 기반 모델을 개발하여, 입원 환자의 심부전 위험을 감지하는 데 목적이 있다.
첫째, 적극적인 관리가 필요한 심혈관질환과 관련된 임상 데이터는 기본적인 외래, 입원 데이터를 비롯하여 심장초음파나 운동부하검사와 같은 다양한 특수 검사를 포함한다. 이러한 다양한 정형 데이터와 비정형 데이터를 통합하여 심혈관질환 관련 의료 정보학 연구 수행에 도움이 되고자 하여, 심혈관 질환 특화 데이터 베이스를 구축하였다. 익명화 된 데이터를 추출하고 임상적으로 수용가능한 기준에 따라 이상치 및 오류 데이터를 제거하였으며, 자연어처리 기법을 통해 문장 형태의 판독 결과지 등의 비정형 데이터를 구조화하여 심혈관질환 특화 데이터 베이스를 구축하였다. 구축된 데이터베이스는 전자의무기록 분석의 유용성을 높일 수 있으며, 2차적 파생 연구를 지원할 수 있다.
둘째, 기 구축된 심혈관질환 특화 데이터베이스에서 심혈관질환 관련 입원 환자 데이터를 추출하여 머신러닝 기반 예측 연구를 수행하였다. 본 연구는 심혈관질환으로 입원한 환자들의 퇴원 가능성을 예측하였으며, 개인화된 설명자를 통해 각 환자의 입원 건 별 위험 요인 및 퇴원 가능성을 시각화 하였으며, 시뮬레이션된 자료를 통해 퇴원 가능성을 예측한 연구가 병원 프로세스 개선에 도움이 될 수 있음을 제시하였다.
셋째, 기 구축된 심혈관질환 특화 데이터베이스에서 심혈관질환 중 심부전으로 입원한 환자 데이터를 추출하고, 추정 사구체 여과율을 예측하여 질병 위험을 제시하는 딥러닝 기반 모델 개발 연구를 수행하였다. 본 연구에서는 전자의무기록을 딥러닝 기반 모델이 학습할 수 있는 시계열 학습 데이터로 변환하였으며, 추정 사구체 여과율을 예측하여, 심부전 및 만성신장질환을 앓고 있는 환자의 추정 사구체 여과율의 하락 등의 위험을 알려주는 예측 모델을 개발하였다. 추가적으로 각 기준 시간대에 따른 추정 사구체 여과율의 직관적으로 파악할 수 있는 시각화자료를 제시하였다.
결과적으로, 전자의무기록을 활용한 후향적 연구를 효율적으로 진행하기 위해 심혈관질환 특화 데이터베이스를 구축하였으며, 머신러닝 및 딥러닝 기반 모델 개발을 통해 다양한 예측 연구를 수행하였다. 본 연구를 확장하여 보다 정교한 모델 개발을 수행한다면, 의료의 질 향상 및 개인 맞춤형 디지털 헬스케어의 실현에 기여할 수 있을 것이라 생각된다.
첫번째 연구는 “CardioNet: a manually curated database for artificial intelligence-based research on cardiovascular diseases”의 제목으로 2021년 1월 28일에 BMC Medical Informatics and Decision Making에 출판되었으며, 두번째 연구는 “Machine Learning–Based Hospital Discharge Prediction for Patients With Cardiovascular Diseases: Development and Usability Study”의 제목으로 2021년 11월 17일에 JMIR medical informatics에 출판되었다.