Real-World Evidence (RWE) from EMR data and Development of Medical Artificial Intelligence Models
- Alternative Title
- EMR 데이터를 활용한 RWE 임상연구수행 및 의료 인공지능 모델의 개발과 활용
- Abstract
- Background With recent advancements in healthcare-related technology, there has been a notable increase in the accumulation of electronic medical records (EMR) data across various medical institutions. Real- world evidence (RWE) research leveraging anonymized EMR data plays a crucial role in utilizing actual patient data to identify fundamental factors, relationships, and predictive risk factors. Of particular significance, cardiovascular disease (CVD) stands as one of the primary global causes of mortality, with a high lipoprotein(a) fraction being a major contributor to the heightened risk of CVD-related events. Moreover, while the One-Hot Encoding (OHE) method is commonly employed for processing EMR data, it's worth noting that EMR data is predominantly recorded in the form of unstructured text data. Extracting valuable information from this textual data has become increasingly vital. Recent advancements in traditional Natural Language Processing (NLP) technology and word embedding methods have proven highly valuable, addressing the limitations of previous research by considering patients' treatment methodologies. Indeed, this research is anticipated to harness data analytics techniques applied to electronic medical record (EMR) data, thereby paving the way for novel opportunities in healthcare innovation, with the potential to impact disease prevention and patient care significantly. Objectives The primary objective of this study is twofold. Firstly, it aims to harness EMRs for the purpose of conducting RWE research, grounded in authentic patient data. Specifically, the primary focus is to elucidate the connection between Lp(a) levels and cardiovascular outcomes within a cohort of high- risk CVD patients, while concurrently forecasting patient risk factors. Consequently, we intend to conduct a comprehensive study that estimates clinical characteristics and cardiovascular outcomes in correlation with Lp(a) levels, employing EMR data from individuals with a history of high-risk atherosclerotic cardiovascular disease (ASCVD) in Korea. The second goal is aimed at improving the performance of medical artificial intelligence (AI) models. To achieve this, the study seeks to develop and validate a code embedding methodology that effectively captures patients' diagnostic patterns utilizing EMR data. The ultimate aim is to bolster the performance of medical artificial intelligence models. Methods Chapter 1. Epidemiology of lipoprotein(a) and the risk of MACE in ASCVD patients in South Korea: This study was conducted on adult patients with ASCVD who visited Asan Medical Center in Seoul, South Korea, and underwent Lp(a) testing from January 1, 2001, to December 31, 2020. The collected data from ABLE were anonymized, and the structured data included information such as patients' basic information (age, height, weight, BMI), blood pressure measurements, admission and discharge dates, and visit type. Unstructured data included a variety of information, including surgical details, test results or interpretation, smoking status, and medication-related information. For patient-related variables (age, gender, BMI, smoking status), data from the same or closest date before the first Lp(a) measurement were used, and laboratory test results (blood pressure, cholesterol level, etc.) were collected 1 year from the date of index creation. Values measured within one year were selected. The primary endpoint consisted of myocardial infarction, ischemic stroke, and all-cause mortality, while secondary endpoints included additional factors such as hospitalization for unstable angina. Chapter 2. Cognizant Embeddings of ICD Codes via BERT: Leveraging Patient Diagnostic Patterns from a Large-scale Cardiovascular EMR Repository: In this study, we employed fine-tuning on the pre-trained BERT MLM model using diagnostic codes and proposed a methodology to enhance the model's learning performance by reducing the dimensionality of the codes. Code embedding technology is a preprocessing technique that involves training an artificial intelligence model by embedding diagnostic codes, effectively converting words into numerical vectors. This data encompassed patient records from Asan Medical Center in Seoul between January 1, 2000, and December 31, 2019. This yielded a total of 1,052,890 patients, including 572,811 from the CardioNet DB and 480,040 newly extracted patients. The extracted data included visit and discharge records, medication, diagnosis codes, and diagnosis dates. A preprocessing step involved converting each patient's diagnosis code into a single code sequence to capture the visit unit effectively. Subsequently, partial sequences were pre-generated from the data to align with the BERT MLM framework. We systematically explored variations in model dimensions and embedding pooling strategies to evaluate the model's efficiency. To evaluate the effectiveness of our code embedding method, we assessed the impact of different code subsequence alignment methods on the BERT MLM model's performance. In addition, we employed the xgb model to predict subsequent heart disease, allowing us to directly compare the performance of the OHE method with the Code Embedding method. Additionally, we used the t-SNE algorithm to visualize whether the model utilizing the code embedding method effectively captured the relationships between diagnosis codes. Results Chapter 1. Epidemiology of lipoprotein(a) and the risk of MACE in ASCVD patients in South Korea: The study analyzed data from a final study population of 27,686 individuals who underwent Lp(a) testing between 2000 and 2020. These participants were divided into quintiles (Q1, Q2, Q3, Q4, and Q5) based on their Lp(a) levels. The high Lp(a) group (Q5) tended to be older, had a history of ASCVD (excluding stable angina), and showed higher total cholesterol and LDL-C levels. In addition, the highest Lp(a) group (Q5) was associated with older age, a history of ASCVD, higher total cholesterol, and LDL-C levels. The patients were categorized into five quintiles based on Lp(a) levels, and significant differences were observed between the highest and lowest quintiles. The 10- year cumulative incidence of MACE was approximately 29.5% in the entire cohort with a history of ASCVD. Kaplan-Meier curves demonstrated that the absolute risk of MACE recurrence increased with higher Lp(a) levels over a 10-year follow-up period. Chapter 2. Cognizant Embeddings of ICD Codes via BERT: Leveraging Patient Diagnostic Patterns from a Large-scale Cardiovascular EMR Repository: The proposed frequency-based sorting method outperformed alternative sorting approaches by reducing the loss by more than 0.1. Furthermore, a code embedding model trained in a 128- dimensional space exhibited outstanding predictive performance in forecasting I50 (congestive heart failure), achieving an impressive AUROC value of approximately 0.997. In contrast, the XGB model using the OHE approach yielded a significantly lower AUROC value of 0.840, indicating suboptimal performance for this specific task. We observed that despite a substantial reduction in dimensionality, the 'Code Embedded XGB Models' achieved an AUC of 0.96, which is approximately 0.1 higher than the 'OHE XGB Models'. Particularly in real-world clinical predictions of MACE within one year for patients who underwent PCI or CABG, our embedding method reduced dimensions by about 96.5% compared to OHE and demonstrated an approximately 6% improvement in disease prediction performance. Additionally, t-SNE visualizations confirmed that related diagnostic codes were located in similar two-dimensional vector spaces, revealing a tendency for diseases related to the same clinical groups to cluster closely based on classification results. Conclusions This study highlights the crucial role of adapting to the growing medical data volume and emerging healthcare technologies, benefiting disease prediction and treatment for real patients. First, this study involved the construction of a patient cohort based on Lp(a) levels using EMR data from Korean ASCVD patients. It employed various statistical methods to validate the correlation between Lp(a) levels and the occurrence of recurrent MACE, providing valuable insights into critical clinical characteristics for individuals diagnosed with ASCVD. Second, a methodology to enhance medical AI model performance was proposed by utilizing EMR data and developing a code embedding model. These findings suggest that our code sequence alignment method better understands important patient information in EMRs through NLP-based context embedding and strengthens models that identify associations with clinical diseases such as diagnostic codes or medications. Finally, this study underscores opportunities for disease prevention and treatment through the application of RWE research utilizing EMRs. In addition, this research has the versatility to be applied to various clinical studies by integrating other unstructured EMRs. These results emphasize that medical code-embedded management, which integrates multiple data sources, is applicable to various medical prediction models and underscores the potential for risk prediction using realistic real-world datasets. Future research endeavors are anticipated to broaden and enhance medical AI models built upon EMRs, paving the way for the development of more advanced pretrained language models that leverage extensive EMR text data. Keywords: electronic medical records, lipoprotein(a), ASCVD, cardiovascular diseases, MACE, ICD-10, diagnosis, natural language processing, code embeddings, BERT, one-hot encoding, representation learning.
- Author(s)
- 김민경
- Issued Date
- 2024
- Awarded Date
- 2024-02
- Type
- Dissertation
- URI
- https://oak.ulsan.ac.kr/handle/2021.oak/12981
http://ulsan.dcollection.net/common/orgView/200000734602
- Authorize & License
-
- Files in This Item:
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.