KLI

검색

Ulsan Univ. Repository Thesis General Graduate School Medical Engineering 1. Theses(Master)

환자 데이터 생성을 위한 지역 차분 프라이버시가 적용된 적대적 생성 네트워크

Metadata Downloads

Abstract: 전자의무기록(EMR)은 환자의 건강 상태, 진료결과, 처방 정보 등을 담은 의료 데이터의 일종이다. 환자에 대한 많은 정보를 담고 있어 다양하게 활용될 수 있으며 여러 방면에서 의료의 질을 향상시킬 수 있는 잠재력을 가지고 있다. 특히 최근 큰 발전을 이룬 기계학습(Machine learning)이 의료분야에도 도입됨에 따라 전자의무기록도 활용 도가 높아지고 있다. 그러나 전자의무기록은 환자의 민감한 개인정보를 다수 포함하고 있어 수집, 활용 및 공유가 까다롭다. 이러한 특성은 전자의무기록에 관한 연구를 어렵게 하며 활용도를 떨어뜨린다. 이런 경우 생성모델이 한가지 해결책이 될 수 있다. 생성모델은 실제 데이터를 모방해서 이와 유사한 가짜 데이터를 생성하는 모델을 말한다. 이 생성모델에서 생성된 가짜 데이터를 활용하면 개인정보에 관한 제약을 피할 수 있다.
생성모델에는 다양한 종류가 있지만 최근에는 딥러닝(Deep learning)을 활용한 생성모델이 가장 주목받고 있다. 딥러닝 생성모델은 이미지 분야에서 많은 발전을 이뤘고 사람의 눈으로는 진위를 판별하기 어려운 고해상도의 이미지도 생성할 수 있게 됐다. 딥러닝 생성모델은 의료 데이터에도 적용되었고 임상적으로 유의미한 데이터를 생성할 수 있었다. 딥러닝 생성모델이 좋은 성능을 보이기는 하지만 개인정보 완전하게 해결해 주지는 않는다. 몇몇 연구에서 딥러닝 모델에 대한 공격에 관한 내용이 다루어졌고 모델의 출력 값을 바탕으로 학습데이터를 유추할 수 있음이 밝혀졌다. 이는 딥러닝 생성모델을 사용하는 경우에도 여전히 프라이버시에 대한 위험이 있으며 개인정보 보호 목적을 위해 사용하는 경우라면 모델에 대한 보호가 필요함을 의미한다.
본 연구에서는 멤버십 추론 공격(Membership inference attack)으로부터 안전한 딥러닝 생성모델을 개발하는 것을 목표로 한다. 이 목표를 위해 딥러닝 생성모델 중 하나인 적대적 생성모델 신경망(GAN)을 사용했다. GAN의 한 종류인 WGAN-GP를 기본 모델로서 사용했고 프라이버시 보호를 위해 차분 프라이버시(Differential Privacy)를 접목했다. 차분 프라이버시에서는 수학적으로 디자인된 잡음을 통해 프라이버시를 보호하며 잡음의 강도에 관련된 파라미터인 ε을 사용해 효용성(Utility)과 프라이버시(Privacy) 보호 수준 사이의 Trade-off관계를 조절한다. 이 연구에서는 차분 프라이버시 중에서도 지역 차분 프라이버시를 채택하여 교란된 데이터로만 모델을 학습하는 방식을 개발했다. 교란된 데이터로만 학습을 수행하기 때문에 모델에 대한 공격으로부터 원본 데이터를 강력하게 보호할 수 있다.
이런 방식으로 학습된 모델의 성능은 효용성 측면과 프라이버시 측면으로 나누어서 평가되었다. ε에 따라 두 평가지표 모두 유의미한 변화를 보였으며 두 지표사이의 Trade-off 관계를 적절히 조절하여 최적의 모델을 얻는 것이 가능함을 보였다. 이 실험 결과는 적절한 잡음을 가하면 모델에 대한 공격으로부터 학습 데이터를 보호할 수 있음을 의미한다. 이 연구의 결과를 통해 전자의무기록의 개인정보 문제로 인해 생기는 제약을 어느정도 해결할 수 있을 것으로 예상된다.
|The electronic medical records (EMR) are a type of medical data containing the patient's health condition, treatment results, and prescription information. It contains a lot of information on patients, so it can be used in various ways, and has the potential to improve the quality of medical care. In particular, machine learning which has recently made great progress, has been introduced into the medical field, eventually leading to the increased usage of EMR. However, EMR contain a number of sensitive personal information of patients, making it difficult to collect, utilize, and share. These characteristics make it difficult to study and utilize the EMR. Thus, the generative model can be a great solution to the previous difficulties.
A generative model refers to a model that generates synthetic data similar to actual data. By utilizing synthetic data generated in this generative model, restrictions on personal information can be avoided. Although there are many types of generative models, recently, generative models using deep learning are the most noteworthy. In fact, deep learning generative models have made great strides in the field of images, and are able to generate high-resolution images that are difficult to determine authenticity with the human eye. Moreover, the deep learning generative model was also applied to medical data and was able to generate clinically meaningful data. Though deep learning generative models show good performance, they do not completely solve personal information problem. In the past, several studies have dealt with attacks on deep learning models, and it has been found that training data can be inferred based on the output values of the models. The result indicates that even when using a deep learning generative model, there still remains a risk to privacy, and protection of the model is therefore necessary if the model is used for the purpose of protecting personal information.
Further, the objective of this study is to develop a deep learning generative model that is safe from membership inference attacks. To achieve this objective, we used WGAN-GP, a type of Generative adversarial network(GAN), as a basic model, and adopted differential privacy to protect privacy. The differential privacy protects privacy through mathematically designed noise, and uses ε, a parameter related to noise intensity, to adjust the trade-off relationship between utility and privacy protection levels. In this study, we developed a method for learning a model using only perturbation data by introducing regional differential privacy among differential privacy. Also, because training is performed only on perturbed data, the original data can be strongly protected from attacks on the model.
Next, the performance of the model trained in this way was evaluated in terms of utility and privacy. Both evaluation indicators showed significant changes according to ε, and it was shown that it is possible to obtain an optimal model by appropriately adjusting the trade-off relationship between the two indicators. The results of this experiment signifies that the training data can be protected from attacks on the model if appropriate noise is applied. Through the finding of this experiment, it is expected that the limitations caused by the personal information issues on EMR can be resolved to some extent.