Skip to Content
SBMI Horizontal Logo

Cancer Risk Prediction and Interpretation from Electronic Health Records

Author: Zhao Li, M.Eng. (2023)

Primary advisor: W. Jim Zheng, PhD

Committee members: Hua Xu, PhD; Xiaoqian Jiang, PhD

PhD thesis, The University of Texas School of Biomedical Informatics at Houston.


Cancer as a leading cause of death is a complex disease with many different forms. Cancer risk prediction is important for identifying high-risk individuals in early stages and enabling targeted screening and interventions to reduce mortality rates and improve clinical outcomes. Electronic health records (EHR) provide a vast amount of data that can be used for predictive model development, allowing for the detection of subtle patterns and associations that may not be apparent in smaller datasets. Various approaches, including statistical and artificial intelligence models, have been developed to predict an individual's risk of cancer based on these data resources.

Methods such as joint statistical modeling for dynamic prediction of disease risk can provide critical information for prognosis and clinical decision making, but are limited by their ability to handle large amounts of rich clinical data in electronic health record systems. Machine learning models, and more recently, deep learning models, are used for disease prediction on large-scale structured data from EHR and achieved great improvement in performance thanks to the complex architecture design and efficient feature representation learning. However, the accuracy of deep learning model for many disease prediction problems is affected by many factors including time-varying covariates, rare incidence, and covariate imbalance.

This dissertation focuses on disease risk prediction for cancer early detection and diagnosis by utilizing both advanced statistical method and deep learning models on structured EHR data. I first investigated the extent to which time-varying covariates, rare incidence, and covariate imbalance influence deep learning performance, and then devised strategies to tackle these challenges. These strategies were applied to improve hepatocellular carcinoma risk prediction among patients with non-alcoholic fatty liver disease (Chapter 2). The novel strategies developed in this work can significantly improve the performance of hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. Furthermore, the novel strategies can be generalized to apply to other disease risk predictions using structured electronic health records, especially for disease risks on condition of another disease.

In addition, I employed explainable machine learning approaches to predict the onset risk of brain metastasis for lung cancer patients and to identify important clinical features for brain metastasis development (Chapter 3). To our best knowledge, this is the first study to predict brain metastasis using structured EHR data. I achieved decent prediction performance for BM prediction and identified factors highly relevant to brain metastasis development even many important pieces of information like cancer staging and imaging, are missed in EHR.

To enable the advanced statistical model for dynamic disease risk prediction to handle large amounts of rich clinical data in EHR, I also worked closely with biostatisticians to develop novel data science methodologies to utilize GPU and deep learning libraries to accelerate a novel two-stage estimation procedure for joint modeling. This study demonstrates the great enabling power of modern GPU technology and deep learning platforms for statistical methods to analyze large amounts of individual-level clinical data for cancer risk prediction (Chapter 4).

Notably, this dissertation presents comprehensive and systematic research for cancer risk prediction from EHR. These studies have contributed significant insights into brain metastasis development and hepatocellular carcinoma progression. Furthermore, the methods proposed in this dissertation offer valuable strategies to address the challenges of employing deep learning models and advanced statistical models on structured EHR data.