Skip to Content
SBMI Horizontal Logo

Automating Construction of Fair Models for HIV Risk Prediction Using Neighborhood Data

Author: Sarah May, BS, MS, MPH (2023)

Primary advisor: Assaf Gottlieb, PhD

Committee members: Thomas Giordano, MD, MPH; Elmer Bernstam, MD, MSE

PhD thesis, McWilliams School of Biomedical Informatics at UTHealth Houston.


Despite significant improvements in HIV diagnosis, treatment, and prevention over the last four decades, the HIV epidemic is a continuing problem in the United States. The Centers for Disease Control and Prevention (CDC) estimate there were approximately 1.2 million people with HIV living in the US in 2019. This includes an estimated 13% who are undiagnosed and remain unaware of their infection. Testing and diagnosis of these individuals is of paramount importance for reaching the target goal of ending the HIV epidemic in the US by 2030, as these individuals account for 40% of new HIV infections. An additional tool available to aid in reaching these public health goals is pre-exposure prophylaxis (PrEP). However, despite the availability and effectiveness of PrEP, uptake remains slow with only 23% of people eligible for the treatment having a current prescription.

Machine learning-based prediction models have been shown to be useful for identifying patients at high risk for HIV infection, alerting clinicians to those who require HIV testing and depending on the results of the test, treatment, or prevention. Although these models have performed well in general, they are developed using hand-selected features which require clinical input and extensive data preprocessing which can be time-intensive and can limit generalizability if selected features are not common across clinical datasets. Additionally, these models perform poorly when predicting HIV risk in females.

There is evidence that including neighborhood-level factors and social determinants of health (SDOH) into predictive models can help in 1) improving model performance in some sub-populations, and 2) reducing disparities between sub-populations, however, this is impossible to do when using de-identified data that cannot be linked to other data sources. The goal of my research was to develop an automated pipeline to build predictive models for HIV risk that will address populations like females where models have historically performed poorly, and to develop a method to incorporate neighborhood factors into de-identified data based on the distributions of these factors in fully identified datasets. In Aim 1 I developed a phenotyping algorithm which identified people with HIV in clinical datasets allowing for the construction of a reliable cohort to develop my models. In Aim 2 I developed and evaluated an automated pipeline to train predictive models for HIV risk. The resultant models performed as well as previous models that used hand-selected features and performed well in females which previous models were unable to accomplish. Finally in Aim 3 I developed a novel method to introduce neighborhood-level factors into de-identified data sets from information that can be linked to fully identified data. I demonstrated that the neighborhood-level factors replace demographic features and broad clinical diagnoses among the top features while preserving model performance. Neighborhood factors driving the prediction can give clinicians, public health workers, and policy makers modifiable factors to reduce HIV risk in certain individuals and populations.