Author: Evan Yu, MSE (2026)
Primary advisor: Degui Zhi, PhD
Committee members: Cui Tao, PhD, Ziqian Xie, PhD, and Kayo Fujimoto, PhD
PhD thesis, McWilliams School of Biomedical Informatics at UTHealth Houston.
ABSTRACT
Biomedical research increasingly relies on large and heterogeneous data sources to support scientific inference. However, relevant evidence is often distributed across environments that differ substantially in structure, reliability, and scale. Biomedical evidence may be represented in relational networks that encode interactions among entities and in large-scale text corpora that contain unstructured knowledge and discourse. Developing computational approaches capable of extracting reliable and interpretable signals from these diverse data environments remains a central challenge. This dissertation investigates how graph and language models can support robust biomedical inference across three evidence environments: relational epidemiologic networks, large-scale social media discourse, and biomedical scientific literature.
The first study evaluates graph neural networks (GNNs) for predicting HIV infection risk in social and sexual networks of younger sexual minority men. Using data from cohorts in Chicago and Houston, graph attention networks demonstrate strong predictive performance and maintain robust performance under cross-city transfer, while explainability methods identify specific risk factors consistent with epidemiologic knowledge. The second study examines classification of HIV-related discourse on Twitter to support public health surveillance. Graph-based and pretrained language models are evaluated across independently annotated datasets, highlighting the challenges of cross-dataset generalization and demonstrating the advantages of pretrained language models in analyzing text. The third study investigates knowledge-augmented large language models (LLMs) for evaluating gene-disease associations by integrating biomedical literature retrieval with structured knowledge graphs to provide evidence-grounded assessments of biological relationships.
These studies demonstrate how graph-based learning and pretrained language models can support biomedical inference across heterogeneous data environments. By emphasizing interpretability, cross-dataset generalization, and integration of structured and unstructured evidence, this work contributes methodological approaches for developing more reliable and evidence-grounded AI systems in biomedical research.