Author: Yifang Dang, MS (2025)
Primary advisor: Licong Cui, PhD
Committee members: Xiaoqian Jiang, PhD, Cui Tao, PhD and Hua Xu, PhD
PhD thesis: McWilliams School of Biomedical Informatics at UTHealth Houston.
ABSTRACT
Social Determinants of Health (SDoH) significantly influence health outcomes, yet their representation in computational models remains fragmented. This dissertation addresses this gap by constructing an SDoH ontology (SDoHO), leveraging it to improve large language model (LLM) performance, and using LLMs to extract new knowledge and enrich the ontology. The overarching goal is to establish a feedback loop where ontology development enhances LLM-based extraction, and LLM-derived insights refine and expand the ontology. The methodology is structured across three aims: (1) ontology construction, (2) ontology-assisted LLM enhancement, and (3) LLM-driven ontology enrichment, with a focus on Alzheimer’s Disease and Related Dementias (ADRD). Aim 1 involves the development of SDOHO, a structured knowledge representation of key SDoH factors. Represented in OWL2, SDoHO is a comprehensive ontology comprising 708 classes, 106 object properties, and 20 data properties, with 1,561 logical axioms and 976 declaration axioms. The ontology features a well-defined class hierarchy spanning six levels, with a primary top-level structure composed of nine categories. This structured representation provides a robust foundation for standardizing and organizing SDOH knowledge, facilitating improved automated knowledge extraction and integration with downstream applications.
Aim 2 examines the effectiveness of using ontology hierarchies to enhance LLM-based SDOH information extraction. This approach focuses on extracting 17 first-level SDoH labels and 45 second-level labels from MIMIC-III text data. Experimental results on 153 annotated files reveal that GPT-4 generally outperforms or is more stable than LLaMA 3, with F1 scores ranging from 0.48 to 0.64 across different methods. Notably, the hybrid method for GPT-4 attained an F1 score of 0.6364, while LLaMA 3 achieved the highest F1 score of 0.64. A false positive error management strategy led to maximum precision improvements of ~9% and F1 score gains of ~6%, highlighting its effectiveness in addressing critical limitations of LLMs. These findings underscore the benefits and challenges of integrating structured knowledge with generative AI models for domain-specific concept extraction.
Aim 3 extends SDoHO’s application to ADRD by using LLMs to extract SDoH-ADRD concepts and infer relationships from annotated literature. SDoH and ADRD concept extraction achieved F1 scores of 0.744 and 0.927, respectively, while SDoH concept pairing with relationship identification reached an F1 score of 0.914. Instance mapping achieved an F1 score of 0.69, identifying five unmatched instances that suggest potential new concepts for existing SDoH ontologies. Additionally, a focused analysis on suicidal ideation and SDoH-related factors within ADRD populations revealed overlooked social stressors, demonstrating the ontology’s potential for broader health informatics applications. Additionally, a focused analysis on suicidal ideation and SDoH-related factors reveals overlooked social stressors, demonstrating the ontology’s potential for broader health informatics applications. Overall, this dissertation demonstrates a possible bidirectional integration of ontology and LLMs, where ontologies structure and refine LLM outputs, and LLM-extracted knowledge feeds back to enhance the ontology. By bridging structured knowledge representation with generative AI, this work advances ontology-driven natural language processing, improving automated concept extraction, relationship inference, and domain-specific knowledge enrichment. Future directions include expanding evaluation datasets, refining ontology-informed prompting techniques, and assessing real-world applications in clinical decision support and public health policy.