Skip to Content
SBMI Horizontal Logo

Harnessing Knowledge and Data for Clinical Information Extraction in the Era of Large Language Models

Author: Yan Hu, M.Res. (2025)

Primary advisor: Xiaoqian Jiang, PhD

Committee members: Hua Xu, PhD and Kirk Roberts, PhD

PhD thesis: McWilliams School of Biomedical Informatics at UTHealth Houston.

ABSTRACT

The rapid digitization of healthcare records has led to the widespread adoption of Electronic Health Records (EHRs), which contain rich, unstructured clinical notes. These notes hold valuable information for patient care and clinical research, but their complexity and unstructured nature pose significant challenges for effective utilization. Clinical Information Extraction (IE) aims to bridge this gap by transforming unstructured text into structured data, enabling automated analysis. Traditional Natural Language Processing (NLP) techniques, such as Named Entity Recognition (NER), have been widely used for clinical IE, but recent advancements in Large Language Models (LLMs) like GPT-3.5, GPT-4, and LLaMA have opened new possibilities for improving the clinical IE tasks.

This dissertation explores the integration of medical knowledge and data with LLMs to enhance clinical IE. It addresses three specific aims: (1) optimizing closed-source LLMs through task-specific prompt engineering, (2) enhancing open-source LLMs via instruction tuning, and (3) leveraging closed-source LLMs to generate synthetic NER data for augmenting the training of open-source LLMs.

In Aim 1, we propose a prompt engineering framework for closed-source LLMs, such as GPT-3.5 and GPT-4, to improve their performance in clinical NER tasks. By incorporating annotation guidelines, error analysis-based instructions, and few-shot learning, we demonstrate that GPT-4 achieves competitive performance compared to fine-tuned models like BioClinicalBERT, particularly under relaxed-match criteria.

Aim 2 focuses on open-source LLMs, specifically instruction-tuned LLaMA-2 and LLaMA-3 models. We develop a comprehensive cross-institutional annotated corpus and show that instruction-tuned LLaMA models outperform traditional BERT-based models, especially in low-resource and cross-institution settings. However, LLaMA models require significantly more computational resources, highlighting the trade-offs between performance and resource efficiency.

In Aim 3, we investigate the use of closed-source LLMs, such as GPT-4o-mini, to generate high-quality synthetic medical data for augmenting the training of open-source LLMs. By implementing self-verification and semantic mapping techniques, we improve the quality of synthetic data, leading to consistent performance gains across diverse datasets. We also explore the optimal mix ratios of human-annotated and synthetic data, finding that a balanced 1:1 ratio generally yields the best results.

The findings of this dissertation demonstrate the transformative potential of LLMs in clinical IE, reducing the reliance on extensive annotated datasets while improving generalizability across institutions. By integrating domain-specific knowledge and advanced NLP techniques, this research provides a roadmap for developing scalable and efficient clinical IE solutions, paving the way for broader adoption in healthcare research and practice.