Author: Yumeng Yang, MS (2026)
Primary advisor: Kirk Roberts, PhD
Committee members: Elmer Bernstam, MD, MSE, Ethan Ludmir, MD, and Amy Moreno, MD, MS
PhD thesis, McWilliams School of Biomedical Informatics at UTHealth Houston.
ABSTRACT
Clinical trial enrollment remains a major challenge, and low accrual continues to delay study completion, increase research costs, and slow the development of new therapies. A barrier is the scale of the trial-matching problem: patients and clinicians must search through a large and growing number of studies, while each trial contains complex eligibility criteria that may overlap only partially with other trials. Although public registries such as ClinicalTrials.gov make trial information available, they do not adequately help patients identify which studies are most relevant to their clinical profile. This dissertation addresses that information problem by developing and evaluating a patient-centric conversational system for interactive clinical trial prescreening.
This work makes three primary contributions. First, we developed and evaluated ClinicalTrialBERT, a domain-specific transformer language model further pretrained on 442,370 eligibility criteria sections from ClinicalTrials.gov. Across seven common exclusion categories, the model achieved strong classification performance and demonstrated that trial-specific pretraining improves representation of clinical trial language. Second, this dissertation constructs three human-annotated benchmark datasets for the major components of a conversational trial-matching system: semantic clustering of eligibility criteria, generation of patient-facing questions, and assessment of patient answers against source criteria. These datasets provide a standardized framework for evaluating chatbot components and identifying important limitations in current large language model performance, particularly in handling ambiguity and unknown cases. Third, this work designs and evaluates a patient-centric clinical trial prescreening chatbot that integrates semantic clustering, patient-friendly question generation, criterion-level eligibility assessment, and dynamic trial elimination into an end-to-end workflow.
The chatbot was evaluated across six cancer types using synthetic patient profiles and human interactive testing. Results showed that the system substantially reduced the trial search space while maintaining strong matching performance, achieving an overall precision of 0.82, recall of 0.74, and F1 score of 0.78. This outperformed both standard ClinicalTrials.gov keyword search and zero-shot prompting with a general medical large language model. Usability findings further indicated that the system was understandable and well received by users.
Overall, this dissertation demonstrates that conversational AI can serve as a practical biomedical informatics approach to improving clinical trial accessibility. By combining domain-specific language modeling, benchmark dataset development, and patient-centric conversational screening, this work extends clinical trial informatics beyond backend matching tools toward systems that directly support patient understanding and self-screening.