Author: Xubing Hao (2025)
Primary advisor: Licong Cui, PhD
Committee members: Kirk Roberts, PhD and Cui Tao, PhD
PhD thesis: McWilliams School of Biomedical Informatics at UTHealth Houston.
ABSTRACT
Biomedical ontologies or terminologies not only serve as a part of the metadata standards for describing data in the FAIR Data Principles (Findable, Accessible, Interoperable, Reusable), but also play a vital role in downstream applications such as cohort identification from electronic health records (EHR). However, there are two critical barriers that may lead to ambiguity, complexity, and inaccuracies in such ontology-based downstream applications. The first barrier is the quality of the ontology. Despite efforts by ontology curators to ensure ontology accuracy and comprehensiveness, errors and inconsistencies are unavoidable. The second barrier is the semantic heterogeneity since human experts may use different natural language terms to define the same ontological entities. It is critical to develop effective methods for the continued enhancement of the qualities of biomedical terminologies and to develop effective methods to establish meaningful connections between heterogeneous biomedical concepts. This dissertation introduces a substring replacement approach for identifying missing IS-A relations in biomedical terminologies, an order-preserving intersection method leveraging non-lattice subgraphs to detect missing concepts, and a Graph Convolutional Network (GCN) and Pre-trained Language Model (PLM)-based approach to identify synonymous concept pairs across different biomedical terminologies. Additionally, this work leverages large-scale EHR data to assess the impact of terminology quality on cohort identification applications.
The research presented in this dissertation addresses critical challenges in biomedical ontology quality and interoperability. By doing so, this research enhances downstream biomedical ontology-driven applications, facilitates the clear exchange of health information, and ultimately supports more accurate and reliable clinical and research outcomes.