Skip to Content
SBMI Horizontal Logo

Methods for Auditing and Enhancing Completeness of Ontologies

Contact Information
Licong Cui, Principal Investigator
School of Biomedical Informatics
The University of Texas Health Science Center at Houston
7000 Fannin St
Houston, TX 77030

Project Award Information
This website is based upon work supported by the National Science Foundation under Grant No. 1931134. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Project Summary
An ontology is a formal representation of the knowledge by a set of concepts (terms) and the relationships between those concepts within a domain of specialty. Ontologies have been widely used for orchestrating the coding, management, exchange, and sharing of the increasingly large amounts of digital data produced by the biomedical research enterprise. For example, SNOMED CT, one of the largest and most complex biomedical ontologies, supports the development of high-quality electronic health records and facilitates patient subgroup identification, clinical decision support, and healthcare delivery quality measurement. Given such important roles that biomedical ontologies play, quality issues (such as incompleteness in coverage of subclasses), if not addressed, can affect the quality of diagnoses and decisions. However, incompleteness issues such as missing hierarchical relations and missing concepts are infeasible to be addressed by manual work alone due to the size and complexity of biomedical ontologies. The goal of this project is to develop automated and scalable approaches for identifying potential incompleteness issues as well as suggesting solutions to fix them. This project will incorporate the computational aspects of the proposed work into curriculum development and educational offerings related to data science and promote women participation in health data science.

To audit and enhance completeness of ontologies, this project explores the following research tasks: (1) Development of a robust reasoning framework for detecting and repairing missing subclass or hierarchical relations. This will result in suggestions that directly enhances the subclass completeness of ontologies; (2) Development of novel methods for identifying missing concepts and creating appropriate name labels for the identified missing concepts. This will result in enhancement in the concept completeness of ontologies; (3) Generation of supporting evidence for suggested solutions by leveraging rich extrinsic knowledge. For further verification of the robustness of the proposed approaches, domain experts will be involved in validation of the discovered incompleteness issues. The proposed approaches are applied to three large ontologies in biomedicine: SNOMED CT, Gene Ontology, and NCI Thesaurus. Suggested ontology changes will be communicated to the respective ontology owners for incorporation in subsequent versions. The project website will include further information about this project, and provide access to publications, software, datasets and curriculum material.

  • A transformation-based lexical approach to detect missing subclass relations (code)
  • Deeping learning approaches to predicting concept names (code)
  • A hybrid approach to detect missing subclass relations (code)
  • A sequenced-based Formal Concept Analysis approach to detect missing concepts (code)

Detection Results
  • Result for the transformation-based lexical approach (result)
  • Result for concept name prediction (result)
  • Result for the hybrid approach (result)
  • Result for the sequenced-based Formal Concept Analysis approach (result)

Curriculum Materials
  • Fall 2018 - CS 626 Large Scale Data Science - Lecture on "Biomedical Ontology Quality Assurance" (slides and homework)

  1. Zheng F, Cui L. Exploring Deep Learning-based Approaches for Predicting Concept Names in SNOMED CT. 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 808-813.
  2. Abeysinghe R, Zheng F, Hinderer EW, Moseley HNB, Cui L. A Lexical Approach to Identifying Subtype Inconsistencies in Biomedical Terminologies. 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), The 1st International Workshop on Quality Assurance of Biological and Biomedical Ontologies and Terminologies, pp. 1982-1989.
  3. Sun Q, Zhang GQ, Zhu W, Cui L. Validating Auto-Suggested Changes for SNOMED CT in Non-lattice Subgraphs Using Relational Machine Learning. The 17th World Congress of Medical and Health Informatics (MedInfo 2019) 2019;264:378-82.
  4. Abeysinghe R, Brooks MA, Cui L. Leveraging Non-lattice Subgraphs to Audit Hierarchical Relations in NCI Thesaurus. AMIA Annual Symp Proc 2019, pp. 982-991. [Student Paper Award Finalist]
  5. Zheng F, Abeysinghe R, Cui L. A Hybrid Method to Detect Missing Hierarchical Relations in NCI Thesaurus. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - The 2nd International Workshop on Quality Assurance and Enrichment of Biological and Biomedical Ontologies and Terminologies, pp. 1948-1953.
  6. Zhu W, Tao S, Cui L, Zhang GQ. Web-based Interactive Visualization of Non-Lattice Subgraphs in SNOMED CT. AAMIA Jt Summits Transl Sci Proc 2020, pp. 740–749.
  7. Zhang GQ, Tao S, Zeng N, Cui L. Ontologies as Nested Facet Systems for Human-Data Interaction. Semantic Web, 2020;11(1):79-86.
  8. Cui L, Abeysinghe R, Zheng F, Tao S, Zeng N, Hands I, Durbin EB, Whiteman L, Remennik L, Sioutos N, Zhang GQ. Enhancing the Quality of Hierarchic Relations in the National Cancer Institute Thesaurus to Enable Faceted Query of Cancer Registry Data. JCO Clinical Cancer Informatics, 2020(4):392-398.
  9. Abeysinghe R, Hinderer EW, Moseley HNB, Cui L. SSIF: Subsumption-based Sub-term Inference Framework to Audit Gene Ontology. Bioinformatics, 2020;36(10):3207-3214.
  10. Zheng F, Abeysinghe R, Sioutos N, Whiteman L, Remennik L, Cui L. Detecting missing IS-A relations in the NCI Thesaurus using an enhanced hybrid approach. BMC Medical Informatics and Decision Making, 2020;20(Suppl 10):273.
  11. Zheng F, Shi J, Cui L. A lexical-based approach for exhaustive detection of missing hierarchical relations in SNOMED CT. AMIA 2020 Annual Symposium. 2020, pp. 1392-1401.
  12. Zheng F, Cui L. A Lexical-based Formal Concept Analysis Method to Identify Missing Concepts in the NCI Thesaurus. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2020, pp. 1757-1760.
  13. Zheng F, Shi J, Yang Y, Zheng WJ, Cui L. A Transformation-based Method for Auditing the IS-A Hierarchy of Biomedical Terminologies in the Unified Medical Language System (UMLS). Journal of the American Medical Informatics Association, 2020;27(10):1568-1575.
Last updated: August 2021