Author: Zhiguo Yu, MS (2017)
Primary advisor: Todd Johnson, PhD
Committee members: Elmer Bernstam, MD, MSE, Trevor Cohen, MBChB, MD, PhD, Cui Tao, PhD, Byron Wallace, PhD
PhD thesis, The University of Texas School of Health Information Sciences at Houston.
ABSTRACT
With data increasing exponentially, extracting and understanding information, themes and relationships from larger collections of documents is becoming more and more important to researchers in many areas. PubMed, which comprises more than 25 million citations, uses Medical Subject Headings (MeSH) to index articles to better facilitate their management, searching and indexing. However, researchers are still challenged to find and then get a meaningful overview of a set of documents in a specific area of interest. This is due in part to several limitations of MeSH terms, including: the need to monitor and expand the vocabulary; the lack of concept coverage for newly developing areas; human inconsistency in assigning codes; and the time required to manually index an exponentially growing corpus. Another reason for this challenge is that neither PubMed itself nor its related Web tools can help users see high level themes and hidden semantic structures in the biomedical literature.
Topic models are a class of statistical machine learning algorithms that when given a set of natural language documents, extract the semantic themes (topics) from the set of documents, describe the topics for each document, and the semantic similarity of topics and documents. Researchers have shown that these latent themes can help humans better understand and search documents. Unlike MeSH terms, which are created based on important concepts throughout the literature, topics extracted from a subset of documents are specific to those documents. Thus they can find document-specific themes that may not exist in MeSH terms. Such themes may give a subject area-specific set of themes for browsing search results, and provide a broader overview of the search results.
This first part of this dissertation presents the TopicalMeSH representation, which exploits the ‘correspondence’ between topics generated using latent Dirichlet allocation (LDA) and MeSH terms to create new document representations that combine MeSH terms and latent topic vectors. In an evaluation with 15 systematic drug review corpora, TopicalMeSH performed better than MeSH in both document retrieval and classification tasks. The second part of this work introduces the “Hybrid Topic” , an alternative LDA approach that uses a ‘bag-of-MeSH&words’ approach, instead of just ‘bag-of-words’, to test whether the addition of labels (e.g. MeSH descriptors) can improve the quality and facilitate the interpretation of LDA-generated topics. An evaluation of this approach on the quality and interpretability of topics in both a general corpus and a specialized corpus demonstrated that the coherence of ‘hybrid topics’ is higher than that of regular bag-of-words topics in both specialized and general copora. The last part of this dissertation presents a visualization tool based on the ‘hybrid topics’ model that could allow users to interactively use topic models and MeSH terms to efficiently and effectively retrieve relevant information from tons of PubMed search results. A preliminary user study has been conduced with 6 participants. All of them agree that this tool can quickly help them understand PubMed search results and identify target articles.