Author: Safa Fathiamini, MD, MS (2018)
Primary advisor: Trevor Cohen, MBChB, PhD
Committee members: Elmer Bernstam, MD, MSE; Funda Meric-Bernstam, MD; Cui Tao, PhD
PhD thesis, The University of Texas School of Health Information Sciences at Houston.
In emerging domains, such as precision oncology, knowledge extracted from explicit assertions may be insufficient to identify relationships of interest. One solution to this problem involves drawing inference on the basis of similarity. Computational methods have been developed to estimate the semantic similarity and relatedness between terms and relationships that are distributed across corpora of literature such as Medline abstracts and other forms of human readable text. Most research on distributional similarity has focused on the notion of attributional similarity, which estimates the similarity between entities based on the contexts in which they occur across a large corpus. A relatively under-researched area concerns relational similarity, in which the similarity between pairs of entities is estimated from the contexts in which these entity pairs occur together. While it seems intuitive that models capturing the structure of the relationships between entities might mediate the identification of biologically important relationships, there is to date no comparison of the relative utility of attributional and relational models for this purpose. In this research, I compare the performance of a range of relational and attributional similarity methods, on the task of identifying drugs that may be therapeutically useful in the context of particular aberrant genes, as identified by a team of human experts. My hypothesis is that relational similarity will be of greater utility than attributional similarity as a means to identify biological relationships that may provide answers to clinical questions, (such as “which drugs INHIBIT gene x”?) in the context of rapidly evolving domains.
My results show that models based on relational similarity outperformed models based on attributional similarity on this task. As the methods explained in this research can be applied to identify any sort of relationship for which cue pairs exist, my results suggest that relational similarity may be a suitable approach to apply to other biomedical problems. Furthermore, I found models based on neural word embeddings (NWE) to be particularly useful for this task, given their higher performance than Random Indexing-based models, and significantly less computational effort needed to create them. NWE methods (such as those produced by the popular word2vec tool) are a relatively recent development in the domain of distributional semantics, and are considered by many as the state-of-the-art when it comes to semantic language modeling. However, their application in identifying biologically important relationships from Medline in general, and specifically, in the domain of precision oncology has not been well studied.
The results of this research can guide the design and implementation of biomedical question answering and other relationship extraction applications for precision medicine, precision oncology and other similar domains, where there is rapid emergence of novel knowledge. The methods developed and evaluated in this project can help NLP applications provide more accurate results by leveraging corpus based methods that are by design scalable and robust.