Spatial Information Extraction from Radiology Reports
Author: Surabhi Datta, BE, MS (2022)
Primary advisor: Kirk Roberts, PhD
Committee members: Elmer Bernstam, MD, MSE; Luca Giancardo, PhD; Roy F. Riascos-Castandea, MD; Hua Xu, PhD
PhD thesis, The University of Texas School of Biomedical Informatics at Houston.
ABSTRACT
Radiology reports contain a radiologist’s interpretation of images, and these images frequently describe spatial relations between radiological entities. Important radiographic findings are mostly documented in the reports in reference to an anatomical structure along with other clinically-relevant contextual information thought spatial expressions. The spatially-grounded radiological entities mainly include clinical findings and medical devices. Although much work has focused on radiology information extraction, spatial language understanding in the radiology domain has remained less explored. The language used for representing spatial relations is complex and varied. Therefore, we aim to encode granular spatial information in the reports and automatically extract this important information using natural language processing (NLP) methods. Structured representation of this clinically significant spatial information has the potential to be used in a variety of downstream clinical applications. Such applications include fine-grained phenotyping for clinical trials and epidemiological studies, automated tracking of clinical findings and devices, and automatic image label generation. The three broad aims of this dissertation are to – 1) build a robust spatial representation schema that can encode detailed spatial information of findings and devices, 2) develop state-of-the-art deep learning-based NLP methods to automatically extract the spatial information and 3) develop clinical informatics applications using the spatial information extracted from reports. First, we define two spatial representation schemas, Rad-SpRL and Rad-SpatialNet, that are based on spatial role labeling and frame semantics, respectively. We construct manually annotated radiology report datasets following these schemas. We then propose transformer-based language models to automatically identify the spatial information from these reports where we frame the extraction problem as both sequence labeling and question answering. To enable downstream applications, we also propose normalization methods to map the radiological entities in the reports to standard concepts in RadLex, a publicly available radiology lexicon. In addition to this, we also propose a weak supervision method to automatically create a large radiology training dataset for spatial information extraction without using any manual annotations. Further, we extend the Rad-SpatialNet schema to encode spatial language in a different domain, i.e., ophthalmology notes. Finally, we use the information extracted from the radiology reports to develop an ischemic stroke phenotyping system and an automated radiology tracking system that aims to track the same radiological findings and medical devices across reports.