Skip to Content
SBMI Horizontal Logo

Advancing Cancer Pharmacoepidemiology Research Through EHRs and Informatics


The data-intensive nature of EHRs creates challenges for adaptation to clinical research including cancer pharmacoepidemiology studies. Much of the detailed information such as drug exposure, cancer characteristics (e.g., diagnoses and treatment outcomes), and confounding factors is embedded in narrative documents and thus not easily accessible for analysis. In addition, differing EHR platforms and standards of clinical data make cross-site algorithm implementation and data aggregation more challenging. Finally, the uneven nature of clinical documentation and quality of EHR data bring additional problems for data analysis, such as selection bias and missing data issues.

In this project (Advancing Cancer Pharmacoepidemiology Research Through EHRs and Informatics, Grant no: U24CA194215), we propose to integrate and extend established tools to build an informatics infrastructure for EHR data extraction, harmonization, management, and analysis, to advance cancer pharmacoepidemiology research. These informatics methods and software for the secondary use of EHR data were developed by this project team. It includes a natural language processing (NLP) systems such as MedEx, KMCI, and cTAKES. EHR data normalization tools such as the SHARPn data normalization pipeline, and clinical data management software such as REDCap.


Datasets being generated in the project will be listed and made available for download once pertinent papers have been published.


To facilitate the adoption of NLP in cancer research, we have developed a number of modules for extracting cancer-related information from pathology reports, such as tumor size, margin, biomarkers etc., within the existing CLAMP toolkit. The system provides high-performance modules with a user-friendly interface through which users can build customized NLP solutions for their specific needs. This system is available for free download to academic researchers at In addition, we are now developing a new web-based system that provides user-friendly interfaces for building customized NLP pipelines for cancer information extraction, with the goal to facilitate cancer researchers to adopt NLP technologies for their research.

We have developed and released a large ontology of chemotherapy drugs and regimens, based primarily on the website (MPI Warner is Deputy Editor). Beginning in mid-2017, we began the process of deriving a JSON formatted OWL ontology from content. Specifically, we sought to create a self-contained information model establishing the relationships between antineoplastic and supportive drugs, regimens, and the contexts in which they are used. As of May 26, 2018, the derived ontology contains 172,490 axioms, 1400 classes, and 25,299 entities. For example, the regimen entity “R-CHOP” has 4 aliases, 16 classes (e.g., induction therapy for untreated diffuse large B-cell lymphoma), 20 component entities (including supportive medications, CNS prophylaxis, and explicit links to preceding or subsequent treatments declared by some R-CHOP protocol variants), and 42 reference entities.


2018, North American Association of Central Cancer Registries (NAACCR) Annual Meeting, Warner J, A comprehensive ontology of hematology/oncology treatment regimens.

2018, National Cancer Policy Forum of the National Academies of Medicine workshop on Improving Cancer Diagnosis and Care, Warner J, Genomic standards and knowledge bases for decision support.

2018, ITCR Annual Meeting, Warmer J, Advancing cancer pharmacoepidemiology research through EHRs and informatics.

2018, ITCR Annual Meeting, Wang L, Information Extraction for Populating Lung Cancer Clinical Research Data.

2018, CI4CC, Spring Symposium & Workshop, Precision Oncology Knowledge Networks, Xu H, Building customized NLP pipelines for cancer research using CLAMP

2017, CDC/NCI/FDA/VA Clinical Natural Language Processing Workshop, Xu H, Supporting cancer registries through automated extraction of pathology and chemotherapy regimen information.

2017, AMIA Joint Summits on Translational Science:  Xu H, Panel

2017, CI4CC:  Warner J, Natural Language Processing, and Visual Analytics to Improve the Efficiency of Registry Operations.

2017, ITCR PI meeting: Xu Hua

2017, AMIA NLP Working Group invited seminar: Xu H, NLP tools for pathology reports processing: CLAMP and MetaMap


AMIA Annual Symposium 2018. San Francisco, CA: Information Extraction for Populating Lung Cancer Clinical Research Data.

AMIA Annual Symposium 2018. San Francisco, CA: Combine Factual Medical Knowledge and Distributed Word Representation to Improve Clinical Named Entity Recognition.

ICIBM, 2018, Los Angeles, CA: Integrating Shortest Dependency Path and Sentence Sequence into a Deep Learning Framework for Relation Extraction in Clinical Text. 

ICIBM, 2018, Los Angeles, CA: Parsing clinical text: how good are the state-of-the-art deep learning based parsers.

2018, ASCO Annual Meeting, Chicago, IL: Genetic differences between primary and metastatic tumors from cross-institutional data.

2017, AMIA 2017 Joint Summits, San Francisco, CA: Identifying Metastases related Information from Pathology Reports of Lung Cancer Patients

2017, AMIA Annual Symposium, Washington, DC: Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation (submitted)


Lee H, Zhang Y, Jiang M, Xu J, Tao C, et al. Identifying direct temporal relations between time and events from clinical notes. BMC medical informatics and decision making (Accepted)

Gregg JR, Lang M, Wang LL, Resnick MJ, Jain SK, Warner JL*, Barocas DA*. Automating the determination of prostate cancer risk strata from electronic medical records. JCO Clinical Cancer Informatics. 2017 Jun 8;1:1-8. PMCID: PMC5847303.

Malty AM, Jain SK, Yang PC, Harvey K, Warner JL. Computerized approach to creating a systematic ontology of hematology/oncology regimens. JCO Clinical Cancer Informatics. 2018 May 11. [PMCID pending – NIHMS ID 974428]

Soysal E, Wang J, Jiang M, Wu Y, Pakhomov S, Liu H, and Xu H. CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc, 2018, 25(3), 331–336.

Huang J, Duan R, Hubbard RA, Wu Y, Moore JH, Xu H, and Chen Y. PIE: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data. J Am Med Inform Assoc, 2018, 25(3), 345–350.

Lee HJ, Wu Y, Zhang Y, Xu J, Xu H, Roberts K. A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform. 2017 Nov;75S:S19-S27. doi: 10.1016/j.jbi.2017.06.006. Epub 2017 Jun 7. PubMed PMID: 28602904; PubMed Central PMCID: PMC5705430.

Soysal E, Warner JL, Wang J, Jiang, M, Harvey K, Jain SK, Dong X, Song HY,  Siddhanamatha H., Wang L, Dai Q, Chen Q, Du X, Tao C, Yang P, Denny JC, Liu H, Xu H. CLAMP-Cancer - software for developing customized cancer information extraction systems for pathology reports. Cancer Research. 2017. (submitted)