Author: Jianbo Lei, M.D., M.S., M.A. (2014)
Primary Advisor: Hua Xu, PhD
Committee Members: Jiajie Zhang, PhD, Trevor Cohen, PhD, Kai Zheng, PhD, Qun Meng, PhD
PhD thesis, The University of Texas School of Biomedical Informatics at Houston.
Abstract:
Objective: Named entity recognition (NER) is one of the fundamental tasks in natural language processing (NLP). In the medical domain, there have been a number of studies on NER in English clinical notes; however, very limited NER research has been done on clinical notes written in Chinese. The goal of this study is to develop corpora, methods, and systems for NER in Chinese clinical text.
Materials and methods: To study entities in Chinese clinical text, we started with building annotated clinical corpora in Chinese. We developed an NER annotation guideline in Chinese by extending the one used in the 2010 i2b2 NLP challenge. We randomly selected 400 admission notes and 400 discharge summaries from Peking Union Medical College Hospital (PUMCH) in China. For each note, four types of entities including clinical problems, procedures, labs, and medications were annotated according to the developed guideline. In addition, an annotation tool was developed to assist two MD students to annotate Chinese clinical documents. A comparison of entity distribution between Chinese and English clinical notes (646 English and 400 Chinese discharge summaries) was performed using the annotated corpora, to identify the important features for NER. In the NER study, two-thirds of the 400 notes were used for training the NER systems and one-third were used for testing. We investigated the effects of different types of features including bag-of-characters, word segmentation, part-of-speech, and section information, with different machine learning (ML) algorithms including Conditional Random Fields (CRF), Support Vector Machines (SVM), Maximum Entropy (ME), and Structural Support Vector Machines (SSVM) on the Chinese clinical NER task. All classifiers were trained on the training dataset, evaluated on the test set, and microaveraged precision, recall, and F-measure were reported.
Results: Our evaluation on the independent test set showed that most types of features were beneficial to Chinese NER systems, although the improvements were limited. By combining word segmentation and section information, the system achieved the highest performance, indicating that these two types of features are complementary to each other. When the same types of optimized features were used, CRF and SSVM outperformed SVM and ME. More specifically, SSVM reached the highest performance among the four algorithms, with F-measures of 93.51% and 90.01% for admission notes and discharge summaries respectively.
Conclusions: In this study, we created large annotated datasets of Chinese admission notes and discharge summaries and then systematically evaluated different types of features (e.g., syntactic, semantic, and segmentation information) and four ML algorithms including CRF, SVM, SSVM, and ME for clinical NER in Chinese. To the best of our knowledge, this is one of the earliest comprehensive effort in Chinese clinical NER research and we believe it will provide valuable insights to NLP research in Chinese clinical text. Our results suggest that both word segmentation and section information improves NER in Chinese clinical text, and SSVM, a recent sequential labelling algorithm, outperformed CRF and other classification algorithms. Our best system achieved F-measures of 90.01% and 93.52% on Chinese discharge summaries and admission notes, respectively, indicating a promising start on Chinese NLP research.