Skip to Content
SBMI Horizontal Logo


For a full list of publications, visit Dr. Cui Tao Google Scholar Profile

Selected papers
  • Issues in melanoma detection: semisupervised deep learning algorithm development via a combination of human and artificial intelligence

    Xinyuan Zhang, Ziqian Xie, Yang Xiang, Imran Baig, Mena Kozman, Carly Stender, Luca Giancardo, Cui Tao; JMIR Dermatol 2022;5(4):e39113
    Funded by: This research was partially supported by UTHealth Innovation for Cancer Prevention Research Training Program Pre-doctoral Fellowship (Cancer Prevention and Research Institute of Texas Grant No. RP160015 and No. RP210042)


    Automatic skin lesion recognition has shown to be effective in increasing access to reliable dermatology evaluation; however, most existing algorithms rely solely on images. Many diagnostic rules, including the 3-point checklist, are not considered by artificial intelligence algorithms, which comprise human knowledge and reflect the diagnosis process of human experts. In this paper, we aimed to develop a semisupervised model that can not only integrate the dermoscopic features and scoring rule from the 3-point checklist but also automate the feature-annotation process. We first trained the semisupervised model on a small, annotated data set with disease and dermoscopic feature labels and tried to improve the classification accuracy by integrating the 3-point checklist using ranking loss function. We then used a large, unlabeled data set with only disease label to learn from the trained algorithm to automatically classify skin lesions and features. After adding the 3-point checklist to our model, its performance for melanoma classification improved from a mean of 0.8867 (SD 0.0191) to 0.8943 (SD 0.0115) under 5-fold cross-validation. The trained semisupervised model can automatically detect 3 dermoscopic features from the 3-point checklist, with best performances of 0.80 (area under the curve [AUC] 0.8380), 0.89 (AUC 0.9036), and 0.76 (AUC 0.8444), in some cases outperforming human annotators. Our proposed semisupervised learning framework can help with the automatic diagnosis of skin disease based on its ability to detect dermoscopic features and automate the label-annotation process. The framework can also help combine semantic knowledge with a computer algorithm to arrive at a more accurate and more interpretable diagnostic result, which can be applied to broader use cases.

    View details at DOI: 10.2196/39113

  • Toward a standard formal semantic representation of the model card report

    Muhammad Tuan Amith, Licong Cui, Degui Zhi, Kirk Roberts, Xiaoqian Jiang, Fang Li, Evan Yu & Cui Tao; BMC Bioinformatics 23 (Suppl 6), 281 (2022)
    Funded by: This research was partially supported by NIH award No. RF1AG072799.


    Model card reports aim to provide informative and transparent description of machine learning models to stakeholders. This report document is of interest to the National Institutes of Health’s Bridge2AI initiative to address the FAIR challenges with artificial intelligence-based machine learning models for biomedical research. We present our early undertaking in developing an ontology for capturing the conceptual-level information embedded in model card reports. Sourcing from existing ontologies and developing the core framework, we generated the Model Card Report Ontology. Our development efforts yielded an OWL2-based artifact that represents and formalizes model card report information. The current release of this ontology utilizes standard concepts and properties from OBO Foundry ontologies. Also, the software reasoner indicated no logical inconsistencies with the ontology. With sample model cards of machine learning models for bioinformatics research (HIV social networks and adverse outcome prediction for stent implantation), we showed the coverage and usefulness of our model in transforming static model card reports to a computable format for machine-based processing. The benefit of our work is that it utilizes expansive and standard terminologies and scientific rigor promoted by biomedical ontologists, as well as, generating an avenue to make model cards machine-readable using semantic web technology. Our future goal is to assess the veracity of our model and later expand the model to include additional concepts to address terminological gaps. We discuss tools and software that will utilize our ontology for potential …

    View details at DOI 10.1186/s12859-022-04797-6

  • Mining on Alzheimer’s diseases related knowledge graph to identity potential AD-related semantic triples for drug repurposing

    Yi Nian, Xinyue Hu, Rui Zhang, Jingna Feng, Jingcheng Du, Fang Li, Larry Bu, Yuji Zhang, Yong Chen & Cui Tao; BMC Bioinformatics 23 (Suppl 6), 407 (2022)


    To date, there are no effective treatments for most neurodegenerative diseases. Knowledge graphs can provide comprehensive and semantic representation for heterogeneous data, and have been successfully leveraged in many biomedical applications including drug repurposing. Our objective is to construct a knowledge graph from literature to study the relations between Alzheimer’s disease (AD) and chemicals, drugs and dietary supplements in order to identify opportunities to prevent or delay neurodegenerative progression. We collected biomedical annotations and extracted their relations using SemRep via SemMedDB. We used both a BERT-based classifier and rule-based methods during data preprocessing to exclude noise while preserving most AD-related semantic triples. The 1,672,110 filtered triples were used to train with knowledge graph completion algorithms (i.e., TransE, DistMult, and ComplEx) to predict candidates that might be helpful for AD treatment or prevention. Among three knowledge graph completion models, TransE outperformed the other two (MR = 10.53, Hits@1 = 0.28). We leveraged the time-slicing technique to further evaluate the prediction results. We found supporting evidence for most highly ranked candidates predicted by our model which indicates that our approach can inform reliable new knowledge. This paper shows that our graph mining model can predict reliable new relationships between AD and other entities (i.e., dietary supplements, chemicals, and drugs). The knowledge graph constructed can facilitate data-driven knowledge discoveries and the generation of novel hypotheses.

    View details at DOI 10.1186/s12859-022-04934-1

  • Understanding public perceptions of measles from Twitter using multi-task Convolutional Neural Networks

    Samuel Wang, Jingcheng Du, Lu Tang, Cui Tao; Studies in health technology and informatics, 2022 Jun 6;290:607-611.


    Measles is a highly contagious cause of febrile illness typically seen in young children. Recent years have witnessed the resurgence of measles cases in the United States. Prompt understanding of public perceptions of measles will allow public health agencies to respond appropriately promptly. We proposed a multi-task Convolutional Neural Network (MT-CNN) model to classify measles-related tweets in terms of three characteristics: Type of Message (6 subclasses), Emotion Expressed (6 subclasses), and Attitude towards Vaccination (3 subclasses). A gold standard corpus that contains 2,997 tweets with annotation in these dimensions was manually curated. A variety of conventional machine learning and deep learning models were evaluated as baseline models. The MT-CNN model performed better than other baseline conventional machine learning and the signal-task CNN models, and was then applied to predict unlabeled measles-related Twitter discussions that were crawled from 2007 to 2019, and the trends of public perceptions were analyzed along three dimensions.

    View details at DOI: 10.3233/SHTI220149

  • Application of artificial intelligence and machine learning for HIV prevention interventions

    Yang Xiang, Jingcheng Du, Kayo Fujimoto, Fang Li, John Schneider, Cui Tao; The Lancet HIV, November 08, 2021
    Project: Using big data and deep learning on predicting HIV transmission risk in MSM population
    Funded by: NIH award 1R56AI150272-01A1


    In 2019, the US Government announced its goal to end the HIV epidemic within 10 years, mirroring the initiatives set forth by UNAIDS. Public health prevention interventions are a crucial part of this ambitious goal. However, numerous challenges to this goal exist, including improving HIV awareness, increasing early HIV infection detection, ensuring rapid treatment, optimising resource distribution, and providing efficient prevention services for vulnerable populations. Artificial intelligence has had a pivotal role in revolutionising health care and has shown great potential in developing effective HIV prevention intervention strategies. Although artificial intelligence has been used in a few HIV prevention intervention areas, there are challenges to address and opportunities to explore.

    View details at DOI 10.1016/S2352-3018(21)00247-2

  • COVID-19 trial graph: a linked graph for COVID-19 clinical trials

    Jingcheng Du, Qing Wang, Jingqi Wang, Prerana Ramesh, Yang Xiang, Xiaoqian Jiang, Cui Tao; Journal of the American Medical Informatics Association, Volume 28, Issue 9, September 2021, Pages 1964–1969
    Funded by: This research was partially supported by NIH award Nos. R56AI150272 and R01AI130460


    Clinical trials are an essential part of the effort to find safe and effective prevention and treatment for COVID-19. Given the rapid growth of COVID-19 clinical trials, there is an urgent need for a better clinical trial information retrieval tool that supports searching by specifying criteria, including both eligibility criteria and structured trial information. We built a linked graph for registered COVID-19 clinical trials: the COVID-19 Trial Graph, to facilitate retrieval of clinical trials. Natural language processing tools were leveraged to extract and normalize the clinical trial information from both their eligibility criteria free texts and structured information from We linked the extracted data using the COVID-19 Trial Graph and imported it to a graph database, which supports both querying and visualization. We evaluated trial graph using case queries and graph embedding. The graph currently (as of October 5, 2020) contains 3392 registered COVID-19 clinical trials, with 17 480 nodes and 65 236 relationships. Manual evaluation of case queries found high precision and recall scores on retrieving relevant clinical trials searching from both eligibility criteria and trial-structured information. We observed clustering in clinical trials via graph embedding, which also showed superiority over the baseline (0.870 vs 0.820) in evaluating whether a trial can complete its recruitment successfully. The COVID-19 Trial Graph is a novel representation of clinical trials that allows diverse search queries and provides a graph-based visualization of COVID-19 clinical trials. High-dimensional vectors mapped by graph embedding for clinical trials would be potentially beneficial for many downstream applications, such as trial end recruitment status prediction and trial similarity comparison. Our methodology also is generalizable to other clinical trials.

    View details at DOI 10.1093/jamia/ocab078

  • Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning

    Jingcheng Du, Yang Xiang, Madhuri Sankaranarayanapillai, Meng Zhang, Jingqi Wang, Yuqi Si, Huy Anh Pham, Hua Xu, Yong Chen, Cui Tao; Journal of the American Medical Informatics Association, Volume 28, Issue 7, July 2021, Pages 1393–1400
    Project: Dynamic learning for post-vaccine event prediction using temporal information in VAERS
    Funded by: This research was funded by NIH under award Nos. R01AI130460 and R01LM011829


    Automated analysis of vaccine postmarketing surveillance narrative reports is important to understand the progression of rare but severe vaccine adverse events (AEs). This study implemented and evaluated state-of-the-art deep learning algorithms for named entity recognition to extract nervous system disorder-related events from vaccine safety reports. We collected Guillain-Barré syndrome (GBS) related influenza vaccine safety reports from the Vaccine Adverse Event Reporting System (VAERS) from 1990 to 2016. VAERS reports were selected and manually annotated with major entities related to nervous system disorders, including, investigation, nervous_AE, other_AE, procedure, social_circumstance, and temporal_expression. A variety of conventional machine learning and deep learning algorithms were then evaluated for the extraction of the above entities. We further pretrained domain-specific BERT (Bidirectional Encoder Representations from Transformers) using VAERS reports (VAERS BERT) and compared its performance with existing models. Ninety-one VAERS reports were annotated, resulting in 2512 entities. The corpus was made publicly available to promote community efforts on vaccine AEs identification. Deep learning-based methods (e.g., bi-long short-term memory and BERT models) outperformed conventional machine learning-based methods (i.e., conditional random fields with extensive features). The BioBERT large model achieved the highest exact match F-1 scores on nervous_AE, procedure, social_circumstance, and temporal_expression; while VAERS BERT large models achieved the highest exact match F-1 scores on investigation and other_AE. An ensemble of these 2 models achieved the highest exact match microaveraged F-1 score at 0.6802 and the second highest lenient match microaveraged F-1 score at 0.8078 among peer models.

    View details at DOI 10.1093/jamia/ocab014

Conference proceedings
  • Chemical-Protein Relation Extraction with Pre-trained Prompt Tuning

    Jianping He, Fang Li, Xinyue Hu, Jianfu Li, Yi Nian, Jingqi Wang, Yang Xiang, Qiang Wei, Hua Xu, Cui Tao; 2022 IEEE 10th International Conference on Healthcare Informatics (ICHI)


    Biomedical relation extraction plays a critical role in the construction of high-quality knowledge graphs and databases, which can further support many downstream applications. Pre-trained prompt tuning, as a new paradigm, has shown great potential in many natural language processing (NLP) tasks. Through inserting a piece of text into the original input, prompt converts NLP tasks into masked language problems, which could be better addressed by pre-trained language models (PLMs). In this study, we applied pre-trained prompt tuning to chemical-protein relation extraction using the BioCreative VI CHEMPROT dataset. The experiment results showed that the pre-trained prompt tuning outperformed the baseline approach in chemical-protein interaction classification. We conclude that the prompt tuning can improve the efficiency of the PLMs on chemical-protein relation extraction tasks.

    View details at DOI:10.1109/ICHI54592.2022.00120