Skip Navigation and Go To Content

Data Science and Informatics Core for Cancer Research

Resources


Data

DSICCR and SBMI established a Data Service Coordination Office that will host, manage and provide access to various data ranging from electronic health records to -omics data for cancer research. The SBMI Data Service Coordination Office is directed by DSICCR co-investigator Dr. Hua Xu. For more information or to access data, please visit the office’s website at: https://sbmi.uth.edu/sbmi-data-service/



Hardware Infrastructure

DSICCR established a robust computing infrastructure to support data science and AI research. The infrastructure includes the following cutting edge computing hardware for advanced data science research.

  1. The world's first Nvidia DGX-H100 GPU server that is the fastest and the most complete AI platform for Enterprise AI. The DGX-H100 system has 8 Nvidia H100 tensor core GPUs with 640GB GPU memory, 32 petaFLOPS FP8, 4x Nvidia NVSwitch, and 2TB systems memory.
  2. One Dell PE R750 with 5TB system memory, two Intel Xeon Platinum 8380 with 80 Cores/160 Threads, 180TB system storage and 2 Nvidia H100 GPU Cards.
  3. One Nvidia DGX-A100 GPU servers with 5 petaFLOPS AI/10 petaOPS INT8 performance, 8x latest NVIDIA A100 GPUs, 12 NVLinks/GPUs (600GB/s GPU to GPU communication), 320 GB total GPU memory, NVIDIA CUDA Cores: 65536, NVIDIA Tensor Cores: 4096, 1 TB DDR4 Memory and 30 TB SSD storage.
  4. One EXXACT TS4 deep learning certified system with 24 CPU cores, 10 Nvidia RTX 2080 Ti GPUs (43,520 cores), 376 GB memory, and 12 TB internal storage.
  5. One Supermicro 8-way systems with 448 CPU cores, 1 Xilinx Alveo U250 FPGA, 6TB system memory, 60TB local SSD drive. It also has a Nvidia V100 GPU (20,480 cores) and an additional 200TB of external storage.
  6. 3 PB of storage space provided by various direct-attached storage devices, NFS server and Dell ME5084 Storage Array.
  7. One 36 computing-node Dell EMC Hadoop Cluster with 864 total computing cores, 12.8 TB combined memory, and 1.5 PB (raw) storage.

Advanced Hardware Infrastructure for AI and Big Data


In addition, the Texas Advanced Computing Center (TACC, https://www.tacc.utexas.edu) is available through a high-speed network (Internet II). TACC is equipped with many robust, high-performance computing systems, including Frontera—the fifth most powerful supercomputer in the world (2019). TACC's ultimate science environment includes high-performance computing, visualization, data analysis, storage systems, software, and portal interfaces that enable researchers to answer questions more efficiently and effectively using advanced computing resources. TACC provides systems and software to researchers and has worked on over 3000 projects by more than 1000 researchers at over 350 institutions nationally and worldwide that address scientific concepts to improve the quality of life.


All these systems are connected through a multi-platform computer network (1GBS, 10 GBS, and 25GBS) that is in a continuous process of upgrading to state-of-the-art technology.



Software/Algorithms by DSICCR Faculty

DSICCR faculty is conducting cutting edge health data science and informatics research, and has developed many software and algorithms for biomedical data analysis. These software tools are listed below and made available for cancer researchers.

Name Description
MSEA: Mutation Set Enrichment Analysis MSEA identifies cancer driver genes by detecting groups of somatic mutations, aka hotspots.
  1. Faculty/PI Name: Peilin Jia
  2. Publications: Jia P, Wang Q, Chen Q, Hutchinson K, Pao W, Zhao Z (2014) MSEA: detection and quantification of mutation hotspots through mutation set enrichment analysis. Genome Biology 15(10):489
  3. Website Link: https://github.com/bsml320/MSEA

VarWalker VarWalker performs mutation network analysis of putative cancer genes from next-generation sequencing data.
  1. Faculty/PI Name: Peilin Jia
  2. Publications: Jia P, Zhao Z (2014) VarWalker: Personalized Mutation Network Analysis of Putative Cancer Genes from Next-generation Sequencing Data. PLoS Computational Biology 10(2): e1003460
  3. Website Link: https://bioinfo.uth.edu/VarWalker.html

GATHER It is a Gene Annotation Tool to Help Explain Relationships.
  1. Faculty Name: Jeffrey Chang
  2. Publications:
    Chang, J. and Nevins, J. (2006). GATHER: a systems approach to interpreting genomic signatures. Bioinformatics, 22(23), pp.2926-2933.
  3. Website Link:   http://changlab.uth.tmc.edu/gather/

SIGNATURE It is a web-based resource that simplifies gene expression signature analysis by providing software, data, and protocols to perform the analysis successfully.
  1. Faculty/PI Name: Jeffrey Chang
  2. Publications:
    Chang JT, Gatza ML, Lucas JE, Barry W, Vaughn P, and Nevins JR. "SIGNATURE: A Workbench for Gene Expression Signature Analysis." BMC Bioinformatics 12(443), 2011
  3. Website Link: https://uth.tmc.edu
TREC Precision Medicine An information retrieval (IR) tool for finding relevant precision medicine scientific literature and clinical trials for specific cancer patients.
  1. Faculty/PI Name: Kirk Roberts
  2. Publications:
    • Roberts K, Demner-Fushman D, Voorhees E, Hersh WR, Bedrick S, Lazar A, Pant S. (2017). Overview of the TREC 2017 Precision Medicine Track. Proceedings of the Text Retrieval Conference.
    • Roberts K. Automatic Identification of Cancer Precision Medicine Literature Articles. In Submission.
Cancer FrameNet A natural language processing (NLP) information extraction tool for frame-based information about cancer written in clinical text.
  1. Faculty/PI Name: Kirk Roberts
  2. Publications:
    • Roberts K, Si Y, Gandhi A, Bernstam EV. (2018). A FrameNet for Cancer Information in Clinical Narratives: Schema and Annotation. Proceedings of the Language Resources and Evaluation Conference.
    • Si Y, Roberts K. A Frame-Based NLP System for Cancer-Related Information Extraction. In Submission.
Epiphanet EpiphaNet is an interactive knowledge discovery system, which enables researchers to explore visually sets of relations extracted from MEDLINE using a combination of language processing techniques.
  1. Faculty/PI Name: Trevor Cohen
  2. Publications:
    • Cohen, T., Whitfield, G. K., Schvaneveldt, R. W., Mukund, K., & Rindflesch, T. (2010). EpiphaNet: An Interactive Tool to Support Biomedical Discoveries. Journal of Biomedical Discovery and Collaboration, 5, 21–49.
CLAMP: Clinical Language Annotation, Modeling, and Processing Toolkit CLAMP is a comprehensive clinical Natural Language Processing (NLP) software that enables recognition and automatic encoding of clinical information in narrative patient reports.
  1. Faculty/PI Name: Hua Xu
  2. Publications:
    • Ergin Soysal, Jingqi Wang, Min Jiang, Yonghui Wu, Serguei Pakhomov, Hongfang Liu, Hua Xu. CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines. JAMIA, Doi: 10.1093/jamia/
    • Jun Xu*, Hee-Jin Lee*, Zongcheng Ji*, Jingqi Wang, Qiang Wei, and Hua Xu. UTH_CCB System for Adverse Drug Reaction Extraction from Drug Labels at TAC-ADR 2017. Proceedings of TAC, 2017. (* denotes equal contribution)
  3. Website Link: https://clamp.uth.edu/
SBMI Data Service SBMI Data Service provides technical assistance and consulting services to qualified clients about the health data sets available at the school.
  1. Faculty/PI Name: Hua Xu
  2. Website Link: SBMI Data Service
R, SAS We developed computer codes using R and SAS for various projects in cancer research and other fields.
  1. Faculty/PI Name: Dejian Lai
  2. Publications:

A. Recent Projects in Cancer Research:

  1. Early Life Exposures to Air Toxics and Risk of Early Childhood Leukemia (2018)
  2. SPATIAL ANALYSIS OF AMBIENT BENZENE AND CANCER INCIDENCE RATES IN TEXAS (student thesis, graduated in 2017).
  3. Longitudinal Study of Melatonin, Cortisol and risk of Colorectal Cancer (proposal submitted in 2017)
  4. Hazardous Air Pollutants and Lymphohematopoietic Cancer Incidence in Houston (NIH funded project).
  5. USING SPATIAL LINEAR MODELS WITH SAR AND CAR STRUCTURE TO EXAMINE TEXAS LUNG CANCER INCIDENCE RATES (student thesis, graduated in 2017)

B. Some Cancer Related Publications in Last Five Years:

  1. Symanski E, Lewis PGT, Chen TY, Chan W, Lai D, Ma XM: Air Toxics and Early Childhood Acute Lymphocytic Leukemia in Texas, A Population Based Case Control Study. 2016, Environmental Health: A Global Access Science Source. 2016, Vol. 15, No. 70  
  2. Tong L, Ahn C, Symanski E, Lai D, Du XL. Relative Impact of Earlier Diagnosis and Improved Treatment on Survival for Colorectal Cancer: A US database Study among Elderly Patients. Cancer Epidemiology. 2014, Vol. 38, 733-740.
  3. Tong LY, Ahn C, Symanski E, Lai D, Du XL: Temporal Trends in the Leading Causes of Death among a Large National Cohort of Patients with Colorectal Cancer from 1975 to 2009 in the United States. Annals of Epidemiology. 2014, Vol. 24, 411-417
  4. Tong LY, Ahn C, Symanski E, Lai D, Du XL: Effects of Newly Developed Chemotherapy Regimens, Comorbidities, Chemotherapy-related Toxicity on the Changing Patterns of the Leading Causes of Death in Elderly Patients with Colorectal Cancer. Annals of Oncology. 2014, Vol. 25, 1234-1242.
  5. Wang GD, Lai D, Burau K, Du XL: Potential Gains in Life Expectancy from Reducing Heart Disease, Cancer, Alzheimer's Disease, Kidney Disease or HIV/AIDS as Major Causes of Death in the USA. Public Health. 2013, Vol. 127, 348-356.
Machine Learning tools for Longitudinal Brain Connectivity Methods to identify neuroplasticity patterns in brains are of the utmost importance in understanding and potentially treating diseases. Diffusion tensor imaging (DTI) allows for an in-vivo estimation of the structural connectome inside the brain and may serve to quantify degenerative process before the appearance of clinical symptoms. We have developed novel machine learning-based strategies to compute longitudinal structural connectomes that allow the discovery and quantification longitudinal patterns.
  1. Faculty/PI Name: Luca Giancardo
  2. Publications:
    • Giancardo, L.*, Ellmore, T. M., Suescun, J., Ocasio, L., Kamali, A., Riascos-Castaneda, R. & Schiess, M. C. Longitudinal Connectome-based Predictive Modeling for REM Sleep Behavior Disorder from Structural Brain Connectivity. Proceeding SPIE Med. Imaging (2018).

Luca Machine Learning tools
Machine learning-based image (and video) segmentation We develop and adapt pipelines for image and video segmentation. These pipelines can be adapted to the multiple use cases by leveraging machine learning techniques that learn by examples. These tools allow for high throughput analysis quantitative analysis of large dataset. We have experience with optical images, MRI and videos.
  1. Faculty/PI Name: Luca Giancardo
  2. Publications:

Here are some examples of new image segmentation pipelines developed


Luca Machine learning-based image
Image/signal based computational biomarker development. Using modern machine learning approaches, we can have the opportunity of discovering data pattern from unstructured image or time signal data for developing new types of computational biomarkers for hypotheses generation or predicting outcomes.
  1. Faculty/PI Name: Luca Giancardo
  2. Publications:
    • L Giancardo*, K Roberts and Z Zhao, “Representation Learning for Retinal Vasculature Embeddings”. Fetal, Infant and Ophthalmic Medical Image Analysis. FIFI 2017, OMIA 2017. Lecture Notes in Computer Science, vol 10554. Springer, Cham, 2017.
    • T Arroyo-Gallego, M Ledesma-Carbayo, A Sanchez-Ferro, I Butterworth, C Mendoza, M Matarazzo, P Montero, R Lopez-Blanco, V Puertas-Martin, R Trincado and L Giancardo*. Detection of Motor Impairment in Parkinson’s Disease via Mobile Touchscreen Typing. IEEE Transaction on Biomedical Engineering, 64, 1994–2002, 2017.
    • L Giancardo*, A Sanchez-Ferro, T. Arroyo-Gallego, I. Butterworth, C.S. Mendoza, P. Montero, M. Matarazzo, A. Obeso, M. L. Gray and San José Estepar, “Computer keyboard interaction as an indicator of early Parkinson's disease”, Scientific Reports, 6(34468), 2016.
    • L Giancardo*, A Sanchez-Ferro, I Butterworth, C Sanchez-Mendoza and J M Hooker, “Psychomotor Impairment Detection via Finger Interactions with a Computer Keyboard”, Scientific Reports, 5(9678), 2015.
Luca Modern Machine Learning
Genome3D Project Genome3D is a model-view framework for displaying genomic and epigenomic data within a three-dimensional physical model of the human genome.
  1. Faculty/PI Name: Jim Zheng
  2. Publications:
    • Asbury, T., Mitman, M., Tang, J., & Zheng, W. (2010). Genome3D: a viewer-model framework for integrating and visualizing multi-scale epigenomic information within a three-dimensional genome. BMC Bioinformatics, 11(1), 444. http://dx.doi.org/10.1186/1471-2105-11-444
  3. Website Link: http://www.genome3d.org/
Ontology Fingerprints Ontology Fingerprint for a gene or a disease is a set of Gene Ontology terms overrepresented in the PubMed abstracts linked to a gene or disease along with those terms corresponding enrichment p-values.
  1. Faculty/PI Name: Jim Zheng
  2. Publications:
    • Tsoi, L., Boehnke, M., Klein, R., & Zheng, W. (2009). Evaluation of genome-wide association study results through development of ontology fingerprints. Bioinformatics, 25(10), 1314-1320. http://dx.doi.org/10.1093/bioinformatics/btp158
    • Qin, T., Matmati, N., Tsoi, L., Mohanty, B., Gao, N., & Tang, J. et al. (2014). Finding pathway-modulating genes from a novel Ontology Fingerprint-derived gene network. Nucleic Acids Research, 42(18), e138-e138. http://dx.doi.org/10.1093/nar/gku678
  3. Website Link: http://www.ontologyfingerprint.org/
RaPID RaPID is an ultra-fast tool for the identification of identity-by-descent segments among genotyped individuals.
  1. Faculty/PI Name: Degui Zhi
  2. Publications:
    • Naseri, A., Liu, X., Zhang, S., & Zhi, D. (2017). Ultra-fast Identity by Descent Detection in Biobank-Scale Cohorts using Positional Burrows-Wheeler Transform. http://dx.doi.org/10.1101/103325
  3. Website Link: https://github.com/ZhiGroup/RaPID
HapSeq2 HapSeq2 is a program for genotyping calling and haplotype phasing from next generation sequencing data using haplotype information from jumping reads.
  1. Faculty/PI Name: Degui Zhi
  2. Publications:
  3. Website Link: https://github.com/ZhiGroup/HapSeq2