Our current research activities largely fall into 3 major groups focusing on different but synergistic areas:
Genome Informatics: Population genetic informatics powered by efficient DNA segment matching. Modern precision medicine is powered by resources from biobanks, epidemiological cohorts, and consumer genetic testing companies that collect genetic and phenotypic information of thousands to millions of people. When the sample sizes are about 0.1%-1% of an entire large population, genetic relatedness among samples is ubiquitous. We developed ultra-efficient methods for detecting Identical-by-Descent (IBD) segments, a primary embodiment of genetic relatedness, and open up new research opportunities of population genetics informatics. Funded by NIH grants, we are well-positioned to develop the next-generation informatics tools for the discovery of IBD sharing information that serve as foundational data structures enabling new approaches for genetic discovery. We will develop new methods for making IBD segment detection even faster and more accurate, multiway IBD segments shared by a cluster of individuals, genetic association methods testing the correlations between IBD sharing and phenotype sharing, and methods leveraging IBD information for haplotype phasing, genotype imputation, and relatedness inference.
Imaging genetics powered by deep learning (DL)-based phenotyping. Although the phenotypes for genome-wide association studies (GWAS) seem almost exhausted after receiving over 15 years of intensive investigation of the community, we believe the use of the new AI and deep learning approaches will offer new opportunities for automatic phenotyping and new phenotype discovery. From images of unlabeled healthy people, we use DL models to generate “risk scores” of diseases that captures subtle patterns that are easily missed, we can boost the power of GWAS on relatively rare and insufficiently phenotyped diseases such as Alzheimer’s disease (AD) and diabetic retinopathy (DR). Moreover, we developed DL approach that can discovery new phenotypes via self-supervised learning.
Medical AI: Deep learning system for predictive model for EHR data. We are developing biomedical informatics methods for Electronic Health Records (EHR) predictive modeling. Since 2018, we are early adopters of deep learning (DL) methodology for predicting risks of diseases in EHR. Our Med-BERT model was the first convincing demonstration that pre-trained contextualized embedding from large unlabeled EHR data can significantly boost the performance of predictive models with limited local training set, and potentially leads to a new paradigm of self-supervised learning for EHRs. In particular, we develop risk prediction models for COVID-19 clinical outcomes and therapeutic levels of drugs.