Development of Enabling Technologies for Genomic Analysis
Author: Su Wang, MBE (2022)
Primary advisor: Arif Harmanci, Phd
Committee members: W. Jim Zheng, PhD; Degui Zhi, PhD
PhD thesis, The University of Texas School of Biomedical Informatics at Houston.
ABSTRACT
Background The advance of sequencing techniques has promoted gene expression profiling. Numerous methods have also been developed to estimate genetic relatedness, or kinship. However, many problems still exist that limit gene expression data utility, governance, and kinship inference. The expanding size of gene expression datasets due to the growing number of samples and diverse sequencing conditions makes it expensive for both local storage and online sharing. In addition, the current kinship estimation approaches suffer from high computational requirements, untenable assumptions of homogeneous population ancestry, and genetic privacy concerns, especially when finding unknown familial relationships in 3rd party databases and reporting sensitive population level statistics. Methods and Results Here, we present a set of enabling methods based on statistical modeling to relieve these limitations. To improve gene expression data utility and governance, we develop a model-driven approach for data compression. The model-driven compression reduces data redundancy and lowers the cost for both data storage and transmission. For kinship inference, we present a secure federated kinship estimation framework and implement a secure kinship estimator using homomorphic encryption- based primitives for computing relatedness between samples in 2 different sites while genotype data is kept confidential. Conclusion Collectively, our model-driven methods enable a broader usage of genetics and genomics data.