SBMI Informatics Blog Header

Exploring the “Other” Side of Data Science


Tuesday, November 12, 2019

“The responsibility to the personal information is increasing at a great pace.”

Arif Harmanci, M.S., Ph.D.
Assistant Professor, Center for Precision Health

We are fortunate to live in an era of healthcare revolution. The usage of big data and machine learning is disrupting every aspect of healthcare in both the research labs and in the clinic. For example, large computers at SBMI are churning data from millions of people at a non-stop, 24/7 pace. We learn insights from the databases to make healthcare less expensive, more effective, and more accessible.

To make the next generation of healthcare practices a reality, the computers need huge databases. For this, everyone is encouraged to share their health-related data -- such as genetic, imaging, and EHR -- with researchers. While there is great promise in using the large datasets to improve healthcare, there are many great challenges. Biggest of these is that the data is of incredibly large dimensions. It is hard to computationally grasp the data even when we use the vast resources at UTHealth. Electronic Health Records databases hold as much as hundreds of data entries from millions of patients. Genetic and genomic databases are even larger as they may harbor tens of thousands to millions of entries from thousands of individuals.

This is where the “other” side of the big data science surfaces. The data are from real people whether they are patients or healthy individuals, as such, it must be treated with the utmost respect. The healthcare data is highly “identifying” of the owner. Even a small portion of the data can identify the owner very easily. Moreover, healthcare data is very sensitive. In the wrong hands, it may cause discrimination of the owner, for example insurance denials, public defamation. The risk of identification extends from an individual to their families. Healthcare information cannot be reissued. It is not like a credit card; when it is stolen, it is lost for good. Now law enforcement is tapping into the big data science as a means to solve crimes. The Federal Bureau of Investigation (FBI) is using genetic data to solve cold-cases, such as the Golden State killer. As good as this sounds, the methods that the FBI use are fallible as we have seen people be wrongly convicted as a result of using genetic data. As the incentives and public pressure to share and use medical data encourages more public data sharing, we have to understand and teach the risks associated with using and sharing personal and medical data in the public domain.

So what are we doing about this? At SBMI, we are actively researching ways for building “privacy-aware” data mining and machine learning methods. These methods can analyze medical and personal data while respecting and protecting patients’ privacy. For example, we are using state-of-the-art encryption methods so that data is always encrypted and never seen, even by the data analyzers. We are developing new ways to hide and share personal data so that individual privacy cannot be breached. It is, however, not an easy task to address all the problems in the field of biomedical data privacy. We are constantly looking for motivated students and researchers with fresh ideas and energy to contribute to these projects. We are also offering courses at SBMI related to biomedical data security, ethics, and privacy such as Biomedical Data Privacy (BMI 5334), Security for Health Information Systems (BMI 5306), and Legal and Ethical Aspects of Health Informatics (BMI 5305).

Arif Harmanci, PhD, MS

"Dr. Harmanci received his PhD from received his Master’s and PhD degrees in Electrical Engineering from University of Rochester, NY. He next received postdoctoral training at Yale University. His research is focused building statistical and machine learning methods for analysis of high throughput genomic data such as whole genome and transcriptome sequencing."