machine learning hackathon logo

SBMI Healthcare Machine Learning Datathon

February 1 - 2, 2020

School of Biomedical Informatics (SBMI), University of Texas Health Science Center at Houston (UTHealth)

E6 Level, 7000 Fannin St., Houston, TX 77030

Co-organizers: Xiaoqian Jiang, Yejin Kim

Project Manager: Marijane deTranaltes

Sponsor: Vir Biotechnology

First prize: $1200
Second prize: $600
Third prize: $300

Architectural/Content/Logistical Support: Judy Young, David Ha, Luyao Chen, Queen Chambliss, Marcos Hernandez, Angela Wilkes

Steering Committee: Jing Tang, Shuyu Zheng, Pora Kim, Fei Wang, Jim Zheng, Shaghayegh Agah, Steve Wong, Rong Xu, Santiago Segarra


The 2nd SBMI Healthcare Machine Learning Datathon is calling capable and motivated undergraduate and graduate students from Gulf Coast Consortia institutions and other Houston area universities. Come join us for this great opportunity to challenge your coding skills, meet new people, and enjoy the gathering of young hackers. This 24-hour Datathon is organized by the Center for Center for Secure Artificial intelligence For hEalthcare(SAFE) at the School of Biomedical Informatics in UTHealth. The event is sponsored by Vir Biotechnology, for a total prize of $2,100 for the winners. Undergraduate, master, and doctoral (only 1st and 2nd year) students from the institutes within the Gulf Coast Consortia (include UTHealth, MDACC, UH, Rice, TAMU, UTMB, IBT, and Baylor) and colleges in the vicinity of TMC are highly encouraged to apply. This is individual-based event (no team participation).


Identifying new treatment for diseases is the long-standing goal of medicine. The cost of drug discovery is very high in the traditional setting. Systematic integration of previous results and knowledge might change the game by identifying highly promising drugs to save cost and speedup discovery.

High-throughput drug screening using computational approaches has the potential to substantially improve cost-efficiency by automatically estimating drug sensitivity based on genomic and pharmacological data.1–7 These computational methods utilize drug sensitivity data at certain cell lines and predict promising drugs that potentially have high sensitivity in other cell lines. In this Datathon, participants are asked to build a prediction model that ranks promising drugs in given cancer cell lines.

Key words: drug repositioning, collaborative filtering, recommender system, cold start, graph convolution neural network, tensor factorization, random walk.



The goal is to use machine learning to predict drug’s sensitivity in given cell lines. Participants should rank drugs that are likely to be sensitive (i.e., relative inhibition > 50%) in cell lines given in test sets. A key challenge is that testing cell lines have limited experimental drug response data in the training set. Participants are encouraged to use relative dense observations from other types of cancer cell lines to predict these data-rare cancer cell lines.

Cold-start problem

A key challenge lies in the imbalances of experimental data for different tissues. For example, we have accumulated many drug sensitivity experimental data for major tissues’ cell lines (such as lung and breast). Traditional methods target at studying commonly observed tissues, 8–15 which take known drug response data at certain cell lines and attempt to find other drugs responses at other cell lines within the data-rich tissues.

In this Datathon, we will focus on ranking drugs for pancreatic cancer, which is deadly and incurable. The study on pancreatic cancer cell lines is limited and training data for machine learning models is insufficient. A potential way to mitigate this data paucity problem is to utilize common features - gene expression level, because these different tissues share biological commonality partly in terms of gene expression and therefore respond drugs in similar ways.16 We will provide the cell’s gene expression level as external features, and participants need to incorporate the features into the prediction model.

Figure 1. Number of tested drugs in each cell line. The drug distribution has a long tail.

Figure 2. Cold start problem. Participants are required to predict drug sensitivity of cells from pancreatic tissues, which are not included in the training set.


Sensitivity of drug and cell pair

Train data is a drug’s sensitivity of given pair of (drug,cell). We will use relative inhibition (RI) as a drug sensitivity measure. The RI has been binarized as 1 if the drug shows efficacy in the cell; and 0 otherwise.

The train data are in a single csv file (train.csv) or binarized pickle file (train.pkl) with format below:

Drug’s features
./drug/fingerprint.tsv (optional)
./drug/profiles.csv (optional)

One of the most important features for drug repositioning is drug’s target genes. Drugs are suppressing specific target genes, which consequently causes modification of the diseases. We will provide drug’s features as tsv files and binarized pickle files).

(Optional) Additionally, one can further investigate the drug’s chemical features using molecular structure. Contestants can freely use Molecular ACCess System (MACCS) fingerprints17 and native chemical compounds to boost prediction performance. MACCS fingerprints contains 166 chemical structures such as the number of oxygens, S-S bonds, ring. In addition, we represented drug as a native chemical structure using SMILES. SMILES is a linear notation to represent chemical compound in a unique way; in the SMILES representation atoms are represented as their atomic symbols (e.g., c for carbon); special characters are also used to represent relationship (e.g., “=”: double bonds; “#”:triple bonds; “.”:ionic bond; “:”: aromatic bond)18. SMILES can provide richer features space that strictly represent functional substructures and express structural differences such as compound’s chirality19.

Cell line’s features.
./cell/profiles.csv (optional)

The cell lines come from 14 different tissues including lung, ovary, and skin and 14 different types of cancer including carcinoma, adenocarcinoma, and melanoma. We provided gene expression profiles of each cell line using Fragments Per Kilobase of transcript per Million reads mapped (FPKM).20,21 This gene expression profiles is an accurate quantification of the cell’s genetic status. Note that some cell lines have missing gene expression.

(Optional) We also provide cell line’s profiles with identifiers linking external databases.


Contestants are asked to select the most efficacious 20 drugs for each 44 pancreatic cancer cell lines given in the test set. Contestants should predict and rank the efficacious drugs with computational method. Test data contains drug and cell line that we’d like to predict the sensitivity. It is a csv file (test.csv) with format below:

The submission file (submission.csv) should be formatted as:

The lowest rank (0, 1, 2, 3, ...) correspond to the most efficacious drugs. Submissions will be judged on the ranking scores - normalized discounted cumulative gain (NDCG).

As our ultimate goal is to rank drugs based on likelihood of efficacy and help prioritizing drug experiments in vitro, we will measure the accuracy of the model as ranking performance. For ranking measure, we will evaluate normalized discounted cumulative gain (NDCG). In information retrieval, cumulative gain is the sum of relevance values (i.e., relative inhibition) of high-ranked drugs per each query (cell). Discounted cumulative gain (DCG) is the sum of graded/weighted/discounted relevance scores in the top ranking list. The formula for DCG accumulated at top-20 ranking list is

DCG will be high if highly efficacious drugs are ranked high and/or if highly efficacious drugs are ranked higher than marginally efficacious drugs. The DCG score can be normalized by the maximum DCG or ideal DCG (IDCG) in which the ranking is perfectly matched:

We will use averaged NDCG20 for all 44 cell lines as a final performance measure of this Datathon.

Useful links and literature:


  • Participants are required to submit source codes (e.g., Jupyter Notebook) in self-contained way.
  • Downloading data from our server and save them locally to be used after the competition is not permitted.
  • Privately sharing data outside of our provided environment during the competition is not permitted.
  • Use of external data is permitted, provided it does not directly relate to the labels in the official competition data.
  • Participants must use an algorithmic approach to classify the segments. Any changes to the methodology must be done in an automated way, so that your approach will generalize to new subjects.
  • Top contestants are asked to prepare summary slides to describe their models at the end of the Datathon.
  • Top three participants will be asked to give a short presentation and the ones on the leaderboard (top 10) may have an opportunity to publish their results on Special Issue of a journal


First prize: $1200
Second prize: $600
Third prize: $300



School of Biomedical Informatics (SBMI), University of Texas Health Science Center at Houston (UTHealth)
UCT Classrooms 612 & 614
7000 Fannin St., Houston, TX 77030

February 1, 2020

  • 1:00pm Opening remarks by the organizer Dr. Xiaoqian Jiang and Dean Jiajie Zhang
    • Challenge projects announcement and greetings from the sponsor
  • 1:30pm Warm up
    • Environment preparation
  • 2:00pm Enjoy Datathon!!

February 2, 2020

  • 2:00pm End of Datathon
  • 2:30pm Announcement of Awardees
    • Mrs. Marijane Detranaltes
  • 3:00pm Demonstrations



Undergraduate or Graduate students (including 1st and 2nd year PhD student) from a Texas institute, a program from the institutes within the Gulf Coast Consortia (including UTHealth, MDACC, UH, Rice, TAMU, UTMB, IBT, and Baylor), and colleges in the vicinity of TMC. Those who are affiliated with the Center for Secure Healthcare Machine Learning are not eligible to participate.
No, this Datathon is completely free!
Just bring your laptop and charger, student ID, and your brilliant mind. Of course, feel free to bring things that can help you - earbuds, clothes/blankets, sleeping bag if you plan on sleeping, etc.
This is a coding Datathon. You are expected to master basic programming skills and machine learning knowledge.
Yes! There will be meals, snacks, drinks, coffee and more!
YES! There is a total prize of $2,100 for the winners sponsored by Vir Biotechnology.
The Datathon will be held at the University Center Tower (UCT) of UTHealth, Houston. The address is Room 614, 7000 Fannin St., Houston, TX 77030. Pubic parking is available in the UCT garage. We will validate the UCT parking tickets for car riders. If you come by Houston METRORail, the TMC Transit Center Station of Red Line is just across the street. We will not provide travel reimbursement.
Our panel of judges are faculty members of the School of Biomedical Informatics at UTHealth. Your projects will be judged based on the usefulness, design, difficulty and creativity. Top selected projects will be demonstrated to the group at the end of the event.
If you have a question that is not listed here, contact Dr. Xiaohong Bi


  1. Guan N-N, Zhao Y, Wang C-C, Li J-Q, Chen X, Piao X. Anticancer Drug Response Prediction in Cell Lines Using Weighted Graph Regularized Matrix Factorization.Mol Ther Nucleic Acids. 2019;17:164-174.
  2. Menden MP, Iorio F, Garnett M, et al. Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS One. 2013;8(4):e61318.
  3. Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol. 2014;15(3):R47.
  4. Donner Y, Kazmierczak S, Fortney K. Drug Repurposing Using Deep Embeddings of Gene Expression Profiles. Mol Pharm. 2018;15(10):4314-4325.
  5. Pushpakom S, Iorio F, Eyers PA, et al. Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov. 2019;18(1):41-58.
  6. Yang J, Li A, Li Y, Guo X, Wang M. A novel approach for drug response prediction in cancer cell lines via network representation learning. Bioinformatics. 2019;35(9):1527-1535.
  7. Madhukar NS, Khade PK, Huang L, et al. A Bayesian machine learning approach for drug target identification using diverse data types. Nat Commun. 2019;10(1):5221.
  8. Sun X, Vilar S, Tatonetti NP. High-throughput methods for combinatorial drug discovery. Sci Transl Med. 2013;5(205):205rv1.
  9. Celebi R, Bear Don’t Walk O 4th, Movva R, Alpsoy S, Dumontier M. In-silico Prediction of Synergistic Anti-Cancer Drug Combinations Using Multi-omics Data. Sci Rep. 2019;9(1):8949.
  10. Preuer K, Lewis RPI, Hochreiter S, Bender A, Bulusu KC, Klambauer G. DeepSynergy: predicting anti-cancer drug synergy with Deep Learning. Bioinformatics. 2018;34(9):1538-1546.
  11. Huang L, Li F, Sheng J, et al. DrugComboRanker: drug combination discovery based on target network analysis. Bioinformatics. 2014;30(12):i228-i236.
  12. Bansal M, Yang J, Karan C, et al. A community computational challenge to predict the activity of pairs of compounds. Nat Biotechnol. 2014;32(12):1213-1222.
  13. Zhao X-M, Iskar M, Zeller G, Kuhn M, van Noort V, Bork P. Prediction of drug combinations by integrating molecular and pharmacological data. PLoS Comput Biol. 2011;7(12):e1002323.
  14. Chen G, Tsoi A, Xu H, Jim Zheng W. Predict effective drug combination by deep belief network and ontology fingerprints. Journal of Biomedical Informatics. 2018;85:149-154. doi: 10.1016/j.jbi.2018.07.024
  15. Tang J, Gautam P, Gupta A, et al. Network pharmacology modeling identifies synergistic Aurora B and ZAK interaction in triple-negative breast cancer. NPJ Syst Biol Appl. 2019;5:20.
  16. Xu C, Ai D, Shi D, et al. Accurate Drug Repositioning through Non-tissue-Specific Core Signatures from Cancer Transcriptomes. Cell Rep. 2019;29(4):1055.
  17. Polton DJ. Installation and operational experiences with MACCS (Molecular Access System). Online Review. 1982;6(3):235-242. doi: 10.1108/eb024099
  18. Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Cent Sci. 2018;4(1):120-131.
  19. Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics. 2018;19(Suppl 19):526.
  20. Barretina J, Caponigro G, Stransky N, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7391):603-607.