machine learning datathon covid-19 logo

SBMI Datathon 2021 for Stroke Prediction

Registration Deadline: March 25th, 2021

Competition Date: March 27th – 28th, 2021

Shayan Shams, Yejin Kim, Xiaoqian Jiang, Sean Savitz

About the Datathon

In 2017, 7.8 million adults in the U.S. reported having survived a stroke. While deaths attributable to stroke have declined, stroke remains a leading cause of morbidity and disability. By 2030, stroke-related costs are expected to reach $183 billion. Despite early treatment, stroke survivors often have a severe long-term disability including both physical and cognitive issues that require constant monitoring and care from the community. Rehabilitation is essential to recovery and begins soon after the injury when the brain is especially receptive to processes that can enhance repair . The appropriate quantity, quality, and timing of rehab therapy is unknown to optimize outcomes and remedy disabilities effectively. An accurate prediction of the functional and cognitive outcome at the acute stage of stroke is important for a personalized rehabilitation plan and improving communication among patient, family, and clinicians regarding possible outcomes and expectations.


The theme of this Datathon is to ask participants to compete on the development of algorithms to predict changes in cognitive and Functional Independence Measure (FIM) scores (18 subcategories) during inpatient rehabilitation (difference between admission FIM score and discharge for each subcategory). FIM score is extensively used across North America to measure disabilities. It includes eighteen subcategories of assessment items, grouped in six sections. The FIM assesses both motor and cognitive functions, and an increasing FIM score implies functional improvement while a decreasing score implies a decline in the patient's functional status.

FIM score for each category range from 1 to 7 where:

7 6 5 4 3 2 1
Complete Independence Modified Independence Supervision Minimal Assistance Moderate Assistance Maximal Assistance Total Assistance or not Testable


The participants are expected to develop algorithms to jointly predict changes in FIM score during inpatient rehabilitation in each subcategory from admission to discharge.

Predictive variables

The predictive variables consist of both continuous and categorical variables. While a great deal of effort has been invested in organizing and cleaning the dataset, participants are expected to be able to use novel strategies to deal with missing values in predictive variables.


In this machine learning challenge, we ask the participants to build models (in a justifiable manner) and evaluate final performance, based on L1 (Manhattan) distance example of L1 distance represent the actual and predicted changes of FIM scores (i.e., P subcategories). If there are ties in the performance, additional consideration will be given to model interpretability and identification of predictive variable importance, should participant performance be tied.

Example of final output:

ID Eating-Change Bathing-Change       Memory-Change
100 5 7 1 ... 3 2
101 2 7 3   2 5
102 4 3 1   1 2

Data Description

Train data are in a single CSV file (train.csv) in the below format:

Image of the Train.CSV file

The label contains 18 FIM subcategory and participants are expected to predict a vector of (18) where each value in the vector represents the difference in admission FIM score and discharge for each subcategory.


  • Participants are required to submit source codes (e.g., Jupyter Notebook) in a self-contained manner.
  • Downloading data from our server and saving those data locally for use after the competition is strictly prohibited.
  • Privately sharing data outside our provided environment during the competition is not permitted.
  • Participants must use an algorithmic approach for prediction. Any changes to the methodology must be done in an automated way, so that the approach can be generalized to new subjects.
  • Use of external data is permitted.
  • All participants are asked to prepare summary slides that will describe their models.
  • The top three participants will be asked to give a short presentation and the top ten participants on the leaderboard may have an opportunity to publish their results in the special issue of a journal (under negotiation).
  • The participant may submit a maximum of 10 entries. The entry with the best performance counts for the purposes of final judgment.


A total of $1,500 sponsored by UTHealth

  • First place: $1000
  • Second place: $300
  • Third Place: $200



Undergraduate students and Graduate students currently enrolled in their first/second year of a master’s program or in the first two years of a Ph.D. program from institutions within the Gulf Coast Consortia (inclusive of UTHealth, MDACC, UH, Rice, TAMU, UTMB, IBT, and Baylor, etc.). In addition, qualifying students from the Houston area (e.g., HBU, SHSU, TSU, PVAMU, University of St. Thomas, UH-Clear Lake, UH-Sugar Land, and UH-Victoria, etc.) are encouraged to apply.
No, this competition is free.
No, due to COVID-19 pandemic, the competition will be held remotely. Our team will provide you with required VPN and access to the coding environment.
This is a coding datathon. You are expected to have mastered basic programming skills and have knowledge of machine learning.
Our panel of experts is composed of faculty members from UTHealth School of Biomedical Informatics. Your project will be judged via an automated leaderboard program; each contestant can only submit 10 times. The top 3 contestants will be asked to make a short presentation on their solution at the end of the event.