August 2020

COVID-19 Houston Datathon

Registration Deadline: August 25, 2020

Co-organizers:
Xiaoqian Jiang¹ (UTHealth), Genevera Allen² (Rice University), Devika Subramanian (Rice University),
Assaf Gottlieb (UTHealth), Ioannis Kakadiaris (University of Houston), Yejin Kim (UTHealth)

Sponsor:
¹Center for Secure Artificial intelligence For hEalthcare (SAFE)
School of Biomedical Informatics, UTHealth
And
²Center for Transforming Data to Knowledge (the D2K Lab), Rice University
And
Gulf Coast Consortia (GCC) and the GCC cluster of AI in Healthcare

Project Manager:
Marijane Detranaltes (UTHealth)

Steering Committee:
Shayan Shams (UTHealth), Ananth V Annapragada (Texas Children’s Hospital), Kai Zhang (UTHealth)

Architectural Support: Robert III Jolly, David Ha, Luyao Chen, Marcos Hernandez
Logistic Support: Queen Chambliss, Angela Wilkes, Xiaohong Bi
Student Volunteers: Tongtong Huang, Yan Chu

The Kaggle link for this competition, please see the information below

https://www.kaggle.com/c/covid19houstondatathon/overview

Individual participant should sign at under “Team” as a team, notice that “team” here means an individual participant. Our Datathon is not intended for group participation purpose
Download data from “Data”, and notice that there are 7 datasets in total
Please ask any question under Discussion
Submit Notebooks and prediction via “Notebooks” and “Submit Predictions”

About the Datathon

The COVID-19 Houston Datathon is an online challenge to predict the regional hospitalization and mortality patterns of COVID-19 in Houston, Texas. This Datathon is jointly organized and sponsored by the Center for Secure Artificial intelligence For hEalthcare at the UTHealth School of Biomedical Informatics, and Data to Knowledge lab at Rice University. Undergraduate, master, and doctoral students from the institutes within the Gulf Coast Consortia (including UTHealth, MDACC, UH, Rice, TAMU, UTMB, IBT, and Baylor) and colleges near TMC are highly encouraged to apply. The event will have up to $1,500 in prizes for the winners. This is an individual-based event (no team participation).

THEME

Objective

The goal is to develop a prediction model using local county-level data to estimate the changes in hospitalization and mortality rates in the greater Houston area encompassing 8 counties (Harris, Fort Bend, Montgomery, Brazoria, Galveston, Liberty, Chambers, and Austin) in the state of Texas, USA.

Problem

Accurate and timely prediction of local trends for pandemics will have profound implications to medical resource preparation and policy adjustment evaluation. In this Datathon, we will focus on predicting daily hospitalization cases (COVID-19 general beds + ICU beds) and cumulative mortality cases based on previous observations. We will provide daily hospitalization and mortality statistics (together with infection cases, recovery cases, active cases, test cases) for nine counties in Texas. In addition, we will provide data related to population mobility, demographics, mask usage, which might contain features related to behavioral patterns affecting the transmission.

Data Sources

County-level mortality, infection, recovery, active cases, test counts, hospitalization: John Hopkins COVID-19 tracking data [link]
County-level mask usage: New York time [link]
County-level population mobility: Google Mobility Report [link]
County-level data dashboard: School of Public Health, UTHealth [link]
(optional) COVID-19 Control Policies KFF [link]
(optional) Demonstration and protest [link]
(optional) Weather [link]

DATA DESCRIPTION

COVID-19 confirmed cases data

./data/time_series_covid19_confirmed_HOU.csv

Confirmed cases data consists of accumulated confirmed cases at 8 counties in Greater Houston between 04/01/2020 and 09/06/2020. In addition, longitude, latitude, and FIPS are provided, which may serve as foreign keys to query mask survey data.

Confirmed cases data is in a single .csv file (time_series_covid19_confirmed_HOU.csv) with the format below:

COVID-19 deceased cases data

./data/time_series_covid19_deaths_HOU.csv

Deceased data consists of accumulated deceased cases at 8 counties in Greater Houston between 04/01/2020 and 09/06/2020. In addition, longitude, latitude, and FIPS are provided, which may serve as foreign keys to query mask survey data.

Deceased cases data is in a single csv file (time_series_covid19_death_HOU.csv) with the format below:

COVID-19 mask usage survey

./data/mask-use-HOU.csv

COVID-19 mask usage survey conducted by The New York Times to estimate the mask usage by county in the United States. Data comes from over 250,000 online interviews between 07/02/2020 and 07/14/2020. Specifically, each interview involves how often the participant wears a mask publicly when he or she expects to be within six feet of another person.

The data includes the following definition:

COUNTYFP: The county FIPS code.

NEVER: The estimated share of people in this county who would say never responding to the question “How often do you wear a mask in public when you expect to be within six feet of another person?”

RARELY: The estimated share of people in this county who would say rarely

SOMETIMES: The estimated share of people in this county who would say sometimes

FREQUENTLY: The estimated share of people in this county who would say frequently

ALWAYS: The estimated share of people in this county who would say always

Mask usage survey data is in a single csv file (mask_use_HOU.csv) with the format below:

COVID-19 Hospitalization data

./data/{county_name}_hosp_{end_date}.xlsx

The county-level hospitalization at 8 counties in Greater Houston includes COVID-19 patients in general beds, COVID-19 patients in ICU (no intersection with general bed), total general beds, and total hospitalization patient census. The dataset is available from SETRAC.

Hospitalization data in each county is stored as a separate xlsx file ({county_name}_hosp_{end_date}.xlsx) with the format below:

Photo of ExampleHospitalization data in each county

County FIPS and population data

./data/UID_ISO_FIPS_LookUp_Table.csv

FIPS data is used to check county code and population. It’s in a single csv file with the following format:

EVALUATION

Leaderboard

The Datathon will involve two rounds of competition; one for each week after 09/07/2020. The participants will have 2 weeks to prepare and fine-tune their model.

In the first round, the evaluation will use data between 09/07/2020(beginning of the competition) and 09/13/2020 (2 weeks after the start) and top candidates’ performance will be published on a dashboard. Participants should only use data on or before 09/06/2020 to predict the incoming week.

In the second round, participants can update their model and incorporate data from the first period to make predictions for the next week (09/14/2020 - 09/20/2020). Similarly, participants should only use data on or before 09/14/2020. The submitted solutions will be evaluated based on the ranking score (elaborated in the next section).

Model preparation	08/26/2020 - 09/06/2020
Round 1 evaluation	09/07/2020 - 09/13/2020
Round 2 evaluation	09/14/2020 - 09/20/2020

Round 1 Ranking (09/07/2020 – 09/13/2020)

Rank	ID	Score
1	0003	16
2	0009	20
3	0008	24
4	0006	28
5	0005	32
6	0010	55
7	0007	64
8	0012	71
9	0011	68
10	0013	72
11	0004	78
12	0002	99
13	0001	101

Round 2 Ranking (09/14/2020 – 09/21/2020)

Rank	ID	Score
1	0008	24
2	0006	28
3	0010	28
4	0009	29
5	0005	31
6	0003	33
7	0001	51
8	0007	76
9	0004	79
10	0014	79
11	0011	81
12	0012	85

Combined Ranking (09/07/2020 – 09/21/2020)

Rank	ID	Score
1	0003	16.5
2	0008	20.5
3	0009	24
4	0006	27
5	0005	32
6	0010	52
7	0007	60
8	0011	64
9	0012	71
10	0004	73
11	0001	88

Ranking Score Calculation

We will use mean squared logarithmic error (MSLE) of hospitalization and deceased case prediction to evaluate the performance of submitted models on each county. Final scores will be evaluated based on the sum of ranking in each county. We will provide evaluation codes.

MSLE stands for the mean over the observed data of the squared differences between the log-transformed true and predicted values, or writing as a formula:

Photo of Ranking Scoring1

where:

N is the total number of observations

H_i is actual hospitalization value at time i

Ĥ_i is your hospitalization prediction at time i

D_i is actual mortality value at time i

_i is your mortality prediction at time i

In case of equal MSLE scores the leaderboard, we will apply a secondary evaluation metric -- mean squared error (MSE) of hospitalization and deceased case prediction.

MSE stands for the mean over the observed data of the squared differences between the targets and predicted values, or writing as a formula:

Photo of Ranking Scoring 2

where the meaning of all parameters are the same as above.

Submission

In each competition round, participants are asked to provide predictive hospitalization and mortality cases for the next 7 days. The test/submission file format is identical in both competition rounds. Note that our evaluation metric is independent of Kaggle's default leaderboard ranking settings, so please wait for our final announcement for your correct ranking scores.

Participants can make predictions with any computational method(s). Test data contains an IDcolumn with format (county_name+date), a hospitalization column, a mortality column that we’d like to compute the error. Note that the date column and county column are necessary as they decide how to match submission results and actual data. The default hospitalization and mortality values in the file are all set as 0. It is a .csv file (test.csv) with the format below:

Involving 8 counties (i.e. Harris, Fort Bend, Montgomery, Brazoria, Galveston, Liberty, Chambers, and Austin) in Texas, the submission file should be saved as one csv file (submissions.csv) with format below:

Photo of Example Test File of submission

RULES

Participants are asked to submit source codes (e.g., Jupyter Notebook as a kernel) in a self-contained way.
Use of external data is encouraged, and the goal is predicting future trends.
Top contestants are asked to prepare summary slides to describe their models at the end of the Datathon and make a presentation to other contestants in a virtual session.

prizes

A total of $1,500

First place: $500 (GCC sponsored)
Second place: $300 (UTHealth sponsored)
Third Place: $200 (Rice sponsored)

Institution Specific Prizes:

Top Rice Student: $250
Top UTHealth Student: $250

In addition, participating students will receive suvanariors sponsored by the GCC and the GCC cluster of AI in Healthcare

FAQS

Expand All Categories | Reset

FREQUENTLY ASKED QUESTIONS

Undergraduate, master, and doctoral students from the institutes within the Gulf Coast Consortia (including UTHealth, MDACC, UH, Rice, TAMU, UTMB, IBT, and Baylor) and colleges near TMC are highly encouraged to apply. Those who are affiliated with the Center for Secure Healthcare Machine Learning are not eligible to participate.

No, this datathon is completely free!

This is a coding datathon. You are expected to master basic programming skills and machine learning knowledge.

Yes! We have cash prizes for the winners.

The evaluation will be conducted fairly with predetermined ranking scores on future observational data.

If you have a question that is not listed here, contact Dr. Xiaohong Bi

Find additional frequently asked questions (FAQs) and answers here:
https://docs.google.com/document/d/1k1yJu7igk2uwUWde4FmKwN1dN-vrqGglBHzAgWIEWUo