Challenge Use Case Task

…A national health data science challenge established to advance human health through machine learning

Sepsis 2 Onset and Mortality Among Adult Inpatients

Introduction

The early identification of sepsis cases is the difference between life and death for the patient, and is mission critical to healthcare providers re: quality and cost; also, this use case is well supported by Cerner Health Facts data (Note: 1) these data are deidentified and 2) we have more complete information for inpatients than outpatients). We focus on inpatients to cover a cohort with a much larger vulnerable patient population sample size, in an environment that may feature a smaller nurse/patient ratio.

Challenge Tasks and Data

The Challenge has three tasks:

Sepsis 2 onset risk prediction (4 hours before onset)
30-day mortality risk prediction among sepsis patients (at the time of onset); and
Innovation regarding interpretability

We included all hospitalized adult (at least 16 years old) patients with suspicious infection. The sepsis 2 patients must meet at least 2 SIRS criteria:

Body temperature > 100.4 or < 95.0
RR > 20 or PaCO2 < 32mmHg
HR > 90/min
WBC > 12k or < 4k or Band > 10%

We excluded patients who 1) are children, and 2) have been in the hospital for less than 8 hours or more than 30 days.

There are 3 critical time points for each patient:

T_admission: The time at which the patient was admitted to the hospital

T_onset: The time at which sepsis 2 onset was identified in the patient

T_discharge: The time at which the patient was discharged from the hospital

We will provide patient demographic and admission data for both tasks.

adm_id	gender	race	admission_type	addission_source	care_setting	age_grp
A100019	Male	Caucasian	Elective	Physician Referral	Care Setting Undefined	60~70
A100032	Female	African American	Emergency	Physician Referral	Care Setting Undefined	50~60
A100034	Male	Caucasian	Elective	Others/unknown	Care Setting Undefined	40~50
A100035	Male	Caucasian	Emergency	Others/unknown	Care Setting Undefined	70~80

Task 1: Sepsis 2 Onset Risk Prediction (4 hours before onset)

Goal: To predict sepsis-2 onset 4 hours before it occurs

We provide clinical events and lab test results between T_admission

and T_onset - 4 for each patient, in the matrix format. The time is offset by T_admission.

adm_id	event_time	A/G Ratio	ALT/SGPT	AST/SGOT	Albumin Quant	Albumin, Serum	Alk Phos, Serum	Amylase, Serum	Anion Gap	...
A100008	0.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
A100008	2.0	1.2	26.0	38.0	NaN	2.9	75.0	NaN	9.0	...
A100008	3.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
A100008	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...

Each patent is labeled with whether they have been identified for sepsis 2 onset.

adm_id	sepsis2
A100001	0
A100002	0
A100003	0
A100004	0
A100005	0
A100006	0

Total data size:

Training data: 106,291 patients (4,910,670 records)

Evaluation data: 35,781 patients (1,651,497 records)

Evaluation: Standard AUC, with randomly supplied samples from the testing cohort. We will test:

Case and control segments from the same patient: over the longer term ( > 4 hours before sepsis onset) vs. segmentation close to sepsis onset ( = 4 hours)
Case and control segments from different patients who have sepsis onset in the next 4 hours, as well as those who do not have a sepsis

Task 2: 30-day Mortality Risk Prediction for Patients Identified with Sepsis 2

Graph of Sepsis 2 identified in patients

Goal: To predict whether the patient will die in the hospital within 30 days, using up to 48 hours of data before sepsis onset.

We provide the clinical events and lab test results between T_onset - 48 and T_onset - 4 for each sepsis 2 patient, in the matrix format. The time is offset by T_onset.

adm_id	event_time	A/G Ratio	ALT/SGPT	AST/SGOT	Albumin Quant	Albumin, Serum	Alk Phos, Serum	Amylase, Serum	Anion Gap	...
A1000019	-47.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
A1000019	-46.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
A1000019	-45.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
A1000019	-45.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
A1000019	-44.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...

Each patent is labeled with their mortality status and the time between T_onset and T_discharge.

adm_id	time	mortality
A100079	200.5	0
A100244	78.5	0
A100328	78.5	0
A100388	55.5	0
A100398	117.0	0

Total data size:

Training data: 31,614 patients (940,567 records)

Evaluation data: 10,643 patients (313,991 records)

Evaluation:

Cumulative case/dynamic control ROC; judge performance on multiple timestamps to see how well and how early (relative to mortality/discharge) the model M_i can obtain a good prediction from t_0nset

Evaluate and compare using R Package timeROC
https://cran.r-project.org/web/packages/timeROC/index.html

sensitivity^C(c,t) = P(M_i > c|T_i < t)
specificity^D(c,t) = P(M_i < c|T_i > t)

Using different time cutoffs t to calculate AUC (in the traditional way) allows one to access the model's performance in predicting short-term, medium term, and long term mortality risk after sepsis onset.

Task 3: Innovation Regarding Interpretability

While many machine learning models can conduct classification and regression tasks, not all of them achieve valid interpretation that potentially enables the application of findings to better inform decision support in the clinical setting.

There is no means of providing interpretability (e.g., automatic decisions on the threshold, finding combined patterns, designing novel visualizations, etc.), without evaluation from human experts. We have assembled a group of machine learning and clinical experts to judge the Challenge innovation track, which will be focused on interpretability.

Submitting Your Entry

The prediction result must be submitted via the SECURESTOR submission directory (one is assigned for each team).

For Task 1, please submit the probability that the patient will have sepsis 2 onset in the next 4 hours
For Task 2, please submit the probability for the patient’s mortality within 30 days

The submission must be in CSV (comma-separated) format, with column headers. Below is the sample layout for both tasks 1 and 2.

adm_id	probability
0.98330
A100093	0.34455
A100044	0.12333
A100046	0.23322

Rules

Rules:

Participants must not download the dataset
Participants are responsible for any additional access/logons created on their server and for keeping their password secret
Solutions must be submitted in the required format by the designated deadline