…A national data science challenge established to advance human health through machine learning

Sepsis 2 Onset and Mortality Among Adult Inpatients

Introduction


The early identification of sepsis cases is the difference between life and death for the patient, and is mission critical to healthcare providers re: quality and cost; also, this use case is well supported by Cerner Health Facts data (Note: 1) these data are deidentified and 2) we have more complete information for inpatients than outpatients). We focus on inpatients to cover a cohort with a much larger vulnerable patient population sample size, in an environment that may feature a smaller nurse/patient ratio.

Challenge Tasks and Data


The Challenge has three tasks:

  1. Sepsis 2 onset risk prediction (4 hours before onset)
  2. 30-day mortality risk prediction among sepsis patients (at the time of onset); and
  3. Innovation regarding interpretability

We included all hospitalized adult (at least 16 years old) patients with suspicious infection. The sepsis 2 patients must meet at least 2 SIRS criteria:

  1. Body temperature > 100.4 or < 95.0
  2. RR > 20 or PaCO2 < 32mmHg
  3. HR > 90/min
  4. WBC > 12k or < 4k or Band > 10%

We excluded patients who 1) are children, and 2) have been in the hospital for less than 8 hours or more than 30 days.

There are 3 critical time points for each patient:

Tadmission: The time at which the patient was admitted to the hospital
Tonset: The time at which sepsis 2 onset was identified in the patient
Tdischarge: The time at which the patient was discharged from the hospital

We will provide patient demographic and admission data for both tasks.

adm_id gender race admission_type addission_source care_setting age_grp
A100019 Male Caucasian Elective Physician Referal Care Setting Undefined 60~70
A100032 Female African American Emergency Physician Referal Care Setting Undefined 50~60
A100034 Male Caucasian Elective Others/unknown Care Setting Undefined 40~50
A100035 Male Caucasian Emergency Others/unknown Care Setting Undefined 70~80

Task 1: Sepsis 2 Onset Risk Prediction (4 hours before onset)


Graph of Sepsis 2 Prediction

Goal: To predict sepsis-2 onset 4 hours before it occurs

We provide clinical events and lab test results between Tadmission

and Tonset - 4 for each patient, in the matrix format. The time is offset by Tadmission.

adm_id event_time A/G Ratio ALT/SGPT AST/SGOT Albumin Quant Albumin, Serum Alk Phos, Serum Amylase, Serum Anion Gap ...
A100008 0.5 NaN NaN NaN NaN NaN NaN NaN NaN ...
A100008 2.0 1.2 26.0 38.0 NaN 2.9 75.0 NaN 9.0 ...
A100008 3.5 NaN NaN NaN NaN NaN NaN NaN NaN ...
A100008 4.0 NaN NaN NaN NaN NaN NaN NaN NaN ...

Each patent is labeled with whether they have been identified for sepsis 2 onset.

adm_id sepsis2
A100001 0
A100002 0
A100003 0
A100004 0
A100005 0
A100006 0

Total data size:

Training data: 106,291 patients (4,910,670 records)
Evaluation data: 35,781 patients (1,651,497 records)

Evaluation: Standard AUC, with randomly supplied samples from the testing cohort. We will test:

  1. Case and control segments from the same patient: over the longer term ( > 4 hours before sepsis onset) vs. segmentation close to sepsis onset ( = 4 hours)
  2. Case and control segments from different patients who have sepsis onset in the next 4 hours, as well as those who do not have a sepsis

Task 2: 30-day Mortality Risk Prediction for Patients Identified with Sepsis 2


Graph of Sepsis 2 identified in patients

Goal: To predict whether the patient will die in the hospital within 30 days, using up to 48 hours of data before sepsis onset.

We provide the clinical events and lab test results between Tonset - 48 and Tonset - 4 for each sepsis 2 patient, in the matrix format. The time is offset by Tonset.

adm_id event_time A/G Ratio ALT/SGPT AST/SGOT Albumin Quant Albumin, Serum Alk Phos, Serum Amylase, Serum Anion Gap ...
A1000019 -47.5 NaN NaN NaN NaN NaN NaN NaN NaN ...
A1000019 -46.5 NaN NaN NaN NaN NaN NaN NaN NaN ...
A1000019 -45.5 NaN NaN NaN NaN NaN NaN NaN NaN ...
A1000019 -45.0 NaN NaN NaN NaN NaN NaN NaN NaN ...
A1000019 -44.5 NaN NaN NaN NaN NaN NaN NaN NaN ...

Each patent is labeled with their mortality status and the time between Tonset and Tdischarge.

adm_id time mortality
A100079 200.5 0
A100244 78.5 0
A100328 78.5 0
A100388 55.5 0
A100398 117.0 0

Total data size:

Training data: 31,614 patients (940,567 records)
Evaluation data: 10,643 patients (313,991 records)

Evaluation:

Cumulative case/dynamic control ROC; judge performance on multiple timestamps to see how well and how early (relative to mortality/discharge) the model Mi can obtain a good prediction from t0nset

Chart of the Evalution for Paient

Evaluate and compare using R Package timeROC
https://cran.r-project.org/web/packages/timeROC/index.html

Forest

sensitivityC(c,t) = P(Mi > c|Ti < t)
specificityD(c,t) = P(Mi < c|Ti > t)

Using different time cutoffs t to calculate AUC (in the traditional way) allows one to access the model's performance in predicting short-term, medium term, and long term mortality risk after sepsis onset.

Task 3: Innovation Regarding Interpretability


While many machine learning models can conduct classification and regression tasks, not all of them achieve valid interpretation that potentially enables the application of findings to better inform decision support in the clinical setting.

There is no means of providing interpretability (e.g., automatic decisions on the threshold, finding combined patterns, designing novel visualizations, etc.), without evaluation from human experts. We have assembled a group of machine learning and clinical experts to judge the Challenge innovation track, which will be focused on interpretability.

Submitting Your Entry


The prediction result must be submitted via the SECURESTOR submission directory (one is assigned for each team).

For Task 1, please submit the probability that the patient will have sepsis 2 onset in the next 4 hours
For Task 2, please submit the probability for the patient’s mortality within 30 days

The submission must be in CSV (comma-separated) format, with column headers. Below is the sample layout for both tasks 1 and 2.

adm_id probability
A100079 0.98330
A100093 0.34455
A100044 0.12333
A100046 0.23322

Rules


Rules:

  1. Participants must not download the dataset
  2. Participants are responsible for any additional access/logons created on their server and for keeping their password secret
  3. Solutions must be submitted in the required format by the designated deadline