The Lab Data Transmogrification

Data cleaning on lab and vital periodic tables were done this week.

From the ‘lab.csv’ file, wbcx1000, lactate and creatinine values were all extracted from the ‘labname’ column per patient by making rows with data into columns. This was paired with the lab’s offset value to synchronize overlapping patient data.

Within the ‘vitalPeriodic.csv’ file, temperature, heartrate, respiration and systemicsystolic were gathered as well.

All values were all extracted and written to separate csv files.

Nani!!!

So the creators of the code finally got back to us. They recommended us to run the code on a machine with 20 CPU cores and 256GB RAM. I’m so confused how would providing more resources fix missing values errors. Seems like that is a dead end. I’m beginning to doubt how accurate the results they presented are. Nevertheless that is a problem for a later point.

We continue to clean some of the data on the sample data since we relalize it’s not possible to run the full data on our own machines since the files are too large to be loaded into RAM properly. Seems like at least 32GB of RAM would be needed to load and manipulate some of the files. Maybe even as much as 64GB may be needed. Hopefully we find a solution for that soon. Seems like Chrome has some competition for RAM.

The Intake-Output Extraction

In the continuation for the battle in acquiring urine output values, we loaded in the ‘intakeOutput.csv’ file. This served as a way of gathering the urine output values for only the patient IDs that we specified.

More tests were done on the validity of all data cleaning processes.

We are currently in the process on researching how we should merge the data in weeks to come.

Show Me The Data!

We decided to move on as we wait to get a reply for the creators of the competition code. We started to look through the database to find all the useful information that the doctor told us would be useful. We were able to find everything within the dataset thankfully. The urine output by weight was derived from using the patient’s weight information and their urine output which were in separate columns.

We also spent some time cleaning some of the data. Things finally feel like they are going places. We are only looking at the sample dataset since it is easier to work with at this point since the full database is pretty large thus it would take longer to run. The sample should be a pretty good representation of the full data anyways.

Demo Usage

Having encountered many issues in using the full datasets in cleaning and feature extraction, a decision was made to use the demo datasets to create the framework and pipelines for our data cleaning and neural network models, after a Professor at the University kindly offered the use of their personal lab machine to run the code using its abundant resources on the full datasets.

The Patient Data Insufficiency

In the hunt for training data, we first began with grabbing all the patients that are only diagnosed with sepsis in order to shrink the data set, this allowed us to utilize the ‘diagnosis.csv’ file.

This represented patients with sepsis as a ‘diagnosis : 1’ and patients without sepsis
as ‘diagnosis : 0’.

To begin our data cleaning process, we handled the patient data set first. Due to the urine output for a patient is in millilitres (ml), it needed to be converted to millilitres per kilogram (ml/kg) in order to work with sepsis definitions. This meant that the patient’s admission weight is necessary.

This was found in the ‘patient.csv’ file.

Missing values for a patient’s admission weight was calculated by filling nulls of ages and appropriating the gender attribute by making all values consistent.

Admission weight values were filled by average weight and gender per patient.

Sorting Out Plan of Action

We heard back from the doctor about what information we should look for in the dataset. This inlcudes:

  • Urine output by weight
  • temperature
  • respiration rate
  • white blood cell count
  • creatinine
  • lactate
  • systemic systolic
  • heart rate

The following are the levels that could lead to septic shock upon further research:

Body AttributeBody Attribute Value Required for Septic Shock
Body temperature<36 °Celsius or  >38 °Celsius
Heart rate>90 BPM
Respiratory rate>20 BPM or arterial CO2 pressure (PaCO2) <32 mmHg
White blood cell count<4,000/mL or >12,000/mL
Body AttributeBody Attribute Value Required for Septic Shock
Systolic blood pressure<90 mmHg
Blood Lactate>2.0 mmol/L
Urine output<0.5 mL/kg over the last two hours despite adequate fluid resuscitation
Creatinine>2.0 mg/dL without the presence of chronic dialysis or renal insufficiency

Some more time was spent looking into how time series data is represented and how it is used in models etc.

The Sepsis Hypothesis

The physician advised us on how sepsis is determined by summarizing Septic Shock.

Septic shock, a subset of sepsis, is a severe condition that can possibly occur should sepsis lead to life-threatening hypo tension for a patient. The human body would possess the following attributes if they are in septic shock.

Body AttributeBody Attribute Value Required for Septic Shock
Body temperature<36 °Celsius or  >38 °Celsius
Heart rate>90 BPM
Respiratory rate>20 BPM or arterial CO2 pressure (PaCO2) <32 mmHg
White blood cell count<4,000/mL or >12,000/mL

Severe sepsis, being the presence of sepsis in together with sepsis-related organ dysfunction, is also shown.

Body AttributeBody Attribute Value Required for Septic Shock
Systolic blood pressure<90 mmHg
Blood Lactate>2.0 mmol/L
Urine output<0.5 mL/kg over the last two hours despite adequate fluid resuscitation
Creatinine>2.0 mg/dL without the presence of chronic dialysis or renal insufficiency

The physician also assessed the eICU database in order to help us pick out valuable data attributes that we would need in predicting sepsis.

Data AttributeDatabase File
patientunitstayidTable 18: patient.csv Cleaned File Description
observationoffsetTable 18: patient.csv Cleaned File DescriptionTable 12: patient.csv File Description
temperatureTable 21: vitalPeriodic.csv Cleaned File Description
heartrateTable 21: vitalPeriodic.csv Cleaned File Description
respirationTable 21: vitalPeriodic.csv Cleaned File Description
systemicsystolicTable 21: vitalPeriodic.csv Cleaned File Description
wbcx1000Table 19: lab.csv Cleaned File Description
lactateTable 19: lab.csv Cleaned File Description
creatinineTable 19: lab.csv Cleaned File Description
urineoutputbyweightTable 20: intakeOutputUrine.csv File Description
diagnosisTable 16: diagnosis.csv Cleaned File Description

If we can indeed use these values to train the model, then our predictions would be successful.

Course Assignments & Supercomputers

Not much work was done this week, other than reaching out to the competitors for assistance in running and explaining their code. As other University courses had assignments due this week and time was spent working on them.

Our email was responded with recommending using a virtual machine with more resources as they claim their code takes 2 days to run using 256 GB of RAM and a CPU with more than 20 cores. Which does not make any sense to us, as there was a compilation error in running their code.

Also there is no way for us to gain access to a machine this powerful outside the US as most of the solutions that provided services close to this, required a US based credit card, which as international students from the Caribbean we do not have access to.

Design a site like this with WordPress.com
Get started