Thieu Nguyen - May-03-2025
COVID-19 REPORTS 2020 - PYTHONΒΆ
Use the SyntheaMass data and rerun the code from this website to practice analyzing healthcare data using Python. I modified some code blocks to handle analysis involving several related tables in the healthcare database.
In this post, I learned how to:
- Define specific groups of patients, such as those who contracted COVID-19, were hospitalized, admitted to the ICU, placed on ventilators, and more
- Calculate age, age at death, time points of COVID-19 events, symptom timelines, and hospital stay lengths
- Aggregate metrics such as counts, means, and sums of symptoms and COVID-19 outcomes
- Explore lab values
- Plot key metrics
- Use Python functions to simplify code blocks in the report by calling external .py files such as analysis.py
Loading data and packagesΒΆ
%pip install pandas numpy matplotlib seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 1
# for loading analysis.py containing the analysis functions
%aimport analysis
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
# Load the data
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
devices = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/devices.csv")
supplies = pd.read_csv('/Users/nnthieu/SyntheaData/SyntheaCovid19/supplies.csv')
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
medications = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv")
Firstly, create patient Ids for 'covid-19', 'isolation', 'admitted', 'icu_admitted', 'ventilated', 'recovered' and 'death' groups, then create related columns in the covid_patients table. To get that, I have to know CODEs, DESCRIPTIONs related to patient groups, for instance, conditions.CODE = '840539006' for covid-19.
Grab the IDs of patients that have been diagnosed with covid-19.
covid_patient_ids = conditions[conditions.CODE == 840539006].PATIENT.unique()
inpatient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT.unique()
isolation_ids = care_plans[(care_plans.CODE == 736376001) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()
icu_patient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 305351004)].PATIENT.unique()
vent_ids = procedures[procedures.CODE == 26763009].PATIENT.unique()
deceased_ids = patients[patients.DEATHDATE.notna()].Id.unique()
negative_covid_patient_ids = observations[(observations.CODE == '94531-1') &
(observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
completed_isolation_patients = care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) &
(care_plans.REASONCODE == 840539006)].PATIENT.unique()
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)
# Filter patients who have COVID-19
covid_patients = patients[patients.Id.isin(covid_patient_ids)] # Correct column name to 'Id'
len(covid_patient_ids)
88166
The number of inpatient survivors
np.intersect1d(inpatient_ids, survivor_ids).shape
(14654,)
The number of inpatient non-survivors
np.intersect1d(inpatient_ids, deceased_ids).shape[0]
3548
Mortality rate of covid-19 patients
(np.intersect1d(covid_patient_ids, deceased_ids).shape[0]/len(covid_patient_ids))* 100
4.129709865481025
import datetime
# 1. Find the earliest COVID-19 start date
covid_start_date = pd.to_datetime(conditions[conditions['DESCRIPTION'] == 'COVID-19']['START'].min())
# 2. Set the maximum allowed COVID-19 date (80 days later)
covid_max_date = covid_start_date + pd.Timedelta(days=80)
# Ensure the 'START' column is in datetime format
conditions['START'] = pd.to_datetime(conditions['START'], errors='coerce')
observations['DATE'] = pd.to_datetime(observations['DATE'], errors='coerce')
# 3. Create a boolean mask for rows within the COVID-19 date range and for COVID-19 diagnosis
covid_date_mask = (conditions['START'] >= covid_start_date) & (conditions['START'] <= covid_max_date)
# 4. Filter `conditions` and `observations` based on the date range
covid_cons = conditions[covid_date_mask]
covid_obs = observations[
(observations['DATE'] >= covid_start_date) & (observations['DATE'] <= covid_max_date)
]
# 5. Display first few rows
print(covid_cons.head(3))
covid_cons.shape
START STOP PATIENT \ 2 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 3 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 4 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 ENCOUNTER CODE DESCRIPTION 2 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006 Fever (finding) 3 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004 Suspected COVID-19 4 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006 COVID-19
(654569, 6)
Health outcomes experienced by COVID-19 patientsΒΆ
The following table shows different health outcomes that were experienced by COVID-19 patients during the course of the disease.
File analysis.py is written and saved in a local working folder.
write a function in analysis.py like import analysis analysis.outcome_table(inpatient_ids, survivor_ids, deceased_patients, covid_cons) to create a table as: outcome total percent of inpatient survivors percent survivors non survivors percent non survivors with outcome are values of covid_condition.DESCRIPTION containing (disorder)
analysis.outcome_table(inpatient_ids, survivor_ids, deceased_ids, covid_cons)
outcome | total | percent of inpatient | survivors | percent survivors | non survivors | percent non survivors | |
---|---|---|---|---|---|---|---|
1 | Acute respiratory failure | 8701 | 47.87 | 5237 | 5.99 | 3473 | 14.38 |
0 | Sepsis caused by virus | 6941 | 38.19 | 3419 | 3.91 | 3526 | 14.60 |
6 | Acute respiratory distress syndrome | 2400 | 13.20 | 85 | 0.10 | 2315 | 9.59 |
2 | Septic shock | 1746 | 9.61 | 0 | 0.00 | 1746 | 7.23 |
5 | Heart failure | 1431 | 7.87 | 122 | 0.14 | 1309 | 5.42 |
4 | Injury of heart | 1288 | 7.09 | 20 | 0.02 | 1268 | 5.25 |
3 | Injury of kidney | 1252 | 6.89 | 8 | 0.01 | 1244 | 5.15 |
#import analysis
def outcome_table(inpatient_ids, survivor_ids, deceased_ids, covid_cons):
# Filter conditions containing "(disorder)" in the DESCRIPTION
disorder_conditions = covid_cons[
(covid_cons['DESCRIPTION'].str.contains(r'\(disorder\)', na=False)) &
(~covid_cons['DESCRIPTION'].str.contains(
r'pneumonia|hypoxemia|chronic|Hypertriglyceridemia|Hyperlipidemia|Hypercholesterolemia|Metabolic|osteoporosis|Alzheimer|Anemia|sinusitis|thrombosis|embolism|pharyngitis|bronchitis', case=False, na=False))
]
# Initialize the result table
table_rows = []
for outcome in disorder_conditions['DESCRIPTION'].unique():
# Filter patients with the specific outcome
outcome_patients = disorder_conditions[disorder_conditions['DESCRIPTION']
== outcome]['PATIENT'].unique()
# Calculate metrics
total = len(outcome_patients)
inpatient = len(set(outcome_patients) & set(inpatient_ids))
survivors = len(set(outcome_patients) & set(survivor_ids))
non_survivors = len(set(outcome_patients) & set(deceased_ids))
percent_inpatient = (inpatient / len(inpatient_ids)) * \
100 if len(inpatient_ids) > 0 else 0
percent_survivors = (survivors / len(survivor_ids)) * \
100 if len(survivor_ids) > 0 else 0
percent_non_survivors = (
non_survivors / len(deceased_ids)) * 100 if len(deceased_ids) > 0 else 0
# Append row to the table
table_rows.append({
'outcome': outcome,
'total': total,
'percent of inpatient': round(percent_inpatient, 2),
'survivors': survivors,
'percent survivors': round(percent_survivors, 2),
'non survivors': non_survivors,
'percent non survivors': round(percent_non_survivors, 2)
})
# Convert to DataFrame
outcome_table = pd.DataFrame(table_rows).head(7)
outcome_table = outcome_table.sort_values(by='total', ascending=False)
outcome_table['outcome'] = outcome_table['outcome'].str.replace(
r'\(disorder\)', '', regex=True).str.strip()
return outcome_table
outcome_table(inpatient_ids, survivor_ids, deceased_ids, covid_cons)
Outcomes for all COVID-19 PatientsΒΆ
This code builds a new DataFrame for the purposes of display. The DataFrame contains the percentages of patients that experience a particular outcome. Percentages are then provided for only hospitalized patients, ICU admitted patients and ventilated patients.
# Recode the boolean columns from True/False to text to read easily
hospitalized = (cp.admit == True)
icu = (cp.icu_admit == True)
vent = (cp.ventilated == True)
covid_count = cp.Id.size
row_filters = {'Home Isolation': (cp.isolation == True),
'Hospital Admission': hospitalized,
'ICU Admission': icu,
'Ventilated': vent,
'Recovered': (cp.recovered == True),
'Death': (cp.death == True)}
table_rows = []
for category, row_filter in row_filters.items():
row = {'Outcome': category}
row['All Patients'] = round(cp[row_filter].Id.size / covid_count, 2)
row['Hospitalized'] = round(cp[row_filter & hospitalized].Id.size / hospitalized.value_counts()[True], 2)
row['ICU Admitted'] = round(cp[row_filter & icu].Id.size / icu.value_counts()[True], 2)
row['Ventilated'] = round(cp[row_filter & vent].Id.size / vent.value_counts()[True], 2)
table_rows.append(row)
pd.DataFrame.from_records(table_rows)
Outcome | All Patients | Hospitalized | ICU Admitted | Ventilated | |
---|---|---|---|---|---|
0 | Home Isolation | 0.80 | 0.03 | 0.03 | 0.03 |
1 | Hospital Admission | 0.21 | 1.00 | 1.00 | 1.00 |
2 | ICU Admission | 0.04 | 0.20 | 1.00 | 1.00 |
3 | Ventilated | 0.03 | 0.16 | 0.79 | 1.00 |
4 | Recovered | 0.96 | 0.81 | 0.32 | 0.15 |
5 | Death | 0.04 | 0.20 | 0.68 | 0.85 |
Outcomes for ICU Admitted PatientsΒΆ
Essentially a sub table from above, looking only at ICU patients.
icu_only = cp[cp.icu_admit == True]
vent = (icu_only.ventilated == True)
covid_count = icu_only.Id.size
row_filters = {'Ventilated': vent,
'Recovered': (icu_only.recovered == True),
'Death': (icu_only.death == True)}
table_rows = []
for category, row_filter in row_filters.items():
row = {'Outcome': category}
row['ICU Admitted'] = round(icu_only[row_filter].Id.size / covid_count, 2)
row['Ventilated'] = round(icu_only[row_filter & vent].Id.size / vent.value_counts()[True], 2)
table_rows.append(row)
pd.DataFrame.from_records(table_rows)
Outcome | ICU Admitted | Ventilated | |
---|---|---|---|
0 | Ventilated | 0.79 | 1.00 |
1 | Recovered | 0.32 | 0.15 |
2 | Death | 0.68 | 0.85 |
Start to build a DataFrame that we can use to look at other conditions in relation to COVID-19
covid_info = cp[['PATIENT', 'recovered', 'death', 'START', 'DEATHDATE', 'BIRTHDATE', 'GENDER', 'admit', 'icu_admit']]
covid_info = covid_info.rename(columns={'START': 'covid_start'})
covid_info.head(2)
PATIENT | recovered | death | covid_start | DEATHDATE | BIRTHDATE | GENDER | admit | icu_admit | |
---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False |
1 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | True | False | 2020-03-13 | NaN | 2016-08-01 | F | False | False |
Grab all of the conditions starting after January 20, 2020. This is a hack to get only conditions that are related to COVID-19. We will end up merging these with the COVID patients.
covid_related_conditions = conditions[pd.to_datetime(conditions.START) > pd.to_datetime('2020-01-20')]
This DataFrame will contain all conditions for COVID-19 patients, where START can be compared to covid_start to see how long after the COVID-19 diagnosis something happened.
covid_patient_conditions = covid_info.merge(covid_related_conditions, on='PATIENT')
covid_patient_conditions.head(2)
PATIENT | recovered | death | covid_start | DEATHDATE | BIRTHDATE | GENDER | admit | icu_admit | START | STOP | ENCOUNTER | CODE | DESCRIPTION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False | 2020-03-01 | 2020-03-30 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) |
1 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False | 2020-03-01 | 2020-03-01 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 |
Lab values for COVID-19 patientsΒΆ
The following code presents lab values taken for COVID-19 patients. Values are separated into survivors and non survivors. The first block of code selects lab values of interest from all observations in the simulation.
lab_obs = observations[(observations.CODE == '48065-7') | (observations.CODE == '26881-3') |
(observations.CODE == '2276-4') | (observations.CODE == '89579-7') |
(observations.CODE == '731-0') | (observations.CODE == '14804-9')
]
lab_obs.head(3)
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | |
---|---|---|---|---|---|---|---|---|
254 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric |
273 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric |
274 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric |
Select COVID-19 conditions out of all conditions in the simulation
covid_conditions = conditions[conditions.CODE == 840539006]
covid_patients = covid_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
covid_patients['survivor'] = covid_patients.PATIENT.isin(survivor_ids)
covid_patients = covid_patients[['START', 'PATIENT', 'survivor', 'CODE']]
Calculate attributes needed to support the plot. Also coerce all lab values into a numeric data type.
covid_patients_obs = covid_patients.merge(lab_obs, on='PATIENT')
covid_patients_obs['START'] = pd.to_datetime(covid_patients_obs.START)
covid_patients_obs['DATE'] = pd.to_datetime(covid_patients_obs.DATE)
covid_patients_obs['lab_days'] = covid_patients_obs.DATE - covid_patients_obs.START
covid_patients_obs['days'] = covid_patients_obs.lab_days / np.timedelta64(1, 'D')
covid_patients_obs['VALUE'] = pd.to_numeric(covid_patients_obs['VALUE'], errors='coerce')
loinc_to_display = {'CODE_y = 48065-7': 'D-dimer', 'CODE_y = 2276-4': 'Serum Ferritin',
'CODE_y = 89579-7': 'High Sensitivity Cardiac Troponin I',
'CODE_y = 26881-3': 'IL-6', 'CODE_y = 731-0': 'Lymphocytes',
'CODE_y = 14804-9': 'Lactate dehydrogenase'}
catplt = sns.catplot(x="days", y="VALUE", hue="survivor", kind="box", col='CODE_y',
col_wrap=2, sharey=False, sharex=False, data=covid_patients_obs, palette=["C1", "C0"])
# Set the title for the entire figure
catplt.fig.suptitle("Labs Values of Covid-19 Patients", y=1.02)
for axis in catplt.fig.axes:
axis.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
axis.xaxis.set_major_locator(ticker.MultipleLocator(base=4))
axis.set_title(loinc_to_display[axis.get_title()])
plt.show()
Line plots
loinc_to_display = {'CODE_y = 48065-7': 'D-dimer', 'CODE_y = 2276-4': 'Serum Ferritin',
'CODE_y = 89579-7': 'High Sensitivity Cardiac Troponin I',
'CODE_y = 26881-3': 'IL-6', 'CODE_y = 731-0': 'Lymphocytes',
'CODE_y = 14804-9': 'Lactate dehydrogenase'}
g = sns.FacetGrid(covid_patients_obs, col="CODE_y", col_wrap=2, sharey=False, height=3, aspect=2)
g = g.map_dataframe(sns.lineplot, x="days", y="VALUE", marker=".", hue="survivor", palette=["C1", "C0"], alpha=0.5)
g.fig.suptitle("Labs Values of Covid-19 Patients", y=1.02)
for axis in g.axes.flat:
axis.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
axis.xaxis.set_major_locator(ticker.MultipleLocator(base=4))
title = axis.get_title().replace("DESCRIPTION = ", "").strip()
if title in loinc_to_display:
axis.set_title(loinc_to_display[title])
else:
axis.set_title(title) # Fallback to the original title if not found in the dictionary
Another way to create line plots
loinc_to_display = {'CODE_y = 48065-7': 'D-dimer', 'CODE_y = 2276-4': 'Serum Ferritin',
'CODE_y = 89579-7': 'High Sensitivity Cardiac Troponin I',
'CODE_y = 26881-3': 'IL-6', 'CODE_y = 731-0': 'Lymphocytes',
'CODE_y = 14804-9': 'Lactate dehydrogenase'}
catplt = sns.catplot(x="days", y="VALUE", hue="survivor", kind="point", col='CODE_y',
col_wrap=2, sharey=False, sharex=False, data=covid_patients_obs, palette=["C1", "C0"])
# Set the title for the entire figure
catplt.fig.suptitle("Labs Values of Covid-19 Patients", y=1.02)
for axis in catplt.fig.axes:
axis.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
axis.xaxis.set_major_locator(ticker.MultipleLocator(base=4))
axis.set_title(loinc_to_display[axis.title.get_text()])
plt.show()
Set up a new DataFrame with boolean columns representing various outcomes, like admit, recovery or death
First, create patient Ids for 'isolation', 'admitted', 'icu_admitted', 'ventilated', 'recovered' and 'death', then create related columns in the covid_patient_ids table.
cp = covid_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
isolation_ids = care_plans[(care_plans.CODE == 736376001) & (care_plans.REASONCODE == 840539006)].PATIENT
cp['isolation'] = cp.Id.isin(isolation_ids)
cp['admit'] = cp.Id.isin(inpatient_ids)
cp['recovered'] = cp.Id.isin(survivor_ids)
cp['death'] = cp.DEATHDATE.notna()
icu_ids = encounters[encounters.CODE == 305351004].PATIENT
cp['icu_admit'] = cp.Id.isin(icu_ids)
vent_ids = procedures[procedures.CODE == 26763009].PATIENT
cp['ventilated'] = cp.Id.isin(vent_ids)
cp.head(2)
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | Id | BIRTHDATE | DEATHDATE | SSN | ... | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | isolation | admit | recovered | death | icu_admit | ventilated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | 999-68-6630 | ... | 42.151961 | -72.598959 | 8446.49 | 1499.08 | True | False | True | False | False | False |
1 | 2020-03-13 | 2020-04-14 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | a78e78d9-33bb-40bc-9e42-e47ff7e910cc | 840539006 | COVID-19 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 2016-08-01 | NaN | 999-15-5895 | ... | 42.177370 | -71.281353 | 94568.40 | 1870.72 | True | False | True | False | False | False |
2 rows Γ 37 columns
Symptoms for all COVID-19 PatientsΒΆ
Generates a DataFrame with percentages of co-occurring conditions
icu_patient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 305351004)].PATIENT
icu_patient_ids.shape
(3677,)
analysis.symptom_table(inpatient_ids, survivor_ids, deceased_ids, covid_cons)
outcome | total | percent of inpatient | survivors | percent survivors | non survivors | percent non survivors | |
---|---|---|---|---|---|---|---|
0 | Fever | 80865 | 92.07 | 77629 | 88.73 | 3317 | 13.73 |
1 | Cough | 61710 | 70.81 | 59170 | 67.63 | 2601 | 10.77 |
4 | Loss of taste | 46121 | 51.25 | 44299 | 50.64 | 1868 | 7.73 |
8 | Fatigue | 34951 | 39.86 | 33551 | 38.35 | 1432 | 5.93 |
2 | Sputum finding | 30653 | 34.78 | 29418 | 33.63 | 1264 | 5.23 |
7 | Respiratory distress | 18177 | 100.00 | 14654 | 16.75 | 3548 | 14.69 |
6 | Wheezing | 17939 | 37.86 | 16621 | 19.00 | 1339 | 5.54 |
5 | Dyspnea | 17939 | 37.86 | 16621 | 19.00 | 1339 | 5.54 |
3 | Diarrhea symptom | 3521 | 5.80 | 3319 | 3.79 | 208 | 0.86 |
9 | Hemoptysis | 874 | 2.56 | 781 | 0.89 | 95 | 0.39 |
def symptom_table(inpatient_ids, survivor_ids, deceased_ids, covid_cons):
# Filter conditions containing "(disorder)" in the DESCRIPTION
disorder_conditions = covid_cons[
(covid_cons['DESCRIPTION'].str.contains(r'\(finding\)', na=False)) &
(~covid_cons['DESCRIPTION'].str.contains(
r'pneumonia|hypoxemia|chronic|Hypertriglyceridemia|Hyperlipidemia|Hypercholesterolemia|Metabolic|osteoporosis|Alzheimer|Anemia|sinusitis|thrombosis|embolism|pharyngitis|bronchitis', case=False, na=False))
]
# Initialize the result table
table_rows = []
for outcome in disorder_conditions['DESCRIPTION'].unique():
# Filter patients with the specific outcome
outcome_patients = disorder_conditions[disorder_conditions['DESCRIPTION']
== outcome]['PATIENT'].unique()
# Calculate metrics
total = len(outcome_patients)
inpatient = len(set(outcome_patients) & set(inpatient_ids))
survivors = len(set(outcome_patients) & set(survivor_ids))
non_survivors = len(set(outcome_patients) & set(deceased_ids))
percent_inpatient = (inpatient / len(inpatient_ids)) * \
100 if len(inpatient_ids) > 0 else 0
percent_survivors = (survivors / len(survivor_ids)) * \
100 if len(survivor_ids) > 0 else 0
percent_non_survivors = (
non_survivors / len(deceased_ids)) * 100 if len(deceased_ids) > 0 else 0
# Append row to the table
table_rows.append({
'outcome': outcome,
'total': total,
'percent of inpatient': round(percent_inpatient, 2),
'survivors': survivors,
'percent survivors': round(percent_survivors, 2),
'non survivors': non_survivors,
'percent non survivors': round(percent_non_survivors, 2)
})
# Convert to DataFrame
outcome_table = pd.DataFrame(table_rows).head(10)
outcome_table = outcome_table.sort_values(by='total', ascending=False)
outcome_table['outcome'] = outcome_table['outcome'].str.replace(
r'\(finding\)', '', regex=True).str.strip()
return outcome_table
symptom_table(inpatient_ids, survivor_ids, deceased_ids, covid_cons)
outcome | total | percent of inpatient | survivors | percent survivors | non survivors | percent non survivors | |
---|---|---|---|---|---|---|---|
0 | Fever | 80865 | 92.07 | 77629 | 88.73 | 3317 | 13.73 |
1 | Cough | 61710 | 70.81 | 59170 | 67.63 | 2601 | 10.77 |
4 | Loss of taste | 46121 | 51.25 | 44299 | 50.64 | 1868 | 7.73 |
8 | Fatigue | 34951 | 39.86 | 33551 | 38.35 | 1432 | 5.93 |
2 | Sputum finding | 30653 | 34.78 | 29418 | 33.63 | 1264 | 5.23 |
7 | Respiratory distress | 18177 | 100.00 | 14654 | 16.75 | 3548 | 14.69 |
6 | Wheezing | 17939 | 37.86 | 16621 | 19.00 | 1339 | 5.54 |
5 | Dyspnea | 17939 | 37.86 | 16621 | 19.00 | 1339 | 5.54 |
3 | Diarrhea symptom | 3521 | 5.80 | 3319 | 3.79 | 208 | 0.86 |
9 | Hemoptysis | 874 | 2.56 | 781 | 0.89 | 95 | 0.39 |
Symptoms for ICU-admitted COVID-19 PatientsΒΆ
covid_conditions = covid_cons
icu_patient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 305351004)].PATIENT
icu_patients = covid_conditions[covid_conditions.PATIENT.isin(icu_patient_ids)]
len(icu_patient_ids)
3677
analysis.calculate_symptom_stats(icu_patient_ids, survivor_ids, deceased_ids, covid_conditions)
symptom | icu_count | percent_icu | survivor_count | percent_survivor | deceased_count | percent_deceased | |
---|---|---|---|---|---|---|---|
7 | Respiratory distress | 3677 | 100.00 | 1180 | 1.35 | 2498 | 10.34 |
0 | Fever | 3345 | 90.97 | 1068 | 1.22 | 2278 | 9.43 |
1 | Cough | 2637 | 71.72 | 853 | 0.98 | 1785 | 7.39 |
4 | Loss of taste | 1940 | 52.76 | 637 | 0.73 | 1303 | 5.40 |
8 | Fatigue | 1482 | 40.30 | 498 | 0.57 | 985 | 4.08 |
6 | Wheezing | 1381 | 37.56 | 466 | 0.53 | 916 | 3.79 |
5 | Dyspnea | 1381 | 37.56 | 466 | 0.53 | 916 | 3.79 |
2 | Sputum finding | 1271 | 34.57 | 401 | 0.46 | 870 | 3.60 |
12 | Joint pain | 656 | 17.84 | 205 | 0.23 | 451 | 1.87 |
11 | Muscle pain | 656 | 17.84 | 205 | 0.23 | 451 | 1.87 |
import pandas as pd
def calculate_symptom_stats(icu_patient_ids, survivor_ids, deceased_ids, covid_conditions):
# Filter conditions for "(finding)" only
finding_conditions = covid_conditions[
covid_conditions['DESCRIPTION'].str.contains(r'\(finding\)', na=False)
]
# Ensure there are symptoms to process
if finding_conditions.empty:
print("No symptoms found with '(finding)' in the description.")
return pd.DataFrame(columns=[
'symptom', 'icu_count', 'percent_icu',
'survivor_count', 'percent_survivor',
'deceased_count', 'percent_deceased'
])
# Convert inputs to sets for faster lookup
icu_set = set(icu_patient_ids)
survivor_set = set(survivor_ids)
deceased_set = set(deceased_ids)
table_rows = []
for symptom in finding_conditions['DESCRIPTION'].unique():
# Get all patients with the symptom
symptom_patients = finding_conditions[
finding_conditions['DESCRIPTION'] == symptom
]['PATIENT'].unique()
symptom_patients_set = set(symptom_patients)
# Intersections
icu_with_symptom = symptom_patients_set & icu_set
survivors_with_symptom = icu_with_symptom & survivor_set
deceased_with_symptom = icu_with_symptom & deceased_set
icu_count = len(icu_with_symptom)
survivor_count = len(survivors_with_symptom)
deceased_count = len(deceased_with_symptom)
# Safe percentage calculations
percent_icu = (icu_count / len(icu_patient_ids)) * 100 if len(icu_patient_ids) > 0 else 0
percent_survivor = (survivor_count / len(survivor_ids)) * 100 if len(survivor_ids) > 0 else 0
percent_deceased = (deceased_count / len(deceased_ids)) * 100 if len(deceased_ids) > 0 else 0
table_rows.append({
'symptom': symptom.replace('(finding)', '').strip(),
'icu_count': icu_count,
'percent_icu': round(percent_icu, 2),
'survivor_count': survivor_count,
'percent_survivor': round(percent_survivor, 2),
'deceased_count': deceased_count,
'percent_deceased': round(percent_deceased, 2)
})
# Create and sort the DataFrame
df = pd.DataFrame(table_rows)
if df.empty:
print("No matching symptoms found for ICU patients.")
return pd.DataFrame(columns=[
'symptom', 'icu_count', 'percent_icu',
'survivor_count', 'percent_survivor',
'deceased_count', 'percent_deceased'
])
df = df.sort_values(by='icu_count', ascending=False).head(10)
return df
# Call the function
calculate_symptom_stats(icu_patient_ids, survivor_ids, deceased_ids, covid_conditions)
symptom | icu_count | percent_icu | survivor_count | percent_survivor | deceased_count | percent_deceased | |
---|---|---|---|---|---|---|---|
7 | Respiratory distress | 3677 | 100.00 | 1180 | 1.35 | 2498 | 10.34 |
0 | Fever | 3345 | 90.97 | 1068 | 1.22 | 2278 | 9.43 |
1 | Cough | 2637 | 71.72 | 853 | 0.98 | 1785 | 7.39 |
4 | Loss of taste | 1940 | 52.76 | 637 | 0.73 | 1303 | 5.40 |
8 | Fatigue | 1482 | 40.30 | 498 | 0.57 | 985 | 4.08 |
6 | Wheezing | 1381 | 37.56 | 466 | 0.53 | 916 | 3.79 |
5 | Dyspnea | 1381 | 37.56 | 466 | 0.53 | 916 | 3.79 |
2 | Sputum finding | 1271 | 34.57 | 401 | 0.46 | 870 | 3.60 |
12 | Joint pain | 656 | 17.84 | 205 | 0.23 | 451 | 1.87 |
11 | Muscle pain | 656 | 17.84 | 205 | 0.23 | 451 | 1.87 |
Create a DataFrame with columns that show a condition's start and end in days relative to COVID-19 diagnosis. Also create a column that calculates the number of days between COVID-19 diagnosis and a person's death.
covid_patient_conditions['start_days'] = (pd.to_datetime(covid_patient_conditions.START) - pd.to_datetime(covid_patient_conditions.covid_start)) / np.timedelta64(1, 'D')
covid_patient_conditions['end_days'] = (pd.to_datetime(covid_patient_conditions.STOP) - pd.to_datetime(covid_patient_conditions.covid_start)) / np.timedelta64(1, 'D')
covid_patient_conditions['death_days'] = (pd.to_datetime(covid_patient_conditions.DEATHDATE) - pd.to_datetime(covid_patient_conditions.covid_start)) / np.timedelta64(1, 'D')
covid_patient_conditions = covid_patient_conditions[covid_patient_conditions['DESCRIPTION'].str.contains(r'\(finding\)', na=False)]
covid_patient_conditions.head(2)
PATIENT | recovered | death | covid_start | DEATHDATE | BIRTHDATE | GENDER | admit | icu_admit | START | STOP | ENCOUNTER | CODE | DESCRIPTION | start_days | end_days | death_days | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False | 2020-03-01 | 2020-03-30 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) | 0.0 | 29.0 | NaN |
4 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | True | False | 2020-03-13 | NaN | 2016-08-01 | F | False | False | 2020-03-13 | 2020-04-14 | a78e78d9-33bb-40bc-9e42-e47ff7e910cc | 49727002 | Cough (finding) | 0.0 | 32.0 | NaN |
# Step 1 & 2: Filter and remove duplicate patient-condition pairs
filtered = covid_patient_conditions[
(covid_patient_conditions['death'] == False) &
(covid_patient_conditions['icu_admit'] == True)
][['PATIENT', 'DESCRIPTION']].drop_duplicates()
# Step 3: Group by condition and count unique patients
df_survivor = filtered.groupby('DESCRIPTION') \
.agg(survivor_count=('PATIENT', 'nunique')) \
.reset_index()
# Step 4: Add percentage column
total_unique_patients = filtered['PATIENT'].nunique()
df_survivor['survivor_percent'] = round((df_survivor['survivor_count'] / total_unique_patients) * 100, 2)
# Step 5: Sort and take top 10
df_survivor = df_survivor.sort_values(by='survivor_count', ascending=False).head(15)
df_survivor
DESCRIPTION | survivor_count | survivor_percent | |
---|---|---|---|
15 | Respiratory distress (finding) | 1179 | 100.00 |
6 | Fever (finding) | 1067 | 90.50 |
2 | Cough (finding) | 852 | 72.26 |
10 | Loss of taste (finding) | 637 | 54.03 |
5 | Fatigue (finding) | 497 | 42.15 |
19 | Wheezing (finding) | 465 | 39.44 |
4 | Dyspnea (finding) | 465 | 39.44 |
17 | Sputum finding (finding) | 401 | 34.01 |
9 | Joint pain (finding) | 205 | 17.39 |
11 | Muscle pain (finding) | 205 | 17.39 |
1 | Chill (finding) | 171 | 14.50 |
7 | Headache (finding) | 166 | 14.08 |
16 | Sore throat symptom (finding) | 139 | 11.79 |
13 | Nausea (finding) | 82 | 6.96 |
18 | Vomiting symptom (finding) | 82 | 6.96 |
# Step 1 & 2: Filter and remove duplicate patient-condition pairs
filtered = covid_patient_conditions[
(covid_patient_conditions['death'] == True) &
(covid_patient_conditions['icu_admit'] == True)
][['PATIENT', 'DESCRIPTION']].drop_duplicates()
# Step 3: Group by condition and count unique patients
df_death = filtered.groupby('DESCRIPTION') \
.agg(death_count=('PATIENT', 'nunique')) \
.reset_index()
# Step 4: Add percentage column
total_unique_patients = filtered['PATIENT'].nunique()
df_death['death_percent'] = round((df_death['death_count'] / total_unique_patients) * 100, 2)
# Step 5: Sort and take top 10
df_death = df_death.sort_values(by='death_count', ascending=False).head(15)
df_death
DESCRIPTION | death_count | death_percent | |
---|---|---|---|
15 | Respiratory distress (finding) | 2498 | 100.00 |
6 | Fever (finding) | 2278 | 91.19 |
2 | Cough (finding) | 1785 | 71.46 |
10 | Loss of taste (finding) | 1303 | 52.16 |
5 | Fatigue (finding) | 985 | 39.43 |
19 | Wheezing (finding) | 916 | 36.67 |
4 | Dyspnea (finding) | 916 | 36.67 |
17 | Sputum finding (finding) | 870 | 34.83 |
9 | Joint pain (finding) | 451 | 18.05 |
11 | Muscle pain (finding) | 451 | 18.05 |
7 | Headache (finding) | 412 | 16.49 |
1 | Chill (finding) | 370 | 14.81 |
16 | Sore throat symptom (finding) | 348 | 13.93 |
13 | Nausea (finding) | 158 | 6.33 |
18 | Vomiting symptom (finding) | 158 | 6.33 |
# Step 1 & 2: Filter and remove duplicate patient-condition pairs
filtered = covid_patient_conditions[
(covid_patient_conditions['icu_admit'] == True)
][['PATIENT', 'DESCRIPTION']].drop_duplicates()
# Step 3: Group by condition and count unique patients
df_total = filtered.groupby('DESCRIPTION') \
.agg(total_count=('PATIENT', 'nunique')) \
.reset_index()
# Step 4: Add percentage column
total_unique_patients = filtered['PATIENT'].nunique()
df_total['total_percent'] = round((df_total['total_count'] / total_unique_patients) * 100, 2)
# Step 5: Sort and take top 10
df_total = df_total.sort_values(by='total_count', ascending=False).head(15)
df_total
DESCRIPTION | total_count | total_percent | |
---|---|---|---|
15 | Respiratory distress (finding) | 3677 | 100.00 |
6 | Fever (finding) | 3345 | 90.97 |
2 | Cough (finding) | 2637 | 71.72 |
10 | Loss of taste (finding) | 1940 | 52.76 |
5 | Fatigue (finding) | 1482 | 40.30 |
19 | Wheezing (finding) | 1381 | 37.56 |
4 | Dyspnea (finding) | 1381 | 37.56 |
17 | Sputum finding (finding) | 1271 | 34.57 |
9 | Joint pain (finding) | 656 | 17.84 |
11 | Muscle pain (finding) | 656 | 17.84 |
7 | Headache (finding) | 578 | 15.72 |
1 | Chill (finding) | 541 | 14.71 |
16 | Sore throat symptom (finding) | 487 | 13.24 |
13 | Nausea (finding) | 240 | 6.53 |
18 | Vomiting symptom (finding) | 240 | 6.53 |
# Merge df_total and df_survivor on 'DESCRIPTION'
merged_df = pd.merge(df_total, df_survivor, on='DESCRIPTION', how='left')
# Merge the result with df_death on 'DESCRIPTION'
merged_df = pd.merge(merged_df, df_death, on='DESCRIPTION', how='left')
merged_df['DESCRIPTION'] = merged_df['DESCRIPTION'].str.replace(
r'\(finding\)', '', regex=True).str.strip()
# Display the merged DataFrame
merged_df.head(10)
DESCRIPTION | total_count | total_percent | survivor_count | survivor_percent | death_count | death_percent | |
---|---|---|---|---|---|---|---|
0 | Respiratory distress | 3677 | 100.00 | 1179 | 100.00 | 2498 | 100.00 |
1 | Fever | 3345 | 90.97 | 1067 | 90.50 | 2278 | 91.19 |
2 | Cough | 2637 | 71.72 | 852 | 72.26 | 1785 | 71.46 |
3 | Loss of taste | 1940 | 52.76 | 637 | 54.03 | 1303 | 52.16 |
4 | Fatigue | 1482 | 40.30 | 497 | 42.15 | 985 | 39.43 |
5 | Wheezing | 1381 | 37.56 | 465 | 39.44 | 916 | 36.67 |
6 | Dyspnea | 1381 | 37.56 | 465 | 39.44 | 916 | 36.67 |
7 | Sputum finding | 1271 | 34.57 | 401 | 34.01 | 870 | 34.83 |
8 | Joint pain | 656 | 17.84 | 205 | 17.39 | 451 | 18.05 |
9 | Muscle pain | 656 | 17.84 | 205 | 17.39 | 451 | 18.05 |
Combined tableΒΆ
# --- TOTAL ICU PATIENTS ---
filtered_total = covid_patient_conditions[
covid_patient_conditions['icu_admit'] == True
][['PATIENT', 'DESCRIPTION']].drop_duplicates()
df_total = filtered_total.groupby('DESCRIPTION') \
.agg(total_count=('PATIENT', 'nunique')) \
.reset_index()
total_unique_patients = filtered_total['PATIENT'].nunique()
df_total['total_percent'] = round((df_total['total_count'] / total_unique_patients) * 100, 2)
# --- ICU SURVIVORS ---
filtered_survivor = covid_patient_conditions[
(covid_patient_conditions['death'] == False) &
(covid_patient_conditions['icu_admit'] == True)
][['PATIENT', 'DESCRIPTION']].drop_duplicates()
df_survivor = filtered_survivor.groupby('DESCRIPTION') \
.agg(survivor_count=('PATIENT', 'nunique')) \
.reset_index()
survivor_unique_patients = filtered_survivor['PATIENT'].nunique()
df_survivor['survivor_percent'] = round((df_survivor['survivor_count'] / survivor_unique_patients) * 100, 2)
# --- ICU DEATHS ---
filtered_death = covid_patient_conditions[
(covid_patient_conditions['death'] == True) &
(covid_patient_conditions['icu_admit'] == True)
][['PATIENT', 'DESCRIPTION']].drop_duplicates()
df_death = filtered_death.groupby('DESCRIPTION') \
.agg(death_count=('PATIENT', 'nunique')) \
.reset_index()
death_unique_patients = filtered_death['PATIENT'].nunique()
df_death['death_percent'] = round((df_death['death_count'] / death_unique_patients) * 100, 2)
# --- MERGE ALL ---
from functools import reduce
dfs = [df_total, df_survivor, df_death]
df_combined = reduce(lambda left, right: pd.merge(left, right, on='DESCRIPTION', how='outer'), dfs)
# Optional: Fill NaNs with 0
df_combined = df_combined.fillna(0)
# Sort by total_count and show top 15
df_combined = df_combined.sort_values(by='total_count', ascending=False).head(10)
df_combined['DESCRIPTION'] = df_combined['DESCRIPTION'].str.replace(
r'\(finding\)', '', regex=True).str.strip()
# Final output
df_combined.head(10)
DESCRIPTION | total_count | total_percent | survivor_count | survivor_percent | death_count | death_percent | |
---|---|---|---|---|---|---|---|
15 | Respiratory distress | 3677 | 100.00 | 1179 | 100.00 | 2498 | 100.00 |
6 | Fever | 3345 | 90.97 | 1067 | 90.50 | 2278 | 91.19 |
2 | Cough | 2637 | 71.72 | 852 | 72.26 | 1785 | 71.46 |
10 | Loss of taste | 1940 | 52.76 | 637 | 54.03 | 1303 | 52.16 |
5 | Fatigue | 1482 | 40.30 | 497 | 42.15 | 985 | 39.43 |
19 | Wheezing | 1381 | 37.56 | 465 | 39.44 | 916 | 36.67 |
4 | Dyspnea | 1381 | 37.56 | 465 | 39.44 | 916 | 36.67 |
17 | Sputum finding | 1271 | 34.57 | 401 | 34.01 | 870 | 34.83 |
9 | Joint pain | 656 | 17.84 | 205 | 17.39 | 451 | 18.05 |
11 | Muscle pain | 656 | 17.84 | 205 | 17.39 | 451 | 18.05 |
Symptom timelines for hospitalized patientsΒΆ
These plots show the progression of COVID-19 related complications in hospitalized patients. The bars represent the average start and end time for the particular item.
covid_patient_conditions.head(2)
PATIENT | recovered | death | covid_start | DEATHDATE | BIRTHDATE | GENDER | admit | icu_admit | START | STOP | ENCOUNTER | CODE | DESCRIPTION | start_days | end_days | death_days | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False | 2020-03-01 | 2020-03-30 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) | 0.0 | 29.0 | NaN |
4 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | True | False | 2020-03-13 | NaN | 2016-08-01 | F | False | False | 2020-03-13 | 2020-04-14 | a78e78d9-33bb-40bc-9e42-e47ff7e910cc | 49727002 | Cough (finding) | 0.0 | 32.0 | NaN |
symptoms = ['Fever (finding)', 'Cough (finding)', 'Dyspnea (finding)']
filtered = covid_patient_conditions[(covid_patient_conditions['DESCRIPTION'].isin(symptoms)) &
(covid_patient_conditions['icu_admit'] == True)]
filtered.groupby('DESCRIPTION')['end_days'].mean().round(2)
DESCRIPTION Cough (finding) 16.29 Dyspnea (finding) 16.43 Fever (finding) 16.33 Name: end_days, dtype: float64
# Step 1: Create Boolean mask for "(finding)" and exclusion terms
condition_mask = (
covid_patient_conditions['DESCRIPTION'].str.contains(r'\(finding\)', na=False) &
~covid_patient_conditions['DESCRIPTION'].str.contains(
r'pneumonia|hypoxemia|chronic|Hypertriglyceridemia|Hyperlipidemia|Hypercholesterolemia|Metabolic|osteoporosis|Alzheimer|Anemia|sinusitis|thrombosis|embolism|pharyngitis|bronchitis',
case=False, na=False)
)
# Step 2: Define function with additional filtering for specific descriptions
def select_condition_averages(df, mask):
mask = mask.reindex(df.index, fill_value=False)
filtered_df = df[mask]
# Filter for specific conditions only
symptoms = ['Fever (finding)', 'Cough (finding)', 'Dyspnea (finding)']
filtered_df = filtered_df[filtered_df['DESCRIPTION'].isin(symptoms)]
averages = filtered_df.groupby('DESCRIPTION').agg(
avg_start_days= ('start_days', 'mean'),
avg_end_days= ('end_days', 'mean'),
count=('PATIENT', 'size')
).reset_index()
return averages
# Step 3: Apply function
averages = select_condition_averages(covid_patient_conditions, condition_mask)
# Step 4: View result
averages.head(3)
DESCRIPTION | avg_start_days | avg_end_days | count | |
---|---|---|---|---|
0 | Cough (finding) | -0.041698 | 22.529629 | 59763 |
1 | Dyspnea (finding) | -0.041395 | 20.302278 | 17514 |
2 | Fever (finding) | -0.041540 | 22.541085 | 78334 |
averages = select_condition_averages(
covid_patient_conditions[
(covid_patient_conditions['recovered'] == True) &
(covid_patient_conditions['icu_admit'] == True)
],
condition_mask
)
averages.head(3)
DESCRIPTION | avg_start_days | avg_end_days | count | |
---|---|---|---|---|
0 | Cough (finding) | -0.046893 | 16.293083 | 853 |
1 | Dyspnea (finding) | -0.049356 | 16.431330 | 466 |
2 | Fever (finding) | -0.048689 | 16.327715 | 1068 |
averages = select_condition_averages(covid_patient_conditions[(covid_patient_conditions['death'] == True) & (covid_patient_conditions['icu_admit'] == True )], condition_mask)
averages.head(3)
DESCRIPTION | avg_start_days | avg_end_days | count | |
---|---|---|---|---|
0 | Cough (finding) | -0.042017 | 21.0 | 1785 |
1 | Dyspnea (finding) | -0.036026 | 21.0 | 916 |
2 | Fever (finding) | -0.046971 | 21.0 | 2278 |
# Deaths of icu patients
np.intersect1d(icu_patient_ids, deceased_ids).shape
(2498,)
Add an age column to the DataFrame for rows where the patient has died
covid_info.loc[covid_info.death == True, 'age'] = (pd.to_datetime(covid_info.DEATHDATE) - pd.to_datetime(covid_info.BIRTHDATE)).dt.days / 365.25
Populate ages for survivors based on the current date
covid_info.loc[covid_info.recovered == True, 'age'] = (datetime.datetime.now() - pd.to_datetime(covid_info.BIRTHDATE)).dt.days / 365.25
Create an age_range column that places individuals into 10 year age ranges, such as 0 - 10, 10 - 20, etc.
bins = list(range(0, 120, 10))
covid_info['age_range'] = pd.cut(covid_info.age, bins=bins)
covid_info.head(3)
PATIENT | recovered | death | covid_start | DEATHDATE | BIRTHDATE | GENDER | admit | icu_admit | age | age_range | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False | 7.685147 | (0, 10] |
1 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | True | False | 2020-03-13 | NaN | 2016-08-01 | F | False | False | 8.747433 | (0, 10] |
2 | 5163c501-353c-4a82-b863-a3f1df2d6cf1 | True | False | 2020-03-10 | NaN | 2004-01-09 | F | False | False | 21.308693 | (20, 30] |
Mortality by Age and SexΒΆ
A plot of deaths grouped by age range and gender.
# Mortality rate
rate = (np.intersect1d(covid_patient_ids, deceased_ids).shape[0]/len(covid_patient_ids))* 100
print(round(rate, 2),'% for all covid-19 patients')
4.13 % for all covid-19 patients
import seaborn as sns
chart = sns.catplot(x="age_range", kind="count", hue="GENDER", data=covid_info[covid_info.death == True])
for axes in chart.axes.flat:
axes.tick_params(axis='x', rotation=90)
A table view of the same information from above
covid_info[covid_info.death==True].groupby(['age_range', 'GENDER'], observed=True).count()[['PATIENT']]
PATIENT | ||
---|---|---|
age_range | GENDER | |
(0, 10] | F | 1 |
M | 1 | |
(10, 20] | F | 2 |
M | 4 | |
(20, 30] | F | 16 |
M | 102 | |
(30, 40] | F | 35 |
M | 61 | |
(40, 50] | F | 32 |
M | 128 | |
(50, 60] | F | 200 |
M | 299 | |
(60, 70] | F | 250 |
M | 425 | |
(70, 80] | F | 431 |
M | 479 | |
(80, 90] | F | 289 |
M | 437 | |
(90, 100] | F | 112 |
M | 117 | |
(100, 110] | F | 103 |
M | 106 |
Another table view of the mortality data, this time just grouped by age range
covid_info[covid_info.death == True].groupby(['age_range'], observed=True).count()[['PATIENT']]
PATIENT | |
---|---|
age_range | |
(0, 10] | 2 |
(10, 20] | 6 |
(20, 30] | 118 |
(30, 40] | 96 |
(40, 50] | 160 |
(50, 60] | 499 |
(60, 70] | 675 |
(70, 80] | 910 |
(80, 90] | 726 |
(90, 100] | 229 |
(100, 110] | 209 |
Build a DataFrame that shows the total count of a supply used on a given day
supplies.head(3)
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | QUANTITY | |
---|---|---|---|---|---|---|
0 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 409534002 | Disposable air-purifying respirator (physical ... | 2 |
1 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 713779008 | Nitrile examination/treatment glove non-powde... | 24 |
2 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 469673003 | Isolation gown single-use (physical object) | 12 |
grouped_supplies = supplies.groupby(['DESCRIPTION', 'DATE']).sum()
Supply UsageΒΆ
Small multiples plot of supply usage over time.
gs = grouped_supplies.reset_index()
gs['DATE'] = pd.to_datetime(gs.DATE)
g = sns.FacetGrid(gs, col="DESCRIPTION", col_wrap=3, sharey=False, height=3, aspect=2)
g = g.map(sns.lineplot, "DATE", "QUANTITY", marker=".")
for axes in g.axes.flat:
title = axes.get_title()
if 'glove' in title:
axes.set_title('Gloves')
else:
axes.set_title(title.replace("DESCRIPTION = ", "").replace(" (physical object)", ""))
for tick in axes.get_xticklabels():
tick.set_rotation(90)
A table showing total supplies used over the entire simulation
Build a DataFrame that has cumulative case counts over time
case_counts = conditions[conditions.CODE == 840539006].groupby('START').count()[['PATIENT']]
case_counts['total'] = case_counts['PATIENT'].cumsum()
case_counts = case_counts.rename(columns={'PATIENT': 'daily'})
case_counts = case_counts.reset_index()
case_counts['START'] = pd.to_datetime(case_counts.START)
Cumulative Case CountΒΆ
Show total cases over time
axes = sns.lineplot(x='START', y='total', data=case_counts)
plt.xticks(rotation=90)
plt.show()
Medication DispensesΒΆ
This table shows medications dispensed to patients with COVID-19 since January 20, 2020.
covid_meds = medications[pd.to_datetime(medications.START) > pd.to_datetime('2020-01-20')]
covid_meds = covid_info.merge(covid_meds, on='PATIENT')
# Use only numeric columns for aggregation
covid_meds.groupby(['DESCRIPTION'])[['DISPENSES']].sum().sort_values('DISPENSES', ascending=False).head(10)
DISPENSES | |
---|---|
DESCRIPTION | |
Acetaminophen 500 MG Oral Tablet | 35509 |
0.4 ML Enoxaparin sodium 100 MG/ML Prefilled Syringe | 34733 |
1 ML Enoxaparin sodium 150 MG/ML Prefilled Syringe | 20332 |
NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT Metered Dose Inhaler | 17163 |
Hydrochlorothiazide 25 MG Oral Tablet | 6832 |
Simvastatin 10 MG Oral Tablet | 5779 |
insulin human isophane 70 UNT/ML / Regular Insulin Human 30 UNT/ML Injectable Suspension [Humulin] | 5392 |
1 ML Epoetin Alfa 4000 UNT/ML Injection [Epogen] | 4817 |
amLODIPine 5 MG / Hydrochlorothiazide 12.5 MG / Olmesartan medoxomil 20 MG Oral Tablet | 4799 |
Atenolol 50 MG / Chlorthalidone 25 MG Oral Tablet | 4550 |
Hospital Day StatsΒΆ
For patients with COVID-19, calculate the average hospital length of stay as well as total hospital days for all COVID-19 patients. Provide the same information for ICU patients
covid_info.head(3)
PATIENT | recovered | death | covid_start | DEATHDATE | BIRTHDATE | GENDER | admit | icu_admit | age | age_range | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False | 7.685147 | (0, 10] |
1 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | True | False | 2020-03-13 | NaN | 2016-08-01 | F | False | False | 8.747433 | (0, 10] |
2 | 5163c501-353c-4a82-b863-a3f1df2d6cf1 | True | False | 2020-03-10 | NaN | 2004-01-09 | F | False | False | 21.308693 | (20, 30] |
covid_hosp = covid_info[covid_info['admit'] == True].merge(encounters, on='PATIENT')
covid_hosp = covid_hosp[covid_hosp['REASONCODE'] == 840539006]
covid_hosp['START'] = pd.to_datetime(covid_hosp.START)
covid_hosp['STOP'] = pd.to_datetime(covid_hosp.STOP)
covid_hosp['days'] = (covid_hosp.STOP - covid_hosp.START) / np.timedelta64(1, 'D')
covid_hosp = covid_hosp[['PATIENT', 'icu_admit', 'START', 'STOP', 'days']]
covid_hosp = covid_hosp.rename(columns={'START': 'hospital_start', 'STOP': 'hospital_stop'})
covid_hosp = covid_hosp.drop_duplicates(subset=['PATIENT', 'hospital_start', 'hospital_stop'])
covid_hosp = covid_hosp.sort_values(by='hospital_start')
covid_hosp = covid_hosp.reset_index(drop=True)
covid_hosp.head(3)
PATIENT | icu_admit | hospital_start | hospital_stop | days | |
---|---|---|---|---|---|
0 | 641699c3-03c0-40fc-9f14-38247e0a8cfa | False | 2020-01-18 20:55:04+00:00 | 2020-02-02 05:14:04+00:00 | 14.346528 |
1 | d19da9db-c5a7-468f-aadf-3c7cd5b5a8d2 | False | 2020-01-20 12:24:28+00:00 | 2020-02-08 22:59:28+00:00 | 19.440972 |
2 | a654fdb6-0796-4fb5-9ad8-435e38ed5168 | False | 2020-01-22 23:57:08+00:00 | 2020-01-25 00:27:08+00:00 | 2.020833 |
# Group by hospital admission and ICU admission
grouped_data = covid_hosp.groupby(['icu_admit']).agg(
mean_days=('days', 'mean'),
total_days=('days', 'sum'),
patient_count=('PATIENT', 'nunique')
).reset_index()
# Rename the columns for better readability
grouped_data = grouped_data.rename(columns={
'icu_admit': 'Admission'
})
grouped_data['Admission'] = grouped_data['Admission'].map({False: 'Non-ICU Inpatients', True: 'ICU Admission'})
# Calculate the total row
total_row = {
'Admission': 'Total',
'mean_days': grouped_data['mean_days'].mean(),
'total_days': grouped_data['total_days'].sum(),
'patient_count': grouped_data['patient_count'].sum()
}
# Append the total row to the DataFrame using pd.concat
grouped_total = pd.concat(
[grouped_data, pd.DataFrame([total_row])],
ignore_index=True
)
# Display the updated DataFrame
grouped_total
Admission | mean_days | total_days | patient_count | |
---|---|---|---|---|
0 | Non-ICU Inpatients | 13.817881 | 214412.061111 | 14498 |
1 | ICU Admission | 5.376755 | 59305.611806 | 3677 |
2 | Total | 9.597318 | 273717.672917 | 18175 |
device_codes = [448907002, 449071006, 36965003]
grouped_dev = devices[devices.CODE.isin(device_codes)].groupby(['DESCRIPTION', 'START']).count()
grouped_dev = grouped_dev.reset_index()
grouped_dev['START'] = pd.to_datetime(grouped_dev.START)
Device UsageΒΆ
Show the number of devices used to treat COVID-19 over time.
g = sns.FacetGrid(grouped_dev.reset_index(), col="DESCRIPTION", col_wrap=3, sharey=False, height=3, aspect=2)
g = g.map(sns.lineplot, "START", "PATIENT", marker=".")
for axes in g.axes.flat:
title = axes.get_title()
axes.set_title(title.replace("DESCRIPTION = ", "").replace(" (physical object)", ""))
for tick in axes.get_xticklabels():
tick.set_rotation(90)
ConclusionΒΆ
In this project, I learned how to:
- extract data for specific analyses, such as filtering, merging, and recoding data based on fields like 'CODE', 'REASONCODE', and 'DESCRIPTION', for different groups of patients including inpatients, COVID patients, ICU-admitted, and ventilated individuals.