import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime
Chapter_1 - Extracting Medical Data
Extracting correct data
Medical data is stored across multiple tables in a hospital database. In an EHR system like EPIC, there can be thousands of tables. Therefore, extracting the correct data we need can sometimes be challenging.
For example, if I want to analyze COVID-19 data, I first need to identify where the relevant data is stored and in which tables. In the SyntheaMass database, I start with the ‘conditions’ table, where all diseases are defined with fields such as code, description, start (date), stop (date), patient (ID), and more.
The skill to identify sources of specific data comes with training, time and experience working with hospital databases.
Loading the data
Import necessary python packages
Loading the database with multiple tables is quite simple in python.
# Load the data
= pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv") medications
Covid patient IDs
First, I have to identify code of the covid disease in the table ‘conditions’.
print(conditions.columns)
conditions.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | |
---|---|---|---|---|---|---|
0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media |
1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media |
2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) |
3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 |
4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 |
In order to extract IDs for only covid19 patients I code like this:
= conditions[conditions.CODE == 840539006].PATIENT.unique()
covid_patient_ids print(covid_patient_ids.shape[0])
0:5] covid_patient_ids[
88166
array(['1ff7f10f-a204-4bb1-aa72-dd763fa99482',
'9bcf6ed5-d808-44af-98a0-7d78a29ede72',
'5163c501-353c-4a82-b863-a3f1df2d6cf1',
'cc3c806f-4a09-4a89-a990-4286450956be',
'bd1c4ffc-7f1d-4590-adbb-1d6533fb623e'], dtype=object)
I want to have data of the covid patients I code as:
# Filter patients who have COVID-19
= patients[patients.Id.isin(covid_patient_ids)]
covid_patients print(covid_patients.columns)
print(covid_patients.shape[0])
covid_patients.head()
Index(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE', 'ETHNICITY',
'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE', 'COUNTY', 'ZIP',
'LAT', 'LON', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
dtype='object')
88166
Id | BIRTHDATE | DEATHDATE | SSN | DRIVERS | PASSPORT | PREFIX | FIRST | LAST | SUFFIX | ... | BIRTHPLACE | ADDRESS | CITY | STATE | COUNTY | ZIP | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | 999-68-6630 | NaN | NaN | NaN | Jacinto644 | Kris249 | NaN | ... | Beverly Massachusetts US | 888 Hickle Ferry Suite 38 | Springfield | Massachusetts | Hampden County | 1106.0 | 42.151961 | -72.598959 | 8446.49 | 1499.08 |
1 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 2016-08-01 | NaN | 999-15-5895 | NaN | NaN | NaN | Alva958 | Krajcik437 | NaN | ... | Boston Massachusetts US | 1048 Skiles Trailer | Walpole | Massachusetts | Norfolk County | 2081.0 | 42.177370 | -71.281353 | 94568.40 | 1870.72 |
2 | 5163c501-353c-4a82-b863-a3f1df2d6cf1 | 2004-01-09 | NaN | 999-73-2461 | S99999063 | NaN | NaN | Jimmie93 | Harris789 | NaN | ... | Worcester Massachusetts US | 201 Mitchell Lodge Unit 67 | Pembroke | Massachusetts | Plymouth County | NaN | 42.075292 | -70.757035 | 375754.62 | 3131.44 |
3 | cc3c806f-4a09-4a89-a990-4286450956be | 1996-11-15 | NaN | 999-60-7372 | S99924941 | X9952947X | Mr. | Gregorio366 | Auer97 | NaN | ... | Patras Achaea GR | 1050 Lindgren Extension Apt 38 | Boston | Massachusetts | Suffolk County | 2135.0 | 42.352434 | -71.028610 | 484758.46 | 3632.96 |
4 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | 2019-06-12 | NaN | 999-81-4349 | NaN | NaN | NaN | Karyn217 | Mueller846 | NaN | ... | New Bedford Massachusetts US | 570 Abshire Forge Suite 32 | Colrain | Massachusetts | Franklin County | NaN | 42.740220 | -72.722648 | 24130.00 | 903.28 |
5 rows × 25 columns
There are 88,166 COVID-19 patients, including both those who were isolated at home and those who were hospitalized.
Covid inpatients
To extract data for COVID-19 patients who were hospitalized due to the virus, I use the ‘encounters’ table. Specifically, I filter for records where the encounter class is inpatient (CODE = 1505002) and the reason is COVID-19 (REASONCODE = 840539006).
print(encounters.columns)
encounters.head()
Index(['Id', 'START', 'STOP', 'PATIENT', 'ORGANIZATION', 'PROVIDER', 'PAYER',
'ENCOUNTERCLASS', 'CODE', 'DESCRIPTION', 'BASE_ENCOUNTER_COST',
'TOTAL_CLAIM_COST', 'PAYER_COVERAGE', 'REASONCODE',
'REASONDESCRIPTION'],
dtype='object')
Id | START | STOP | PATIENT | ORGANIZATION | PROVIDER | PAYER | ENCOUNTERCLASS | CODE | DESCRIPTION | BASE_ENCOUNTER_COST | TOTAL_CLAIM_COST | PAYER_COVERAGE | REASONCODE | REASONDESCRIPTION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | d317dacb-d801-4136-be7d-440094f7ae83 | 2019-02-16T01:43:20Z | 2019-02-16T01:58:20Z | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 5103c940-0c08-392f-95cd-446e0cea042a | 8b532fbe-4254-3a60-a442-33028916d24e | 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a | outpatient | 185345009 | Encounter for symptom | 129.16 | 129.16 | 69.16 | 65363002.0 | Otitis media |
1 | 52051c30-c6c3-45fe-b5da-a790f1680e91 | 2019-08-02T01:43:20Z | 2019-08-02T02:13:20Z | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 0b9f3f7c-8ab6-30a5-b3ae-4dc0e0c00cb3 | 872f1318-45f7-34cc-adc7-26d59fc7c203 | 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a | wellness | 410620009 | Well child visit (procedure) | 129.16 | 129.16 | 129.16 | NaN | NaN |
2 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 2019-10-31T01:43:20Z | 2019-10-31T01:58:20Z | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 5103c940-0c08-392f-95cd-446e0cea042a | 8b532fbe-4254-3a60-a442-33028916d24e | 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a | outpatient | 185345009 | Encounter for symptom | 129.16 | 129.16 | 69.16 | 65363002.0 | Otitis media |
3 | 20032279-0deb-41f9-bea4-93710396eb95 | 2020-01-31T01:43:20Z | 2020-01-31T01:58:20Z | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 0b9f3f7c-8ab6-30a5-b3ae-4dc0e0c00cb3 | 872f1318-45f7-34cc-adc7-26d59fc7c203 | 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a | wellness | 410620009 | Well child visit (procedure) | 129.16 | 129.16 | 129.16 | NaN | NaN |
4 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 2020-03-02T01:43:20Z | 2020-03-02T02:39:20Z | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | fd328395-ab1d-35c6-a2d0-d05a9a79cf11 | 595c4f10-bcce-3892-a5d2-f95e3ab2ab74 | 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a | ambulatory | 185345009 | Encounter for symptom (procedure) | 129.16 | 129.16 | 69.16 | NaN | NaN |
I code as follow:
= encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT.unique()
covid_inpatient_ids
= patients[patients.Id.isin(covid_inpatient_ids)]
covid_inpatients
print(len(covid_inpatient_ids))
print(covid_inpatients.columns)
covid_inpatients.head()
18177
Index(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE', 'ETHNICITY',
'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE', 'COUNTY', 'ZIP',
'LAT', 'LON', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
dtype='object')
Id | BIRTHDATE | DEATHDATE | SSN | DRIVERS | PASSPORT | PREFIX | FIRST | LAST | SUFFIX | ... | BIRTHPLACE | ADDRESS | CITY | STATE | COUNTY | ZIP | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | 2019-06-12 | NaN | 999-81-4349 | NaN | NaN | NaN | Karyn217 | Mueller846 | NaN | ... | New Bedford Massachusetts US | 570 Abshire Forge Suite 32 | Colrain | Massachusetts | Franklin County | NaN | 42.740220 | -72.722648 | 24130.00 | 903.28 |
16 | 3c3b89b1-cb41-4f94-9193-2f3da4fe38e5 | 1983-09-30 | NaN | 999-57-9112 | S99937871 | X59867376X | Mrs. | Lisbeth69 | Rowe323 | NaN | ... | Agawam Massachusetts US | 290 Hyatt Lane | Malden | Massachusetts | Middlesex County | 2155.0 | 42.414955 | -71.067066 | 752343.27 | 6611.76 |
20 | 7e2c6949-dce9-44be-b6ef-50aca95840d7 | 1958-05-09 | NaN | 999-49-5916 | S99935194 | X2603260X | Mrs. | Danae973 | Franecki195 | NaN | ... | Holyoke Massachusetts US | 1062 Conn Well Suite 4 | Framingham | Massachusetts | Middlesex County | 1702.0 | 42.309678 | -71.423073 | 1507529.08 | 19755.11 |
21 | 0ab3b40f-3afd-42b8-9755-5834aa42eb67 | 1959-06-23 | NaN | 999-59-1261 | S99928544 | X30032284X | Mrs. | Anissa357 | Mueller846 | NaN | ... | Pembroke Massachusetts US | 661 Schulist Highlands | Waltham | Massachusetts | Middlesex County | 2452.0 | 42.425915 | -71.288673 | 1363750.36 | 15054.36 |
28 | 346b2b95-b8c5-4fe1-acf3-94de7e6e965b | 1987-08-31 | NaN | 999-47-3790 | S99987128 | X31069450X | Mr. | Erwin847 | Weimann465 | NaN | ... | Quincy Massachusetts US | 297 Fritsch Estate Apt 24 | Lynnfield | Massachusetts | Essex County | NaN | 42.570672 | -71.044043 | 678188.97 | 4118.76 |
5 rows × 25 columns
There are 18,177 patients who are admitted to hospital for covid.
Covid ICU-admitted patients
= encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 305351004)].PATIENT.unique()
icu_patient_ids len(icu_patient_ids)
3677
Among covid inpatients, 3,677 patients were admitted to ICU.
Covid ventilated patients
print(procedures.columns)
procedures.head()
Index(['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'BASE_COST',
'REASONCODE', 'REASONDESCRIPTION'],
dtype='object')
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | BASE_COST | REASONCODE | REASONDESCRIPTION | |
---|---|---|---|---|---|---|---|---|
0 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 52051c30-c6c3-45fe-b5da-a790f1680e91 | 430193006 | Medication Reconciliation (procedure) | 495.80 | NaN | NaN |
1 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 261352009 | Face mask (physical object) | 516.65 | 840544004.0 | Suspected COVID-19 |
2 | 2020-01-06 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 723060b3-7be8-4d12-8e1e-04f9da2c52d5 | 430193006 | Medication Reconciliation (procedure) | 453.37 | NaN | NaN |
3 | 2020-03-13 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | a78e78d9-33bb-40bc-9e42-e47ff7e910cc | 261352009 | Face mask (physical object) | 516.65 | 840544004.0 | Suspected COVID-19 |
4 | 2020-04-28 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 794aac2b-0eb8-4c36-98bb-a4317b344e6e | 117015009 | Throat culture (procedure) | 2169.16 | 43878008.0 | Streptococcal sore throat (disorder) |
= procedures[procedures.CODE == 26763009].PATIENT.unique()
vent_ids = np.intersect1d(vent_ids, covid_patient_ids)
vent_covid_ids len(vent_covid_ids)
2914
There are 2,914 covid patients who were ventilated.
Covid isolated patients
Patients who are isolated at home for covid are identified in the table ‘care_plans’
= care_plans[(care_plans.CODE == 736376001) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()
isolation_ids len(isolation_ids)
70536
There are 70,536 covid patients isolated at their home.
Completed isolation patients
= care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()
completed_isolation_patients len(completed_isolation_patients)
70398
There are 70,398 covid patients who completed their home isolation.
Negative covid patients
I look at the table ‘observations’ for lab tests.
print(observations.columns)
observations.head()
Index(['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'VALUE', 'UNITS',
'TYPE'],
dtype='object')
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | |
---|---|---|---|---|---|---|---|---|
0 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 52051c30-c6c3-45fe-b5da-a790f1680e91 | 8302-2 | Body Height | 82.7 | cm | numeric |
1 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 52051c30-c6c3-45fe-b5da-a790f1680e91 | 72514-3 | Pain severity - 0-10 verbal numeric rating [Sc... | 2.0 | {score} | numeric |
2 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 52051c30-c6c3-45fe-b5da-a790f1680e91 | 29463-7 | Body Weight | 11.5 | kg | numeric |
3 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 52051c30-c6c3-45fe-b5da-a790f1680e91 | 77606-2 | Weight-for-length Per age and sex | 47.0 | % | numeric |
4 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 52051c30-c6c3-45fe-b5da-a790f1680e91 | 9843-4 | Head Occipital-frontal circumference | 46.9 | cm | numeric |
Then filtering for negative covid patient Ids.
= observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
negative_covid_patient_ids len(negative_covid_patient_ids)
17524
There are 17,524 COVID-19 patients who were tested for the virus and received negative results.
Covid survivors
= np.union1d(completed_isolation_patients, negative_covid_patient_ids)
survivor_ids len(survivor_ids)
87485
There are 87,485 covid patients who survived the virus
Covid deaths
= patients[patients.DEATHDATE.notna()].Id.unique()
deceased_ids = np.intersect1d(deceased_ids, covid_inpatient_ids)
covid_deceased_ids len(covid_deceased_ids)
3548
There are 3,548 covid patients who died.
Lab values for COVID-19 patients
I find labs values in the table ’observations’
LOINC codes = 'CODE = 48065-7': 'D-dimer',
'CODE = 2276-4': 'Serum Ferritin',
'CODE = 89579-7': 'High Sensitivity Cardiac Troponin I',
'CODE = 26881-3': 'IL-6', 'CODE = 731-0': 'Lymphocytes',
'CODE = 14804-9': 'Lactate dehydrogenase'
= observations[
lab_obs == '48065-7') |
(observations.CODE == '26881-3') |
(observations.CODE == '2276-4') |
(observations.CODE == '89579-7') |
(observations.CODE == '731-0') |
(observations.CODE == '14804-9')
(observations.CODE
]
print(lab_obs.columns)
lab_obs.head()
Index(['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'VALUE', 'UNITS',
'TYPE'],
dtype='object')
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | |
---|---|---|---|---|---|---|---|---|
254 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric |
273 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric |
274 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric |
275 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 89579-7 | Troponin I.cardiac [Mass/volume] in Serum or P... | 2.3 | pg/mL | numeric |
276 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 14804-9 | Lactate dehydrogenase [Enzymatic activity/volu... | 223.9 | U/L | numeric |
Medication dispenses
The earliest date of covid is ‘2020-01-20’
= medications[pd.to_datetime(medications.START) > pd.to_datetime('2020-01-20')]
covid_meds = covid_meds[covid_meds['PATIENT'].isin(covid_patient_ids)]
covid_meds print(covid_meds.columns)
covid_meds.head()
Index(['START', 'STOP', 'PATIENT', 'PAYER', 'ENCOUNTER', 'CODE', 'DESCRIPTION',
'BASE_COST', 'PAYER_COVERAGE', 'DISPENSES', 'TOTALCOST', 'REASONCODE',
'REASONDESCRIPTION'],
dtype='object')
START | STOP | PATIENT | PAYER | ENCOUNTER | CODE | DESCRIPTION | BASE_COST | PAYER_COVERAGE | DISPENSES | TOTALCOST | REASONCODE | REASONDESCRIPTION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2020-02-12 | 2020-02-26 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 4d71f845-a6a9-3c39-b242-14d25ef86a8d | 5d9dfe80-7edf-4181-9753-770e361934d1 | 313820 | Acetaminophen 160 MG Chewable Tablet | 5.25 | 0.0 | 1 | 5.25 | NaN | NaN |
3 | 2020-04-28 | 2020-05-08 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 4d71f845-a6a9-3c39-b242-14d25ef86a8d | 794aac2b-0eb8-4c36-98bb-a4317b344e6e | 834061 | Penicillin V Potassium 250 MG Oral Tablet | 13.90 | 0.0 | 1 | 13.90 | 43878008.0 | Streptococcal sore throat (disorder) |
5 | 2020-05-03 | 2020-05-17 | 5163c501-353c-4a82-b863-a3f1df2d6cf1 | 5059a55e-5d6e-34d1-b6cb-d83d16e57bcf | f2df2577-f1ff-4e0f-a467-8f72d494f478 | 313782 | Acetaminophen 325 MG Oral Tablet | 6.37 | 0.0 | 1 | 6.37 | 10509002.0 | Acute bronchitis (disorder) |
6 | 2020-02-19 | NaN | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | 4d71f845-a6a9-3c39-b242-14d25ef86a8d | b7455838-3607-47f4-aaa5-fd89abea7d29 | 854235 | 0.4 ML Enoxaparin sodium 100 MG/ML Prefilled S... | 263.49 | 0.0 | 3 | 790.47 | NaN | NaN |
7 | 2020-02-19 | NaN | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | 4d71f845-a6a9-3c39-b242-14d25ef86a8d | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2123111 | NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT ... | 302.07 | 0.0 | 3 | 906.21 | 389087006.0 | Hypoxemia (disorder) |
Combined covid patient data
= conditions[conditions.CODE == 840539006].copy()
covid_cond 'covid_date'] = covid_cond['START']
covid_cond[= conditions[conditions['PATIENT'].isin(covid_patient_ids)].copy()
covid_conditions = covid_conditions.merge(covid_cond[['PATIENT', 'covid_date']], on='PATIENT', how='left')
covid_conditions covid_conditions.head()
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | covid_date | |
---|---|---|---|---|---|---|---|
0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media | 2020-03-01 |
1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media | 2020-03-01 |
2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) | 2020-03-01 |
3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 | 2020-03-01 |
4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 | 2020-03-01 |
= covid_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
cp print(cp.columns)
cp.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION',
'covid_date', 'Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS',
'PASSPORT', 'PREFIX', 'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL',
'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE',
'COUNTY', 'ZIP', 'LAT', 'LON', 'HEALTHCARE_EXPENSES',
'HEALTHCARE_COVERAGE'],
dtype='object')
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | covid_date | Id | BIRTHDATE | DEATHDATE | ... | BIRTHPLACE | ADDRESS | CITY | STATE | COUNTY | ZIP | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | Beverly Massachusetts US | 888 Hickle Ferry Suite 38 | Springfield | Massachusetts | Hampden County | 1106.0 | 42.151961 | -72.598959 | 8446.49 | 1499.08 |
1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | Beverly Massachusetts US | 888 Hickle Ferry Suite 38 | Springfield | Massachusetts | Hampden County | 1106.0 | 42.151961 | -72.598959 | 8446.49 | 1499.08 |
2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | Beverly Massachusetts US | 888 Hickle Ferry Suite 38 | Springfield | Massachusetts | Hampden County | 1106.0 | 42.151961 | -72.598959 | 8446.49 | 1499.08 |
3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | Beverly Massachusetts US | 888 Hickle Ferry Suite 38 | Springfield | Massachusetts | Hampden County | 1106.0 | 42.151961 | -72.598959 | 8446.49 | 1499.08 |
4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | Beverly Massachusetts US | 888 Hickle Ferry Suite 38 | Springfield | Massachusetts | Hampden County | 1106.0 | 42.151961 | -72.598959 | 8446.49 | 1499.08 |
5 rows × 32 columns
'isolation'] = cp.Id.isin(isolation_ids)
cp['admit'] = cp.Id.isin(covid_inpatient_ids)
cp['recovered'] = cp.Id.isin(survivor_ids)
cp['death'] = cp.DEATHDATE.notna()
cp['icu_admit'] = cp.Id.isin(icu_patient_ids)
cp['ventilated'] = cp.Id.isin(vent_covid_ids)
cp[print(cp.columns)
cp.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION',
'covid_date', 'Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS',
'PASSPORT', 'PREFIX', 'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL',
'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE',
'COUNTY', 'ZIP', 'LAT', 'LON', 'HEALTHCARE_EXPENSES',
'HEALTHCARE_COVERAGE', 'isolation', 'admit', 'recovered', 'death',
'icu_admit', 'ventilated'],
dtype='object')
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | covid_date | Id | BIRTHDATE | DEATHDATE | ... | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | isolation | admit | recovered | death | icu_admit | ventilated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | 42.151961 | -72.598959 | 8446.49 | 1499.08 | True | False | True | False | False | False |
1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | 42.151961 | -72.598959 | 8446.49 | 1499.08 | True | False | True | False | False | False |
2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | 42.151961 | -72.598959 | 8446.49 | 1499.08 | True | False | True | False | False | False |
3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | 42.151961 | -72.598959 | 8446.49 | 1499.08 | True | False | True | False | False | False |
4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 2017-08-24 | NaN | ... | 42.151961 | -72.598959 | 8446.49 | 1499.08 | True | False | True | False | False | False |
5 rows × 38 columns
The condensed dataset
= cp[['PATIENT', 'recovered', 'death', 'covid_date', 'DEATHDATE', 'BIRTHDATE', 'GENDER', 'admit', 'icu_admit']]
covid_info print(covid_info.columns)
covid_info.head()
Index(['PATIENT', 'recovered', 'death', 'covid_date', 'DEATHDATE', 'BIRTHDATE',
'GENDER', 'admit', 'icu_admit'],
dtype='object')
PATIENT | recovered | death | covid_date | DEATHDATE | BIRTHDATE | GENDER | admit | icu_admit | |
---|---|---|---|---|---|---|---|---|---|
0 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False |
1 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False |
2 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False |
3 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False |
4 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | True | False | 2020-03-01 | NaN | 2017-08-24 | M | False | False |
Conclusion
The most important thing is to know where the data we need is stored and which tables contain it. We use fields like CODE, DESCRIPTION, and REASONCODE to identify the relevant information. Then, we filter and join these tables to create the datasets we want.
Extracting exact medical data from a hospital database can sometimes be challenging. No one is immune to making mistakes when working with healthcare data. Therefore, it’s important to double-check codes and results multiple times while performing data-related tasks.