Chapter_1 - Extracting Medical Data

Author

Thieu Nguyen

Extracting correct data

Medical data is stored across multiple tables in a hospital database. In an EHR system like EPIC, there can be thousands of tables. Therefore, extracting the correct data we need can sometimes be challenging.

For example, if I want to analyze COVID-19 data, I first need to identify where the relevant data is stored and in which tables. In the SyntheaMass database, I start with the ‘conditions’ table, where all diseases are defined with fields such as codedescriptionstart (date), stop (date), patient (ID), and more.

The skill to identify sources of specific data comes with training, time and experience working with hospital databases.

Loading the data

Import necessary python packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime

Loading the database with multiple tables is quite simple in python.

# Load the data
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
medications = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv")

Covid patient IDs

First, I have to identify code of the covid disease in the table ‘conditions’.

print(conditions.columns)
conditions.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')
START STOP PATIENT ENCOUNTER CODE DESCRIPTION
0 2019-02-15 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 d317dacb-d801-4136-be7d-440094f7ae83 65363002 Otitis media
1 2019-10-30 2020-01-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 759cfd9c-b40b-45df-a5b2-fca418027a6a 65363002 Otitis media
2 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006 Fever (finding)
3 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004 Suspected COVID-19
4 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006 COVID-19

In order to extract IDs for only covid19 patients I code like this:

covid_patient_ids = conditions[conditions.CODE == 840539006].PATIENT.unique()
print(covid_patient_ids.shape[0])
covid_patient_ids[0:5]
88166
array(['1ff7f10f-a204-4bb1-aa72-dd763fa99482',
       '9bcf6ed5-d808-44af-98a0-7d78a29ede72',
       '5163c501-353c-4a82-b863-a3f1df2d6cf1',
       'cc3c806f-4a09-4a89-a990-4286450956be',
       'bd1c4ffc-7f1d-4590-adbb-1d6533fb623e'], dtype=object)

I want to have data of the covid patients I code as:

# Filter patients who have COVID-19
covid_patients = patients[patients.Id.isin(covid_patient_ids)]  
print(covid_patients.columns)
print(covid_patients.shape[0])
covid_patients.head()
Index(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
       'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE', 'ETHNICITY',
       'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE', 'COUNTY', 'ZIP',
       'LAT', 'LON', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
      dtype='object')
88166
Id BIRTHDATE DEATHDATE SSN DRIVERS PASSPORT PREFIX FIRST LAST SUFFIX ... BIRTHPLACE ADDRESS CITY STATE COUNTY ZIP LAT LON HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE
0 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN 999-68-6630 NaN NaN NaN Jacinto644 Kris249 NaN ... Beverly Massachusetts US 888 Hickle Ferry Suite 38 Springfield Massachusetts Hampden County 1106.0 42.151961 -72.598959 8446.49 1499.08
1 9bcf6ed5-d808-44af-98a0-7d78a29ede72 2016-08-01 NaN 999-15-5895 NaN NaN NaN Alva958 Krajcik437 NaN ... Boston Massachusetts US 1048 Skiles Trailer Walpole Massachusetts Norfolk County 2081.0 42.177370 -71.281353 94568.40 1870.72
2 5163c501-353c-4a82-b863-a3f1df2d6cf1 2004-01-09 NaN 999-73-2461 S99999063 NaN NaN Jimmie93 Harris789 NaN ... Worcester Massachusetts US 201 Mitchell Lodge Unit 67 Pembroke Massachusetts Plymouth County NaN 42.075292 -70.757035 375754.62 3131.44
3 cc3c806f-4a09-4a89-a990-4286450956be 1996-11-15 NaN 999-60-7372 S99924941 X9952947X Mr. Gregorio366 Auer97 NaN ... Patras Achaea GR 1050 Lindgren Extension Apt 38 Boston Massachusetts Suffolk County 2135.0 42.352434 -71.028610 484758.46 3632.96
4 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e 2019-06-12 NaN 999-81-4349 NaN NaN NaN Karyn217 Mueller846 NaN ... New Bedford Massachusetts US 570 Abshire Forge Suite 32 Colrain Massachusetts Franklin County NaN 42.740220 -72.722648 24130.00 903.28

5 rows × 25 columns

There are 88,166 COVID-19 patients, including both those who were isolated at home and those who were hospitalized.

Covid inpatients

To extract data for COVID-19 patients who were hospitalized due to the virus, I use the ‘encounters’ table. Specifically, I filter for records where the encounter class is inpatient (CODE = 1505002) and the reason is COVID-19 (REASONCODE = 840539006).

print(encounters.columns)
encounters.head()
Index(['Id', 'START', 'STOP', 'PATIENT', 'ORGANIZATION', 'PROVIDER', 'PAYER',
       'ENCOUNTERCLASS', 'CODE', 'DESCRIPTION', 'BASE_ENCOUNTER_COST',
       'TOTAL_CLAIM_COST', 'PAYER_COVERAGE', 'REASONCODE',
       'REASONDESCRIPTION'],
      dtype='object')
Id START STOP PATIENT ORGANIZATION PROVIDER PAYER ENCOUNTERCLASS CODE DESCRIPTION BASE_ENCOUNTER_COST TOTAL_CLAIM_COST PAYER_COVERAGE REASONCODE REASONDESCRIPTION
0 d317dacb-d801-4136-be7d-440094f7ae83 2019-02-16T01:43:20Z 2019-02-16T01:58:20Z 1ff7f10f-a204-4bb1-aa72-dd763fa99482 5103c940-0c08-392f-95cd-446e0cea042a 8b532fbe-4254-3a60-a442-33028916d24e 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a outpatient 185345009 Encounter for symptom 129.16 129.16 69.16 65363002.0 Otitis media
1 52051c30-c6c3-45fe-b5da-a790f1680e91 2019-08-02T01:43:20Z 2019-08-02T02:13:20Z 1ff7f10f-a204-4bb1-aa72-dd763fa99482 0b9f3f7c-8ab6-30a5-b3ae-4dc0e0c00cb3 872f1318-45f7-34cc-adc7-26d59fc7c203 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a wellness 410620009 Well child visit (procedure) 129.16 129.16 129.16 NaN NaN
2 759cfd9c-b40b-45df-a5b2-fca418027a6a 2019-10-31T01:43:20Z 2019-10-31T01:58:20Z 1ff7f10f-a204-4bb1-aa72-dd763fa99482 5103c940-0c08-392f-95cd-446e0cea042a 8b532fbe-4254-3a60-a442-33028916d24e 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a outpatient 185345009 Encounter for symptom 129.16 129.16 69.16 65363002.0 Otitis media
3 20032279-0deb-41f9-bea4-93710396eb95 2020-01-31T01:43:20Z 2020-01-31T01:58:20Z 1ff7f10f-a204-4bb1-aa72-dd763fa99482 0b9f3f7c-8ab6-30a5-b3ae-4dc0e0c00cb3 872f1318-45f7-34cc-adc7-26d59fc7c203 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a wellness 410620009 Well child visit (procedure) 129.16 129.16 129.16 NaN NaN
4 60584c6a-e26b-4176-a660-86ad84c7ceca 2020-03-02T01:43:20Z 2020-03-02T02:39:20Z 1ff7f10f-a204-4bb1-aa72-dd763fa99482 fd328395-ab1d-35c6-a2d0-d05a9a79cf11 595c4f10-bcce-3892-a5d2-f95e3ab2ab74 7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a ambulatory 185345009 Encounter for symptom (procedure) 129.16 129.16 69.16 NaN NaN

I code as follow:

covid_inpatient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT.unique()

covid_inpatients = patients[patients.Id.isin(covid_inpatient_ids)] 

print(len(covid_inpatient_ids))
print(covid_inpatients.columns)
covid_inpatients.head()
18177
Index(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
       'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE', 'ETHNICITY',
       'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE', 'COUNTY', 'ZIP',
       'LAT', 'LON', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
      dtype='object')
Id BIRTHDATE DEATHDATE SSN DRIVERS PASSPORT PREFIX FIRST LAST SUFFIX ... BIRTHPLACE ADDRESS CITY STATE COUNTY ZIP LAT LON HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE
4 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e 2019-06-12 NaN 999-81-4349 NaN NaN NaN Karyn217 Mueller846 NaN ... New Bedford Massachusetts US 570 Abshire Forge Suite 32 Colrain Massachusetts Franklin County NaN 42.740220 -72.722648 24130.00 903.28
16 3c3b89b1-cb41-4f94-9193-2f3da4fe38e5 1983-09-30 NaN 999-57-9112 S99937871 X59867376X Mrs. Lisbeth69 Rowe323 NaN ... Agawam Massachusetts US 290 Hyatt Lane Malden Massachusetts Middlesex County 2155.0 42.414955 -71.067066 752343.27 6611.76
20 7e2c6949-dce9-44be-b6ef-50aca95840d7 1958-05-09 NaN 999-49-5916 S99935194 X2603260X Mrs. Danae973 Franecki195 NaN ... Holyoke Massachusetts US 1062 Conn Well Suite 4 Framingham Massachusetts Middlesex County 1702.0 42.309678 -71.423073 1507529.08 19755.11
21 0ab3b40f-3afd-42b8-9755-5834aa42eb67 1959-06-23 NaN 999-59-1261 S99928544 X30032284X Mrs. Anissa357 Mueller846 NaN ... Pembroke Massachusetts US 661 Schulist Highlands Waltham Massachusetts Middlesex County 2452.0 42.425915 -71.288673 1363750.36 15054.36
28 346b2b95-b8c5-4fe1-acf3-94de7e6e965b 1987-08-31 NaN 999-47-3790 S99987128 X31069450X Mr. Erwin847 Weimann465 NaN ... Quincy Massachusetts US 297 Fritsch Estate Apt 24 Lynnfield Massachusetts Essex County NaN 42.570672 -71.044043 678188.97 4118.76

5 rows × 25 columns

There are 18,177 patients who are admitted to hospital for covid.

Covid ICU-admitted patients

icu_patient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 305351004)].PATIENT.unique()
len(icu_patient_ids)
3677

Among covid inpatients, 3,677 patients were admitted to ICU.

Covid ventilated patients

print(procedures.columns)
procedures.head()
Index(['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'BASE_COST',
       'REASONCODE', 'REASONDESCRIPTION'],
      dtype='object')
DATE PATIENT ENCOUNTER CODE DESCRIPTION BASE_COST REASONCODE REASONDESCRIPTION
0 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 52051c30-c6c3-45fe-b5da-a790f1680e91 430193006 Medication Reconciliation (procedure) 495.80 NaN NaN
1 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 261352009 Face mask (physical object) 516.65 840544004.0 Suspected COVID-19
2 2020-01-06 9bcf6ed5-d808-44af-98a0-7d78a29ede72 723060b3-7be8-4d12-8e1e-04f9da2c52d5 430193006 Medication Reconciliation (procedure) 453.37 NaN NaN
3 2020-03-13 9bcf6ed5-d808-44af-98a0-7d78a29ede72 a78e78d9-33bb-40bc-9e42-e47ff7e910cc 261352009 Face mask (physical object) 516.65 840544004.0 Suspected COVID-19
4 2020-04-28 9bcf6ed5-d808-44af-98a0-7d78a29ede72 794aac2b-0eb8-4c36-98bb-a4317b344e6e 117015009 Throat culture (procedure) 2169.16 43878008.0 Streptococcal sore throat (disorder)
vent_ids = procedures[procedures.CODE == 26763009].PATIENT.unique()
vent_covid_ids = np.intersect1d(vent_ids, covid_patient_ids)
len(vent_covid_ids)
2914

There are 2,914 covid patients who were ventilated.

Covid isolated patients

Patients who are isolated at home for covid are identified in the table ‘care_plans’

isolation_ids = care_plans[(care_plans.CODE == 736376001) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()
len(isolation_ids)
70536

There are 70,536 covid patients isolated at their home.

Completed isolation patients

completed_isolation_patients = care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()
len(completed_isolation_patients)
70398

There are 70,398 covid patients who completed their home isolation.

Negative covid patients

I look at the table ‘observations’ for lab tests.

print(observations.columns)
observations.head()
Index(['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'VALUE', 'UNITS',
       'TYPE'],
      dtype='object')
DATE PATIENT ENCOUNTER CODE DESCRIPTION VALUE UNITS TYPE
0 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 52051c30-c6c3-45fe-b5da-a790f1680e91 8302-2 Body Height 82.7 cm numeric
1 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 52051c30-c6c3-45fe-b5da-a790f1680e91 72514-3 Pain severity - 0-10 verbal numeric rating [Sc... 2.0 {score} numeric
2 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 52051c30-c6c3-45fe-b5da-a790f1680e91 29463-7 Body Weight 11.5 kg numeric
3 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 52051c30-c6c3-45fe-b5da-a790f1680e91 77606-2 Weight-for-length Per age and sex 47.0 % numeric
4 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 52051c30-c6c3-45fe-b5da-a790f1680e91 9843-4 Head Occipital-frontal circumference 46.9 cm numeric

Then filtering for negative covid patient Ids.

negative_covid_patient_ids = observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
len(negative_covid_patient_ids)
17524

There are 17,524 COVID-19 patients who were tested for the virus and received negative results.

Covid survivors

survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)
len(survivor_ids)
87485

There are 87,485 covid patients who survived the virus

Covid deaths

deceased_ids = patients[patients.DEATHDATE.notna()].Id.unique()
covid_deceased_ids = np.intersect1d(deceased_ids, covid_inpatient_ids)
len(covid_deceased_ids)
3548

There are 3,548 covid patients who died.

Lab values for COVID-19 patients

I find labs values in the table ’observations’

LOINC codes = 'CODE = 48065-7': 'D-dimer', 
              'CODE = 2276-4': 'Serum Ferritin',
              'CODE = 89579-7': 'High Sensitivity Cardiac Troponin I',
              'CODE = 26881-3': 'IL-6', 'CODE = 731-0': 'Lymphocytes',
              'CODE = 14804-9': 'Lactate dehydrogenase'
lab_obs = observations[
    (observations.CODE == '48065-7') | 
    (observations.CODE == '26881-3') | 
    (observations.CODE == '2276-4') | 
    (observations.CODE == '89579-7') |
    (observations.CODE == '731-0') | 
    (observations.CODE == '14804-9')
]

print(lab_obs.columns)
lab_obs.head()
Index(['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'VALUE', 'UNITS',
       'TYPE'],
      dtype='object')
DATE PATIENT ENCOUNTER CODE DESCRIPTION VALUE UNITS TYPE
254 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 731-0 Lymphocytes [#/volume] in Blood by Automated c... 1.1 10*3/uL numeric
273 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 48065-7 Fibrin D-dimer FEU [Mass/volume] in Platelet p... 0.4 ug/mL numeric
274 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 2276-4 Ferritin [Mass/volume] in Serum or Plasma 332.4 ug/L numeric
275 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 89579-7 Troponin I.cardiac [Mass/volume] in Serum or P... 2.3 pg/mL numeric
276 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 14804-9 Lactate dehydrogenase [Enzymatic activity/volu... 223.9 U/L numeric

Medication dispenses

The earliest date of covid is ‘2020-01-20’

covid_meds = medications[pd.to_datetime(medications.START) > pd.to_datetime('2020-01-20')]
covid_meds = covid_meds[covid_meds['PATIENT'].isin(covid_patient_ids)]
print(covid_meds.columns)
covid_meds.head()
Index(['START', 'STOP', 'PATIENT', 'PAYER', 'ENCOUNTER', 'CODE', 'DESCRIPTION',
       'BASE_COST', 'PAYER_COVERAGE', 'DISPENSES', 'TOTALCOST', 'REASONCODE',
       'REASONDESCRIPTION'],
      dtype='object')
START STOP PATIENT PAYER ENCOUNTER CODE DESCRIPTION BASE_COST PAYER_COVERAGE DISPENSES TOTALCOST REASONCODE REASONDESCRIPTION
2 2020-02-12 2020-02-26 9bcf6ed5-d808-44af-98a0-7d78a29ede72 4d71f845-a6a9-3c39-b242-14d25ef86a8d 5d9dfe80-7edf-4181-9753-770e361934d1 313820 Acetaminophen 160 MG Chewable Tablet 5.25 0.0 1 5.25 NaN NaN
3 2020-04-28 2020-05-08 9bcf6ed5-d808-44af-98a0-7d78a29ede72 4d71f845-a6a9-3c39-b242-14d25ef86a8d 794aac2b-0eb8-4c36-98bb-a4317b344e6e 834061 Penicillin V Potassium 250 MG Oral Tablet 13.90 0.0 1 13.90 43878008.0 Streptococcal sore throat (disorder)
5 2020-05-03 2020-05-17 5163c501-353c-4a82-b863-a3f1df2d6cf1 5059a55e-5d6e-34d1-b6cb-d83d16e57bcf f2df2577-f1ff-4e0f-a467-8f72d494f478 313782 Acetaminophen 325 MG Oral Tablet 6.37 0.0 1 6.37 10509002.0 Acute bronchitis (disorder)
6 2020-02-19 NaN bd1c4ffc-7f1d-4590-adbb-1d6533fb623e 4d71f845-a6a9-3c39-b242-14d25ef86a8d b7455838-3607-47f4-aaa5-fd89abea7d29 854235 0.4 ML Enoxaparin sodium 100 MG/ML Prefilled S... 263.49 0.0 3 790.47 NaN NaN
7 2020-02-19 NaN bd1c4ffc-7f1d-4590-adbb-1d6533fb623e 4d71f845-a6a9-3c39-b242-14d25ef86a8d b7455838-3607-47f4-aaa5-fd89abea7d29 2123111 NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT ... 302.07 0.0 3 906.21 389087006.0 Hypoxemia (disorder)

Combined covid patient data

covid_cond = conditions[conditions.CODE == 840539006].copy()
covid_cond['covid_date'] = covid_cond['START']
covid_conditions = conditions[conditions['PATIENT'].isin(covid_patient_ids)].copy()
covid_conditions = covid_conditions.merge(covid_cond[['PATIENT', 'covid_date']], on='PATIENT', how='left')
covid_conditions.head()
START STOP PATIENT ENCOUNTER CODE DESCRIPTION covid_date
0 2019-02-15 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 d317dacb-d801-4136-be7d-440094f7ae83 65363002 Otitis media 2020-03-01
1 2019-10-30 2020-01-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 759cfd9c-b40b-45df-a5b2-fca418027a6a 65363002 Otitis media 2020-03-01
2 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006 Fever (finding) 2020-03-01
3 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004 Suspected COVID-19 2020-03-01
4 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006 COVID-19 2020-03-01
cp = covid_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
print(cp.columns)
cp.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION',
       'covid_date', 'Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS',
       'PASSPORT', 'PREFIX', 'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL',
       'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE',
       'COUNTY', 'ZIP', 'LAT', 'LON', 'HEALTHCARE_EXPENSES',
       'HEALTHCARE_COVERAGE'],
      dtype='object')
START STOP PATIENT ENCOUNTER CODE DESCRIPTION covid_date Id BIRTHDATE DEATHDATE ... BIRTHPLACE ADDRESS CITY STATE COUNTY ZIP LAT LON HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE
0 2019-02-15 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 d317dacb-d801-4136-be7d-440094f7ae83 65363002 Otitis media 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... Beverly Massachusetts US 888 Hickle Ferry Suite 38 Springfield Massachusetts Hampden County 1106.0 42.151961 -72.598959 8446.49 1499.08
1 2019-10-30 2020-01-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 759cfd9c-b40b-45df-a5b2-fca418027a6a 65363002 Otitis media 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... Beverly Massachusetts US 888 Hickle Ferry Suite 38 Springfield Massachusetts Hampden County 1106.0 42.151961 -72.598959 8446.49 1499.08
2 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006 Fever (finding) 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... Beverly Massachusetts US 888 Hickle Ferry Suite 38 Springfield Massachusetts Hampden County 1106.0 42.151961 -72.598959 8446.49 1499.08
3 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004 Suspected COVID-19 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... Beverly Massachusetts US 888 Hickle Ferry Suite 38 Springfield Massachusetts Hampden County 1106.0 42.151961 -72.598959 8446.49 1499.08
4 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006 COVID-19 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... Beverly Massachusetts US 888 Hickle Ferry Suite 38 Springfield Massachusetts Hampden County 1106.0 42.151961 -72.598959 8446.49 1499.08

5 rows × 32 columns

cp['isolation'] = cp.Id.isin(isolation_ids)
cp['admit'] = cp.Id.isin(covid_inpatient_ids)
cp['recovered'] = cp.Id.isin(survivor_ids)
cp['death'] = cp.DEATHDATE.notna()
cp['icu_admit'] = cp.Id.isin(icu_patient_ids)
cp['ventilated'] = cp.Id.isin(vent_covid_ids)
print(cp.columns)
cp.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION',
       'covid_date', 'Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS',
       'PASSPORT', 'PREFIX', 'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL',
       'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE',
       'COUNTY', 'ZIP', 'LAT', 'LON', 'HEALTHCARE_EXPENSES',
       'HEALTHCARE_COVERAGE', 'isolation', 'admit', 'recovered', 'death',
       'icu_admit', 'ventilated'],
      dtype='object')
START STOP PATIENT ENCOUNTER CODE DESCRIPTION covid_date Id BIRTHDATE DEATHDATE ... LAT LON HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE isolation admit recovered death icu_admit ventilated
0 2019-02-15 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 d317dacb-d801-4136-be7d-440094f7ae83 65363002 Otitis media 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... 42.151961 -72.598959 8446.49 1499.08 True False True False False False
1 2019-10-30 2020-01-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 759cfd9c-b40b-45df-a5b2-fca418027a6a 65363002 Otitis media 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... 42.151961 -72.598959 8446.49 1499.08 True False True False False False
2 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006 Fever (finding) 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... 42.151961 -72.598959 8446.49 1499.08 True False True False False False
3 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004 Suspected COVID-19 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... 42.151961 -72.598959 8446.49 1499.08 True False True False False False
4 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006 COVID-19 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 2017-08-24 NaN ... 42.151961 -72.598959 8446.49 1499.08 True False True False False False

5 rows × 38 columns

The condensed dataset

covid_info = cp[['PATIENT', 'recovered', 'death', 'covid_date', 'DEATHDATE', 'BIRTHDATE', 'GENDER', 'admit', 'icu_admit']]
print(covid_info.columns)
covid_info.head()
Index(['PATIENT', 'recovered', 'death', 'covid_date', 'DEATHDATE', 'BIRTHDATE',
       'GENDER', 'admit', 'icu_admit'],
      dtype='object')
PATIENT recovered death covid_date DEATHDATE BIRTHDATE GENDER admit icu_admit
0 1ff7f10f-a204-4bb1-aa72-dd763fa99482 True False 2020-03-01 NaN 2017-08-24 M False False
1 1ff7f10f-a204-4bb1-aa72-dd763fa99482 True False 2020-03-01 NaN 2017-08-24 M False False
2 1ff7f10f-a204-4bb1-aa72-dd763fa99482 True False 2020-03-01 NaN 2017-08-24 M False False
3 1ff7f10f-a204-4bb1-aa72-dd763fa99482 True False 2020-03-01 NaN 2017-08-24 M False False
4 1ff7f10f-a204-4bb1-aa72-dd763fa99482 True False 2020-03-01 NaN 2017-08-24 M False False

Conclusion

The most important thing is to know where the data we need is stored and which tables contain it. We use fields like CODEDESCRIPTION, and REASONCODE to identify the relevant information. Then, we filter and join these tables to create the datasets we want.

Extracting exact medical data from a hospital database can sometimes be challenging. No one is immune to making mistakes when working with healthcare data. Therefore, it’s important to double-check codes and results multiple times while performing data-related tasks.