Chaper_2 - Creating Data Aggregation Tables

Author

Thieu Nguyen

Creating date aggregation tables

Creating aggregated tables is a common task in data analysis. In this chapter, I demonstrate how to build aggregation tables across groups of patients. To prepare the data for further analysis involving both categorical and continuous variables, I use various tools to combine, join, and restructure the data.

Functions such as groupby() and agg()—with methods like count()sum()mean(), and std()—are commonly used to generate these summary tables.

Loading Data

Loading the necessary Python packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime as dt
# Ensure all columns are shown
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

Loading csv data

# Load the data
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
medications = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv")

Outcomes - Data aggregation and group operations (counts and percents) for categorical data

I want to calculate both the counts and percentages of COVID-related outcomes among different patient groups, specifically:

  • Inpatients

  • Survivors

  • Non-survivors

This analysis will help understand how outcomes vary across these categories and can be used to support further epidemiological insights or health system planning.

Define covid patients

covid_patient_ids = conditions[conditions.CODE == 840539006].PATIENT.unique()
covid_conditions = conditions[conditions.PATIENT.isin(covid_patient_ids)]
print(covid_conditions.columns)
covid_conditions.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')
START STOP PATIENT ENCOUNTER CODE DESCRIPTION
0 2019-02-15 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 d317dacb-d801-4136-be7d-440094f7ae83 65363002 Otitis media
1 2019-10-30 2020-01-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 759cfd9c-b40b-45df-a5b2-fca418027a6a 65363002 Otitis media
2 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006 Fever (finding)
3 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004 Suspected COVID-19
4 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006 COVID-19

Outcomes

covid_conditions['DESCRIPTION'].value_counts().head(10)
DESCRIPTION
Suspected COVID-19                         88166
COVID-19                                   88166
Fever (finding)                            78341
Cough (finding)                            59766
Loss of taste (finding)                    44646
Body mass index 30+ - obesity (finding)    33903
Fatigue (finding)                          33855
Sputum finding (finding)                   29666
Prediabetes                                25442
Hypertension                               23019
Name: count, dtype: int64

The keywords for identifying COVID-related outcomes are collected from conditions['DESCRIPTION'].values that contain “(disorder)” and related features.

Define covid_date as the earliest date each patient was diagnosed with COVID. This is important for analyses involving subsequent time points.

# Filter COVID-19 rows
covid_conds = conditions[conditions['DESCRIPTION'] == 'COVID-19'].copy()

# Convert START to datetime
covid_conds['START'] = pd.to_datetime(covid_conditions['START'])

# Group by PATIENT and get the earliest COVID date
covid_dates = covid_conds.groupby('PATIENT')['START'].min().reset_index()
covid_dates.rename(columns={'START': 'covid_date'}, inplace=True)

# Merge back to the original conditions table
covid_conditions = conditions.merge(covid_dates, on='PATIENT', how='left')

# Preview result
covid_conditions.head()
START STOP PATIENT ENCOUNTER CODE DESCRIPTION covid_date
0 2019-02-15 2019-08-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 d317dacb-d801-4136-be7d-440094f7ae83 65363002 Otitis media 2020-03-01
1 2019-10-30 2020-01-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 759cfd9c-b40b-45df-a5b2-fca418027a6a 65363002 Otitis media 2020-03-01
2 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006 Fever (finding) 2020-03-01
3 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004 Suspected COVID-19 2020-03-01
4 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006 COVID-19 2020-03-01

Define specific patient groups

# inpatients
inpatient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT.unique()

completed_isolation_patients = care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()

negative_covid_patient_ids = observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()

# survivors
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)

# deceased patients
deceased_ids = patients[
    patients['Id'].isin(covid_patient_ids) & patients['DEATHDATE'].notna()
]['Id'].unique()

Further filter covid-related outcomes

dis_cons = covid_conditions[
        (covid_conditions['DESCRIPTION'].str.contains(r'\(disorder\)', na=False)) &
        (~covid_conditions['DESCRIPTION'].str.contains(
            r'pneumonia|hypoxemia|chronic|Hypertriglyceridemia|Hyperlipidemia|Hypercholesterolemia|Metabolic|osteoporosis|Alzheimer|Anemia|sinusitis|thrombosis|embolism|pharyngitis|bronchitis', case=False, na=False))
        ].copy() 

dis_cons['inpatient'] = dis_cons.PATIENT.isin(inpatient_ids)
dis_cons['survivor'] = dis_cons.PATIENT.isin(survivor_ids)
dis_cons['death'] = dis_cons.PATIENT.isin(deceased_ids)
dis_cons.head()
START STOP PATIENT ENCOUNTER CODE DESCRIPTION covid_date inpatient survivor death
12 2020-04-28 2020-05-08 9bcf6ed5-d808-44af-98a0-7d78a29ede72 794aac2b-0eb8-4c36-98bb-a4317b344e6e 43878008 Streptococcal sore throat (disorder) 2020-03-13 False True False
98 2015-02-11 NaN da454f92-9e90-43f8-9514-550fbaa6de3d 6131ae6d-0623-4772-bc9a-c3157aee44fe 87433001 Pulmonary emphysema (disorder) 2020-03-13 False True False
107 2010-12-17 NaN 29581d13-2687-402f-9c4c-06141a319e61 3af37864-e57e-4c9b-91bc-3313aa3fe518 254837009 Malignant neoplasm of breast (disorder) 2020-03-13 False True False
141 2020-03-07 2020-03-20 3c3b89b1-cb41-4f94-9193-2f3da4fe38e5 60b6cbc7-bbee-4fdd-9283-e8462a56dc34 770349000 Sepsis caused by virus (disorder) 2020-03-07 True True False
166 2007-04-27 NaN 7e2c6949-dce9-44be-b6ef-50aca95840d7 ef7e86ea-37d3-491d-bd6e-60fb09fb21cf 254837009 Malignant neoplasm of breast (disorder) 2020-02-29 True True False

Group by ‘DESCRIPTION’ for outcomes and count patients using an aggregation function to create the outcomes table

# Add non_survivor column
dis_cons['non_survivor'] = dis_cons['death'] & (~dis_cons['survivor'])

# Total patients per group
total_inpatients = len(inpatient_ids)
total_survivors = dis_cons[dis_cons['survivor']]['PATIENT'].nunique()
total_non_survivors = dis_cons[dis_cons['non_survivor']]['PATIENT'].nunique()

# Group by outcome and count patients per group
summary_df = dis_cons.groupby('DESCRIPTION').agg(
    total=('PATIENT', 'nunique'),
    inpatient=('inpatient', 'sum'),
    survivors=('survivor', 'sum'),
    non_survivors=('non_survivor', 'sum')
).reset_index()

# Calculate percentages using group totals
summary_df['percent of inpatient'] = round(summary_df['inpatient'] / total_inpatients * 100, 2)
summary_df['percent survivors'] = round(summary_df['survivors'] / total_survivors * 100, 2)
summary_df['percent non survivors'] = round(summary_df['non_survivors'] / total_non_survivors * 100, 2)

# Rename for clarity
summary_df.rename(columns={'DESCRIPTION': 'outcome'}, inplace=True)

# Sort by inpatient count or total
summary_df = summary_df.sort_values(by='inpatient', ascending=False)

# Final columns
summary_df = summary_df[[
    'outcome', 'total', 'percent of inpatient',
    'survivors', 'percent survivors',
    'non_survivors', 'percent non survivors'
]]

# Show result
summary_df.sort_values(by='percent non survivors', ascending=False).head(7)
outcome total percent of inpatient survivors percent survivors non_survivors percent non survivors
26 Sepsis caused by virus (disorder) 6941 38.19 3419 24.20 3520 99.83
2 Acute respiratory failure (disorder) 8714 47.87 5237 37.07 3462 98.18
1 Acute respiratory distress syndrome (disorder) 2401 13.21 85 0.60 2316 65.68
27 Septic shock (disorder) 1746 9.61 0 0.00 1746 49.52
8 Heart failure (disorder) 1434 7.89 124 0.88 1310 37.15
0 Acquired coagulation disorder (disorder) 1389 7.64 91 0.64 1298 36.81
10 Injury of heart (disorder) 1288 7.09 20 0.14 1268 35.96

Use the groupby() and sum() to create counts and percents table

# Calculate total unique patients per category
total_inpatients = len(inpatient_ids)
total_survivors = dis_cons[dis_cons['survivor'] == True]['PATIENT'].nunique()
total_non_survivors = dis_cons[dis_cons['death'] == True]['PATIENT'].nunique()  

# Group inpatient counts
df1 = dis_cons.groupby('DESCRIPTION')['inpatient'].sum().reset_index()
df1['percent_inpatient'] = round(df1['inpatient'] / total_inpatients * 100, 2)

# Group survivor counts
df2 = dis_cons.groupby('DESCRIPTION')['survivor'].sum().reset_index()
df2['percent_survivor'] = round(df2['survivor'] / total_survivors * 100, 2)

# Group death counts
df3 = dis_cons.groupby('DESCRIPTION')['death'].sum().reset_index()
df3['percent_non_survivor'] = round(df3['death'] / total_non_survivors * 100, 2)

# Merge all on DESCRIPTION
output_df = df1.merge(df2, on='DESCRIPTION').merge(df3, on='DESCRIPTION')

# Optional sorting by inpatient count
output_df = output_df.sort_values(by='percent_non_survivor', ascending=False)

# Display top 10
output_df.head(7)
DESCRIPTION inpatient percent_inpatient survivor percent_survivor death percent_non_survivor
26 Sepsis caused by virus (disorder) 6941 38.19 3419 24.20 3524 99.07
2 Acute respiratory failure (disorder) 8701 47.87 5237 37.07 3471 97.58
1 Acute respiratory distress syndrome (disorder) 2401 13.21 85 0.60 2316 65.11
27 Septic shock (disorder) 1746 9.61 0 0.00 1746 49.09
8 Heart failure (disorder) 1434 7.89 124 0.88 1310 36.83
0 Acquired coagulation disorder (disorder) 1389 7.64 91 0.64 1298 36.49
10 Injury of heart (disorder) 1288 7.09 20 0.14 1268 35.65

Labs values - Aggregation for continuous data

# Filter observations by specific CODE values
lab_obs = observations[
    observations['CODE'].isin(['48065-7', '26881-3', '2276-4', '89579-7', '731-0', '14804-9'])
]

# Further filter by COVID patient IDs
lab_obs = lab_obs[lab_obs['PATIENT'].isin(covid_patient_ids)]

# Ensure the DATE column is in datetime format
lab_obs['DATE'] = pd.to_datetime(lab_obs['DATE'], errors='coerce')

# Filter rows after 2020-01-20 for covid-related data
lab_obs = lab_obs[lab_obs['DATE'] > '2020-01-20']

lab_obs['survivor'] = lab_obs.PATIENT.isin(survivor_ids)

# Preview first 3 rows
lab_obs.head(3)
DATE PATIENT ENCOUNTER CODE DESCRIPTION VALUE UNITS TYPE survivor
254 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 731-0 Lymphocytes [#/volume] in Blood by Automated c... 1.1 10*3/uL numeric True
273 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 48065-7 Fibrin D-dimer FEU [Mass/volume] in Platelet p... 0.4 ug/mL numeric True
274 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 2276-4 Ferritin [Mass/volume] in Serum or Plasma 332.4 ug/L numeric True

Compare labs values between groups of patients who are ‘survivor’ or ‘non-survivor’.

# transform value to numeric
lab_obs['VALUE'] = pd.to_numeric(lab_obs['VALUE'], errors='coerce')

# aggregating data
lab_obs.pivot_table(index="DESCRIPTION", columns="survivor", values="VALUE", aggfunc="mean").round(2)
survivor False True
DESCRIPTION
Ferritin [Mass/volume] in Serum or Plasma 1441.62 461.64
Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma 8.81 0.43
Interleukin 6 [Mass/volume] in Serum or Plasma 11.73 5.65
Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction 356.88 243.38
Lymphocytes [#/volume] in Blood by Automated count 0.58 1.01
Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method 30.40 2.99

Compare labs values between time points as ‘baseline’ and ‘week_1’ later

lab_obs = lab_obs.copy().merge(covid_dates, on = 'PATIENT')
lab_obs['days'] = (pd.to_datetime(lab_obs['DATE']) - pd.to_datetime(lab_obs['covid_date'])).dt.days
lab_obs.head()
DATE PATIENT ENCOUNTER CODE DESCRIPTION VALUE UNITS TYPE survivor covid_date days
0 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 731-0 Lymphocytes [#/volume] in Blood by Automated c... 1.1 10*3/uL numeric True 2020-02-19 0
1 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 48065-7 Fibrin D-dimer FEU [Mass/volume] in Platelet p... 0.4 ug/mL numeric True 2020-02-19 0
2 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 2276-4 Ferritin [Mass/volume] in Serum or Plasma 332.4 ug/L numeric True 2020-02-19 0
3 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 89579-7 Troponin I.cardiac [Mass/volume] in Serum or P... 2.3 pg/mL numeric True 2020-02-19 0
4 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 14804-9 Lactate dehydrogenase [Enzymatic activity/volu... 223.9 U/L numeric True 2020-02-19 0

Take a look at ‘days’ column.

lab_obs['days'].describe()
count    716748.000000
mean          6.085112
std           4.248569
min           0.000000
25%           2.000000
50%           6.000000
75%           9.000000
max          26.000000
Name: days, dtype: float64

Compare labs values between time points as ‘baseline’ and ‘week_1’ later.

# Assign Timepoints 
def assign_timepoint(days):
    if days == 0:
        return 'baseline'
    elif 6 <= days <= 8:
        return 'week_1'
    else:
        return None

lab_obs['TIMEPOINT'] = lab_obs['days'].apply(assign_timepoint)
lab_obs = lab_obs.dropna(subset=['TIMEPOINT'])
lab_obs.head()
DATE PATIENT ENCOUNTER CODE DESCRIPTION VALUE UNITS TYPE survivor covid_date days TIMEPOINT
0 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 731-0 Lymphocytes [#/volume] in Blood by Automated c... 1.1 10*3/uL numeric True 2020-02-19 0 baseline
1 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 48065-7 Fibrin D-dimer FEU [Mass/volume] in Platelet p... 0.4 ug/mL numeric True 2020-02-19 0 baseline
2 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 2276-4 Ferritin [Mass/volume] in Serum or Plasma 332.4 ug/L numeric True 2020-02-19 0 baseline
3 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 89579-7 Troponin I.cardiac [Mass/volume] in Serum or P... 2.3 pg/mL numeric True 2020-02-19 0 baseline
4 2020-02-19 bd1c4ffc-7f1d-4590-adbb-1d6533fb623e b7455838-3607-47f4-aaa5-fd89abea7d29 14804-9 Lactate dehydrogenase [Enzymatic activity/volu... 223.9 U/L numeric True 2020-02-19 0 baseline
# aggregating data
lab_obs.pivot_table(index="DESCRIPTION", columns=["survivor", "TIMEPOINT"], values="VALUE", aggfunc="mean").round(2)
survivor False True
TIMEPOINT baseline week_1 baseline week_1
DESCRIPTION
Ferritin [Mass/volume] in Serum or Plasma 998.97 1586.77 400.51 497.94
Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma 1.50 4.82 0.30 0.50
Interleukin 6 [Mass/volume] in Serum or Plasma NaN 11.79 NaN 5.99
Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction 329.85 330.42 224.87 237.25
Lymphocytes [#/volume] in Blood by Automated count 0.60 0.63 1.05 0.90
Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method 8.00 23.26 2.50 2.99

Ferritin [Mass/volume] in Serum or Plasma, Fibrin D-dimer FEU [Mass/volume] in Platelet and Troponin I. cardiac [Mass/volume] in Serum increase significantly at one week later among covid patients who died.

Table of N, means and SD between time points

# Aggregate Statistics 
summary = lab_obs.groupby(['DESCRIPTION','TIMEPOINT'])['VALUE'].agg(
    N='count',
    mean='mean',
    sd='std'
).reset_index()

# Output Results 
order = ['baseline', 'week_1']
summary = summary.set_index('TIMEPOINT').loc[order].reset_index().round(2)
print(summary)
   TIMEPOINT                                        DESCRIPTION      N  \
0   baseline          Ferritin [Mass/volume] in Serum or Plasma  17628   
1   baseline  Fibrin D-dimer FEU [Mass/volume] in Platelet p...  17628   
2   baseline  Lactate dehydrogenase [Enzymatic activity/volu...  17628   
3   baseline  Lymphocytes [#/volume] in Blood by Automated c...  17628   
4   baseline  Troponin I.cardiac [Mass/volume] in Serum or P...  17628   
5     week_1          Ferritin [Mass/volume] in Serum or Plasma  36136   
6     week_1  Fibrin D-dimer FEU [Mass/volume] in Platelet p...  36136   
7     week_1     Interleukin 6 [Mass/volume] in Serum or Plasma    326   
8     week_1  Lactate dehydrogenase [Enzymatic activity/volu...  36136   
9     week_1  Lymphocytes [#/volume] in Blood by Automated c...  53945   
10    week_1  Troponin I.cardiac [Mass/volume] in Serum or P...  36136   

      mean      sd  
0   516.42  247.60  
1     0.53    0.50  
2   245.20   43.67  
3     0.96    0.19  
4     3.57    2.29  
5   719.98  444.59  
6     1.38    2.74  
7     7.34    2.60  
8   256.25   40.67  
9     0.85    0.13  
10    7.13    8.27  

Conclusion

In this chapter, I explore tools for combining, joining, and restructuring data, along with functions for generating summary tables—such as pivot_table and techniques for recoding datetime variables.

I demonstrate data aggregation for both categorical and continuous variables, using functions like groupby()agg()count()sum()mean(), and std() to create meaningful summary tables.