Chaper_2 - Creating Data Aggregation Tables

Author

Thieu Nguyen

Creating date aggregation tables

Creating aggregated tables is a common task in data analysis. In this chapter, I demonstrate how to build aggregation tables across groups of patients. To prepare the data for further analysis involving both categorical and continuous variables, I use various tools to combine, join, and restructure the data.

Functions such as groupby() and agg()—with methods like count(), sum(), mean(), and std()—are commonly used to generate these summary tables.

Loading Data

Loading the necessary Python packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime as dt

# Ensure all columns are shown
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

Loading csv data

# Load the data
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
medications = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv")

Outcomes - Data aggregation and group operations (counts and percents) for categorical data

I want to calculate both the counts and percentages of COVID-related outcomes among different patient groups, specifically:

Inpatients
Survivors
Non-survivors

This analysis will help understand how outcomes vary across these categories and can be used to support further epidemiological insights or health system planning.

Define covid patients

covid_patient_ids = conditions[conditions.CODE == 840539006].PATIENT.unique()
covid_conditions = conditions[conditions.PATIENT.isin(covid_patient_ids)]
print(covid_conditions.columns)
covid_conditions.head()

Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')

	START	STOP	PATIENT	ENCOUNTER	CODE	DESCRIPTION
0	2019-02-15	2019-08-01	1ff7f10f-a204-4bb1-aa72-dd763fa99482	d317dacb-d801-4136-be7d-440094f7ae83	65363002	Otitis media
1	2019-10-30	2020-01-30	1ff7f10f-a204-4bb1-aa72-dd763fa99482	759cfd9c-b40b-45df-a5b2-fca418027a6a	65363002	Otitis media
2	2020-03-01	2020-03-30	1ff7f10f-a204-4bb1-aa72-dd763fa99482	60584c6a-e26b-4176-a660-86ad84c7ceca	386661006	Fever (finding)
3	2020-03-01	2020-03-01	1ff7f10f-a204-4bb1-aa72-dd763fa99482	60584c6a-e26b-4176-a660-86ad84c7ceca	840544004	Suspected COVID-19
4	2020-03-01	2020-03-30	1ff7f10f-a204-4bb1-aa72-dd763fa99482	60584c6a-e26b-4176-a660-86ad84c7ceca	840539006	COVID-19

Outcomes

covid_conditions['DESCRIPTION'].value_counts().head(10)

DESCRIPTION
Suspected COVID-19                         88166
COVID-19                                   88166
Fever (finding)                            78341
Cough (finding)                            59766
Loss of taste (finding)                    44646
Body mass index 30+ - obesity (finding)    33903
Fatigue (finding)                          33855
Sputum finding (finding)                   29666
Prediabetes                                25442
Hypertension                               23019
Name: count, dtype: int64

The keywords for identifying COVID-related outcomes are collected from conditions['DESCRIPTION'].values that contain “(disorder)” and related features.

Define covid_date as the earliest date each patient was diagnosed with COVID. This is important for analyses involving subsequent time points.

# Filter COVID-19 rows
covid_conds = conditions[conditions['DESCRIPTION'] == 'COVID-19'].copy()

# Convert START to datetime
covid_conds['START'] = pd.to_datetime(covid_conditions['START'])

# Group by PATIENT and get the earliest COVID date
covid_dates = covid_conds.groupby('PATIENT')['START'].min().reset_index()
covid_dates.rename(columns={'START': 'covid_date'}, inplace=True)

# Merge back to the original conditions table
covid_conditions = conditions.merge(covid_dates, on='PATIENT', how='left')

# Preview result
covid_conditions.head()

	START	STOP	PATIENT	ENCOUNTER	CODE	DESCRIPTION	covid_date
0	2019-02-15	2019-08-01	1ff7f10f-a204-4bb1-aa72-dd763fa99482	d317dacb-d801-4136-be7d-440094f7ae83	65363002	Otitis media	2020-03-01
1	2019-10-30	2020-01-30	1ff7f10f-a204-4bb1-aa72-dd763fa99482	759cfd9c-b40b-45df-a5b2-fca418027a6a	65363002	Otitis media	2020-03-01
2	2020-03-01	2020-03-30	1ff7f10f-a204-4bb1-aa72-dd763fa99482	60584c6a-e26b-4176-a660-86ad84c7ceca	386661006	Fever (finding)	2020-03-01
3	2020-03-01	2020-03-01	1ff7f10f-a204-4bb1-aa72-dd763fa99482	60584c6a-e26b-4176-a660-86ad84c7ceca	840544004	Suspected COVID-19	2020-03-01
4	2020-03-01	2020-03-30	1ff7f10f-a204-4bb1-aa72-dd763fa99482	60584c6a-e26b-4176-a660-86ad84c7ceca	840539006	COVID-19	2020-03-01

Define specific patient groups

# inpatients
inpatient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT.unique()

completed_isolation_patients = care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()

negative_covid_patient_ids = observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()

# survivors
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)

# deceased patients
deceased_ids = patients[
    patients['Id'].isin(covid_patient_ids) & patients['DEATHDATE'].notna()
]['Id'].unique()

Further filter covid-related outcomes

dis_cons = covid_conditions[
        (covid_conditions['DESCRIPTION'].str.contains(r'\(disorder\)', na=False)) &
        (~covid_conditions['DESCRIPTION'].str.contains(
            r'pneumonia|hypoxemia|chronic|Hypertriglyceridemia|Hyperlipidemia|Hypercholesterolemia|Metabolic|osteoporosis|Alzheimer|Anemia|sinusitis|thrombosis|embolism|pharyngitis|bronchitis', case=False, na=False))
        ].copy() 

dis_cons['inpatient'] = dis_cons.PATIENT.isin(inpatient_ids)
dis_cons['survivor'] = dis_cons.PATIENT.isin(survivor_ids)
dis_cons['death'] = dis_cons.PATIENT.isin(deceased_ids)
dis_cons.head()

	START	STOP	PATIENT	ENCOUNTER	CODE	DESCRIPTION	covid_date	inpatient	survivor	death
12	2020-04-28	2020-05-08	9bcf6ed5-d808-44af-98a0-7d78a29ede72	794aac2b-0eb8-4c36-98bb-a4317b344e6e	43878008	Streptococcal sore throat (disorder)	2020-03-13	False	True	False
98	2015-02-11	NaN	da454f92-9e90-43f8-9514-550fbaa6de3d	6131ae6d-0623-4772-bc9a-c3157aee44fe	87433001	Pulmonary emphysema (disorder)	2020-03-13	False	True	False
107	2010-12-17	NaN	29581d13-2687-402f-9c4c-06141a319e61	3af37864-e57e-4c9b-91bc-3313aa3fe518	254837009	Malignant neoplasm of breast (disorder)	2020-03-13	False	True	False
141	2020-03-07	2020-03-20	3c3b89b1-cb41-4f94-9193-2f3da4fe38e5	60b6cbc7-bbee-4fdd-9283-e8462a56dc34	770349000	Sepsis caused by virus (disorder)	2020-03-07	True	True	False
166	2007-04-27	NaN	7e2c6949-dce9-44be-b6ef-50aca95840d7	ef7e86ea-37d3-491d-bd6e-60fb09fb21cf	254837009	Malignant neoplasm of breast (disorder)	2020-02-29	True	True	False

Group by ‘DESCRIPTION’ for outcomes and count patients using an aggregation function to create the outcomes table

# Add non_survivor column
dis_cons['non_survivor'] = dis_cons['death'] & (~dis_cons['survivor'])

# Total patients per group
total_inpatients = len(inpatient_ids)
total_survivors = dis_cons[dis_cons['survivor']]['PATIENT'].nunique()
total_non_survivors = dis_cons[dis_cons['non_survivor']]['PATIENT'].nunique()

# Group by outcome and count patients per group
summary_df = dis_cons.groupby('DESCRIPTION').agg(
    total=('PATIENT', 'nunique'),
    inpatient=('inpatient', 'sum'),
    survivors=('survivor', 'sum'),
    non_survivors=('non_survivor', 'sum')
).reset_index()

# Calculate percentages using group totals
summary_df['percent of inpatient'] = round(summary_df['inpatient'] / total_inpatients * 100, 2)
summary_df['percent survivors'] = round(summary_df['survivors'] / total_survivors * 100, 2)
summary_df['percent non survivors'] = round(summary_df['non_survivors'] / total_non_survivors * 100, 2)

# Rename for clarity
summary_df.rename(columns={'DESCRIPTION': 'outcome'}, inplace=True)

# Sort by inpatient count or total
summary_df = summary_df.sort_values(by='inpatient', ascending=False)

# Final columns
summary_df = summary_df[[
    'outcome', 'total', 'percent of inpatient',
    'survivors', 'percent survivors',
    'non_survivors', 'percent non survivors'
]]

# Show result
summary_df.sort_values(by='percent non survivors', ascending=False).head(7)

	outcome	total	percent of inpatient	survivors	percent survivors	non_survivors	percent non survivors
26	Sepsis caused by virus (disorder)	6941	38.19	3419	24.20	3520	99.83
2	Acute respiratory failure (disorder)	8714	47.87	5237	37.07	3462	98.18
1	Acute respiratory distress syndrome (disorder)	2401	13.21	85	0.60	2316	65.68
27	Septic shock (disorder)	1746	9.61	0	0.00	1746	49.52
8	Heart failure (disorder)	1434	7.89	124	0.88	1310	37.15
0	Acquired coagulation disorder (disorder)	1389	7.64	91	0.64	1298	36.81
10	Injury of heart (disorder)	1288	7.09	20	0.14	1268	35.96

Use the groupby() and sum() to create counts and percents table

# Calculate total unique patients per category
total_inpatients = len(inpatient_ids)
total_survivors = dis_cons[dis_cons['survivor'] == True]['PATIENT'].nunique()
total_non_survivors = dis_cons[dis_cons['death'] == True]['PATIENT'].nunique()  

# Group inpatient counts
df1 = dis_cons.groupby('DESCRIPTION')['inpatient'].sum().reset_index()
df1['percent_inpatient'] = round(df1['inpatient'] / total_inpatients * 100, 2)

# Group survivor counts
df2 = dis_cons.groupby('DESCRIPTION')['survivor'].sum().reset_index()
df2['percent_survivor'] = round(df2['survivor'] / total_survivors * 100, 2)

# Group death counts
df3 = dis_cons.groupby('DESCRIPTION')['death'].sum().reset_index()
df3['percent_non_survivor'] = round(df3['death'] / total_non_survivors * 100, 2)

# Merge all on DESCRIPTION
output_df = df1.merge(df2, on='DESCRIPTION').merge(df3, on='DESCRIPTION')

# Optional sorting by inpatient count
output_df = output_df.sort_values(by='percent_non_survivor', ascending=False)

# Display top 10
output_df.head(7)

	DESCRIPTION	inpatient	percent_inpatient	survivor	percent_survivor	death	percent_non_survivor
26	Sepsis caused by virus (disorder)	6941	38.19	3419	24.20	3524	99.07
2	Acute respiratory failure (disorder)	8701	47.87	5237	37.07	3471	97.58
1	Acute respiratory distress syndrome (disorder)	2401	13.21	85	0.60	2316	65.11
27	Septic shock (disorder)	1746	9.61	0	0.00	1746	49.09
8	Heart failure (disorder)	1434	7.89	124	0.88	1310	36.83
0	Acquired coagulation disorder (disorder)	1389	7.64	91	0.64	1298	36.49
10	Injury of heart (disorder)	1288	7.09	20	0.14	1268	35.65

Labs values - Aggregation for continuous data

# Filter observations by specific CODE values
lab_obs = observations[
    observations['CODE'].isin(['48065-7', '26881-3', '2276-4', '89579-7', '731-0', '14804-9'])
]

# Further filter by COVID patient IDs
lab_obs = lab_obs[lab_obs['PATIENT'].isin(covid_patient_ids)]

# Ensure the DATE column is in datetime format
lab_obs['DATE'] = pd.to_datetime(lab_obs['DATE'], errors='coerce')

# Filter rows after 2020-01-20 for covid-related data
lab_obs = lab_obs[lab_obs['DATE'] > '2020-01-20']

lab_obs['survivor'] = lab_obs.PATIENT.isin(survivor_ids)

# Preview first 3 rows
lab_obs.head(3)

	DATE	PATIENT	ENCOUNTER	CODE	DESCRIPTION	VALUE	UNITS	TYPE	survivor
254	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	731-0	Lymphocytes [#/volume] in Blood by Automated c...	1.1	10*3/uL	numeric	True
273	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	48065-7	Fibrin D-dimer FEU [Mass/volume] in Platelet p...	0.4	ug/mL	numeric	True
274	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	2276-4	Ferritin [Mass/volume] in Serum or Plasma	332.4	ug/L	numeric	True

Compare labs values between groups of patients who are ‘survivor’ or ‘non-survivor’.

# transform value to numeric
lab_obs['VALUE'] = pd.to_numeric(lab_obs['VALUE'], errors='coerce')

# aggregating data
lab_obs.pivot_table(index="DESCRIPTION", columns="survivor", values="VALUE", aggfunc="mean").round(2)

survivor	False	True
DESCRIPTION
Ferritin [Mass/volume] in Serum or Plasma	1441.62	461.64
Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma	8.81	0.43
Interleukin 6 [Mass/volume] in Serum or Plasma	11.73	5.65
Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction	356.88	243.38
Lymphocytes [#/volume] in Blood by Automated count	0.58	1.01
Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method	30.40	2.99

Compare labs values between time points as ‘baseline’ and ‘week_1’ later

lab_obs = lab_obs.copy().merge(covid_dates, on = 'PATIENT')
lab_obs['days'] = (pd.to_datetime(lab_obs['DATE']) - pd.to_datetime(lab_obs['covid_date'])).dt.days
lab_obs.head()

	DATE	PATIENT	ENCOUNTER	CODE	DESCRIPTION	VALUE	UNITS	TYPE	survivor	covid_date
0	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	731-0	Lymphocytes [#/volume] in Blood by Automated c...	1.1	10*3/uL	numeric	True	2020-02-19
1	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	48065-7	Fibrin D-dimer FEU [Mass/volume] in Platelet p...	0.4	ug/mL	numeric	True	2020-02-19
2	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	2276-4	Ferritin [Mass/volume] in Serum or Plasma	332.4	ug/L	numeric	True	2020-02-19
3	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	89579-7	Troponin I.cardiac [Mass/volume] in Serum or P...	2.3	pg/mL	numeric	True	2020-02-19
4	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	14804-9	Lactate dehydrogenase [Enzymatic activity/volu...	223.9	U/L	numeric	True	2020-02-19

Take a look at ‘days’ column.

lab_obs['days'].describe()

count    716748.000000
mean          6.085112
std           4.248569
min           0.000000
25%           2.000000
50%           6.000000
75%           9.000000
max          26.000000
Name: days, dtype: float64

Compare labs values between time points as ‘baseline’ and ‘week_1’ later.

# Assign Timepoints 
def assign_timepoint(days):
    if days == 0:
        return 'baseline'
    elif 6 <= days <= 8:
        return 'week_1'
    else:
        return None

lab_obs['TIMEPOINT'] = lab_obs['days'].apply(assign_timepoint)
lab_obs = lab_obs.dropna(subset=['TIMEPOINT'])
lab_obs.head()

	DATE	PATIENT	ENCOUNTER	CODE	DESCRIPTION	VALUE	UNITS	TYPE	survivor	covid_date	TIMEPOINT
0	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	731-0	Lymphocytes [#/volume] in Blood by Automated c...	1.1	10*3/uL	numeric	True	2020-02-19	baseline
1	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	48065-7	Fibrin D-dimer FEU [Mass/volume] in Platelet p...	0.4	ug/mL	numeric	True	2020-02-19	baseline
2	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	2276-4	Ferritin [Mass/volume] in Serum or Plasma	332.4	ug/L	numeric	True	2020-02-19	baseline
3	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	89579-7	Troponin I.cardiac [Mass/volume] in Serum or P...	2.3	pg/mL	numeric	True	2020-02-19	baseline
4	2020-02-19	bd1c4ffc-7f1d-4590-adbb-1d6533fb623e	b7455838-3607-47f4-aaa5-fd89abea7d29	14804-9	Lactate dehydrogenase [Enzymatic activity/volu...	223.9	U/L	numeric	True	2020-02-19	baseline

# aggregating data
lab_obs.pivot_table(index="DESCRIPTION", columns=["survivor", "TIMEPOINT"], values="VALUE", aggfunc="mean").round(2)

survivor	False		True
TIMEPOINT	baseline	week_1	baseline	week_1
DESCRIPTION
Ferritin [Mass/volume] in Serum or Plasma	998.97	1586.77	400.51	497.94
Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma	1.50	4.82	0.30	0.50
Interleukin 6 [Mass/volume] in Serum or Plasma	NaN	11.79	NaN	5.99
Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction	329.85	330.42	224.87	237.25
Lymphocytes [#/volume] in Blood by Automated count	0.60	0.63	1.05	0.90
Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method	8.00	23.26	2.50	2.99

Ferritin [Mass/volume] in Serum or Plasma, Fibrin D-dimer FEU [Mass/volume] in Platelet and Troponin I. cardiac [Mass/volume] in Serum increase significantly at one week later among covid patients who died.

Table of N, means and SD between time points

# Aggregate Statistics 
summary = lab_obs.groupby(['DESCRIPTION','TIMEPOINT'])['VALUE'].agg(
    N='count',
    mean='mean',
    sd='std'
).reset_index()

# Output Results 
order = ['baseline', 'week_1']
summary = summary.set_index('TIMEPOINT').loc[order].reset_index().round(2)
print(summary)

   TIMEPOINT                                        DESCRIPTION      N  \
0   baseline          Ferritin [Mass/volume] in Serum or Plasma  17628   
1   baseline  Fibrin D-dimer FEU [Mass/volume] in Platelet p...  17628   
2   baseline  Lactate dehydrogenase [Enzymatic activity/volu...  17628   
3   baseline  Lymphocytes [#/volume] in Blood by Automated c...  17628   
4   baseline  Troponin I.cardiac [Mass/volume] in Serum or P...  17628   
5     week_1          Ferritin [Mass/volume] in Serum or Plasma  36136   
6     week_1  Fibrin D-dimer FEU [Mass/volume] in Platelet p...  36136   
7     week_1     Interleukin 6 [Mass/volume] in Serum or Plasma    326   
8     week_1  Lactate dehydrogenase [Enzymatic activity/volu...  36136   
9     week_1  Lymphocytes [#/volume] in Blood by Automated c...  53945   
10    week_1  Troponin I.cardiac [Mass/volume] in Serum or P...  36136   

      mean      sd  
0   516.42  247.60  
1     0.53    0.50  
2   245.20   43.67  
3     0.96    0.19  
4     3.57    2.29  
5   719.98  444.59  
6     1.38    2.74  
7     7.34    2.60  
8   256.25   40.67  
9     0.85    0.13  
10    7.13    8.27

Conclusion

In this chapter, I explore tools for combining, joining, and restructuring data, along with functions for generating summary tables—such as pivot_table and techniques for recoding datetime variables.

I demonstrate data aggregation for both categorical and continuous variables, using functions like groupby(), agg(), count(), sum(), mean(), and std() to create meaningful summary tables.