import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime as dtChaper_2 - Creating Data Aggregation Tables
Creating date aggregation tables
Creating aggregated tables is a common task in data analysis. In this chapter, I demonstrate how to build aggregation tables across groups of patients. To prepare the data for further analysis involving both categorical and continuous variables, I use various tools to combine, join, and restructure the data.
Functions such as groupby() and agg()—with methods like count(), sum(), mean(), and std()—are commonly used to generate these summary tables.
Loading Data
Loading the necessary Python packages:
# Ensure all columns are shown
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)Loading csv data
# Load the data
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
medications = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv")Outcomes - Data aggregation and group operations (counts and percents) for categorical data
I want to calculate both the counts and percentages of COVID-related outcomes among different patient groups, specifically:
Inpatients
Survivors
Non-survivors
This analysis will help understand how outcomes vary across these categories and can be used to support further epidemiological insights or health system planning.
Define covid patients
covid_patient_ids = conditions[conditions.CODE == 840539006].PATIENT.unique()
covid_conditions = conditions[conditions.PATIENT.isin(covid_patient_ids)]
print(covid_conditions.columns)
covid_conditions.head()Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')
| START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | |
|---|---|---|---|---|---|---|
| 0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media |
| 1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media |
| 2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) |
| 3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 |
| 4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 |
Outcomes
covid_conditions['DESCRIPTION'].value_counts().head(10)DESCRIPTION
Suspected COVID-19 88166
COVID-19 88166
Fever (finding) 78341
Cough (finding) 59766
Loss of taste (finding) 44646
Body mass index 30+ - obesity (finding) 33903
Fatigue (finding) 33855
Sputum finding (finding) 29666
Prediabetes 25442
Hypertension 23019
Name: count, dtype: int64
The keywords for identifying COVID-related outcomes are collected from conditions['DESCRIPTION'].values that contain “(disorder)” and related features.
Define covid_date as the earliest date each patient was diagnosed with COVID. This is important for analyses involving subsequent time points.
# Filter COVID-19 rows
covid_conds = conditions[conditions['DESCRIPTION'] == 'COVID-19'].copy()
# Convert START to datetime
covid_conds['START'] = pd.to_datetime(covid_conditions['START'])
# Group by PATIENT and get the earliest COVID date
covid_dates = covid_conds.groupby('PATIENT')['START'].min().reset_index()
covid_dates.rename(columns={'START': 'covid_date'}, inplace=True)
# Merge back to the original conditions table
covid_conditions = conditions.merge(covid_dates, on='PATIENT', how='left')
# Preview result
covid_conditions.head()| START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | covid_date | |
|---|---|---|---|---|---|---|---|
| 0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media | 2020-03-01 |
| 1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media | 2020-03-01 |
| 2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) | 2020-03-01 |
| 3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 | 2020-03-01 |
| 4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 | 2020-03-01 |
Define specific patient groups
# inpatients
inpatient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT.unique()
completed_isolation_patients = care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()
negative_covid_patient_ids = observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
# survivors
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)
# deceased patients
deceased_ids = patients[
patients['Id'].isin(covid_patient_ids) & patients['DEATHDATE'].notna()
]['Id'].unique()Further filter covid-related outcomes
dis_cons = covid_conditions[
(covid_conditions['DESCRIPTION'].str.contains(r'\(disorder\)', na=False)) &
(~covid_conditions['DESCRIPTION'].str.contains(
r'pneumonia|hypoxemia|chronic|Hypertriglyceridemia|Hyperlipidemia|Hypercholesterolemia|Metabolic|osteoporosis|Alzheimer|Anemia|sinusitis|thrombosis|embolism|pharyngitis|bronchitis', case=False, na=False))
].copy()
dis_cons['inpatient'] = dis_cons.PATIENT.isin(inpatient_ids)
dis_cons['survivor'] = dis_cons.PATIENT.isin(survivor_ids)
dis_cons['death'] = dis_cons.PATIENT.isin(deceased_ids)
dis_cons.head()| START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | covid_date | inpatient | survivor | death | |
|---|---|---|---|---|---|---|---|---|---|---|
| 12 | 2020-04-28 | 2020-05-08 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 794aac2b-0eb8-4c36-98bb-a4317b344e6e | 43878008 | Streptococcal sore throat (disorder) | 2020-03-13 | False | True | False |
| 98 | 2015-02-11 | NaN | da454f92-9e90-43f8-9514-550fbaa6de3d | 6131ae6d-0623-4772-bc9a-c3157aee44fe | 87433001 | Pulmonary emphysema (disorder) | 2020-03-13 | False | True | False |
| 107 | 2010-12-17 | NaN | 29581d13-2687-402f-9c4c-06141a319e61 | 3af37864-e57e-4c9b-91bc-3313aa3fe518 | 254837009 | Malignant neoplasm of breast (disorder) | 2020-03-13 | False | True | False |
| 141 | 2020-03-07 | 2020-03-20 | 3c3b89b1-cb41-4f94-9193-2f3da4fe38e5 | 60b6cbc7-bbee-4fdd-9283-e8462a56dc34 | 770349000 | Sepsis caused by virus (disorder) | 2020-03-07 | True | True | False |
| 166 | 2007-04-27 | NaN | 7e2c6949-dce9-44be-b6ef-50aca95840d7 | ef7e86ea-37d3-491d-bd6e-60fb09fb21cf | 254837009 | Malignant neoplasm of breast (disorder) | 2020-02-29 | True | True | False |
Group by ‘DESCRIPTION’ for outcomes and count patients using an aggregation function to create the outcomes table
# Add non_survivor column
dis_cons['non_survivor'] = dis_cons['death'] & (~dis_cons['survivor'])
# Total patients per group
total_inpatients = len(inpatient_ids)
total_survivors = dis_cons[dis_cons['survivor']]['PATIENT'].nunique()
total_non_survivors = dis_cons[dis_cons['non_survivor']]['PATIENT'].nunique()
# Group by outcome and count patients per group
summary_df = dis_cons.groupby('DESCRIPTION').agg(
total=('PATIENT', 'nunique'),
inpatient=('inpatient', 'sum'),
survivors=('survivor', 'sum'),
non_survivors=('non_survivor', 'sum')
).reset_index()
# Calculate percentages using group totals
summary_df['percent of inpatient'] = round(summary_df['inpatient'] / total_inpatients * 100, 2)
summary_df['percent survivors'] = round(summary_df['survivors'] / total_survivors * 100, 2)
summary_df['percent non survivors'] = round(summary_df['non_survivors'] / total_non_survivors * 100, 2)
# Rename for clarity
summary_df.rename(columns={'DESCRIPTION': 'outcome'}, inplace=True)
# Sort by inpatient count or total
summary_df = summary_df.sort_values(by='inpatient', ascending=False)
# Final columns
summary_df = summary_df[[
'outcome', 'total', 'percent of inpatient',
'survivors', 'percent survivors',
'non_survivors', 'percent non survivors'
]]
# Show result
summary_df.sort_values(by='percent non survivors', ascending=False).head(7)| outcome | total | percent of inpatient | survivors | percent survivors | non_survivors | percent non survivors | |
|---|---|---|---|---|---|---|---|
| 26 | Sepsis caused by virus (disorder) | 6941 | 38.19 | 3419 | 24.20 | 3520 | 99.83 |
| 2 | Acute respiratory failure (disorder) | 8714 | 47.87 | 5237 | 37.07 | 3462 | 98.18 |
| 1 | Acute respiratory distress syndrome (disorder) | 2401 | 13.21 | 85 | 0.60 | 2316 | 65.68 |
| 27 | Septic shock (disorder) | 1746 | 9.61 | 0 | 0.00 | 1746 | 49.52 |
| 8 | Heart failure (disorder) | 1434 | 7.89 | 124 | 0.88 | 1310 | 37.15 |
| 0 | Acquired coagulation disorder (disorder) | 1389 | 7.64 | 91 | 0.64 | 1298 | 36.81 |
| 10 | Injury of heart (disorder) | 1288 | 7.09 | 20 | 0.14 | 1268 | 35.96 |
Use the groupby() and sum() to create counts and percents table
# Calculate total unique patients per category
total_inpatients = len(inpatient_ids)
total_survivors = dis_cons[dis_cons['survivor'] == True]['PATIENT'].nunique()
total_non_survivors = dis_cons[dis_cons['death'] == True]['PATIENT'].nunique()
# Group inpatient counts
df1 = dis_cons.groupby('DESCRIPTION')['inpatient'].sum().reset_index()
df1['percent_inpatient'] = round(df1['inpatient'] / total_inpatients * 100, 2)
# Group survivor counts
df2 = dis_cons.groupby('DESCRIPTION')['survivor'].sum().reset_index()
df2['percent_survivor'] = round(df2['survivor'] / total_survivors * 100, 2)
# Group death counts
df3 = dis_cons.groupby('DESCRIPTION')['death'].sum().reset_index()
df3['percent_non_survivor'] = round(df3['death'] / total_non_survivors * 100, 2)
# Merge all on DESCRIPTION
output_df = df1.merge(df2, on='DESCRIPTION').merge(df3, on='DESCRIPTION')
# Optional sorting by inpatient count
output_df = output_df.sort_values(by='percent_non_survivor', ascending=False)
# Display top 10
output_df.head(7)| DESCRIPTION | inpatient | percent_inpatient | survivor | percent_survivor | death | percent_non_survivor | |
|---|---|---|---|---|---|---|---|
| 26 | Sepsis caused by virus (disorder) | 6941 | 38.19 | 3419 | 24.20 | 3524 | 99.07 |
| 2 | Acute respiratory failure (disorder) | 8701 | 47.87 | 5237 | 37.07 | 3471 | 97.58 |
| 1 | Acute respiratory distress syndrome (disorder) | 2401 | 13.21 | 85 | 0.60 | 2316 | 65.11 |
| 27 | Septic shock (disorder) | 1746 | 9.61 | 0 | 0.00 | 1746 | 49.09 |
| 8 | Heart failure (disorder) | 1434 | 7.89 | 124 | 0.88 | 1310 | 36.83 |
| 0 | Acquired coagulation disorder (disorder) | 1389 | 7.64 | 91 | 0.64 | 1298 | 36.49 |
| 10 | Injury of heart (disorder) | 1288 | 7.09 | 20 | 0.14 | 1268 | 35.65 |
Labs values - Aggregation for continuous data
# Filter observations by specific CODE values
lab_obs = observations[
observations['CODE'].isin(['48065-7', '26881-3', '2276-4', '89579-7', '731-0', '14804-9'])
]
# Further filter by COVID patient IDs
lab_obs = lab_obs[lab_obs['PATIENT'].isin(covid_patient_ids)]
# Ensure the DATE column is in datetime format
lab_obs['DATE'] = pd.to_datetime(lab_obs['DATE'], errors='coerce')
# Filter rows after 2020-01-20 for covid-related data
lab_obs = lab_obs[lab_obs['DATE'] > '2020-01-20']
lab_obs['survivor'] = lab_obs.PATIENT.isin(survivor_ids)
# Preview first 3 rows
lab_obs.head(3)| DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | survivor | |
|---|---|---|---|---|---|---|---|---|---|
| 254 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric | True |
| 273 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | True |
| 274 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric | True |
Compare labs values between groups of patients who are ‘survivor’ or ‘non-survivor’.
# transform value to numeric
lab_obs['VALUE'] = pd.to_numeric(lab_obs['VALUE'], errors='coerce')
# aggregating data
lab_obs.pivot_table(index="DESCRIPTION", columns="survivor", values="VALUE", aggfunc="mean").round(2)| survivor | False | True |
|---|---|---|
| DESCRIPTION | ||
| Ferritin [Mass/volume] in Serum or Plasma | 1441.62 | 461.64 |
| Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma | 8.81 | 0.43 |
| Interleukin 6 [Mass/volume] in Serum or Plasma | 11.73 | 5.65 |
| Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction | 356.88 | 243.38 |
| Lymphocytes [#/volume] in Blood by Automated count | 0.58 | 1.01 |
| Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method | 30.40 | 2.99 |
Compare labs values between time points as ‘baseline’ and ‘week_1’ later
lab_obs = lab_obs.copy().merge(covid_dates, on = 'PATIENT')
lab_obs['days'] = (pd.to_datetime(lab_obs['DATE']) - pd.to_datetime(lab_obs['covid_date'])).dt.days
lab_obs.head()| DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | survivor | covid_date | days | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric | True | 2020-02-19 | 0 |
| 1 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | True | 2020-02-19 | 0 |
| 2 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric | True | 2020-02-19 | 0 |
| 3 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 89579-7 | Troponin I.cardiac [Mass/volume] in Serum or P... | 2.3 | pg/mL | numeric | True | 2020-02-19 | 0 |
| 4 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 14804-9 | Lactate dehydrogenase [Enzymatic activity/volu... | 223.9 | U/L | numeric | True | 2020-02-19 | 0 |
Take a look at ‘days’ column.
lab_obs['days'].describe()count 716748.000000
mean 6.085112
std 4.248569
min 0.000000
25% 2.000000
50% 6.000000
75% 9.000000
max 26.000000
Name: days, dtype: float64
Compare labs values between time points as ‘baseline’ and ‘week_1’ later.
# Assign Timepoints
def assign_timepoint(days):
if days == 0:
return 'baseline'
elif 6 <= days <= 8:
return 'week_1'
else:
return None
lab_obs['TIMEPOINT'] = lab_obs['days'].apply(assign_timepoint)
lab_obs = lab_obs.dropna(subset=['TIMEPOINT'])
lab_obs.head()| DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | survivor | covid_date | days | TIMEPOINT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric | True | 2020-02-19 | 0 | baseline |
| 1 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | True | 2020-02-19 | 0 | baseline |
| 2 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric | True | 2020-02-19 | 0 | baseline |
| 3 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 89579-7 | Troponin I.cardiac [Mass/volume] in Serum or P... | 2.3 | pg/mL | numeric | True | 2020-02-19 | 0 | baseline |
| 4 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 14804-9 | Lactate dehydrogenase [Enzymatic activity/volu... | 223.9 | U/L | numeric | True | 2020-02-19 | 0 | baseline |
# aggregating data
lab_obs.pivot_table(index="DESCRIPTION", columns=["survivor", "TIMEPOINT"], values="VALUE", aggfunc="mean").round(2)| survivor | False | True | ||
|---|---|---|---|---|
| TIMEPOINT | baseline | week_1 | baseline | week_1 |
| DESCRIPTION | ||||
| Ferritin [Mass/volume] in Serum or Plasma | 998.97 | 1586.77 | 400.51 | 497.94 |
| Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma | 1.50 | 4.82 | 0.30 | 0.50 |
| Interleukin 6 [Mass/volume] in Serum or Plasma | NaN | 11.79 | NaN | 5.99 |
| Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction | 329.85 | 330.42 | 224.87 | 237.25 |
| Lymphocytes [#/volume] in Blood by Automated count | 0.60 | 0.63 | 1.05 | 0.90 |
| Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method | 8.00 | 23.26 | 2.50 | 2.99 |
Ferritin [Mass/volume] in Serum or Plasma, Fibrin D-dimer FEU [Mass/volume] in Platelet and Troponin I. cardiac [Mass/volume] in Serum increase significantly at one week later among covid patients who died.
Table of N, means and SD between time points
# Aggregate Statistics
summary = lab_obs.groupby(['DESCRIPTION','TIMEPOINT'])['VALUE'].agg(
N='count',
mean='mean',
sd='std'
).reset_index()
# Output Results
order = ['baseline', 'week_1']
summary = summary.set_index('TIMEPOINT').loc[order].reset_index().round(2)
print(summary) TIMEPOINT DESCRIPTION N \
0 baseline Ferritin [Mass/volume] in Serum or Plasma 17628
1 baseline Fibrin D-dimer FEU [Mass/volume] in Platelet p... 17628
2 baseline Lactate dehydrogenase [Enzymatic activity/volu... 17628
3 baseline Lymphocytes [#/volume] in Blood by Automated c... 17628
4 baseline Troponin I.cardiac [Mass/volume] in Serum or P... 17628
5 week_1 Ferritin [Mass/volume] in Serum or Plasma 36136
6 week_1 Fibrin D-dimer FEU [Mass/volume] in Platelet p... 36136
7 week_1 Interleukin 6 [Mass/volume] in Serum or Plasma 326
8 week_1 Lactate dehydrogenase [Enzymatic activity/volu... 36136
9 week_1 Lymphocytes [#/volume] in Blood by Automated c... 53945
10 week_1 Troponin I.cardiac [Mass/volume] in Serum or P... 36136
mean sd
0 516.42 247.60
1 0.53 0.50
2 245.20 43.67
3 0.96 0.19
4 3.57 2.29
5 719.98 444.59
6 1.38 2.74
7 7.34 2.60
8 256.25 40.67
9 0.85 0.13
10 7.13 8.27
Conclusion
In this chapter, I explore tools for combining, joining, and restructuring data, along with functions for generating summary tables—such as pivot_table and techniques for recoding datetime variables.
I demonstrate data aggregation for both categorical and continuous variables, using functions like groupby(), agg(), count(), sum(), mean(), and std() to create meaningful summary tables.