import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime as dt
Chaper_2 - Creating Data Aggregation Tables
Creating date aggregation tables
Creating aggregated tables is a common task in data analysis. In this chapter, I demonstrate how to build aggregation tables across groups of patients. To prepare the data for further analysis involving both categorical and continuous variables, I use various tools to combine, join, and restructure the data.
Functions such as groupby()
and agg()
—with methods like count()
, sum()
, mean()
, and std()
—are commonly used to generate these summary tables.
Loading Data
Loading the necessary Python packages:
# Ensure all columns are shown
'display.max_columns', None)
pd.set_option('display.width', None) pd.set_option(
Loading csv data
# Load the data
= pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv") medications
Outcomes - Data aggregation and group operations (counts and percents) for categorical data
I want to calculate both the counts and percentages of COVID-related outcomes among different patient groups, specifically:
Inpatients
Survivors
Non-survivors
This analysis will help understand how outcomes vary across these categories and can be used to support further epidemiological insights or health system planning.
Define covid patients
= conditions[conditions.CODE == 840539006].PATIENT.unique()
covid_patient_ids = conditions[conditions.PATIENT.isin(covid_patient_ids)]
covid_conditions print(covid_conditions.columns)
covid_conditions.head()
Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | |
---|---|---|---|---|---|---|
0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media |
1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media |
2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) |
3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 |
4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 |
Outcomes
'DESCRIPTION'].value_counts().head(10) covid_conditions[
DESCRIPTION
Suspected COVID-19 88166
COVID-19 88166
Fever (finding) 78341
Cough (finding) 59766
Loss of taste (finding) 44646
Body mass index 30+ - obesity (finding) 33903
Fatigue (finding) 33855
Sputum finding (finding) 29666
Prediabetes 25442
Hypertension 23019
Name: count, dtype: int64
The keywords for identifying COVID-related outcomes are collected from conditions['DESCRIPTION'].values
that contain “(disorder)” and related features.
Define covid_date
as the earliest date each patient was diagnosed with COVID. This is important for analyses involving subsequent time points.
# Filter COVID-19 rows
= conditions[conditions['DESCRIPTION'] == 'COVID-19'].copy()
covid_conds
# Convert START to datetime
'START'] = pd.to_datetime(covid_conditions['START'])
covid_conds[
# Group by PATIENT and get the earliest COVID date
= covid_conds.groupby('PATIENT')['START'].min().reset_index()
covid_dates ={'START': 'covid_date'}, inplace=True)
covid_dates.rename(columns
# Merge back to the original conditions table
= conditions.merge(covid_dates, on='PATIENT', how='left')
covid_conditions
# Preview result
covid_conditions.head()
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | covid_date | |
---|---|---|---|---|---|---|---|
0 | 2019-02-15 | 2019-08-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | d317dacb-d801-4136-be7d-440094f7ae83 | 65363002 | Otitis media | 2020-03-01 |
1 | 2019-10-30 | 2020-01-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 759cfd9c-b40b-45df-a5b2-fca418027a6a | 65363002 | Otitis media | 2020-03-01 |
2 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 386661006 | Fever (finding) | 2020-03-01 |
3 | 2020-03-01 | 2020-03-01 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840544004 | Suspected COVID-19 | 2020-03-01 |
4 | 2020-03-01 | 2020-03-30 | 1ff7f10f-a204-4bb1-aa72-dd763fa99482 | 60584c6a-e26b-4176-a660-86ad84c7ceca | 840539006 | COVID-19 | 2020-03-01 |
Define specific patient groups
# inpatients
= encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT.unique()
inpatient_ids
= care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT.unique()
completed_isolation_patients
= observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
negative_covid_patient_ids
# survivors
= np.union1d(completed_isolation_patients, negative_covid_patient_ids)
survivor_ids
# deceased patients
= patients[
deceased_ids 'Id'].isin(covid_patient_ids) & patients['DEATHDATE'].notna()
patients['Id'].unique() ][
Further filter covid-related outcomes
= covid_conditions[
dis_cons 'DESCRIPTION'].str.contains(r'\(disorder\)', na=False)) &
(covid_conditions[~covid_conditions['DESCRIPTION'].str.contains(
(r'pneumonia|hypoxemia|chronic|Hypertriglyceridemia|Hyperlipidemia|Hypercholesterolemia|Metabolic|osteoporosis|Alzheimer|Anemia|sinusitis|thrombosis|embolism|pharyngitis|bronchitis', case=False, na=False))
].copy()
'inpatient'] = dis_cons.PATIENT.isin(inpatient_ids)
dis_cons['survivor'] = dis_cons.PATIENT.isin(survivor_ids)
dis_cons['death'] = dis_cons.PATIENT.isin(deceased_ids)
dis_cons[ dis_cons.head()
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | covid_date | inpatient | survivor | death | |
---|---|---|---|---|---|---|---|---|---|---|
12 | 2020-04-28 | 2020-05-08 | 9bcf6ed5-d808-44af-98a0-7d78a29ede72 | 794aac2b-0eb8-4c36-98bb-a4317b344e6e | 43878008 | Streptococcal sore throat (disorder) | 2020-03-13 | False | True | False |
98 | 2015-02-11 | NaN | da454f92-9e90-43f8-9514-550fbaa6de3d | 6131ae6d-0623-4772-bc9a-c3157aee44fe | 87433001 | Pulmonary emphysema (disorder) | 2020-03-13 | False | True | False |
107 | 2010-12-17 | NaN | 29581d13-2687-402f-9c4c-06141a319e61 | 3af37864-e57e-4c9b-91bc-3313aa3fe518 | 254837009 | Malignant neoplasm of breast (disorder) | 2020-03-13 | False | True | False |
141 | 2020-03-07 | 2020-03-20 | 3c3b89b1-cb41-4f94-9193-2f3da4fe38e5 | 60b6cbc7-bbee-4fdd-9283-e8462a56dc34 | 770349000 | Sepsis caused by virus (disorder) | 2020-03-07 | True | True | False |
166 | 2007-04-27 | NaN | 7e2c6949-dce9-44be-b6ef-50aca95840d7 | ef7e86ea-37d3-491d-bd6e-60fb09fb21cf | 254837009 | Malignant neoplasm of breast (disorder) | 2020-02-29 | True | True | False |
Group by ‘DESCRIPTION’ for outcomes and count patients using an aggregation function to create the outcomes table
# Add non_survivor column
'non_survivor'] = dis_cons['death'] & (~dis_cons['survivor'])
dis_cons[
# Total patients per group
= len(inpatient_ids)
total_inpatients = dis_cons[dis_cons['survivor']]['PATIENT'].nunique()
total_survivors = dis_cons[dis_cons['non_survivor']]['PATIENT'].nunique()
total_non_survivors
# Group by outcome and count patients per group
= dis_cons.groupby('DESCRIPTION').agg(
summary_df =('PATIENT', 'nunique'),
total=('inpatient', 'sum'),
inpatient=('survivor', 'sum'),
survivors=('non_survivor', 'sum')
non_survivors
).reset_index()
# Calculate percentages using group totals
'percent of inpatient'] = round(summary_df['inpatient'] / total_inpatients * 100, 2)
summary_df['percent survivors'] = round(summary_df['survivors'] / total_survivors * 100, 2)
summary_df['percent non survivors'] = round(summary_df['non_survivors'] / total_non_survivors * 100, 2)
summary_df[
# Rename for clarity
={'DESCRIPTION': 'outcome'}, inplace=True)
summary_df.rename(columns
# Sort by inpatient count or total
= summary_df.sort_values(by='inpatient', ascending=False)
summary_df
# Final columns
= summary_df[[
summary_df 'outcome', 'total', 'percent of inpatient',
'survivors', 'percent survivors',
'non_survivors', 'percent non survivors'
]]
# Show result
='percent non survivors', ascending=False).head(7) summary_df.sort_values(by
outcome | total | percent of inpatient | survivors | percent survivors | non_survivors | percent non survivors | |
---|---|---|---|---|---|---|---|
26 | Sepsis caused by virus (disorder) | 6941 | 38.19 | 3419 | 24.20 | 3520 | 99.83 |
2 | Acute respiratory failure (disorder) | 8714 | 47.87 | 5237 | 37.07 | 3462 | 98.18 |
1 | Acute respiratory distress syndrome (disorder) | 2401 | 13.21 | 85 | 0.60 | 2316 | 65.68 |
27 | Septic shock (disorder) | 1746 | 9.61 | 0 | 0.00 | 1746 | 49.52 |
8 | Heart failure (disorder) | 1434 | 7.89 | 124 | 0.88 | 1310 | 37.15 |
0 | Acquired coagulation disorder (disorder) | 1389 | 7.64 | 91 | 0.64 | 1298 | 36.81 |
10 | Injury of heart (disorder) | 1288 | 7.09 | 20 | 0.14 | 1268 | 35.96 |
Use the groupby() and sum() to create counts and percents table
# Calculate total unique patients per category
= len(inpatient_ids)
total_inpatients = dis_cons[dis_cons['survivor'] == True]['PATIENT'].nunique()
total_survivors = dis_cons[dis_cons['death'] == True]['PATIENT'].nunique()
total_non_survivors
# Group inpatient counts
= dis_cons.groupby('DESCRIPTION')['inpatient'].sum().reset_index()
df1 'percent_inpatient'] = round(df1['inpatient'] / total_inpatients * 100, 2)
df1[
# Group survivor counts
= dis_cons.groupby('DESCRIPTION')['survivor'].sum().reset_index()
df2 'percent_survivor'] = round(df2['survivor'] / total_survivors * 100, 2)
df2[
# Group death counts
= dis_cons.groupby('DESCRIPTION')['death'].sum().reset_index()
df3 'percent_non_survivor'] = round(df3['death'] / total_non_survivors * 100, 2)
df3[
# Merge all on DESCRIPTION
= df1.merge(df2, on='DESCRIPTION').merge(df3, on='DESCRIPTION')
output_df
# Optional sorting by inpatient count
= output_df.sort_values(by='percent_non_survivor', ascending=False)
output_df
# Display top 10
7) output_df.head(
DESCRIPTION | inpatient | percent_inpatient | survivor | percent_survivor | death | percent_non_survivor | |
---|---|---|---|---|---|---|---|
26 | Sepsis caused by virus (disorder) | 6941 | 38.19 | 3419 | 24.20 | 3524 | 99.07 |
2 | Acute respiratory failure (disorder) | 8701 | 47.87 | 5237 | 37.07 | 3471 | 97.58 |
1 | Acute respiratory distress syndrome (disorder) | 2401 | 13.21 | 85 | 0.60 | 2316 | 65.11 |
27 | Septic shock (disorder) | 1746 | 9.61 | 0 | 0.00 | 1746 | 49.09 |
8 | Heart failure (disorder) | 1434 | 7.89 | 124 | 0.88 | 1310 | 36.83 |
0 | Acquired coagulation disorder (disorder) | 1389 | 7.64 | 91 | 0.64 | 1298 | 36.49 |
10 | Injury of heart (disorder) | 1288 | 7.09 | 20 | 0.14 | 1268 | 35.65 |
Labs values - Aggregation for continuous data
# Filter observations by specific CODE values
= observations[
lab_obs 'CODE'].isin(['48065-7', '26881-3', '2276-4', '89579-7', '731-0', '14804-9'])
observations[
]
# Further filter by COVID patient IDs
= lab_obs[lab_obs['PATIENT'].isin(covid_patient_ids)]
lab_obs
# Ensure the DATE column is in datetime format
'DATE'] = pd.to_datetime(lab_obs['DATE'], errors='coerce')
lab_obs[
# Filter rows after 2020-01-20 for covid-related data
= lab_obs[lab_obs['DATE'] > '2020-01-20']
lab_obs
'survivor'] = lab_obs.PATIENT.isin(survivor_ids)
lab_obs[
# Preview first 3 rows
3) lab_obs.head(
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | survivor | |
---|---|---|---|---|---|---|---|---|---|
254 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric | True |
273 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | True |
274 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric | True |
Compare labs values between groups of patients who are ‘survivor’ or ‘non-survivor’.
# transform value to numeric
'VALUE'] = pd.to_numeric(lab_obs['VALUE'], errors='coerce')
lab_obs[
# aggregating data
="DESCRIPTION", columns="survivor", values="VALUE", aggfunc="mean").round(2) lab_obs.pivot_table(index
survivor | False | True |
---|---|---|
DESCRIPTION | ||
Ferritin [Mass/volume] in Serum or Plasma | 1441.62 | 461.64 |
Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma | 8.81 | 0.43 |
Interleukin 6 [Mass/volume] in Serum or Plasma | 11.73 | 5.65 |
Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction | 356.88 | 243.38 |
Lymphocytes [#/volume] in Blood by Automated count | 0.58 | 1.01 |
Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method | 30.40 | 2.99 |
Compare labs values between time points as ‘baseline’ and ‘week_1’ later
= lab_obs.copy().merge(covid_dates, on = 'PATIENT')
lab_obs 'days'] = (pd.to_datetime(lab_obs['DATE']) - pd.to_datetime(lab_obs['covid_date'])).dt.days
lab_obs[ lab_obs.head()
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | survivor | covid_date | days | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric | True | 2020-02-19 | 0 |
1 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | True | 2020-02-19 | 0 |
2 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric | True | 2020-02-19 | 0 |
3 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 89579-7 | Troponin I.cardiac [Mass/volume] in Serum or P... | 2.3 | pg/mL | numeric | True | 2020-02-19 | 0 |
4 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 14804-9 | Lactate dehydrogenase [Enzymatic activity/volu... | 223.9 | U/L | numeric | True | 2020-02-19 | 0 |
Take a look at ‘days’ column.
'days'].describe() lab_obs[
count 716748.000000
mean 6.085112
std 4.248569
min 0.000000
25% 2.000000
50% 6.000000
75% 9.000000
max 26.000000
Name: days, dtype: float64
Compare labs values between time points as ‘baseline’ and ‘week_1’ later.
# Assign Timepoints
def assign_timepoint(days):
if days == 0:
return 'baseline'
elif 6 <= days <= 8:
return 'week_1'
else:
return None
'TIMEPOINT'] = lab_obs['days'].apply(assign_timepoint)
lab_obs[= lab_obs.dropna(subset=['TIMEPOINT'])
lab_obs lab_obs.head()
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | survivor | covid_date | days | TIMEPOINT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 731-0 | Lymphocytes [#/volume] in Blood by Automated c... | 1.1 | 10*3/uL | numeric | True | 2020-02-19 | 0 | baseline |
1 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | True | 2020-02-19 | 0 | baseline |
2 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 2276-4 | Ferritin [Mass/volume] in Serum or Plasma | 332.4 | ug/L | numeric | True | 2020-02-19 | 0 | baseline |
3 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 89579-7 | Troponin I.cardiac [Mass/volume] in Serum or P... | 2.3 | pg/mL | numeric | True | 2020-02-19 | 0 | baseline |
4 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 14804-9 | Lactate dehydrogenase [Enzymatic activity/volu... | 223.9 | U/L | numeric | True | 2020-02-19 | 0 | baseline |
# aggregating data
="DESCRIPTION", columns=["survivor", "TIMEPOINT"], values="VALUE", aggfunc="mean").round(2) lab_obs.pivot_table(index
survivor | False | True | ||
---|---|---|---|---|
TIMEPOINT | baseline | week_1 | baseline | week_1 |
DESCRIPTION | ||||
Ferritin [Mass/volume] in Serum or Plasma | 998.97 | 1586.77 | 400.51 | 497.94 |
Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma | 1.50 | 4.82 | 0.30 | 0.50 |
Interleukin 6 [Mass/volume] in Serum or Plasma | NaN | 11.79 | NaN | 5.99 |
Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction | 329.85 | 330.42 | 224.87 | 237.25 |
Lymphocytes [#/volume] in Blood by Automated count | 0.60 | 0.63 | 1.05 | 0.90 |
Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method | 8.00 | 23.26 | 2.50 | 2.99 |
Ferritin [Mass/volume] in Serum or Plasma, Fibrin D-dimer FEU [Mass/volume] in Platelet and Troponin I. cardiac [Mass/volume] in Serum increase significantly at one week later among covid patients who died.
Table of N, means and SD between time points
# Aggregate Statistics
= lab_obs.groupby(['DESCRIPTION','TIMEPOINT'])['VALUE'].agg(
summary ='count',
N='mean',
mean='std'
sd
).reset_index()
# Output Results
= ['baseline', 'week_1']
order = summary.set_index('TIMEPOINT').loc[order].reset_index().round(2)
summary print(summary)
TIMEPOINT DESCRIPTION N \
0 baseline Ferritin [Mass/volume] in Serum or Plasma 17628
1 baseline Fibrin D-dimer FEU [Mass/volume] in Platelet p... 17628
2 baseline Lactate dehydrogenase [Enzymatic activity/volu... 17628
3 baseline Lymphocytes [#/volume] in Blood by Automated c... 17628
4 baseline Troponin I.cardiac [Mass/volume] in Serum or P... 17628
5 week_1 Ferritin [Mass/volume] in Serum or Plasma 36136
6 week_1 Fibrin D-dimer FEU [Mass/volume] in Platelet p... 36136
7 week_1 Interleukin 6 [Mass/volume] in Serum or Plasma 326
8 week_1 Lactate dehydrogenase [Enzymatic activity/volu... 36136
9 week_1 Lymphocytes [#/volume] in Blood by Automated c... 53945
10 week_1 Troponin I.cardiac [Mass/volume] in Serum or P... 36136
mean sd
0 516.42 247.60
1 0.53 0.50
2 245.20 43.67
3 0.96 0.19
4 3.57 2.29
5 719.98 444.59
6 1.38 2.74
7 7.34 2.60
8 256.25 40.67
9 0.85 0.13
10 7.13 8.27
Conclusion
In this chapter, I explore tools for combining, joining, and restructuring data, along with functions for generating summary tables—such as pivot_table
and techniques for recoding datetime variables.
I demonstrate data aggregation for both categorical and continuous variables, using functions like groupby()
, agg()
, count()
, sum()
, mean()
, and std()
to create meaningful summary tables.