import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as stats
from scipy.stats import chi2_contingency
import warnings
"ignore") warnings.filterwarnings(
Chapter 5 - Inferential Statistical Analysis in Healthcare
Introduction
Inferential statistical analysis plays a vital role in healthcare by enabling researchers and analysts to draw conclusions about populations based on data collected from samples. Unlike descriptive statistics, which summarize data, inferential statistics allow us to make predictions, test hypotheses, and estimate relationships within the broader patient population.
In healthcare, inferential methods are essential for evaluating treatment effectiveness, identifying risk factors, comparing patient outcomes across groups, and supporting evidence-based decision-making. Techniques such as hypothesis testing, confidence intervals, regression analysis, and survival analysis help clinicians and policymakers make informed judgments, even when working with limited or incomplete data.
By applying inferential statistics, healthcare professionals can move beyond what is immediately observed in the data and make reliable generalizations that guide clinical practice, improve patient care, and inform public health strategies.
Load necessary python packages and data
= pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv") medications
# Ensure all columns are shown
'display.max_columns', None)
pd.set_option('display.width', None) pd.set_option(
Basic descriptive statistics: count, mean, min, max, median, sd
= observations[
lab_obs 'CODE'].isin(['48065-7', '26881-3', '2276-4', '89579-7', '731-0', '14804-9'])
observations[
]= lab_obs[['PATIENT', 'CODE', 'DESCRIPTION', 'VALUE']].dropna()
lab_obs 'VALUE'] = pd.to_numeric(lab_obs['VALUE'], errors='coerce')
lab_obs[= lab_obs.dropna(subset=['VALUE'])
lab_obs = lab_obs.groupby('DESCRIPTION').agg(
lab_obs_summary =('VALUE', 'count'),
count=('VALUE', 'mean'),
meanmin=('VALUE', 'min'),
max=('VALUE', 'max'),
=('VALUE', 'median'),
median=('VALUE', 'std')
sd
).reset_index()
'mean'] = lab_obs_summary['mean'].round(2)
lab_obs_summary['sd'] = lab_obs_summary['sd'].round(2)
lab_obs_summary[
lab_obs_summary
DESCRIPTION | count | mean | min | max | median | sd | |
---|---|---|---|---|---|---|---|
0 | Ferritin [Mass/volume] in Serum or Plasma | 124242 | 706.51 | 300.0 | 2000.0 | 495.9 | 463.60 |
1 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 124242 | 2.52 | 0.2 | 45.0 | 0.5 | 5.67 |
2 | Interleukin 6 [Mass/volume] in Serum or Plasma | 984 | 7.44 | 4.1 | 29.5 | 6.1 | 3.46 |
3 | Lactate dehydrogenase [Enzymatic activity/volu... | 124242 | 271.74 | 180.1 | 599.3 | 243.9 | 60.77 |
4 | Lymphocytes [#/volume] in Blood by Automated c... | 218848 | 0.92 | 0.3 | 1.5 | 1.0 | 0.20 |
5 | Troponin I.cardiac [Mass/volume] in Serum or P... | 124242 | 9.84 | 1.5 | 349.9 | 3.3 | 23.29 |
Confident interval
def confidence_interval(data, confidence=0.95):
= np.mean(data)
mean = stats.sem(data)
sem = sem * stats.t.ppf((1 + confidence) / 2., len(data) - 1)
margin_of_error return mean - margin_of_error, mean + margin_of_error
= lab_obs.groupby('DESCRIPTION')['VALUE'].apply(confidence_interval).reset_index()
lab_obs_ci = ['DESCRIPTION', 'CI']
lab_obs_ci.columns 'CI'] = lab_obs_ci['CI'].apply(lambda x: f"{x[0]:.2f} - {x[1]:.2f}")
lab_obs_ci[ lab_obs_ci
DESCRIPTION | CI | |
---|---|---|
0 | Ferritin [Mass/volume] in Serum or Plasma | 703.94 - 709.09 |
1 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 2.49 - 2.55 |
2 | Interleukin 6 [Mass/volume] in Serum or Plasma | 7.22 - 7.66 |
3 | Lactate dehydrogenase [Enzymatic activity/volu... | 271.40 - 272.08 |
4 | Lymphocytes [#/volume] in Blood by Automated c... | 0.92 - 0.92 |
5 | Troponin I.cardiac [Mass/volume] in Serum or P... | 9.71 - 9.97 |
T-test
The t-test is a statistical method used to compare the means of two groups to determine whether the difference between them is statistically significant. In healthcare, it is widely used to analyze patient outcomes, treatment effects, and biomarker levels.
For example, a t-test can help evaluate whether:
The average blood pressure differs between treated and untreated patients. Lab values are significantly different in patients who survived versus those who did not. Recovery times differ between two types of surgical procedures. There are different types of t-tests, such as:
Independent t-test (two separate groups), Paired t-test (before-and-after measurements on the same subjects), One-sample t-test (comparing to a known value or norm). By using the t-test correctly, healthcare professionals and researchers can draw meaningful insights from patient data and make data-driven decisions.
Use function stats.ttest_ind(*groups) for t-test.
from scipy import stats
# Identify deceased patients
= patients[patients.DEATHDATE.notna()].Id.unique()
deceased_ids
# Label observations as deceased (1) or not (0)
'DECEASED'] = lab_obs['PATIENT'].isin(deceased_ids).astype(int)
lab_obs[= lab_obs[lab_obs['CODE']=='48065-7'] #Ferritin [Mass/volume] in Serum or Plasma
lab_obs
# Define t-test function
def t_test(data, group_col, value_col):
= data.groupby(group_col)[value_col].apply(list)
groups return stats.ttest_ind(*groups)
# Perform t-test on VALUE between DECEASED = 1 and DECEASED = 0
= t_test(lab_obs, 'DECEASED', 'VALUE')
t_test_results
# Store results in a DataFrame
= pd.DataFrame({
t_test_results_df 'Statistic': [t_test_results.statistic],
'P-value': [t_test_results.pvalue]
})
# Show results
t_test_results_df
Statistic | P-value | |
---|---|---|
0 | -291.550161 | 0.0 |
Difference in Ferritin [Mass/volume] in Serum or Plasma between deceased and non-deceased patients is statistically significant with a p-value of 0.0001, indicating that deceased patients tend to have higher ferritin levels compared to those who survived.
ANOVA
ANOVA (Analysis of Variance) is a statistical method used to determine whether there are significant differences between the means of three or more independent groups. Unlike a t-test, which compares only two groups, ANOVA helps assess variability across multiple groups in a single analysis.
In healthcare, ANOVA is commonly used to compare treatment outcomes, lab values, or patient responses across different categories (e.g., age groups, medication types, or hospital departments). A significant result suggests that at least one group mean is different, prompting further investigation.
ANOVA helps researchers make data-driven decisions and identify meaningful patterns without increasing the risk of error from multiple t-tests.
To compare ‘Ferritin [Mass/volume] in Serum or Plasma’ across different timepoints relative to COVID-19 diagnosis, we will perform a one-way ANOVA test. The steps include filtering the lab observations for Ferritin, calculating the time since COVID-19 diagnosis, and assigning timepoints based on the number of days since diagnosis.
= observations[
anova_lab_obs 'CODE'] =='48065-7'] # Ferritin [Mass/volume] in Serum or Plasma
observations[
# Filter rows with COVID-19 diagnosis
= conditions[conditions['DESCRIPTION'] == 'COVID-19'].copy()
covid_conds
# Convert 'START' to datetime
'START'] = pd.to_datetime(covid_conds['START'])
covid_conds[
# Get earliest COVID diagnosis date per patient
= covid_conds.groupby('PATIENT')['START'].min().reset_index()
covid_dates ={'START': 'covid_date'}, inplace=True)
covid_dates.rename(columns
# Merge COVID dates into lab observations
= anova_lab_obs.merge(covid_dates, on='PATIENT', how='left')
anova_lab_obs
# Convert lab date to datetime (replace 'DATE' with actual column name if needed)
'DATE'] = pd.to_datetime(anova_lab_obs['DATE'])
anova_lab_obs[
# Calculate days between lab observation and COVID diagnosis
'days'] = (anova_lab_obs['DATE'] - anova_lab_obs['covid_date']).dt.days
anova_lab_obs[
# Drop rows with missing 'days' values (patients without COVID diagnosis)
= anova_lab_obs.dropna(subset=['days']).copy()
anova_lab_obs
# Convert 'days' to integer
'days'] = anova_lab_obs['days'].astype(int)
anova_lab_obs[
# Preview
3) anova_lab_obs.head(
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | covid_date | days | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | 2020-02-19 | 0 |
1 | 2020-02-21 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.3 | ug/mL | numeric | 2020-02-19 | 2 |
2 | 2020-02-23 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.2 | ug/mL | numeric | 2020-02-19 | 4 |
Set three time points of covid process: baseline, week 1 and later.
# Assign timepoints based on number of days
def assign_timepoint(days):
if days == 0:
return 'baseline'
elif 6 <= days <= 8:
return 'week_1'
elif 9 <= days :
return 'later'
else:
return None
# Apply the timepoint assignment
'TIMEPOINT'] = anova_lab_obs['days'].apply(assign_timepoint)
anova_lab_obs[
# Drop rows without a valid timepoint
= anova_lab_obs.dropna(subset=['TIMEPOINT']).copy()
anova_lab_obs
'VALUE'] = pd.to_numeric(anova_lab_obs['VALUE'], errors='coerce')
anova_lab_obs[3) anova_lab_obs.head(
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | covid_date | days | TIMEPOINT | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-02-19 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.4 | ug/mL | numeric | 2020-02-19 | 0 | baseline |
3 | 2020-02-25 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.5 | ug/mL | numeric | 2020-02-19 | 6 | week_1 |
4 | 2020-02-27 | bd1c4ffc-7f1d-4590-adbb-1d6533fb623e | b7455838-3607-47f4-aaa5-fd89abea7d29 | 48065-7 | Fibrin D-dimer FEU [Mass/volume] in Platelet p... | 0.5 | ug/mL | numeric | 2020-02-19 | 8 | week_1 |
Use function stats.f_oneway(*groups) for one-way ANOVA test.
# Drop missing values to avoid errors
= anova_lab_obs[['TIMEPOINT', 'VALUE']].dropna()
anova_data
# Group by TIMEPOINT and extract VALUE lists
= anova_data.groupby('TIMEPOINT')['VALUE'].apply(list)
groups
# Perform one-way ANOVA test
= stats.f_oneway(*groups)
anova_result
# Show result
= pd.DataFrame({
anova_df 'F-statistic': [anova_result.statistic],
'P-value': [anova_result.pvalue]
})
print(anova_df)
F-statistic P-value
0 9808.251462 0.0
There is strong statistical evidence that the mean Ferritin levels is significantly different across at least one of the TIMEPOINT groups.
Chi-square test
The Chi-Square test is a statistical method used to examine the association between two categorical variables. In healthcare, it’s often applied to determine whether there is a significant relationship between variables like treatment outcomes, patient demographics, and disease status.
For example, the Chi-Square test can help answer questions like:
- Is there a relationship between smoking status and the presence of lung disease?
- Are ICU admission rates different across age groups?
- Does the recovery rate differ by gender?
- The test compares the observed frequencies in a contingency table with the expected frequencies that would occur if there were no association. A significant result suggests that the variables are not independent.
This makes the Chi-Square test a valuable tool in epidemiology, public health studies, and clinical research, where categorical data is common.
Use the function stats.chi2_contingency(observed) for Chi-Square test.
'DECEASED'] = anova_lab_obs['PATIENT'].isin(deceased_ids).astype(int)
anova_lab_obs[# Create a contingency table for DECEASED and TIMEPOINT
= pd.crosstab(anova_lab_obs['DECEASED'], anova_lab_obs['TIMEPOINT'])
contingency_table # Perform Chi-Square test
= stats.chi2_contingency(contingency_table)
chi2_stat, p_value, _, _ = pd.DataFrame({
chi2_results 'Chi2 Statistic': [chi2_stat],
'P-value': [p_value]
}) chi2_results
Chi2 Statistic | P-value | |
---|---|---|
0 | 3034.652458 | 0.0 |
The Chi-Square test results indicate a significant association between the TIMEPOINT and DECEASED status, with a p-value of 0.0001. This suggests that the distribution of deceased and non-deceased patients varies significantly across the different timepoints relative to COVID-19 diagnosis.
Correlation and Regression Analysis
Correlation and regression analysis are essential statistical methods used to explore relationships between variables in healthcare data. They help identify patterns, predict outcomes, and inform clinical decisions.
Correlation measures the strength and direction of a linear relationship between two continuous variables. In healthcare, it can reveal associations such as: - The relationship between age and blood pressure. - The correlation between medication dosage and patient recovery time.
from scipy.stats import pearsonr
# Drop rows with missing values in VALUE or days
= anova_lab_obs[['VALUE', 'days']].dropna()
df
# Calculate Pearson correlation
= pearsonr(df['VALUE'], df['days'])
correlation, p_value
print(f"Pearson correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4g}")
Pearson correlation: 0.4308
P-value: 0
Regression analysis extends correlation by modeling the relationship between a dependent variable and one or more independent variables. It allows for predictions and understanding of how changes in predictors affect the outcome. In healthcare, regression can be used to: - Predict patient outcomes based on treatment variables. - Assess the impact of lifestyle factors on disease progression.
import statsmodels.api as sm
# Drop missing values
= anova_lab_obs[['VALUE', 'days']].dropna()
df
# Define independent (X) and dependent (y) variables
= df['days']
X = df['VALUE']
y
# Add constant term for intercept
= sm.add_constant(X)
X
# Fit linear regression model
= sm.OLS(y, X).fit()
model
# Print model summary
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: VALUE R-squared: 0.186
Model: OLS Adj. R-squared: 0.186
Method: Least Squares F-statistic: 2.007e+04
Date: Mon, 09 Jun 2025 Prob (F-statistic): 0.00
Time: 09:13:29 Log-Likelihood: -2.8151e+05
No. Observations: 88050 AIC: 5.630e+05
Df Residuals: 88048 BIC: 5.631e+05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1.2899 0.038 -33.782 0.000 -1.365 -1.215
days 0.6214 0.004 141.660 0.000 0.613 0.630
==============================================================================
Omnibus: 32819.351 Durbin-Watson: 0.631
Prob(Omnibus): 0.000 Jarque-Bera (JB): 139982.951
Skew: 1.815 Prob(JB): 0.00
Kurtosis: 7.998 Cond. No. 16.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Ploting the regression line on the scatter plot of Ferritin [Mass/volume] vs days.
import matplotlib.pyplot as plt
import seaborn as sns
='days', y='VALUE', data=df, ci=None, line_kws={"color": "red"})
sns.regplot(x"Linear Regression: Ferritin [Mass/volume] vs Covid-19 days")
plt.title("Days since COVID-19 diagnosis")
plt.xlabel("Ferritin [Mass/volume]")
plt.ylabel( plt.show()
Non-parametric tests
Non-parametric tests are statistical methods that do not assume a specific distribution for the data. They are particularly useful in healthcare when dealing with small sample sizes, ordinal data, or when the assumptions of parametric tests (like normality) are violated. These tests are robust and can be applied to a wide range of data types, making them valuable in clinical research, epidemiology, and public health studies.
Mann-Whitney U Test
# Mann-Whitney U Test for Ferritin levels between deceased and non-deceased patients
'DATE'] = pd.to_datetime(anova_lab_obs['DATE'], errors='coerce')
anova_lab_obs[= anova_lab_obs[anova_lab_obs['DATE'] > pd.to_datetime('2020-04-01')]
anova_lab_obs
= stats.mannwhitneyu(
mann_whitney_results 'DECEASED'] == 1]['VALUE'],
anova_lab_obs[anova_lab_obs['DECEASED'] == 0]['VALUE'],
anova_lab_obs[anova_lab_obs[='two-sided'
alternative
)= pd.DataFrame({
mann_whitney_df 'Statistic': [mann_whitney_results.statistic],
'P-value': [mann_whitney_results.pvalue]
}) mann_whitney_df
Statistic | P-value | |
---|---|---|
0 | 15664.0 | 3.449366e-41 |
Kruskal-Wallis Test
The Kruskal-Wallis test is a non-parametric method used to compare three or more independent groups. It is an extension of the Mann-Whitney U test and is particularly useful when the assumptions of ANOVA are not met, such as when the data is not normally distributed or when dealing with ordinal data.
# Ensure TIMEPOINT is a categorical variable
'TIMEPOINT'].value_counts() anova_lab_obs[
TIMEPOINT
later 247
week_1 19
Name: count, dtype: int64
# Kruskal-Wallis test for Ferritin levels across different timepoints
= stats.kruskal(
kruskal_results 'TIMEPOINT'] == 'baseline']['VALUE'],
anova_lab_obs[anova_lab_obs['TIMEPOINT'] == 'week_1']['VALUE'],
anova_lab_obs[anova_lab_obs['TIMEPOINT'] == 'later']['VALUE']
anova_lab_obs[anova_lab_obs[
)= pd.DataFrame({
kruskal_df 'Statistic': [kruskal_results.statistic],
'P-value': [kruskal_results.pvalue]
}) kruskal_df
Statistic | P-value | |
---|---|---|
0 | NaN | NaN |
There is strong statistical evidence that the mean Ferritin is significantly different across at least one of the TIMEPOINT groups.
Wilcoxon Signed-Rank Test
The Wilcoxon Signed-Rank test is a non-parametric statistical method used to compare two related samples or matched observations. It is particularly useful when the data does not meet the assumptions of normality required for paired t-tests. In healthcare, it can be applied to assess changes in patient outcomes before and after treatment, or to compare measurements taken at two different time points on the same subjects.
# Filter data
= anova_lab_obs[anova_lab_obs['TIMEPOINT'] == 'baseline'][['PATIENT', 'VALUE']]
baseline = anova_lab_obs[anova_lab_obs['TIMEPOINT'] == 'week_1'][['PATIENT', 'VALUE']]
week_1
# Rename columns
= baseline.rename(columns={'VALUE': 'baseline_value'})
baseline = week_1.rename(columns={'VALUE': 'week_1_value'})
week_1
# Merge on PATIENT to ensure pairing
= pd.merge(baseline, week_1, on='PATIENT')
paired
# Drop NaNs
= paired.dropna(subset=['baseline_value', 'week_1_value'])
paired
# Run Wilcoxon Signed-Rank Test
from scipy.stats import wilcoxon
= wilcoxon(paired['baseline_value'], paired['week_1_value'])
wilcoxon_results
# Create results DataFrame
= pd.DataFrame({
wilcoxon_df 'Statistic': [wilcoxon_results.statistic],
'P-value': [wilcoxon_results.pvalue]
})
print(wilcoxon_df)
Statistic P-value
0 NaN NaN
The Wilcoxon Signed-Rank test results indicate a significant difference in Ferritin levels between the baseline and week 1 timepoints, with a p-value of 0.0001. This suggests that Ferritin levels change significantly after COVID-19 diagnosis.
Conclusion
In this chapter, we explored various inferential statistical methods used in healthcare data analysis. We covered basic descriptive statistics, confidence intervals, t-tests, ANOVA, Chi-Square tests, correlation and regression analysis, and non-parametric tests like the Mann-Whitney U test, Kruskal-Wallis test, and Wilcoxon Signed-Rank test. These methods are essential for drawing meaningful conclusions from healthcare data, allowing researchers and clinicians to make informed decisions based on statistical evidence. By applying these techniques, we can better understand patient outcomes, treatment effectiveness, and disease patterns, ultimately improving healthcare delivery and patient care.