Hypothesis Testing with NHANES Dataset

2025-11-15

Introduction

Hypothesis Testing with NHANES

This is a real-world health survey by the National Center for Health Statistics (NCHS), CDC designed to assess the health and nutritional status of adults and children in the United States.
The data includes cholesterol, BMI, blood pressure, glucose, diabetes status, smoking and other health related data points.
To demonstrate hypothesis testing, we’ll perform t-testing.

What is Hypothesis Testing?

A statistical method to test assumptions or theories about a population parameter (e.g., the mean or proportion).

It starts with a Null Hypothesis (\(H_0\)), which represents the “status quo” or a statement of no effect.
We also form an Alternative Hypothesis (\(H_a\)), which is the claim we want to test against the null.
The goal is to use sample data to decide whether we have enough evidence to reject the null hypothesis.

The Core Hypotheses

Null Hypothesis (\(H_0\)): This is the default assumption of no difference or no effect. We assume this is true until proven otherwise.
Alternative Hypothesis (\(H_a\)): This is the claim we are testing for. It’s what we will “accept” if we find enough evidence to reject the null hypothesis.

Understanding the P-Value

What is P-value?

The p-value is the probability of observing data as extreme as, or more extreme than, what was actually observed, while assuming the null hypothesis to be true.

We compare our p-value to a pre-determined significance level (\(\alpha\)), which is usually 0.05.

If \(\text{p-value} < \alpha\), the result is statistically significant and hence we reject the null hypothesis.
If \(\text{p-value} >= \alpha\), the result is not statistically significant and hence we fail to reject the null hypothesis.

Case Study with NHANES

Question: Is there a difference in blood pressure between genders?

Our research question: Is there a statistically significant difference in the mean average systolic blood pressure (BPSysAve) between males and females (Gender)?

We will first use a boxplot and a distribution plot to help us see the median, spread (IQR), outliers and data distribution for both groups.

Box Plot (Code)

box_plot = ggplot(nhanes_clean, aes(x = Gender, 
                                    y = BPSysAve, 
                                    fill = Gender)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Systolic Blood Pressure by Gender",
       y = "Average Systolic BP (mmHg)",
       x = "Gender") +
  scale_fill_manual(values = c("female" ="#00A3E0", 
                               "male" = "#8C1D40"))

Box Plot (Plot)

The plot shows that males have a slightly higer median BP while the outliers remain similar for both.

Distribution Plot (Code)

dist_plot = ggplot(nhanes_clean, aes(x = BPSysAve, 
                                     fill = Gender, 
                                     color = Gender)) +
  geom_density(alpha = 0.5) +
  labs(title = "Distribution of Systolic BP by Gender",
       x = "Average Systolic BP (mmHg)",
       y = "Density") +
  scale_fill_manual(values = c("female" = "#00A3E0", 
                               "male" = "#8C1D40")) +
  scale_color_manual(values = c("female" = "#00A3E0", 
                                "male" = "#8C1D40"))

Distribution Plot (Plot)

We can see the distributions are very similar, but the peak for males have a slightly higher BP than females. Hypothesis testing will help us answer if this difference is significant or just due to random chance.

3D Plot

The plot above shows that males and females generally have similar distribution with the pressure having a positive correlation with Age and weakly positive correlation with BMI.

T-test comparing BPSysAve by Gender

t_test_result <- t.test(BPSysAve ~ Gender, data = nhanes_clean)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  BPSysAve by Gender
## t = -10.028, df = 8508.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -4.444378 -2.990898
## sample estimates:
## mean in group female   mean in group male 
##             116.3064             120.0240

Interpreting Our T-Test Results

p-value: <0.001
t-statistic: -10.03
Mean (Female): 116.3 mmHg
Mean (Male): 120 mmHg

Our p-value is much smaller than the significance level (\(\alpha = 0.05\)), indicating a statistically significant difference in average systolic blood pressure between males and females.

Conclusion

Decision: Since our p-value (< 0.001) is less than alpha (0.05), we reject the null hypothesis (H₀).
Interpretation: There is a statistically significant difference in mean systolic blood pressure between males and females.
Context: On average, males in this sample had a systolic blood pressure approximately 3.7 mmHg higher than females (120 − 116.3 = 3.7 mmHg).