# Loading csv and assigning it to a variable
data <- read_csv(
"NCHS_final_2000_2017_with_population_enriched.csv",
show_col_types = FALSE
)
Do smaller‑population states have higher age‑adjusted suicide death rates than larger‑population states?
The mean age‑adjusted suicide death rate is the same in small and large population states.
\[ H_0: \mu_{Small} = \mu_{Large} \]
Small population states have a higher mean age‑adjusted suicide rate.
\[ H_A: \mu_{Small} > \mu_{Large} \]
I am doing a one-sided test.
# Filter to suicide only
suicide_data <- data |>
filter(Cause.Name == "Suicide")
# Create median population split
median_pop <- suicide_data$Total_Population |>
median(na.rm = TRUE)
suicide_data <- suicide_data |>
mutate(
pop_group = ifelse(
Total_Population <= median_pop,
"Small",
"Large"
)
)
# Check group sizes
suicide_data |>
count(pop_group)
## # A tibble: 2 × 2
## pop_group n
## <chr> <int>
## 1 Large 468
## 2 Small 468
# Descriptive statistics
suicide_data |>
group_by(pop_group) |>
summarise(
mean_rate = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
sd_rate = sd(Age.adjusted.Death.Rate, na.rm = TRUE),
n = n()
)
## # A tibble: 2 × 4
## pop_group mean_rate sd_rate n
## <chr> <dbl> <dbl> <int>
## 1 Large 12.0 2.83 468
## 2 Small 15.0 4.59 468
# Ensuring proper labels being used in testing
suicide_data <- suicide_data |>
mutate(pop_group = factor(pop_group, levels = c("Small", "Large")))
I conducted this hypothesis test using the Neyman–Pearson framework.
I set the significance level at:
\[ \alpha = 0.05 \]
This means I am willing to accept a 5% probability of committing a Type I error (rejecting the null hypothesis when it is actually true).
I set the desired power at:
\[ 1 - \beta = 0.80 \]
This means I want at least an 80% probability of correctly detecting a true difference if one exists.
I defined a practically meaningful minimum effect size as:
\[ \delta = 1 \text{ death per 100,000 population} \]
A difference of one death per 100,000 is considered substantively meaningful from a public health perspective.
Using the available dataset, there are 468 observations in each population group (Small and Large states), for a total of 936 observations.
A power calculation was conducted to determine whether this sample size is sufficient to detect the specified minimum effect size of 1 death per 100,000 at the \(\alpha = 0.05\) significance level.
The resulting statistical power was:
\[ \text{Power} = 0.982 \]
Since this exceeds the desired power level of 0.80, the available sample size is more than sufficient to conduct this hypothesis test. Therefore, I conclude that there is adequate data to proceed with formal hypothesis testing.
# Power Calculation
power.t.test(
n = min(table(suicide_data$pop_group)),
delta = 1,
sd = sd(suicide_data$Age.adjusted.Death.Rate, na.rm = TRUE),
sig.level = 0.05,
type = "two.sample",
alternative = "one.sided"
)
##
## Two-sample t test power calculation
##
## n = 468
## delta = 1
## sd = 4.07731
## sig.level = 0.05
## power = 0.9823186
## alternative = one.sided
##
## NOTE: n is number in *each* group
# Performing one-sided test
t_test_result <- t.test(
Age.adjusted.Death.Rate ~ pop_group,
data = suicide_data,
alternative = "greater"
)
t_test_result
##
## Welch Two Sample t-test
##
## data: Age.adjusted.Death.Rate by pop_group
## t = 11.664, df = 777.86, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Small and group Large is greater than 0
## 95 percent confidence interval:
## 2.495885 Inf
## sample estimates:
## mean in group Small mean in group Large
## 14.95085 12.04466
At the \(\alpha = 0.05\) significance level, I reject the null hypothesis.
The Welch two-sample t-test produced a test statistic of \(t = 11.664\) with \(df = 777.86\), and a p-value less than \(2.2 \times 10^{-16}\). Since the p-value is far below the chosen significance level of 0.05, there is strong statistical evidence that small population states have higher mean age-adjusted suicide death rates than large population states.
The estimated difference in means is approximately 2.91 deaths per 100,000 population (14.95 for small states versus 12.04 for large states). The one-sided 95% confidence interval suggests that the true difference is at least 2.50 deaths per 100,000.
Given the achieved statistical power of 0.982, the probability of committing a Type II error is very low (approximately 1.8%). Therefore, the evidence strongly supports the conclusion that population size is associated with differences in suicide death rates.
suicide_data |>
ggplot(aes(x = pop_group, y = Age.adjusted.Death.Rate, fill = pop_group)) +
geom_boxplot(alpha = 0.7) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
labs(
title = "Age-Adjusted Suicide Death Rates by Population Group",
x = "Population Group",
y = "Age-Adjusted Death Rate (per 100,000)"
) +
theme_minimal() +
theme(legend.position = "none")
The boxplot visually reinforces the statistical findings of the hypothesis test. Small population states exhibit consistently higher age-adjusted suicide death rates compared to large population states. The median and mean for small states are clearly higher, and the overall distribution of values for small states is shifted upward relative to large states.
The visualization also shows that small states tend to have greater variability and more extreme high-end values, suggesting that not only are rates higher on average, but they may also be more dispersed.
This graphical evidence supports the statistical conclusion that population size is associated with meaningful differences in suicide death rates.
Further investigation could explore potential underlying factors contributing to this difference, such as rurality, access to mental health services, socioeconomic conditions, or firearm prevalence.
Did influenza and pneumonia death rates per 100,000 population change after 2009?
\[ H_0: \mu_{Pre2010} = \mu_{Post2010} \]
\[ H_A: \mu_{Pre2010} \neq \mu_{Post2010} \]
A two‑sided test is performed since I am assessing whether rates changed in either direction.
# Filter to Influenza & Pneumonia
flu_data <- data |>
filter(Cause.Name == "Influenza and pneumonia")
# Create Pre-2010 vs Post-2010 grouping variable
flu_data <- flu_data |>
mutate(
period = ifelse(Year <= 2009, "Pre-2010", "Post-2010")
)
# Verify year ranges by group
flu_data |>
group_by(period) |>
summarise(
min_year = min(Year),
max_year = max(Year),
n = n()
)
## # A tibble: 2 × 4
## period min_year max_year n
## <chr> <dbl> <dbl> <int>
## 1 Post-2010 2010 2017 416
## 2 Pre-2010 2000 2009 520
# Descriptive Statistics
flu_data |>
group_by(period) |>
summarise(
mean_rate = mean(Deaths_per_100k, na.rm = TRUE),
sd_rate = sd(Deaths_per_100k, na.rm = TRUE),
n = n()
)
## # A tibble: 2 × 4
## period mean_rate sd_rate n
## <chr> <dbl> <dbl> <int>
## 1 Post-2010 17.5 4.74 416
## 2 Pre-2010 20.5 5.08 520
This hypothesis test is conducted using the Fisher significance testing framework.
Unlike the Neyman–Pearson framework used in Hypothesis 1, this approach focuses on evaluating the strength of evidence against the null hypothesis by examining the p-value.
I set the significance level at:
\[ \alpha = 0.05 \]
This means that if the probability of observing a test statistic as extreme as the one calculated (assuming the null hypothesis is true) is less than 0.05, I will reject the null hypothesis.
No minimum effect size or power calculation is pre-specified under this framework, as the emphasis is on assessing the evidence provided by the data.
# Perform two-sided Welch t-test under Fisher framework
t_test_flu <- t.test(
Deaths_per_100k ~ period,
data = flu_data,
alternative = "two.sided"
)
t_test_flu
##
## Welch Two Sample t-test
##
## data: Deaths_per_100k by period
## t = -9.3394, df = 912.15, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Post-2010 and group Pre-2010 is not equal to 0
## 95 percent confidence interval:
## -3.636774 -2.373736
## sample estimates:
## mean in group Post-2010 mean in group Pre-2010
## 17.50161 20.50687
At the \(\alpha = 0.05\) significance level, I reject the null hypothesis.
The Welch two-sample t-test produced a test statistic of \(t = -9.34\) with \(df = 912.15\), and a p-value less than \(2.2 \times 10^{-16}\). Since the p-value is far below the chosen significance level of 0.05, there is extremely strong statistical evidence that influenza and pneumonia death rates differed between the Pre‑2010 (2000–2009) and Post‑2010 (2010–2017) periods.
The estimated mean death rate prior to 2010 was approximately 20.51 deaths per 100,000 population, compared to 17.50 deaths per 100,000 population after 2010. The 95% confidence interval for the difference in means (Post‑2010 minus Pre‑2010) ranges from −3.64 to −2.37, indicating that death rates after 2010 were significantly lower than in the preceding period. The negative test statistic reflects that the Post‑2010 mean is lower than the Pre‑2010 mean.
Under the Fisher framework, this very small p-value provides strong evidence against the null hypothesis of equal means, suggesting that influenza and pneumonia mortality rates changed following 2009.
# Building Visualization
flu_data |>
ggplot(aes(x = period, y = Deaths_per_100k, fill = period)) +
geom_boxplot(alpha = 0.7) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
labs(
title = "Influenza & Pneumonia Death Rates Before and After 2010",
x = "Time Period",
y = "Deaths per 100,000 Population"
) +
theme_minimal() +
theme(legend.position = "none")
The boxplot visually reinforces the results of the hypothesis test. The Pre‑2010 period exhibits higher influenza and pneumonia death rates compared to the Post‑2010 period. Both the median and mean are clearly higher in the Pre‑2010 group, and the overall distribution appears shifted downward after 2010.
This graphical evidence supports the statistical conclusion that influenza and pneumonia mortality rates changed following 2009, with rates declining in the years after 2010. The consistency between the visual separation of the groups and the highly significant p-value strengthens the conclusion that the difference is both statistically meaningful and substantively important.
Further investigation could explore whether this decline reflects improved vaccination rates, changes in public health interventions, demographic shifts, or reporting differences following the 2009 pandemic. Additional analysis examining year‑by‑year trends rather than aggregated periods may also provide deeper insight into the timing and persistence of the observed decline.