# Loading csv and assigning it to a variable
data <- read_csv(
  "NCHS_final_2000_2017_with_population_enriched.csv",
  show_col_types = FALSE
)

Hypothesis 1 Population Size and Suicide Rates

Research Question

Do smaller‑population states have higher age‑adjusted suicide death rates than larger‑population states?

Hypothesis

Null Hypothesis \[H_0\]

The mean age‑adjusted suicide death rate is the same in small and large population states.

\[ H_0: \mu_{Small} = \mu_{Large} \]

Alternative Hypothesis \[H_A\]

Small population states have a higher mean age‑adjusted suicide rate.

\[ H_A: \mu_{Small} > \mu_{Large} \]

I am doing a one-sided test.

Data Prep

# Filter to suicide only
suicide_data <- data |>
  filter(Cause.Name == "Suicide")
# Create median population split
median_pop <- suicide_data$Total_Population |>
  median(na.rm = TRUE)

suicide_data <- suicide_data |>
  mutate(
    pop_group = ifelse(
      Total_Population <= median_pop,
      "Small",
      "Large"
    )
  )
# Check group sizes
suicide_data |>
  count(pop_group)
## # A tibble: 2 × 2
##   pop_group     n
##   <chr>     <int>
## 1 Large       468
## 2 Small       468
# Descriptive statistics
suicide_data |>
  group_by(pop_group) |>
  summarise(
    mean_rate = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
    sd_rate = sd(Age.adjusted.Death.Rate, na.rm = TRUE),
    n = n()
  )
## # A tibble: 2 × 4
##   pop_group mean_rate sd_rate     n
##   <chr>         <dbl>   <dbl> <int>
## 1 Large          12.0    2.83   468
## 2 Small          15.0    4.59   468
# Ensuring proper labels being used in testing
suicide_data <- suicide_data |>
  mutate(pop_group = factor(pop_group, levels = c("Small", "Large")))

Neyman–Pearson Framework

I conducted this hypothesis test using the Neyman–Pearson framework.

I set the significance level at:

\[ \alpha = 0.05 \]

This means I am willing to accept a 5% probability of committing a Type I error (rejecting the null hypothesis when it is actually true).

I set the desired power at:

\[ 1 - \beta = 0.80 \]

This means I want at least an 80% probability of correctly detecting a true difference if one exists.

I defined a practically meaningful minimum effect size as:

\[ \delta = 1 \text{ death per 100,000 population} \]

A difference of one death per 100,000 is considered substantively meaningful from a public health perspective.

Sample Size and Data Sufficiency

Using the available dataset, there are 468 observations in each population group (Small and Large states), for a total of 936 observations.

A power calculation was conducted to determine whether this sample size is sufficient to detect the specified minimum effect size of 1 death per 100,000 at the \(\alpha = 0.05\) significance level.

The resulting statistical power was:

\[ \text{Power} = 0.982 \]

Since this exceeds the desired power level of 0.80, the available sample size is more than sufficient to conduct this hypothesis test. Therefore, I conclude that there is adequate data to proceed with formal hypothesis testing.

# Power Calculation
power.t.test(
  n = min(table(suicide_data$pop_group)),
  delta = 1,
  sd = sd(suicide_data$Age.adjusted.Death.Rate, na.rm = TRUE),
  sig.level = 0.05,
  type = "two.sample",
  alternative = "one.sided"
)
## 
##      Two-sample t test power calculation 
## 
##               n = 468
##           delta = 1
##              sd = 4.07731
##       sig.level = 0.05
##           power = 0.9823186
##     alternative = one.sided
## 
## NOTE: n is number in *each* group

Performing Hypothesis Testing

# Performing one-sided test
t_test_result <- t.test(
  Age.adjusted.Death.Rate ~ pop_group,
  data = suicide_data,
  alternative = "greater"
)

t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  Age.adjusted.Death.Rate by pop_group
## t = 11.664, df = 777.86, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Small and group Large is greater than 0
## 95 percent confidence interval:
##  2.495885      Inf
## sample estimates:
## mean in group Small mean in group Large 
##            14.95085            12.04466

Interpretation of Results

At the \(\alpha = 0.05\) significance level, I reject the null hypothesis.

The Welch two-sample t-test produced a test statistic of \(t = 11.664\) with \(df = 777.86\), and a p-value less than \(2.2 \times 10^{-16}\). Since the p-value is far below the chosen significance level of 0.05, there is strong statistical evidence that small population states have higher mean age-adjusted suicide death rates than large population states.

The estimated difference in means is approximately 2.91 deaths per 100,000 population (14.95 for small states versus 12.04 for large states). The one-sided 95% confidence interval suggests that the true difference is at least 2.50 deaths per 100,000.

Given the achieved statistical power of 0.982, the probability of committing a Type II error is very low (approximately 1.8%). Therefore, the evidence strongly supports the conclusion that population size is associated with differences in suicide death rates.

Visualization of Hypothesis 1 Results

suicide_data |>
  ggplot(aes(x = pop_group, y = Age.adjusted.Death.Rate, fill = pop_group)) +
  geom_boxplot(alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
  labs(
    title = "Age-Adjusted Suicide Death Rates by Population Group",
    x = "Population Group",
    y = "Age-Adjusted Death Rate (per 100,000)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation of Visualization

The boxplot visually reinforces the statistical findings of the hypothesis test. Small population states exhibit consistently higher age-adjusted suicide death rates compared to large population states. The median and mean for small states are clearly higher, and the overall distribution of values for small states is shifted upward relative to large states.

The visualization also shows that small states tend to have greater variability and more extreme high-end values, suggesting that not only are rates higher on average, but they may also be more dispersed.

This graphical evidence supports the statistical conclusion that population size is associated with meaningful differences in suicide death rates.

Further investigation could explore potential underlying factors contributing to this difference, such as rurality, access to mental health services, socioeconomic conditions, or firearm prevalence.


Hypothesis 2: Influenza Death Rates Before and After 2009

Research Question

Did influenza and pneumonia death rates per 100,000 population change after 2009?

Null Hypothesis

\[ H_0: \mu_{Pre2010} = \mu_{Post2010} \]

Alternative Hypothesis

\[ H_A: \mu_{Pre2010} \neq \mu_{Post2010} \]

A two‑sided test is performed since I am assessing whether rates changed in either direction.

Data Prep

# Filter to Influenza & Pneumonia
flu_data <- data |>
  filter(Cause.Name == "Influenza and pneumonia")

# Create Pre-2010 vs Post-2010 grouping variable
flu_data <- flu_data |>
  mutate(
    period = ifelse(Year <= 2009, "Pre-2010", "Post-2010")
  )
# Verify year ranges by group
flu_data |>
  group_by(period) |>
  summarise(
    min_year = min(Year),
    max_year = max(Year),
    n = n()
  )
## # A tibble: 2 × 4
##   period    min_year max_year     n
##   <chr>        <dbl>    <dbl> <int>
## 1 Post-2010     2010     2017   416
## 2 Pre-2010      2000     2009   520
# Descriptive Statistics
flu_data |>
  group_by(period) |>
  summarise(
    mean_rate = mean(Deaths_per_100k, na.rm = TRUE),
    sd_rate = sd(Deaths_per_100k, na.rm = TRUE),
    n = n()
  )
## # A tibble: 2 × 4
##   period    mean_rate sd_rate     n
##   <chr>         <dbl>   <dbl> <int>
## 1 Post-2010      17.5    4.74   416
## 2 Pre-2010       20.5    5.08   520

Fisher Significance Testing Framework

This hypothesis test is conducted using the Fisher significance testing framework.

Unlike the Neyman–Pearson framework used in Hypothesis 1, this approach focuses on evaluating the strength of evidence against the null hypothesis by examining the p-value.

I set the significance level at:

\[ \alpha = 0.05 \]

This means that if the probability of observing a test statistic as extreme as the one calculated (assuming the null hypothesis is true) is less than 0.05, I will reject the null hypothesis.

No minimum effect size or power calculation is pre-specified under this framework, as the emphasis is on assessing the evidence provided by the data.

Hypothesis Testing

# Perform two-sided Welch t-test under Fisher framework
t_test_flu <- t.test(
  Deaths_per_100k ~ period,
  data = flu_data,
  alternative = "two.sided"
)

t_test_flu
## 
##  Welch Two Sample t-test
## 
## data:  Deaths_per_100k by period
## t = -9.3394, df = 912.15, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Post-2010 and group Pre-2010 is not equal to 0
## 95 percent confidence interval:
##  -3.636774 -2.373736
## sample estimates:
## mean in group Post-2010  mean in group Pre-2010 
##                17.50161                20.50687

Interpretation of Results

At the \(\alpha = 0.05\) significance level, I reject the null hypothesis.

The Welch two-sample t-test produced a test statistic of \(t = -9.34\) with \(df = 912.15\), and a p-value less than \(2.2 \times 10^{-16}\). Since the p-value is far below the chosen significance level of 0.05, there is extremely strong statistical evidence that influenza and pneumonia death rates differed between the Pre‑2010 (2000–2009) and Post‑2010 (2010–2017) periods.

The estimated mean death rate prior to 2010 was approximately 20.51 deaths per 100,000 population, compared to 17.50 deaths per 100,000 population after 2010. The 95% confidence interval for the difference in means (Post‑2010 minus Pre‑2010) ranges from −3.64 to −2.37, indicating that death rates after 2010 were significantly lower than in the preceding period. The negative test statistic reflects that the Post‑2010 mean is lower than the Pre‑2010 mean.

Under the Fisher framework, this very small p-value provides strong evidence against the null hypothesis of equal means, suggesting that influenza and pneumonia mortality rates changed following 2009.

Visualization of Hypothesis 2 Results

# Building Visualization
flu_data |>
  ggplot(aes(x = period, y = Deaths_per_100k, fill = period)) +
  geom_boxplot(alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
  labs(
    title = "Influenza & Pneumonia Death Rates Before and After 2010",
    x = "Time Period",
    y = "Deaths per 100,000 Population"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation of Visualization

The boxplot visually reinforces the results of the hypothesis test. The Pre‑2010 period exhibits higher influenza and pneumonia death rates compared to the Post‑2010 period. Both the median and mean are clearly higher in the Pre‑2010 group, and the overall distribution appears shifted downward after 2010.

This graphical evidence supports the statistical conclusion that influenza and pneumonia mortality rates changed following 2009, with rates declining in the years after 2010. The consistency between the visual separation of the groups and the highly significant p-value strengthens the conclusion that the difference is both statistically meaningful and substantively important.

Further investigation could explore whether this decline reflects improved vaccination rates, changes in public health interventions, demographic shifts, or reporting differences following the 2009 pandemic. Additional analysis examining year‑by‑year trends rather than aggregated periods may also provide deeper insight into the timing and persistence of the observed decline.