The response variable selected for this analysis is Age.adjusted.Death.Rate. This variable represents the number of deaths per 100,000 individuals, standardized to account for differences in age distribution across populations. Unlike raw death counts, which are heavily influenced by population size, age-adjusted rates provide a more meaningful measure of mortality risk. Since public health decisions are typically based on risk rather than total volume alone, this variable best represents the underlying impact of disease across states and over time.
The categorical explanatory variable selected for the ANOVA analysis is Cause.Name. This variable categorizes deaths into major public health causes such as heart disease, cancer, stroke, influenza and pneumonia, and others. These categories represent distinct disease processes and risk profiles, and it is reasonable to expect that age-adjusted death rates may differ across them due to differences in prevalence, preventability, and treatment effectiveness. The aggregate category “All causes” was excluded from this analysis since it represents total mortality across all categories rather than a distinct cause of death. After excluding this aggregate category, 10 meaningful cause categories remain, which is appropriate for interpretation in a one-way ANOVA framework.
For the regression analysis, the continuous explanatory variable selected is Year. Mortality rates often change over time due to medical advancements, public health interventions, demographic shifts, and policy changes. Modeling the age-adjusted death rate as a function of year enables the ability to assess whether there is a meaningful temporal trend. Since the dataset spans from 2000 to 2017, it is reasonable to expect that any long-term trend in mortality rates may be approximately linear over this period.
# Removing 'all causes' for the process
anova_data <- data |>
filter(Cause.Name != "All causes")
Do mean age-adjusted death rates differ across causes of death?
The mean age-adjusted death rate is equal across all ten causes of death.
\[ H_0 : \mu_1 = \mu_2 = \cdots = \mu_{10} \]
where each \(\mu_i\) represents the mean age-adjusted death rate for a specific cause.
\[ H_A : \exists \ i,j \text{ such that } \mu_i \ne \mu_j \]
ANOVA tests whether at least one group mean differs, but it does not identify which specific means differ.
# Running the anova model
anova_model <- aov(Age.adjusted.Death.Rate ~ Cause.Name, data = anova_data)
# Summary results of anova
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Cause.Name 9 38604959 4289440 15924 <2e-16 ***
## Residuals 9350 2518664 269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The one-way ANOVA results show a statistically significant effect of cause of death on the age-adjusted death rate,
F(9, 9350) = 15924, \(p < 2 \times 10^{-16}\)
Since the p-value is far below the 0.05 significance level, I reject the null hypothesis that all mean age-adjusted death rates are equal across causes. This provides strong statistical evidence that at least one cause of death has a different mean age-adjusted death rate.
anova_data |>
ggplot(aes(x = Cause.Name, y = Age.adjusted.Death.Rate)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
labs(
title = "Age-Adjusted Death Rate by Cause of Death",
x = "Cause of Death",
y = "Age-Adjusted Death Rate (per 100,000)"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
The boxplot visually reinforces the ANOVA results by illustrating substantial differences in age-adjusted death rates across causes of death. Heart disease and cancer exhibit noticeably higher median death rates compared to other causes, while kidney disease, influenza and pneumonia, and suicide show considerably lower median rates. The separation between group medians is large relative to the within-group variability, which aligns with the extremely large F-statistic observed in the ANOVA. This visualization supports the conclusion that age-adjusted death rates are not equal across causes of death.
data |>
ggplot(aes(x = Year, y = Age.adjusted.Death.Rate)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Age-Adjusted Death Rate Over Time",
x = "Year",
y = "Age-Adjusted Death Rate (per 100,000)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# LR Model
lm_model <- lm(Age.adjusted.Death.Rate ~ Year, data = data)
# Summary of results
summary(lm_model)
##
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Year, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.08 -107.08 -90.12 29.85 921.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3246.7742 847.2705 3.832 0.000128 ***
## Year -1.5534 0.4218 -3.683 0.000232 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 222.1 on 10294 degrees of freedom
## Multiple R-squared: 0.001316, Adjusted R-squared: 0.001219
## F-statistic: 13.56 on 1 and 10294 DF, p-value: 0.0002321
The linear regression results indicate a statistically significant relationship between year and age-adjusted death rate, F(1,10294) = 13.56 and p = 0.000232. The slope coefficient of −1.5534 suggests that, on average, the age-adjusted death rate decreases by approximately 1.55 deaths per 100,000 people each year.
However, the model explains only a very small proportion of the variability in death rates, with R2 = 0.001316. This indicates that while there is evidence of a slight downward trend over time, year alone is not a strong predictor of age-adjusted death rate. Although there is a statistically significant downward trend over time, the very small R2 suggests that year alone is not a strong predictor of death rates.
Therefore, policy decisions or public health planning should consider additional factors beyond time, such as cause of death, geographic differences, or demographic characteristics.