Week 8 Data Dive

Variables for Week’s Dive

Response Variable

The response variable selected for this analysis is Age.adjusted.Death.Rate. This variable represents the number of deaths per 100,000 individuals, standardized to account for differences in age distribution across populations. Unlike raw death counts, which are heavily influenced by population size, age-adjusted rates provide a more meaningful measure of mortality risk. Since public health decisions are typically based on risk rather than total volume alone, this variable best represents the underlying impact of disease across states and over time.

Explanatory Variable

The categorical explanatory variable selected for the ANOVA analysis is Cause.Name. This variable categorizes deaths into major public health causes such as heart disease, cancer, stroke, influenza and pneumonia, and others. These categories represent distinct disease processes and risk profiles, and it is reasonable to expect that age-adjusted death rates may differ across them due to differences in prevalence, preventability, and treatment effectiveness. The aggregate category “All causes” was excluded from this analysis since it represents total mortality across all categories rather than a distinct cause of death. After excluding this aggregate category, 10 meaningful cause categories remain, which is appropriate for interpretation in a one-way ANOVA framework.

Regression Variable

For the regression analysis, the continuous explanatory variable selected is Year. Mortality rates often change over time due to medical advancements, public health interventions, demographic shifts, and policy changes. Modeling the age-adjusted death rate as a function of year enables the ability to assess whether there is a meaningful temporal trend. Since the dataset spans from 2000 to 2017, it is reasonable to expect that any long-term trend in mortality rates may be approximately linear over this period.

Anova Setup and Process

# Removing 'all causes' for the process
anova_data <- data |>
  filter(Cause.Name != "All causes")

Research Question

Do mean age-adjusted death rates differ across causes of death?

Null Hypothesis (ANOVA)

The mean age-adjusted death rate is equal across all ten causes of death.

\[ H_0 : \mu_1 = \mu_2 = \cdots = \mu_{10} \]

where each \(\mu_i\) represents the mean age-adjusted death rate for a specific cause.

Alternative Hypothesis

\[ H_A : \exists \ i,j \text{ such that } \mu_i \ne \mu_j \]

ANOVA tests whether at least one group mean differs, but it does not identify which specific means differ.

ANOVA Test

# Running the anova model
anova_model <- aov(Age.adjusted.Death.Rate ~ Cause.Name, data = anova_data)
# Summary results of anova
summary(anova_model)
##               Df   Sum Sq Mean Sq F value Pr(>F)    
## Cause.Name     9 38604959 4289440   15924 <2e-16 ***
## Residuals   9350  2518664     269                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of Results

The one-way ANOVA results show a statistically significant effect of cause of death on the age-adjusted death rate,

F(9, 9350) = 15924, \(p < 2 \times 10^{-16}\)

Since the p-value is far below the 0.05 significance level, I reject the null hypothesis that all mean age-adjusted death rates are equal across causes. This provides strong statistical evidence that at least one cause of death has a different mean age-adjusted death rate.

Visualization

anova_data |>
  ggplot(aes(x = Cause.Name, y = Age.adjusted.Death.Rate)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Age-Adjusted Death Rate by Cause of Death",
    x = "Cause of Death",
    y = "Age-Adjusted Death Rate (per 100,000)"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Visualization Results Summary

The boxplot visually reinforces the ANOVA results by illustrating substantial differences in age-adjusted death rates across causes of death. Heart disease and cancer exhibit noticeably higher median death rates compared to other causes, while kidney disease, influenza and pneumonia, and suicide show considerably lower median rates. The separation between group medians is large relative to the within-group variability, which aligns with the extremely large F-statistic observed in the ANOVA. This visualization supports the conclusion that age-adjusted death rates are not equal across causes of death.

Continuous Variable

Variable Linearity

data |>
  ggplot(aes(x = Year, y = Age.adjusted.Death.Rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Age-Adjusted Death Rate Over Time",
    x = "Year",
    y = "Age-Adjusted Death Rate (per 100,000)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Linear Regression Model

# LR Model
lm_model <- lm(Age.adjusted.Death.Rate ~ Year, data = data)
# Summary of results
summary(lm_model)
## 
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Year, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -136.08 -107.08  -90.12   29.85  921.32 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3246.7742   847.2705   3.832 0.000128 ***
## Year          -1.5534     0.4218  -3.683 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 222.1 on 10294 degrees of freedom
## Multiple R-squared:  0.001316,   Adjusted R-squared:  0.001219 
## F-statistic: 13.56 on 1 and 10294 DF,  p-value: 0.0002321

Linear Regression Summary Analysis

The linear regression results indicate a statistically significant relationship between year and age-adjusted death rate, F(1,10294) = 13.56 and p = 0.000232. The slope coefficient of −1.5534 suggests that, on average, the age-adjusted death rate decreases by approximately 1.55 deaths per 100,000 people each year.

However, the model explains only a very small proportion of the variability in death rates, with R2 = 0.001316. This indicates that while there is evidence of a slight downward trend over time, year alone is not a strong predictor of age-adjusted death rate. Although there is a statistically significant downward trend over time, the very small R2 suggests that year alone is not a strong predictor of death rates.

Therefore, policy decisions or public health planning should consider additional factors beyond time, such as cause of death, geographic differences, or demographic characteristics.