Week 9 Data Dive

Expanding the Linear Regression Model

In Week 8, a simple linear regression model was used to examine whether age-adjusted death rates changed over time:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \varepsilon\)

While this model identified a statistically significant temporal trend, the model explained very little of the variability in death rates. This suggests that additional explanatory variables are needed to better understand differences in mortality risk.

For this week’s analysis, the model will be expanded by including Cause of Death as a categorical explanatory variable. Rather than using all possible causes, four major public health categories are selected:

Heart disease
Cancer
Stroke
Suicide

These causes were selected since they represent major and distinct contributors to mortality and are likely to have different baseline death rates.

The updated regression model is:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \beta_2(\text{Cause}) + \varepsilon\)

Including Cause of Death allows the model to account for systematic differences in mortality levels across disease categories, while still evaluating the overall trend over time.

The aggregate category All causes is excluded to avoid redundancy, since it represents total mortality rather than a distinct disease process.

No additional derived rate variables are included, as they would introduce multicollinearity due to overlap with the response variable.

Filtering Data

# Filter dataset for selected causes and exclude "All causes"
week9_data <- data |>
  filter(Cause.Name %in% c("Heart disease",
                           "Cancer",
                           "Stroke",
                           "Suicide"))

Explanation of Filtering

To expand the regression model while maintaining interpretability, the dataset is restricted to four major causes of death: heart disease, cancer, stroke, and suicide. These causes represent distinct disease categories with meaningful differences in mortality patterns.

The aggregate category All causes is excluded, since it represents total mortality rather than a specific cause of death. Including it would introduce conceptual redundancy into the model.

Restricting the dataset to four categories also ensures the regression model remains parsimonious and avoids unnecessary complexity from excessive dummy variables.

Baseline Model

# Week 8 baseline model for continuity
lm_week8 <- lm(Age.adjusted.Death.Rate ~ Year,
               data = week9_data)

summary(lm_week8)

## 
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Year, data = week9_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -123.13  -74.54  -11.09   72.41  214.27 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4667.8700   520.4727   8.969   <2e-16 ***
## Year          -2.2705     0.2591  -8.762   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.26 on 3742 degrees of freedom
## Multiple R-squared:  0.0201, Adjusted R-squared:  0.01984 
## F-statistic: 76.77 on 1 and 3742 DF,  p-value: < 2.2e-16

Explanation

To maintain continuity with Week 8, the original simple linear regression model is refit using the restricted dataset:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \varepsilon\)

This model evaluates whether age-adjusted death rates change over time, without accounting for differences across causes of death.

Refitting this model allows for a direct comparison between the simple regression from Week 8 and the expanded model developed in Week 9.

Expanded Model

# Week 9 expanded model
lm_week9 <- lm(Age.adjusted.Death.Rate ~ Year + Cause.Name,
               data = week9_data)

summary(lm_week9)

## 
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Year + Cause.Name, data = week9_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.090 -11.519  -1.045  10.137 127.135 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4737.55422  134.87729   35.12   <2e-16 ***
## Year                      -2.27047    0.06715  -33.81   <2e-16 ***
## Cause.NameHeart disease   17.45011    0.98541   17.71   <2e-16 ***
## Cause.NameStroke        -132.36880    0.98541 -134.33   <2e-16 ***
## Cause.NameSuicide       -163.81816    0.98541 -166.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.32 on 3739 degrees of freedom
## Multiple R-squared:  0.9342, Adjusted R-squared:  0.9342 
## F-statistic: 1.328e+04 on 4 and 3739 DF,  p-value: < 2.2e-16

Explanation and Analysis

The expanded regression model includes both Year and Cause of Death as explanatory variables:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \beta_2(\text{Cause}) + \varepsilon\)

After including Cause.Name, the model fit improves substantially. The multiple \(R^2\) increases from 0.0201 in the Week 8 model to 0.9342, indicating that approximately 93.4% of the variability in age-adjusted death rates is explained by Year and Cause of Death combined. This represents a dramatic improvement in explanatory power.

The overall model is highly statistically significant:

F(4, 3739) = 13280, p < 2.2 x \(10^{-16}\)

The coefficient for Year remains negative and statistically significant (β₁ = −2.27047, p < 2e−16), indicating that death rates continue to decline over time even after controlling for cause of death.

The cause coefficients represent differences relative to the reference category (Cancer). Specifically:

Heart disease has a significantly higher age-adjusted death rate than Cancer.
Stroke and Suicide have significantly lower age-adjusted death rates than Cancer.

All cause indicators are highly statistically significant (p < 2e−16), suggesting substantial differences in baseline mortality levels across these disease categories.

The residual standard error also decreases substantially (from 82.26 in the Week 8 model to 21.32), further indicating that including Cause of Death meaningfully reduces unexplained variability.

Overall, this expanded model suggests that differences across causes of death account for a large portion of the variability in mortality rates, while the downward time trend remains present.

Model Evaluation

Visualization Plot 1: Residuals vs Fitted

# Checks linearity and constant variance (homoscedasticity)
plot(lm_week9, which = 1)

Residuals vs Fitted Summary and Explanation

This plot evaluates the linearity and constant variance (homoscedasticity) assumptions.

The residuals are centered around zero and the red smoothing line is approximately flat, indicating no strong curvature. This supports the assumption that the linear form of the model is appropriate. I have a high level of confidence that the linearity assumption is reasonably satisfied.

However, the spread of residuals increases at higher fitted values, particularly among observations with larger predicted death rates. This suggests moderate heteroscedasticity, meaning the variance is not perfectly constant across fitted values.

The severity of this issue appears moderate rather than severe, as the increase in spread is noticeable but not extreme. Given the large sample size, this violation is unlikely to substantially affect inference, but it should be acknowledged as a limitation.

Overall, linearity appears well satisfied, while constant variance is moderately violated.

Visualization Plot 2: Normal Q-Q

# Checks normality of residuals
plot(lm_week9, which = 2)

Normal Q-Q Plot Summary and Explanation

This plot evaluates the normality of the residuals. If the normality assumption is satisfied, the points should closely follow the diagonal reference line.

In this plot, the residuals follow the reference line closely through the center of the distribution, indicating that the majority of residuals are approximately normally distributed. However, there is noticeable deviation in the upper tail, where several points rise above the line. This suggests some right-tail skewness or the presence of larger-than-expected positive residuals.

The deviation is primarily confined to the extreme tail rather than the bulk of the data. Therefore, the violation appears mild to moderate rather than severe.

Given the large sample size (n ≈ 3700+), minor departures from normality are unlikely to substantially affect inference due to the Central Limit Theorem. I have moderate to high confidence that the normality assumption is sufficiently satisfied for reliable regression inference.

Overall, the normality assumption appears reasonably met, with some mild tail deviation.

Visualization Plot 3: Scale-Location

# Plot 3: Scale-Location (Checks homoscedasticity / constant variance)
plot(lm_week9, which = 3)

Scale-Location Plot Summary and Explanation

This plot evaluates the homoscedasticity (constant variance) assumption by examining whether the spread of standardized residuals remains constant across fitted values.

In this plot, the red smoothing line shows a noticeable upward trend as fitted values increase. Additionally, the spread of points is visibly larger for higher fitted values (approximately 170–210) compared to lower fitted values. This indicates that the variance of the residuals increases with the fitted values.

This pattern provides further evidence of heteroscedasticity, meaning the constant variance assumption is not fully satisfied.

The severity of this issue appears moderate rather than extreme. The increase in spread is clear but not sharply funnel-shaped or explosive. Given the large sample size, this level of heteroscedasticity is unlikely to invalidate overall inference, though it does represent a limitation of the model.

Overall, there is moderate confidence in the constant variance assumption, with clear but not severe deviation from perfect homoscedasticity.

Visualization Plot 4: Residuals vs Leverage

# Checks for influential observations and high-leverage points
plot(lm_week9, which = 5)

Residuals vs. Leverage Plot Summary and Explanation

This plot evaluates the presence of high-leverage and influential observations, using leverage values and Cook’s distance contours.

Most observations cluster within a narrow leverage range (approximately 0.001–0.0017), and there are no points clearly exceeding the Cook’s distance reference lines. While a few observations have relatively large standardized residuals, their leverage values are not extreme.

There is no clear evidence of observations that simultaneously have high leverage and large residuals, which would indicate strong influence on the model.

The severity of influence appears low, as no points stand out as extreme or outside the Cook’s distance thresholds. I have high confidence that the regression results are not being driven by a small number of influential observations.

Overall, the influence and leverage assumptions appear well satisfied.

Visualization Plot 5: Cook’s Distance

# Identifies influential observations based on Cook’s D values
plot(lm_week9, which = 4)

Cook’s Distance Plot Summary and Explanation

This plot evaluates influential observations by measuring how much each observation affects the fitted regression model. Larger Cook’s distance values indicate greater influence.

Most observations have Cook’s distance values very close to zero, suggesting minimal influence on the model. However, there is a noticeable spike among the final observations, with a few points (e.g., 3742–3744) showing substantially larger Cook’s distance values relative to the rest of the data.

Although these spikes are clearly larger than most other values, their absolute magnitude remains relatively small (well below 1, which is often used as a rule-of-thumb threshold for serious influence). Therefore, while these observations are more influential than others, they do not appear extreme enough to severely distort the regression results.

The severity of influence appears mild to moderate, but not critical. Given the large overall sample size, the model is unlikely to be driven by these few observations.

Overall, I have moderate to high confidence that influential points do not materially compromise the validity of the regression model, though these observations could be examined further in a deeper analysis.

Week 9 Data Dive Summary

The expanded regression model substantially improves upon the Week 8 model by incorporating Cause of Death as an explanatory variable. The increase in \(R^2\) from 0.0201 to 0.9342 demonstrates that differences across causes account for a large proportion of the variability in age-adjusted death rates. This indicates that mortality risk is driven far more by disease category than by time alone, although a statistically significant downward trend over time remains present even after controlling for cause.

Diagnostic evaluation suggests that the model assumptions are largely reasonable, though not perfect. The linearity assumption appears well satisfied, and the residuals are approximately normal with only mild tail deviations. However, moderate heteroscedasticity is present, as variability increases for higher fitted values. While this represents a limitation, the severity does not appear extreme and is unlikely to invalidate inference given the large sample size. Additionally, although a small number of observations exhibit higher Cook’s distance values, there is no strong evidence that the model is driven by a few influential points.

Overall, I have moderate to high confidence in the validity of this regression model. The analysis highlights the importance of pairing regression modeling with diagnostic evaluation to ensure conclusions are supported by a reasonably well-behaved model. Future investigation could explore variance-stabilizing transformations or robust standard errors to address the observed heteroscedasticity, as well as potential interaction effects between Year and Cause of Death to determine whether trends differ across disease categories.