Week 9 Data Dive

Expanding the Linear Regression Model

In Week 8, a simple linear regression model was used to examine whether age-adjusted death rates changed over time:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \varepsilon\)

While this model identified a statistically significant temporal trend, the model explained very little of the variability in death rates. This suggests that additional explanatory variables are needed to better understand differences in mortality risk.

For this week’s analysis, the model will be expanded by including Cause of Death as a categorical explanatory variable. Rather than using all possible causes, four major public health categories are selected:

Heart disease
Cancer
Stroke
Suicide

These causes were selected since they represent major and distinct contributors to mortality and are likely to have different baseline death rates.

The updated regression model is:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \beta_2(\text{Cause}) + \varepsilon\)

Including Cause of Death allows the model to account for systematic differences in mortality levels across disease categories, while still evaluating the overall trend over time.

The aggregate category All causes is excluded to avoid redundancy, since it represents total mortality rather than a distinct disease process.

No additional derived rate variables are included, as they would introduce multicollinearity due to overlap with the response variable.

Filtering Data

# Filter dataset for selected causes and exclude "All causes"
week9_data <- data |>
  filter(Cause.Name %in% c("Heart disease",
                           "Cancer",
                           "Stroke",
                           "Suicide"))

Explanation of Filtering

To expand the regression model while maintaining interpretability, the dataset is restricted to four major causes of death: heart disease, cancer, stroke, and suicide. These causes represent distinct disease categories with meaningful differences in mortality patterns.

The aggregate category All causes is excluded, since it represents total mortality rather than a specific cause of death. Including it would introduce conceptual redundancy into the model.

Restricting the dataset to four categories also ensures the regression model remains parsimonious and avoids unnecessary complexity from excessive dummy variables.

Baseline Model

# Week 8 baseline model for continuity
lm_week8 <- lm(Age.adjusted.Death.Rate ~ Year,
               data = week9_data)

summary(lm_week8)

## 
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Year, data = week9_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -123.13  -74.54  -11.09   72.41  214.27 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4667.8700   520.4727   8.969   <2e-16 ***
## Year          -2.2705     0.2591  -8.762   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.26 on 3742 degrees of freedom
## Multiple R-squared:  0.0201, Adjusted R-squared:  0.01984 
## F-statistic: 76.77 on 1 and 3742 DF,  p-value: < 2.2e-16

Explanation

To maintain continuity with Week 8, the original simple linear regression model is refit using the restricted dataset:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \varepsilon\)

This model evaluates whether age-adjusted death rates change over time, without accounting for differences across causes of death.

Refitting this model allows for a direct comparison between the simple regression from Week 8 and the expanded model developed in Week 9.

Expanded Model

# Week 9 expanded model
lm_week9 <- lm(Age.adjusted.Death.Rate ~ Year + Cause.Name,
               data = week9_data)

summary(lm_week9)

## 
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Year + Cause.Name, data = week9_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.090 -11.519  -1.045  10.137 127.135 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4737.55422  134.87729   35.12   <2e-16 ***
## Year                      -2.27047    0.06715  -33.81   <2e-16 ***
## Cause.NameHeart disease   17.45011    0.98541   17.71   <2e-16 ***
## Cause.NameStroke        -132.36880    0.98541 -134.33   <2e-16 ***
## Cause.NameSuicide       -163.81816    0.98541 -166.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.32 on 3739 degrees of freedom
## Multiple R-squared:  0.9342, Adjusted R-squared:  0.9342 
## F-statistic: 1.328e+04 on 4 and 3739 DF,  p-value: < 2.2e-16

Explanation and Analysis

The expanded regression model includes both Year and Cause of Death as explanatory variables:

Age.adjusted.Death.Rate = \(\beta_0 + \beta_1(\text{Year}) + \beta_2(\text{Cause}) + \varepsilon\)

After including Cause.Name, the model fit improves substantially. The multiple \(R^2\) increases from 0.0201 in the Week 8 model to 0.9342, indicating that approximately 93.4% of the variability in age-adjusted death rates is explained by Year and Cause of Death combined. This represents a dramatic improvement in explanatory power.

The overall model is highly statistically significant:

F(4, 3739) = 13280, p < 2.2 x \(10^{-16}\)

The coefficient for Year remains negative and statistically significant (β₁ = −2.27047, p < 2e−16), indicating that death rates continue to decline over time even after controlling for cause of death.

The cause coefficients represent differences relative to the reference category (Cancer). Specifically:

Heart disease has a significantly higher age-adjusted death rate than Cancer.
Stroke and Suicide have significantly lower age-adjusted death rates than Cancer.

All cause indicators are highly statistically significant (p < 2e−16), suggesting substantial differences in baseline mortality levels across these disease categories.

The residual standard error also decreases substantially (from 82.26 in the Week 8 model to 21.32), further indicating that including Cause of Death meaningfully reduces unexplained variability.

Overall, this expanded model suggests that differences across causes of death account for a large portion of the variability in mortality rates, while the downward time trend remains present.

Model Evaluation

Visualization Plot 1: Residuals vs Fitted

# Checks linearity and constant variance (homoscedasticity)
plot(lm_week9, which = 1)

Residuals vs Fitted Summary and Explanation

This plot evaluates the linearity and constant variance (homoscedasticity) assumptions.

The residuals are centered around zero and the red smoothing line is approximately flat, indicating no strong curvature. This supports the assumption that the linear form of the model is appropriate. I have a high level of confidence that the linearity assumption is reasonably satisfied.

However, the spread of residuals increases at higher fitted values, particularly among observations with larger predicted death rates. This suggests moderate heteroscedasticity, meaning the variance is not perfectly constant across fitted values.

The severity of this issue appears moderate rather than severe, as the increase in spread is noticeable but not extreme. Given the large sample size, this violation is unlikely to substantially affect inference, but it should be acknowledged as a limitation.

Overall, linearity appears well satisfied, while constant variance is moderately violated.

Visualization Plot 2: Normal Q-Q

# Checks normality of residuals
plot(lm_week9, which = 2)

Normal Q-Q Plot Summary and Explanation

This plot evaluates the normality of the residuals. If the normality assumption is satisfied, the points should closely follow the diagonal reference line.

In this plot, the residuals follow the reference line closely through the center of the distribution, indicating that the majority of residuals are approximately normally distributed. However, there is noticeable deviation in the upper tail, where several points rise above the line. This suggests some right-tail skewness or the presence of larger-than-expected positive residuals.

The deviation is primarily confined to the extreme tail rather than the bulk of the data. Therefore, the violation appears mild to moderate rather than severe.

Given the large sample size (n ≈ 3700+), minor departures from normality are unlikely to substantially affect inference due to the Central Limit Theorem. I have moderate to high confidence that the normality assumption is sufficiently satisfied for reliable regression inference.

Overall, the normality assumption appears reasonably met, with some mild tail deviation.

Visualization Plot 3: Scale-Location

# Plot 3: Scale-Location (Checks homoscedasticity / constant variance)
plot(lm_week9, which = 3)

Scale-Location Plot Summary and Explanation

This plot evaluates the homoscedasticity (constant variance) assumption by examining whether the spread of standardized residuals remains constant across fitted values.

In this plot, the red smoothing line shows a noticeable upward trend as fitted values increase. Additionally, the spread of points is visibly larger for higher fitted values (approximately 170–210) compared to lower fitted values. This indicates that the variance of the residuals increases with the fitted values.

This pattern provides further evidence of heteroscedasticity, meaning the constant variance assumption is not fully satisfied.

The severity of this issue appears moderate rather than extreme. The increase in spread is clear but not sharply funnel-shaped or explosive. Given the large sample size, this level of heteroscedasticity is unlikely to invalidate overall inference, though it does represent a limitation of the model.

Overall, there is moderate confidence in the constant variance assumption, with clear but not severe deviation from perfect homoscedasticity.

Visualization Plot 4: Residuals vs Leverage

# Checks for influential observations and high-leverage points
plot(lm_week9, which = 5)

Residuals vs. Leverage Plot Summary and Explanation

This plot evaluates the presence of high-leverage and influential observations, using leverage values and Cook’s distance contours.

Most observations cluster within a narrow leverage range (approximately 0.001–0.0017), and there are no points clearly exceeding the Cook’s distance reference lines. While a few observations have relatively large standardized residuals, their leverage values are not extreme.

There is no clear evidence of observations that simultaneously have high leverage and large residuals, which would indicate strong influence on the model.

The severity of influence appears low, as no points stand out as extreme or outside the Cook’s distance thresholds. I have high confidence that the regression results are not being driven by a small number of influential observations.

Overall, the influence and leverage assumptions appear well satisfied.

Visualization Plot 5: Cook’s Distance

# Identifies influential observations based on Cook’s D values
plot(lm_week9, which = 4)

Cook’s Distance Plot Summary and Explanation

This plot evaluates influential observations by measuring how much each observation affects the fitted regression model. Larger Cook’s distance values indicate greater influence.

Most observations have Cook’s distance values very close to zero, suggesting minimal influence on the model. However, there is a noticeable spike among the final observations, with a few points (e.g., 3742–3744) showing substantially larger Cook’s distance values relative to the rest of the data.

Although these spikes are clearly larger than most other values, their absolute magnitude remains relatively small (well below 1, which is often used as a rule-of-thumb threshold for serious influence). Therefore, while these observations are more influential than others, they do not appear extreme enough to severely distort the regression results.

The severity of influence appears mild to moderate, but not critical. Given the large overall sample size, the model is unlikely to be driven by these few observations.

Overall, I have moderate to high confidence that influential points do not materially compromise the validity of the regression model, though these observations could be examined further in a deeper analysis.

Visualization Plot 6: Residuals vs. X Values

# Creates a residuals vs x-values plot for Year to check linearity and constant variance across the predictor values
plots <- gg_resX(lm_week9, plot.all = FALSE)

plots$Year +
  geom_smooth(se = FALSE)

Residuals vs. X Values Plot Summary and Explanation

This plot evaluates the relationship between the residuals and the explanatory variable Year to assess whether the linearity assumption is reasonable and whether the residuals remain centered around zero across the range of x-values. Ideally, the residuals should appear randomly scattered around the horizontal zero line without a strong visible pattern.

In this plot, the residuals are generally spread around 0 for each year, which suggests that the model is not showing a severe systematic error across time. The blue smooth curve does show a slight curved pattern, dipping somewhat below zero in the middle years and rising again toward the later years, which may indicate a mild non-linear trend that the model does not fully capture.

The overall spread of the residuals appears fairly similar across most years, although there are some larger positive residuals in the earlier years. This suggests that the constant variance assumption is reasonably acceptable, even if not perfectly uniform.

Overall, I have moderate confidence that the relationship between Year and the response is being modeled reasonably well, though the slight curvature suggests that the effect of Year may not be perfectly linear and could be explored further in a more advanced analysis.

Visualization Plot 7: Residual Histogram

# Creates a histogram of the residuals to assess whether the residuals appear approximately normal
hist(
  residuals(lm_week9),
  main = "Histogram of Residuals",
  xlab = "Residuals",
  col = "lightgray",
  border = "black"
)

Residual Histogram Plot Summary and Explanation

This histogram evaluates whether the residuals from the regression model appear approximately normally distributed, which is one of the standard assumptions checked in linear regression. Ideally, the residuals should form a roughly symmetric, bell-shaped distribution centered near zero.

In this plot, most of the residuals are concentrated around the center, with the highest frequencies occurring near values slightly below and above 0, which suggests that the model errors are generally clustered near zero. However, the distribution is not perfectly symmetric, as there appears to be a right-skewed tail extending toward larger positive residual values, along with a smaller tail on the negative side.

This means the residuals are approximately normal, but not perfectly so, since there are some larger positive residuals that stretch the distribution to the right. Overall, I have moderate confidence that the normality assumption is reasonably satisfied for this model, though the mild skew suggests that a few observations may be contributing somewhat larger errors than the rest.

Week 9 Data Dive Summary

The expanded regression model improves substantially on the Week 8 model by adding Cause.Name as an explanatory variable alongside Year. The increase in \(R^2\) from 0.0201 to 0.9342 shows that cause of death explains a very large share of the variation in Age.adjusted.Death.Rate, indicating that mortality differences are driven much more strongly by disease category than by time alone. At the same time, the model still suggests a statistically significant downward trend over time after controlling for cause.

The diagnostic plots suggest that the model is reasonably strong overall, while also showing a few limitations that should be acknowledged. The Residuals vs Fitted and Residuals vs X Values plots suggest that the linear form is generally reasonable, although the residuals vs year plot shows a slight curve that may indicate a mild non-linear time pattern. The Normal Q-Q Plot and Residual Histogram both suggest that the residuals are approximately normal, though not perfectly so, since there are some departures in the tails and a slight right-skew in the histogram. The Cook’s Distance plot indicates that a small number of observations are more influential than others, but their magnitudes do not appear extreme enough to suggest that the model is being driven by only a few data points.

Overall, I have moderate to high confidence in the validity of this regression model and in the main conclusion that Cause.Name is an essential predictor of age-adjusted death rate. The diagnostic analysis shows that the model assumptions are mostly reasonable, even if they are not perfectly satisfied in every respect. Future work could explore whether a nonlinear time term, interaction effects between Year and Cause.Name, or other refinements might improve fit even further.