Week 11 Data Dive

Linear Regression Model Build

# Build a linear regression model using Year, Cause.Name, and Total_Population to predict Age.adjusted.Death.Rate
lm_week11 <- lm(
  Age.adjusted.Death.Rate ~ Year + Cause.Name + Total_Population,
  data = data
)

# Display the linear regression model summary so the coefficients, significance, and overall model fit can be reviewed
summary(lm_week11)

## 
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Year + Cause.Name + Total_Population, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -210.573   -9.321   -0.719    7.972  253.782 
## 
## Coefficients:
##                                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                        3.911e+03  1.217e+02   32.150   <2e-16 ***
## Year                              -1.552e+00  6.057e-02  -25.622   <2e-16 ***
## Cause.NameAlzheimer's disease     -7.686e+02  1.474e+00 -521.503   <2e-16 ***
## Cause.NameCancer                  -6.168e+02  1.474e+00 -418.451   <2e-16 ***
## Cause.NameCLRD                    -7.496e+02  1.474e+00 -508.589   <2e-16 ***
## Cause.NameDiabetes                -7.708e+02  1.474e+00 -522.965   <2e-16 ***
## Cause.NameHeart disease           -5.993e+02  1.474e+00 -406.612   <2e-16 ***
## Cause.NameInfluenza and pneumonia -7.762e+02  1.474e+00 -526.633   <2e-16 ***
## Cause.NameKidney disease          -7.799e+02  1.474e+00 -529.171   <2e-16 ***
## Cause.NameStroke                  -7.491e+02  1.474e+00 -508.260   <2e-16 ***
## Cause.NameSuicide                 -7.806e+02  1.474e+00 -529.598   <2e-16 ***
## Cause.NameUnintentional injuries  -7.504e+02  1.474e+00 -509.107   <2e-16 ***
## Total_Population                  -1.408e-08  7.550e-09   -1.865   0.0623 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.89 on 10283 degrees of freedom
## Multiple R-squared:  0.9794, Adjusted R-squared:  0.9794 
## F-statistic: 4.081e+04 on 12 and 10283 DF,  p-value: < 2.2e-16

Variable Selection Explanation

For this Week 11 data dive, the response variable chosen was Age.adjusted.Death.Rate since it is a continuous measure of mortality risk and is therefore appropriate for a linear regression model. The explanatory variables selected were Year, Cause.Name, and Total_Population. Year was included to capture possible changes in mortality over time, Cause.Name was included since different causes of death are expected to have very different death-rate patterns, and Total_Population was added as a numeric variable that could help account for differences in the size of the population associated with each observation.

Model Results Summary and Analysis

The Week 11 linear regression model performed very well overall, with an \(R^2\) of 0.9794, meaning that the predictors in the model explain about 97.94% of the variation in Age.adjusted.Death.Rate. This indicates that the combination of Year, Cause.Name, and Total_Population provides a very strong fit for the data. The overall F-statistic was also highly significant, showing that the model as a whole explains substantially more variation than a model with no predictors.

Looking at the individual coefficients, Year was negative and highly significant, which suggests that, after controlling for cause of death and population, age-adjusted death rates tend to decline over time. The coefficients for the different levels of Cause.Name were also highly significant, indicating that cause of death is a major factor in explaining differences in death rates. This is an important insight since it shows that variation in mortality is driven much more strongly by the type of cause than by time alone.

The variable Total_Population had a very small negative coefficient and was only marginally significant. This suggests that population size may have some relationship with the response, but its independent contribution appears much weaker than the effects of Year and Cause.Name. Even so, keeping it in the model is still reasonable since it adds an additional quantitative dimension to the analysis and may help account for some background variation across observations.

Overall, this model provides strong evidence that both time and cause of death are important for understanding changes in age-adjusted death rates. The main significance of this result is that it supports the idea that mortality patterns are not random, but instead are shaped by both long-term trends and meaningful differences across disease categories. A useful follow-up question would be whether the effect of Year differs depending on the Cause.Name, which could be explored in a future model using interaction terms

Diagnostics of Model

Visualization Plot 1: Residuals vs. Fitted Values

# Creates a residuals versus fitted values plot to check linearity and whether the residual spread stays fairly constant
plot(lm_week11, which = 1)

Residuals vs. Fitted Values Summary and Explanation

This plot is used to assess whether the linearity assumption is reasonable and whether the residuals have roughly constant variance across the range of fitted values. Ideally, the points should appear randomly scattered around the horizontal zero line without a strong pattern or major change in spread.

In this plot, the residuals are generally centered around zero, but they appear in distinct vertical clusters rather than one continuous cloud. This makes sense since Cause.Name is a categorical predictor with several levels, which creates groups of fitted values. The spread of the residuals is noticeably wider for the larger fitted values, especially in the far-right cluster, which suggests some heteroscedasticity, meaning the variance is not completely constant across the fitted range.

Overall, the plot suggests that the model captures the main structure of the data reasonably well, but the constant variance assumption is not perfectly satisfied. This is an important issue to note since it means prediction errors are somewhat larger for some fitted-value groups than for others, even though the model still appears to fit the data strongly overall.

Visualization Plot 2: Normal Q-Q Plot

# Creates a normal Q-Q plot to assess whether the residuals appear approximately normal
plot(lm_week11, which = 2)

Normal Q-Q Plot Summary and Explanation

This plot is used to assess whether the residuals from the model are approximately normally distributed. If the normality assumption were well satisfied, the points would fall roughly along the diagonal reference line.

In this plot, the points deviate substantially from the reference line, especially in both tails, creating a strong curved pattern rather than a straight-line pattern. This indicates that the residuals are not normally distributed, and that the model has heavier tails or more extreme residual values than would be expected under a normal distribution.

This is an important issue to note since non-normal residuals suggest that the model errors are not perfectly well-behaved. However, given the very large sample size, this violation may be less damaging to the overall usefulness of the model than it would be in a much smaller dataset.

Visualization Plot 3: Cook’s Distance

# Creates a Cook's distance plot to identify observations that may be especially influential
plot(lm_week11, which = 4)

Cook’s Distance Summary and Explanation

This plot is used to identify influential observations, or points that have a relatively large impact on the fitted regression model. Larger Cook’s distance values indicate observations that, if removed, could change the fitted model more noticeably.

In this plot, most observations have Cook’s distance values very close to zero, which suggests that the majority of the data points have little individual influence on the model. However, there are several noticeably larger spikes near the end of the observation range, including observations such as 10293, 10295, and 10296, indicating that these points are more influential than the rest.

Even so, the absolute Cook’s distance values remain quite small overall, all far below commonly cited thresholds such as 1, so these observations do not appear to be extreme enough to invalidate the model. This means that while a small number of points deserve attention, there is not strong evidence that the model’s main conclusions are being driven by only a few influential observations.

Visualization Plot 4: Residuals vs. X Values

# Creates a residuals versus x-values plot to assess linearity and whether residual spread changes across Year
plots <- gg_resX(lm_week11, plot.all = FALSE)

plots$Year +
  geom_smooth(se = FALSE)

Residuals vs. X Values Summary and Explanation

This plot is used to examine whether the residuals remain centered around zero across the predictor Year and to check for any visible non-linear pattern or major change in spread over time. Ideally, the residuals should appear randomly scattered around the zero line, and the smooth curve should stay fairly flat.

In this plot, the residuals are generally distributed around 0 for each year, and the blue smooth line stays close to the horizontal reference line across most of the time range. There is a slight dip in the middle years and a mild rise again toward the later years, which suggests a small amount of curvature, but the pattern is not especially strong.

The spread of the residuals appears somewhat wider in some years than others, though not drastically so. This suggests that the relationship with Year is modeled reasonably well overall, but the plot also hints that the effect of time may not be perfectly linear and that there may be mild variation in error spread across years.

Visualization Plot 5: Residual Histogram

# Creates a histogram of residuals to assess whether the residuals appear approximately normal
hist(
  residuals(lm_week11),
  main = "Histogram of Residuals",
  xlab = "Residuals",
  col = "lightgray",
  border = "black"
)

Residual Histogram Summary and Explanation

This plot is used to assess whether the residuals from the Week 11 model are approximately normally distributed. Ideally, the histogram would appear roughly bell-shaped and centered around zero, with most residuals clustered near the middle and fewer observations in the tails.

In this histogram, the residuals are heavily concentrated near the center, which suggests that the model does fit many observations fairly closely. However, the distribution is not perfectly bell-shaped, as there are noticeable tails on both sides and some asymmetry, especially with a longer positive tail extending farther to the right.

This supports the conclusion from the Normal Q-Q plot that the residuals are not perfectly normal, even though most are still centered near zero. The main issue highlighted here is that the model has some extreme residual values, which means a small subset of observations is not fitted as closely as the bulk of the data.

Coefficient Interpretation

The coefficient for Year is approximately -1.552, which means that, holding Cause.Name and Total_Population constant, the model predicts that Age.adjusted.Death.Rate decreases by about 1.552 units for each one-year increase. In other words, after accounting for differences in cause of death and population size, the model suggests that age-adjusted death rates tend to decline over time. Since the p-value for Year is less than 2e-16, this negative relationship is statistically significant and provides very strong evidence that time is associated with lower age-adjusted death rates in this dataset.

Week 11 Data Dive Summary

Compared with Week 9, this Week 11 model builds on a very similar linear regression framework while adding Total_Population as an additional explanatory variable. Both models used Age.adjusted.Death.Rate as the response and showed that Cause.Name is a major driver of variation in mortality rates, but the Week 11 model produced an even stronger overall fit, suggesting that the added predictor helped capture a small amount of additional structure in the data. At the same time, the main substantive conclusion remained consistent: differences across causes of death explain far more variation in age-adjusted death rates than time alone.

The diagnostic results for Week 11 were also similar in spirit to Week 9, in that the model was strong overall but still showed some imperfections. As in Week 9, the visualizations suggested that the model is useful and informative, while also revealing issues such as non-constant variance, non-normal residuals, and a small number of somewhat more influential observations. Overall, Week 11 reinforces the main findings from Week 9 while extending the model slightly, showing that even a very strong regression model should still be evaluated carefully through diagnostic checks before drawing final conclusions.