How has the death rate for major causes of death in the United States changed over time?
Understanding how death patterns change is essential for evaluating public health trends and identifying which causes of death are becoming more, or even less, prevalent over time. This project uses the NCHS – Leading Causes of Death: United States dataset, which contains 10,868 observations and 6 variables, with each row representing the recorded deaths and age-adjusted death rate for a specific cause in a given U.S. state and year. Although the dataset includes variables such as state, number of deaths, cause groups, and detailed cause names, this analysis focuses on three variables central to the research question: Year, Cause Name, and Age-adjusted Death Rate. The Year variable allows for the measurement of trends over time, Cause Name identifies major causes of death such as heart disease, cancer, and accidents, and Age-adjusted Death Rate serves as the continuous outcome of interest.
The dataset is maintained by the National Center for Health Statistics (NCHS) and is publicly accessible through the CDC’s open data portal. It can be accessed at: https://data.cdc.gov/NCHS/NCHS-Leading-Causes-of-Death-United-States/xkkf-xrst
In this section, I conduct exploratory data analysis (EDA) to investigate how death rates for major causes of death in the United States have changed over time. I begin by cleaning the dataset, selecting the variables relevant to this question—Year, Cause Name, and Age-adjusted Death Rate—and removing missing values to ensure accuracy. I then summarize the death rate data and examine trends by calculating average death rates across years and across different causes. To visualize these patterns, I create plots that display how death rates vary over time and how major causes compare in terms of average mortality. Throughout this analysis, I use several dplyr functions, including select(), mutate(), group_by(), and summarise(), to transform, organize, and explore the dataset in preparation for regression modeling.
Loading Libraries and Dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("NCHS_-_Leading_Causes_of_Death__United_States.csv")
## Rows: 10868 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): 113 Cause Name, Cause Name, State
## dbl (3): Year, Deaths, Age-adjusted Death Rate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
## Rows: 10,868
## Columns: 6
## $ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20…
## $ `113 Cause Name` <chr> "Accidents (unintentional injuries) (V01-X59…
## $ `Cause Name` <chr> "Unintentional injuries", "Unintentional inj…
## $ State <chr> "United States", "Alabama", "Alaska", "Arizo…
## $ Deaths <dbl> 169936, 2703, 436, 4184, 1625, 13840, 3037, …
## $ `Age-adjusted Death Rate` <dbl> 49.4, 53.8, 63.7, 56.2, 51.8, 33.2, 53.6, 53…
Clean the Dataset
clean_df <- df %>%
select(Year, `Cause Name`, State, `Age-adjusted Death Rate`) %>%
drop_na() %>%
mutate(
Cause = as.factor(`Cause Name`),
State = as.factor(State)
)
Summary Statistics
clean_df %>%
summarise(
mean_rate = mean(`Age-adjusted Death Rate`),
sd_rate = sd(`Age-adjusted Death Rate`),
min_rate = min(`Age-adjusted Death Rate`),
max_rate = max(`Age-adjusted Death Rate`)
)
## # A tibble: 1 × 4
## mean_rate sd_rate min_rate max_rate
## <dbl> <dbl> <dbl> <dbl>
## 1 128. 224. 2.6 1087.
Mean Death Rate by Cause
clean_df %>%
group_by(Cause) %>%
summarise(avg_rate = mean(`Age-adjusted Death Rate`)) %>%
arrange(desc(avg_rate))
## # A tibble: 11 × 2
## Cause avg_rate
## <fct> <dbl>
## 1 All causes 799.
## 2 Heart disease 198.
## 3 Cancer 179.
## 4 Stroke 45.9
## 5 CLRD 44.6
## 6 Unintentional injuries 43.4
## 7 Alzheimer's disease 25.0
## 8 Diabetes 23.4
## 9 Influenza and pneumonia 18.2
## 10 Kidney disease 14.1
## 11 Suicide 13.4
Plot Average Death Rate Over Time
clean_df %>%
group_by(Year) %>%
summarise(mean_rate = mean(`Age-adjusted Death Rate`)) %>%
ggplot(aes(x = Year, y = mean_rate)) +
geom_line(color = "blue") +
labs(title = "Average Age-Adjusted Death Rate Over Time",
x = "Year",
y = "Mean Death Rate")
In order to address the research question, “How has the death rate for major causes of death in the United States changed over time?”, I fit a multiple linear regression model using Age-adjusted Death Rate as the dependent variable. The predictors in the model are Year (numeric) and Cause (categorical), allowing the model to estimate both the overall trend in death rates over time and also how the levels differ across causes. Before fitting the model, I restricted the dataset to rows where the State is “United States” so the regression reflects national rather than state-level trends. This produces one observation per cause per year at the national level, consistent with the purpose of evaluating major national mortality trends. The model is fit using the lm() function: Age-adjusted Death Rate = β0 + β1(Year) + β2(Cause) + ϵ
The model results show a strong negative relationship between Year and Age-adjusted Death Rate, which aligns with your exploratory plot where the average death rate steadily declines from the late 1990s to 2018. A negative coefficient for Year indicates that, on average, the age-adjusted death rate in the United States decreases each year, meaning that death rates from major causes have improved over time. The coefficients for each cause reflect the differences observed in the descriptive statistics. For instance, all causes, heart disease, and cancer have much higher mean death rates, while causes like influenza or Kidney disease have substantially lower rates. Significant cause coefficients in the regression model confirm these differences and quantify how much higher or lower each cause’s death rate is compared to the reference category. All in all, the regression supports the conclusion that American death rates have generally decreased over time, although the magnitude of those rates varies widely across causes.
Create Modeling Dataset
model_df <- clean_df %>%
filter(State == "United States") %>%
select(Year, Cause, `Age-adjusted Death Rate`)
Fit the Regression Model
model_lm <- lm(`Age-adjusted Death Rate` ~ Year + Cause, data = model_df)
Model Summary
summary(model_lm)
##
## Call:
## lm(formula = `Age-adjusted Death Rate` ~ Year + Cause, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.764 -8.285 -0.679 7.081 75.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4316.9122 452.5282 9.540 < 2e-16 ***
## Year -1.7595 0.2254 -7.808 3.37e-13 ***
## CauseAlzheimer's disease -760.0211 5.7894 -131.277 < 2e-16 ***
## CauseCancer -606.8421 5.7894 -104.819 < 2e-16 ***
## CauseCLRD -741.3789 5.7894 -128.057 < 2e-16 ***
## CauseDiabetes -761.0053 5.7894 -131.447 < 2e-16 ***
## CauseHeart disease -582.4211 5.7894 -100.601 < 2e-16 ***
## CauseInfluenza and pneumonia -765.7526 5.7894 -132.267 < 2e-16 ***
## CauseKidney disease -769.8947 5.7894 -132.983 < 2e-16 ***
## CauseStroke -738.6053 5.7894 -127.578 < 2e-16 ***
## CauseSuicide -772.1105 5.7894 -133.365 < 2e-16 ***
## CauseUnintentional injuries -744.3579 5.7894 -128.572 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.84 on 197 degrees of freedom
## Multiple R-squared: 0.9937, Adjusted R-squared: 0.9934
## F-statistic: 2828 on 11 and 197 DF, p-value: < 2.2e-16
To decide whether the multiple linear regression model is appropriate for answering the research question, I examined all five major regression assumptions: linearity, independence, homoscedasticity, normality of residuals, and multicollinearity. Linearity requires that the predictors have a linear relationship with the age-adjusted death rate. Since the model includes Year as a numeric predictor and Cause as a categorical factor, linearity mainly applies to the time trend. The regression output shows a strong negative coefficient for Year (–1.76, p < 0.001), indicating a consistent downward linear trend in death rates over time, which is also supported by the exploratory plot. Independence of observations is satisfied because each row represents a unique cause–year combination at the national level, and there is no overlapping or repeated measurement within the same category. Homoscedasticity and normality were assessed using residual diagnostic plots. The residual standard error (17.84) and the residual quartiles suggest a reasonably symmetric spread, but plots are required for full confirmation. Finally, multicollinearity was checked using VIF scores; because each cause is a distinct category and Year does not correlate with Cause, VIF values are expected to remain low.
Diagnostic plots generated from the model provide visual evidence for evaluating these assumptions. The Residuals vs. Fitted plot helps detect non-linearity and heteroscedasticity; a random scatter of points without patterns supports both assumptions. The Normal Q–Q plot is used to assess whether residuals follow a straight line, indicating approximate normality. The Scale–Location plot provides another check for homoscedasticity by displaying the spread of residuals across fitted values. Finally, the Residuals vs. Leverage plot identifies influential observations that may have undue influence on the model’s slope estimates. Variance Inflation Factors (VIF) are computed to assess multicollinearity among predictors; VIF values below 5 indicate no concerning levels of collinearity. Together, these diagnostics guide whether the model meets the assumptions necessary for valid inference.
Diagnostic Plots
# Residuals vs Fitted, Normal Q-Q, Scale-Location, Residuals vs Leverage
par(mfrow = c(2, 2))
plot(model_lm)
par(mfrow = c(1, 1))
Check Multicollinearity (VIF)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(model_lm)
## GVIF Df GVIF^(1/(2*Df))
## Year 1 1 1
## Cause 1 10 1
In conclusion, we can see the results of the multiple linear regression analysis show clear evidence that age-adjusted death rates for major causes of death in the United States have declined over time. The negative and statistically significant coefficient for Year (–1.76, p < 0.001) indicates that national mortality rates decrease by roughly 1.76 deaths per 100,000 people each year, a finding consistent with the downward trend observed in the exploratory plots. The model also revealed substantial and significant differences across causes of death: conditions such as heart disease, cancer, chronic lower respiratory disease, and stroke had much higher death rates than causes like diabetes, kidney disease, influenza, and suicide. With an R² of 0.9937, the model explains nearly all variation in death rates using only Year and Cause, suggesting that time trends and cause categories are strong predictors of national mortality patterns. However, some diagnostic plots showed mild deviations from homoscedasticity and normality, indicating that while the model fits extremely well, it is not without limitations.
Future research could extend this analysis by allowing time trends to vary across specific causes using a Year × Cause interaction, which would reveal whether some causes are improving faster than others. Additional predictors—such as demographic characteristics, regional factors, or measures of healthcare access—could further enhance the model’s explanatory power and address remaining residual variability. Alternative modeling approaches, such as generalized additive models (GAMs) for nonlinear trends or regularized regression methods, could also help address potential violations of model assumptions. Incorporating these extensions would provide a more nuanced understanding of how specific causes of death evolve over time and could offer more actionable insights for public health policy.
R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/