Heart Disease Death Rates in the United States

Author

Naomi Surendorj

This image presents trends in heart disease mortality in the United States across several decades.

Source: American Heart Association. Journal of the American Heart Association (2024) https://www.ahajournals.org/doi/10.1161/JAHA.124.038644

Introduction

For the final project, I chose a dataset on heart disease mortality among adults aged 35 and older in the United States. Heart disease is one of the leading causes of death in the country, and it continues to affect many families, including my own. There is a history of heart disease in my family, which is why I was interested in learning more about how heart disease impacts different groups of people and regions across the U.S.

The dataset used in this project comes from the Centers for Disease Control and Prevention (CDC), through the National Center for Chronic Disease Prevention and Health Promotion. The data cover the years 2019 to 2021 and include information on heart disease deaths by state, sex, race, and type of heart disease. It also includes the total number of deaths and age-adjusted death rates, which helps compare different groups more fairly.

The goal of this project is to explore how heart disease mortality rates vary across states and demographic groups and to identify factors that are associated with higher mortality rates.

Variables

Name Description Type
Year Year the data was recorded (2019 - 2021) Categorical
LocationDesc U.S state or territory Categorical
Topic Type of heart disease Categorical
Stratification1 Sex Categorical
Stratification2 Race or ethnic group Categorical
Data_Value Age adjusted heart disease death rate Quantitative
Deaths Total number of heart disease deaths Quantitative

Questions I want to explore?

  • How do heart disease death rates differ across U.S. states?

  • Are there differences in death rates by sex and race?

  • Do certain types of heart disease have higher death rates than others?

  • What factors are related to higher heart disease death rates?

How the Data was collected

The data was collected by the Centers for Disease Control and Prevention (CDC). The CDC gathers heart disease death data using death certificate records that are reported by each U.S. state and territory. When a person dies, the cause of death is listed on the death certificate, and this information is later sent to the CDC.

Then CDC groups these records by year, state, sex, race, and type of heart disease. The dataset also includes age-adjusted death rates, these rates are calculated so that states and groups with different age populations can be compared more fairly.

Loading Libraries and Data set

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
heart_data <- read_csv(
  "~/Downloads/final project /Heart_Disease_Mortality_Data_Among_US_Adults__35___by_State_Territory_and_County___2019-2021.csv"
)
Rows: 78792 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): LocationAbbr, LocationDesc, GeographicLevel, DataSource, Class, To...
dbl  (4): Year, Data_Value, Y_lat, X_lon

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(heart_data)
# A tibble: 6 × 21
   Year LocationAbbr LocationDesc     GeographicLevel DataSource Class     Topic
  <dbl> <chr>        <chr>            <chr>           <chr>      <chr>     <chr>
1  2020 AK           Denali           County          NVSS       Cardiova… Hear…
2  2020 CA           California       State           NVSS       Cardiova… Hear…
3  2020 CO           Park County      County          NVSS       Cardiova… Hear…
4  2020 FL           Walton County    County          NVSS       Cardiova… Hear…
5  2020 GA           Whitfield County County          NVSS       Cardiova… Hear…
6  2020 GA           Ware County      County          NVSS       Cardiova… Hear…
# ℹ 14 more variables: Data_Value <dbl>, Data_Value_Unit <chr>,
#   Data_Value_Type <chr>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, StratificationCategory1 <chr>,
#   Stratification1 <chr>, StratificationCategory2 <chr>,
#   Stratification2 <chr>, TopicID <chr>, LocationID <chr>, Y_lat <dbl>,
#   X_lon <dbl>, Georeference <chr>

Below, I keep only the variables that I will use later

heart <- heart_data |>
  select(
    Year,
    LocationDesc,
    GeographicLevel,
    Topic,
    Stratification1,
    Stratification2,
    Data_Value)

Cleaning the Data

I cleaned the dataset. I keeped only state-level data, remove missing values and format variables.

heart_clean <- heart |>
  filter(GeographicLevel == "State") |>
  filter(!is.na(Data_Value)) |>
  mutate(
    Year = factor(Year),
    LocationDesc = factor(LocationDesc),
    Topic = factor(Topic),
    Stratification1 = factor(Stratification1),
    Stratification2 = factor(Stratification2))

I checked to make sure the data looks correct.

glimpse(heart_clean)
Rows: 975
Columns: 7
$ Year            <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, …
$ LocationDesc    <fct> California, Mississippi, Nevada, Iowa, Alabama, Alabam…
$ GeographicLevel <chr> "State", "State", "State", "State", "State", "State", …
$ Topic           <fct> Heart Disease Mortality, Heart Disease Mortality, Hear…
$ Stratification1 <fct> Male, Female, Male, Overall, Male, Female, Overall, Ma…
$ Stratification2 <fct> More than one race, Asian, More than one race, Asian, …
$ Data_Value      <dbl> 230.1, 167.6, 281.6, 190.6, 164.9, 397.3, 226.8, 170.1…
summary(heart_clean$Data_Value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   26.3   163.6   257.2   278.4   369.6   905.4 

Summary of Deaths Rates by Year

Here I create a summary to see how average death rates change by year.

heart_clean |>
  group_by(Year) |>
  summarise(
    avg_rate = mean(Data_Value),
    min_rate = min(Data_Value),
    max_rate = max(Data_Value),
    n = n())
# A tibble: 1 × 5
  Year  avg_rate min_rate max_rate     n
  <fct>    <dbl>    <dbl>    <dbl> <int>
1 2020      278.     26.3     905.   975

Multiple Linear Regression

I run a multiple linear regression model to study heart disease death rates. The outcome variable in this model is the age-adjusted heart disease death rate. The predictors used in this model are sex and race. These variables are included to see how heart disease death rates differ across different groups of people.

heart_clean <- heart_clean |>
  mutate(
    Stratification1 = droplevels(Stratification1),
    Stratification2 = droplevels(Stratification2)
  )

# Multiple linear regression (2+ predictors)
model1 <- lm(
  Data_Value ~ Stratification1 + Stratification2,
  data = heart_clean
)

summary(model1)

Call:
lm(formula = Data_Value ~ Stratification1 + Stratification2, 
    data = heart_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-258.47  -50.40   -6.82   41.16  554.33 

Coefficients:
                                                         Estimate Std. Error
(Intercept)                                               216.736      9.923
Stratification1Male                                       127.329      7.434
Stratification1Overall                                     49.618      7.310
Stratification2Asian                                     -112.838     11.962
Stratification2Black                                      136.975     11.860
Stratification2Hispanic                                   -89.590     11.803
Stratification2More than one race                        -134.088     12.807
Stratification2Native Hawaiian or Other Pacific Islander  148.820     17.839
Stratification2Overall                                     57.461     11.403
Stratification2White                                       59.970     11.582
                                                         t value Pr(>|t|)    
(Intercept)                                               21.841  < 2e-16 ***
Stratification1Male                                       17.129  < 2e-16 ***
Stratification1Overall                                     6.788 1.99e-11 ***
Stratification2Asian                                      -9.433  < 2e-16 ***
Stratification2Black                                      11.550  < 2e-16 ***
Stratification2Hispanic                                   -7.590 7.52e-14 ***
Stratification2More than one race                        -10.470  < 2e-16 ***
Stratification2Native Hawaiian or Other Pacific Islander   8.342 2.50e-16 ***
Stratification2Overall                                     5.039 5.58e-07 ***
Stratification2White                                       5.178 2.73e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 93.09 on 965 degrees of freedom
Multiple R-squared:  0.5816,    Adjusted R-squared:  0.5777 
F-statistic: 149.1 on 9 and 965 DF,  p-value: < 2.2e-16

Model Form

Data_Value = b0 + b1(Stratification1) + b2(Stratification2) + error

This model shows how the heart disease death rate is estimated using sex and race.In this equation, b0 is the starting point for the death rate. The parts with sex and race show how the death rate changes for different groups. The error part stands for other things that affect heart disease deaths but are not included in this model.

Diagnostic Plots

par(mfrow = c(2, 2))
plot(model1)

par(mfrow = c(1, 1))

Adjusted R-squared

summary(model1)$adj.r.squared
[1] 0.5777104

Analyzing

The values showed that the model explains some of the differences in heart disease death rates, but not most of them. The p-values show that sex and race do matter in the model, so they are related to changes in death rates. But the adjusted R-squared value is low, which means this model does not explain a lot of what is going on. This shows that there are many other things that affect heart disease death rates, like healthcare, money, or daily habits, that are not included in this data. The model is helpful, but it has limits.

Vizualization 1: Heatmap of Heart Disease Death Rates by State and Race (Top 10 States)

I made a heatmap focusing on only the 10 states with the highest average heart disease death rates. I think the heatmap makes it easier to see patterns because color shows how high or low the rates are.

top_states <- heart_clean |>
  group_by(LocationDesc) |>
  summarise(avg_rate = mean(Data_Value), .groups = "drop") |>
  arrange(desc(avg_rate)) |>
  slice(1:10)

heat_df <- heart_clean |>
  filter(LocationDesc %in% top_states$LocationDesc) |>
  group_by(LocationDesc, Stratification2) |>
  summarise(avg_rate = mean(Data_Value), .groups = "drop")

ggplot(heat_df, aes(x = Stratification2,
                    y = reorder(LocationDesc, avg_rate),
                    fill = avg_rate)) +
  geom_tile(color = "white") +
  scale_fill_gradientn(
    colors = c("#f1eef6", "#bdc9e1", "#74a9cf", "#0570b0"),
    name = "Avg rate"
  ) +
  labs(
    title = "Average Heart Disease Death Rates by State and Race (Top 10 States)",
    x = "Race",
    y = "State",
    caption = "Source: CDC Heart Disease Mortality Data, 2019–2021"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(size = 8))

What it shows:

This heatmap shows how heart disease death rates change by race in the 10 states with the highest overall rates.Guam has the highest overall death rate compared to the other states. Some race groups, like Black and White populations, have higher rates in several states. The pattern also changes from state to state, which shows that where you live matters. Overall, the graph shows that both race and state affect heart disease death rates.

Visualization 2: Top 10 States by Average Heart Disease Death Rate

I used a bar chart to compare the 10 states with the highest average heart disease death rates.

top_states <- heart_clean |>
  group_by(LocationDesc) |>
  summarise(avg_rate = mean(Data_Value), .groups = "drop") |>
  arrange(desc(avg_rate)) |>
  slice(1:10)

highest_state <- top_states |> slice(1)
ggplot(top_states, aes(
  x = reorder(LocationDesc, avg_rate),
  y = avg_rate,
  fill = LocationDesc
)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Top 10 States by Average Heart Disease Death Rate (2019–2021)",
    x = "State",
    y = "Average age-adjusted death rate",
    caption = "Source: CDC Heart Disease Mortality Data, 2019–2021") +
  theme_bw() +
  geom_text(
    data = highest_state,
    aes(label = paste("Highest:", round(avg_rate, 1))),
    hjust = -0.1,
    size = 3) +
  ylim(0, max(top_states$avg_rate) * 1.15)

What it shows:

The bar chart shows the 10 states with the highest average heart disease death rates. Guam stands out with the highest rate at about 578.4, which is much higher than the other states in the chart. The rest of the states have lower rates, but there are still clear differences between them. This means that the heart disease death rates vary a lot depending on the state.

Conclusion:

The most interesting finding for me was how much death rates change depending on the state. Guam stood out with the highest average heart disease death rate, which was much higher than the other states. I also found it interesting that death rates were different across race groups, and these patterns were not the same in every state.

The regression model showed that sex and race are related to heart disease death rates, but the model did not explain everything. The adjusted R-squared value showed that there are many factors that affect heart disease deaths that are not included in this data, like access to healthcare, income, diet, and daily habits.

One limitation I faced was that the data does not include personal health behaviors or information about healthcare access. If I could continue this project, I would want to include those factors to better understand why heart disease death rates are higher in some places and groups. Overall, this project helped me see that heart disease is a serious issue and that many factors play a role in the differences we see across states and groups.