For the final project, I chose a dataset on heart disease mortality among adults aged 35 and older in the United States. Heart disease is one of the leading causes of death in the country, and it continues to affect many families, including my own. There is a history of heart disease in my family, which is why I was interested in learning more about how heart disease impacts different groups of people and regions across the U.S.
The dataset used in this project comes from the Centers for Disease Control and Prevention (CDC), through the National Center for Chronic Disease Prevention and Health Promotion. The data cover the years 2019 to 2021 and include information on heart disease deaths by state, sex, race, and type of heart disease. It also includes the total number of deaths and age-adjusted death rates, which helps compare different groups more fairly.
The goal of this project is to explore how heart disease mortality rates vary across states and demographic groups and to identify factors that are associated with higher mortality rates.
Variables
Name
Description
Type
Year
Year the data was recorded (2019 - 2021)
Categorical
LocationDesc
U.S state or territory
Categorical
Topic
Type of heart disease
Categorical
Stratification1
Sex
Categorical
Stratification2
Race or ethnic group
Categorical
Data_Value
Age adjusted heart disease death rate
Quantitative
Deaths
Total number of heart disease deaths
Quantitative
Questions I want to explore?
How do heart disease death rates differ across U.S. states?
Are there differences in death rates by sex and race?
Do certain types of heart disease have higher death rates than others?
What factors are related to higher heart disease death rates?
How the Data was collected
The data was collected by the Centers for Disease Control and Prevention (CDC). The CDC gathers heart disease death data using death certificate records that are reported by each U.S. state and territory. When a person dies, the cause of death is listed on the death certificate, and this information is later sent to the CDC.
Then CDC groups these records by year, state, sex, race, and type of heart disease. The dataset also includes age-adjusted death rates, these rates are calculated so that states and groups with different age populations can be compared more fairly.
Loading Libraries and Data set
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 78792 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): LocationAbbr, LocationDesc, GeographicLevel, DataSource, Class, To...
dbl (4): Year, Data_Value, Y_lat, X_lon
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(heart_data)
# A tibble: 6 × 21
Year LocationAbbr LocationDesc GeographicLevel DataSource Class Topic
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2020 AK Denali County NVSS Cardiova… Hear…
2 2020 CA California State NVSS Cardiova… Hear…
3 2020 CO Park County County NVSS Cardiova… Hear…
4 2020 FL Walton County County NVSS Cardiova… Hear…
5 2020 GA Whitfield County County NVSS Cardiova… Hear…
6 2020 GA Ware County County NVSS Cardiova… Hear…
# ℹ 14 more variables: Data_Value <dbl>, Data_Value_Unit <chr>,
# Data_Value_Type <chr>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, StratificationCategory1 <chr>,
# Stratification1 <chr>, StratificationCategory2 <chr>,
# Stratification2 <chr>, TopicID <chr>, LocationID <chr>, Y_lat <dbl>,
# X_lon <dbl>, Georeference <chr>
Below, I keep only the variables that I will use later
# A tibble: 1 × 5
Year avg_rate min_rate max_rate n
<fct> <dbl> <dbl> <dbl> <int>
1 2020 278. 26.3 905. 975
Multiple Linear Regression
I run a multiple linear regression model to study heart disease death rates. The outcome variable in this model is the age-adjusted heart disease death rate. The predictors used in this model are sex and race. These variables are included to see how heart disease death rates differ across different groups of people.
This model shows how the heart disease death rate is estimated using sex and race.In this equation, b0 is the starting point for the death rate. The parts with sex and race show how the death rate changes for different groups. The error part stands for other things that affect heart disease deaths but are not included in this model.
Diagnostic Plots
par(mfrow =c(2, 2))plot(model1)
par(mfrow =c(1, 1))
Adjusted R-squared
summary(model1)$adj.r.squared
[1] 0.5777104
Analyzing
The values showed that the model explains some of the differences in heart disease death rates, but not most of them. The p-values show that sex and race do matter in the model, so they are related to changes in death rates. But the adjusted R-squared value is low, which means this model does not explain a lot of what is going on. This shows that there are many other things that affect heart disease death rates, like healthcare, money, or daily habits, that are not included in this data. The model is helpful, but it has limits.
Vizualization 1: Heatmap of Heart Disease Death Rates by State and Race (Top 10 States)
I made a heatmap focusing on only the 10 states with the highest average heart disease death rates. I think the heatmap makes it easier to see patterns because color shows how high or low the rates are.
This heatmap shows how heart disease death rates change by race in the 10 states with the highest overall rates.Guam has the highest overall death rate compared to the other states. Some race groups, like Black and White populations, have higher rates in several states. The pattern also changes from state to state, which shows that where you live matters. Overall, the graph shows that both race and state affect heart disease death rates.
Visualization 2: Top 10 States by Average Heart Disease Death Rate
I used a bar chart to compare the 10 states with the highest average heart disease death rates.
ggplot(top_states, aes(x =reorder(LocationDesc, avg_rate),y = avg_rate,fill = LocationDesc)) +geom_col(show.legend =FALSE) +coord_flip() +scale_fill_brewer(palette ="Set3") +labs(title ="Top 10 States by Average Heart Disease Death Rate (2019–2021)",x ="State",y ="Average age-adjusted death rate",caption ="Source: CDC Heart Disease Mortality Data, 2019–2021") +theme_bw() +geom_text(data = highest_state,aes(label =paste("Highest:", round(avg_rate, 1))),hjust =-0.1,size =3) +ylim(0, max(top_states$avg_rate) *1.15)
What it shows:
The bar chart shows the 10 states with the highest average heart disease death rates. Guam stands out with the highest rate at about 578.4, which is much higher than the other states in the chart. The rest of the states have lower rates, but there are still clear differences between them. This means that the heart disease death rates vary a lot depending on the state.
Conclusion:
The most interesting finding for me was how much death rates change depending on the state. Guam stood out with the highest average heart disease death rate, which was much higher than the other states. I also found it interesting that death rates were different across race groups, and these patterns were not the same in every state.
The regression model showed that sex and race are related to heart disease death rates, but the model did not explain everything. The adjusted R-squared value showed that there are many factors that affect heart disease deaths that are not included in this data, like access to healthcare, income, diet, and daily habits.
One limitation I faced was that the data does not include personal health behaviors or information about healthcare access. If I could continue this project, I would want to include those factors to better understand why heart disease death rates are higher in some places and groups. Overall, this project helped me see that heart disease is a serious issue and that many factors play a role in the differences we see across states and groups.