Heart Health in America: Investigating Elderly Diseases Among Different Demographic Locations

Author

N Bellot Norman

Published

July 7, 2024

Eat Your Way to a Healthy Heart

Source: https://www.uab.edu/news/images/eating_healthy_heart.jpg

Introduction

My topic, Heart Health in America: Investigating Elderly Diseases Among Different Demographic Locations,” takes a longitudinal review of data from 2016 to 2021. This data originates from the National Cardiovascular Disease Surveillance System, a crucial component of the Centers for Disease Control and Prevention (CDC), which leverages data from the Centers for Medicare & Medicaid Services (CMS). The dataset was compiled from Medicare and Medicaid claims data, including inpatient and outpatient claims and master beneficiary summary files. The CDC’s Division for Heart Disease and Stroke Prevention (DHDSP) then computed indicators from this data source to ensure comprehensive and accurate data (Centers for Disease Control and Prevention, 2022).

My filtered dataset includes quantitative variables such as the year and data value and categorical variables such as disease type (Heart Disease, Heart Attack, Stroke), categories (High Healthcare Expenditures, Growth Rate of Elderly Population, High Cardiovascular Disease), locations (multiple states), and gender (male and female).

To conduct my analysis, I modified the data by removing unwanted columns with missing or irrelevant information, such as RowID, LocationAbbr, and various PriorityArea columns that had no value for my analysis. I renamed the YearStart column for better readability, shortened the names of major disease categories, and focused on Heart Disease, Heart Attack, and Stroke. I also filtered the data to include only male and female gender categorization and separated the GeoLocation column into latitude and longitude. Additionally, I categorized states based on their characteristics related to elderly populations and cardiovascular health. The dataset also includes information for individuals ages 75 and older.

This topic is significant to me due to a personal experience with a severe health episode in my family, which highlighted the importance of understanding and addressing cardiovascular diseases in the elderly. Witnessing an elderly person having a significant health episode in my living room on a sunny Sunday morning in 2024 changed my life forever and led me to this topic. Also, in my line of work as an investigative analyst, I am drawn to the vulnerability among the elderly, many of whom fall victim to fraud and scams yearly. Thankfully, federal and local governments have instilled measures and provided resources to help thwart these threats and punish offenders. My analysis and recommendations will hopefully contribute to ongoing efforts to safeguard many in our community.

Background Research

The United States population aged 65 and older has grown significantly, increasing five times faster than the total population over the past century (Census, 2020). During the COVID-19 pandemic, this demographic reached 55.8 million, or 16.8% of the total population (Census, 2020). Although older Americans were negatively affected by the COVID-19 Pandemic, they are resilient and strong.

As the American population ages, the focus on health insurance and related care for older adults intensifies. Heart disease, heart attack, and stroke are leading causes of death among the elderly (CDC, 2022; NIH, 2022; AHA, 2021). According to the CDC, heart disease is the leading cause of death for both men and women, accounting for 25% of deaths (CDC, 2022). Understanding the rate in which they are affected is key. The National Institute of Health reports that approximately 805,000 Americans have a heart attack each year, with a significant portion occurring in individuals aged 65 and older (NIH, 2022). In the most staggering statistic in this essay, the American Heart Association (AHA) states that nearly 75% of all strokes occur in people over the age of 65 (AHA, 2021). Research shows that a person in the United States experiences a stroke every 40 seconds (AHA, 2021). Every 40 seconds. That statistic is very high. Given this information, I would like to explore factors that contribute to these numbers in my analysis.

Load libraries

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.1

Warning: package 'dplyr' was built under R version 4.4.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(ggalluvial)
library(leaflet)

Warning: package 'leaflet' was built under R version 4.4.1

library(RColorBrewer)
library(ggdark)

Warning: package 'ggdark' was built under R version 4.4.1

Set Working Directory & Upload Data

setwd("C:/Users/naomi/OneDrive/Desktop/Desktop of 11-08-2022/Community College Classes/DATA 110/Submitted Assignments/Project # 2")
elderly <- read_csv("Elderly_Heart_Disease.csv")

Rows: 33454 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): LocationAbbr, LocationDesc, DataSource, PriorityArea1, PriorityAre...
dbl  (6): YearStart, Data_Value, Data_Value_Alt, Low_Confidence_Limit, High_...
lgl  (5): RowId, PriorityArea2, PriorityArea4, Data_Value_Footnote_Symbol, D...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Remove and rename unwanted columns

data_step1 <- elderly |>
  select(-RowId, -LocationAbbr, -DataSource, -PriorityArea1, -PriorityArea2, -PriorityArea3, -PriorityArea4, -Class, -Data_Value_Type, -Data_Value_Alt, -Data_Value_Footnote_Symbol, -Data_Value_Footnote, -Low_Confidence_Limit, -High_Confidence_Limit, -ClassId, -TopicId, -BreakOutCategoryId, -BreakOutId, -QuestionId) |>
  rename(Year = YearStart)

Filter column to select specific data

data_step2 <- data_step1 |>
  filter(Break_Out %in% c("Male", "Female")) |>
  rename(Gender = Break_Out)

Shorten name of diseases to facilitate understanding and removed categories that do not fit the criteria

data_step3 <- data_step2 |>
  mutate(
    Gender = case_when(
      Gender == "Male" ~ "Male",
      Gender == "Female" ~ "Female"
    ),
    Disease = case_when(
      Topic == "Diseases of the Heart (Heart Disease)" ~ "Heart Disease",
      Topic == "Acute Myocardial Infarction (Heart Attack)" ~ "Heart Attack",
      Topic == "Stroke" ~ "Stroke"
    )
  ) |>
  filter(!is.na(Disease)) # Remove rows where Disease is NA

Convert the location into acceptable format for mapping

data_step4 <- data_step3 |>
  mutate(GeoLocation = str_replace_all(GeoLocation, "POINT \\(|\\)", "")) |>
  separate(GeoLocation, into = c("Longitude", "Latitude"), sep = " ", convert = TRUE)

Change categoric into numeric variable

data_step5 <- data_step4 |>
  mutate(Year = as.numeric(as.character(Year)))

Filter locationdesc into categories that fit the description for the current analysis

data_high_elderly <- data_step5 |>
  filter(LocationDesc %in% c("Florida", "Maine", "West Virginia", "Vermont", "Montana")) |>
  mutate(Category = "High Elderly Populations")

data_high_health_care <- data_step5 |>
  filter(LocationDesc %in% c("Massachusetts", "Alaska", "Delaware", "Connecticut", "New York")) |>
  mutate(Category = "High Health Care Exp")

data_growth_rate_elderly <- data_step5 |>
  filter(LocationDesc %in% c("Arizona", "Nevada", "Colorado", "Georgia", "Texas")) |>
  mutate(Category = "Growth Rate of Elderly Pop")

data_high_cardiovascular <- data_step5 |>
  filter(LocationDesc %in% c("Mississippi", "Louisiana", "Arkansas", "Alabama", "Oklahoma")) |>
  mutate(Category = "High Cardiovascular Disease")

Cleaned Data

cleaned_data <- bind_rows(data_high_elderly, data_high_health_care, data_growth_rate_elderly, data_high_cardiovascular)

Assign mean values for outliers or values that are lower than 100 to reduce skewed numbers

mean_value <- mean(cleaned_data$Data_Value[cleaned_data$Data_Value < 100], na.rm = TRUE)
cleaned_data <- cleaned_data |>
  mutate(Data_Value = ifelse(Data_Value < 100, mean_value, Data_Value))

Print to ensure data is filtered correctly

print(head(cleaned_data))

# A tibble: 6 × 14
   Year LocationDesc  Topic                  Question Data_Value_Unit Data_Value
  <dbl> <chr>         <chr>                  <chr>    <chr>                <dbl>
1  2021 Maine         Stroke                 Cerebro… Rate per 100,0…     803.  
2  2020 Maine         Diseases of the Heart… Disease… Rate per 100,0…    3275.  
3  2017 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    4890.  
4  2017 Florida       Diseases of the Heart… Disease… Rate per 100,0…    3409.  
5  2020 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    3565.  
6  2019 Florida       Diseases of the Heart… Prevale… Percent (%)           6.08
# ℹ 8 more variables: Break_Out_Category <chr>, Gender <chr>,
#   Data_Value_TypeID <chr>, LocationId <dbl>, Longitude <dbl>, Latitude <dbl>,
#   Disease <chr>, Category <chr>

print(colnames(cleaned_data))

 [1] "Year"               "LocationDesc"       "Topic"             
 [4] "Question"           "Data_Value_Unit"    "Data_Value"        
 [7] "Break_Out_Category" "Gender"             "Data_Value_TypeID" 
[10] "LocationId"         "Longitude"          "Latitude"          
[13] "Disease"            "Category"

Linear Regression Analysis

model <- lm(Data_Value ~ Disease + Gender, data = cleaned_data)
summary(model)


Call:
lm(formula = Data_Value ~ Disease + Gender, data = cleaned_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1784.5  -464.7  -106.3   524.1  3099.1 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            228.16      56.57   4.033 5.79e-05 ***
DiseaseHeart Disease  1361.50      69.29  19.650  < 2e-16 ***
DiseaseStroke          167.13      69.29   2.412 0.015980 *  
GenderMale             200.88      56.57   3.551 0.000396 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1073 on 1436 degrees of freedom
Multiple R-squared:  0.2474,    Adjusted R-squared:  0.2458 
F-statistic: 157.3 on 3 and 1436 DF,  p-value: < 2.2e-16

p_values <- summary(model)$coefficients[, 4]
adjusted_r_squared <- summary(model)$adj.r.squared

The p values < 2.2e-16 means that the model is highly significant. It suggests that one of the values is related to the Data Value.

The linear regression equation for my model is Data_Value=β0+β1×DiseaseHeartDisease+β2×DiseaseStroke+β3×GenderMale Data_Value=228.16+1361.50×DiseaseHeartDisease+167.13×DiseaseStroke+200.88×GenderMale

The relationship between the variables are positive. The strongest relationship is DiseaseHeartDisease because it has the highest value 1361.50 and smallest p value (< 2e-16). Conversely, the weakest predictor is stroke. It has the lowest value at 167.13, but more importantly, the p-value = 0.015980). Male falls in the middle of these two categories as p-value (0.000396), and it is statistically significant.

R2 = 1 − SS - Sum of Squares residuals/Sum of Squares Total

The multiple R-squared value of 0.2474 means that approximately 24.74% of the variability in the Data value can be explained by the model. For the adjusted R-squared, 0.2458 adjusts the R-squared value based on the number of predictors in the model. In essence, heart disease, stroke, gender are significant predictors of the data value. Heart disease has the highest and most positive effect on data value followed by gender, specifically male then stroke. The model explains 24.74% of the variability.

summary(model)


Call:
lm(formula = Data_Value ~ Disease + Gender, data = cleaned_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1784.5  -464.7  -106.3   524.1  3099.1 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            228.16      56.57   4.033 5.79e-05 ***
DiseaseHeart Disease  1361.50      69.29  19.650  < 2e-16 ***
DiseaseStroke          167.13      69.29   2.412 0.015980 *  
GenderMale             200.88      56.57   3.551 0.000396 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1073 on 1436 degrees of freedom
Multiple R-squared:  0.2474,    Adjusted R-squared:  0.2458 
F-statistic: 157.3 on 3 and 1436 DF,  p-value: < 2.2e-16

equation <- paste("Data_Value = ", round(coef(model)[1], 2), 
                  "+", round(coef(model)[2], 2), "* DiseaseHeartDisease",
                  "+", round(coef(model)[3], 2), "* DiseaseHeartAttack",
                  "+", round(coef(model)[4], 2), "* DiseaseStroke",
                  "+", round(coef(model)[5], 2), "* GenderMale")

cat("Equation for the model:\n", equation, "\n")

Equation for the model:
 Data_Value =  228.16 + 1361.5 * DiseaseHeartDisease + 167.13 * DiseaseHeartAttack + 200.88 * DiseaseStroke + NA * GenderMale

cat("P-values:\n", p_values, "\n")

P-values:
 5.793905e-05 2.504147e-76 0.01597978 0.0003963393

cat("Adjusted R-squared:\n", adjusted_r_squared, "\n")

Adjusted R-squared:
 0.2458134

par(mfrow = c(2, 2)) # Arrange plots in a 2x2 grid

plot(model, which = 1, main = "Residuals vs Fitted")

I am not versed in this area and may have to take a class to further my understanding. Here is my attempt based on my research. The Residuals vs. Fitted graph helps validate the assumptions of a linear regression model. It appears that the values are clustered around the 500 value and under. The fitted red line veers under the negative values the further away it gets from the lower values. However, on the right hand side of the graph, the values are holding strong past the 1500 point at the 0 intercept.

plot(model, which = 2, main = "Normal Q-Q")

Similarly, a normal Q-Q Plot is used to determine if the residuals of a regression model (or any dataset) follow a normal distribution. In this graph, the points that are close to the reference line indicates normality and are stronger. However, the further out or away from the values, the more abnormal or weaker the relationship is.

plot(model, which = 3, main = "Scale-Location")

In my view, the scale-location is similar to residuals and fitted graph.

plot(model, which = 5, main = "Residuals vs Leverage")

For the Residuals vs. Leverage, Heart Disease remains higher than the other two categories.

custom_theme <- theme(
  panel.background = element_rect(fill = "lightblue"),
  panel.grid.major = element_line(color = "white", linewidth = 0.5),
  panel.grid.minor = element_line(color = "white", linewidth = 0.2),
  axis.text = element_text(color = "darkblue"),
  axis.title = element_text(color = "darkblue", face = "bold"),
  plot.title = element_text(hjust = 0.5, color = "darkblue", face = "bold"),
  legend.position = "right",
  legend.title = element_text(face = "bold", color = "darkblue"),
  legend.background = element_rect(fill = "lightblue")
)

line_graph <- ggplot(cleaned_data, aes(x = Year, y = Data_Value, color = Gender)) +
  geom_line(aes(group = Gender), size = 1) +
  facet_wrap(~ Disease) +
  scale_color_manual(values = c("darkgreen", "purple")) +
  labs(
    title = "Gender and Disease Demographics",
    x = "Year",
    y = "Prevalence",
    color = "Gender"
  ) +
  custom_theme

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

print(line_graph)

The time series measures three categories: heart attack, heart disease, and stroke among men and women from 2016 through 2021. The line graph shows that men consistently have higher rates of heart disease compared to women. Notably, there are periods of reductions for both men and women.

scatter_plot_1 <- ggplot(cleaned_data, aes(x = Year, y = Data_Value, color = Category)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("darkorange", "cyan", "magenta", "limegreen")) +
  labs(
    title = "Scatter Plot: Year vs. Data Value (High Elderly Populations)",
    x = "Year",
    y = "Prevalence",
    color = "Category"
  ) +
  facet_wrap(~Category) +
  custom_theme

print(scatter_plot_1)

scatter_plot_2 <- ggplot(cleaned_data, aes(x = Year, y = Data_Value, color = Category)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("darkorange", "cyan", "magenta", "limegreen")) +
  labs(
    title = "Scatter Plot: Year vs. Data Value (High Health Care Exp)",
    x = "Year",
    y = "Prevalence",
    color = "Category"
  ) +
  facet_wrap(~Category) +
  custom_theme

print(scatter_plot_2)

scatter_plot_3 <- ggplot(cleaned_data, aes(x = Year, y = Data_Value, color = Category)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("darkorange", "cyan", "magenta", "limegreen")) +
  labs(
    title = "Scatter Plot: Year vs. Data Value (Growth Rate of Elderly Pop)",
    x = "Year",
    y = "Prevalence",
    color = "Category"
  ) +
  facet_wrap(~Category) +
  custom_theme

print(scatter_plot_3)

scatter_plot_4 <- ggplot(cleaned_data, aes(x = Year, y = Data_Value, color = Category)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("darkorange", "cyan", "magenta", "limegreen")) +
  labs(
    title = "Exploring Demographic Factors in Heart Disease Incidence",
    x = "Year",
    y = "Prevalence",
    color = "Category"
  ) +
  facet_wrap(~Category) +
  custom_theme

print(scatter_plot_4)

The scatter plot compares states categorized by “Growth Rate of Elderly Population,” “High Cardiovascular Disease,” “High Elderly Populations,” and “High Health Care Expenditure.” States categorized as High Cardiovascular accounts for the highest prevalence followed by High Elderly. States that are described as having the fastest growth rate of elderly population can learn from other states with high elderly population based on the trends.There are noticeable declines in 2020 and 2021 for all categories. One reason may be due to the COVID-19 Pandemic. It will be interesting to see what 2022 and 2023 data shows.

heatmap_data <- cleaned_data %>%
  group_by(Year, Disease, Gender) %>%
  summarise(mean_value = mean(Data_Value, na.rm = TRUE))

`summarise()` has grouped output by 'Year', 'Disease'. You can override using
the `.groups` argument.

heatmap_plot <- ggplot(heatmap_data, aes(x = Year, y = Disease, fill = mean_value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  facet_wrap(~ Gender) +
  labs(
    title = "Heatmap Analysis of Heart Disease & Gender Prevalence",
    x = "Year",
    y = "Disease Type",
    fill = "Mean Data Value"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "right",
    legend.title = element_text(face = "bold")
  )

print(heatmap_plot)

The heatmap visualizes the mean data by year, gender, and disease type, showing that heart disease consistently dominates across all years, peaking between 1500 to 2000. The chart also consistently shows darker colors for males, indicating higher rates of heart disease among men.

box_plot <- ggplot(cleaned_data, aes(x = Disease, y = Data_Value, fill = Gender)) +
  geom_boxplot() +
  scale_fill_manual(values = c("pink", "green")) +
  labs(
    title = "Gender & Heart Disease",
    x = "Disease",
    y = "Prevalence",
    fill = "Gender"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "right",
    legend.title = element_text(face = "bold")
  )

print(box_plot)

The box plot compares gender and disease type, reaffirming that men dominate in all categories. For heart disease, men have significantly higher values, almost 1000 more than women. This is an opportunity for health campaigns to target their material toward men. Perhaps, encourage men to seek healthcare treatment early as a preventative measure. Again, it is important to reach everyone in an equitable manner.

max_data <- cleaned_data %>%
  group_by(Category, Disease, LocationDesc) %>%
  summarise(max_value = max(Data_Value, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(Category, Disease) %>%
  slice_max(max_value, with_ties = FALSE) %>%
  ungroup()

`summarise()` has grouped output by 'Category', 'Disease'. You can override
using the `.groups` argument.

print(max_data)

# A tibble: 12 × 4
   Category                    Disease       LocationDesc  max_value
   <chr>                       <chr>         <chr>             <dbl>
 1 Growth Rate of Elderly Pop  Heart Attack  Texas              870.
 2 Growth Rate of Elderly Pop  Heart Disease Georgia           4205.
 3 Growth Rate of Elderly Pop  Stroke        Texas             1168.
 4 High Cardiovascular Disease Heart Attack  Arkansas          1049.
 5 High Cardiovascular Disease Heart Disease Arkansas          4733.
 6 High Cardiovascular Disease Stroke        Louisiana         1426 
 7 High Elderly Populations    Heart Attack  West Virginia     1341.
 8 High Elderly Populations    Heart Disease West Virginia     4890.
 9 High Elderly Populations    Stroke        West Virginia     1236.
10 High Health Care Exp        Heart Attack  Massachusetts      892.
11 High Health Care Exp        Heart Disease Massachusetts     4469.
12 High Health Care Exp        Stroke        Delaware          1367

This allows me to review the max data for each category to see the states that are most impacted.The summary table highlights the top values for each category by state. Texas leads in both heart attacks and strokes for states with a high growth rate of the elderly population. Arkansas dominates heart attack and heart disease in states with high cardiovascular disease rates. West Virginia leads in all three disease categories among states with high elderly populations. Massachusetts leads in heart attack and heart disease for states with high health care expenditures, while Delaware dominates in stroke for this category.

cleaned_data <- cleaned_data %>%
  mutate(Category = case_when(
    Category == "High Elderly Populations" ~ "High Elderly",
    Category == "High Health Care Exp" ~ "High Health Care",
    Category == "Growth Rate of Elderly Pop" ~ "Growth Rate Elderly",
    Category == "High Cardiovascular Disease" ~ "High Cardiovascular",
    TRUE ~ Category
  ))

max_data <- cleaned_data %>%
  group_by(Category, Disease, LocationDesc) %>%
  summarise(max_value = max(Data_Value, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(Category, Disease) %>%
  slice_max(max_value, with_ties = FALSE) %>%
  ungroup()

`summarise()` has grouped output by 'Category', 'Disease'. You can override
using the `.groups` argument.

max_data <- max_data %>%
  mutate(max_value = round(max_value))

print(max_data)

# A tibble: 12 × 4
   Category            Disease       LocationDesc  max_value
   <chr>               <chr>         <chr>             <dbl>
 1 Growth Rate Elderly Heart Attack  Texas               870
 2 Growth Rate Elderly Heart Disease Georgia            4205
 3 Growth Rate Elderly Stroke        Texas              1168
 4 High Cardiovascular Heart Attack  Arkansas           1049
 5 High Cardiovascular Heart Disease Arkansas           4733
 6 High Cardiovascular Stroke        Louisiana          1426
 7 High Elderly        Heart Attack  West Virginia      1341
 8 High Elderly        Heart Disease West Virginia      4890
 9 High Elderly        Stroke        West Virginia      1236
10 High Health Care    Heart Attack  Massachusetts       892
11 High Health Care    Heart Disease Massachusetts      4469
12 High Health Care    Stroke        Delaware           1367

Create the theme for upcoming visualization

custom_theme <- theme(
  panel.background = element_blank(),
  panel.grid.major = element_line(color = "grey", size = 0.2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(color = "black"),
  axis.title = element_text(color = "black", face = "bold"),
  plot.title = element_text(hjust = 0.5, color = "black", face = "bold"),
  legend.position = "right",
  legend.title = element_text(face = "bold", color = "black"),
  legend.background = element_blank()
)

Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

New visualization. Lollipop plot.

lollipop_plot <- ggplot(max_data, aes(x = Category, y = max_value, color = Disease)) +
  geom_segment(aes(x = Category, xend = Category, y = 0, yend = max_value), size = 1) +
  geom_point(size = 4) +
  scale_color_manual(values = c("darkorange", "cyan", "magenta")) +
  labs(
    title = "Leading States in 2 Demographic Categories",
    x = "Category",
    y = "Prevalence",
    color = "Disease"
  ) +
  geom_text(aes(label = paste(LocationDesc, "\n", max_value)), 
            position = position_stack(vjust = 1.1), 
            color = "black", size = 2.5) +
  custom_theme

print(lollipop_plot)

The lollipop plot visualizes the leading states are impacting diseases within the high cardiovascular and high elderly categories.

Use the max data to create an alluvial data

alluvial_data <- max_data %>%
  group_by(Category, Disease, LocationDesc) %>%
  summarise(max_value = max(max_value, na.rm = TRUE)) %>%
  ungroup()

`summarise()` has grouped output by 'Category', 'Disease'. You can override
using the `.groups` argument.

Use the cleaned data to slice the top data for grouping

top3_data <- cleaned_data %>%
  group_by(LocationDesc, Disease) %>%
  summarise(max_value = max(Data_Value, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(Disease) %>%
  slice_max(max_value, n = 3, with_ties = FALSE) %>%
  ungroup()

`summarise()` has grouped output by 'LocationDesc'. You can override using the
`.groups` argument.

Create an alluvial plot

alluvial_plot <- ggplot(alluvial_data,
                        aes(axis1 = Category, axis2 = Disease, axis3 = LocationDesc, y = max_value)) +
  geom_alluvium(aes(fill = Disease), width = 1/12) +
  geom_stratum(width = 1/12, fill = "grey", color = "black") +
  geom_text(stat = "stratum", aes(label = after_stat(stratum)), size = 3) +
  scale_x_discrete(limits = c("Category", "Disease", "LocationDesc"), expand = c(0.15, 0.05)) +
  scale_fill_manual(values = c("#FF5733", "#33FF57", "#3357FF")) +
  labs(
    title = "Flow of Heart Diseases Across Categories and Locations",
    x = "Attributes",
    y = "Prevalance",
    fill = "Disease"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    axis.title.x = element_blank()
  )

print(alluvial_plot)

The alluvial plot, entitled “Flow of Heart Diseases Across Categories and Locations,” provides a glimpse of pathways for heart attack, heart disease, and stroke. In keeping with earlier visualizations, heart diseases dominate each category, as depicted by the light green colors. Arkansas, Georgia, Massachusetts, and West Virginia have high heart disease rates. Red signifies the heart attack category. The red color weaves throughout the graph in a scarce pattern, though it focuses on West Virginia, Texas, and Massachusetts. It simultaneously touches on each category: Growth Rate Elderly, High Cardiovascular, High Elderly, and High Health Care to the left. However, the numbers for heart attacks are lower when compared to strokes. Delaware, Louisiana, Texas, and West Virginia’s elderly populations have been affected by strokes.

cleaned_data <- cleaned_data %>%
  mutate(Category = case_when(
    Category == "High Elderly Populations" ~ "High Elderly",
    Category == "High Health Care Exp" ~ "High Health Care",
    Category == "Growth Rate of Elderly Pop" ~ "Growth Rate Elderly",
    Category == "High Cardiovascular Disease" ~ "High Cardiovascular",
    TRUE ~ Category
  ))

filtered_data <- cleaned_data %>%
  filter(Category %in% c("High Elderly", "High Cardiovascular"))

max_data <- filtered_data %>%
  group_by(Category, Disease, LocationDesc) %>%
  summarise(max_value = max(Data_Value, na.rm = TRUE)) %>%
  ungroup()

`summarise()` has grouped output by 'Category', 'Disease'. You can override
using the `.groups` argument.

print(max_data)

# A tibble: 30 × 4
   Category            Disease       LocationDesc max_value
   <chr>               <chr>         <chr>            <dbl>
 1 High Cardiovascular Heart Attack  Alabama           928.
 2 High Cardiovascular Heart Attack  Arkansas         1049.
 3 High Cardiovascular Heart Attack  Louisiana         940 
 4 High Cardiovascular Heart Attack  Mississippi       882.
 5 High Cardiovascular Heart Attack  Oklahoma          963.
 6 High Cardiovascular Heart Disease Alabama          4619.
 7 High Cardiovascular Heart Disease Arkansas         4733.
 8 High Cardiovascular Heart Disease Louisiana        4537.
 9 High Cardiovascular Heart Disease Mississippi      4345.
10 High Cardiovascular Heart Disease Oklahoma         4208.
# ℹ 20 more rows

alluvial_plot <- ggplot(max_data,
                        aes(axis1 = Category, axis2 = Disease, axis3 = LocationDesc, y = max_value)) +
  geom_alluvium(aes(fill = Disease), width = 1/12) +
  geom_stratum(width = 1/12, fill = "grey", color = "black") +
  geom_text(stat = "stratum", aes(label = after_stat(stratum)), size = 3) +
  scale_x_discrete(limits = c("Category", "Disease", "LocationDesc"), expand = c(0.15, 0.05)) +
  scale_fill_manual(values = c("#FF5733", "#33FF57", "#3357FF")) +
  labs(
    title = "Heart Disease Dynamics Across U.S. States and Categories",
    x = "Attributes",
    y = "Prevalance",
    fill = "Disease"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    axis.title.x = element_blank()
  )

print(alluvial_plot)

This repeated alluvial plot accounts for the max data values to determine the affected states. I will discuss later in the essay.

alluvial_plot <- ggplot(max_data,
                        aes(axis1 = Category, axis2 = Disease, axis3 = LocationDesc, y = max_value)) +
  geom_alluvium(aes(fill = Disease), width = 1/12) +
  geom_stratum(width = 1/12, fill = "grey", color = "black") +
  geom_text(stat = "stratum", aes(label = after_stat(stratum)), size = 3) +
  scale_x_discrete(limits = c("Category", "Disease", "LocationDesc"), expand = c(0.15, 0.05)) +
  scale_fill_manual(values = c("#FF5733", "#33FF57", "#3357FF")) +
  labs(
    title = "Heart Disease Dynamics Across U.S. States and Categories",
    x = "Attributes",
    y = "Prevalence",
    fill = "Disease"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    axis.title.x = element_blank()
  )

print(alluvial_plot)

interactive_alluvial_plot <- ggplotly(alluvial_plot)

Add Interactive Application to Alluvial Plot

interactive_alluvial_plot

print(cleaned_data)

# A tibble: 1,440 × 14
    Year LocationDesc  Topic                 Question Data_Value_Unit Data_Value
   <dbl> <chr>         <chr>                 <chr>    <chr>                <dbl>
 1  2021 Maine         Stroke                Cerebro… Rate per 100,0…     803.  
 2  2020 Maine         Diseases of the Hear… Disease… Rate per 100,0…    3275.  
 3  2017 West Virginia Diseases of the Hear… Disease… Rate per 100,0…    4890.  
 4  2017 Florida       Diseases of the Hear… Disease… Rate per 100,0…    3409.  
 5  2020 West Virginia Diseases of the Hear… Disease… Rate per 100,0…    3565.  
 6  2019 Florida       Diseases of the Hear… Prevale… Percent (%)           6.08
 7  2018 Florida       Diseases of the Hear… Prevale… Percent (%)           6.08
 8  2018 Montana       Diseases of the Hear… Disease… Rate per 100,0…    3276.  
 9  2016 Vermont       Diseases of the Hear… Disease… Rate per 100,0…    3839.  
10  2016 Maine         Acute Myocardial Inf… Acute m… Rate per 100,0…    1086.  
# ℹ 1,430 more rows
# ℹ 8 more variables: Break_Out_Category <chr>, Gender <chr>,
#   Data_Value_TypeID <chr>, LocationId <dbl>, Longitude <dbl>, Latitude <dbl>,
#   Disease <chr>, Category <chr>

cleaned_data <- cleaned_data |>
  mutate(Latitude = as.numeric(Latitude), Longitude = as.numeric(Longitude))

print(sum(is.na(cleaned_data$Latitude)))

[1] 0

print(sum(is.na(cleaned_data$Longitude)))

[1] 0

print(head(cleaned_data))

# A tibble: 6 × 14
   Year LocationDesc  Topic                  Question Data_Value_Unit Data_Value
  <dbl> <chr>         <chr>                  <chr>    <chr>                <dbl>
1  2021 Maine         Stroke                 Cerebro… Rate per 100,0…     803.  
2  2020 Maine         Diseases of the Heart… Disease… Rate per 100,0…    3275.  
3  2017 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    4890.  
4  2017 Florida       Diseases of the Heart… Disease… Rate per 100,0…    3409.  
5  2020 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    3565.  
6  2019 Florida       Diseases of the Heart… Prevale… Percent (%)           6.08
# ℹ 8 more variables: Break_Out_Category <chr>, Gender <chr>,
#   Data_Value_TypeID <chr>, LocationId <dbl>, Longitude <dbl>, Latitude <dbl>,
#   Disease <chr>, Category <chr>

summary(cleaned_data$Data_Value)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   6.083    6.083  137.491  838.145 1013.975 4889.600

cleaned_data |>
  group_by(Disease) |>
  summarise(
    Min = min(Data_Value, na.rm = TRUE),
    Max = max(Data_Value, na.rm = TRUE),
    Mean = mean(Data_Value, na.rm = TRUE),
    Median = median(Data_Value, na.rm = TRUE)
  )

# A tibble: 3 × 5
  Disease         Min   Max  Mean Median
  <chr>         <dbl> <dbl> <dbl>  <dbl>
1 Heart Attack   6.08 1341.  329.   137.
2 Heart Disease  6.08 4890. 1690.   876.
3 Stroke         6.08 1426   496.   305.

normalized_data <- filtered_data |>
  group_by(Disease) |>
  mutate(Normalized_Value = (Data_Value - min(Data_Value, na.rm = TRUE)) / 
                              (max(Data_Value, na.rm = TRUE) - min(Data_Value, na.rm = TRUE))) |>
  ungroup()

# View the normalized data
print(head(normalized_data))

# A tibble: 6 × 15
   Year LocationDesc  Topic                  Question Data_Value_Unit Data_Value
  <dbl> <chr>         <chr>                  <chr>    <chr>                <dbl>
1  2021 Maine         Stroke                 Cerebro… Rate per 100,0…     803.  
2  2020 Maine         Diseases of the Heart… Disease… Rate per 100,0…    3275.  
3  2017 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    4890.  
4  2017 Florida       Diseases of the Heart… Disease… Rate per 100,0…    3409.  
5  2020 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    3565.  
6  2019 Florida       Diseases of the Heart… Prevale… Percent (%)           6.08
# ℹ 9 more variables: Break_Out_Category <chr>, Gender <chr>,
#   Data_Value_TypeID <chr>, LocationId <dbl>, Longitude <dbl>, Latitude <dbl>,
#   Disease <chr>, Category <chr>, Normalized_Value <dbl>

##Create Map based on Normalized data

leaflet_map <- leaflet(normalized_data) |>
  addTiles() |>
  addCircleMarkers(
    ~Longitude, ~Latitude,
    color = ~case_when(
      Disease == "Heart Disease" ~ "#33FF57",  # Bright green for Heart Disease
      Disease == "Heart Attack" ~ "#FF5733"  # Bright red for Heart Attack
    ),
    popup = ~paste(
      "<strong>Category:</strong>", Category, "<br>",
      "<strong>Disease:</strong>", Disease, "<br>",
      "<strong>Data Value (%):</strong>", round(Data_Value, 2), "<br>",
      "<strong>Normalized Value:</strong>", round(Normalized_Value, 2), "<br>",
      "<strong>Latitude:</strong>", Latitude, "<br>",
      "<strong>Longitude:</strong>", Longitude
    ),
    radius = ~Normalized_Value * 10,  # Scale the radius for better visualization
    stroke = FALSE,
    fillOpacity = 0.8
  ) |>
  addLegend(
    "bottomright",
    colors = c("#33FF57", "#FF5733"),
    labels = c("Heart Disease", "Heart Attack"),
    title = "Disease"
  )

leaflet_map

Cardiovascular disease is a widespread and serious health concern. The dataset used in this analysis includes a wide range of values, from 100 to over 4000, which initially limited the data captured in the interactive map to stroke cases. The interactive map, upon clicking, categorizes the information based on the filtered categories provided: states with High Elderly Populations, High Health Care Expenses, Growth Rate of Elderly Population, and High Cardiovascular Disease rates. The ten states represented in this analysis are Alabama, Arkansas, Florida, Louisiana, Maine, Mississippi, Montana, Oklahoma, Vermont, and West Virginia.

filtered_data <- cleaned_data |>
  filter(Disease %in% c("Heart Disease", "Heart Attack", "Stroke"))

normalized_data <- filtered_data |>
  group_by(Disease) |>
  mutate(Normalized_Value = (Data_Value - min(Data_Value, na.rm = TRUE)) / 
                              (max(Data_Value, na.rm = TRUE) - min(Data_Value, na.rm = TRUE))) |>
  ungroup()

print(head(normalized_data))

# A tibble: 6 × 15
   Year LocationDesc  Topic                  Question Data_Value_Unit Data_Value
  <dbl> <chr>         <chr>                  <chr>    <chr>                <dbl>
1  2021 Maine         Stroke                 Cerebro… Rate per 100,0…     803.  
2  2020 Maine         Diseases of the Heart… Disease… Rate per 100,0…    3275.  
3  2017 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    4890.  
4  2017 Florida       Diseases of the Heart… Disease… Rate per 100,0…    3409.  
5  2020 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    3565.  
6  2019 Florida       Diseases of the Heart… Prevale… Percent (%)           6.08
# ℹ 9 more variables: Break_Out_Category <chr>, Gender <chr>,
#   Data_Value_TypeID <chr>, LocationId <dbl>, Longitude <dbl>, Latitude <dbl>,
#   Disease <chr>, Category <chr>, Normalized_Value <dbl>

selected_states <- c("Alabama", "Arkansas", "Florida", "Louisiana", "Maine", 
                     "Mississippi", "Montana", "Oklahoma", "Vermont", "West Virginia")

filtered_data <- normalized_data |>
filter(LocationDesc %in% selected_states)

print(head(filtered_data))

# A tibble: 6 × 15
   Year LocationDesc  Topic                  Question Data_Value_Unit Data_Value
  <dbl> <chr>         <chr>                  <chr>    <chr>                <dbl>
1  2021 Maine         Stroke                 Cerebro… Rate per 100,0…     803.  
2  2020 Maine         Diseases of the Heart… Disease… Rate per 100,0…    3275.  
3  2017 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    4890.  
4  2017 Florida       Diseases of the Heart… Disease… Rate per 100,0…    3409.  
5  2020 West Virginia Diseases of the Heart… Disease… Rate per 100,0…    3565.  
6  2019 Florida       Diseases of the Heart… Prevale… Percent (%)           6.08
# ℹ 9 more variables: Break_Out_Category <chr>, Gender <chr>,
#   Data_Value_TypeID <chr>, LocationId <dbl>, Longitude <dbl>, Latitude <dbl>,
#   Disease <chr>, Category <chr>, Normalized_Value <dbl>

Utilize New Palette for Variation

color_palette <- brewer.pal(n = 3, name = "Set1")

p <- ggplot(filtered_data, aes(x = LocationDesc, y = Data_Value, fill = Disease)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = color_palette) +
  labs(
    title = "A Deeper Took at Cardiovascular Diseases in 
    Lower Income States/Rural Communities",
    x = "Lower Income States/Rural Communities",
    y = "Prevalence",
    fill = "Disease"
  ) +
  theme_minimal() +
  ggdark::dark_theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10),
    legend.position = "bottom",
    legend.title = element_text(face = "bold", size = 10),
    legend.text = element_text(size = 10)
  )

Inverted geom defaults of fill and color/colour.
To change them back, use invert_geom_defaults().

interactive_plot <- ggplotly(p, tooltip = c("x", "y", "fill"))

Activate Interactive Element to Visualization

interactive_plot

This visualization examines states characterized by lower income levels and large rural communities. The disease breakdown reveals consistent trends and patterns throughout the analysis. Heart disease overwhelmingly leads as the primary cause of illness across all categories and states analyzed. This underscores the need for awareness and educational campaigns that promote healthy behaviors, such as better food choices and regular exercise.

Healthy foods need to be normalized. Unfortunately, companies often prioritize cheaper, mass-produced food to maximize profits, compromising quality. For individuals with limited resources and budgets, the increasing prices of healthy food options create a significant barrier due to competing financial demands. Despite this challenge, heart disease remains prevalent even in states with higher health care expenditures or greater earning potential.

It is crucial to consider both the financial costs and the psychological and emotional impacts of heart disease. These ramifications deprive society of healthy aging adults who, due to intentional or culturally influenced behavioral choices, often face a wide array of illnesses. Addressing these issues requires a multifaceted approach that includes improving access to healthy foods and encouraging lifestyle changes across all demographics.

Interesting Findings and Future Analysis

One surprising finding in the analysis is the pervasive impact of heart disease across all categories and states analyzed. Even in states with higher health care expenditures and potentially greater access to medical resources, heart disease remains the leading cause of illness. This suggests that factors beyond just access to healthcare, such as lifestyle choices and socioeconomic conditions, play a significant role in the prevalence of heart disease.

Another notable surprise is the significant gender disparity, with men consistently exhibiting higher rates of cardiovascular diseases compared to women. I would like to explore if this is attributed to biological differences or behavioral and lifestyle factors.

For future research, I would like to investigate access to healthy foods in different regions, including lower income communities (inner city vs. rural). I am also interested in understanding mental health burden on patients and their families as well as the benefits of mental health support (including diet and exercise) in averting or reducing heart diseases.

Sources

Centers for Disease Control and Prevention. (2022).

National Center for Health Statistics. Retrieved from https://www.cdc.gov/nchs/index.htm National Institutes of Health. (2022).

NIH Fact Sheets - Heart Attack. Retrieved from https://report.nih.gov/nihfactsheets/viewfactsheet.aspx?csid=116

American Heart Association. (2021). Heart Disease and Stroke Statistics. Retrieved from https://www.heart.org/en/about-us/heart-and-stroke-association-statistics

United States Census Bureau. (2020). The Aging Population in the United States. Retrieved from https://www.census.gov/topics/population/age-and-sex.html

Health Resources and Services Administration. (2020). Rural Health. Retrieved from https://www.hrsa.gov/rural-health

https://www.uab.edu/news/images/eating_healthy_heart.jpg