Introduction

The dataset in question focuses on various health factors associated with heart disease, a leading cause of death in the United States, particularly affecting racial and demographic groups such as African Americans, American Indians, Alaska Natives, and whites. According to the Centers for Disease Control and Prevention (CDC), nearly half of all Americans (47%) have at least one major risk factor for heart disease, including high blood pressure, high cholesterol, and smoking. Other contributing factors include diabetes, obesity (high BMI), insufficient physical activity, and excessive alcohol consumption. Identifying and understanding these risk factors is crucial for healthcare, and the use of machine learning methods has enabled the detection of patterns in large datasets, helping predict heart disease risks more accurately.

Quesion

How does BMI influence the prevalence of heart attacks across different age groups, and what is the statistical relationship between age and heart attack risk in the United States for 2022? This question addresses the core elements from the analysis: The influence of BMI on heart attack prevalence. The relationship between age and heart attack risk, as suggested by the p-values. It ties together statistical significance, age, and BMI as key factors in heart attack risk across age groups.

loading libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

seting part

setwd("C:/Users/eyong/Downloads/heart_2022_no_nans.csv")
gf<- read_csv("heart_2022_no_nans.csv")
## Rows: 246022 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): State, Sex, GeneralHealth, LastCheckupTime, PhysicalActivities, Re...
## dbl  (6): PhysicalHealthDays, MentalHealthDays, SleepHours, HeightInMeters, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cleaning column names

gf<-janitor::clean_names(gf)

Select relevant columns from your dataframe

ha <- gf %>% select(state, sex, race_ethnicity_category,had_heart_attack,age_category,bmi)

Are there any significant differences in heart attack rate based on age groups

age_p_summary <- ha %>%
  group_by(age_category, had_heart_attack) %>%
  summarise(count = n(), .groups = "drop")

creating the plot for age category

hchart(age_p_summary, "column", hcaes(x = age_category, y = log(count), group = had_heart_attack)) %>%
  hc_title(text = "Heart Attack by Age Category") %>%
  hc_xAxis(title = list(text = "Age Category")) %>%
  hc_yAxis(title = list(text = "Count")) %>%
  hc_plotOptions(column = list(stacking = "normal")) %>%
  hc_tooltip(
    pointFormat = "<b>Count:</b> {point.count}<br><b>Age Category:</b> {point.age_category}<br><b>Heart Attack:</b> {point.had_heart_attack}"# recent chart gbt to create and understanble number unlike log
  ) %>%
  hc_caption(text = "CDC: 2022 heart attack incidence by age group.")

Exslanation

This bar graph illustrates the relationship between age group and heart attack incidence in 2022. It shows that individuals in younger age groups are less likely to report having had a heart attack, while older age groups exhibit a higher count of individuals affirming they have experienced one. Overall, a significant number of people in all age groups reported “no” to having had a heart attack, but the data suggests that the likelihood of saying “yes” increases with age. This trend could potentially be used to predict heart attack risk in future years.

what is the heart attack in males and females pair state

# Convert 'had_heart_attack' column to a factor with "Yes" and "No"
ha$had_heart_attack <- factor(ha$had_heart_attack, levels = c("Yes", "No"))

Aggregate data to count occurrences by state, sex, and had_heart_attack

agg_dt <- gf %>%
  group_by(state, sex, had_heart_attack) %>%
  summarise(count = n(), .groups = 'drop')

Create a new column to get the total count of male and female for each state

total_data <- agg_dt %>%
  group_by(state, sex) %>%
  summarise(total_count = sum(count), .groups = 'drop')

Merge aggregated data and total counts

agg_dt_2 <- agg_dt %>%
  left_join(total_data, by = c("state", "sex"))

Filter the data for “Yes” and “No” heart attack statuses

agg_yes <- agg_dt_2 %>% filter(had_heart_attack == "Yes")
agg_no <- agg_dt_2 %>% filter(had_heart_attack == "No")

# Combine the "Yes" and "No" data for facet wrapping

agg_data_combined <- bind_rows(
  agg_yes %>% mutate(heart_attack_status = "Yes"),
  agg_no %>% mutate(heart_attack_status = "No")
)

Create the bargraph using ggplot2

pg <- ggplot(agg_data_combined, aes(x = state, y = count, fill = sex)) +
  geom_bar(stat = "identity", position = "dodge") +  # Bar graph with count as height and sex as fill
  scale_fill_viridis_d(name = "Sex", option = "C") +  # Viridis color scale for discrete values (for 'sex')
  facet_wrap(~heart_attack_status, scales = "free_y", ncol = 2, 
             labeller = labeller(heart_attack_status = c("Yes" = "Yes", "No" = "No"))) +  # Customize facet labels
  labs(title = "Heart Attack by Sex and State",
       x = "State",
       y = "Count",
       caption = "CDC Health Dataset") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),  # Rotate x-axis labels for readability
        strip.text = element_text(size = 12)) +  # Adjust facet label size
  theme(legend.position = "bottom")  # Position the legend at the bottom
pg

Convert the ggplot to a plotly interactive plot

interactive_plot1 <- ggplotly(pg)

# Display the interactive plot
interactive_plot1

Explanation of the Bar Graph for Sex

The bar graph highlights a significant difference in the prevalence of heart attacks between males and females across various states, with particular emphasis on Washington in 2022. In Washington, a higher number of males reported having had a heart attack compared to females, as shown by the “yes” responses. Conversely, more females than males answered “no,” suggesting that heart attacks were more common among males than females in this state, according to the 2022 CDC data.

Differnce in heart attaack according to ethnicity

ha$had_heart_attack <- factor(ha$had_heart_attack, levels = c("Yes", "No"))

Aggregate data to count occurrences by state, sex, race_ethnicity_category, and had_heart_attack

agg_data <- gf %>%
  group_by(state, sex, race_ethnicity_category, had_heart_attack) %>%
  summarise(count = n(), .groups = 'drop')

Create the bar graph plot with ggplot2

p <- ggplot(agg_data, aes(x = state, fill = race_ethnicity_category)) +
  geom_bar(aes(y = count), stat = "identity") +  # Bar graph with heights defined by 'count'
  scale_fill_viridis_d(name = "Race/Ethnicity") +  # Color scale for different race/ethnicity categories
  facet_wrap(~had_heart_attack, scales = "free_y", ncol = 2) +  # Separate plots for 'Yes' and 'No'
  labs(title = "Bar Graph of Number of Heart Attacks per State and Race/Ethnicity",
       x = "State",
       y = "Count",
       caption = " CDC Health Dataset") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  
 #coord_flip()
p

Convert the ggplot to a plotly interactive plot

interactive_plot1 <- ggplotly(p)

Display the interactive plot

interactive_plot1

EXplanation of the heatmap for race and heart attack

The bargraph illustrates that a larger proportion of white individuals responded “Yes” and “No” to having had a heart attack, which is expected given that they represent a significant demographic. However, it is noteworthy that a higher volume of responses comes from Washington, as indicated by the CDC data report for 2022. This suggests that Washington has a higher concentration of responses regarding heart attack history, particularly among the white population.

regresion data set

rd <- gf %>% select(state, sex, race_ethnicity_category,had_heart_attack,age_category,bmi,smoker_status,smoker_status,alcohol_drinkers)

Convert ‘had_heart_attack’ column from yes and no to 1 and 0

rd1 <- rd |>
  mutate(heart_att = ifelse(had_heart_attack == "yes", 1, 0))

regresion model

log_mod <- glm(heart_att ~ sex + race_ethnicity_category + alcohol_drinkers + age_category+bmi+smoker_status, family = binomial(), data = rd1)
## Warning: glm.fit: algorithm did not converge
summary(log_mod)
## 
## Call:
## glm(formula = heart_att ~ sex + race_ethnicity_category + alcohol_drinkers + 
##     age_category + bmi + smoker_status, family = binomial(), 
##     data = rd1)
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                          -2.657e+01  5.710e+03
## sexMale                                              -1.166e-12  1.451e+03
## race_ethnicity_categoryHispanic                       1.545e-13  3.508e+03
## race_ethnicity_categoryMultiracial, Non-Hispanic      4.305e-14  5.424e+03
## race_ethnicity_categoryOther race only, Non-Hispanic  9.819e-15  4.139e+03
## race_ethnicity_categoryWhite only, Non-Hispanic       7.167e-13  2.720e+03
## alcohol_drinkersYes                                  -1.344e-12  1.482e+03
## age_categoryAge 25 to 29                             -2.650e-14  4.604e+03
## age_categoryAge 30 to 34                             -1.966e-13  4.405e+03
## age_categoryAge 35 to 39                             -3.254e-13  4.257e+03
## age_categoryAge 40 to 44                             -4.437e-13  4.192e+03
## age_categoryAge 45 to 49                             -4.206e-13  4.204e+03
## age_categoryAge 50 to 54                             -4.337e-13  4.060e+03
## age_categoryAge 55 to 59                             -5.289e-13  3.982e+03
## age_categoryAge 60 to 64                             -6.831e-13  3.865e+03
## age_categoryAge 65 to 69                              4.250e-12  3.829e+03
## age_categoryAge 70 to 74                             -9.599e-13  3.898e+03
## age_categoryAge 75 to 79                             -1.174e-12  4.167e+03
## age_categoryAge 80 or older                          -1.331e-12  4.188e+03
## bmi                                                  -2.295e-14  1.129e+02
## smoker_statusCurrent smoker - now smokes some days    2.594e-13  4.648e+03
## smoker_statusFormer smoker                            2.291e-12  2.814e+03
## smoker_statusNever smoked                             9.220e-14  2.615e+03
##                                                      z value Pr(>|z|)
## (Intercept)                                           -0.005    0.996
## sexMale                                                0.000    1.000
## race_ethnicity_categoryHispanic                        0.000    1.000
## race_ethnicity_categoryMultiracial, Non-Hispanic       0.000    1.000
## race_ethnicity_categoryOther race only, Non-Hispanic   0.000    1.000
## race_ethnicity_categoryWhite only, Non-Hispanic        0.000    1.000
## alcohol_drinkersYes                                    0.000    1.000
## age_categoryAge 25 to 29                               0.000    1.000
## age_categoryAge 30 to 34                               0.000    1.000
## age_categoryAge 35 to 39                               0.000    1.000
## age_categoryAge 40 to 44                               0.000    1.000
## age_categoryAge 45 to 49                               0.000    1.000
## age_categoryAge 50 to 54                               0.000    1.000
## age_categoryAge 55 to 59                               0.000    1.000
## age_categoryAge 60 to 64                               0.000    1.000
## age_categoryAge 65 to 69                               0.000    1.000
## age_categoryAge 70 to 74                               0.000    1.000
## age_categoryAge 75 to 79                               0.000    1.000
## age_categoryAge 80 or older                            0.000    1.000
## bmi                                                    0.000    1.000
## smoker_statusCurrent smoker - now smokes some days     0.000    1.000
## smoker_statusFormer smoker                             0.000    1.000
## smoker_statusNever smoked                              0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 246021  degrees of freedom
## Residual deviance: 1.4273e-06  on 245999  degrees of freedom
## AIC: 46
## 
## Number of Fisher Scoring iterations: 25

Explanation

The estimate, although low, suggests that males have a higher likelihood of having a heart attack compared to females. Age also plays a significant role—people aged 65 to 69 face a higher risk of heart attack than those in other age groups. This is surprising, as I expected individuals aged 80 and older to be at greater risk. The Body Mass Index (BMI), used to measure obesity, also shows a low estimate, but logically, as BMI increases, the risk of a heart attack rises. For smokers, the estimate is higher than for former smokers, indicating that current smokers have a higher risk of heart attack. Regarding race and ethnicity, the data shows that multiracial individuals have a higher likelihood of having a heart attack, followed by white people. This is unexpected, as I assumed the opposite. Interestingly, white males have a lower estimate compared to other groups. This suggests that when comparing non-mixed populations, white individuals may be at a higher risk, but multiracial individuals overall face a higher risk. This could be due to genetic factors. Additionally, alcohol consumption is associated with a higher likelihood of heart attacks compared to those who don’t drink

conclusion

Based on the CDC data report, I conducted an analysis of heart attack risks across different states, sex, and race using regression analysis, along with observations from the bar graph. The findings indicate that males aged 65 to 69, who are current smokers, drink alcohol, and are of mixed race, face a significantly higher risk of having a heart attack. This conclusion is supported by both the regression analysis and the 2022 CDC data report

Quesion that I will like to explor

I will love to know what race type of mixture is in the demographic of people who are mixed raced and also I will like to see what diabetes and other health factors affect heart attack.