The dataset in question focuses on various health factors associated with heart disease, a leading cause of death in the United States, particularly affecting racial and demographic groups such as African Americans, American Indians, Alaska Natives, and whites. According to the Centers for Disease Control and Prevention (CDC), nearly half of all Americans (47%) have at least one major risk factor for heart disease, including high blood pressure, high cholesterol, and smoking. Other contributing factors include diabetes, obesity (high BMI), insufficient physical activity, and excessive alcohol consumption. Identifying and understanding these risk factors is crucial for healthcare, and the use of machine learning methods has enabled the detection of patterns in large datasets, helping predict heart disease risks more accurately.
How does BMI influence the prevalence of heart attacks across different age groups, and what is the statistical relationship between age and heart attack risk in the United States for 2022? This question addresses the core elements from the analysis: The influence of BMI on heart attack prevalence. The relationship between age and heart attack risk, as suggested by the p-values. It ties together statistical significance, age, and BMI as key factors in heart attack risk across age groups.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
setwd("C:/Users/eyong/Downloads/heart_2022_no_nans.csv")
gf<- read_csv("heart_2022_no_nans.csv")
## Rows: 246022 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): State, Sex, GeneralHealth, LastCheckupTime, PhysicalActivities, Re...
## dbl (6): PhysicalHealthDays, MentalHealthDays, SleepHours, HeightInMeters, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gf<-janitor::clean_names(gf)
ha <- gf %>% select(state, sex, race_ethnicity_category,had_heart_attack,age_category,bmi)
age_p_summary <- ha %>%
group_by(age_category, had_heart_attack) %>%
summarise(count = n(), .groups = "drop")
hchart(age_p_summary, "column", hcaes(x = age_category, y = log(count), group = had_heart_attack)) %>%
hc_title(text = "Heart Attack by Age Category") %>%
hc_xAxis(title = list(text = "Age Category")) %>%
hc_yAxis(title = list(text = "Count")) %>%
hc_plotOptions(column = list(stacking = "normal")) %>%
hc_tooltip(
pointFormat = "<b>Count:</b> {point.count}<br><b>Age Category:</b> {point.age_category}<br><b>Heart Attack:</b> {point.had_heart_attack}"# recent chart gbt to create and understanble number unlike log
) %>%
hc_caption(text = "CDC: 2022 heart attack incidence by age group.")
This bar graph illustrates the relationship between age group and heart attack incidence in 2022. It shows that individuals in younger age groups are less likely to report having had a heart attack, while older age groups exhibit a higher count of individuals affirming they have experienced one. Overall, a significant number of people in all age groups reported “no” to having had a heart attack, but the data suggests that the likelihood of saying “yes” increases with age. This trend could potentially be used to predict heart attack risk in future years.
# Convert 'had_heart_attack' column to a factor with "Yes" and "No"
ha$had_heart_attack <- factor(ha$had_heart_attack, levels = c("Yes", "No"))
agg_dt <- gf %>%
group_by(state, sex, had_heart_attack) %>%
summarise(count = n(), .groups = 'drop')
total_data <- agg_dt %>%
group_by(state, sex) %>%
summarise(total_count = sum(count), .groups = 'drop')
agg_dt_2 <- agg_dt %>%
left_join(total_data, by = c("state", "sex"))
agg_yes <- agg_dt_2 %>% filter(had_heart_attack == "Yes")
agg_no <- agg_dt_2 %>% filter(had_heart_attack == "No")
# Combine the "Yes" and "No" data for facet wrapping
agg_data_combined <- bind_rows(
agg_yes %>% mutate(heart_attack_status = "Yes"),
agg_no %>% mutate(heart_attack_status = "No")
)
pg <- ggplot(agg_data_combined, aes(x = state, y = count, fill = sex)) +
geom_bar(stat = "identity", position = "dodge") + # Bar graph with count as height and sex as fill
scale_fill_viridis_d(name = "Sex", option = "C") + # Viridis color scale for discrete values (for 'sex')
facet_wrap(~heart_attack_status, scales = "free_y", ncol = 2,
labeller = labeller(heart_attack_status = c("Yes" = "Yes", "No" = "No"))) + # Customize facet labels
labs(title = "Heart Attack by Sex and State",
x = "State",
y = "Count",
caption = "CDC Health Dataset") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis labels for readability
strip.text = element_text(size = 12)) + # Adjust facet label size
theme(legend.position = "bottom") # Position the legend at the bottom
pg
interactive_plot1 <- ggplotly(pg)
# Display the interactive plot
interactive_plot1
The bar graph highlights a significant difference in the prevalence of heart attacks between males and females across various states, with particular emphasis on Washington in 2022. In Washington, a higher number of males reported having had a heart attack compared to females, as shown by the “yes” responses. Conversely, more females than males answered “no,” suggesting that heart attacks were more common among males than females in this state, according to the 2022 CDC data.
ha$had_heart_attack <- factor(ha$had_heart_attack, levels = c("Yes", "No"))
agg_data <- gf %>%
group_by(state, sex, race_ethnicity_category, had_heart_attack) %>%
summarise(count = n(), .groups = 'drop')
p <- ggplot(agg_data, aes(x = state, fill = race_ethnicity_category)) +
geom_bar(aes(y = count), stat = "identity") + # Bar graph with heights defined by 'count'
scale_fill_viridis_d(name = "Race/Ethnicity") + # Color scale for different race/ethnicity categories
facet_wrap(~had_heart_attack, scales = "free_y", ncol = 2) + # Separate plots for 'Yes' and 'No'
labs(title = "Bar Graph of Number of Heart Attacks per State and Race/Ethnicity",
x = "State",
y = "Count",
caption = " CDC Health Dataset") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#coord_flip()
p
interactive_plot1 <- ggplotly(p)
interactive_plot1
The bargraph illustrates that a larger proportion of white individuals responded “Yes” and “No” to having had a heart attack, which is expected given that they represent a significant demographic. However, it is noteworthy that a higher volume of responses comes from Washington, as indicated by the CDC data report for 2022. This suggests that Washington has a higher concentration of responses regarding heart attack history, particularly among the white population.
rd <- gf %>% select(state, sex, race_ethnicity_category,had_heart_attack,age_category,bmi,smoker_status,smoker_status,alcohol_drinkers)
rd1 <- rd |>
mutate(heart_att = ifelse(had_heart_attack == "yes", 1, 0))
log_mod <- glm(heart_att ~ sex + race_ethnicity_category + alcohol_drinkers + age_category+bmi+smoker_status, family = binomial(), data = rd1)
## Warning: glm.fit: algorithm did not converge
summary(log_mod)
##
## Call:
## glm(formula = heart_att ~ sex + race_ethnicity_category + alcohol_drinkers +
## age_category + bmi + smoker_status, family = binomial(),
## data = rd1)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -2.657e+01 5.710e+03
## sexMale -1.166e-12 1.451e+03
## race_ethnicity_categoryHispanic 1.545e-13 3.508e+03
## race_ethnicity_categoryMultiracial, Non-Hispanic 4.305e-14 5.424e+03
## race_ethnicity_categoryOther race only, Non-Hispanic 9.819e-15 4.139e+03
## race_ethnicity_categoryWhite only, Non-Hispanic 7.167e-13 2.720e+03
## alcohol_drinkersYes -1.344e-12 1.482e+03
## age_categoryAge 25 to 29 -2.650e-14 4.604e+03
## age_categoryAge 30 to 34 -1.966e-13 4.405e+03
## age_categoryAge 35 to 39 -3.254e-13 4.257e+03
## age_categoryAge 40 to 44 -4.437e-13 4.192e+03
## age_categoryAge 45 to 49 -4.206e-13 4.204e+03
## age_categoryAge 50 to 54 -4.337e-13 4.060e+03
## age_categoryAge 55 to 59 -5.289e-13 3.982e+03
## age_categoryAge 60 to 64 -6.831e-13 3.865e+03
## age_categoryAge 65 to 69 4.250e-12 3.829e+03
## age_categoryAge 70 to 74 -9.599e-13 3.898e+03
## age_categoryAge 75 to 79 -1.174e-12 4.167e+03
## age_categoryAge 80 or older -1.331e-12 4.188e+03
## bmi -2.295e-14 1.129e+02
## smoker_statusCurrent smoker - now smokes some days 2.594e-13 4.648e+03
## smoker_statusFormer smoker 2.291e-12 2.814e+03
## smoker_statusNever smoked 9.220e-14 2.615e+03
## z value Pr(>|z|)
## (Intercept) -0.005 0.996
## sexMale 0.000 1.000
## race_ethnicity_categoryHispanic 0.000 1.000
## race_ethnicity_categoryMultiracial, Non-Hispanic 0.000 1.000
## race_ethnicity_categoryOther race only, Non-Hispanic 0.000 1.000
## race_ethnicity_categoryWhite only, Non-Hispanic 0.000 1.000
## alcohol_drinkersYes 0.000 1.000
## age_categoryAge 25 to 29 0.000 1.000
## age_categoryAge 30 to 34 0.000 1.000
## age_categoryAge 35 to 39 0.000 1.000
## age_categoryAge 40 to 44 0.000 1.000
## age_categoryAge 45 to 49 0.000 1.000
## age_categoryAge 50 to 54 0.000 1.000
## age_categoryAge 55 to 59 0.000 1.000
## age_categoryAge 60 to 64 0.000 1.000
## age_categoryAge 65 to 69 0.000 1.000
## age_categoryAge 70 to 74 0.000 1.000
## age_categoryAge 75 to 79 0.000 1.000
## age_categoryAge 80 or older 0.000 1.000
## bmi 0.000 1.000
## smoker_statusCurrent smoker - now smokes some days 0.000 1.000
## smoker_statusFormer smoker 0.000 1.000
## smoker_statusNever smoked 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.0000e+00 on 246021 degrees of freedom
## Residual deviance: 1.4273e-06 on 245999 degrees of freedom
## AIC: 46
##
## Number of Fisher Scoring iterations: 25
The estimate, although low, suggests that males have a higher likelihood of having a heart attack compared to females. Age also plays a significant role—people aged 65 to 69 face a higher risk of heart attack than those in other age groups. This is surprising, as I expected individuals aged 80 and older to be at greater risk. The Body Mass Index (BMI), used to measure obesity, also shows a low estimate, but logically, as BMI increases, the risk of a heart attack rises. For smokers, the estimate is higher than for former smokers, indicating that current smokers have a higher risk of heart attack. Regarding race and ethnicity, the data shows that multiracial individuals have a higher likelihood of having a heart attack, followed by white people. This is unexpected, as I assumed the opposite. Interestingly, white males have a lower estimate compared to other groups. This suggests that when comparing non-mixed populations, white individuals may be at a higher risk, but multiracial individuals overall face a higher risk. This could be due to genetic factors. Additionally, alcohol consumption is associated with a higher likelihood of heart attacks compared to those who don’t drink
Based on the CDC data report, I conducted an analysis of heart attack risks across different states, sex, and race using regression analysis, along with observations from the bar graph. The findings indicate that males aged 65 to 69, who are current smokers, drink alcohol, and are of mixed race, face a significantly higher risk of having a heart attack. This conclusion is supported by both the regression analysis and the 2022 CDC data report
I will love to know what race type of mixture is in the demographic of people who are mixed raced and also I will like to see what diabetes and other health factors affect heart attack.