The Heart Disease UCI dataset, consolidated from four sources—Cleveland, Hungarian, Switzerland, and VA—comprises 920 observations and 14 attributes, including the target variable. Each observation represents an individual patient, providing a comprehensive collection of demographic, clinical, and diagnostic data.
The dataset includes key variables such as age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, resting ECG results, maximum heart rate achieved, exercise-induced angina, ST depression, and the slope of the ST segment. These variables are critical for understanding the factors contributing to heart disease and their interrelationships. The target variable indicates the presence or absence of heart disease, making the dataset valuable for both descriptive and predictive modeling.
This expanded dataset offers a richer foundation for analyzing the interactions between age, gender, cholesterol, blood sugar, and other health metrics. It provides an opportunity to uncover patterns that can enhance heart disease prevention, diagnosis, and management strategies across diverse patient populations.
Heart disease remains a leading cause of morbidity and mortality worldwide, with age and health factors playing critical roles in its onset and progression. Age is a non-modifiable risk factor strongly correlated with increased cardiovascular event susceptibility (D’Agostino et al., 2008; Lloyd-Jones et al., 2006). Modifiable health factors such as elevated cholesterol levels, hypertension, diabetes, obesity, sedentary lifestyles, and smoking are also well-established contributors to heart disease pathogenesis (Yusuf et al., 2004). Gender disparities in heart disease prevalence, clinical presentation, and outcomes have been consistently documented, underlining the need for gender-sensitive diagnostic and treatment frameworks (Hemingway & Marmot, 1999). Furthermore, physiological markers like maximum heart rate achieved during exercise have been shown to influence heart disease risk, with notable gender-specific differences (Gupta et al., 2020). Despite these insights, the complex interplay between age, gender, and other risk factors in predicting heart disease outcomes remains underexplored. Additionally, limited research has investigated how combinations of these factors (e.g., age and maximum heart rate or cholesterol and exercise-induced angina) interact to influence heart disease diagnosis and prognosis.
Attributes:
Research Questions
How do age interacted with maximum heart rate, and age interacted with chest pain type influence the risk of developing heart disease?
How do key health factors, including cholesterol levels, blood pressure, blood sugar, and exercise-induced angina, collectively impact the likelihood of heart disease, and how does this relationship vary among different age groups?
Hypotheses
Age and Maximum Heart Rate:
The interaction between age and maximum heart rate significantly
predicts the risk of heart disease, with maximum heart rates posing a
higher risk in older individuals compared to younger ones.
Age and Chest Pain Type: The interaction between age and chest pain type significantly predicts the risk of heart disease, with atypical and asymptomatic chest pain posing a higher risk in older individuals compared to younger ones.
Combinatory Health Factors:
The combined effect of elevated cholesterol levels, high blood pressure,
elevated blood sugar, and exercise-induced angina significantly
increases the likelihood of heart disease, and this effect is more
pronounced in individuals aged 60 and above compared to younger age
groups.
# Maximum Heart Rate by Age Group
p1 <- ggplot(heart_data_final, aes(x = age_group, y = max_heart_rate, fill=as.factor(age_group))) +
geom_boxplot(fill = "lightblue", color = "darkblue", outlier.color = "red", outlier.size = 2) +
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") +
labs(
title = "Maximum Heart Rate by Age Group",
x = "Age Group",
y = "Maximum Heart Rate",
fill = "Age Group"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p1)
# Grouped Bar Plot for Chest Pain Type and Heart Disease
p2 <- ggplot(heart_data_final, aes(x = chest_paintype, fill = as.factor(heart_disease))) +
geom_bar(position = "dodge", color = "black") +
scale_fill_brewer(palette = "Pastel1",labels = c("No Heart Disease", "Heart Disease")) +
scale_x_discrete(labels = c(
"1" = "Typical Angina",
"2" = "Atypical Angina",
"3" = "Non-Anginal Pain",
"4" = "Asymptomatic"
)) +
labs(
title = "Chest Pain Type and Heart Disease",
x = "Chest Pain Type",
y = "Count",
fill = "Heart Disease"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Convert to interactive plot
ggplotly(p2)
# Filter the data to have maximum heart rate of 150+
#maximum heart rate for older individuals is 220-60 which is 160, so we only included data that has more than 150 max heart rate
max_heart_data <- heart_data_final %>% filter(max_heart_rate > 150)
# Logistic regression model for Adults with chest_paintype and max_heart_rate
model_max_heartrate <- glm(heart_disease ~ age_group*max_heart_rate,
data = max_heart_data,
family = binomial(),
x = TRUE)
#interaction effects
heartrate_effects <- allEffects(model_max_heartrate)
#dataframe of interaction effects
heartrate_df <- as.data.frame(heartrate_effects$`age_group:max_heart_rate`)
# Plot using ggplot2, including confidence intervals
p3 <- ggplot(heartrate_df, aes(x = max_heart_rate, y = fit, color = age_group)) +
geom_point(alpha = 0.6) + # Scatter plot with semi-transparent points
geom_line(aes(group = age_group)) + # Line for predicted probabilities
geom_ribbon(aes(ymin = lower, ymax = upper, fill = age_group), alpha = 0.2, color = NA) + # Confidence intervals
labs(title = "Effect of Maximum Heart Rate and Age Group on Heart Disease",
x = "Maximum Heart Rate",
y = "Predicted Probability of Heart Disease",
color = "Age Group",
fill = "Age Group") + # Legend for confidence intervals
theme_minimal() +
theme(legend.position = "top")
ggplotly(p3)
# Filter the data to include only "asymptomatic" chest pain type and max heart rate is more than 150
chestpain4_data <- heart_data_final %>% filter(chest_paintype == "asymptomatic" & max_heart_rate > 150)
# Logistic regression model for chest_paintype "asymptomatic" and age_group
model_chestpain4 <- glm(heart_disease ~ age_group*max_heart_rate,
data = chestpain4_data,
family = binomial(),
x = TRUE)
#interaction effects
chestpain4_effects <- allEffects(model_chestpain4)
#dataframe for interaction effects
chestpain4_df <- as.data.frame(chestpain4_effects$`age_group:max_heart_rate`)
# Plot using ggplot2, including confidence intervals
p4 <- ggplot(chestpain4_df, aes(x = max_heart_rate, y = fit, color = age_group)) +
geom_point(alpha = 0.6) + # Scatter plot with semi-transparent points
geom_line(aes(group = age_group)) + # Line for predicted probabilities
geom_ribbon(aes(ymin = lower, ymax = upper, fill = age_group), alpha = 0.2, color = NA) + # Confidence intervals
labs(title = "Heart Disease with Asymptomatic Chest Pain",
x = "Maximum Heart Rate",
y = "Predicted Probability of Heart Disease",
color = "Age Group",
fill = "Age Group") + # Legend for confidence intervals
theme_minimal() +
theme(legend.position = "top") # Hide redundant legend if not needed
ggplotly(p4)
#chest pain type: atypical angina
chestpain2_data <- heart_data_final %>% filter(chest_paintype == "atypical angina" &
max_heart_rate>150)
# Logistic regression model for chest_paintype "atypical angina" and age_group
model_chestpain2 <- glm(heart_disease ~ age_group*max_heart_rate,
data = chestpain2_data,
family = binomial(),
x = TRUE)
#interaction effects
chestpain2_effects <- allEffects(model_chestpain2)
#dataframe for interaction effects
chestpain2_df <- as.data.frame(chestpain2_effects$`age_group:max_heart_rate`)
# Plot using ggplot2, including confidence intervals
p5<-ggplot(chestpain2_df, aes(x = max_heart_rate, y = fit, color = age_group)) +
geom_point(alpha = 0.6) + # Scatter plot with semi-transparent points
geom_line(aes(group = age_group)) + # Line for predicted probabilities
geom_ribbon(aes(ymin = lower, ymax = upper, fill = age_group), alpha = 0.2, color = NA) + # Confidence intervals
labs(title = "Heart Disease with Atypical Angina Chest Pain",
x = "Maximum Heart Rate",
y = "Predicted Probability of Heart Disease",
color = "Age Group",
fill = "Age Group") + # Legend for confidence intervals
theme_minimal() +
theme(legend.position = "top") # Hide redundant legend if not needed
ggplotly(p5)
# Logistic regression with individual predictors
model_health <- glm(heart_disease ~ sex + resting_BP + age_group + cholestrol
+ fasting_blood_sugar + exercise_induced_angina,
data = heart_data_final,
family = binomial(),
x=TRUE)
shinyApp(
# Define UI for the app
ui <- fluidPage(
titlePanel("Heart Disease Prediction App"),
sidebarLayout(
sidebarPanel(
h2("Enter Your Health Data"),
br(),
selectInput("sex", "Sex:", choices = c("Female" = "Female", "Male" = "Male")),
numericInput("resting_BP", "Resting Blood Pressure:", value = 120, min = 50, max = 200, step = 1),
selectInput("age_group", "Age Group:",
choices = c("Adults(18-39)" = "Adults(18-39)",
"Mid-aged(40-59)" = "Mid-aged(40-59)",
"Older(60+)" = "Older(60+)")),
numericInput("cholestrol", "Cholesterol Level:", value = 200, min = 85, max = 605, step = 1),
selectInput("fasting_blood_sugar", "Fasting Blood Sugar:",
choices = c("<=120mg/dL" = "<=120mg/dL", ">120mg/dL" = ">120mg/dL")),
selectInput("exercise_induced_angina", "Exercise Induced Angina:",
choices = c("No" = "No", "Yes" = "Yes")),
br(),
actionButton("predict", "Predict", class = "btn-primary")
),
mainPanel(
h3("Prediction Result:"),
textOutput("prediction"),
br(),
uiOutput("additional_info") # Placeholder for additional information or suggestions
)
)
),
# Define server logic
server <- function(input, output) {
observeEvent(input$predict, {
# Create a new data frame with user inputs
new_data <- data.frame(
sex = input$sex,
resting_BP = as.numeric(input$resting_BP),
age_group = input$age_group,
cholestrol = as.numeric(input$cholestrol),
fasting_blood_sugar = input$fasting_blood_sugar,
exercise_induced_angina = input$exercise_induced_angina,
stringsAsFactors = FALSE # Prevent automatic conversion
)
# Match factor levels with training dataset
new_data$sex <- factor(new_data$sex, levels = levels(heart_data_final$sex))
new_data$age_group <- factor(new_data$age_group, levels = levels(heart_data_final$age_group))
new_data$fasting_blood_sugar <- factor(new_data$fasting_blood_sugar, levels = levels(heart_data_final$fasting_blood_sugar))
new_data$exercise_induced_angina <- factor(new_data$exercise_induced_angina, levels = levels(heart_data_final$exercise_induced_angina))
# Predict probability using the logistic regression model
prediction <- predict(model_health, newdata = new_data, type = "response")
# Check if prediction is NA
if (is.na(prediction)) {
output$prediction <- renderText("Error: Unable to calculate probability. Please check your inputs.")
} else {
# Render prediction output
output$prediction <- renderText({
paste("The predicted probability of heart disease is:", round(prediction * 100, 2), "%")
})
}
# Provide additional information or suggestions
output$additional_info <- renderUI({
if (prediction > 0.5) {
p("The prediction indicates a higher risk of heart disease. It is recommended to consult a healthcare professional for further advice.")
} else {
p("The prediction indicates a lower risk of heart disease. Maintain a healthy lifestyle and monitor your health regularly.")
}
})
})
},
options = list(vwidth = NULL, height = 500))