| Member | Contribution |
|---|---|
| Liu Tian Shuo | Data understanding, missing value checking, duplicate checking, outlier detection, and data preprocessing |
| Zhao Ning Na | Exploratory data analysis, univariate analysis, bivariate analysis, multivariate analysis, correlation analysis, and visualization |
| Teoh Xiao Qian | Regression modelling for predicting PhysicalHealth,
including Multiple Linear Regression, Stepwise Regression, and Random
Forest Regression |
| Zhao Yi Xuan | Classification modelling for predicting HeartDisease,
including Logistic Regression, Random Forest, and Support Vector
Machine |
| Yang Shan | Final model evaluation, model comparison, visualization, discussion of results, conclusion, and report integration |
Heart disease is one of the major health concerns worldwide. Lifestyle habits, physical condition, mental health, and existing medical conditions may influence the risk of heart disease. In this project, a heart disease dataset is used to explore the relationship between lifestyle and health-related indicators.
This project focuses on two predictive tasks. First, regression
models are developed to predict PhysicalHealth, which
represents the number of physically unhealthy days. Second,
classification models are developed to predict
HeartDisease, which indicates whether an individual has
heart disease.
For this project, we used the Personal Key Indicators of Heart Disease dataset from Kaggle. The dataset was uploaded by Kamil Pytlak and is based on CDC survey data related to adult health status. It includes personal health and lifestyle variables such as BMI, smoking, alcohol drinking, stroke, physical health, mental health, diabetes, physical activity, and general health. The main target variable is HeartDisease, which is recorded as a binary value to show whether a person has heart disease risk or not. This dataset was selected because it is large, health-related, and suitable for both regression and classification analysis.
The dataset link is https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/data
The objectives of this project are:
PhysicalHealth.HeartDisease.packages <- c(
"tidyverse",
"caret",
"MASS",
"randomForest",
"Metrics",
"corrplot",
"glmnet",
"pROC",
"e1071",
"GGally"
)
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
lapply(packages, library, character.only = TRUE)
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "caret" "lattice" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[3]]
## [1] "MASS" "caret" "lattice" "lubridate" "forcats" "stringr"
## [7] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [13] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [19] "methods" "base"
##
## [[4]]
## [1] "randomForest" "MASS" "caret" "lattice" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[5]]
## [1] "Metrics" "randomForest" "MASS" "caret" "lattice"
## [6] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [11] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [16] "stats" "graphics" "grDevices" "utils" "datasets"
## [21] "methods" "base"
##
## [[6]]
## [1] "corrplot" "Metrics" "randomForest" "MASS" "caret"
## [6] "lattice" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[7]]
## [1] "glmnet" "Matrix" "corrplot" "Metrics" "randomForest"
## [6] "MASS" "caret" "lattice" "lubridate" "forcats"
## [11] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [16] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods" "base"
##
## [[8]]
## [1] "pROC" "glmnet" "Matrix" "corrplot" "Metrics"
## [6] "randomForest" "MASS" "caret" "lattice" "lubridate"
## [11] "forcats" "stringr" "dplyr" "purrr" "readr"
## [16] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [21] "graphics" "grDevices" "utils" "datasets" "methods"
## [26] "base"
##
## [[9]]
## [1] "e1071" "pROC" "glmnet" "Matrix" "corrplot"
## [6] "Metrics" "randomForest" "MASS" "caret" "lattice"
## [11] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [16] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[10]]
## [1] "GGally" "e1071" "pROC" "glmnet" "Matrix"
## [6] "corrplot" "Metrics" "randomForest" "MASS" "caret"
## [11] "lattice" "lubridate" "forcats" "stringr" "dplyr"
## [16] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [21] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [26] "datasets" "methods" "base"
heart_data <- read.csv("heart_attack_2020.csv")
# View data structure
str(heart_data)
## 'data.frame': 319795 obs. of 18 variables:
## $ HeartDisease : chr "No" "No" "No" "No" ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : chr "Yes" "No" "Yes" "No" ...
## $ AlcoholDrinking : chr "No" "No" "No" "No" ...
## $ Stroke : chr "No" "Yes" "No" "No" ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : chr "No" "No" "No" "No" ...
## $ Sex : chr "Female" "Female" "Male" "Female" ...
## $ AgeCategory : chr "55-59" "80 or older" "65-69" "75-79" ...
## $ Race : chr "White" "White" "White" "White" ...
## $ Diabetic : chr "Yes" "No" "Yes" "No" ...
## $ PhysicalActivity: chr "Yes" "Yes" "Yes" "No" ...
## $ GenHealth : chr "Very good" "Very good" "Fair" "Good" ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : chr "Yes" "No" "Yes" "No" ...
## $ KidneyDisease : chr "No" "No" "No" "No" ...
## $ SkinCancer : chr "Yes" "No" "No" "Yes" ...
# Check dimensions of dataset
dim(heart_data)
## [1] 319795 18
# Preview first few rows
head(heart_data)
## HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## 1 No 16.60 Yes No No 3 30
## 2 No 20.34 No No Yes 0 0
## 3 No 26.58 Yes No No 20 30
## 4 No 24.21 No No No 0 0
## 5 No 23.71 No No No 28 0
## 6 Yes 28.87 Yes No No 6 0
## DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity GenHealth
## 1 No Female 55-59 White Yes Yes Very good
## 2 No Female 80 or older White No Yes Very good
## 3 No Male 65-69 White Yes Yes Fair
## 4 No Female 75-79 White No No Good
## 5 Yes Female 40-44 White No Yes Very good
## 6 Yes Female 75-79 Black No No Fair
## SleepTime Asthma KidneyDisease SkinCancer
## 1 5 Yes No Yes
## 2 7 No No No
## 3 8 Yes No No
## 4 6 No No Yes
## 5 8 No No No
## 6 12 No No No
# Count missing values in each column
colSums(is.na(heart_data))
## HeartDisease BMI Smoking AlcoholDrinking
## 0 0 0 0
## Stroke PhysicalHealth MentalHealth DiffWalking
## 0 0 0 0
## Sex AgeCategory Race Diabetic
## 0 0 0 0
## PhysicalActivity GenHealth SleepTime Asthma
## 0 0 0 0
## KidneyDisease SkinCancer
## 0 0
# Total missing values in dataset
sum(is.na(heart_data))
## [1] 0
The missing value analysis confirmed that no missing values were present in the dataset. Therefore, no imputation was required.
# Count duplicated rows
sum(duplicated(heart_data))
## [1] 18078
# Remove duplicates if any exist
heart_data <- distinct(heart_data)
# Detect outliers in numerical variables
boxplot(
heart_data$BMI,
main = "BMI Outlier Detection"
)

boxplot(
heart_data$PhysicalHealth,
main = "Physical Health Outlier Detection"
)

boxplot(
heart_data$MentalHealth,
main = "Mental Health Outlier Detection"
)

boxplot(
heart_data$SleepTime,
main = "Sleep Time Outlier Detection"
)

A total of 18,078 duplicate records were identified and removed. Outliers were examined using boxplots and retained because they may represent valid medical observations rather than data entry errors.
# No = 0, Yes = 1
heart_data$HeartDisease <- ifelse(
heart_data$HeartDisease == "Yes",
1,
0
)
# Convert to factor
heart_data$HeartDisease <- as.factor(heart_data$HeartDisease)
heart_data$Smoking <- as.factor(heart_data$Smoking)
heart_data$AlcoholDrinking <- as.factor(heart_data$AlcoholDrinking)
heart_data$Stroke <- as.factor(heart_data$Stroke)
heart_data$DiffWalking <- as.factor(heart_data$DiffWalking)
heart_data$Sex <- as.factor(heart_data$Sex)
heart_data$AgeCategory <- as.factor(heart_data$AgeCategory)
heart_data$Race <- as.factor(heart_data$Race)
heart_data$Diabetic <- as.factor(heart_data$Diabetic)
heart_data$PhysicalActivity <- as.factor(heart_data$PhysicalActivity)
heart_data$GenHealth <- as.factor(heart_data$GenHealth)
heart_data$Asthma <- as.factor(heart_data$Asthma)
heart_data$KidneyDisease <- as.factor(heart_data$KidneyDisease)
heart_data$SkinCancer <- as.factor(heart_data$SkinCancer)
str(heart_data)
## 'data.frame': 301717 obs. of 18 variables:
## $ HeartDisease : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
## $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Stroke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
## $ AgeCategory : Factor w/ 13 levels "18-24","25-29",..: 8 13 10 12 5 12 11 13 13 10 ...
## $ Race : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 6 6 3 6 6 6 6 ...
## $ Diabetic : Factor w/ 4 levels "No","No, borderline diabetes",..: 3 1 3 1 1 1 1 3 2 1 ...
## $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
## $ GenHealth : Factor w/ 5 levels "Excellent","Fair",..: 5 5 2 3 5 2 2 3 2 3 ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
## $ KidneyDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ SkinCancer : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...
numeric_cols <- c(
"BMI",
"PhysicalHealth",
"MentalHealth",
"SleepTime"
)
# Apply Min-Max Normalization
heart_data[numeric_cols] <- lapply(
heart_data[numeric_cols],
function(x) (x - min(x)) / (max(x) - min(x))
)
summary(heart_data)
## HeartDisease BMI Smoking AlcoholDrinking Stroke
## 0:274456 Min. :0.0000 No :174312 No :280136 No :289653
## 1: 27261 1st Qu.:0.1450 Yes:127405 Yes: 21581 Yes: 12064
## Median :0.1858
## Mean :0.1983
## 3rd Qu.:0.2370
## Max. :1.0000
##
## PhysicalHealth MentalHealth DiffWalking Sex
## Min. :0.00000 Min. :0.0000 No :257362 Female:159671
## 1st Qu.:0.00000 1st Qu.:0.0000 Yes: 44355 Male :142046
## Median :0.00000 Median :0.0000
## Mean :0.11908 Mean :0.1374
## 3rd Qu.:0.06667 3rd Qu.:0.1333
## Max. :1.00000 Max. :1.0000
##
## AgeCategory Race
## 65-69 : 31670 American Indian/Alaskan Native: 5192
## 60-64 : 31219 Asian : 7993
## 70-74 : 29273 Black : 22810
## 55-59 : 27610 Hispanic : 27107
## 50-54 : 23736 Other : 10891
## 80 or older: 23352 White :227724
## (Other) :134857
## Diabetic PhysicalActivity GenHealth
## No :251796 No : 71305 Excellent: 59737
## No, borderline diabetes: 6776 Yes:230412 Fair : 34659
## Yes : 40589 Good : 91239
## Yes (during pregnancy) : 2556 Poor : 11286
## Very good:104796
##
##
## SleepTime Asthma KidneyDisease SkinCancer
## Min. :0.0000 No :259066 No :289941 No :272425
## 1st Qu.:0.2174 Yes: 42651 Yes: 11776 Yes: 29292
## Median :0.2609
## Mean :0.2645
## 3rd Qu.:0.3043
## Max. :1.0000
##
# Display first few rows after preprocessing
head(heart_data)
## HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth
## 1 0 0.05529398 Yes No No 0.1000000
## 2 0 0.10044670 No No Yes 0.0000000
## 3 0 0.17578172 Yes No No 0.6666667
## 4 0 0.14716890 No No No 0.0000000
## 5 0 0.14113244 No No No 0.9333333
## 6 1 0.20342871 Yes No No 0.2000000
## MentalHealth DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity
## 1 1 No Female 55-59 White Yes Yes
## 2 0 No Female 80 or older White No Yes
## 3 1 No Male 65-69 White Yes Yes
## 4 0 No Female 75-79 White No No
## 5 0 Yes Female 40-44 White No Yes
## 6 0 Yes Female 75-79 Black No No
## GenHealth SleepTime Asthma KidneyDisease SkinCancer
## 1 Very good 0.1739130 Yes No Yes
## 2 Very good 0.2608696 No No No
## 3 Fair 0.3043478 Yes No No
## 4 Good 0.2173913 No No Yes
## 5 Very good 0.3043478 No No No
## 6 Fair 0.4782609 No No No
# Save the preprocessed dataset as a CSV file
write.csv(
heart_data,
"heart_attack_2020_preprocessed.csv",
row.names = FALSE
)
heart <- read.csv(
"heart_attack_2020_preprocessed_non_normalization.csv",
stringsAsFactors = FALSE
)
dim(heart)
## [1] 301717 18
str(heart)
## 'data.frame': 301717 obs. of 18 variables:
## $ HeartDisease : int 0 0 0 0 0 1 0 0 0 0 ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : chr "Yes" "No" "Yes" "No" ...
## $ AlcoholDrinking : chr "No" "No" "No" "No" ...
## $ Stroke : chr "No" "Yes" "No" "No" ...
## $ PhysicalHealth : int 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : int 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : chr "No" "No" "No" "No" ...
## $ Sex : chr "Female" "Female" "Male" "Female" ...
## $ AgeCategory : chr "55-59" "80 or older" "65-69" "75-79" ...
## $ Race : chr "White" "White" "White" "White" ...
## $ Diabetic : chr "Yes" "No" "Yes" "No" ...
## $ PhysicalActivity: chr "Yes" "Yes" "Yes" "No" ...
## $ GenHealth : chr "Very good" "Very good" "Fair" "Good" ...
## $ SleepTime : int 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : chr "Yes" "No" "Yes" "No" ...
## $ KidneyDisease : chr "No" "No" "No" "No" ...
## $ SkinCancer : chr "Yes" "No" "No" "Yes" ...
summary(heart)
## HeartDisease BMI Smoking AlcoholDrinking
## Min. :0.00000 Min. :12.02 Length:301717 Length:301717
## 1st Qu.:0.00000 1st Qu.:24.03 Class :character Class :character
## Median :0.00000 Median :27.41 Mode :character Mode :character
## Mean :0.09035 Mean :28.44
## 3rd Qu.:0.00000 3rd Qu.:31.65
## Max. :1.00000 Max. :94.85
## Stroke PhysicalHealth MentalHealth DiffWalking
## Length:301717 Min. : 0.000 Min. : 0.000 Length:301717
## Class :character 1st Qu.: 0.000 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Median : 0.000 Mode :character
## Mean : 3.572 Mean : 4.121
## 3rd Qu.: 2.000 3rd Qu.: 4.000
## Max. :30.000 Max. :30.000
## Sex AgeCategory Race Diabetic
## Length:301717 Length:301717 Length:301717 Length:301717
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PhysicalActivity GenHealth SleepTime Asthma
## Length:301717 Length:301717 Min. : 1.000 Length:301717
## Class :character Class :character 1st Qu.: 6.000 Class :character
## Mode :character Mode :character Median : 7.000 Mode :character
## Mean : 7.085
## 3rd Qu.: 8.000
## Max. :24.000
## KidneyDisease SkinCancer
## Length:301717 Length:301717
## Class :character Class :character
## Mode :character Mode :character
##
##
##
head(heart)
## HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## 1 0 16.60 Yes No No 3 30
## 2 0 20.34 No No Yes 0 0
## 3 0 26.58 Yes No No 20 30
## 4 0 24.21 No No No 0 0
## 5 0 23.71 No No No 28 0
## 6 1 28.87 Yes No No 6 0
## DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity GenHealth
## 1 No Female 55-59 White Yes Yes Very good
## 2 No Female 80 or older White No Yes Very good
## 3 No Male 65-69 White Yes Yes Fair
## 4 No Female 75-79 White No No Good
## 5 Yes Female 40-44 White No Yes Very good
## 6 Yes Female 75-79 Black No No Fair
## SleepTime Asthma KidneyDisease SkinCancer
## 1 5 Yes No Yes
## 2 7 No No No
## 3 8 Yes No No
## 4 6 No No Yes
## 5 8 No No No
## 6 12 No No No
ggplot(heart, aes(x = BMI)) +
geom_histogram(
bins = 30,
fill = "steelblue",
color = "black"
) +
labs(
title = "Distribution of BMI",
x = "BMI",
y = "Frequency"
)

The histogram shows the distribution of BMI values among individuals in the dataset. Most observations are concentrated within the middle BMI range, suggesting that the majority of participants fall within average or overweight categories. A small number of extreme values may indicate obesity-related health concerns.
ggplot(heart, aes(x = PhysicalHealth)) +
geom_histogram(
bins = 30,
fill = "orange",
color = "black"
) +
labs(
title = "Distribution of Physical Health",
x = "Physical Health",
y = "Frequency"
)

The distribution of physical health values indicates that many individuals reported relatively lower physical health issues, while fewer participants experienced severe physical health problems.
ggplot(heart, aes(x = MentalHealth)) +
geom_histogram(
bins = 30,
fill = "purple",
color = "black"
) +
labs(
title = "Distribution of Mental Health",
x = "Mental Health",
y = "Frequency"
)

The histogram demonstrates the variation in mental health conditions across participants. Most individuals appear to experience lower levels of mental health difficulties, while a smaller portion reports higher mental health concerns.
ggplot(heart, aes(x = SleepTime)) +
geom_histogram(
bins = 30,
fill = "darkgreen",
color = "black"
) +
labs(
title = "Distribution of Sleep Time",
x = "Sleep Time",
y = "Frequency"
)

The sleep time distribution indicates that most individuals maintain moderate sleep durations. Extremely low or high sleep durations appear less common and may potentially relate to health conditions.
ggplot(heart, aes(x = Smoking)) +
geom_bar(fill = "darkred") +
labs(
title = "Smoking Distribution",
x = "Smoking",
y = "Count"
)

The bar chart illustrates the number of smokers and non-smokers in the dataset. The comparison provides insight into the prevalence of smoking behavior among participants.
ggplot(heart, aes(x = AlcoholDrinking)) +
geom_bar(fill = "brown") +
labs(
title = "Alcohol Drinking Distribution",
x = "Alcohol Drinking",
y = "Count"
)

The chart displays the distribution of alcohol consumption status among individuals. It helps identify whether alcohol drinking behavior is common within the dataset.
ggplot(heart, aes(x = Sex)) +
geom_bar(fill = "steelblue") +
labs(
title = "Sex Distribution",
x = "Sex",
y = "Count"
)

The dataset contains both male and female participants. The distribution indicates whether the dataset is balanced in terms of gender representation.
ggplot(heart, aes(x = PhysicalActivity)) +
geom_bar(fill = "darkgreen") +
labs(
title = "Physical Activity Distribution",
x = "Physical Activity",
y = "Count"
)

This visualization shows the number of individuals who engage in physical activity compared to those who do not. Physical activity is an important lifestyle factor associated with cardiovascular health.
ggplot(heart, aes(x = Diabetic)) +
geom_bar(fill = "purple") +
labs(
title = "Diabetic Distribution",
x = "Diabetic",
y = "Count"
)

The chart demonstrates the distribution of diabetic conditions among participants, including non-diabetic and diabetic individuals.
ggplot(
heart,
aes(
x = HeartDisease,
y = BMI,
fill = HeartDisease
)
) +
geom_boxplot() +
labs(
title = "BMI vs Heart Disease",
x = "Heart Disease",
y = "BMI"
)
The boxplot compares BMI values between individuals with and without heart disease. Differences in median BMI and spread may indicate a relationship between body weight and heart disease risk.
ggplot(
heart,
aes(
x = Smoking,
fill = factor(HeartDisease)
)
) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("pink", "black")) +
labs(
title = "Smoking vs Heart Disease",
x = "Smoking",
y = "Count",
fill = "Heart Disease"
)
The visualization compares smoking habits with heart disease occurrence. Smokers may demonstrate a relatively higher frequency of heart disease compared to non-smokers.
ggplot(
heart,
aes(
x = AgeCategory,
fill = factor(HeartDisease)
)
) +
geom_bar(position = "stack") +
scale_fill_manual(values = c("skyblue", "red")) +
labs(
title = "Age Category vs Heart Disease",
x = "Age Category",
y = "Count",
fill = "Heart Disease"
) +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 1
)
)
The stacked bar chart indicates that heart disease occurrences tend to increase across older age categories. Age appears to be an important factor influencing cardiovascular risk.
ggplot(
heart,
aes(
x = Diabetic,
fill = factor(HeartDisease)
)
) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Diabetic Status vs Heart Disease",
x = "Diabetic Status",
y = "Percentage",
fill = "Heart Disease"
) +
scale_fill_manual(values = c("steelblue", "tomato")) +
theme(axis.text.x = element_text(angle = 15, hjust = 1))
The proportional bar chart shows the relationship between diabetic status and heart disease prevalence. Individuals with diabetes may exhibit higher proportions of heart disease cases.
ggplot(
heart,
aes(
x = GenHealth,
fill = factor(HeartDisease)
)
) +
geom_bar(position = "stack") +
scale_fill_manual(
values = c("steelblue", "tomato")
) +
labs(
title = "General Health vs Heart Disease",
x = "General Health",
y = "Count",
fill = "Heart Disease"
) +
theme(
axis.text.x = element_text(
angle = 20,
hjust = 1
)
)
Participants reporting poorer general health conditions tend to show higher counts of heart disease. This suggests that overall health perception may be associated with cardiovascular health outcomes.
ggplot(
heart,
aes(
x = PhysicalActivity,
fill = factor(HeartDisease)
)
) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("darkgreen", "orange")) +
labs(
title = "Physical Activity vs Heart Disease",
x = "Physical Activity",
y = "Count",
fill = "Heart Disease"
)
The chart compares heart disease occurrence between physically active and inactive individuals. Lower heart disease frequency among active individuals may suggest the protective effect of regular physical activity.
selected_data <- heart_data %>%
dplyr::select(
BMI,
PhysicalHealth,
MentalHealth,
SleepTime,
HeartDisease
)
ggpairs(selected_data)
The pairplot visualizes pairwise relationships among multiple variables simultaneously. It helps identify potential correlations, distributions, and trends between health-related factors and heart disease.
numeric_data <- heart_data %>%
dplyr::select(
BMI,
PhysicalHealth,
MentalHealth,
SleepTime
)
cor_matrix <- cor(numeric_data)
cor_matrix
## BMI PhysicalHealth MentalHealth SleepTime
## BMI 1.00000000 0.10381252 0.05672448 -0.04865342
## PhysicalHealth 0.10381252 1.00000000 0.27965749 -0.05840552
## MentalHealth 0.05672448 0.27965749 1.00000000 -0.11707831
## SleepTime -0.04865342 -0.05840552 -0.11707831 1.00000000
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45
)
The heatmap illustrates the correlation strength between numerical variables. Positive correlations indicate variables that increase together, while negative correlations indicate inverse relationships.
heart %>%
group_by(HeartDisease) %>%
summarise(
Mean_BMI = mean(BMI),
Mean_PhysicalHealth = mean(PhysicalHealth),
Mean_MentalHealth = mean(MentalHealth),
Mean_SleepTime = mean(SleepTime)
)
## # A tibble: 2 × 5
## HeartDisease Mean_BMI Mean_PhysicalHealth Mean_MentalHealth Mean_SleepTime
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 28.3 3.15 4.07 7.08
## 2 1 29.4 7.84 4.66 7.13
The summary table compares average health indicators between individuals with and without heart disease. Differences in mean values may highlight potential risk factors associated with cardiovascular disease.
The EDA findings reveal several important patterns associated with heart disease risk. Variables such as age category, smoking habits, diabetes status, general health condition, and physical activity demonstrate meaningful relationships with heart disease occurrence. These findings provide useful insights for subsequent machine learning classification and predictive modeling tasks.
In this section, three regression models, namely Multiple Linear
Regression, Stepwise Regression and Random Forest Regression, were
developed to predict the target variable
PhysicalHealth.
heart <- read.csv("heart_attack_2020_preprocessed.csv")
set.seed(123)
heart_sample <- heart %>%
sample_n(5000)
str(heart_sample)
## 'data.frame': 5000 obs. of 18 variables:
## $ HeartDisease : int 0 0 1 1 0 0 0 1 0 0 ...
## $ BMI : num 0.108 0.304 0.252 0.149 0.107 ...
## $ Smoking : chr "No" "Yes" "No" "Yes" ...
## $ AlcoholDrinking : chr "No" "No" "No" "No" ...
## $ Stroke : chr "No" "No" "No" "No" ...
## $ PhysicalHealth : num 0 0 0 0 0 0 0 1 0 0 ...
## $ MentalHealth : num 0.5 0 0 0.833 0.4 ...
## $ DiffWalking : chr "No" "No" "No" "No" ...
## $ Sex : chr "Female" "Female" "Female" "Female" ...
## $ AgeCategory : chr "75-79" "55-59" "70-74" "50-54" ...
## $ Race : chr "White" "White" "White" "White" ...
## $ Diabetic : chr "No" "No" "No" "No" ...
## $ PhysicalActivity: chr "Yes" "Yes" "Yes" "No" ...
## $ GenHealth : chr "Very good" "Good" "Fair" "Good" ...
## $ SleepTime : num 0.217 0.261 0.304 0.304 0.217 ...
## $ Asthma : chr "No" "No" "No" "No" ...
## $ KidneyDisease : chr "No" "No" "No" "No" ...
## $ SkinCancer : chr "No" "No" "No" "No" ...
head(heart_sample)
## HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth
## 1 0 0.1081734 No No No 0
## 2 0 0.3041169 Yes No No 0
## 3 1 0.2518411 No No No 0
## 4 1 0.1486177 Yes No No 0
## 5 0 0.1074490 Yes No No 0
## 6 0 0.1885790 No No No 0
## MentalHealth DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity
## 1 0.50000000 No Female 75-79 White No Yes
## 2 0.00000000 No Female 55-59 White No Yes
## 3 0.00000000 No Female 70-74 White No Yes
## 4 0.83333333 No Female 50-54 White No No
## 5 0.40000000 No Male 30-34 White No Yes
## 6 0.03333333 No Female 75-79 Other No Yes
## GenHealth SleepTime Asthma KidneyDisease SkinCancer
## 1 Very good 0.2173913 No No No
## 2 Good 0.2608696 No No No
## 3 Fair 0.3043478 No No No
## 4 Good 0.3043478 No No No
## 5 Very good 0.2173913 No No No
## 6 Very good 0.3043478 Yes Yes No
names(heart_sample)
## [1] "HeartDisease" "BMI" "Smoking" "AlcoholDrinking"
## [5] "Stroke" "PhysicalHealth" "MentalHealth" "DiffWalking"
## [9] "Sex" "AgeCategory" "Race" "Diabetic"
## [13] "PhysicalActivity" "GenHealth" "SleepTime" "Asthma"
## [17] "KidneyDisease" "SkinCancer"
set.seed(123)
trainIndex <- createDataPartition(
heart_sample$PhysicalHealth,
p = 0.8,
list = FALSE
)
train <- heart_sample[trainIndex, ]
test <- heart_sample[-trainIndex, ]
cat("Training observations:", nrow(train), "\n")
## Training observations: 4000
cat("Testing observations:", nrow(test), "\n")
## Testing observations: 1000
numeric_data <- heart_sample %>%
select_if(is.numeric)
cor_matrix <- cor(
numeric_data,
use = "complete.obs"
)
corrplot(
cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.cex = 0.7,
number.cex = 0.6
)
The correlation plot helps identify numerical variables that are
related to PhysicalHealth.
lm_model <- lm(
PhysicalHealth ~ BMI +
MentalHealth +
SleepTime +
Smoking +
AlcoholDrinking +
Stroke +
DiffWalking +
Diabetic +
PhysicalActivity +
GenHealth +
HeartDisease,
data = train
)
summary(lm_model)
##
## Call:
## lm(formula = PhysicalHealth ~ BMI + MentalHealth + SleepTime +
## Smoking + AlcoholDrinking + Stroke + DiffWalking + Diabetic +
## PhysicalActivity + GenHealth + HeartDisease, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88464 -0.06442 -0.02584 -0.00393 0.98558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.347e-02 1.971e-02 3.221 0.001289 **
## BMI -7.943e-02 4.439e-02 -1.789 0.073667 .
## MentalHealth 1.284e-01 1.287e-02 9.981 < 2e-16 ***
## SleepTime -3.573e-02 5.099e-02 -0.701 0.483444
## SmokingYes 1.302e-02 6.849e-03 1.901 0.057415 .
## AlcoholDrinkingYes -1.258e-05 1.332e-02 -0.001 0.999246
## StrokeYes 3.172e-02 1.808e-02 1.754 0.079497 .
## DiffWalkingYes 1.685e-01 1.090e-02 15.462 < 2e-16 ***
## DiabeticNo, borderline diabetes -3.757e-02 2.223e-02 -1.690 0.091034 .
## DiabeticYes -2.906e-03 1.055e-02 -0.275 0.782965
## DiabeticYes (during pregnancy) -4.701e-02 4.301e-02 -1.093 0.274449
## PhysicalActivityYes -3.041e-02 8.241e-03 -3.691 0.000227 ***
## GenHealthFair 2.221e-01 1.393e-02 15.942 < 2e-16 ***
## GenHealthGood 3.722e-02 9.960e-03 3.737 0.000189 ***
## GenHealthPoor 5.077e-01 2.089e-02 24.306 < 2e-16 ***
## GenHealthVery good 3.148e-03 9.382e-03 0.336 0.737251
## HeartDisease -5.937e-03 1.260e-02 -0.471 0.637556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2098 on 3983 degrees of freedom
## Multiple R-squared: 0.3924, Adjusted R-squared: 0.39
## F-statistic: 160.8 on 16 and 3983 DF, p-value: < 2.2e-16
lm_pred <- predict(
lm_model,
newdata = test
)
lm_rmse <- rmse(test$PhysicalHealth, lm_pred)
lm_mae <- mae(test$PhysicalHealth, lm_pred)
lm_r2 <- R2(lm_pred, test$PhysicalHealth)
lm_rmse
## [1] 0.2153915
lm_mae
## [1] 0.1244887
lm_r2
## [1] 0.380274
lm_plot <- data.frame(
Actual = test$PhysicalHealth,
Predicted = lm_pred
)
ggplot(
lm_plot,
aes(
x = Actual,
y = Predicted
)
) +
geom_point(
color = "#2E86DE",
alpha = 0.25,
size = 1.5
) +
geom_abline(
intercept = 0,
slope = 1,
color = "#E74C3C",
linetype = "dashed",
linewidth = 1
) +
theme_minimal() +
labs(
title = "Multiple Linear Regression: Predicted vs Actual",
x = "Actual PhysicalHealth",
y = "Predicted PhysicalHealth"
)
The predicted vs actual plot helps visualize the accuracy of the Multiple Linear Regression model. Points closer to the red dashed diagonal line indicate better prediction accuracy.
lm_coeff <- data.frame(
Variable = names(coef(lm_model)),
Coefficient = as.numeric(coef(lm_model))
)
lm_coeff <- lm_coeff %>%
filter(Variable != "(Intercept)")
ggplot(
lm_coeff,
aes(
x = reorder(Variable, Coefficient),
y = Coefficient,
fill = Coefficient > 0
)
) +
geom_col() +
coord_flip() +
theme_minimal() +
scale_fill_manual(
values = c("#2E86DE", "#E74C3C"),
labels = c("Negative", "Positive"),
name = "Effect"
) +
labs(
title = "Regression Coefficient Plot",
x = "Predictor Variable",
y = "Coefficient"
)
The regression coefficient plot shows how each predictor affects
PhysicalHealth. Positive coefficients increase the
predicted value, while negative coefficients decrease it.
lm_residual <- data.frame(
Predicted = lm_pred,
Residual = test$PhysicalHealth - lm_pred
)
ggplot(
lm_residual,
aes(
x = Predicted,
y = Residual
)
) +
geom_point(
color = "#2E86DE",
alpha = 0.25,
size = 1.5
) +
geom_hline(
yintercept = 0,
color = "#E74C3C",
linetype = "dashed",
linewidth = 1
) +
theme_minimal() +
labs(
title = "Multiple Linear Regression: Residual Plot",
x = "Predicted PhysicalHealth",
y = "Residual"
)
The residual plot shows the difference between actual and predicted values. Residuals closer to zero indicate better predictions.
step_model <- stepAIC(
lm_model,
direction = "both",
trace = FALSE
)
summary(step_model)
##
## Call:
## lm(formula = PhysicalHealth ~ BMI + MentalHealth + Smoking +
## Stroke + DiffWalking + PhysicalActivity + GenHealth, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87992 -0.06260 -0.02590 -0.00489 0.98697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.053707 0.013444 3.995 6.59e-05 ***
## BMI -0.085607 0.043659 -1.961 0.049972 *
## MentalHealth 0.129603 0.012751 10.164 < 2e-16 ***
## SmokingYes 0.012804 0.006796 1.884 0.059627 .
## StrokeYes 0.030451 0.017759 1.715 0.086486 .
## DiffWalkingYes 0.168068 0.010826 15.524 < 2e-16 ***
## PhysicalActivityYes -0.030476 0.008227 -3.704 0.000215 ***
## GenHealthFair 0.220048 0.013633 16.141 < 2e-16 ***
## GenHealthGood 0.036897 0.009896 3.728 0.000195 ***
## GenHealthPoor 0.504303 0.020517 24.580 < 2e-16 ***
## GenHealthVery good 0.003238 0.009371 0.346 0.729678
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2097 on 3989 degrees of freedom
## Multiple R-squared: 0.3917, Adjusted R-squared: 0.3902
## F-statistic: 256.9 on 10 and 3989 DF, p-value: < 2.2e-16
step_pred <- predict(
step_model,
newdata = test
)
step_rmse <- rmse(test$PhysicalHealth, step_pred)
step_mae <- mae(test$PhysicalHealth, step_pred)
step_r2 <- R2(step_pred, test$PhysicalHealth)
step_rmse
## [1] 0.2151523
step_mae
## [1] 0.1241588
step_r2
## [1] 0.3816195
step_plot <- data.frame(
Actual = test$PhysicalHealth,
Predicted = step_pred
)
ggplot(
step_plot,
aes(
x = Actual,
y = Predicted
)
) +
geom_point(
color = "#2E86DE",
alpha = 0.25,
size = 1.5
) +
geom_abline(
intercept = 0,
slope = 1,
color = "#E74C3C",
linetype = "dashed",
linewidth = 1
) +
theme_minimal() +
labs(
title = "Stepwise Regression: Predicted vs Actual",
x = "Actual PhysicalHealth",
y = "Predicted PhysicalHealth"
)
Stepwise Regression selects the most relevant predictors and removes less useful variables.
step_residual <- data.frame(
Predicted = step_pred,
Residual = test$PhysicalHealth - step_pred
)
ggplot(
step_residual,
aes(
x = Predicted,
y = Residual
)
) +
geom_point(
color = "#2E86DE",
alpha = 0.25,
size = 1.5
) +
geom_hline(
yintercept = 0,
color = "#E74C3C",
linetype = "dashed",
linewidth = 1
) +
theme_minimal() +
labs(
title = "Stepwise Regression: Residual Plot",
x = "Predicted PhysicalHealth",
y = "Residual"
)
set.seed(123)
rf_model <- randomForest(
PhysicalHealth ~ BMI +
MentalHealth +
SleepTime +
Smoking +
AlcoholDrinking +
Stroke +
DiffWalking +
Diabetic +
PhysicalActivity +
GenHealth +
HeartDisease,
data = train,
ntree = 10,
importance = TRUE
)
rf_model
##
## Call:
## randomForest(formula = PhysicalHealth ~ BMI + MentalHealth + SleepTime + Smoking + AlcoholDrinking + Stroke + DiffWalking + Diabetic + PhysicalActivity + GenHealth + HeartDisease, data = train, ntree = 10, importance = TRUE)
## Type of random forest: regression
## Number of trees: 10
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.05824892
## % Var explained: 19.24
rf_pred <- predict(
rf_model,
newdata = test
)
rf_rmse <- rmse(test$PhysicalHealth, rf_pred)
rf_mae <- mae(test$PhysicalHealth, rf_pred)
rf_r2 <- R2(rf_pred, test$PhysicalHealth)
rf_rmse
## [1] 0.2294505
rf_mae
## [1] 0.1359131
rf_r2
## [1] 0.2959072
rf_plot <- data.frame(
Actual = test$PhysicalHealth,
Predicted = rf_pred
)
ggplot(
rf_plot,
aes(
x = Actual,
y = Predicted
)
) +
geom_point(
color = "#2E86DE",
alpha = 0.25,
size = 1.5
) +
geom_abline(
intercept = 0,
slope = 1,
color = "#E74C3C",
linetype = "dashed",
linewidth = 1
) +
theme_minimal() +
labs(
title = "Random Forest Regression: Predicted vs Actual",
x = "Actual PhysicalHealth",
y = "Predicted PhysicalHealth"
)
Random Forest can capture non-linear relationships and interactions between predictors.
varImpPlot(
rf_model,
main = "Random Forest Variable Importance"
)
The variable importance plot identifies the most influential
predictors of PhysicalHealth.
rf_residual <- data.frame(
Predicted = rf_pred,
Residual = test$PhysicalHealth - rf_pred
)
ggplot(
rf_residual,
aes(
x = Predicted,
y = Residual
)
) +
geom_point(
color = "#2E86DE",
alpha = 0.25,
size = 1.5
) +
geom_hline(
yintercept = 0,
color = "#E74C3C",
linetype = "dashed",
linewidth = 1
) +
theme_minimal() +
labs(
title = "Random Forest Regression: Residual Plot",
x = "Predicted PhysicalHealth",
y = "Residual"
)
## Evaluation of Regression Models
results <- data.frame(
Model = c(
"Multiple Linear Regression",
"Stepwise Regression",
"Random Forest Regression"
),
RMSE = c(
lm_rmse,
step_rmse,
rf_rmse
),
MAE = c(
lm_mae,
step_mae,
rf_mae
),
R2 = c(
lm_r2,
step_r2,
rf_r2
)
)
results
## Model RMSE MAE R2
## 1 Multiple Linear Regression 0.2153915 0.1244887 0.3802740
## 2 Stepwise Regression 0.2151523 0.1241588 0.3816195
## 3 Random Forest Regression 0.2294505 0.1359131 0.2959072
ggplot(
results,
aes(
x = reorder(Model, RMSE),
y = RMSE,
fill = Model
)
) +
geom_col() +
theme_minimal() +
labs(
title = "RMSE Comparison of Regression Models",
x = "Regression Model",
y = "RMSE"
) +
theme(
axis.text.x = element_text(angle = 20, hjust = 1)
)
ggplot(
results,
aes(
x = reorder(Model, MAE),
y = MAE,
fill = Model
)
) +
geom_col() +
theme_minimal() +
labs(
title = "MAE Comparison of Regression Models",
x = "Regression Model",
y = "MAE"
) +
theme(
axis.text.x = element_text(angle = 20, hjust = 1)
)
ggplot(
results,
aes(
x = reorder(Model, R2),
y = R2,
fill = Model
)
) +
geom_col() +
theme_minimal() +
labs(
title = "R² Comparison of Regression Models",
x = "Regression Model",
y = "R²"
) +
theme(
axis.text.x = element_text(angle = 20, hjust = 1)
)
The predicted vs actual plots show how closely each model’s
predictions match the actual PhysicalHealth values. A model
with points closer to the diagonal red reference line indicates stronger
predictive accuracy.
The Multiple Linear Regression coefficient plot provides
interpretability by showing the direction and magnitude of each
predictor’s effect on PhysicalHealth. Positive coefficients
indicate an increase in predicted unhealthy physical health days, while
negative coefficients indicate a decrease.
The Random Forest variable importance plot provides additional insight into the most influential predictors. This is useful because Random Forest can capture non-linear relationships and interactions that may not be fully captured by linear models.
Overall, the best-performing model is identified based on the lowest RMSE and MAE, together with the highest R² value.
Three regression models were developed to predict
PhysicalHealth. Multiple Linear Regression was used as the
baseline model, Stepwise Regression was used for variable selection, and
Random Forest Regression was used to capture more complex
relationships.
The model comparison results indicate which model provides the most accurate prediction of physical health outcomes. These findings suggest that health-related and lifestyle factors such as mental health, sleep duration, general health status, smoking status and heart disease status may help explain variation in physical health conditions.
Based on the EDA findings, variables such as age category, smoking habits, diabetes status, general health condition, and physical activity demonstrate meaningful relationships with heart disease. This section develops three classification models to predict whether an individual is at risk of heart disease using these factors.
Target Variable: HeartDisease (1 = At
Risk, 0 = Not at Risk)
heart <- read.csv(
"heart_attack_2020_preprocessed.csv",
stringsAsFactors = FALSE
)
cat("Dataset Dimensions:", nrow(heart), "rows x", ncol(heart), "columns\n")
## Dataset Dimensions: 301717 rows x 18 columns
cat("\nTarget Variable Distribution:\n")
##
## Target Variable Distribution:
print(base::table(heart$HeartDisease))
##
## 0 1
## 274456 27261
cat("\nClass Ratio (%):\n")
##
## Class Ratio (%):
print(round(prop.table(base::table(heart$HeartDisease)) * 100, 2))
##
## 0 1
## 90.96 9.04
model_df <- heart %>%
mutate(
Smoking = ifelse(Smoking == "Yes", 1, 0),
AlcoholDrinking = ifelse(AlcoholDrinking == "Yes", 1, 0),
Stroke = ifelse(Stroke == "Yes", 1, 0),
DiffWalking = ifelse(DiffWalking == "Yes", 1, 0),
Sex = ifelse(Sex == "Male", 1, 0),
PhysicalActivity = ifelse(PhysicalActivity == "Yes", 1, 0),
Asthma = ifelse(Asthma == "Yes", 1, 0),
KidneyDisease = ifelse(KidneyDisease == "Yes", 1, 0),
SkinCancer = ifelse(SkinCancer == "Yes", 1, 0),
Diabetic = ifelse(Diabetic == "Yes", 1, 0),
GenHealth = case_when(
GenHealth == "Poor" ~ 1,
GenHealth == "Fair" ~ 2,
GenHealth == "Good" ~ 3,
GenHealth == "Very good" ~ 4,
GenHealth == "Excellent" ~ 5,
TRUE ~ NA_real_
),
AgeCategory = case_when(
AgeCategory == "18-24" ~ 1,
AgeCategory == "25-29" ~ 2,
AgeCategory == "30-34" ~ 3,
AgeCategory == "35-39" ~ 4,
AgeCategory == "40-44" ~ 5,
AgeCategory == "45-49" ~ 6,
AgeCategory == "50-54" ~ 7,
AgeCategory == "55-59" ~ 8,
AgeCategory == "60-64" ~ 9,
AgeCategory == "65-69" ~ 10,
AgeCategory == "70-74" ~ 11,
AgeCategory == "75-79" ~ 12,
AgeCategory == "80 or older" ~ 13,
TRUE ~ NA_real_
),
HeartDisease = as.factor(HeartDisease)
) %>%
dplyr::select(-Race)
levels(model_df$HeartDisease) <- c("No_Risk", "At_Risk")
cat("Feature engineering complete.\n")
## Feature engineering complete.
cat("Dimensions:", nrow(model_df), "x", ncol(model_df), "\n")
## Dimensions: 301717 x 17
features <- c(
"BMI", "Smoking", "AlcoholDrinking", "Stroke",
"PhysicalHealth", "MentalHealth", "DiffWalking",
"Sex", "AgeCategory", "Diabetic", "PhysicalActivity",
"GenHealth", "SleepTime", "Asthma", "KidneyDisease", "SkinCancer"
)
target <- "HeartDisease"
set.seed(42)
train_idx <- createDataPartition(model_df[[target]], p = 0.75, list = FALSE)
train_data <- model_df[train_idx, ]
test_data <- model_df[-train_idx, ]
# Subsample training set to 30000 rows to reduce computation time
set.seed(42)
sub_idx <- createDataPartition(train_data[[target]], p = 30000 / nrow(train_data), list = FALSE)
train_data <- train_data[sub_idx, ]
cat("Training Set:", nrow(train_data), "rows\n")
## Training Set: 30001 rows
cat("Test Set: ", nrow(test_data), "rows\n")
## Test Set: 75429 rows
cat("\nTraining Target Distribution:\n")
##
## Training Target Distribution:
print(base::table(train_data[[target]]))
##
## No_Risk At_Risk
## 27290 2711
Logistic Regression maps a linear combination of features to a probability in [0,1] via the Sigmoid function. This model applies Elastic Net regularization (combining L1 and L2 penalties) to prevent overfitting while performing automatic feature selection.
\[P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}}\]
Given the significant class imbalance in this dataset (90.96% No Risk vs 9.04% At Risk), sample weights proportional to the inverse class frequency are applied during training so that the minority class (At Risk) receives greater influence on the learned coefficients.
X_train <- model.matrix(HeartDisease ~ . - 1, data = train_data[, c(features, target)])
y_train <- train_data[[target]]
X_test <- model.matrix(HeartDisease ~ . - 1, data = test_data[, c(features, target)])
y_test <- test_data[[target]]
# Inverse-frequency class weights to handle imbalance
class_ratio <- as.numeric(base::table(y_train)[1] / base::table(y_train)[2])
weights_train <- ifelse(y_train == "At_Risk", class_ratio, 1)
set.seed(42)
cv_logit <- cv.glmnet(
x = X_train,
y = y_train,
family = "binomial",
alpha = 0.5,
nfolds = 10,
type.measure = "auc",
weights = weights_train
)
best_lambda <- cv_logit$lambda.min * 0.1
logit_model <- glmnet(
x = X_train,
y = y_train,
family = "binomial",
alpha = 0.5,
lambda = best_lambda,
weights = weights_train
)
cat("Optimal Lambda:", round(best_lambda, 6), "\n")
## Optimal Lambda: 0.000345
cat("Non-zero Coefficients:", sum(coef(logit_model) != 0), "\n")
## Non-zero Coefficients: 17
coef_vals <- as.matrix(coef(logit_model))
coef_df <- data.frame(
Feature = rownames(coef_vals),
Coefficient = coef_vals[, 1]
) %>%
filter(Feature != "(Intercept)") %>%
arrange(desc(abs(Coefficient))) %>%
slice(1:15)
ggplot(coef_df, aes(x = reorder(Feature, Coefficient), y = Coefficient,
fill = Coefficient > 0)) +
geom_col(alpha = 0.85) +
coord_flip() +
scale_fill_manual(values = c("#2196F3", "#F44336"),
labels = c("Protective", "Risk Factor")) +
labs(title = "Logistic Regression: Feature Coefficients",
subtitle = "Elastic Net (alpha = 0.5) -- Top 15 Features",
x = NULL, y = "Coefficient Value", fill = "Direction") +
theme_minimal(base_size = 11)
logit_prob <- predict(logit_model, newx = X_test, type = "response", s = best_lambda)
logit_pred <- factor(ifelse(logit_prob > 0.5, "At_Risk", "No_Risk"),
levels = c("No_Risk", "At_Risk"))
Random Forest is an ensemble learning method that builds multiple decision trees and combines their results via majority voting. Its two core mechanisms are:
\[\hat{f}(x) = \frac{1}{B} \sum_{b=1}^{B} T_b(x)\]
Class weights are set proportional to the inverse class frequency (1:10) to compensate for the severe imbalance between the majority and minority classes.
set.seed(42)
rf_model <- randomForest(
x = train_data[, features],
y = train_data[[target]],
ntree = 500,
mtry = floor(sqrt(length(features))),
importance = TRUE,
nodesize = 5,
classwt = c("No_Risk" = 1, "At_Risk" = 10)
)
print(rf_model)
##
## Call:
## randomForest(x = train_data[, features], y = train_data[[target]], ntree = 500, mtry = floor(sqrt(length(features))), classwt = c(No_Risk = 1, At_Risk = 10), nodesize = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 18.19%
## Confusion matrix:
## No_Risk At_Risk class.error
## No_Risk 23349 3941 0.1444119
## At_Risk 1515 1196 0.5588344
oob_df <- data.frame(
Trees = 1:rf_model$ntree,
OOB = rf_model$err.rate[, "OOB"],
No_Risk = rf_model$err.rate[, "No_Risk"],
At_Risk = rf_model$err.rate[, "At_Risk"]
) %>%
pivot_longer(-Trees, names_to = "Type", values_to = "Error")
ggplot(oob_df, aes(x = Trees, y = Error, color = Type)) +
geom_line(linewidth = 0.7) +
scale_color_manual(values = c("OOB" = "#333333",
"No_Risk" = "#2196F3",
"At_Risk" = "#F44336")) +
labs(title = "Random Forest: OOB Error vs. Number of Trees",
x = "Number of Trees", y = "Error Rate", color = "Error Type") +
theme_minimal(base_size = 12)
imp_df <- as.data.frame(importance(rf_model)) %>%
rownames_to_column("Feature") %>%
arrange(desc(MeanDecreaseGini))
ggplot(imp_df[1:15, ], aes(x = reorder(Feature, MeanDecreaseGini),
y = MeanDecreaseGini)) +
geom_col(fill = "#4CAF50", alpha = 0.85) +
coord_flip() +
labs(title = "Random Forest: Top 15 Feature Importances",
subtitle = "Metric: Mean Decrease in Gini Impurity",
x = NULL, y = "Mean Decrease Gini") +
theme_minimal(base_size = 11)
rf_pred <- predict(rf_model, newdata = test_data[, features], type = "class")
rf_prob <- predict(rf_model, newdata = test_data[, features], type = "prob")[, "At_Risk"]
Support Vector Machine (SVM) finds the optimal hyperplane that maximizes the margin between classes. For non-linearly separable problems, the Radial Basis Function (RBF) kernel implicitly maps data into a higher-dimensional space where linear separation becomes possible.
\[K(x_i, x_j) = \exp\left(-\gamma \|x_i - x_j\|^2\right)\]
Hyperparameters cost and gamma are selected
via 3-fold cross-validation on a 10% stratified subsample of the
training data to keep computation time manageable. Class weights (1:10)
are applied to address class imbalance.
set.seed(42)
tune_idx <- createDataPartition(train_data[[target]], p = 0.1, list = FALSE)
tune_data <- train_data[tune_idx, ]
svm_tune <- tune(
svm,
train.x = tune_data[, features],
train.y = tune_data[[target]],
kernel = "radial",
ranges = list(
cost = c(0.1, 1),
gamma = c(0.01, 0.1)
),
tunecontrol = tune.control(
sampling = "cross",
cross = 3
)
)
best_cost <- svm_tune$best.parameters$cost
best_gamma <- svm_tune$best.parameters$gamma
cat("Best Parameters: cost =", best_cost, ", gamma =", best_gamma, "\n")
## Best Parameters: cost = 1 , gamma = 0.1
ggplot(svm_tune$performances,
aes(x = factor(cost), y = factor(gamma), fill = error)) +
geom_tile(color = "white") +
geom_text(aes(label = round(error, 4)), size = 3.5) +
scale_fill_gradient(low = "#4CAF50", high = "#F44336") +
labs(title = "SVM Hyperparameter Tuning: 3-Fold CV Error",
subtitle = "RBF Kernel - Grid Search over Cost x Gamma",
x = "Cost (C)", y = "Gamma", fill = "CV Error") +
theme_minimal(base_size = 12)
svm_model <- svm(
x = train_data[, features],
y = train_data[[target]],
kernel = "radial",
cost = best_cost,
gamma = best_gamma,
probability = TRUE,
class.weights = c("No_Risk" = 1, "At_Risk" = 10)
)
print(svm_model)
##
## Call:
## svm.default(x = train_data[, features], y = train_data[[target]],
## kernel = "radial", gamma = best_gamma, cost = best_cost, class.weights = c(No_Risk = 1,
## At_Risk = 10), probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 15306
svm_pred <- predict(svm_model, newdata = test_data[, features])
svm_prob_obj <- predict(svm_model, newdata = test_data[, features], probability = TRUE)
svm_prob <- attr(svm_prob_obj, "probabilities")[, "At_Risk"]
pred_summary <- data.frame(
Actual = y_test,
LR_Predicted = logit_pred,
LR_Probability = round(as.vector(logit_prob), 4),
RF_Predicted = rf_pred,
RF_Probability = round(rf_prob, 4),
SVM_Predicted = svm_pred,
SVM_Probability = round(svm_prob, 4)
)
cat("Sample Predictions (First 10 Rows):\n")
## Sample Predictions (First 10 Rows):
head(pred_summary, 10)
## Actual LR_Predicted LR_Probability RF_Predicted RF_Probability
## 3 No_Risk At_Risk 0.8777 At_Risk 0.538
## 4 No_Risk At_Risk 0.5387 No_Risk 0.498
## 6 At_Risk At_Risk 0.7436 At_Risk 0.598
## 11 At_Risk At_Risk 0.8870 No_Risk 0.428
## 13 No_Risk At_Risk 0.7758 At_Risk 0.628
## 31 No_Risk At_Risk 0.6924 At_Risk 0.604
## 32 No_Risk At_Risk 0.7914 No_Risk 0.392
## 41 No_Risk At_Risk 0.5003 No_Risk 0.136
## 42 No_Risk No_Risk 0.2321 At_Risk 0.598
## 45 No_Risk At_Risk 0.7059 At_Risk 0.678
## SVM_Predicted SVM_Probability
## 3 No_Risk 0.0953
## 4 No_Risk 0.0712
## 6 At_Risk 0.1394
## 11 No_Risk 0.0983
## 13 At_Risk 0.2313
## 31 At_Risk 0.2625
## 32 At_Risk 0.1908
## 41 No_Risk 0.0754
## 42 No_Risk 0.0558
## 45 At_Risk 0.1795
prob_df <- data.frame(
Model = rep(c("Logistic Regression", "Random Forest", "SVM"),
each = nrow(test_data)),
Probability = c(as.vector(logit_prob), rf_prob, svm_prob),
Actual = rep(as.character(y_test), 3)
)
ggplot(prob_df, aes(x = Probability, fill = Actual)) +
geom_density(alpha = 0.55) +
facet_wrap(~ Model, ncol = 3) +
scale_fill_manual(values = c("No_Risk" = "#2196F3", "At_Risk" = "#F44336")) +
labs(title = "Predicted Probability Distributions by Model",
subtitle = "Separability between risk classes across three classifiers",
x = "Predicted Probability (At Risk)",
y = "Density", fill = "Actual Label") +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom")
evaluate_classification <- function(actual, predicted, probability, model_name) {
cm <- confusionMatrix(
predicted,
actual,
positive = "At_Risk"
)
roc_obj <- roc(
response = actual,
predictor = as.numeric(probability),
levels = c("No_Risk", "At_Risk"),
direction = "<"
)
data.frame(
Model = model_name,
Accuracy = round(as.numeric(cm$overall["Accuracy"]), 4),
Precision = round(as.numeric(cm$byClass["Precision"]), 4),
Recall = round(as.numeric(cm$byClass["Recall"]), 4),
F1_Score = round(as.numeric(cm$byClass["F1"]), 4),
AUC = round(as.numeric(auc(roc_obj)), 4)
)
}
classification_results <- rbind(
evaluate_classification(
y_test,
logit_pred,
as.vector(logit_prob),
"Logistic Regression"
),
evaluate_classification(
y_test,
rf_pred,
rf_prob,
"Random Forest"
),
evaluate_classification(
y_test,
svm_pred,
svm_prob,
"Support Vector Machine"
)
)
classification_results
## Model Accuracy Precision Recall F1_Score AUC
## 1 Logistic Regression 0.7458 0.2312 0.7799 0.3567 0.8364
## 2 Random Forest 0.8187 0.2298 0.4285 0.2992 0.7746
## 3 Support Vector Machine 0.7818 0.2282 0.5938 0.3297 0.7759
# visualization
classification_results_long <- classification_results %>%
pivot_longer(
cols = c(Accuracy, Precision, Recall, F1_Score, AUC),
names_to = "Metric",
values_to = "Value"
)
ggplot(
classification_results_long,
aes(
x = Model,
y = Value,
fill = Metric
)
) +
geom_col(position = "dodge") +
theme_minimal() +
labs(
title = "Classification Model Performance Comparison",
x = "Classification Model",
y = "Score"
) +
theme(
axis.text.x = element_text(angle = 20, hjust = 1)
)
# Final Conclusion for Model Evaluation
# Select best regression model
best_regression_model <- results %>%
arrange(RMSE, MAE, desc(R2)) %>%
slice(1)
best_regression_model
## Model RMSE MAE R2
## 1 Stepwise Regression 0.2151523 0.1241588 0.3816195
For the regression part, the target variable is PhysicalHealth. Since PhysicalHealth is a numerical value, the models were evaluated using RMSE, MAE, and R².
RMSE and MAE show the prediction error. A lower RMSE or MAE means the model prediction is closer to the actual value. R² shows how well the model explains the changes in PhysicalHealth. A higher R² means the model performs better.
Three regression models were compared: Multiple Linear Regression, Stepwise Regression, and Random Forest Regression. The best regression model should have the lowest RMSE and MAE, and the highest R², which is Stepwise Regression.
The predicted vs actual plots were also used to check the model performance. If the points are closer to the diagonal line, it means the predicted values are closer to the actual values. The residual plots were used to check the prediction errors. Smaller residuals mean better prediction.
The regression results show that variables such as MentalHealth, DiffWalking, GenHealth, PhysicalActivity, and HeartDisease are useful for predicting PhysicalHealth.
# Select best classification model
best_classification_model <- classification_results %>%
arrange(desc(F1_Score), desc(Recall), desc(AUC)) %>%
slice(1)
best_classification_model
## Model Accuracy Precision Recall F1_Score AUC
## 1 Logistic Regression 0.7458 0.2312 0.7799 0.3567 0.8364
For the classification analysis, HeartDisease was used as the target variable and divided into two classes: No_Risk and At_Risk. Since the dataset was imbalanced, accuracy was not the only metric used to evaluate the models. Most records belonged to the No_Risk group, so a model with high accuracy could still perform poorly in identifying actual At_Risk cases.
To make the evaluation more reliable, recall, F1-score, and AUC were also considered. Recall was especially important because the purpose of this classification task was to identify people who may have heart disease risk. After comparing Logistic Regression, Random Forest, and Support Vector Machine, Logistic Regression was selected as the best classification model. Although it did not have the highest accuracy, it provided a better balance between recall, F1-score, AUC, and interpretability. This makes it more suitable for this project, where identifying potential risk cases is more important than simply maximizing overall accuracy.
In conclusion, this project shows that heart disease risk is related to a combination of health and lifestyle factors, rather than a single variable. From the EDA, variables such as age category, smoking, diabetes, general health, physical activity, and difficulty walking showed clear patterns with heart disease risk. These findings also supported the use of predictive models in the next stage.
For modelling, the regression task was mainly used to analyse PhysicalHealth, while the classification task was the main part for predicting HeartDisease risk. Stepwise Regression performed best for the regression part, and Logistic Regression was selected as the best classification model because it gave a reasonable balance of recall, F1-score, AUC, and interpretability. Since the dataset was imbalanced, recall was especially important because the aim was to identify people who may be at risk.
Overall, the project met its main objective by using R to clean, explore, visualise, and model a large health dataset. The results suggest that machine learning can help identify possible heart disease risk patterns, but the model should only be used as analytical support, not as a medical diagnosis tool. Future work could improve the result by better handling class imbalance and testing the model with other health datasets.