Members

Contribution Table

Member Contribution
Liu Tian Shuo Data understanding, missing value checking, duplicate checking, outlier detection, and data preprocessing
Zhao Ning Na Exploratory data analysis, univariate analysis, bivariate analysis, multivariate analysis, correlation analysis, and visualization
Teoh Xiao Qian Regression modelling for predicting PhysicalHealth, including Multiple Linear Regression, Stepwise Regression, and Random Forest Regression
Zhao Yi Xuan Classification modelling for predicting HeartDisease, including Logistic Regression, Random Forest, and Support Vector Machine
Yang Shan Final model evaluation, model comparison, visualization, discussion of results, conclusion, and report integration

Introduction

Heart disease is one of the major health concerns worldwide. Lifestyle habits, physical condition, mental health, and existing medical conditions may influence the risk of heart disease. In this project, a heart disease dataset is used to explore the relationship between lifestyle and health-related indicators.

This project focuses on two predictive tasks. First, regression models are developed to predict PhysicalHealth, which represents the number of physically unhealthy days. Second, classification models are developed to predict HeartDisease, which indicates whether an individual has heart disease.

Dataset Source

For this project, we used the Personal Key Indicators of Heart Disease dataset from Kaggle. The dataset was uploaded by Kamil Pytlak and is based on CDC survey data related to adult health status. It includes personal health and lifestyle variables such as BMI, smoking, alcohol drinking, stroke, physical health, mental health, diabetes, physical activity, and general health. The main target variable is HeartDisease, which is recorded as a binary value to show whether a person has heart disease risk or not. This dataset was selected because it is large, health-related, and suitable for both regression and classification analysis.

The dataset link is https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/data

Project Objectives

The objectives of this project are:

  1. To understand the structure, content, and quality of the heart disease dataset.
  2. To clean and preprocess the dataset for further analysis and modelling.
  3. To explore important lifestyle and health-related factors associated with heart disease.
  4. To build regression models to predict PhysicalHealth.
  5. To build classification models to predict HeartDisease.
  6. To evaluate and compare model performance using suitable metrics and visualizations.

Research Questions

  1. What health and lifestyle factors are associated with heart disease risk?
  2. Can physical health condition be predicted using lifestyle and health-related variables?
  3. Can machine learning models predict whether a person has heart disease risk based on health and lifestyle factors?
packages <- c(
  "tidyverse",
  "caret",
  "MASS",
  "randomForest",
  "Metrics",
  "corrplot",
  "glmnet",
  "pROC",
  "e1071",
  "GGally"
)

installed_packages <- packages %in% rownames(installed.packages())

if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

lapply(packages, library, character.only = TRUE)
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "caret"     "lattice"   "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[3]]
##  [1] "MASS"      "caret"     "lattice"   "lubridate" "forcats"   "stringr"  
##  [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
## [13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [19] "methods"   "base"     
## 
## [[4]]
##  [1] "randomForest" "MASS"         "caret"        "lattice"      "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[5]]
##  [1] "Metrics"      "randomForest" "MASS"         "caret"        "lattice"     
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[6]]
##  [1] "corrplot"     "Metrics"      "randomForest" "MASS"         "caret"       
##  [6] "lattice"      "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[7]]
##  [1] "glmnet"       "Matrix"       "corrplot"     "Metrics"      "randomForest"
##  [6] "MASS"         "caret"        "lattice"      "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "pROC"         "glmnet"       "Matrix"       "corrplot"     "Metrics"     
##  [6] "randomForest" "MASS"         "caret"        "lattice"      "lubridate"   
## [11] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [21] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [26] "base"        
## 
## [[9]]
##  [1] "e1071"        "pROC"         "glmnet"       "Matrix"       "corrplot"    
##  [6] "Metrics"      "randomForest" "MASS"         "caret"        "lattice"     
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[10]]
##  [1] "GGally"       "e1071"        "pROC"         "glmnet"       "Matrix"      
##  [6] "corrplot"     "Metrics"      "randomForest" "MASS"         "caret"       
## [11] "lattice"      "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [21] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"

Data Preprocessing

heart_data <- read.csv("heart_attack_2020.csv")

# View data structure
str(heart_data)
## 'data.frame':    319795 obs. of  18 variables:
##  $ HeartDisease    : chr  "No" "No" "No" "No" ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : chr  "Yes" "No" "Yes" "No" ...
##  $ AlcoholDrinking : chr  "No" "No" "No" "No" ...
##  $ Stroke          : chr  "No" "Yes" "No" "No" ...
##  $ PhysicalHealth  : num  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : chr  "No" "No" "No" "No" ...
##  $ Sex             : chr  "Female" "Female" "Male" "Female" ...
##  $ AgeCategory     : chr  "55-59" "80 or older" "65-69" "75-79" ...
##  $ Race            : chr  "White" "White" "White" "White" ...
##  $ Diabetic        : chr  "Yes" "No" "Yes" "No" ...
##  $ PhysicalActivity: chr  "Yes" "Yes" "Yes" "No" ...
##  $ GenHealth       : chr  "Very good" "Very good" "Fair" "Good" ...
##  $ SleepTime       : num  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : chr  "Yes" "No" "Yes" "No" ...
##  $ KidneyDisease   : chr  "No" "No" "No" "No" ...
##  $ SkinCancer      : chr  "Yes" "No" "No" "Yes" ...
# Check dimensions of dataset
dim(heart_data)
## [1] 319795     18
# Preview first few rows
head(heart_data)
##   HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## 1           No 16.60     Yes              No     No              3           30
## 2           No 20.34      No              No    Yes              0            0
## 3           No 26.58     Yes              No     No             20           30
## 4           No 24.21      No              No     No              0            0
## 5           No 23.71      No              No     No             28            0
## 6          Yes 28.87     Yes              No     No              6            0
##   DiffWalking    Sex AgeCategory  Race Diabetic PhysicalActivity GenHealth
## 1          No Female       55-59 White      Yes              Yes Very good
## 2          No Female 80 or older White       No              Yes Very good
## 3          No   Male       65-69 White      Yes              Yes      Fair
## 4          No Female       75-79 White       No               No      Good
## 5         Yes Female       40-44 White       No              Yes Very good
## 6         Yes Female       75-79 Black       No               No      Fair
##   SleepTime Asthma KidneyDisease SkinCancer
## 1         5    Yes            No        Yes
## 2         7     No            No         No
## 3         8    Yes            No         No
## 4         6     No            No        Yes
## 5         8     No            No         No
## 6        12     No            No         No

Check Missing Values

# Count missing values in each column
colSums(is.na(heart_data))
##     HeartDisease              BMI          Smoking  AlcoholDrinking 
##                0                0                0                0 
##           Stroke   PhysicalHealth     MentalHealth      DiffWalking 
##                0                0                0                0 
##              Sex      AgeCategory             Race         Diabetic 
##                0                0                0                0 
## PhysicalActivity        GenHealth        SleepTime           Asthma 
##                0                0                0                0 
##    KidneyDisease       SkinCancer 
##                0                0
# Total missing values in dataset
sum(is.na(heart_data))
## [1] 0

The missing value analysis confirmed that no missing values were present in the dataset. Therefore, no imputation was required.

Check Duplicate Records And Outliers

# Count duplicated rows
sum(duplicated(heart_data))
## [1] 18078
# Remove duplicates if any exist
heart_data <- distinct(heart_data)

# Detect outliers in numerical variables

boxplot(
  heart_data$BMI,
  main = "BMI Outlier Detection"
)

boxplot(
  heart_data$PhysicalHealth,
  main = "Physical Health Outlier Detection"
)

boxplot(
  heart_data$MentalHealth,
  main = "Mental Health Outlier Detection"
)

boxplot(
  heart_data$SleepTime,
  main = "Sleep Time Outlier Detection"
)

A total of 18,078 duplicate records were identified and removed. Outliers were examined using boxplots and retained because they may represent valid medical observations rather than data entry errors.

Convert Target Variable

# No = 0, Yes = 1
heart_data$HeartDisease <- ifelse(
  heart_data$HeartDisease == "Yes",
  1,
  0
)

# Convert to factor
heart_data$HeartDisease <- as.factor(heart_data$HeartDisease)

Convert Categorical Variables

heart_data$Smoking <- as.factor(heart_data$Smoking)
heart_data$AlcoholDrinking <- as.factor(heart_data$AlcoholDrinking)
heart_data$Stroke <- as.factor(heart_data$Stroke)
heart_data$DiffWalking <- as.factor(heart_data$DiffWalking)
heart_data$Sex <- as.factor(heart_data$Sex)
heart_data$AgeCategory <- as.factor(heart_data$AgeCategory)
heart_data$Race <- as.factor(heart_data$Race)
heart_data$Diabetic <- as.factor(heart_data$Diabetic)
heart_data$PhysicalActivity <- as.factor(heart_data$PhysicalActivity)
heart_data$GenHealth <- as.factor(heart_data$GenHealth)
heart_data$Asthma <- as.factor(heart_data$Asthma)
heart_data$KidneyDisease <- as.factor(heart_data$KidneyDisease)
heart_data$SkinCancer <- as.factor(heart_data$SkinCancer)

Check Data Types

str(heart_data)
## 'data.frame':    301717 obs. of  18 variables:
##  $ HeartDisease    : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
##  $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stroke          : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ PhysicalHealth  : num  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
##  $ Sex             : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
##  $ AgeCategory     : Factor w/ 13 levels "18-24","25-29",..: 8 13 10 12 5 12 11 13 13 10 ...
##  $ Race            : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 6 6 3 6 6 6 6 ...
##  $ Diabetic        : Factor w/ 4 levels "No","No, borderline diabetes",..: 3 1 3 1 1 1 1 3 2 1 ...
##  $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
##  $ GenHealth       : Factor w/ 5 levels "Excellent","Fair",..: 5 5 2 3 5 2 2 3 2 3 ...
##  $ SleepTime       : num  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
##  $ KidneyDisease   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ SkinCancer      : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...

Normalize Numerical Variables

numeric_cols <- c(
  "BMI",
  "PhysicalHealth",
  "MentalHealth",
  "SleepTime"
)

# Apply Min-Max Normalization

heart_data[numeric_cols] <- lapply(
  heart_data[numeric_cols],
  function(x) (x - min(x)) / (max(x) - min(x))
)

Verify Processed Data

summary(heart_data)
##  HeartDisease      BMI         Smoking      AlcoholDrinking Stroke      
##  0:274456     Min.   :0.0000   No :174312   No :280136      No :289653  
##  1: 27261     1st Qu.:0.1450   Yes:127405   Yes: 21581      Yes: 12064  
##               Median :0.1858                                            
##               Mean   :0.1983                                            
##               3rd Qu.:0.2370                                            
##               Max.   :1.0000                                            
##                                                                         
##  PhysicalHealth     MentalHealth    DiffWalking      Sex        
##  Min.   :0.00000   Min.   :0.0000   No :257362   Female:159671  
##  1st Qu.:0.00000   1st Qu.:0.0000   Yes: 44355   Male  :142046  
##  Median :0.00000   Median :0.0000                               
##  Mean   :0.11908   Mean   :0.1374                               
##  3rd Qu.:0.06667   3rd Qu.:0.1333                               
##  Max.   :1.00000   Max.   :1.0000                               
##                                                                 
##       AgeCategory                                 Race       
##  65-69      : 31670   American Indian/Alaskan Native:  5192  
##  60-64      : 31219   Asian                         :  7993  
##  70-74      : 29273   Black                         : 22810  
##  55-59      : 27610   Hispanic                      : 27107  
##  50-54      : 23736   Other                         : 10891  
##  80 or older: 23352   White                         :227724  
##  (Other)    :134857                                          
##                     Diabetic      PhysicalActivity     GenHealth     
##  No                     :251796   No : 71305       Excellent: 59737  
##  No, borderline diabetes:  6776   Yes:230412       Fair     : 34659  
##  Yes                    : 40589                    Good     : 91239  
##  Yes (during pregnancy) :  2556                    Poor     : 11286  
##                                                    Very good:104796  
##                                                                      
##                                                                      
##    SleepTime      Asthma       KidneyDisease SkinCancer  
##  Min.   :0.0000   No :259066   No :289941    No :272425  
##  1st Qu.:0.2174   Yes: 42651   Yes: 11776    Yes: 29292  
##  Median :0.2609                                          
##  Mean   :0.2645                                          
##  3rd Qu.:0.3043                                          
##  Max.   :1.0000                                          
## 
# Display first few rows after preprocessing
head(heart_data)
##   HeartDisease        BMI Smoking AlcoholDrinking Stroke PhysicalHealth
## 1            0 0.05529398     Yes              No     No      0.1000000
## 2            0 0.10044670      No              No    Yes      0.0000000
## 3            0 0.17578172     Yes              No     No      0.6666667
## 4            0 0.14716890      No              No     No      0.0000000
## 5            0 0.14113244      No              No     No      0.9333333
## 6            1 0.20342871     Yes              No     No      0.2000000
##   MentalHealth DiffWalking    Sex AgeCategory  Race Diabetic PhysicalActivity
## 1            1          No Female       55-59 White      Yes              Yes
## 2            0          No Female 80 or older White       No              Yes
## 3            1          No   Male       65-69 White      Yes              Yes
## 4            0          No Female       75-79 White       No               No
## 5            0         Yes Female       40-44 White       No              Yes
## 6            0         Yes Female       75-79 Black       No               No
##   GenHealth SleepTime Asthma KidneyDisease SkinCancer
## 1 Very good 0.1739130    Yes            No        Yes
## 2 Very good 0.2608696     No            No         No
## 3      Fair 0.3043478    Yes            No         No
## 4      Good 0.2173913     No            No        Yes
## 5 Very good 0.3043478     No            No         No
## 6      Fair 0.4782609     No            No         No

Export Processed Dataset

# Save the preprocessed dataset as a CSV file

write.csv(
  heart_data,
  "heart_attack_2020_preprocessed.csv",
  row.names = FALSE
)

Exploratory Data Analysis

Import Dataset

heart <- read.csv(
  "heart_attack_2020_preprocessed_non_normalization.csv",
  stringsAsFactors = FALSE
)

Dataset Overview

Dataset Dimensions

dim(heart)
## [1] 301717     18

Dataset Structure

str(heart)
## 'data.frame':    301717 obs. of  18 variables:
##  $ HeartDisease    : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : chr  "Yes" "No" "Yes" "No" ...
##  $ AlcoholDrinking : chr  "No" "No" "No" "No" ...
##  $ Stroke          : chr  "No" "Yes" "No" "No" ...
##  $ PhysicalHealth  : int  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : int  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : chr  "No" "No" "No" "No" ...
##  $ Sex             : chr  "Female" "Female" "Male" "Female" ...
##  $ AgeCategory     : chr  "55-59" "80 or older" "65-69" "75-79" ...
##  $ Race            : chr  "White" "White" "White" "White" ...
##  $ Diabetic        : chr  "Yes" "No" "Yes" "No" ...
##  $ PhysicalActivity: chr  "Yes" "Yes" "Yes" "No" ...
##  $ GenHealth       : chr  "Very good" "Very good" "Fair" "Good" ...
##  $ SleepTime       : int  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : chr  "Yes" "No" "Yes" "No" ...
##  $ KidneyDisease   : chr  "No" "No" "No" "No" ...
##  $ SkinCancer      : chr  "Yes" "No" "No" "Yes" ...

Statistical Summary

summary(heart)
##   HeartDisease          BMI          Smoking          AlcoholDrinking   
##  Min.   :0.00000   Min.   :12.02   Length:301717      Length:301717     
##  1st Qu.:0.00000   1st Qu.:24.03   Class :character   Class :character  
##  Median :0.00000   Median :27.41   Mode  :character   Mode  :character  
##  Mean   :0.09035   Mean   :28.44                                        
##  3rd Qu.:0.00000   3rd Qu.:31.65                                        
##  Max.   :1.00000   Max.   :94.85                                        
##     Stroke          PhysicalHealth    MentalHealth    DiffWalking       
##  Length:301717      Min.   : 0.000   Min.   : 0.000   Length:301717     
##  Class :character   1st Qu.: 0.000   1st Qu.: 0.000   Class :character  
##  Mode  :character   Median : 0.000   Median : 0.000   Mode  :character  
##                     Mean   : 3.572   Mean   : 4.121                     
##                     3rd Qu.: 2.000   3rd Qu.: 4.000                     
##                     Max.   :30.000   Max.   :30.000                     
##      Sex            AgeCategory            Race             Diabetic        
##  Length:301717      Length:301717      Length:301717      Length:301717     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  PhysicalActivity    GenHealth           SleepTime         Asthma         
##  Length:301717      Length:301717      Min.   : 1.000   Length:301717     
##  Class :character   Class :character   1st Qu.: 6.000   Class :character  
##  Mode  :character   Mode  :character   Median : 7.000   Mode  :character  
##                                        Mean   : 7.085                     
##                                        3rd Qu.: 8.000                     
##                                        Max.   :24.000                     
##  KidneyDisease       SkinCancer       
##  Length:301717      Length:301717     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

First Few Rows

head(heart)
##   HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## 1            0 16.60     Yes              No     No              3           30
## 2            0 20.34      No              No    Yes              0            0
## 3            0 26.58     Yes              No     No             20           30
## 4            0 24.21      No              No     No              0            0
## 5            0 23.71      No              No     No             28            0
## 6            1 28.87     Yes              No     No              6            0
##   DiffWalking    Sex AgeCategory  Race Diabetic PhysicalActivity GenHealth
## 1          No Female       55-59 White      Yes              Yes Very good
## 2          No Female 80 or older White       No              Yes Very good
## 3          No   Male       65-69 White      Yes              Yes      Fair
## 4          No Female       75-79 White       No               No      Good
## 5         Yes Female       40-44 White       No              Yes Very good
## 6         Yes Female       75-79 Black       No               No      Fair
##   SleepTime Asthma KidneyDisease SkinCancer
## 1         5    Yes            No        Yes
## 2         7     No            No         No
## 3         8    Yes            No         No
## 4         6     No            No        Yes
## 5         8     No            No         No
## 6        12     No            No         No

Univariate Analysis

Numerical Variable Analysis

BMI Distribution

ggplot(heart, aes(x = BMI)) +
  geom_histogram(
    bins = 30,
    fill = "steelblue",
    color = "black"
  ) +
  labs(
    title = "Distribution of BMI",
    x = "BMI",
    y = "Frequency"
  )

Interpretation

The histogram shows the distribution of BMI values among individuals in the dataset. Most observations are concentrated within the middle BMI range, suggesting that the majority of participants fall within average or overweight categories. A small number of extreme values may indicate obesity-related health concerns.


Physical Health Distribution

ggplot(heart, aes(x = PhysicalHealth)) +
  geom_histogram(
    bins = 30,
    fill = "orange",
    color = "black"
  ) +
  labs(
    title = "Distribution of Physical Health",
    x = "Physical Health",
    y = "Frequency"
  )

Interpretation

The distribution of physical health values indicates that many individuals reported relatively lower physical health issues, while fewer participants experienced severe physical health problems.


Mental Health Distribution

ggplot(heart, aes(x = MentalHealth)) +
  geom_histogram(
    bins = 30,
    fill = "purple",
    color = "black"
  ) +
  labs(
    title = "Distribution of Mental Health",
    x = "Mental Health",
    y = "Frequency"
  )

Interpretation

The histogram demonstrates the variation in mental health conditions across participants. Most individuals appear to experience lower levels of mental health difficulties, while a smaller portion reports higher mental health concerns.


Sleep Time Distribution

ggplot(heart, aes(x = SleepTime)) +
  geom_histogram(
    bins = 30,
    fill = "darkgreen",
    color = "black"
  ) +
  labs(
    title = "Distribution of Sleep Time",
    x = "Sleep Time",
    y = "Frequency"
  )

Interpretation

The sleep time distribution indicates that most individuals maintain moderate sleep durations. Extremely low or high sleep durations appear less common and may potentially relate to health conditions.


Categorical Variable Analysis

Smoking Distribution

ggplot(heart, aes(x = Smoking)) +
  geom_bar(fill = "darkred") +
  labs(
    title = "Smoking Distribution",
    x = "Smoking",
    y = "Count"
  )

Interpretation

The bar chart illustrates the number of smokers and non-smokers in the dataset. The comparison provides insight into the prevalence of smoking behavior among participants.


Alcohol Drinking Distribution

ggplot(heart, aes(x = AlcoholDrinking)) +
  geom_bar(fill = "brown") +
  labs(
    title = "Alcohol Drinking Distribution",
    x = "Alcohol Drinking",
    y = "Count"
  )

Interpretation

The chart displays the distribution of alcohol consumption status among individuals. It helps identify whether alcohol drinking behavior is common within the dataset.


Sex Distribution

ggplot(heart, aes(x = Sex)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Sex Distribution",
    x = "Sex",
    y = "Count"
  )

Interpretation

The dataset contains both male and female participants. The distribution indicates whether the dataset is balanced in terms of gender representation.


Physical Activity Distribution

ggplot(heart, aes(x = PhysicalActivity)) +
  geom_bar(fill = "darkgreen") +
  labs(
    title = "Physical Activity Distribution",
    x = "Physical Activity",
    y = "Count"
  )

Interpretation

This visualization shows the number of individuals who engage in physical activity compared to those who do not. Physical activity is an important lifestyle factor associated with cardiovascular health.


Diabetic Distribution

ggplot(heart, aes(x = Diabetic)) +
  geom_bar(fill = "purple") +
  labs(
    title = "Diabetic Distribution",
    x = "Diabetic",
    y = "Count"
  )

Interpretation

The chart demonstrates the distribution of diabetic conditions among participants, including non-diabetic and diabetic individuals.


Bivariate Analysis

BMI vs Heart Disease

ggplot(
  heart,
  aes(
    x = HeartDisease,
    y = BMI,
    fill = HeartDisease
  )
) +
  geom_boxplot() +
  labs(
    title = "BMI vs Heart Disease",
    x = "Heart Disease",
    y = "BMI"
  )

Interpretation

The boxplot compares BMI values between individuals with and without heart disease. Differences in median BMI and spread may indicate a relationship between body weight and heart disease risk.


Smoking vs Heart Disease

ggplot(
  heart,
  aes(
    x = Smoking,
    fill = factor(HeartDisease)
  )
) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("pink", "black")) +
  labs(
    title = "Smoking vs Heart Disease",
    x = "Smoking",
    y = "Count",
    fill = "Heart Disease"
  )

Interpretation

The visualization compares smoking habits with heart disease occurrence. Smokers may demonstrate a relatively higher frequency of heart disease compared to non-smokers.


Age Category vs Heart Disease

ggplot(
  heart,
  aes(
    x = AgeCategory,
    fill = factor(HeartDisease)
  )
) +
  geom_bar(position = "stack") +
  scale_fill_manual(values = c("skyblue", "red")) +
  labs(
    title = "Age Category vs Heart Disease",
    x = "Age Category",
    y = "Count",
    fill = "Heart Disease"
  ) +
  theme(
    axis.text.x = element_text(
      angle = 45,
      hjust = 1
    )
  )

Interpretation

The stacked bar chart indicates that heart disease occurrences tend to increase across older age categories. Age appears to be an important factor influencing cardiovascular risk.


Diabetic vs Heart Disease

ggplot(
  heart,
  aes(
    x = Diabetic,
    fill = factor(HeartDisease)
  )
) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Diabetic Status vs Heart Disease",
    x = "Diabetic Status",
    y = "Percentage",
    fill = "Heart Disease"
  ) +
  scale_fill_manual(values = c("steelblue", "tomato")) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

Interpretation

The proportional bar chart shows the relationship between diabetic status and heart disease prevalence. Individuals with diabetes may exhibit higher proportions of heart disease cases.


General Health vs Heart Disease

ggplot(
  heart,
  aes(
    x = GenHealth,
    fill = factor(HeartDisease)
  )
) +
  geom_bar(position = "stack") +
  scale_fill_manual(
    values = c("steelblue", "tomato")
  ) +
  labs(
    title = "General Health vs Heart Disease",
    x = "General Health",
    y = "Count",
    fill = "Heart Disease"
  ) +
  theme(
    axis.text.x = element_text(
      angle = 20,
      hjust = 1
    )
  )

Interpretation

Participants reporting poorer general health conditions tend to show higher counts of heart disease. This suggests that overall health perception may be associated with cardiovascular health outcomes.


Physical Activity vs Heart Disease

ggplot(
  heart,
  aes(
    x = PhysicalActivity,
    fill = factor(HeartDisease)
  )
) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("darkgreen", "orange")) +
  labs(
    title = "Physical Activity vs Heart Disease",
    x = "Physical Activity",
    y = "Count",
    fill = "Heart Disease"
  )

Interpretation

The chart compares heart disease occurrence between physically active and inactive individuals. Lower heart disease frequency among active individuals may suggest the protective effect of regular physical activity.


Multivariate Analysis

Pairwise Relationship Analysis

selected_data <- heart_data %>%
  dplyr::select(
    BMI,
    PhysicalHealth,
    MentalHealth,
    SleepTime,
    HeartDisease
  )

ggpairs(selected_data)

Interpretation

The pairplot visualizes pairwise relationships among multiple variables simultaneously. It helps identify potential correlations, distributions, and trends between health-related factors and heart disease.


Correlation Analysis

Correlation Matrix

numeric_data <- heart_data %>%
  dplyr::select(
    BMI,
    PhysicalHealth,
    MentalHealth,
    SleepTime
  )

cor_matrix <- cor(numeric_data)

cor_matrix
##                        BMI PhysicalHealth MentalHealth   SleepTime
## BMI             1.00000000     0.10381252   0.05672448 -0.04865342
## PhysicalHealth  0.10381252     1.00000000   0.27965749 -0.05840552
## MentalHealth    0.05672448     0.27965749   1.00000000 -0.11707831
## SleepTime      -0.04865342    -0.05840552  -0.11707831  1.00000000

Correlation Heatmap

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.col = "black",
  tl.srt = 45
)

Interpretation

The heatmap illustrates the correlation strength between numerical variables. Positive correlations indicate variables that increase together, while negative correlations indicate inverse relationships.


Group Summary Analysis

heart %>%
  group_by(HeartDisease) %>%
  summarise(
    Mean_BMI = mean(BMI),
    Mean_PhysicalHealth = mean(PhysicalHealth),
    Mean_MentalHealth = mean(MentalHealth),
    Mean_SleepTime = mean(SleepTime)
  )
## # A tibble: 2 × 5
##   HeartDisease Mean_BMI Mean_PhysicalHealth Mean_MentalHealth Mean_SleepTime
##          <int>    <dbl>               <dbl>             <dbl>          <dbl>
## 1            0     28.3                3.15              4.07           7.08
## 2            1     29.4                7.84              4.66           7.13

Interpretation

The summary table compares average health indicators between individuals with and without heart disease. Differences in mean values may highlight potential risk factors associated with cardiovascular disease.


Conclusion

The EDA findings reveal several important patterns associated with heart disease risk. Variables such as age category, smoking habits, diabetes status, general health condition, and physical activity demonstrate meaningful relationships with heart disease occurrence. These findings provide useful insights for subsequent machine learning classification and predictive modeling tasks.

Regression Modelling: Predicting PhysicalHealth

In this section, three regression models, namely Multiple Linear Regression, Stepwise Regression and Random Forest Regression, were developed to predict the target variable PhysicalHealth.

heart <- read.csv("heart_attack_2020_preprocessed.csv")

set.seed(123)

heart_sample <- heart %>%
  sample_n(5000)

str(heart_sample)
## 'data.frame':    5000 obs. of  18 variables:
##  $ HeartDisease    : int  0 0 1 1 0 0 0 1 0 0 ...
##  $ BMI             : num  0.108 0.304 0.252 0.149 0.107 ...
##  $ Smoking         : chr  "No" "Yes" "No" "Yes" ...
##  $ AlcoholDrinking : chr  "No" "No" "No" "No" ...
##  $ Stroke          : chr  "No" "No" "No" "No" ...
##  $ PhysicalHealth  : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ MentalHealth    : num  0.5 0 0 0.833 0.4 ...
##  $ DiffWalking     : chr  "No" "No" "No" "No" ...
##  $ Sex             : chr  "Female" "Female" "Female" "Female" ...
##  $ AgeCategory     : chr  "75-79" "55-59" "70-74" "50-54" ...
##  $ Race            : chr  "White" "White" "White" "White" ...
##  $ Diabetic        : chr  "No" "No" "No" "No" ...
##  $ PhysicalActivity: chr  "Yes" "Yes" "Yes" "No" ...
##  $ GenHealth       : chr  "Very good" "Good" "Fair" "Good" ...
##  $ SleepTime       : num  0.217 0.261 0.304 0.304 0.217 ...
##  $ Asthma          : chr  "No" "No" "No" "No" ...
##  $ KidneyDisease   : chr  "No" "No" "No" "No" ...
##  $ SkinCancer      : chr  "No" "No" "No" "No" ...
head(heart_sample)
##   HeartDisease       BMI Smoking AlcoholDrinking Stroke PhysicalHealth
## 1            0 0.1081734      No              No     No              0
## 2            0 0.3041169     Yes              No     No              0
## 3            1 0.2518411      No              No     No              0
## 4            1 0.1486177     Yes              No     No              0
## 5            0 0.1074490     Yes              No     No              0
## 6            0 0.1885790      No              No     No              0
##   MentalHealth DiffWalking    Sex AgeCategory  Race Diabetic PhysicalActivity
## 1   0.50000000          No Female       75-79 White       No              Yes
## 2   0.00000000          No Female       55-59 White       No              Yes
## 3   0.00000000          No Female       70-74 White       No              Yes
## 4   0.83333333          No Female       50-54 White       No               No
## 5   0.40000000          No   Male       30-34 White       No              Yes
## 6   0.03333333          No Female       75-79 Other       No              Yes
##   GenHealth SleepTime Asthma KidneyDisease SkinCancer
## 1 Very good 0.2173913     No            No         No
## 2      Good 0.2608696     No            No         No
## 3      Fair 0.3043478     No            No         No
## 4      Good 0.3043478     No            No         No
## 5 Very good 0.2173913     No            No         No
## 6 Very good 0.3043478    Yes           Yes         No
names(heart_sample)
##  [1] "HeartDisease"     "BMI"              "Smoking"          "AlcoholDrinking" 
##  [5] "Stroke"           "PhysicalHealth"   "MentalHealth"     "DiffWalking"     
##  [9] "Sex"              "AgeCategory"      "Race"             "Diabetic"        
## [13] "PhysicalActivity" "GenHealth"        "SleepTime"        "Asthma"          
## [17] "KidneyDisease"    "SkinCancer"

Data Preparation

set.seed(123)

trainIndex <- createDataPartition(
  heart_sample$PhysicalHealth,
  p = 0.8,
  list = FALSE
)

train <- heart_sample[trainIndex, ]
test <- heart_sample[-trainIndex, ]

cat("Training observations:", nrow(train), "\n")
## Training observations: 4000
cat("Testing observations:", nrow(test), "\n")
## Testing observations: 1000

Regression Correlation Analysis

numeric_data <- heart_sample %>%
  select_if(is.numeric)

cor_matrix <- cor(
  numeric_data,
  use = "complete.obs"
)

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  tl.cex = 0.7,
  number.cex = 0.6
)

The correlation plot helps identify numerical variables that are related to PhysicalHealth.

Multiple Linear Regression Model

lm_model <- lm(
  PhysicalHealth ~ BMI +
    MentalHealth +
    SleepTime +
    Smoking +
    AlcoholDrinking +
    Stroke +
    DiffWalking +
    Diabetic +
    PhysicalActivity +
    GenHealth +
    HeartDisease,
  data = train
)

summary(lm_model)
## 
## Call:
## lm(formula = PhysicalHealth ~ BMI + MentalHealth + SleepTime + 
##     Smoking + AlcoholDrinking + Stroke + DiffWalking + Diabetic + 
##     PhysicalActivity + GenHealth + HeartDisease, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88464 -0.06442 -0.02584 -0.00393  0.98558 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.347e-02  1.971e-02   3.221 0.001289 ** 
## BMI                             -7.943e-02  4.439e-02  -1.789 0.073667 .  
## MentalHealth                     1.284e-01  1.287e-02   9.981  < 2e-16 ***
## SleepTime                       -3.573e-02  5.099e-02  -0.701 0.483444    
## SmokingYes                       1.302e-02  6.849e-03   1.901 0.057415 .  
## AlcoholDrinkingYes              -1.258e-05  1.332e-02  -0.001 0.999246    
## StrokeYes                        3.172e-02  1.808e-02   1.754 0.079497 .  
## DiffWalkingYes                   1.685e-01  1.090e-02  15.462  < 2e-16 ***
## DiabeticNo, borderline diabetes -3.757e-02  2.223e-02  -1.690 0.091034 .  
## DiabeticYes                     -2.906e-03  1.055e-02  -0.275 0.782965    
## DiabeticYes (during pregnancy)  -4.701e-02  4.301e-02  -1.093 0.274449    
## PhysicalActivityYes             -3.041e-02  8.241e-03  -3.691 0.000227 ***
## GenHealthFair                    2.221e-01  1.393e-02  15.942  < 2e-16 ***
## GenHealthGood                    3.722e-02  9.960e-03   3.737 0.000189 ***
## GenHealthPoor                    5.077e-01  2.089e-02  24.306  < 2e-16 ***
## GenHealthVery good               3.148e-03  9.382e-03   0.336 0.737251    
## HeartDisease                    -5.937e-03  1.260e-02  -0.471 0.637556    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2098 on 3983 degrees of freedom
## Multiple R-squared:  0.3924, Adjusted R-squared:   0.39 
## F-statistic: 160.8 on 16 and 3983 DF,  p-value: < 2.2e-16
lm_pred <- predict(
  lm_model,
  newdata = test
)

lm_rmse <- rmse(test$PhysicalHealth, lm_pred)
lm_mae <- mae(test$PhysicalHealth, lm_pred)
lm_r2 <- R2(lm_pred, test$PhysicalHealth)

lm_rmse
## [1] 0.2153915
lm_mae
## [1] 0.1244887
lm_r2
## [1] 0.380274

Predicted vs Actual Plot - Multiple Linear Regression

lm_plot <- data.frame(
  Actual = test$PhysicalHealth,
  Predicted = lm_pred
)

ggplot(
  lm_plot,
  aes(
    x = Actual,
    y = Predicted
  )
) +
  geom_point(
    color = "#2E86DE",
    alpha = 0.25,
    size = 1.5
  ) +
  geom_abline(
    intercept = 0,
    slope = 1,
    color = "#E74C3C",
    linetype = "dashed",
    linewidth = 1
  ) +
  theme_minimal() +
  labs(
    title = "Multiple Linear Regression: Predicted vs Actual",
    x = "Actual PhysicalHealth",
    y = "Predicted PhysicalHealth"
  )

The predicted vs actual plot helps visualize the accuracy of the Multiple Linear Regression model. Points closer to the red dashed diagonal line indicate better prediction accuracy.

Regression Coefficient Plot

lm_coeff <- data.frame(
  Variable = names(coef(lm_model)),
  Coefficient = as.numeric(coef(lm_model))
)

lm_coeff <- lm_coeff %>%
  filter(Variable != "(Intercept)")

ggplot(
  lm_coeff,
  aes(
    x = reorder(Variable, Coefficient),
    y = Coefficient,
    fill = Coefficient > 0
  )
) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  scale_fill_manual(
    values = c("#2E86DE", "#E74C3C"),
    labels = c("Negative", "Positive"),
    name = "Effect"
  ) +
  labs(
    title = "Regression Coefficient Plot",
    x = "Predictor Variable",
    y = "Coefficient"
  )

The regression coefficient plot shows how each predictor affects PhysicalHealth. Positive coefficients increase the predicted value, while negative coefficients decrease it.

Residual Plot - Multiple Linear Regression

lm_residual <- data.frame(
  Predicted = lm_pred,
  Residual = test$PhysicalHealth - lm_pred
)

ggplot(
  lm_residual,
  aes(
    x = Predicted,
    y = Residual
  )
) +
  geom_point(
    color = "#2E86DE",
    alpha = 0.25,
    size = 1.5
  ) +
  geom_hline(
    yintercept = 0,
    color = "#E74C3C",
    linetype = "dashed",
    linewidth = 1
  ) +
  theme_minimal() +
  labs(
    title = "Multiple Linear Regression: Residual Plot",
    x = "Predicted PhysicalHealth",
    y = "Residual"
  )

The residual plot shows the difference between actual and predicted values. Residuals closer to zero indicate better predictions.

Stepwise Regression Model

step_model <- stepAIC(
  lm_model,
  direction = "both",
  trace = FALSE
)

summary(step_model)
## 
## Call:
## lm(formula = PhysicalHealth ~ BMI + MentalHealth + Smoking + 
##     Stroke + DiffWalking + PhysicalActivity + GenHealth, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87992 -0.06260 -0.02590 -0.00489  0.98697 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.053707   0.013444   3.995 6.59e-05 ***
## BMI                 -0.085607   0.043659  -1.961 0.049972 *  
## MentalHealth         0.129603   0.012751  10.164  < 2e-16 ***
## SmokingYes           0.012804   0.006796   1.884 0.059627 .  
## StrokeYes            0.030451   0.017759   1.715 0.086486 .  
## DiffWalkingYes       0.168068   0.010826  15.524  < 2e-16 ***
## PhysicalActivityYes -0.030476   0.008227  -3.704 0.000215 ***
## GenHealthFair        0.220048   0.013633  16.141  < 2e-16 ***
## GenHealthGood        0.036897   0.009896   3.728 0.000195 ***
## GenHealthPoor        0.504303   0.020517  24.580  < 2e-16 ***
## GenHealthVery good   0.003238   0.009371   0.346 0.729678    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2097 on 3989 degrees of freedom
## Multiple R-squared:  0.3917, Adjusted R-squared:  0.3902 
## F-statistic: 256.9 on 10 and 3989 DF,  p-value: < 2.2e-16
step_pred <- predict(
  step_model,
  newdata = test
)

step_rmse <- rmse(test$PhysicalHealth, step_pred)
step_mae <- mae(test$PhysicalHealth, step_pred)
step_r2 <- R2(step_pred, test$PhysicalHealth)

step_rmse
## [1] 0.2151523
step_mae
## [1] 0.1241588
step_r2
## [1] 0.3816195

Predicted vs Actual Plot - Stepwise Regression

step_plot <- data.frame(
  Actual = test$PhysicalHealth,
  Predicted = step_pred
)

ggplot(
  step_plot,
  aes(
    x = Actual,
    y = Predicted
  )
) +
  geom_point(
    color = "#2E86DE",
    alpha = 0.25,
    size = 1.5
  ) +
  geom_abline(
    intercept = 0,
    slope = 1,
    color = "#E74C3C",
    linetype = "dashed",
    linewidth = 1
  ) +
  theme_minimal() +
  labs(
    title = "Stepwise Regression: Predicted vs Actual",
    x = "Actual PhysicalHealth",
    y = "Predicted PhysicalHealth"
  )

Stepwise Regression selects the most relevant predictors and removes less useful variables.

Residual Plot - Stepwise Regression

step_residual <- data.frame(
  Predicted = step_pred,
  Residual = test$PhysicalHealth - step_pred
)

ggplot(
  step_residual,
  aes(
    x = Predicted,
    y = Residual
  )
) +
  geom_point(
    color = "#2E86DE",
    alpha = 0.25,
    size = 1.5
  ) +
  geom_hline(
    yintercept = 0,
    color = "#E74C3C",
    linetype = "dashed",
    linewidth = 1
  ) +
  theme_minimal() +
  labs(
    title = "Stepwise Regression: Residual Plot",
    x = "Predicted PhysicalHealth",
    y = "Residual"
  )

Random Forest Regression Model

set.seed(123)

rf_model <- randomForest(
  PhysicalHealth ~ BMI +
    MentalHealth +
    SleepTime +
    Smoking +
    AlcoholDrinking +
    Stroke +
    DiffWalking +
    Diabetic +
    PhysicalActivity +
    GenHealth +
    HeartDisease,
  data = train,
  ntree = 10,
  importance = TRUE
)

rf_model
## 
## Call:
##  randomForest(formula = PhysicalHealth ~ BMI + MentalHealth +      SleepTime + Smoking + AlcoholDrinking + Stroke + DiffWalking +      Diabetic + PhysicalActivity + GenHealth + HeartDisease, data = train,      ntree = 10, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 10
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.05824892
##                     % Var explained: 19.24
rf_pred <- predict(
  rf_model,
  newdata = test
)

rf_rmse <- rmse(test$PhysicalHealth, rf_pred)
rf_mae <- mae(test$PhysicalHealth, rf_pred)
rf_r2 <- R2(rf_pred, test$PhysicalHealth)

rf_rmse
## [1] 0.2294505
rf_mae
## [1] 0.1359131
rf_r2
## [1] 0.2959072

Predicted vs Actual Plot - Random Forest Regression

rf_plot <- data.frame(
  Actual = test$PhysicalHealth,
  Predicted = rf_pred
)

ggplot(
  rf_plot,
  aes(
    x = Actual,
    y = Predicted
  )
) +
  geom_point(
    color = "#2E86DE",
    alpha = 0.25,
    size = 1.5
  ) +
  geom_abline(
    intercept = 0,
    slope = 1,
    color = "#E74C3C",
    linetype = "dashed",
    linewidth = 1
  ) +
  theme_minimal() +
  labs(
    title = "Random Forest Regression: Predicted vs Actual",
    x = "Actual PhysicalHealth",
    y = "Predicted PhysicalHealth"
  )

Random Forest can capture non-linear relationships and interactions between predictors.

Variable Importance Plot

varImpPlot(
  rf_model,
  main = "Random Forest Variable Importance"
)

The variable importance plot identifies the most influential predictors of PhysicalHealth.

Residual Plot - Random Forest Regression

rf_residual <- data.frame(
  Predicted = rf_pred,
  Residual = test$PhysicalHealth - rf_pred
)

ggplot(
  rf_residual,
  aes(
    x = Predicted,
    y = Residual
  )
) +
  geom_point(
    color = "#2E86DE",
    alpha = 0.25,
    size = 1.5
  ) +
  geom_hline(
    yintercept = 0,
    color = "#E74C3C",
    linetype = "dashed",
    linewidth = 1
  ) +
  theme_minimal() +
  labs(
    title = "Random Forest Regression: Residual Plot",
    x = "Predicted PhysicalHealth",
    y = "Residual"
  )

## Evaluation of Regression Models

results <- data.frame(
  Model = c(
    "Multiple Linear Regression",
    "Stepwise Regression",
    "Random Forest Regression"
  ),
  RMSE = c(
    lm_rmse,
    step_rmse,
    rf_rmse
  ),
  MAE = c(
    lm_mae,
    step_mae,
    rf_mae
  ),
  R2 = c(
    lm_r2,
    step_r2,
    rf_r2
  )
)

results
##                        Model      RMSE       MAE        R2
## 1 Multiple Linear Regression 0.2153915 0.1244887 0.3802740
## 2        Stepwise Regression 0.2151523 0.1241588 0.3816195
## 3   Random Forest Regression 0.2294505 0.1359131 0.2959072

RMSE Comparison

ggplot(
  results,
  aes(
    x = reorder(Model, RMSE),
    y = RMSE,
    fill = Model
  )
) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "RMSE Comparison of Regression Models",
    x = "Regression Model",
    y = "RMSE"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

MAE Comparison

ggplot(
  results,
  aes(
    x = reorder(Model, MAE),
    y = MAE,
    fill = Model
  )
) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "MAE Comparison of Regression Models",
    x = "Regression Model",
    y = "MAE"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

R² Comparison

ggplot(
  results,
  aes(
    x = reorder(Model, R2),
    y = R2,
    fill = Model
  )
) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "R² Comparison of Regression Models",
    x = "Regression Model",
    y = "R²"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

Discussion

The predicted vs actual plots show how closely each model’s predictions match the actual PhysicalHealth values. A model with points closer to the diagonal red reference line indicates stronger predictive accuracy.

The Multiple Linear Regression coefficient plot provides interpretability by showing the direction and magnitude of each predictor’s effect on PhysicalHealth. Positive coefficients indicate an increase in predicted unhealthy physical health days, while negative coefficients indicate a decrease.

The Random Forest variable importance plot provides additional insight into the most influential predictors. This is useful because Random Forest can capture non-linear relationships and interactions that may not be fully captured by linear models.

Overall, the best-performing model is identified based on the lowest RMSE and MAE, together with the highest R² value.

Conclusion

Three regression models were developed to predict PhysicalHealth. Multiple Linear Regression was used as the baseline model, Stepwise Regression was used for variable selection, and Random Forest Regression was used to capture more complex relationships.

The model comparison results indicate which model provides the most accurate prediction of physical health outcomes. These findings suggest that health-related and lifestyle factors such as mental health, sleep duration, general health status, smoking status and heart disease status may help explain variation in physical health conditions.

Classification Modelling: Predicting HeartDisease


Modeling Objective

Based on the EDA findings, variables such as age category, smoking habits, diabetes status, general health condition, and physical activity demonstrate meaningful relationships with heart disease. This section develops three classification models to predict whether an individual is at risk of heart disease using these factors.

Target Variable: HeartDisease (1 = At Risk, 0 = Not at Risk)


Data Setup

Load Data

heart <- read.csv(
  "heart_attack_2020_preprocessed.csv",
  stringsAsFactors = FALSE
)

cat("Dataset Dimensions:", nrow(heart), "rows x", ncol(heart), "columns\n")
## Dataset Dimensions: 301717 rows x 18 columns
cat("\nTarget Variable Distribution:\n")
## 
## Target Variable Distribution:
print(base::table(heart$HeartDisease))
## 
##      0      1 
## 274456  27261
cat("\nClass Ratio (%):\n")
## 
## Class Ratio (%):
print(round(prop.table(base::table(heart$HeartDisease)) * 100, 2))
## 
##     0     1 
## 90.96  9.04

Feature Engineering

model_df <- heart %>%
  mutate(
    Smoking          = ifelse(Smoking == "Yes", 1, 0),
    AlcoholDrinking  = ifelse(AlcoholDrinking == "Yes", 1, 0),
    Stroke           = ifelse(Stroke == "Yes", 1, 0),
    DiffWalking      = ifelse(DiffWalking == "Yes", 1, 0),
    Sex              = ifelse(Sex == "Male", 1, 0),
    PhysicalActivity = ifelse(PhysicalActivity == "Yes", 1, 0),
    Asthma           = ifelse(Asthma == "Yes", 1, 0),
    KidneyDisease    = ifelse(KidneyDisease == "Yes", 1, 0),
    SkinCancer       = ifelse(SkinCancer == "Yes", 1, 0),
    Diabetic = ifelse(Diabetic == "Yes", 1, 0),
    GenHealth = case_when(
      GenHealth == "Poor"      ~ 1,
      GenHealth == "Fair"      ~ 2,
      GenHealth == "Good"      ~ 3,
      GenHealth == "Very good" ~ 4,
      GenHealth == "Excellent" ~ 5,
      TRUE ~ NA_real_
    ),
    AgeCategory = case_when(
      AgeCategory == "18-24"       ~ 1,
      AgeCategory == "25-29"       ~ 2,
      AgeCategory == "30-34"       ~ 3,
      AgeCategory == "35-39"       ~ 4,
      AgeCategory == "40-44"       ~ 5,
      AgeCategory == "45-49"       ~ 6,
      AgeCategory == "50-54"       ~ 7,
      AgeCategory == "55-59"       ~ 8,
      AgeCategory == "60-64"       ~ 9,
      AgeCategory == "65-69"       ~ 10,
      AgeCategory == "70-74"       ~ 11,
      AgeCategory == "75-79"       ~ 12,
      AgeCategory == "80 or older" ~ 13,
      TRUE ~ NA_real_
    ),
    HeartDisease = as.factor(HeartDisease)
  ) %>%
  dplyr::select(-Race)

levels(model_df$HeartDisease) <- c("No_Risk", "At_Risk")

cat("Feature engineering complete.\n")
## Feature engineering complete.
cat("Dimensions:", nrow(model_df), "x", ncol(model_df), "\n")
## Dimensions: 301717 x 17

Train-Test Split

features <- c(
  "BMI", "Smoking", "AlcoholDrinking", "Stroke",
  "PhysicalHealth", "MentalHealth", "DiffWalking",
  "Sex", "AgeCategory", "Diabetic", "PhysicalActivity",
  "GenHealth", "SleepTime", "Asthma", "KidneyDisease", "SkinCancer"
)

target <- "HeartDisease"

set.seed(42)
train_idx  <- createDataPartition(model_df[[target]], p = 0.75, list = FALSE)
train_data <- model_df[train_idx, ]
test_data  <- model_df[-train_idx, ]

# Subsample training set to 30000 rows to reduce computation time
set.seed(42)
sub_idx    <- createDataPartition(train_data[[target]], p = 30000 / nrow(train_data), list = FALSE)
train_data <- train_data[sub_idx, ]

cat("Training Set:", nrow(train_data), "rows\n")
## Training Set: 30001 rows
cat("Test Set:    ", nrow(test_data),  "rows\n")
## Test Set:     75429 rows
cat("\nTraining Target Distribution:\n")
## 
## Training Target Distribution:
print(base::table(train_data[[target]]))
## 
## No_Risk At_Risk 
##   27290    2711

Model 1: Logistic Regression

Method Overview

Logistic Regression maps a linear combination of features to a probability in [0,1] via the Sigmoid function. This model applies Elastic Net regularization (combining L1 and L2 penalties) to prevent overfitting while performing automatic feature selection.

\[P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}}\]

Given the significant class imbalance in this dataset (90.96% No Risk vs 9.04% At Risk), sample weights proportional to the inverse class frequency are applied during training so that the minority class (At Risk) receives greater influence on the learned coefficients.

Model Training

X_train <- model.matrix(HeartDisease ~ . - 1, data = train_data[, c(features, target)])
y_train <- train_data[[target]]
X_test  <- model.matrix(HeartDisease ~ . - 1, data = test_data[, c(features, target)])
y_test  <- test_data[[target]]

# Inverse-frequency class weights to handle imbalance
class_ratio    <- as.numeric(base::table(y_train)[1] / base::table(y_train)[2])
weights_train  <- ifelse(y_train == "At_Risk", class_ratio, 1)

set.seed(42)
cv_logit <- cv.glmnet(
  x            = X_train,
  y            = y_train,
  family       = "binomial",
  alpha        = 0.5,
  nfolds       = 10,
  type.measure = "auc",
  weights      = weights_train
)

best_lambda <- cv_logit$lambda.min * 0.1

logit_model <- glmnet(
  x       = X_train,
  y       = y_train,
  family  = "binomial",
  alpha   = 0.5,
  lambda  = best_lambda,
  weights = weights_train
)

cat("Optimal Lambda:", round(best_lambda, 6), "\n")
## Optimal Lambda: 0.000345
cat("Non-zero Coefficients:", sum(coef(logit_model) != 0), "\n")
## Non-zero Coefficients: 17
coef_vals <- as.matrix(coef(logit_model))
coef_df <- data.frame(
  Feature     = rownames(coef_vals),
  Coefficient = coef_vals[, 1]
) %>%
  filter(Feature != "(Intercept)") %>%
  arrange(desc(abs(Coefficient))) %>%
  slice(1:15)

ggplot(coef_df, aes(x = reorder(Feature, Coefficient), y = Coefficient,
                    fill = Coefficient > 0)) +
  geom_col(alpha = 0.85) +
  coord_flip() +
  scale_fill_manual(values = c("#2196F3", "#F44336"),
                    labels = c("Protective", "Risk Factor")) +
  labs(title = "Logistic Regression: Feature Coefficients",
       subtitle = "Elastic Net (alpha = 0.5) -- Top 15 Features",
       x = NULL, y = "Coefficient Value", fill = "Direction") +
  theme_minimal(base_size = 11)

Prediction

logit_prob <- predict(logit_model, newx = X_test, type = "response", s = best_lambda)
logit_pred <- factor(ifelse(logit_prob > 0.5, "At_Risk", "No_Risk"),
                     levels = c("No_Risk", "At_Risk"))

Model 2: Random Forest

Method Overview

Random Forest is an ensemble learning method that builds multiple decision trees and combines their results via majority voting. Its two core mechanisms are:

  • Bootstrap Aggregating (Bagging): Each tree trains on a bootstrap sample drawn with replacement from the training data.
  • Random Feature Subsets: At each node split, only a random subset of \(\sqrt{p}\) features is considered, reducing correlation between trees.

\[\hat{f}(x) = \frac{1}{B} \sum_{b=1}^{B} T_b(x)\]

Class weights are set proportional to the inverse class frequency (1:10) to compensate for the severe imbalance between the majority and minority classes.

Model Training

set.seed(42)

rf_model <- randomForest(
  x          = train_data[, features],
  y          = train_data[[target]],
  ntree      = 500,
  mtry       = floor(sqrt(length(features))),
  importance = TRUE,
  nodesize   = 5,
  classwt    = c("No_Risk" = 1, "At_Risk" = 10)
)

print(rf_model)
## 
## Call:
##  randomForest(x = train_data[, features], y = train_data[[target]],      ntree = 500, mtry = floor(sqrt(length(features))), classwt = c(No_Risk = 1,          At_Risk = 10), nodesize = 5, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.19%
## Confusion matrix:
##         No_Risk At_Risk class.error
## No_Risk   23349    3941   0.1444119
## At_Risk    1515    1196   0.5588344
oob_df <- data.frame(
  Trees   = 1:rf_model$ntree,
  OOB     = rf_model$err.rate[, "OOB"],
  No_Risk = rf_model$err.rate[, "No_Risk"],
  At_Risk = rf_model$err.rate[, "At_Risk"]
) %>%
  pivot_longer(-Trees, names_to = "Type", values_to = "Error")

ggplot(oob_df, aes(x = Trees, y = Error, color = Type)) +
  geom_line(linewidth = 0.7) +
  scale_color_manual(values = c("OOB"     = "#333333",
                                "No_Risk" = "#2196F3",
                                "At_Risk" = "#F44336")) +
  labs(title = "Random Forest: OOB Error vs. Number of Trees",
       x = "Number of Trees", y = "Error Rate", color = "Error Type") +
  theme_minimal(base_size = 12)

imp_df <- as.data.frame(importance(rf_model)) %>%
  rownames_to_column("Feature") %>%
  arrange(desc(MeanDecreaseGini))

ggplot(imp_df[1:15, ], aes(x = reorder(Feature, MeanDecreaseGini),
                            y = MeanDecreaseGini)) +
  geom_col(fill = "#4CAF50", alpha = 0.85) +
  coord_flip() +
  labs(title = "Random Forest: Top 15 Feature Importances",
       subtitle = "Metric: Mean Decrease in Gini Impurity",
       x = NULL, y = "Mean Decrease Gini") +
  theme_minimal(base_size = 11)

Prediction

rf_pred <- predict(rf_model, newdata = test_data[, features], type = "class")
rf_prob <- predict(rf_model, newdata = test_data[, features], type = "prob")[, "At_Risk"]

Model 3: Support Vector Machine

Method Overview

Support Vector Machine (SVM) finds the optimal hyperplane that maximizes the margin between classes. For non-linearly separable problems, the Radial Basis Function (RBF) kernel implicitly maps data into a higher-dimensional space where linear separation becomes possible.

\[K(x_i, x_j) = \exp\left(-\gamma \|x_i - x_j\|^2\right)\]

Hyperparameters cost and gamma are selected via 3-fold cross-validation on a 10% stratified subsample of the training data to keep computation time manageable. Class weights (1:10) are applied to address class imbalance.

Model Training

set.seed(42)

tune_idx  <- createDataPartition(train_data[[target]], p = 0.1, list = FALSE)
tune_data <- train_data[tune_idx, ]

svm_tune <- tune(
  svm,
  train.x     = tune_data[, features],
  train.y     = tune_data[[target]],
  kernel      = "radial",
  ranges      = list(
    cost  = c(0.1, 1),
    gamma = c(0.01, 0.1)
  ),
  tunecontrol = tune.control(
    sampling = "cross",
    cross    = 3
  )
)

best_cost  <- svm_tune$best.parameters$cost
best_gamma <- svm_tune$best.parameters$gamma
cat("Best Parameters: cost =", best_cost, ", gamma =", best_gamma, "\n")
## Best Parameters: cost = 1 , gamma = 0.1
ggplot(svm_tune$performances,
       aes(x = factor(cost), y = factor(gamma), fill = error)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(error, 4)), size = 3.5) +
  scale_fill_gradient(low = "#4CAF50", high = "#F44336") +
  labs(title = "SVM Hyperparameter Tuning: 3-Fold CV Error",
       subtitle = "RBF Kernel - Grid Search over Cost x Gamma",
       x = "Cost (C)", y = "Gamma", fill = "CV Error") +
  theme_minimal(base_size = 12)

svm_model <- svm(
  x             = train_data[, features],
  y             = train_data[[target]],
  kernel        = "radial",
  cost          = best_cost,
  gamma         = best_gamma,
  probability   = TRUE,
  class.weights = c("No_Risk" = 1, "At_Risk" = 10)
)

print(svm_model)
## 
## Call:
## svm.default(x = train_data[, features], y = train_data[[target]], 
##     kernel = "radial", gamma = best_gamma, cost = best_cost, class.weights = c(No_Risk = 1, 
##         At_Risk = 10), probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  15306

Prediction

svm_pred     <- predict(svm_model, newdata = test_data[, features])
svm_prob_obj <- predict(svm_model, newdata = test_data[, features], probability = TRUE)
svm_prob     <- attr(svm_prob_obj, "probabilities")[, "At_Risk"]

Prediction Summary

pred_summary <- data.frame(
  Actual          = y_test,
  LR_Predicted    = logit_pred,
  LR_Probability  = round(as.vector(logit_prob), 4),
  RF_Predicted    = rf_pred,
  RF_Probability  = round(rf_prob, 4),
  SVM_Predicted   = svm_pred,
  SVM_Probability = round(svm_prob, 4)
)

cat("Sample Predictions (First 10 Rows):\n")
## Sample Predictions (First 10 Rows):
head(pred_summary, 10)
##     Actual LR_Predicted LR_Probability RF_Predicted RF_Probability
## 3  No_Risk      At_Risk         0.8777      At_Risk          0.538
## 4  No_Risk      At_Risk         0.5387      No_Risk          0.498
## 6  At_Risk      At_Risk         0.7436      At_Risk          0.598
## 11 At_Risk      At_Risk         0.8870      No_Risk          0.428
## 13 No_Risk      At_Risk         0.7758      At_Risk          0.628
## 31 No_Risk      At_Risk         0.6924      At_Risk          0.604
## 32 No_Risk      At_Risk         0.7914      No_Risk          0.392
## 41 No_Risk      At_Risk         0.5003      No_Risk          0.136
## 42 No_Risk      No_Risk         0.2321      At_Risk          0.598
## 45 No_Risk      At_Risk         0.7059      At_Risk          0.678
##    SVM_Predicted SVM_Probability
## 3        No_Risk          0.0953
## 4        No_Risk          0.0712
## 6        At_Risk          0.1394
## 11       No_Risk          0.0983
## 13       At_Risk          0.2313
## 31       At_Risk          0.2625
## 32       At_Risk          0.1908
## 41       No_Risk          0.0754
## 42       No_Risk          0.0558
## 45       At_Risk          0.1795
prob_df <- data.frame(
  Model       = rep(c("Logistic Regression", "Random Forest", "SVM"),
                    each = nrow(test_data)),
  Probability = c(as.vector(logit_prob), rf_prob, svm_prob),
  Actual      = rep(as.character(y_test), 3)
)

ggplot(prob_df, aes(x = Probability, fill = Actual)) +
  geom_density(alpha = 0.55) +
  facet_wrap(~ Model, ncol = 3) +
  scale_fill_manual(values = c("No_Risk" = "#2196F3", "At_Risk" = "#F44336")) +
  labs(title = "Predicted Probability Distributions by Model",
       subtitle = "Separability between risk classes across three classifiers",
       x = "Predicted Probability (At Risk)",
       y = "Density", fill = "Actual Label") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")

Evaluation of Classification Models

evaluate_classification <- function(actual, predicted, probability, model_name) {
  
  cm <- confusionMatrix(
    predicted,
    actual,
    positive = "At_Risk"
  )
  
  roc_obj <- roc(
    response = actual,
    predictor = as.numeric(probability),
    levels = c("No_Risk", "At_Risk"),
    direction = "<"
  )
  
  data.frame(
    Model = model_name,
    Accuracy = round(as.numeric(cm$overall["Accuracy"]), 4),
    Precision = round(as.numeric(cm$byClass["Precision"]), 4),
    Recall = round(as.numeric(cm$byClass["Recall"]), 4),
    F1_Score = round(as.numeric(cm$byClass["F1"]), 4),
    AUC = round(as.numeric(auc(roc_obj)), 4)
  )
}

classification_results <- rbind(
  evaluate_classification(
    y_test,
    logit_pred,
    as.vector(logit_prob),
    "Logistic Regression"
  ),
  evaluate_classification(
    y_test,
    rf_pred,
    rf_prob,
    "Random Forest"
  ),
  evaluate_classification(
    y_test,
    svm_pred,
    svm_prob,
    "Support Vector Machine"
  )
)

classification_results
##                    Model Accuracy Precision Recall F1_Score    AUC
## 1    Logistic Regression   0.7458    0.2312 0.7799   0.3567 0.8364
## 2          Random Forest   0.8187    0.2298 0.4285   0.2992 0.7746
## 3 Support Vector Machine   0.7818    0.2282 0.5938   0.3297 0.7759
# visualization
classification_results_long <- classification_results %>%
  pivot_longer(
    cols = c(Accuracy, Precision, Recall, F1_Score, AUC),
    names_to = "Metric",
    values_to = "Value"
  )

ggplot(
  classification_results_long,
  aes(
    x = Model,
    y = Value,
    fill = Metric
  )
) +
  geom_col(position = "dodge") +
  theme_minimal() +
  labs(
    title = "Classification Model Performance Comparison",
    x = "Classification Model",
    y = "Score"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

# Final Conclusion for Model Evaluation

Regression Model Evaluation

# Select best regression model
best_regression_model <- results %>%
  arrange(RMSE, MAE, desc(R2)) %>%
  slice(1)

best_regression_model
##                 Model      RMSE       MAE        R2
## 1 Stepwise Regression 0.2151523 0.1241588 0.3816195

For the regression part, the target variable is PhysicalHealth. Since PhysicalHealth is a numerical value, the models were evaluated using RMSE, MAE, and R².

RMSE and MAE show the prediction error. A lower RMSE or MAE means the model prediction is closer to the actual value. R² shows how well the model explains the changes in PhysicalHealth. A higher R² means the model performs better.

Three regression models were compared: Multiple Linear Regression, Stepwise Regression, and Random Forest Regression. The best regression model should have the lowest RMSE and MAE, and the highest R², which is Stepwise Regression.

The predicted vs actual plots were also used to check the model performance. If the points are closer to the diagonal line, it means the predicted values are closer to the actual values. The residual plots were used to check the prediction errors. Smaller residuals mean better prediction.

The regression results show that variables such as MentalHealth, DiffWalking, GenHealth, PhysicalActivity, and HeartDisease are useful for predicting PhysicalHealth.

Classification Model Evaluation

# Select best classification model
best_classification_model <- classification_results %>%
  arrange(desc(F1_Score), desc(Recall), desc(AUC)) %>%
  slice(1)

best_classification_model
##                 Model Accuracy Precision Recall F1_Score    AUC
## 1 Logistic Regression   0.7458    0.2312 0.7799   0.3567 0.8364

For the classification analysis, HeartDisease was used as the target variable and divided into two classes: No_Risk and At_Risk. Since the dataset was imbalanced, accuracy was not the only metric used to evaluate the models. Most records belonged to the No_Risk group, so a model with high accuracy could still perform poorly in identifying actual At_Risk cases.

To make the evaluation more reliable, recall, F1-score, and AUC were also considered. Recall was especially important because the purpose of this classification task was to identify people who may have heart disease risk. After comparing Logistic Regression, Random Forest, and Support Vector Machine, Logistic Regression was selected as the best classification model. Although it did not have the highest accuracy, it provided a better balance between recall, F1-score, AUC, and interpretability. This makes it more suitable for this project, where identifying potential risk cases is more important than simply maximizing overall accuracy.

Conclusion

In conclusion, this project shows that heart disease risk is related to a combination of health and lifestyle factors, rather than a single variable. From the EDA, variables such as age category, smoking, diabetes, general health, physical activity, and difficulty walking showed clear patterns with heart disease risk. These findings also supported the use of predictive models in the next stage.

For modelling, the regression task was mainly used to analyse PhysicalHealth, while the classification task was the main part for predicting HeartDisease risk. Stepwise Regression performed best for the regression part, and Logistic Regression was selected as the best classification model because it gave a reasonable balance of recall, F1-score, AUC, and interpretability. Since the dataset was imbalanced, recall was especially important because the aim was to identify people who may be at risk.

Overall, the project met its main objective by using R to clean, explore, visualise, and model a large health dataset. The results suggest that machine learning can help identify possible heart disease risk patterns, but the model should only be used as analytical support, not as a medical diagnosis tool. Future work could improve the result by better handling class imbalance and testing the model with other health datasets.