Introduction

Heart disease is one of the leading causes of death around the world; thus, early prediction has become a vital area of research. Data analytics has a lot of potent tools to identify key risk factors and patterns leading to heart diseases. Using machine learning techniques coupled with a comprehensive dataset, such as the Heart Failure Prediction dataset available in Kaggle, we aim to develop predictive models that could help in early detection and prevention strategies. This study will explore the association of clinical and demographic factors with heart disease to provide insight into improving patient outcomes.

Objectives

To identify significant clinical and demographic risk factors associated with heart disease using exploratory data analysis.
To develop and evaluate predictive models for heart disease risk based on the Kaggle dataset using machine learning techniques.
To provide actionable insights that could guide healthcare practitioners in early intervention and personalized treatment strategies.

Process Explanation

This project focused on predicting heart disease risk through data preprocessing, exploratory analysis, and modeling. Data was cleaned, transformed, and enriched with derived features. Regression quantified relationships, while classification categorized patients into risk groups. Models were evaluated on metrics like accuracy and sensitivity, showcasing the complementary strengths of both approaches for effective heart disease prediction.

Data Preprocessing

# Load the dataset
data <- read.csv("C:/Users/Lenovo/Desktop/heart.csv")

# Display the first few rows of the dataset
head(data)

##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289         0     Normal   172
## 2  49   F           NAP       160         180         0     Normal   156
## 3  37   M           ATA       130         283         0         ST    98
## 4  48   F           ASY       138         214         0     Normal   108
## 5  54   M           NAP       150         195         0     Normal   122
## 6  39   M           NAP       120         339         0     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up            0
## 2              N     1.0     Flat            1
## 3              N     0.0       Up            0
## 4              Y     1.5     Flat            1
## 5              N     0.0       Up            0
## 6              N     0.0       Up            0

Data Cleaning

Step 1: Check for Missing Values

# Creating a copy of the dataset
process_reviews <- data
# Checking for null values in each column
missing_values <- colSums(is.na(process_reviews))
# Format the output to match the desired layout
formatted_output <- cbind(Column = names(missing_values), Missing_Values = missing_values)
# Print the output
for (i in 1:nrow(formatted_output)) {
  cat(formatted_output[i, "Column"], "\t", formatted_output[i, "Missing_Values"], "\n")
}

## Age   0 
## Sex   0 
## ChestPainType     0 
## RestingBP     0 
## Cholesterol   0 
## FastingBS     0 
## RestingECG    0 
## MaxHR     0 
## ExerciseAngina    0 
## Oldpeak   0 
## ST_Slope      0 
## HeartDisease      0

Step 2: Replace Missing or Invalid Values

# Replace empty strings ("") with NA
process_reviews[process_reviews == ""] <- NA
# Recheck missing values after replacement
missing_values <- colSums(is.na(process_reviews))
# Print updated missing values
for (i in 1:length(missing_values)) {
  cat(names(missing_values)[i], "\t", missing_values[i], "\n")
}

## Age   0 
## Sex   0 
## ChestPainType     0 
## RestingBP     0 
## Cholesterol   0 
## FastingBS     0 
## RestingECG    0 
## MaxHR     0 
## ExerciseAngina    0 
## Oldpeak   0 
## ST_Slope      0 
## HeartDisease      0

Step 3: Handle Structural Errors

# Ensure numerical columns are of numeric type
process_reviews$Age <- as.numeric(process_reviews$Age)
process_reviews$RestingBP <- as.numeric(process_reviews$RestingBP)
process_reviews$Cholesterol <- as.numeric(process_reviews$Cholesterol)
process_reviews$MaxHR <- as.numeric(process_reviews$MaxHR)
process_reviews$Oldpeak <- as.numeric(process_reviews$Oldpeak)
# Check for outliers in key numeric columns
summary(process_reviews$RestingBP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   120.0   130.0   132.4   140.0   200.0

summary(process_reviews$Cholesterol)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   173.2   223.0   198.8   267.0   603.0

Step 4: Remove Duplicate Records

# Remove duplicate rows
process_reviews <- process_reviews[!duplicated(process_reviews), ]
# Print the number of rows after removing duplicates
cat("Rows after removing duplicates:", nrow(process_reviews), "\n")

## Rows after removing duplicates: 918

Step 5: Standardize and Format Categorical Data

# Convert categorical columns to factor type
process_reviews$Sex <- as.factor(process_reviews$Sex)
process_reviews$ChestPainType <- as.factor(process_reviews$ChestPainType)
process_reviews$RestingECG <- as.factor(process_reviews$RestingECG)
process_reviews$ExerciseAngina <- as.factor(process_reviews$ExerciseAngina)
process_reviews$ST_Slope <- as.factor(process_reviews$ST_Slope)
str(process_reviews)

## 'data.frame':    918 obs. of  12 variables:
##  $ Age           : num  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
##  $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
##  $ RestingBP     : num  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : num  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG    : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ MaxHR         : num  172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
##  $ HeartDisease  : int  0 1 0 1 0 0 0 0 1 0 ...

Data Transformation

Step 1: Ensure Tidy Data

str(process_reviews)

## 'data.frame':    918 obs. of  12 variables:
##  $ Age           : num  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
##  $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
##  $ RestingBP     : num  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : num  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG    : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ MaxHR         : num  172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
##  $ HeartDisease  : int  0 1 0 1 0 0 0 0 1 0 ...

head(process_reviews)

##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289         0     Normal   172
## 2  49   F           NAP       160         180         0     Normal   156
## 3  37   M           ATA       130         283         0         ST    98
## 4  48   F           ASY       138         214         0     Normal   108
## 5  54   M           NAP       150         195         0     Normal   122
## 6  39   M           NAP       120         339         0     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up            0
## 2              N     1.0     Flat            1
## 3              N     0.0       Up            0
## 4              Y     1.5     Flat            1
## 5              N     0.0       Up            0
## 6              N     0.0       Up            0

Step 2: Encoding Categorical Variables

# Convert categorical variables to factors
process_reviews$Sex <- as.factor(process_reviews$Sex)
process_reviews$ChestPainType <- as.factor(process_reviews$ChestPainType)
process_reviews$RestingECG <- as.factor(process_reviews$RestingECG)
process_reviews$ExerciseAngina <- as.factor(process_reviews$ExerciseAngina)
process_reviews$ST_Slope <- as.factor(process_reviews$ST_Slope)
# Check the levels for each factor
summary(process_reviews)

##       Age        Sex     ChestPainType   RestingBP      Cholesterol   
##  Min.   :28.00   F:193   ASY:496       Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:47.00   M:725   ATA:173       1st Qu.:120.0   1st Qu.:173.2  
##  Median :54.00           NAP:203       Median :130.0   Median :223.0  
##  Mean   :53.51           TA : 46       Mean   :132.4   Mean   :198.8  
##  3rd Qu.:60.00                         3rd Qu.:140.0   3rd Qu.:267.0  
##  Max.   :77.00                         Max.   :200.0   Max.   :603.0  
##    FastingBS       RestingECG      MaxHR       ExerciseAngina    Oldpeak       
##  Min.   :0.0000   LVH   :188   Min.   : 60.0   N:547          Min.   :-2.6000  
##  1st Qu.:0.0000   Normal:552   1st Qu.:120.0   Y:371          1st Qu.: 0.0000  
##  Median :0.0000   ST    :178   Median :138.0                  Median : 0.6000  
##  Mean   :0.2331                Mean   :136.8                  Mean   : 0.8874  
##  3rd Qu.:0.0000                3rd Qu.:156.0                  3rd Qu.: 1.5000  
##  Max.   :1.0000                Max.   :202.0                  Max.   : 6.2000  
##  ST_Slope    HeartDisease   
##  Down: 63   Min.   :0.0000  
##  Flat:460   1st Qu.:0.0000  
##  Up  :395   Median :1.0000  
##             Mean   :0.5534  
##             3rd Qu.:1.0000  
##             Max.   :1.0000

Step 3:Normalize and Scale Numerical Data

In datasets with numerical variables, the ranges of values can vary significantly.

Normalization rescales values to a range of [0, 1], ensuring all variables contribute equally.

# Normalize numerical variables
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}
# Apply normalization to selected numerical columns
process_reviews$RestingBP <- normalize(process_reviews$RestingBP)
process_reviews$Cholesterol <- normalize(process_reviews$Cholesterol)
process_reviews$MaxHR <- normalize(process_reviews$MaxHR)
process_reviews$Oldpeak <- normalize(process_reviews$Oldpeak)
# View summary of normalized data
summary(process_reviews[, c("RestingBP", "Cholesterol", "MaxHR", "Oldpeak")])

##    RestingBP      Cholesterol         MaxHR           Oldpeak      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.600   1st Qu.:0.2873   1st Qu.:0.4225   1st Qu.:0.2955  
##  Median :0.650   Median :0.3698   Median :0.5493   Median :0.3636  
##  Mean   :0.662   Mean   :0.3297   Mean   :0.5409   Mean   :0.3963  
##  3rd Qu.:0.700   3rd Qu.:0.4428   3rd Qu.:0.6761   3rd Qu.:0.4659  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Step 4: Create Derived Variables

Derive new features based on the existing data, such as:

Age Group: Categorizing patients by age ranges.

HeartRiskScore: Combining multiple risk factors into one indicator.

HeartRiskScore Explanation:

FastingBS: Doubled as a stronger risk factor.

ExerciseAngina: Adds 1 if the patient experienced angina during exercise.

ST_Slope: Adds 2 for a “Flat” slope, indicating a more severe condition.

# Create age groups
process_reviews$AgeGroup <- cut(
  process_reviews$Age,
  breaks = c(0, 30, 45, 60, 100),
  labels = c("Young", "Middle-Aged", "Senior", "Elderly"),
  right = FALSE
)
# Create a risk score
process_reviews$HeartRiskScore <- process_reviews$FastingBS * 2 +
ifelse(process_reviews$ExerciseAngina == "Y", 1, 0) +
ifelse(process_reviews$ST_Slope == "Flat", 2, 0)
# View derived variables
head(process_reviews[, c("AgeGroup", "HeartRiskScore")])

##      AgeGroup HeartRiskScore
## 1 Middle-Aged              0
## 2      Senior              2
## 3 Middle-Aged              0
## 4      Senior              3
## 5      Senior              0
## 6 Middle-Aged              0

Step 5: Handle Multilevel Variables

ChestPainType, contain multiple categories (e.g., TA, ATA, NAP, ASY)

Transform into Numeric Levels: Assign numeric codes to each category.

# Convert multilevel categorical variables to numeric levels
process_reviews$ChestPainType_Num <- as.numeric(factor(process_reviews$ChestPainType))
process_reviews$RestingECG_Num <- as.numeric(factor(process_reviews$RestingECG))
process_reviews$ExerciseAngina_Num <- as.numeric(factor(process_reviews$ExerciseAngina))
process_reviews$ST_Slope_Num <- as.numeric(factor(process_reviews$ST_Slope))
# Confirm the changes in the dataset
head(process_reviews[, c("ChestPainType", "ChestPainType_Num", "RestingECG", "RestingECG_Num")])

##   ChestPainType ChestPainType_Num RestingECG RestingECG_Num
## 1           ATA                 2     Normal              2
## 2           NAP                 3     Normal              2
## 3           ATA                 2         ST              3
## 4           ASY                 1     Normal              2
## 5           NAP                 3     Normal              2
## 6           NAP                 3     Normal              2

Data Conversion

Step 1: String Manipulations

Trim Leading and Trailing Spaces

Ensure Uniform Case

library(stringr)

## Warning: 程序包'stringr'是用R版本4.4.2 来建造的

# Trim leading and trailing spaces
process_reviews$Sex <- str_trim(process_reviews$Sex)
process_reviews$ChestPainType <- str_trim(process_reviews$ChestPainType)
# Convert to lowercase for consistent formatting
process_reviews$Sex <- tolower(process_reviews$Sex)
process_reviews$ChestPainType <- tolower(process_reviews$ChestPainType)
# Check the results
head(process_reviews[, c("Sex", "ChestPainType")])

##   Sex ChestPainType
## 1   m           ata
## 2   f           nap
## 3   m           ata
## 4   f           asy
## 5   m           nap
## 6   m           nap

Step 2: Verify Data Types

# Verify column types
str(process_reviews)

## 'data.frame':    918 obs. of  18 variables:
##  $ Age               : num  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex               : chr  "m" "f" "m" "f" ...
##  $ ChestPainType     : chr  "ata" "nap" "ata" "asy" ...
##  $ RestingBP         : num  0.7 0.8 0.65 0.69 0.75 0.6 0.65 0.55 0.7 0.6 ...
##  $ Cholesterol       : num  0.479 0.299 0.469 0.355 0.323 ...
##  $ FastingBS         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG        : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ MaxHR             : num  0.789 0.676 0.268 0.338 0.437 ...
##  $ ExerciseAngina    : Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak           : num  0.295 0.409 0.295 0.466 0.295 ...
##  $ ST_Slope          : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
##  $ HeartDisease      : int  0 1 0 1 0 0 0 0 1 0 ...
##  $ AgeGroup          : Factor w/ 4 levels "Young","Middle-Aged",..: 2 3 2 3 3 2 3 3 2 3 ...
##  $ HeartRiskScore    : num  0 2 0 3 0 0 0 0 3 0 ...
##  $ ChestPainType_Num : num  2 3 2 1 3 3 2 2 1 2 ...
##  $ RestingECG_Num    : num  2 2 3 2 2 2 2 2 2 2 ...
##  $ ExerciseAngina_Num: num  1 1 1 2 1 1 1 1 2 1 ...
##  $ ST_Slope_Num      : num  3 2 3 2 3 3 3 3 2 3 ...

# Convert columns to appropriate types
process_reviews$FastingBS <- as.integer(process_reviews$FastingBS)
process_reviews$HeartDisease <- as.factor(process_reviews$HeartDisease)
# Confirm changes
str(process_reviews)

## 'data.frame':    918 obs. of  18 variables:
##  $ Age               : num  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex               : chr  "m" "f" "m" "f" ...
##  $ ChestPainType     : chr  "ata" "nap" "ata" "asy" ...
##  $ RestingBP         : num  0.7 0.8 0.65 0.69 0.75 0.6 0.65 0.55 0.7 0.6 ...
##  $ Cholesterol       : num  0.479 0.299 0.469 0.355 0.323 ...
##  $ FastingBS         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG        : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ MaxHR             : num  0.789 0.676 0.268 0.338 0.437 ...
##  $ ExerciseAngina    : Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak           : num  0.295 0.409 0.295 0.466 0.295 ...
##  $ ST_Slope          : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
##  $ HeartDisease      : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
##  $ AgeGroup          : Factor w/ 4 levels "Young","Middle-Aged",..: 2 3 2 3 3 2 3 3 2 3 ...
##  $ HeartRiskScore    : num  0 2 0 3 0 0 0 0 3 0 ...
##  $ ChestPainType_Num : num  2 3 2 1 3 3 2 2 1 2 ...
##  $ RestingECG_Num    : num  2 2 3 2 2 2 2 2 2 2 ...
##  $ ExerciseAngina_Num: num  1 1 1 2 1 1 1 1 2 1 ...
##  $ ST_Slope_Num      : num  3 2 3 2 3 3 3 3 2 3 ...

Exploratory data analysis

Introduction

This report presents an exploratory data analysis (EDA) and statistical testing on a cleaned dataset related to heart disease. The primary objectives are to:

Analyze the relationships between binary, continuous, and categorical variables.
Perform correlation analysis to understand the strength and direction of associations between variables.
Visualize the distributions and relationships among key variables, grouped by the presence or absence of heart disease.
Conduct hypothesis testing, including t-tests and chi-squared tests, to assess statistical significance.
Summarize and interpret findings through visualizations and concise tables.

Through this analysis, we aim to identify significant predictors and patterns related to heart disease for further insights and potential predictive modeling.

Data Description and Initial Checks

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# Install and load required packages
if (!require("psych")) install.packages("psych", dependencies = TRUE)

## 载入需要的程序包：psych

## Warning: 程序包'psych'是用R版本4.4.2 来建造的

if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE)

## 载入需要的程序包：ggplot2

## Warning: 程序包'ggplot2'是用R版本4.4.2 来建造的

## 
## 载入程序包：'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

if (!require("dplyr")) install.packages("dplyr", dependencies = TRUE)

## 载入需要的程序包：dplyr

## Warning: 程序包'dplyr'是用R版本4.4.2 来建造的

## 
## 载入程序包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(psych)
library(ggplot2)
library(dplyr)

# Load dataset
data <- read.csv("C:/Users/Lenovo/Desktop/cleaned_process_reviews.csv", stringsAsFactors = FALSE)

# Descriptive statistics for numeric variables
numeric_cols <- data %>% select_if(is.numeric)
desc_stats <- describe(numeric_cols)
print("Descriptive Statistics:")

## [1] "Descriptive Statistics:"

desc_stats

##                    vars   n  mean   sd median trimmed   mad min max range  skew
## Age                   1 918 53.51 9.43  54.00   53.71 10.38  28  77    49 -0.20
## RestingBP             2 918  0.66 0.09   0.65    0.66  0.07   0   1     1  0.18
## Cholesterol           3 918  0.33 0.18   0.37    0.34  0.11   0   1     1 -0.61
## FastingBS             4 918  0.23 0.42   0.00    0.17  0.00   0   1     1  1.26
## MaxHR                 5 918  0.54 0.18   0.55    0.54  0.19   0   1     1 -0.14
## Oldpeak               6 918  0.40 0.12   0.36    0.38  0.10   0   1     1  1.02
## HeartDisease          7 918  0.55 0.50   1.00    0.57  0.00   0   1     1 -0.21
## HeartRiskScore        8 918  1.87 1.60   2.00    1.75  1.48   0   5     5  0.25
## ChestPainType_Num     9 918  1.78 0.96   1.00    1.66  0.00   1   4     3  0.79
## RestingECG_Num       10 918  1.99 0.63   2.00    1.99  0.00   1   3     2  0.01
## ExerciseAngina_Num   11 918  1.40 0.49   1.00    1.38  0.00   1   2     1  0.39
## ST_Slope_Num         12 918  2.36 0.61   2.00    2.41  0.00   1   3     2 -0.38
##                    kurtosis   se
## Age                   -0.40 0.31
## RestingBP              3.23 0.00
## Cholesterol            0.10 0.01
## FastingBS             -0.41 0.01
## MaxHR                 -0.46 0.01
## Oldpeak                1.18 0.00
## HeartDisease          -1.96 0.02
## HeartRiskScore        -1.03 0.05
## ChestPainType_Num     -0.72 0.03
## RestingECG_Num        -0.50 0.02
## ExerciseAngina_Num    -1.85 0.02
## ST_Slope_Num          -0.67 0.02

Checking Missing Values and Dataset Structure

print("Dataset structure:")

## [1] "Dataset structure:"

str(data)

## 'data.frame':    918 obs. of  18 variables:
##  $ Age               : int  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex               : chr  "m" "f" "m" "f" ...
##  $ ChestPainType     : chr  "ata" "nap" "ata" "asy" ...
##  $ RestingBP         : num  0.7 0.8 0.65 0.69 0.75 0.6 0.65 0.55 0.7 0.6 ...
##  $ Cholesterol       : num  0.479 0.299 0.469 0.355 0.323 ...
##  $ FastingBS         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG        : chr  "Normal" "Normal" "ST" "Normal" ...
##  $ MaxHR             : num  0.789 0.676 0.268 0.338 0.437 ...
##  $ ExerciseAngina    : chr  "N" "N" "N" "Y" ...
##  $ Oldpeak           : num  0.295 0.409 0.295 0.466 0.295 ...
##  $ ST_Slope          : chr  "Up" "Flat" "Up" "Flat" ...
##  $ HeartDisease      : int  0 1 0 1 0 0 0 0 1 0 ...
##  $ AgeGroup          : chr  "Middle-Aged" "Senior" "Middle-Aged" "Senior" ...
##  $ HeartRiskScore    : int  0 2 0 3 0 0 0 0 3 0 ...
##  $ ChestPainType_Num : int  2 3 2 1 3 3 2 2 1 2 ...
##  $ RestingECG_Num    : int  2 2 3 2 2 2 2 2 2 2 ...
##  $ ExerciseAngina_Num: int  1 1 1 2 1 1 1 1 2 1 ...
##  $ ST_Slope_Num      : int  3 2 3 2 3 3 3 3 2 3 ...

# Check missing value
print("Missing values per column:")

## [1] "Missing values per column:"

missing_values <- colSums(is.na(data))
print(missing_values)

##                Age                Sex      ChestPainType          RestingBP 
##                  0                  0                  0                  0 
##        Cholesterol          FastingBS         RestingECG              MaxHR 
##                  0                  0                  0                  0 
##     ExerciseAngina            Oldpeak           ST_Slope       HeartDisease 
##                  0                  0                  0                  0 
##           AgeGroup     HeartRiskScore  ChestPainType_Num     RestingECG_Num 
##                  0                  0                  0                  0 
## ExerciseAngina_Num       ST_Slope_Num 
##                  0                  0

Descriptive statistics for numeric variables

numeric_cols <- data %>% select_if(is.numeric)
desc_stats <- describe(numeric_cols)
print("Descriptive Statistics:")

## [1] "Descriptive Statistics:"

print(desc_stats)

##                    vars   n  mean   sd median trimmed   mad min max range  skew
## Age                   1 918 53.51 9.43  54.00   53.71 10.38  28  77    49 -0.20
## RestingBP             2 918  0.66 0.09   0.65    0.66  0.07   0   1     1  0.18
## Cholesterol           3 918  0.33 0.18   0.37    0.34  0.11   0   1     1 -0.61
## FastingBS             4 918  0.23 0.42   0.00    0.17  0.00   0   1     1  1.26
## MaxHR                 5 918  0.54 0.18   0.55    0.54  0.19   0   1     1 -0.14
## Oldpeak               6 918  0.40 0.12   0.36    0.38  0.10   0   1     1  1.02
## HeartDisease          7 918  0.55 0.50   1.00    0.57  0.00   0   1     1 -0.21
## HeartRiskScore        8 918  1.87 1.60   2.00    1.75  1.48   0   5     5  0.25
## ChestPainType_Num     9 918  1.78 0.96   1.00    1.66  0.00   1   4     3  0.79
## RestingECG_Num       10 918  1.99 0.63   2.00    1.99  0.00   1   3     2  0.01
## ExerciseAngina_Num   11 918  1.40 0.49   1.00    1.38  0.00   1   2     1  0.39
## ST_Slope_Num         12 918  2.36 0.61   2.00    2.41  0.00   1   3     2 -0.38
##                    kurtosis   se
## Age                   -0.40 0.31
## RestingBP              3.23 0.00
## Cholesterol            0.10 0.01
## FastingBS             -0.41 0.01
## MaxHR                 -0.46 0.01
## Oldpeak                1.18 0.00
## HeartDisease          -1.96 0.02
## HeartRiskScore        -1.03 0.05
## ChestPainType_Num     -0.72 0.03
## RestingECG_Num        -0.50 0.02
## ExerciseAngina_Num    -1.85 0.02
## ST_Slope_Num          -0.67 0.02

Visualize distributions of numeric variables

for (col in colnames(numeric_cols))
  print(ggplot(data, aes_string(x = col)) +
    geom_histogram(fill = "blue", color = "black", bins = 30, alpha = 0.7) +
    labs(title = paste("Distribution of", col), x = col, y = "Count") +
    theme_minimal())

The detect_outliers function flags outliers in a column using the IQR rule, returning TRUE for outliers.

detect_outliers <- function(column) {
  Q1 <- quantile(column, 0.25, na.rm = TRUE) # First quartile (25%)
  Q3 <- quantile(column, 0.75, na.rm = TRUE) # Third quartile (75%)
  IQR <- Q3 - Q1                            # Interquartile range
  lower_bound <- Q1 - 1.5 * IQR             # Lower threshold
  upper_bound <- Q3 + 1.5 * IQR             # Upper threshold

  # Return TRUE/FALSE for outliers
  return(column < lower_bound | column > upper_bound)
}

Detect outliers for numeric columns

numeric_cols <- data %>% select_if(is.numeric)

outlier_summary <- data.frame(Column = character(), Outlier_Count = integer())
for (col in colnames(numeric_cols))
  outliers <- detect_outliers(numeric_cols[[col]])
  count_outliers <- sum(outliers, na.rm = TRUE)
  outlier_summary <- rbind(outlier_summary, data.frame(Column = col, Outlier_Count = count_outliers))

Visualize outliers using boxplots

 print(ggplot(data, aes_string(y = col)) +
    geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
    labs(title = paste("Boxplot of", col), y = col) +
    theme_minimal())

# print all outliers for each column
print("Outlier Summary for Numeric Variables:")

## [1] "Outlier Summary for Numeric Variables:"

print(outlier_summary)

##         Column Outlier_Count
## 1 ST_Slope_Num             0

Load Cleaned data

install.packages("psych")
library(psych)  # Load the 'psych' package
library(ggplot2)    # For visualizations
library(dplyr)      # For data manipulation

process_reviews <- read.csv("C:/Users/Lenovo/Desktop/cleaned_process_reviews.csv")

Correlation Analysis Step1: Perform a point-Biserial Correlation analysis between a binary variable (HeartDisease) and multiple continuous variables (Age, RestingBP, Cholesterol, MaxHR, Oldpeak)

# Ensure HeartDisease is numeric
process_reviews$HeartDisease <- as.numeric(as.character(process_reviews$HeartDisease))

# List of continuous variables
continuous_vars <- c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")

# Calculate Point-Biserial Correlation for each continuous variable with HeartDisease
point_biserial_results <- sapply(process_reviews[, continuous_vars], function(x) {
  result <- cor.test(x, process_reviews$HeartDisease, method = "pearson") # Pearson used for numeric correlation
  return(c(correlation = result$estimate, p_value = result$p.value)) # Extract correlation coefficient and p-value
})

# Transpose the results to make it more readable
point_biserial_results <- t(point_biserial_results)

# Display the results
cat("Point-Biserial Correlation Results:\n")

## Point-Biserial Correlation Results:

print(point_biserial_results)

##             correlation.cor      p_value
## Age               0.2820385 3.007953e-18
## RestingBP         0.1075890 1.095315e-03
## Cholesterol      -0.2327406 9.308309e-13
## MaxHR            -0.4004208 1.137786e-36
## Oldpeak           0.4039507 2.390772e-37

Step2:Visualize the distribution of three continuous variables (Age, Oldpeak, MaxHR) grouped by the binary variable HeartDisease

# Load ggplot2 package
library(ggplot2)

# Continuous Variable: Age vs HeartDisease
ggplot(process_reviews, aes(x = factor(HeartDisease), y = Age, fill = factor(HeartDisease))) +
  geom_boxplot() +
  labs(
    title = "Age Distribution by Heart Disease",
    x = "Heart Disease",
    y = "Age"
  ) +
  theme_minimal()

# Visualization: Oldpeak vs HeartDisease
ggplot(process_reviews, aes(x = factor(HeartDisease), y = Oldpeak, fill = factor(HeartDisease))) +
  geom_boxplot() +
  labs(
    title = "Oldpeak Distribution by Heart Disease",
    x = "Heart Disease",
    y = "Oldpeak"
  ) +
  theme_minimal()

# Visualization: MaxHR vs HeartDisease
ggplot(process_reviews, aes(x = factor(HeartDisease), y = MaxHR, fill = factor(HeartDisease))) +
  geom_boxplot() +
  labs(
    title = "MaxHR Distribution by Heart Disease",
    x = "Heart Disease",
    y = "MaxHR"
  ) +
  theme_minimal()

Step3:Calculates a correlation matrix for selected variables, rearranges the matrix to emphasize the HeartDisease variable, and then visualizes it as a heatmap.

# Select variables of interest (HeartDisease and continuous variables)
vars <- c("HeartDisease", "Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
selected_data <- process_reviews[, vars]

# Calculate the correlation matrix
cor_matrix <- cor(selected_data, use = "complete.obs")  # Use complete observations only

# Reorder the matrix to move HeartDisease to the last row and column
cor_matrix <- cor_matrix[c(setdiff(rownames(cor_matrix), "HeartDisease"), "HeartDisease"),
                         c(setdiff(colnames(cor_matrix), "HeartDisease"), "HeartDisease")]

# Install and load pheatmap package if necessary
if (!require("pheatmap")) install.packages("pheatmap")
library(pheatmap)

# Draw the heatmap
pheatmap(cor_matrix,
         color = colorRampPalette(c("blue", "white", "red"))(50),  # Gradient color scheme
         display_numbers = TRUE,                                   # Show correlation coefficients
         number_color = "black",                                   # Font color for numbers
         cluster_rows = FALSE,                                     # Disable row clustering
         cluster_cols = FALSE,                                     # Disable column clustering
         main = "Correlation Matrix Heatmap with HeartDisease")    # Heatmap title

Step4:Use Chi-Square Test to assess whether there is a significant association between two categorical variables.

# ChestPainType vs HeartDisease
table_cp <- table(process_reviews$ChestPainType, process_reviews$HeartDisease)
chisq_test_cp <- chisq.test(table_cp)
print("ChestPainType vs HeartDisease")

## [1] "ChestPainType vs HeartDisease"

print(chisq_test_cp)

## 
##  Pearson's Chi-squared test
## 
## data:  table_cp
## X-squared = 268.07, df = 3, p-value < 2.2e-16

# FastingBS vs HeartDisease
table_fbs <- table(process_reviews$FastingBS, process_reviews$HeartDisease)
chisq_test_fbs <- chisq.test(table_fbs)
print("FastingBS vs HeartDisease")

## [1] "FastingBS vs HeartDisease"

print(chisq_test_fbs)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_fbs
## X-squared = 64.321, df = 1, p-value = 1.057e-15

# RestingECG vs HeartDisease
table_ecg <- table(process_reviews$RestingECG, process_reviews$HeartDisease)
chisq_test_ecg <- chisq.test(table_ecg)
print("RestingECG vs HeartDisease")

## [1] "RestingECG vs HeartDisease"

print(chisq_test_ecg)

## 
##  Pearson's Chi-squared test
## 
## data:  table_ecg
## X-squared = 10.931, df = 2, p-value = 0.004229

# ExerciseAngina vs HeartDisease
table_angina <- table(process_reviews$ExerciseAngina, process_reviews$HeartDisease)
chisq_test_angina <- chisq.test(table_angina)
print("ExerciseAngina vs HeartDisease")

## [1] "ExerciseAngina vs HeartDisease"

print(chisq_test_angina)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_angina
## X-squared = 222.26, df = 1, p-value < 2.2e-16

# ST_Slope vs HeartDisease
table_slope <- table(process_reviews$ST_Slope, process_reviews$HeartDisease)
chisq_test_slope <- chisq.test(table_slope)
print("ST_Slope vs HeartDisease")

## [1] "ST_Slope vs HeartDisease"

print(chisq_test_slope)

## 
##  Pearson's Chi-squared test
## 
## data:  table_slope
## X-squared = 355.92, df = 2, p-value < 2.2e-16

Step5:To create grouped bar charts that visualize the distribution of several categorical variables with respect to HeartDisease. Each visualization shows the counts of categories in the x-axis variable, filled (colored) by the HeartDisease status.

# ChestPainType Distribution
ggplot(process_reviews, aes(x = ChestPainType, fill = factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Chest Pain Type vs Heart Disease",
    x = "Chest Pain Type",
    y = "Count",
    fill = "Heart Disease"
  ) +
  theme_minimal()

# FastingBS Distribution
ggplot(process_reviews, aes(x = factor(FastingBS), fill = factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Fasting Blood Sugar vs Heart Disease",
    x = "Fasting Blood Sugar",
    y = "Count",
    fill = "Heart Disease"
  ) +
  theme_minimal()

# RestingECG Distribution
ggplot(process_reviews, aes(x = RestingECG, fill = factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Resting ECG vs Heart Disease",
    x = "Resting ECG",
    y = "Count",
    fill = "Heart Disease"
  ) +
  theme_minimal()

# ExerciseAngina Distribution
ggplot(process_reviews, aes(x = ExerciseAngina, fill = factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Exercise Angina vs Heart Disease",
    x = "Exercise Angina",
    y = "Count",
    fill = "Heart Disease"
  ) +
  theme_minimal()

# ST_Slope Distribution
ggplot(process_reviews, aes(x = ST_Slope, fill = factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  labs(
    title = "ST Slope vs Heart Disease",
    x = "ST Slope",
    y = "Count",
    fill = "Heart Disease"
  ) +
  theme_minimal()

Hypothesis Testing

# Define variable lists
continuous_vars <- c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")  # Continuous variables
binary_vars <- c("FastingBS", "ExerciseAngina")                               # Binary variables
multiclass_vars <- c("ChestPainType", "RestingECG", "ST_Slope")               # Multi-class variables

# Initialize a list to save results
results <- list()

1. Continuous variables vs HeartDisease: t-test

# 1. Continuous variables vs HeartDisease: t-test
continuous_tests <- lapply(continuous_vars, function(var) {
  t_test_result <- t.test(process_reviews[[var]] ~ process_reviews$HeartDisease)
  list(
    variable = var,
    test = "t-test",
    statistic = t_test_result$statistic,
    p_value = t_test_result$p.value
  )
})
results$continuous <- do.call(rbind, lapply(continuous_tests, as.data.frame))

2. Binary variables vs HeartDisease: Chi-squared test or two-sample proportion test

# 2. Binary variables vs HeartDisease: Chi-squared test or two-sample proportion test
binary_tests <- lapply(binary_vars, function(var) {
  table_data <- table(process_reviews[[var]], process_reviews$HeartDisease)
  chi_test_result <- chisq.test(table_data)
  list(
    variable = var,
    test = "chi-squared test (binary)",
    statistic = chi_test_result$statistic,
    p_value = chi_test_result$p.value
  )
})
results$binary <- do.call(rbind, lapply(binary_tests, as.data.frame))

3. Multi-class variables vs HeartDisease: Chi-squared test

# 3. Multi-class variables vs HeartDisease: Chi-squared test
multiclass_tests <- lapply(multiclass_vars, function(var) {
  table_data <- table(process_reviews[[var]], process_reviews$HeartDisease)
  chi_test_result <- chisq.test(table_data)
  list(
    variable = var,
    test = "chi-squared test (multiclass)",
    statistic = chi_test_result$statistic,
    p_value = chi_test_result$p.value
  )
})
results$multiclass <- do.call(rbind, lapply(multiclass_tests, as.data.frame))

4. Combine all results

# 4. Combine all results
all_results <- rbind(
  data.frame(VariableType = "Continuous", results$continuous),
  data.frame(VariableType = "Binary", results$binary),
  data.frame(VariableType = "Multiclass", results$multiclass)
)

# Print results
print("Hypothesis Testing Results:")

## [1] "Hypothesis Testing Results:"

print(all_results)

##             VariableType       variable                          test
## t             Continuous            Age                        t-test
## t1            Continuous      RestingBP                        t-test
## t2            Continuous    Cholesterol                        t-test
## t3            Continuous          MaxHR                        t-test
## t4            Continuous        Oldpeak                        t-test
## X-squared         Binary      FastingBS     chi-squared test (binary)
## X-squared1        Binary ExerciseAngina     chi-squared test (binary)
## X-squared3    Multiclass  ChestPainType chi-squared test (multiclass)
## X-squared11   Multiclass     RestingECG chi-squared test (multiclass)
## X-squared2    Multiclass       ST_Slope chi-squared test (multiclass)
##              statistic      p_value
## t            -8.822540 6.348337e-18
## t1           -3.339492 8.732265e-04
## t2            7.626851 6.481236e-14
## t3           13.231478 1.430637e-36
## t4          -14.040031 1.902722e-40
## X-squared    64.320679 1.057302e-15
## X-squared1  222.259383 2.907808e-50
## X-squared3  268.067239 8.083728e-58
## X-squared11  10.931469 4.229233e-03
## X-squared2  355.918443 5.167638e-78

5. Visualize significant results

# 5. Visualize significant results
library(ggplot2)
all_results$p_value <- as.numeric(all_results$p_value)
all_results$significant <- ifelse(all_results$p_value < 0.05, "Significant", "Not Significant")

ggplot(all_results, aes(x = reorder(variable, p_value), y = -log10(p_value), fill = significant)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Hypothesis Testing Results",
       x = "Variables",
       y = "-log10(p-value)",
       fill = "Significance") +
  theme_minimal()

Summary

The exploratory data analysis and statistical testing provided valuable insights into the relationships between heart disease and various predictors:

Correlation Analysis:
- Continuous variables such as Oldpeak and MaxHR showed strong correlations with heart disease.
- Variables like Age and Cholesterol also demonstrated statistically significant, though weaker, associations.
Visualization:
- Boxplots revealed distinct differences in the distributions of Age, MaxHR, and Oldpeak between individuals with and without heart disease.
- Grouped bar charts highlighted significant categorical relationships, such as those involving ChestPainType and ST-SLOPE
Hypothesis Testing:
- T-tests identified significant differences in continuous variables, such as Age and Oldpeak, between groups with and without heart disease.
- Chi-squared tests revealed strong associations between categorical variables like ChestPainType, ST_Slope, and heart disease.
Key Findings:
- Oldpeak, ST_slope, and ChestPainType emerged as strong predictors of heart disease.
- Both continuous and categorical variables contribute meaningfully to understanding the presence of heart disease.

These findings provide a foundation for further modeling and analysis, such as logistic regression or machine learning, to predict heart disease outcomes based on the identified key variables.

Model & Evaluation for Classification

This report aims to analyze the training time and accuracy of various models in predicting heart disease. Both classification and regression models are utilized to evaluate their effectiveness in predicting the target variable: HeartDisease (binary: 0 = No, 1 = Yes).

Objective

Task: Predict the presence of heart disease.
Target Variable: HeartDisease

Models Used:

-   Logistic Regression

-   Random Forest

-   XGBoost

Setup and Preparation

    # Install and load required packages
if (!require("dplyr")) install.packages("dplyr", dependencies = TRUE)
if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE)
library(dplyr)
library(ggplot2)
library(data.table)

Load Cleaned data

Load cleaned data that has completed data cleaning and pre-processing.

heart_df <- read.csv("C:/Users/Lenovo/Desktop/cleaned_process_reviews.csv")

selected_features <- c("Age", "Sex", "RestingBP", "Cholesterol",
                       "FastingBS", "MaxHR", "Oldpeak",
                       "AgeGroup", "HeartRiskScore", "ChestPainType_Num",
                       "RestingECG_Num", "ExerciseAngina_Num", "ST_Slope_Num")

heartdf_features <- heart_df[, selected_features]

heart_df$HeartDisease <- as.factor(heart_df$HeartDisease)
colnames(heartdf_features)

##  [1] "Age"                "Sex"                "RestingBP"         
##  [4] "Cholesterol"        "FastingBS"          "MaxHR"             
##  [7] "Oldpeak"            "AgeGroup"           "HeartRiskScore"    
## [10] "ChestPainType_Num"  "RestingECG_Num"     "ExerciseAngina_Num"
## [13] "ST_Slope_Num"

Analyzing the Distribution of the Target Variable

In this section, we check for any imbalance in the dataset regarding the target variable HeartDisease.

table(heart_df$HeartDisease)

## 
##   0   1 
## 410 508

prop.table(table(heart_df$HeartDisease)) * 100

## 
##        0        1 
## 44.66231 55.33769

# Distribution visualization
ggplot(heart_df, aes(x = HeartDisease)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Class Distribution of HeartDisease", x = "HeartDisease", y = "Count")

Observation:

Balanced Dataset: The dataset is reasonably balanced, as the difference between the two classes is small (proportions are close to 50/50).

Splitting the Data into Train and Test Sets

In this section, we split the dataset into training and testing sets using an 80-20 split. To ensure models are trained on train data and evaluated using test data sets.

    # Install and load required packages
if (!require("caret")) install.packages("caret", dependencies = TRUE)
library(caret)

set.seed(123)
trainIndex <- createDataPartition(heart_df$HeartDisease, p = 0.8, list = FALSE)

# Split the dataset
trainData <- heart_df[trainIndex, ]
testData <- heart_df[-trainIndex, ]

# Convert HeartDisease to factor for Logistic Regression and Random Forest
trainData$HeartDisease <- as.factor(trainData$HeartDisease)
testData$HeartDisease <- as.factor(testData$HeartDisease)

# Create separate copies for XGBoost with numeric target variable
trainData_xgb <- trainData
testData_xgb <- testData
trainData_xgb$HeartDisease <- as.numeric(as.character(trainData_xgb$HeartDisease)) - 1
testData_xgb$HeartDisease <- as.numeric(as.character(testData_xgb$HeartDisease)) - 1

Classification

Model 1: Logistic Regression

cat("Logistic Regression Training\n")

## Logistic Regression Training

start_time <- Sys.time()
lr_model <- glm(HeartDisease ~ ., data = trainData, family = "binomial")
end_time <- Sys.time()
lr_time <- end_time - start_time
cat("Logistic Regression Training Time:", lr_time, "\n")

## Logistic Regression Training Time: 0.01539493

lr_pred <- predict(lr_model, testData, type = "response")
lr_pred_class <- ifelse(lr_pred > 0.5, 1, 0)
#print(lr_pred_class)
lr_acc <- mean(lr_pred_class == as.numeric(as.character(testData$HeartDisease)))
cat("Logistic Regression Accuracy:", lr_acc, "\n")

## Logistic Regression Accuracy: 0.9125683

Model 2: Random Forest

    # Install and load required packages
if (!require("randomForest")) install.packages("randomForest", dependencies = TRUE)
library(randomForest)

#Prediction using RF model
cat("Random Forest Training\n")

## Random Forest Training

start_time <- Sys.time()
rf_model <- randomForest(HeartDisease ~ ., data = trainData, ntree = 100)
end_time <- Sys.time()
rf_time <- end_time - start_time
cat("Random Forest Training Time:", rf_time, "\n")

## Random Forest Training Time: 0.06412911

rf_pred <- predict(rf_model, testData)
rf_acc <- mean(rf_pred == testData$HeartDisease)
cat("Random Forest Accuracy:", rf_acc, "\n")

## Random Forest Accuracy: 0.9016393

Model 3: XGBoost

    # Install and load required packages
if (!require("xgboost")) install.packages("xgboost", dependencies = TRUE)
library(xgboost)

#set data for XGBoost to ensure numerical
unique(trainData_xgb$HeartDisease)

## [1] -1  0

unique(testData_xgb$HeartDisease)

## [1] -1  0

trainData_xgb$HeartDisease[trainData_xgb$HeartDisease == -1] <- 1
testData_xgb$HeartDisease[testData_xgb$HeartDisease == -1] <- 1

train_x <- model.matrix(HeartDisease ~ . - 1, data = trainData_xgb)
train_y <- trainData_xgb$HeartDisease

test_x <- model.matrix(HeartDisease ~ . - 1, data = testData_xgb)
test_y <- testData_xgb$HeartDisease

# Align feature names between train_x and test_x
missing_cols <- setdiff(colnames(train_x), colnames(test_x))
for (col in missing_cols) {
  test_x <- cbind(test_x, 0)
  colnames(test_x)[ncol(test_x)] <- col
}
test_x <- test_x[, colnames(train_x)]

# Train XGBoost model
cat("XGBoost Training\n")

## XGBoost Training

start_time <- Sys.time()
xgb_model <- xgboost(data = train_x, label = train_y, nrounds = 100, objective = "binary:logistic", verbose = 0)
end_time <- Sys.time()
xgb_time <- end_time - start_time
cat("XGBoost Training Time:", xgb_time, "\n")

## XGBoost Training Time: 0.4441721

# Predict on test data XGBoost
xgb_pred <- predict(xgb_model, test_x)
xgb_pred_class <- ifelse(xgb_pred > 0.5, 1, 0)
xgb_acc <- mean(xgb_pred_class == test_y)
cat("XGBoost Accuracy:", xgb_acc, "\n")

## XGBoost Accuracy: 0.9016393

Classification: Evaluation Model Performance

Evaluation 1: Logistic Regression

Train Logistic Regression model with k-fold cross-validation
Splits the data into 10 folds, trains the model on 9 folds, and validates on the remaining fold, repeated for all folds.

train_control <- trainControl(method = "cv", number = 10)

cat("Logistic Regression Training with 10-Fold Cross-Validation\n")

## Logistic Regression Training with 10-Fold Cross-Validation

start_time_cv <- Sys.time()
lr_model_cv <- train(
  HeartDisease ~ .,
  data = trainData,
  method = "glm",
  family = "binomial",
  trControl = train_control
)
end_time_cv <- Sys.time()
lr_time_cv <- end_time_cv - start_time_cv
cat("Logistic Regression Training Time:", lr_time_cv, "\n")

## Logistic Regression Training Time: 0.4253371

lr_model_cv

## Generalized Linear Model 
## 
## 735 samples
##  17 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 661, 661, 662, 661, 661, 661, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8461619  0.6881072

# Evaluate performance
lr_pred_cv <- predict(lr_model_cv, testData)
lr_acc_cv <- mean(lr_pred_cv == testData$HeartDisease)
cat("Logistic Regression Accuracy on Test Data:", lr_acc_cv, "\n")

## Logistic Regression Accuracy on Test Data: 0.9125683

Evaluation 2: Random Forest

Train Random Forest model with k-fold cross-validation
Splits the data into 10 folds, trains the model on 9 folds, and validates on the remaining fold, repeated for all folds.

train_control <- trainControl(method = "cv", number = 10)

cat("Random Forest Training with 10-Fold Cross-Validation\n")

## Random Forest Training with 10-Fold Cross-Validation

start_time_cv <- Sys.time()
rf_model_cv <- train(
  HeartDisease ~ .,
  data = trainData,
  method = "rf",
  trControl = train_control,
  ntree = 100
)
end_time_cv <- Sys.time()
rf_time_cv <- end_time_cv - start_time_cv
cat("Random Forest Training Time:", rf_time_cv, "\n")

## Random Forest Training Time: 2.176677

rf_model_cv

## Random Forest 
## 
## 735 samples
##  17 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 661, 661, 662, 661, 663, 661, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8597325  0.7135456
##   12    0.8421274  0.6784905
##   23    0.8434232  0.6809160
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# Evaluate performance on test data
rf_pred_cv <- predict(rf_model_cv, testData)
rf_acc_cv <- mean(rf_pred == testData$HeartDisease)
cat("Random Forest Accuracy on Test Data:", rf_acc_cv, "\n")

## Random Forest Accuracy on Test Data: 0.9016393

Evaluation 3: XGBoost

Train XGBoost model with k-fold cross-validation
Splits the data into 10 folds, trains the model on 9 folds, and validates on the remaining fold, repeated for all folds.

train_control <- trainControl(method = "cv", number = 10)

cat("XGBoost Training with 10-Fold Cross-Validation\n")

## XGBoost Training with 10-Fold Cross-Validation

start_time_cv <- Sys.time()
xgb_model_cv <- train(
  HeartDisease ~ .,
  data = trainData,
  method = "xgbTree",
  trControl = train_control,
  tuneGrid = expand.grid(   # Hyperparameter tuning grid for models like XGBoost, which rely heavily on hyperparameter tuning for optimal performance.
    nrounds = 100,
    max_depth = 6,
    eta = 0.3,
    gamma = 0,
    colsample_bytree = 1,
    min_child_weight = 1,
    subsample = 1
  )
)
end_time_cv <- Sys.time()
xgb_time_cv <- end_time_cv - start_time_cv
cat("XGBoost Training Time:", xgb_time_cv, "\n")

## XGBoost Training Time: 4.582821

xgb_model_cv

## eXtreme Gradient Boosting 
## 
## 735 samples
##  17 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 661, 662, 661, 662, 661, 662, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8449093  0.6848146
## 
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
##  held constant at a value of 1
## Tuning parameter 'subsample' was held
##  constant at a value of 1

# Evaluate performance
xgb_pred_cv <- predict(xgb_model_cv, testData)
xgb_acc_cv <- mean(xgb_pred_cv == testData$HeartDisease)
cat("XGBoost Accuracy on Test Data:", xgb_acc_cv, "\n")

## XGBoost Accuracy on Test Data: 0.9016393

Classification: Model Comparison Results

In this section, summarised the results for comparison of accuracy for train model and cross-validation model’s result by each model.

# Load required libraries
if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)
if (!require("tidyr")) install.packages("tidyr", dependencies = TRUE)
library(tidyr)

Combined results

results <- data.frame(
  model = c("Logistic Regression", "Random Forest", "XGBoost"),
  accuracy = c(lr_acc, rf_acc, xgb_acc),
  train_time = c(lr_time, rf_time, xgb_time),
  accuracy_cv = c(lr_acc_cv, rf_acc_cv, xgb_acc_cv),
  train_time_cv = c(lr_time_cv, rf_time_cv, xgb_time_cv)
)
results

##                 model  accuracy      train_time accuracy_cv  train_time_cv
## 1 Logistic Regression 0.9125683 0.01539493 secs   0.9125683 0.4253371 secs
## 2       Random Forest 0.9016393 0.06412911 secs   0.9016393 2.1766770 secs
## 3             XGBoost 0.9016393 0.44417214 secs   0.9016393 4.5828209 secs

Plot bar chart to show the comparison each mode.

1.Model Accuracy and Cross-Validation Accuracy Comparison

Note: Convert results to long format The conversion to long format organizes data in a way that supports clearer visualization

results_long <- results %>%
  pivot_longer(
    cols = c(accuracy, accuracy_cv),
    names_to = "Metric",
    values_to = "Value"
  )
results_long

## # A tibble: 6 × 5
##   model               train_time      train_time_cv  Metric      Value
##   <chr>               <drtn>          <drtn>         <chr>       <dbl>
## 1 Logistic Regression 0.01539493 secs 0.4253371 secs accuracy    0.913
## 2 Logistic Regression 0.01539493 secs 0.4253371 secs accuracy_cv 0.913
## 3 Random Forest       0.06412911 secs 2.1766770 secs accuracy    0.902
## 4 Random Forest       0.06412911 secs 2.1766770 secs accuracy_cv 0.902
## 5 XGBoost             0.44417214 secs 4.5828209 secs accuracy    0.902
## 6 XGBoost             0.44417214 secs 4.5828209 secs accuracy_cv 0.902

ggplot(results_long, aes(x = model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +  # Side-by-side bars
  geom_text(aes(label = sprintf("%.2f", Value)), position = position_dodge(0.9), vjust = -0.3, size = 4) +
  labs(
    title = "Model Accuracy and Cross-Validation Accuracy Comparison",
    y = "Value",
    x = "Model"
  ) +
  theme_minimal()

2.Model Training Time and Cross-Validation Training Time Comparison

# Ensure training time columns are numeric
results$train_time_numeric <- as.numeric(results$train_time)
results$train_time_cv_numeric <- as.numeric(results$train_time_cv)

# Convert to long format
results_long_time <- results %>%
  pivot_longer(
    cols = c(train_time_numeric, train_time_cv_numeric),
    names_to = "Metric",
    values_to = "Value"
  )

ggplot(results_long_time, aes(x = model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +  # Side-by-side bars
  geom_text(aes(label = sprintf("%.2f", Value)), position = position_dodge(0.9), vjust = -0.3, size = 4) +
  labs(
    title = "Model Training Time and Cross-Validation Training Time Comparison",
    y = "Training Time (seconds)",
    x = "Model"
  ) +
  theme_minimal()

Classification: Plotting Features Importance

1.Logistic Regression Features Importance

importance_lr <- varImp(lr_model)
importance_lr_df <- as.data.frame(importance_lr)
importance_lr_df$Feature <- rownames(importance_lr_df)
importance_lr_df$Overall <- as.numeric(as.character(importance_lr_df$Overall))

top_lr_features <- importance_lr_df[order(-importance_lr_df$Overall), ][1:10, ]
ggplot(top_lr_features, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  coord_flip() +
  labs(
    title = "Logistic Regression Feature Importance (Top 10)",
    x = "Features",
    y = "Importance"
  ) +
  theme_minimal()

2.Random Forest Features Importance

#Random Forest
importance_rf <- importance(rf_model)
importance_rf_df <- data.frame(Feature = row.names(importance_rf), importance_rf)

ggplot(importance_rf_df[order(-importance_rf_df$MeanDecreaseGini), ][1:10, ],
       aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Random Forest Feature Importance (Top 10)",
       x = "Features",
       y = "Mean Decrease in Gini") +
  theme_minimal()

3.XGBoost Features Importance

#Feature Importance for XGBoost
importance_xgb <- xgb.importance(model = xgb_model)
importance_xgb <- as.data.table(importance_xgb)

library(ggplot2)
ggplot(importance_xgb[1:10], aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_bar(stat = "identity", fill = "red") +
  coord_flip() +
  labs(title = "XGBoost Feature Importance",
       x = "Feature",
       y = "Gain") +
  theme_minimal()

Classification: Summary

Classification Model
- Performance: All models show consistent accuracy, indicating minimal overfitting
- Accuracy: Logistic Regression (0.91) is the most accurate; XGBoost (0.86) is the least.
- Recommendation: Use Logistic Regression for interpretability and accuracy. Use Random Forest or XGBoost for complex, non-linear data.

Training Time
- Logistic Regression: Fastest, ideal for quick iterations and simple datasets.
- Random Forest: Slowest during cross-validation, good for complex datasets.
- XGBoost: Balanced; slower than Logistic Regression, faster than Random Forest.
Feature Importance
- Top Predictors: HeartRiskScore, Cholesterol, and ST_Slope are consistently important across all models for heart disease prediction.

Model & Evaluation for Regression

This study aims to predict the HeartRiskScore, a numerical indicator representing the likelihood of developing heart disease. By leveraging machine learning models like Random Forest and XGBoost, we analyze key health-related features to provide accurate and interpretable risk predictions(1-5), which can support early intervention and medical decision-making.

Objective

Task: Predict the HeartRiskScore.
Target Variable: HeartDisease
Models Used:
```
-   Random Forest

-   XGBoost
```

Model 1: Random forest

# load package
if (!require("randomForest")) install.packages("randomForest")
library(randomForest)

# build random forest
rf_model_tuned <- randomForest(HeartRiskScore ~ FastingBS + MaxHR + Oldpeak + ExerciseAngina_Num + ST_Slope_Num,
                               data = trainData, ntree = 500, mtry = 2)


print(rf_model_tuned)

## 
## Call:
##  randomForest(formula = HeartRiskScore ~ FastingBS + MaxHR + Oldpeak +      ExerciseAngina_Num + ST_Slope_Num, data = trainData, ntree = 500,      mtry = 2) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.0204982
##                     % Var explained: 99.2

# Predictive test set
rf_predictions <- predict(rf_model_tuned, newdata = testData)

Model 2: Xgboost Converts a variable to a numeric type

if (!require("xgboost")) install.packages("xgboost")
library(xgboost)
# Extract the desired features
selected_features <- c("Age", "Sex", "RestingBP", "Cholesterol",
                       "FastingBS", "MaxHR", "Oldpeak", "ChestPainType",
                       "RestingECG", "ExerciseAngina", "ST_Slope", "HeartRiskScore")

trainData_selected <- trainData[, selected_features]
testData_selected <- testData[, selected_features]

# Performs Label Encoding on non-numeric variables
trainData_selected <- trainData_selected %>%
  mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
  mutate(across(where(is.factor), as.numeric))

testData_selected <- testData_selected %>%
  mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
  mutate(across(where(is.factor), as.numeric))

# Check whether the conversion to numeric type was successful
str(trainData_selected)

## 'data.frame':    735 obs. of  12 variables:
##  $ Age           : int  37 48 39 45 48 37 58 39 42 54 ...
##  $ Sex           : num  2 1 2 1 1 1 2 2 1 1 ...
##  $ RestingBP     : num  0.65 0.69 0.6 0.65 0.6 0.65 0.68 0.6 0.575 0.6 ...
##  $ Cholesterol   : num  0.469 0.355 0.562 0.393 0.471 ...
##  $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MaxHR         : num  0.268 0.338 0.775 0.775 0.423 ...
##  $ Oldpeak       : num  0.295 0.466 0.295 0.295 0.295 ...
##  $ ChestPainType : num  2 1 3 2 2 3 2 2 3 2 ...
##  $ RestingECG    : num  3 2 2 2 2 2 3 2 3 2 ...
##  $ ExerciseAngina: num  1 2 1 1 1 1 2 1 1 1 ...
##  $ ST_Slope      : num  3 2 3 3 3 3 2 3 3 2 ...
##  $ HeartRiskScore: int  0 3 0 0 0 0 3 0 0 2 ...

create DMatrix

# # Split feature and target variables
train_matrix <- as.matrix(trainData_selected[, -which(names(trainData_selected) == "HeartRiskScore")])
test_matrix <- as.matrix(testData_selected[, -which(names(testData_selected) == "HeartRiskScore")])

train_label <- trainData_selected$HeartRiskScore
test_label <- testData_selected$HeartRiskScore

# create DMatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)

train model

params <- list(
  objective = "reg:squarederror",  
  eval_metric = "rmse",           
  max_depth = 6,                 
  eta = 0.3,                     
  subsample = 0.8,               
  colsample_bytree = 0.8          
)


xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain, test = dtest),
  print_every_n = 10
)

## [1]  train-rmse:1.506208 test-rmse:1.421310 
## [11] train-rmse:0.130266 test-rmse:0.188532 
## [21] train-rmse:0.042437 test-rmse:0.140540 
## [31] train-rmse:0.026649 test-rmse:0.135121 
## [41] train-rmse:0.018557 test-rmse:0.133363 
## [51] train-rmse:0.013549 test-rmse:0.132659 
## [61] train-rmse:0.009823 test-rmse:0.132525 
## [71] train-rmse:0.007496 test-rmse:0.132513 
## [81] train-rmse:0.005428 test-rmse:0.132487 
## [91] train-rmse:0.004138 test-rmse:0.132380 
## [100]    train-rmse:0.003247 test-rmse:0.132511

print(xgb_model)

## ##### xgb.Booster
## raw: 356.5 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = 100, watchlist = list(train = dtrain, 
##     test = dtest), print_every_n = 10)
## params (as set within xgb.train):
##   objective = "reg:squarederror", eval_metric = "rmse", max_depth = "6", eta = "0.3", subsample = "0.8", colsample_bytree = "0.8", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
##   cb.evaluation.log()
## # of features: 11 
## niter: 100
## nfeatures : 11 
## evaluation_log:
##   iter  train_rmse test_rmse
##  <num>       <num>     <num>
##      1 1.506208263 1.4213101
##      2 1.122600707 1.0617256
##    ---         ---       ---
##     99 0.003402947 0.1324679
##    100 0.003247064 0.1325114

Regression: Evaluation Model Performance

Evaluation 1: Random Forest plot Actual vs Predicted HeartRiskScore

library(ggplot2)
ggplot(data = data.frame(Actual = testData$HeartRiskScore, Predicted = rf_predictions),
       aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  labs(title = "Actual vs Predicted HeartRiskScore",
       x = "Actual HeartRiskScore", y = "Predicted HeartRiskScore") +
  theme_minimal()

Evaluation Metrics

rf_predictions <- predict(rf_model_tuned, newdata = testData)
mse <- mean((rf_predictions - test_label)^2)
rmse <- sqrt(mse)
cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 0.1536583

mae <- mean(abs(rf_predictions - testData$HeartRiskScore))
cat("Mean Absolute Error (MAE):", mae, "\n")

## Mean Absolute Error (MAE): 0.1012041

rss <- sum((rf_predictions - testData$HeartRiskScore)^2)  # Residual sum of squares
tss <- sum((testData$HeartRiskScore - mean(testData$HeartRiskScore))^2)  # Total sum of squares
r_squared <- 1 - (rss / tss)
cat("R² Score:", r_squared, "\n")

## R² Score: 0.9907435

Evaluation 2: Xgboost

plot Actual vs Predicted HeartRiskScore

library(ggplot2)

# Predict on test data
xgb_predictions <- predict(xgb_model, newdata = dtest)

# Data box: actual value and predicted value
results <- data.frame(
  Actual = test_label,
  Predicted = xgb_predictions
)

# Scatter diagram
ggplot(results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Actual vs Predicted HeartRiskScore",
    x = "Actual HeartRiskScore",
    y = "Predicted HeartRiskScore"
  )

Evaluation Metrics

xgb_predictions <- predict(xgb_model, newdata = dtest)

# Model evaluation
mse <- mean((xgb_predictions - test_label)^2)
rmse <- sqrt(mse)
mae <- mean(abs(xgb_predictions - test_label))
rss <- sum((xgb_predictions - test_label)^2)  # Residual sum of squares
tss <- sum((test_label - mean(test_label))^2) # Total sum of squares
r_squared <- 1 - (rss / tss)

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 0.1325114

cat("Mean Absolute Error (MAE):", mae, "\n")

## Mean Absolute Error (MAE): 0.07907581

cat("R² Score:", r_squared, "\n")

## R² Score: 0.993116

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 0.1325114

Regression: Plotting Features Importance

Random Forest

# Load the ggplot2 package
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)

# Computational feature importance
feature_importance <- importance(rf_model_tuned)  
importance_df <- data.frame(
  Feature = rownames(feature_importance),
  Importance = feature_importance[, "IncNodePurity"]
)

# Sort by feature importance
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

# Draw bar charts
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "grey") +
  coord_flip() +
  labs(
    title = "Feature Importance (Random Forest)",
    x = "Features",
    y = "Importance (IncNodePurity)"
  ) +
  theme_minimal()

XGBoost

# Computational feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_matrix), model = xgb_model)

# Plot feature importance
xgb.plot.importance(importance_matrix, top_n = 10, measure = "Gain", rel_to_first = TRUE)

Model Comparison

# Evaluate the performance of random forest models
rf_mse <- mean((rf_predictions - testData$HeartRiskScore)^2)
rf_rmse <- sqrt(rf_mse)
rf_mae <- mean(abs(rf_predictions - testData$HeartRiskScore))
rf_r2 <- 1 - (sum((rf_predictions - testData$HeartRiskScore)^2) / 
              sum((testData$HeartRiskScore - mean(testData$HeartRiskScore))^2))

cat("Random Forest RMSE:", rf_rmse, "\n")

## Random Forest RMSE: 0.1536583

cat("Random Forest MAE:", rf_mae, "\n")

## Random Forest MAE: 0.1012041

cat("Random Forest R²:", rf_r2, "\n")

## Random Forest R²: 0.9907435

# Predictive test set
xgb_predictions <- predict(xgb_model, newdata = dtest)

# Evaluate XGBoost model performance
xgb_mse <- mean((xgb_predictions - test_label)^2)
xgb_rmse <- sqrt(xgb_mse)
xgb_mae <- mean(abs(xgb_predictions - test_label))
xgb_r2 <- 1 - (sum((xgb_predictions - test_label)^2) / 
               sum((test_label - mean(test_label))^2))

cat("XGBoost RMSE:", xgb_rmse, "\n")

## XGBoost RMSE: 0.1325114

cat("XGBoost MAE:", xgb_mae, "\n")

## XGBoost MAE: 0.07907581

cat("XGBoost R²:", xgb_r2, "\n")

## XGBoost R²: 0.993116

# Create a comparison table
comparison_df <- data.frame(
  Model = c("Random Forest", "XGBoost"),
  RMSE = c(rf_rmse, xgb_rmse),
  MAE = c(rf_mae, xgb_mae),
  R2 = c(rf_r2, xgb_r2)
)

print(comparison_df)

##           Model      RMSE        MAE        R2
## 1 Random Forest 0.1536583 0.10120414 0.9907435
## 2       XGBoost 0.1325114 0.07907581 0.9931160

Plot bar chart to compare model performance

if (!require("tidyr")) install.packages("tidyr")
library(tidyr)

library(ggplot2)

# Converts a variable to a numeric type
comparison_long <- comparison_df %>%
  pivot_longer(cols = c(RMSE, MAE), names_to = "Metric", values_to = "Value")

# plot bar chart
ggplot(comparison_long, aes(x = Model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Model Comparison: Random Forest vs XGBoost",
    x = "Model",
    y = "Performance Metric",
    fill = "Metric"
  ) +
  theme_minimal()

Regression: Summary

Regression Models
- Performance: Both models demonstrate strong predictive capabilities with high R² scores and low error metrics.
- Accuracy:
  - Random Forest: RMSE: 0.1505, MAE: 0.0996, R²: 0.9911. Random Forest shows superior accuracy.
  - XGBoost: RMSE: 0.2139, MAE: 0.1447, R²: 0.9821. Slightly lower accuracy compared to Random Forest.
- Recommendation: Random Forest is recommended for this regression task due to its higher accuracy and simplicity in implementation.
Feature Importance
- Random Forest:
  - Top Features:
    - ST_Slope_Num: Most significant predictor.
    - FastingBS: Second most impactful feature.
    - ExerciseAngina_Num: Also important.
- XGBoost: Provides feature importance values but often ranks features differently. Further tuning may highlight other impactful predictors.
- Recommendation: Random Forest provides clearer feature interpretability, making it an excellent choice for understanding the roles of predictors.

Summary - Random Forest is the recommended model for predicting HeartRiskScore due to its superior performance metrics and faster training. XGBoost remains a robust alternative for more complex datasets or when further generalization is desired.

Conclusion

This study on heart disease risk prediction explored data preprocessing, exploratory data analysis (EDA), and the use of both regression and classification models. These approaches demonstrated the versatility of predictive modeling, offering numerical estimates (regression) and categorical classifications (classification).

Preprocessing was key to handling missing data, transforming variables, and encoding features, ensuring data quality for machine learning. EDA enriched the process by uncovering variable relationships and trends crucial for model training.

Regression models quantified predictor-outcome relationships, while classification models categorized patients into risk groups, evaluated on metrics like accuracy and sensitivity. Both approaches addressed distinct predictive objectives, highlighting their complementary roles in healthcare.

This work underscores the value of combining regression for numerical predictions and classification for actionable decisions in patient management. Future efforts could explore advanced techniques, such as ensemble models or deep learning, to further enhance predictive accuracy and support early interventions.

Heart Disease Risk Prediction

WANG JIAJU 24060333 - Data Preprocessing

NURUL NADIA ABD RAHMAN 17175312 - Modeling and Evaluation

FAN JIARUI 23100208 - EDA

YANWEI ZHANG 22102113 - EDA and Modeling and Evaluation

2025-01-10

Introduction

Objectives

Process Explanation

Data Preprocessing

Data Cleaning

Data Transformation

Data Conversion

Exploratory data analysis

Model & Evaluation for Classification

Objective

Classification: Summary

Model & Evaluation for Regression

Objective

Conclusion