Heart disease is one of the leading causes of death around the world; thus, early prediction has become a vital area of research. Data analytics has a lot of potent tools to identify key risk factors and patterns leading to heart diseases. Using machine learning techniques coupled with a comprehensive dataset, such as the Heart Failure Prediction dataset available in Kaggle, we aim to develop predictive models that could help in early detection and prevention strategies. This study will explore the association of clinical and demographic factors with heart disease to provide insight into improving patient outcomes.
This project focused on predicting heart disease risk through data preprocessing, exploratory analysis, and modeling. Data was cleaned, transformed, and enriched with derived features. Regression quantified relationships, while classification categorized patients into risk groups. Models were evaluated on metrics like accuracy and sensitivity, showcasing the complementary strengths of both approaches for effective heart disease prediction.
# Load the dataset
data <- read.csv("C:/Users/Lenovo/Desktop/heart.csv")
# Display the first few rows of the dataset
head(data)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0
Step 1: Check for Missing Values
# Creating a copy of the dataset
process_reviews <- data
# Checking for null values in each column
missing_values <- colSums(is.na(process_reviews))
# Format the output to match the desired layout
formatted_output <- cbind(Column = names(missing_values), Missing_Values = missing_values)
# Print the output
for (i in 1:nrow(formatted_output)) {
cat(formatted_output[i, "Column"], "\t", formatted_output[i, "Missing_Values"], "\n")
}
## Age 0
## Sex 0
## ChestPainType 0
## RestingBP 0
## Cholesterol 0
## FastingBS 0
## RestingECG 0
## MaxHR 0
## ExerciseAngina 0
## Oldpeak 0
## ST_Slope 0
## HeartDisease 0
Step 2: Replace Missing or Invalid Values
# Replace empty strings ("") with NA
process_reviews[process_reviews == ""] <- NA
# Recheck missing values after replacement
missing_values <- colSums(is.na(process_reviews))
# Print updated missing values
for (i in 1:length(missing_values)) {
cat(names(missing_values)[i], "\t", missing_values[i], "\n")
}
## Age 0
## Sex 0
## ChestPainType 0
## RestingBP 0
## Cholesterol 0
## FastingBS 0
## RestingECG 0
## MaxHR 0
## ExerciseAngina 0
## Oldpeak 0
## ST_Slope 0
## HeartDisease 0
Step 3: Handle Structural Errors
# Ensure numerical columns are of numeric type
process_reviews$Age <- as.numeric(process_reviews$Age)
process_reviews$RestingBP <- as.numeric(process_reviews$RestingBP)
process_reviews$Cholesterol <- as.numeric(process_reviews$Cholesterol)
process_reviews$MaxHR <- as.numeric(process_reviews$MaxHR)
process_reviews$Oldpeak <- as.numeric(process_reviews$Oldpeak)
# Check for outliers in key numeric columns
summary(process_reviews$RestingBP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 120.0 130.0 132.4 140.0 200.0
summary(process_reviews$Cholesterol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 173.2 223.0 198.8 267.0 603.0
Step 4: Remove Duplicate Records
# Remove duplicate rows
process_reviews <- process_reviews[!duplicated(process_reviews), ]
# Print the number of rows after removing duplicates
cat("Rows after removing duplicates:", nrow(process_reviews), "\n")
## Rows after removing duplicates: 918
Step 5: Standardize and Format Categorical Data
# Convert categorical columns to factor type
process_reviews$Sex <- as.factor(process_reviews$Sex)
process_reviews$ChestPainType <- as.factor(process_reviews$ChestPainType)
process_reviews$RestingECG <- as.factor(process_reviews$RestingECG)
process_reviews$ExerciseAngina <- as.factor(process_reviews$ExerciseAngina)
process_reviews$ST_Slope <- as.factor(process_reviews$ST_Slope)
str(process_reviews)
## 'data.frame': 918 obs. of 12 variables:
## $ Age : num 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
## $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
## $ RestingBP : num 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : num 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ MaxHR : num 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
## $ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
Step 1: Ensure Tidy Data
str(process_reviews)
## 'data.frame': 918 obs. of 12 variables:
## $ Age : num 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
## $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
## $ RestingBP : num 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : num 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ MaxHR : num 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
## $ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
head(process_reviews)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0
Step 2: Encoding Categorical Variables
# Convert categorical variables to factors
process_reviews$Sex <- as.factor(process_reviews$Sex)
process_reviews$ChestPainType <- as.factor(process_reviews$ChestPainType)
process_reviews$RestingECG <- as.factor(process_reviews$RestingECG)
process_reviews$ExerciseAngina <- as.factor(process_reviews$ExerciseAngina)
process_reviews$ST_Slope <- as.factor(process_reviews$ST_Slope)
# Check the levels for each factor
summary(process_reviews)
## Age Sex ChestPainType RestingBP Cholesterol
## Min. :28.00 F:193 ASY:496 Min. : 0.0 Min. : 0.0
## 1st Qu.:47.00 M:725 ATA:173 1st Qu.:120.0 1st Qu.:173.2
## Median :54.00 NAP:203 Median :130.0 Median :223.0
## Mean :53.51 TA : 46 Mean :132.4 Mean :198.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0
## Max. :77.00 Max. :200.0 Max. :603.0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## Min. :0.0000 LVH :188 Min. : 60.0 N:547 Min. :-2.6000
## 1st Qu.:0.0000 Normal:552 1st Qu.:120.0 Y:371 1st Qu.: 0.0000
## Median :0.0000 ST :178 Median :138.0 Median : 0.6000
## Mean :0.2331 Mean :136.8 Mean : 0.8874
## 3rd Qu.:0.0000 3rd Qu.:156.0 3rd Qu.: 1.5000
## Max. :1.0000 Max. :202.0 Max. : 6.2000
## ST_Slope HeartDisease
## Down: 63 Min. :0.0000
## Flat:460 1st Qu.:0.0000
## Up :395 Median :1.0000
## Mean :0.5534
## 3rd Qu.:1.0000
## Max. :1.0000
Step 3:Normalize and Scale Numerical Data
In datasets with numerical variables, the ranges of values can vary significantly.
Normalization rescales values to a range of [0, 1], ensuring all variables contribute equally.
# Normalize numerical variables
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# Apply normalization to selected numerical columns
process_reviews$RestingBP <- normalize(process_reviews$RestingBP)
process_reviews$Cholesterol <- normalize(process_reviews$Cholesterol)
process_reviews$MaxHR <- normalize(process_reviews$MaxHR)
process_reviews$Oldpeak <- normalize(process_reviews$Oldpeak)
# View summary of normalized data
summary(process_reviews[, c("RestingBP", "Cholesterol", "MaxHR", "Oldpeak")])
## RestingBP Cholesterol MaxHR Oldpeak
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.600 1st Qu.:0.2873 1st Qu.:0.4225 1st Qu.:0.2955
## Median :0.650 Median :0.3698 Median :0.5493 Median :0.3636
## Mean :0.662 Mean :0.3297 Mean :0.5409 Mean :0.3963
## 3rd Qu.:0.700 3rd Qu.:0.4428 3rd Qu.:0.6761 3rd Qu.:0.4659
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Step 4: Create Derived Variables
Derive new features based on the existing data, such as:
Age Group: Categorizing patients by age ranges.
HeartRiskScore: Combining multiple risk factors into one indicator.
HeartRiskScore Explanation:
FastingBS: Doubled as a stronger risk factor.
ExerciseAngina: Adds 1 if the patient experienced angina during exercise.
ST_Slope: Adds 2 for a “Flat” slope, indicating a more severe condition.
# Create age groups
process_reviews$AgeGroup <- cut(
process_reviews$Age,
breaks = c(0, 30, 45, 60, 100),
labels = c("Young", "Middle-Aged", "Senior", "Elderly"),
right = FALSE
)
# Create a risk score
process_reviews$HeartRiskScore <- process_reviews$FastingBS * 2 +
ifelse(process_reviews$ExerciseAngina == "Y", 1, 0) +
ifelse(process_reviews$ST_Slope == "Flat", 2, 0)
# View derived variables
head(process_reviews[, c("AgeGroup", "HeartRiskScore")])
## AgeGroup HeartRiskScore
## 1 Middle-Aged 0
## 2 Senior 2
## 3 Middle-Aged 0
## 4 Senior 3
## 5 Senior 0
## 6 Middle-Aged 0
Step 5: Handle Multilevel Variables
ChestPainType, contain multiple categories (e.g., TA, ATA, NAP, ASY)
Transform into Numeric Levels: Assign numeric codes to each category.
# Convert multilevel categorical variables to numeric levels
process_reviews$ChestPainType_Num <- as.numeric(factor(process_reviews$ChestPainType))
process_reviews$RestingECG_Num <- as.numeric(factor(process_reviews$RestingECG))
process_reviews$ExerciseAngina_Num <- as.numeric(factor(process_reviews$ExerciseAngina))
process_reviews$ST_Slope_Num <- as.numeric(factor(process_reviews$ST_Slope))
# Confirm the changes in the dataset
head(process_reviews[, c("ChestPainType", "ChestPainType_Num", "RestingECG", "RestingECG_Num")])
## ChestPainType ChestPainType_Num RestingECG RestingECG_Num
## 1 ATA 2 Normal 2
## 2 NAP 3 Normal 2
## 3 ATA 2 ST 3
## 4 ASY 1 Normal 2
## 5 NAP 3 Normal 2
## 6 NAP 3 Normal 2
Step 1: String Manipulations
Trim Leading and Trailing Spaces
Ensure Uniform Case
library(stringr)
## Warning: 程序包'stringr'是用R版本4.4.2 来建造的
# Trim leading and trailing spaces
process_reviews$Sex <- str_trim(process_reviews$Sex)
process_reviews$ChestPainType <- str_trim(process_reviews$ChestPainType)
# Convert to lowercase for consistent formatting
process_reviews$Sex <- tolower(process_reviews$Sex)
process_reviews$ChestPainType <- tolower(process_reviews$ChestPainType)
# Check the results
head(process_reviews[, c("Sex", "ChestPainType")])
## Sex ChestPainType
## 1 m ata
## 2 f nap
## 3 m ata
## 4 f asy
## 5 m nap
## 6 m nap
Step 2: Verify Data Types
# Verify column types
str(process_reviews)
## 'data.frame': 918 obs. of 18 variables:
## $ Age : num 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : chr "m" "f" "m" "f" ...
## $ ChestPainType : chr "ata" "nap" "ata" "asy" ...
## $ RestingBP : num 0.7 0.8 0.65 0.69 0.75 0.6 0.65 0.55 0.7 0.6 ...
## $ Cholesterol : num 0.479 0.299 0.469 0.355 0.323 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ MaxHR : num 0.789 0.676 0.268 0.338 0.437 ...
## $ ExerciseAngina : Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Oldpeak : num 0.295 0.409 0.295 0.466 0.295 ...
## $ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
## $ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
## $ AgeGroup : Factor w/ 4 levels "Young","Middle-Aged",..: 2 3 2 3 3 2 3 3 2 3 ...
## $ HeartRiskScore : num 0 2 0 3 0 0 0 0 3 0 ...
## $ ChestPainType_Num : num 2 3 2 1 3 3 2 2 1 2 ...
## $ RestingECG_Num : num 2 2 3 2 2 2 2 2 2 2 ...
## $ ExerciseAngina_Num: num 1 1 1 2 1 1 1 1 2 1 ...
## $ ST_Slope_Num : num 3 2 3 2 3 3 3 3 2 3 ...
# Convert columns to appropriate types
process_reviews$FastingBS <- as.integer(process_reviews$FastingBS)
process_reviews$HeartDisease <- as.factor(process_reviews$HeartDisease)
# Confirm changes
str(process_reviews)
## 'data.frame': 918 obs. of 18 variables:
## $ Age : num 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : chr "m" "f" "m" "f" ...
## $ ChestPainType : chr "ata" "nap" "ata" "asy" ...
## $ RestingBP : num 0.7 0.8 0.65 0.69 0.75 0.6 0.65 0.55 0.7 0.6 ...
## $ Cholesterol : num 0.479 0.299 0.469 0.355 0.323 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ MaxHR : num 0.789 0.676 0.268 0.338 0.437 ...
## $ ExerciseAngina : Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Oldpeak : num 0.295 0.409 0.295 0.466 0.295 ...
## $ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
## $ HeartDisease : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
## $ AgeGroup : Factor w/ 4 levels "Young","Middle-Aged",..: 2 3 2 3 3 2 3 3 2 3 ...
## $ HeartRiskScore : num 0 2 0 3 0 0 0 0 3 0 ...
## $ ChestPainType_Num : num 2 3 2 1 3 3 2 2 1 2 ...
## $ RestingECG_Num : num 2 2 3 2 2 2 2 2 2 2 ...
## $ ExerciseAngina_Num: num 1 1 1 2 1 1 1 1 2 1 ...
## $ ST_Slope_Num : num 3 2 3 2 3 3 3 3 2 3 ...
Introduction
This report presents an exploratory data analysis (EDA) and statistical testing on a cleaned dataset related to heart disease. The primary objectives are to:
Through this analysis, we aim to identify significant predictors and patterns related to heart disease for further insights and potential predictive modeling.
Data Description and Initial Checks
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# Install and load required packages
if (!require("psych")) install.packages("psych", dependencies = TRUE)
## 载入需要的程序包:psych
## Warning: 程序包'psych'是用R版本4.4.2 来建造的
if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE)
## 载入需要的程序包:ggplot2
## Warning: 程序包'ggplot2'是用R版本4.4.2 来建造的
##
## 载入程序包:'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
if (!require("dplyr")) install.packages("dplyr", dependencies = TRUE)
## 载入需要的程序包:dplyr
## Warning: 程序包'dplyr'是用R版本4.4.2 来建造的
##
## 载入程序包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
library(ggplot2)
library(dplyr)
# Load dataset
data <- read.csv("C:/Users/Lenovo/Desktop/cleaned_process_reviews.csv", stringsAsFactors = FALSE)
# Descriptive statistics for numeric variables
numeric_cols <- data %>% select_if(is.numeric)
desc_stats <- describe(numeric_cols)
print("Descriptive Statistics:")
## [1] "Descriptive Statistics:"
desc_stats
## vars n mean sd median trimmed mad min max range skew
## Age 1 918 53.51 9.43 54.00 53.71 10.38 28 77 49 -0.20
## RestingBP 2 918 0.66 0.09 0.65 0.66 0.07 0 1 1 0.18
## Cholesterol 3 918 0.33 0.18 0.37 0.34 0.11 0 1 1 -0.61
## FastingBS 4 918 0.23 0.42 0.00 0.17 0.00 0 1 1 1.26
## MaxHR 5 918 0.54 0.18 0.55 0.54 0.19 0 1 1 -0.14
## Oldpeak 6 918 0.40 0.12 0.36 0.38 0.10 0 1 1 1.02
## HeartDisease 7 918 0.55 0.50 1.00 0.57 0.00 0 1 1 -0.21
## HeartRiskScore 8 918 1.87 1.60 2.00 1.75 1.48 0 5 5 0.25
## ChestPainType_Num 9 918 1.78 0.96 1.00 1.66 0.00 1 4 3 0.79
## RestingECG_Num 10 918 1.99 0.63 2.00 1.99 0.00 1 3 2 0.01
## ExerciseAngina_Num 11 918 1.40 0.49 1.00 1.38 0.00 1 2 1 0.39
## ST_Slope_Num 12 918 2.36 0.61 2.00 2.41 0.00 1 3 2 -0.38
## kurtosis se
## Age -0.40 0.31
## RestingBP 3.23 0.00
## Cholesterol 0.10 0.01
## FastingBS -0.41 0.01
## MaxHR -0.46 0.01
## Oldpeak 1.18 0.00
## HeartDisease -1.96 0.02
## HeartRiskScore -1.03 0.05
## ChestPainType_Num -0.72 0.03
## RestingECG_Num -0.50 0.02
## ExerciseAngina_Num -1.85 0.02
## ST_Slope_Num -0.67 0.02
Checking Missing Values and Dataset Structure
print("Dataset structure:")
## [1] "Dataset structure:"
str(data)
## 'data.frame': 918 obs. of 18 variables:
## $ Age : int 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : chr "m" "f" "m" "f" ...
## $ ChestPainType : chr "ata" "nap" "ata" "asy" ...
## $ RestingBP : num 0.7 0.8 0.65 0.69 0.75 0.6 0.65 0.55 0.7 0.6 ...
## $ Cholesterol : num 0.479 0.299 0.469 0.355 0.323 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : chr "Normal" "Normal" "ST" "Normal" ...
## $ MaxHR : num 0.789 0.676 0.268 0.338 0.437 ...
## $ ExerciseAngina : chr "N" "N" "N" "Y" ...
## $ Oldpeak : num 0.295 0.409 0.295 0.466 0.295 ...
## $ ST_Slope : chr "Up" "Flat" "Up" "Flat" ...
## $ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
## $ AgeGroup : chr "Middle-Aged" "Senior" "Middle-Aged" "Senior" ...
## $ HeartRiskScore : int 0 2 0 3 0 0 0 0 3 0 ...
## $ ChestPainType_Num : int 2 3 2 1 3 3 2 2 1 2 ...
## $ RestingECG_Num : int 2 2 3 2 2 2 2 2 2 2 ...
## $ ExerciseAngina_Num: int 1 1 1 2 1 1 1 1 2 1 ...
## $ ST_Slope_Num : int 3 2 3 2 3 3 3 3 2 3 ...
# Check missing value
print("Missing values per column:")
## [1] "Missing values per column:"
missing_values <- colSums(is.na(data))
print(missing_values)
## Age Sex ChestPainType RestingBP
## 0 0 0 0
## Cholesterol FastingBS RestingECG MaxHR
## 0 0 0 0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 0 0 0 0
## AgeGroup HeartRiskScore ChestPainType_Num RestingECG_Num
## 0 0 0 0
## ExerciseAngina_Num ST_Slope_Num
## 0 0
Descriptive statistics for numeric variables
numeric_cols <- data %>% select_if(is.numeric)
desc_stats <- describe(numeric_cols)
print("Descriptive Statistics:")
## [1] "Descriptive Statistics:"
print(desc_stats)
## vars n mean sd median trimmed mad min max range skew
## Age 1 918 53.51 9.43 54.00 53.71 10.38 28 77 49 -0.20
## RestingBP 2 918 0.66 0.09 0.65 0.66 0.07 0 1 1 0.18
## Cholesterol 3 918 0.33 0.18 0.37 0.34 0.11 0 1 1 -0.61
## FastingBS 4 918 0.23 0.42 0.00 0.17 0.00 0 1 1 1.26
## MaxHR 5 918 0.54 0.18 0.55 0.54 0.19 0 1 1 -0.14
## Oldpeak 6 918 0.40 0.12 0.36 0.38 0.10 0 1 1 1.02
## HeartDisease 7 918 0.55 0.50 1.00 0.57 0.00 0 1 1 -0.21
## HeartRiskScore 8 918 1.87 1.60 2.00 1.75 1.48 0 5 5 0.25
## ChestPainType_Num 9 918 1.78 0.96 1.00 1.66 0.00 1 4 3 0.79
## RestingECG_Num 10 918 1.99 0.63 2.00 1.99 0.00 1 3 2 0.01
## ExerciseAngina_Num 11 918 1.40 0.49 1.00 1.38 0.00 1 2 1 0.39
## ST_Slope_Num 12 918 2.36 0.61 2.00 2.41 0.00 1 3 2 -0.38
## kurtosis se
## Age -0.40 0.31
## RestingBP 3.23 0.00
## Cholesterol 0.10 0.01
## FastingBS -0.41 0.01
## MaxHR -0.46 0.01
## Oldpeak 1.18 0.00
## HeartDisease -1.96 0.02
## HeartRiskScore -1.03 0.05
## ChestPainType_Num -0.72 0.03
## RestingECG_Num -0.50 0.02
## ExerciseAngina_Num -1.85 0.02
## ST_Slope_Num -0.67 0.02
Visualize distributions of numeric variables
for (col in colnames(numeric_cols))
print(ggplot(data, aes_string(x = col)) +
geom_histogram(fill = "blue", color = "black", bins = 30, alpha = 0.7) +
labs(title = paste("Distribution of", col), x = col, y = "Count") +
theme_minimal())
The detect_outliers function flags outliers in a column using the IQR rule, returning TRUE for outliers.
detect_outliers <- function(column) {
Q1 <- quantile(column, 0.25, na.rm = TRUE) # First quartile (25%)
Q3 <- quantile(column, 0.75, na.rm = TRUE) # Third quartile (75%)
IQR <- Q3 - Q1 # Interquartile range
lower_bound <- Q1 - 1.5 * IQR # Lower threshold
upper_bound <- Q3 + 1.5 * IQR # Upper threshold
# Return TRUE/FALSE for outliers
return(column < lower_bound | column > upper_bound)
}
Detect outliers for numeric columns
numeric_cols <- data %>% select_if(is.numeric)
outlier_summary <- data.frame(Column = character(), Outlier_Count = integer())
for (col in colnames(numeric_cols))
outliers <- detect_outliers(numeric_cols[[col]])
count_outliers <- sum(outliers, na.rm = TRUE)
outlier_summary <- rbind(outlier_summary, data.frame(Column = col, Outlier_Count = count_outliers))
Visualize outliers using boxplots
print(ggplot(data, aes_string(y = col)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
labs(title = paste("Boxplot of", col), y = col) +
theme_minimal())
# print all outliers for each column
print("Outlier Summary for Numeric Variables:")
## [1] "Outlier Summary for Numeric Variables:"
print(outlier_summary)
## Column Outlier_Count
## 1 ST_Slope_Num 0
Load Cleaned data
install.packages("psych")
library(psych) # Load the 'psych' package
library(ggplot2) # For visualizations
library(dplyr) # For data manipulation
process_reviews <- read.csv("C:/Users/Lenovo/Desktop/cleaned_process_reviews.csv")
Correlation Analysis Step1: Perform a point-Biserial Correlation analysis between a binary variable (HeartDisease) and multiple continuous variables (Age, RestingBP, Cholesterol, MaxHR, Oldpeak)
# Ensure HeartDisease is numeric
process_reviews$HeartDisease <- as.numeric(as.character(process_reviews$HeartDisease))
# List of continuous variables
continuous_vars <- c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
# Calculate Point-Biserial Correlation for each continuous variable with HeartDisease
point_biserial_results <- sapply(process_reviews[, continuous_vars], function(x) {
result <- cor.test(x, process_reviews$HeartDisease, method = "pearson") # Pearson used for numeric correlation
return(c(correlation = result$estimate, p_value = result$p.value)) # Extract correlation coefficient and p-value
})
# Transpose the results to make it more readable
point_biserial_results <- t(point_biserial_results)
# Display the results
cat("Point-Biserial Correlation Results:\n")
## Point-Biserial Correlation Results:
print(point_biserial_results)
## correlation.cor p_value
## Age 0.2820385 3.007953e-18
## RestingBP 0.1075890 1.095315e-03
## Cholesterol -0.2327406 9.308309e-13
## MaxHR -0.4004208 1.137786e-36
## Oldpeak 0.4039507 2.390772e-37
Step2:Visualize the distribution of three continuous variables (Age, Oldpeak, MaxHR) grouped by the binary variable HeartDisease
# Load ggplot2 package
library(ggplot2)
# Continuous Variable: Age vs HeartDisease
ggplot(process_reviews, aes(x = factor(HeartDisease), y = Age, fill = factor(HeartDisease))) +
geom_boxplot() +
labs(
title = "Age Distribution by Heart Disease",
x = "Heart Disease",
y = "Age"
) +
theme_minimal()
# Visualization: Oldpeak vs HeartDisease
ggplot(process_reviews, aes(x = factor(HeartDisease), y = Oldpeak, fill = factor(HeartDisease))) +
geom_boxplot() +
labs(
title = "Oldpeak Distribution by Heart Disease",
x = "Heart Disease",
y = "Oldpeak"
) +
theme_minimal()
# Visualization: MaxHR vs HeartDisease
ggplot(process_reviews, aes(x = factor(HeartDisease), y = MaxHR, fill = factor(HeartDisease))) +
geom_boxplot() +
labs(
title = "MaxHR Distribution by Heart Disease",
x = "Heart Disease",
y = "MaxHR"
) +
theme_minimal()
Step3:Calculates a correlation matrix for selected variables, rearranges the matrix to emphasize the HeartDisease variable, and then visualizes it as a heatmap.
# Select variables of interest (HeartDisease and continuous variables)
vars <- c("HeartDisease", "Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
selected_data <- process_reviews[, vars]
# Calculate the correlation matrix
cor_matrix <- cor(selected_data, use = "complete.obs") # Use complete observations only
# Reorder the matrix to move HeartDisease to the last row and column
cor_matrix <- cor_matrix[c(setdiff(rownames(cor_matrix), "HeartDisease"), "HeartDisease"),
c(setdiff(colnames(cor_matrix), "HeartDisease"), "HeartDisease")]
# Install and load pheatmap package if necessary
if (!require("pheatmap")) install.packages("pheatmap")
library(pheatmap)
# Draw the heatmap
pheatmap(cor_matrix,
color = colorRampPalette(c("blue", "white", "red"))(50), # Gradient color scheme
display_numbers = TRUE, # Show correlation coefficients
number_color = "black", # Font color for numbers
cluster_rows = FALSE, # Disable row clustering
cluster_cols = FALSE, # Disable column clustering
main = "Correlation Matrix Heatmap with HeartDisease") # Heatmap title
Step4:Use Chi-Square Test to assess whether there is a significant association between two categorical variables.
# ChestPainType vs HeartDisease
table_cp <- table(process_reviews$ChestPainType, process_reviews$HeartDisease)
chisq_test_cp <- chisq.test(table_cp)
print("ChestPainType vs HeartDisease")
## [1] "ChestPainType vs HeartDisease"
print(chisq_test_cp)
##
## Pearson's Chi-squared test
##
## data: table_cp
## X-squared = 268.07, df = 3, p-value < 2.2e-16
# FastingBS vs HeartDisease
table_fbs <- table(process_reviews$FastingBS, process_reviews$HeartDisease)
chisq_test_fbs <- chisq.test(table_fbs)
print("FastingBS vs HeartDisease")
## [1] "FastingBS vs HeartDisease"
print(chisq_test_fbs)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_fbs
## X-squared = 64.321, df = 1, p-value = 1.057e-15
# RestingECG vs HeartDisease
table_ecg <- table(process_reviews$RestingECG, process_reviews$HeartDisease)
chisq_test_ecg <- chisq.test(table_ecg)
print("RestingECG vs HeartDisease")
## [1] "RestingECG vs HeartDisease"
print(chisq_test_ecg)
##
## Pearson's Chi-squared test
##
## data: table_ecg
## X-squared = 10.931, df = 2, p-value = 0.004229
# ExerciseAngina vs HeartDisease
table_angina <- table(process_reviews$ExerciseAngina, process_reviews$HeartDisease)
chisq_test_angina <- chisq.test(table_angina)
print("ExerciseAngina vs HeartDisease")
## [1] "ExerciseAngina vs HeartDisease"
print(chisq_test_angina)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_angina
## X-squared = 222.26, df = 1, p-value < 2.2e-16
# ST_Slope vs HeartDisease
table_slope <- table(process_reviews$ST_Slope, process_reviews$HeartDisease)
chisq_test_slope <- chisq.test(table_slope)
print("ST_Slope vs HeartDisease")
## [1] "ST_Slope vs HeartDisease"
print(chisq_test_slope)
##
## Pearson's Chi-squared test
##
## data: table_slope
## X-squared = 355.92, df = 2, p-value < 2.2e-16
Step5:To create grouped bar charts that visualize the distribution of several categorical variables with respect to HeartDisease. Each visualization shows the counts of categories in the x-axis variable, filled (colored) by the HeartDisease status.
# ChestPainType Distribution
ggplot(process_reviews, aes(x = ChestPainType, fill = factor(HeartDisease))) +
geom_bar(position = "dodge") +
labs(
title = "Chest Pain Type vs Heart Disease",
x = "Chest Pain Type",
y = "Count",
fill = "Heart Disease"
) +
theme_minimal()
# FastingBS Distribution
ggplot(process_reviews, aes(x = factor(FastingBS), fill = factor(HeartDisease))) +
geom_bar(position = "dodge") +
labs(
title = "Fasting Blood Sugar vs Heart Disease",
x = "Fasting Blood Sugar",
y = "Count",
fill = "Heart Disease"
) +
theme_minimal()
# RestingECG Distribution
ggplot(process_reviews, aes(x = RestingECG, fill = factor(HeartDisease))) +
geom_bar(position = "dodge") +
labs(
title = "Resting ECG vs Heart Disease",
x = "Resting ECG",
y = "Count",
fill = "Heart Disease"
) +
theme_minimal()
# ExerciseAngina Distribution
ggplot(process_reviews, aes(x = ExerciseAngina, fill = factor(HeartDisease))) +
geom_bar(position = "dodge") +
labs(
title = "Exercise Angina vs Heart Disease",
x = "Exercise Angina",
y = "Count",
fill = "Heart Disease"
) +
theme_minimal()
# ST_Slope Distribution
ggplot(process_reviews, aes(x = ST_Slope, fill = factor(HeartDisease))) +
geom_bar(position = "dodge") +
labs(
title = "ST Slope vs Heart Disease",
x = "ST Slope",
y = "Count",
fill = "Heart Disease"
) +
theme_minimal()
Hypothesis Testing
# Define variable lists
continuous_vars <- c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak") # Continuous variables
binary_vars <- c("FastingBS", "ExerciseAngina") # Binary variables
multiclass_vars <- c("ChestPainType", "RestingECG", "ST_Slope") # Multi-class variables
# Initialize a list to save results
results <- list()
1. Continuous variables vs HeartDisease: t-test
# 1. Continuous variables vs HeartDisease: t-test
continuous_tests <- lapply(continuous_vars, function(var) {
t_test_result <- t.test(process_reviews[[var]] ~ process_reviews$HeartDisease)
list(
variable = var,
test = "t-test",
statistic = t_test_result$statistic,
p_value = t_test_result$p.value
)
})
results$continuous <- do.call(rbind, lapply(continuous_tests, as.data.frame))
2. Binary variables vs HeartDisease: Chi-squared test or two-sample proportion test
# 2. Binary variables vs HeartDisease: Chi-squared test or two-sample proportion test
binary_tests <- lapply(binary_vars, function(var) {
table_data <- table(process_reviews[[var]], process_reviews$HeartDisease)
chi_test_result <- chisq.test(table_data)
list(
variable = var,
test = "chi-squared test (binary)",
statistic = chi_test_result$statistic,
p_value = chi_test_result$p.value
)
})
results$binary <- do.call(rbind, lapply(binary_tests, as.data.frame))
3. Multi-class variables vs HeartDisease: Chi-squared test
# 3. Multi-class variables vs HeartDisease: Chi-squared test
multiclass_tests <- lapply(multiclass_vars, function(var) {
table_data <- table(process_reviews[[var]], process_reviews$HeartDisease)
chi_test_result <- chisq.test(table_data)
list(
variable = var,
test = "chi-squared test (multiclass)",
statistic = chi_test_result$statistic,
p_value = chi_test_result$p.value
)
})
results$multiclass <- do.call(rbind, lapply(multiclass_tests, as.data.frame))
4. Combine all results
# 4. Combine all results
all_results <- rbind(
data.frame(VariableType = "Continuous", results$continuous),
data.frame(VariableType = "Binary", results$binary),
data.frame(VariableType = "Multiclass", results$multiclass)
)
# Print results
print("Hypothesis Testing Results:")
## [1] "Hypothesis Testing Results:"
print(all_results)
## VariableType variable test
## t Continuous Age t-test
## t1 Continuous RestingBP t-test
## t2 Continuous Cholesterol t-test
## t3 Continuous MaxHR t-test
## t4 Continuous Oldpeak t-test
## X-squared Binary FastingBS chi-squared test (binary)
## X-squared1 Binary ExerciseAngina chi-squared test (binary)
## X-squared3 Multiclass ChestPainType chi-squared test (multiclass)
## X-squared11 Multiclass RestingECG chi-squared test (multiclass)
## X-squared2 Multiclass ST_Slope chi-squared test (multiclass)
## statistic p_value
## t -8.822540 6.348337e-18
## t1 -3.339492 8.732265e-04
## t2 7.626851 6.481236e-14
## t3 13.231478 1.430637e-36
## t4 -14.040031 1.902722e-40
## X-squared 64.320679 1.057302e-15
## X-squared1 222.259383 2.907808e-50
## X-squared3 268.067239 8.083728e-58
## X-squared11 10.931469 4.229233e-03
## X-squared2 355.918443 5.167638e-78
5. Visualize significant results
# 5. Visualize significant results
library(ggplot2)
all_results$p_value <- as.numeric(all_results$p_value)
all_results$significant <- ifelse(all_results$p_value < 0.05, "Significant", "Not Significant")
ggplot(all_results, aes(x = reorder(variable, p_value), y = -log10(p_value), fill = significant)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Hypothesis Testing Results",
x = "Variables",
y = "-log10(p-value)",
fill = "Significance") +
theme_minimal()
Summary
The exploratory data analysis and statistical testing provided valuable insights into the relationships between heart disease and various predictors:
Oldpeak and
MaxHR showed strong correlations with heart disease.Age and Cholesterol also
demonstrated statistically significant, though weaker,
associations.Age, MaxHR, and Oldpeak between
individuals with and without heart disease.ChestPainType and
ST-SLOPEAge and Oldpeak, between groups with
and without heart disease.ChestPainType, ST_Slope, and
heart disease.Oldpeak, ST_slope, and
ChestPainType emerged as strong predictors of heart
disease.These findings provide a foundation for further modeling and analysis, such as logistic regression or machine learning, to predict heart disease outcomes based on the identified key variables.
This report aims to analyze the training time and accuracy of various models in predicting heart disease. Both classification and regression models are utilized to evaluate their effectiveness in predicting the target variable: HeartDisease (binary: 0 = No, 1 = Yes).
Task: Predict the presence of heart disease.
Target Variable: HeartDisease
Models Used:
- Logistic Regression
- Random Forest
- XGBoostSetup and Preparation
# Install and load required packages
if (!require("dplyr")) install.packages("dplyr", dependencies = TRUE)
if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE)
library(dplyr)
library(ggplot2)
library(data.table)
Load Cleaned data
Load cleaned data that has completed data cleaning and pre-processing.
heart_df <- read.csv("C:/Users/Lenovo/Desktop/cleaned_process_reviews.csv")
selected_features <- c("Age", "Sex", "RestingBP", "Cholesterol",
"FastingBS", "MaxHR", "Oldpeak",
"AgeGroup", "HeartRiskScore", "ChestPainType_Num",
"RestingECG_Num", "ExerciseAngina_Num", "ST_Slope_Num")
heartdf_features <- heart_df[, selected_features]
heart_df$HeartDisease <- as.factor(heart_df$HeartDisease)
colnames(heartdf_features)
## [1] "Age" "Sex" "RestingBP"
## [4] "Cholesterol" "FastingBS" "MaxHR"
## [7] "Oldpeak" "AgeGroup" "HeartRiskScore"
## [10] "ChestPainType_Num" "RestingECG_Num" "ExerciseAngina_Num"
## [13] "ST_Slope_Num"
Analyzing the Distribution of the Target Variable
In this section, we check for any imbalance in the dataset regarding the target variable HeartDisease.
table(heart_df$HeartDisease)
##
## 0 1
## 410 508
prop.table(table(heart_df$HeartDisease)) * 100
##
## 0 1
## 44.66231 55.33769
# Distribution visualization
ggplot(heart_df, aes(x = HeartDisease)) +
geom_bar(fill = "skyblue") +
labs(title = "Class Distribution of HeartDisease", x = "HeartDisease", y = "Count")
Observation:
Splitting the Data into Train and Test Sets
In this section, we split the dataset into training and testing sets using an 80-20 split. To ensure models are trained on train data and evaluated using test data sets.
# Install and load required packages
if (!require("caret")) install.packages("caret", dependencies = TRUE)
library(caret)
set.seed(123)
trainIndex <- createDataPartition(heart_df$HeartDisease, p = 0.8, list = FALSE)
# Split the dataset
trainData <- heart_df[trainIndex, ]
testData <- heart_df[-trainIndex, ]
# Convert HeartDisease to factor for Logistic Regression and Random Forest
trainData$HeartDisease <- as.factor(trainData$HeartDisease)
testData$HeartDisease <- as.factor(testData$HeartDisease)
# Create separate copies for XGBoost with numeric target variable
trainData_xgb <- trainData
testData_xgb <- testData
trainData_xgb$HeartDisease <- as.numeric(as.character(trainData_xgb$HeartDisease)) - 1
testData_xgb$HeartDisease <- as.numeric(as.character(testData_xgb$HeartDisease)) - 1
Classification
Model 1: Logistic Regression
cat("Logistic Regression Training\n")
## Logistic Regression Training
start_time <- Sys.time()
lr_model <- glm(HeartDisease ~ ., data = trainData, family = "binomial")
end_time <- Sys.time()
lr_time <- end_time - start_time
cat("Logistic Regression Training Time:", lr_time, "\n")
## Logistic Regression Training Time: 0.01539493
lr_pred <- predict(lr_model, testData, type = "response")
lr_pred_class <- ifelse(lr_pred > 0.5, 1, 0)
#print(lr_pred_class)
lr_acc <- mean(lr_pred_class == as.numeric(as.character(testData$HeartDisease)))
cat("Logistic Regression Accuracy:", lr_acc, "\n")
## Logistic Regression Accuracy: 0.9125683
Model 2: Random Forest
# Install and load required packages
if (!require("randomForest")) install.packages("randomForest", dependencies = TRUE)
library(randomForest)
#Prediction using RF model
cat("Random Forest Training\n")
## Random Forest Training
start_time <- Sys.time()
rf_model <- randomForest(HeartDisease ~ ., data = trainData, ntree = 100)
end_time <- Sys.time()
rf_time <- end_time - start_time
cat("Random Forest Training Time:", rf_time, "\n")
## Random Forest Training Time: 0.06412911
rf_pred <- predict(rf_model, testData)
rf_acc <- mean(rf_pred == testData$HeartDisease)
cat("Random Forest Accuracy:", rf_acc, "\n")
## Random Forest Accuracy: 0.9016393
Model 3: XGBoost
# Install and load required packages
if (!require("xgboost")) install.packages("xgboost", dependencies = TRUE)
library(xgboost)
#set data for XGBoost to ensure numerical
unique(trainData_xgb$HeartDisease)
## [1] -1 0
unique(testData_xgb$HeartDisease)
## [1] -1 0
trainData_xgb$HeartDisease[trainData_xgb$HeartDisease == -1] <- 1
testData_xgb$HeartDisease[testData_xgb$HeartDisease == -1] <- 1
train_x <- model.matrix(HeartDisease ~ . - 1, data = trainData_xgb)
train_y <- trainData_xgb$HeartDisease
test_x <- model.matrix(HeartDisease ~ . - 1, data = testData_xgb)
test_y <- testData_xgb$HeartDisease
# Align feature names between train_x and test_x
missing_cols <- setdiff(colnames(train_x), colnames(test_x))
for (col in missing_cols) {
test_x <- cbind(test_x, 0)
colnames(test_x)[ncol(test_x)] <- col
}
test_x <- test_x[, colnames(train_x)]
# Train XGBoost model
cat("XGBoost Training\n")
## XGBoost Training
start_time <- Sys.time()
xgb_model <- xgboost(data = train_x, label = train_y, nrounds = 100, objective = "binary:logistic", verbose = 0)
end_time <- Sys.time()
xgb_time <- end_time - start_time
cat("XGBoost Training Time:", xgb_time, "\n")
## XGBoost Training Time: 0.4441721
# Predict on test data XGBoost
xgb_pred <- predict(xgb_model, test_x)
xgb_pred_class <- ifelse(xgb_pred > 0.5, 1, 0)
xgb_acc <- mean(xgb_pred_class == test_y)
cat("XGBoost Accuracy:", xgb_acc, "\n")
## XGBoost Accuracy: 0.9016393
Classification: Evaluation Model Performance
Evaluation 1: Logistic Regression
Train Logistic Regression model with k-fold cross-validation
Splits the data into 10 folds, trains the model on 9 folds, and validates on the remaining fold, repeated for all folds.
train_control <- trainControl(method = "cv", number = 10)
cat("Logistic Regression Training with 10-Fold Cross-Validation\n")
## Logistic Regression Training with 10-Fold Cross-Validation
start_time_cv <- Sys.time()
lr_model_cv <- train(
HeartDisease ~ .,
data = trainData,
method = "glm",
family = "binomial",
trControl = train_control
)
end_time_cv <- Sys.time()
lr_time_cv <- end_time_cv - start_time_cv
cat("Logistic Regression Training Time:", lr_time_cv, "\n")
## Logistic Regression Training Time: 0.4253371
lr_model_cv
## Generalized Linear Model
##
## 735 samples
## 17 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 661, 661, 662, 661, 661, 661, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8461619 0.6881072
# Evaluate performance
lr_pred_cv <- predict(lr_model_cv, testData)
lr_acc_cv <- mean(lr_pred_cv == testData$HeartDisease)
cat("Logistic Regression Accuracy on Test Data:", lr_acc_cv, "\n")
## Logistic Regression Accuracy on Test Data: 0.9125683
Evaluation 2: Random Forest
Train Random Forest model with k-fold cross-validation
Splits the data into 10 folds, trains the model on 9 folds, and validates on the remaining fold, repeated for all folds.
train_control <- trainControl(method = "cv", number = 10)
cat("Random Forest Training with 10-Fold Cross-Validation\n")
## Random Forest Training with 10-Fold Cross-Validation
start_time_cv <- Sys.time()
rf_model_cv <- train(
HeartDisease ~ .,
data = trainData,
method = "rf",
trControl = train_control,
ntree = 100
)
end_time_cv <- Sys.time()
rf_time_cv <- end_time_cv - start_time_cv
cat("Random Forest Training Time:", rf_time_cv, "\n")
## Random Forest Training Time: 2.176677
rf_model_cv
## Random Forest
##
## 735 samples
## 17 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 661, 661, 662, 661, 663, 661, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8597325 0.7135456
## 12 0.8421274 0.6784905
## 23 0.8434232 0.6809160
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Evaluate performance on test data
rf_pred_cv <- predict(rf_model_cv, testData)
rf_acc_cv <- mean(rf_pred == testData$HeartDisease)
cat("Random Forest Accuracy on Test Data:", rf_acc_cv, "\n")
## Random Forest Accuracy on Test Data: 0.9016393
Evaluation 3: XGBoost
Train XGBoost model with k-fold cross-validation
Splits the data into 10 folds, trains the model on 9 folds, and validates on the remaining fold, repeated for all folds.
train_control <- trainControl(method = "cv", number = 10)
cat("XGBoost Training with 10-Fold Cross-Validation\n")
## XGBoost Training with 10-Fold Cross-Validation
start_time_cv <- Sys.time()
xgb_model_cv <- train(
HeartDisease ~ .,
data = trainData,
method = "xgbTree",
trControl = train_control,
tuneGrid = expand.grid( # Hyperparameter tuning grid for models like XGBoost, which rely heavily on hyperparameter tuning for optimal performance.
nrounds = 100,
max_depth = 6,
eta = 0.3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
)
end_time_cv <- Sys.time()
xgb_time_cv <- end_time_cv - start_time_cv
cat("XGBoost Training Time:", xgb_time_cv, "\n")
## XGBoost Training Time: 4.582821
xgb_model_cv
## eXtreme Gradient Boosting
##
## 735 samples
## 17 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 661, 662, 661, 662, 661, 662, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8449093 0.6848146
##
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
## held constant at a value of 1
## Tuning parameter 'subsample' was held
## constant at a value of 1
# Evaluate performance
xgb_pred_cv <- predict(xgb_model_cv, testData)
xgb_acc_cv <- mean(xgb_pred_cv == testData$HeartDisease)
cat("XGBoost Accuracy on Test Data:", xgb_acc_cv, "\n")
## XGBoost Accuracy on Test Data: 0.9016393
Classification: Model Comparison Results
In this section, summarised the results for comparison of accuracy for train model and cross-validation model’s result by each model.
# Load required libraries
if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)
if (!require("tidyr")) install.packages("tidyr", dependencies = TRUE)
library(tidyr)
Combined results
results <- data.frame(
model = c("Logistic Regression", "Random Forest", "XGBoost"),
accuracy = c(lr_acc, rf_acc, xgb_acc),
train_time = c(lr_time, rf_time, xgb_time),
accuracy_cv = c(lr_acc_cv, rf_acc_cv, xgb_acc_cv),
train_time_cv = c(lr_time_cv, rf_time_cv, xgb_time_cv)
)
results
## model accuracy train_time accuracy_cv train_time_cv
## 1 Logistic Regression 0.9125683 0.01539493 secs 0.9125683 0.4253371 secs
## 2 Random Forest 0.9016393 0.06412911 secs 0.9016393 2.1766770 secs
## 3 XGBoost 0.9016393 0.44417214 secs 0.9016393 4.5828209 secs
Plot bar chart to show the comparison each mode.
1.Model Accuracy and Cross-Validation Accuracy Comparison
Note: Convert results to long format The conversion to long format organizes data in a way that supports clearer visualization
results_long <- results %>%
pivot_longer(
cols = c(accuracy, accuracy_cv),
names_to = "Metric",
values_to = "Value"
)
results_long
## # A tibble: 6 × 5
## model train_time train_time_cv Metric Value
## <chr> <drtn> <drtn> <chr> <dbl>
## 1 Logistic Regression 0.01539493 secs 0.4253371 secs accuracy 0.913
## 2 Logistic Regression 0.01539493 secs 0.4253371 secs accuracy_cv 0.913
## 3 Random Forest 0.06412911 secs 2.1766770 secs accuracy 0.902
## 4 Random Forest 0.06412911 secs 2.1766770 secs accuracy_cv 0.902
## 5 XGBoost 0.44417214 secs 4.5828209 secs accuracy 0.902
## 6 XGBoost 0.44417214 secs 4.5828209 secs accuracy_cv 0.902
ggplot(results_long, aes(x = model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") + # Side-by-side bars
geom_text(aes(label = sprintf("%.2f", Value)), position = position_dodge(0.9), vjust = -0.3, size = 4) +
labs(
title = "Model Accuracy and Cross-Validation Accuracy Comparison",
y = "Value",
x = "Model"
) +
theme_minimal()
2.Model Training Time and Cross-Validation Training Time Comparison
# Ensure training time columns are numeric
results$train_time_numeric <- as.numeric(results$train_time)
results$train_time_cv_numeric <- as.numeric(results$train_time_cv)
# Convert to long format
results_long_time <- results %>%
pivot_longer(
cols = c(train_time_numeric, train_time_cv_numeric),
names_to = "Metric",
values_to = "Value"
)
ggplot(results_long_time, aes(x = model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") + # Side-by-side bars
geom_text(aes(label = sprintf("%.2f", Value)), position = position_dodge(0.9), vjust = -0.3, size = 4) +
labs(
title = "Model Training Time and Cross-Validation Training Time Comparison",
y = "Training Time (seconds)",
x = "Model"
) +
theme_minimal()
Classification: Plotting Features Importance
1.Logistic Regression Features Importance
importance_lr <- varImp(lr_model)
importance_lr_df <- as.data.frame(importance_lr)
importance_lr_df$Feature <- rownames(importance_lr_df)
importance_lr_df$Overall <- as.numeric(as.character(importance_lr_df$Overall))
top_lr_features <- importance_lr_df[order(-importance_lr_df$Overall), ][1:10, ]
ggplot(top_lr_features, aes(x = reorder(Feature, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "dodgerblue") +
coord_flip() +
labs(
title = "Logistic Regression Feature Importance (Top 10)",
x = "Features",
y = "Importance"
) +
theme_minimal()
2.Random Forest Features Importance
#Random Forest
importance_rf <- importance(rf_model)
importance_rf_df <- data.frame(Feature = row.names(importance_rf), importance_rf)
ggplot(importance_rf_df[order(-importance_rf_df$MeanDecreaseGini), ][1:10, ],
aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Random Forest Feature Importance (Top 10)",
x = "Features",
y = "Mean Decrease in Gini") +
theme_minimal()
3.XGBoost Features Importance
#Feature Importance for XGBoost
importance_xgb <- xgb.importance(model = xgb_model)
importance_xgb <- as.data.table(importance_xgb)
library(ggplot2)
ggplot(importance_xgb[1:10], aes(x = reorder(Feature, Gain), y = Gain)) +
geom_bar(stat = "identity", fill = "red") +
coord_flip() +
labs(title = "XGBoost Feature Importance",
x = "Feature",
y = "Gain") +
theme_minimal()
This study aims to predict the HeartRiskScore, a numerical indicator representing the likelihood of developing heart disease. By leveraging machine learning models like Random Forest and XGBoost, we analyze key health-related features to provide accurate and interpretable risk predictions(1-5), which can support early intervention and medical decision-making.
Task: Predict the HeartRiskScore.
Target Variable: HeartDisease
Models Used:
- Random Forest
- XGBoostModel 1: Random forest
# load package
if (!require("randomForest")) install.packages("randomForest")
library(randomForest)
# build random forest
rf_model_tuned <- randomForest(HeartRiskScore ~ FastingBS + MaxHR + Oldpeak + ExerciseAngina_Num + ST_Slope_Num,
data = trainData, ntree = 500, mtry = 2)
print(rf_model_tuned)
##
## Call:
## randomForest(formula = HeartRiskScore ~ FastingBS + MaxHR + Oldpeak + ExerciseAngina_Num + ST_Slope_Num, data = trainData, ntree = 500, mtry = 2)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.0204982
## % Var explained: 99.2
# Predictive test set
rf_predictions <- predict(rf_model_tuned, newdata = testData)
Model 2: Xgboost Converts a variable to a numeric type
if (!require("xgboost")) install.packages("xgboost")
library(xgboost)
# Extract the desired features
selected_features <- c("Age", "Sex", "RestingBP", "Cholesterol",
"FastingBS", "MaxHR", "Oldpeak", "ChestPainType",
"RestingECG", "ExerciseAngina", "ST_Slope", "HeartRiskScore")
trainData_selected <- trainData[, selected_features]
testData_selected <- testData[, selected_features]
# Performs Label Encoding on non-numeric variables
trainData_selected <- trainData_selected %>%
mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
mutate(across(where(is.factor), as.numeric))
testData_selected <- testData_selected %>%
mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
mutate(across(where(is.factor), as.numeric))
# Check whether the conversion to numeric type was successful
str(trainData_selected)
## 'data.frame': 735 obs. of 12 variables:
## $ Age : int 37 48 39 45 48 37 58 39 42 54 ...
## $ Sex : num 2 1 2 1 1 1 2 2 1 1 ...
## $ RestingBP : num 0.65 0.69 0.6 0.65 0.6 0.65 0.68 0.6 0.575 0.6 ...
## $ Cholesterol : num 0.469 0.355 0.562 0.393 0.471 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MaxHR : num 0.268 0.338 0.775 0.775 0.423 ...
## $ Oldpeak : num 0.295 0.466 0.295 0.295 0.295 ...
## $ ChestPainType : num 2 1 3 2 2 3 2 2 3 2 ...
## $ RestingECG : num 3 2 2 2 2 2 3 2 3 2 ...
## $ ExerciseAngina: num 1 2 1 1 1 1 2 1 1 1 ...
## $ ST_Slope : num 3 2 3 3 3 3 2 3 3 2 ...
## $ HeartRiskScore: int 0 3 0 0 0 0 3 0 0 2 ...
create DMatrix
# # Split feature and target variables
train_matrix <- as.matrix(trainData_selected[, -which(names(trainData_selected) == "HeartRiskScore")])
test_matrix <- as.matrix(testData_selected[, -which(names(testData_selected) == "HeartRiskScore")])
train_label <- trainData_selected$HeartRiskScore
test_label <- testData_selected$HeartRiskScore
# create DMatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
train model
params <- list(
objective = "reg:squarederror",
eval_metric = "rmse",
max_depth = 6,
eta = 0.3,
subsample = 0.8,
colsample_bytree = 0.8
)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
watchlist = list(train = dtrain, test = dtest),
print_every_n = 10
)
## [1] train-rmse:1.506208 test-rmse:1.421310
## [11] train-rmse:0.130266 test-rmse:0.188532
## [21] train-rmse:0.042437 test-rmse:0.140540
## [31] train-rmse:0.026649 test-rmse:0.135121
## [41] train-rmse:0.018557 test-rmse:0.133363
## [51] train-rmse:0.013549 test-rmse:0.132659
## [61] train-rmse:0.009823 test-rmse:0.132525
## [71] train-rmse:0.007496 test-rmse:0.132513
## [81] train-rmse:0.005428 test-rmse:0.132487
## [91] train-rmse:0.004138 test-rmse:0.132380
## [100] train-rmse:0.003247 test-rmse:0.132511
print(xgb_model)
## ##### xgb.Booster
## raw: 356.5 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = 100, watchlist = list(train = dtrain,
## test = dtest), print_every_n = 10)
## params (as set within xgb.train):
## objective = "reg:squarederror", eval_metric = "rmse", max_depth = "6", eta = "0.3", subsample = "0.8", colsample_bytree = "0.8", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## cb.evaluation.log()
## # of features: 11
## niter: 100
## nfeatures : 11
## evaluation_log:
## iter train_rmse test_rmse
## <num> <num> <num>
## 1 1.506208263 1.4213101
## 2 1.122600707 1.0617256
## --- --- ---
## 99 0.003402947 0.1324679
## 100 0.003247064 0.1325114
Regression: Evaluation Model Performance
Evaluation 1: Random Forest plot Actual vs Predicted HeartRiskScore
library(ggplot2)
ggplot(data = data.frame(Actual = testData$HeartRiskScore, Predicted = rf_predictions),
aes(x = Actual, y = Predicted)) +
geom_point(color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red") +
labs(title = "Actual vs Predicted HeartRiskScore",
x = "Actual HeartRiskScore", y = "Predicted HeartRiskScore") +
theme_minimal()
Evaluation Metrics
rf_predictions <- predict(rf_model_tuned, newdata = testData)
mse <- mean((rf_predictions - test_label)^2)
rmse <- sqrt(mse)
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.1536583
mae <- mean(abs(rf_predictions - testData$HeartRiskScore))
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 0.1012041
rss <- sum((rf_predictions - testData$HeartRiskScore)^2) # Residual sum of squares
tss <- sum((testData$HeartRiskScore - mean(testData$HeartRiskScore))^2) # Total sum of squares
r_squared <- 1 - (rss / tss)
cat("R² Score:", r_squared, "\n")
## R² Score: 0.9907435
Evaluation 2: Xgboost
plot Actual vs Predicted HeartRiskScore
library(ggplot2)
# Predict on test data
xgb_predictions <- predict(xgb_model, newdata = dtest)
# Data box: actual value and predicted value
results <- data.frame(
Actual = test_label,
Predicted = xgb_predictions
)
# Scatter diagram
ggplot(results, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Actual vs Predicted HeartRiskScore",
x = "Actual HeartRiskScore",
y = "Predicted HeartRiskScore"
)
Evaluation Metrics
xgb_predictions <- predict(xgb_model, newdata = dtest)
# Model evaluation
mse <- mean((xgb_predictions - test_label)^2)
rmse <- sqrt(mse)
mae <- mean(abs(xgb_predictions - test_label))
rss <- sum((xgb_predictions - test_label)^2) # Residual sum of squares
tss <- sum((test_label - mean(test_label))^2) # Total sum of squares
r_squared <- 1 - (rss / tss)
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.1325114
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 0.07907581
cat("R² Score:", r_squared, "\n")
## R² Score: 0.993116
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.1325114
Regression: Plotting Features Importance
Random Forest
# Load the ggplot2 package
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Computational feature importance
feature_importance <- importance(rf_model_tuned)
importance_df <- data.frame(
Feature = rownames(feature_importance),
Importance = feature_importance[, "IncNodePurity"]
)
# Sort by feature importance
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]
# Draw bar charts
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "grey") +
coord_flip() +
labs(
title = "Feature Importance (Random Forest)",
x = "Features",
y = "Importance (IncNodePurity)"
) +
theme_minimal()
XGBoost
# Computational feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_matrix), model = xgb_model)
# Plot feature importance
xgb.plot.importance(importance_matrix, top_n = 10, measure = "Gain", rel_to_first = TRUE)
Model Comparison
# Evaluate the performance of random forest models
rf_mse <- mean((rf_predictions - testData$HeartRiskScore)^2)
rf_rmse <- sqrt(rf_mse)
rf_mae <- mean(abs(rf_predictions - testData$HeartRiskScore))
rf_r2 <- 1 - (sum((rf_predictions - testData$HeartRiskScore)^2) /
sum((testData$HeartRiskScore - mean(testData$HeartRiskScore))^2))
cat("Random Forest RMSE:", rf_rmse, "\n")
## Random Forest RMSE: 0.1536583
cat("Random Forest MAE:", rf_mae, "\n")
## Random Forest MAE: 0.1012041
cat("Random Forest R²:", rf_r2, "\n")
## Random Forest R²: 0.9907435
# Predictive test set
xgb_predictions <- predict(xgb_model, newdata = dtest)
# Evaluate XGBoost model performance
xgb_mse <- mean((xgb_predictions - test_label)^2)
xgb_rmse <- sqrt(xgb_mse)
xgb_mae <- mean(abs(xgb_predictions - test_label))
xgb_r2 <- 1 - (sum((xgb_predictions - test_label)^2) /
sum((test_label - mean(test_label))^2))
cat("XGBoost RMSE:", xgb_rmse, "\n")
## XGBoost RMSE: 0.1325114
cat("XGBoost MAE:", xgb_mae, "\n")
## XGBoost MAE: 0.07907581
cat("XGBoost R²:", xgb_r2, "\n")
## XGBoost R²: 0.993116
# Create a comparison table
comparison_df <- data.frame(
Model = c("Random Forest", "XGBoost"),
RMSE = c(rf_rmse, xgb_rmse),
MAE = c(rf_mae, xgb_mae),
R2 = c(rf_r2, xgb_r2)
)
print(comparison_df)
## Model RMSE MAE R2
## 1 Random Forest 0.1536583 0.10120414 0.9907435
## 2 XGBoost 0.1325114 0.07907581 0.9931160
Plot bar chart to compare model performance
if (!require("tidyr")) install.packages("tidyr")
library(tidyr)
library(ggplot2)
# Converts a variable to a numeric type
comparison_long <- comparison_df %>%
pivot_longer(cols = c(RMSE, MAE), names_to = "Metric", values_to = "Value")
# plot bar chart
ggplot(comparison_long, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Model Comparison: Random Forest vs XGBoost",
x = "Model",
y = "Performance Metric",
fill = "Metric"
) +
theme_minimal()
Regression: Summary
Summary - Random Forest is the recommended model for predicting HeartRiskScore due to its superior performance metrics and faster training. XGBoost remains a robust alternative for more complex datasets or when further generalization is desired.
This study on heart disease risk prediction explored data preprocessing, exploratory data analysis (EDA), and the use of both regression and classification models. These approaches demonstrated the versatility of predictive modeling, offering numerical estimates (regression) and categorical classifications (classification).
Preprocessing was key to handling missing data, transforming variables, and encoding features, ensuring data quality for machine learning. EDA enriched the process by uncovering variable relationships and trends crucial for model training.
Regression models quantified predictor-outcome relationships, while classification models categorized patients into risk groups, evaluated on metrics like accuracy and sensitivity. Both approaches addressed distinct predictive objectives, highlighting their complementary roles in healthcare.
This work underscores the value of combining regression for numerical predictions and classification for actionable decisions in patient management. Future efforts could explore advanced techniques, such as ensemble models or deep learning, to further enhance predictive accuracy and support early interventions.