Introduction to the Domain

Heart disease remains the leading cause of mortality worldwide, affecting millions annually (World Health Organization, 2021). Among various cardiac conditions, heart failure stands out due to its complexity and severe health implications, often resulting from a mix of factors such as age, lifestyle, and other health-related conditions. The increasing prevalence of heart disease, combined with rising healthcare costs, underscores the need for improved prediction methods (Smith et al., 2020). This project aims to address this need by employing data analysis and predictive modeling to forecast heart failure risks, potentially enabling earlier intervention and better patient outcomes.

Dataset Overview

The dataset utilized in this analysis is sourced from Kaggle’s Heart Failure Prediction Dataset, comprising records from 918 patients and 12 health-related features. These features include variables such as age, gender, blood pressure, cholesterol levels, and other relevant clinical indicators. The target variable, ‘Heart Disease,’ is binary, indicating whether a patient has heart disease (1) or not (0). The dataset provides a robust foundation for conducting comprehensive data analysis and constructing predictive models.

Goal of the Analysis

Main Goal

The primary objective of this analysis is to develop a predictive model capable of accurately identifying patients at risk of heart failure based on their health data. The model aims to assist healthcare providers in detecting high-risk individuals early, facilitating timely and effective medical interventions.

Specific Objectives

Model Comparison

The analysis will compare various machine learning models, including Logistic Regression, Decision Tree, and KNN, to determine the most effective model (Hastie, Tibshirani, and Friedman, 2009). The comparison will focus on balancing prediction accuracy with model interpretability.

Why This Matters: In healthcare, it is crucial to balance accuracy with transparency. Doctors need to understand and trust the models they use, making interpretability a key consideration in model selection.

Objectives

Data Understanding and Preparation:
Grasp the underlying context and clinical significance of the heart disease dataset. This step involves a thorough review of the dataset to identify and address missing values, detect and handle outliers, and apply necessary transformations to prepare the data for accurate analysis. This preparation ensures that the data is clean and suitable for robust predictive modeling.
Descriptive Analytics:
Perform a detailed descriptive statistical analysis to summarize the key characteristics of the dataset. This includes computing statistical measures such as mean, median, variance, and range for various features. The goal is to build a solid understanding of the data’s distribution and identify any initial patterns that could be relevant for further analysis.
Exploratory Data Analysis (EDA):
Utilize a combination of visual and statistical techniques to explore the relationships within the data, uncover hidden patterns, and assess the quality of the data. EDA will include creating visualizations such as histograms, box plots, and scatter plots, which are essential for identifying trends, correlations, and potential issues such as multicollinearity or skewness in the dataset.
Result Analysis:
Assess the performance of the machine learning models using appropriate metrics such as accuracy and precision. This analysis will help determine the model’s effectiveness in predicting heart disease, identify the strengths and weaknesses of each approach, and offer insights that could guide further model refinement or clinical application.

# Load Required Libraries ------------------------------------------------
library(data.table)   # For efficient data handling

## Warning: package 'data.table' was built under R version 4.4.1

library(dplyr)        # For data manipulation

## Warning: package 'dplyr' was built under R version 4.4.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)      # For data visualization

## Warning: package 'ggplot2' was built under R version 4.4.1

library(plotly)       # For interactive plots

## Warning: package 'plotly' was built under R version 4.4.1

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(caret)        # For machine learning tasks

## Warning: package 'caret' was built under R version 4.4.1

## Loading required package: lattice

library(recipes)      # For data preprocessing

## Warning: package 'recipes' was built under R version 4.4.1

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stats':
## 
##     step

library(rsample)      # For data splitting

## Warning: package 'rsample' was built under R version 4.4.1

library(randomForest) # For Random Forest model

## Warning: package 'randomForest' was built under R version 4.4.1

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(xgboost)      # For XGBoost model

## Warning: package 'xgboost' was built under R version 4.4.1

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:plotly':
## 
##     slice

## The following object is masked from 'package:dplyr':
## 
##     slice

library(e1071)        # For Support Vector Machine (SVM)

## Warning: package 'e1071' was built under R version 4.4.1

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:rsample':
## 
##     permutations

library(rpart)        # For Decision Tree model

## Warning: package 'rpart' was built under R version 4.4.1

library(class)        # For K-Nearest Neighbors (KNN)

## Warning: package 'class' was built under R version 4.4.1

library(yardstick)    # For model evaluation metrics

## Warning: package 'yardstick' was built under R version 4.4.1

## 
## Attaching package: 'yardstick'

## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity

library(ggplot2)
library(naniar)

## Warning: package 'naniar' was built under R version 4.4.1

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.1

## Warning: package 'tibble' was built under R version 4.4.1

## Warning: package 'tidyr' was built under R version 4.4.1

## Warning: package 'readr' was built under R version 4.4.1

## Warning: package 'purrr' was built under R version 4.4.1

## Warning: package 'forcats' was built under R version 4.4.1

## Warning: package 'lubridate' was built under R version 4.4.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()        masks data.table::between()
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ plotly::filter()        masks dplyr::filter(), stats::filter()
## ✖ dplyr::first()          masks data.table::first()
## ✖ stringr::fixed()        masks recipes::fixed()
## ✖ lubridate::hour()       masks data.table::hour()
## ✖ lubridate::isoweek()    masks data.table::isoweek()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ dplyr::last()           masks data.table::last()
## ✖ purrr::lift()           masks caret::lift()
## ✖ randomForest::margin()  masks ggplot2::margin()
## ✖ lubridate::mday()       masks data.table::mday()
## ✖ lubridate::minute()     masks data.table::minute()
## ✖ lubridate::month()      masks data.table::month()
## ✖ lubridate::quarter()    masks data.table::quarter()
## ✖ lubridate::second()     masks data.table::second()
## ✖ xgboost::slice()        masks plotly::slice(), dplyr::slice()
## ✖ readr::spec()           masks yardstick::spec()
## ✖ purrr::transpose()      masks data.table::transpose()
## ✖ lubridate::wday()       masks data.table::wday()
## ✖ lubridate::week()       masks data.table::week()
## ✖ lubridate::yday()       masks data.table::yday()
## ✖ lubridate::year()       masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)
library(randomForest)
library(pROC)

## Warning: package 'pROC' was built under R version 4.4.1

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(e1071)

Data Preparation and Cleaning

Initial Data Inspection

# Load the Data -----------------------------------------------------------
df <- fread("C:/Users/aliha/Downloads/heart.csv")  # Ensuring the correct path

# Convert HeartDisease to a factor for classification
df$HeartDisease <- as.factor(df$HeartDisease)

# Display a sample of the data to confirm it loaded correctly
sample_n(df, 5)

##      Age    Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
##    <int> <char>        <char>     <int>       <int>     <int>     <char> <int>
## 1:    64      M           ASY       128         263         0     Normal   105
## 2:    52      F           ATA       140         225         0     Normal   140
## 3:    48      M           ASY       160         193         0     Normal   102
## 4:    54      M           ASY       130         202         1     Normal   112
## 5:    49      F           ASY       130         269         0     Normal   163
##    ExerciseAngina Oldpeak ST_Slope HeartDisease
##            <char>   <num>   <char>       <fctr>
## 1:              Y     0.2     Flat            0
## 2:              N     0.0       Up            0
## 3:              Y     3.0     Flat            1
## 4:              Y     2.0     Flat            1
## 5:              N     0.0       Up            0

# Basic structure and summary of the data
cat("Number of columns:", ncol(df), "\nNumber of rows:", nrow(df), "\n")

## Number of columns: 12 
## Number of rows: 918

str(df)

## Classes 'data.table' and 'data.frame':   918 obs. of  12 variables:
##  $ Age           : int  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : chr  "M" "F" "M" "F" ...
##  $ ChestPainType : chr  "ATA" "NAP" "ATA" "ASY" ...
##  $ RestingBP     : int  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : int  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG    : chr  "Normal" "Normal" "ST" "Normal" ...
##  $ MaxHR         : int  172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: chr  "N" "N" "N" "Y" ...
##  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : chr  "Up" "Flat" "Up" "Flat" ...
##  $ HeartDisease  : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

summary(df)

##       Age            Sex            ChestPainType        RestingBP    
##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
##  Mean   :53.51                                         Mean   :132.4  
##  3rd Qu.:60.00                                         3rd Qu.:140.0  
##  Max.   :77.00                                         Max.   :200.0  
##   Cholesterol      FastingBS       RestingECG            MaxHR      
##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
##  ExerciseAngina        Oldpeak          ST_Slope         HeartDisease
##  Length:918         Min.   :-2.6000   Length:918         0:410       
##  Class :character   1st Qu.: 0.0000   Class :character   1:508       
##  Mode  :character   Median : 0.6000   Mode  :character               
##                     Mean   : 0.8874                                  
##                     3rd Qu.: 1.5000                                  
##                     Max.   : 6.2000

# Check for nulls, cardinality, and unique values
summary_table <- data.frame(
  Feature = names(df),
  Missing = sapply(df, function(x) sum(is.na(x))),
  Cardinality = sapply(df, function(x) length(unique(x))),
  stringsAsFactors = FALSE
)
print(summary_table)

##                       Feature Missing Cardinality
## Age                       Age       0          50
## Sex                       Sex       0           2
## ChestPainType   ChestPainType       0           4
## RestingBP           RestingBP       0          67
## Cholesterol       Cholesterol       0         222
## FastingBS           FastingBS       0           2
## RestingECG         RestingECG       0           3
## MaxHR                   MaxHR       0         119
## ExerciseAngina ExerciseAngina       0           2
## Oldpeak               Oldpeak       0          53
## ST_Slope             ST_Slope       0           3
## HeartDisease     HeartDisease       0           2

This dataset appears to be well-prepared for machine learning applications, with a variety of features that can contribute to predicting heart disease. The next step will involve choosing appropriate models, preprocessing the data further if needed, and evaluating model performance.

Handling Missing Values

# Impute missing numeric values with the median (Little and Rubin, 2019)
df <- df %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

Outlier Detection and Handling

# Detect outliers using boxplots (Tukey, 1977)
ggplot(df, aes(x = "", y = Cholesterol)) + 
  geom_boxplot() +
  labs(title = "Boxplot for Cholesterol", y = "Cholesterol")

# Handle outliers by capping them at the 5th and 95th percentiles
lower_limit <- quantile(df$Cholesterol, 0.05)
upper_limit <- quantile(df$Cholesterol, 0.95)

df <- df %>%
  mutate(Cholesterol = ifelse(Cholesterol < lower_limit, lower_limit, Cholesterol),
         Cholesterol = ifelse(Cholesterol > upper_limit, upper_limit, Cholesterol))

Interpretation of Cholesterol Boxplot

Central Tendency:
- The median cholesterol level is around 223 mg/dL, which is shown by the thick line within the box. This indicates that half of the cholesterol values are below this level and half are above.
Interquartile Range (IQR):
- The box represents the interquartile range, which spans from the first quartile (approximately 173 mg/dL) to the third quartile (approximately 267 mg/dL). This range captures the middle 50% of the data, providing an indication of the typical range of cholesterol values.
Spread and Skewness:
- The distribution appears slightly skewed to the higher end, as indicated by the longer whisker and the presence of outliers above the upper quartile. The lower whisker is shorter, and there are fewer outliers on the lower end, suggesting that while cholesterol levels vary widely, extremely high values are more common than extremely low ones.
Outliers:
- Several data points are identified as outliers, particularly above the upper whisker (above approximately 400 mg/dL). These outliers represent individuals with unusually high cholesterol levels that are not typical of the general population in this dataset.
Minimum and Maximum:
- The minimum recorded cholesterol value is 0 mg/dL, which may indicate missing or erroneous data. The maximum value exceeds 600 mg/dL, which is represented by the highest outlier.

Conclusion:

The boxplot indicates that while most individuals in the dataset have cholesterol levels between 173 and 267 mg/dL, there is a notable number of individuals with significantly higher levels, leading to a positively skewed distribution. The presence of very low and high values suggests the need for further examination of potential data errors or outlier analysis.

Data Transformation and Feature Engineering

# Creating an Age Group feature
df <- df %>%
  mutate(AgeGroup = cut(Age, breaks = c(0, 30, 45, 60, 100), labels = c("Young", "Middle-aged", "Senior", "Elderly")))

# Encoding categorical variables
df <- df %>%
  mutate(across(where(is.factor), as.integer))

# Log transformation for skewed distributions
df <- df %>%
  mutate(Cholesterol_log = log1p(Cholesterol))

Standardization

# Standardize numeric features
df <- df %>%
  mutate(across(where(is.numeric), scale))

Final Dataset Check

# Final structure and summary check
str(df)

## Classes 'data.table' and 'data.frame':   918 obs. of  14 variables:
##  $ Age            : num [1:918, 1] -1.4324 -0.4782 -1.7504 -0.5842 0.0519 ...
##   ..- attr(*, "scaled:center")= num 53.5
##   ..- attr(*, "scaled:scale")= num 9.43
##  $ Sex            : chr  "M" "F" "M" "F" ...
##  $ ChestPainType  : chr  "ATA" "NAP" "ATA" "ASY" ...
##  $ RestingBP      : num [1:918, 1] 0.411 1.491 -0.129 0.303 0.951 ...
##   ..- attr(*, "scaled:center")= num 132
##   ..- attr(*, "scaled:scale")= num 18.5
##  $ Cholesterol    : num [1:918, 1] 0.89415 -0.15333 0.83649 0.17341 -0.00918 ...
##   ..- attr(*, "scaled:center")= num 196
##   ..- attr(*, "scaled:scale")= num 104
##  $ FastingBS      : num [1:918, 1] -0.551 -0.551 -0.551 -0.551 -0.551 ...
##   ..- attr(*, "scaled:center")= num 0.233
##   ..- attr(*, "scaled:scale")= num 0.423
##  $ RestingECG     : chr  "Normal" "Normal" "ST" "Normal" ...
##  $ MaxHR          : num [1:918, 1] 1.382 0.754 -1.524 -1.132 -0.582 ...
##   ..- attr(*, "scaled:center")= num 137
##   ..- attr(*, "scaled:scale")= num 25.5
##  $ ExerciseAngina : chr  "N" "N" "N" "Y" ...
##  $ Oldpeak        : num [1:918, 1] -0.832 0.106 -0.832 0.574 -0.832 ...
##   ..- attr(*, "scaled:center")= num 0.887
##   ..- attr(*, "scaled:scale")= num 1.07
##  $ ST_Slope       : chr  "Up" "Flat" "Up" "Flat" ...
##  $ HeartDisease   : num [1:918, 1] -1.113 0.898 -1.113 0.898 -1.113 ...
##   ..- attr(*, "scaled:center")= num 1.55
##   ..- attr(*, "scaled:scale")= num 0.497
##  $ AgeGroup       : num [1:918, 1] -1.4894 -0.0318 -1.4894 -0.0318 -0.0318 ...
##   ..- attr(*, "scaled:center")= num 3.02
##   ..- attr(*, "scaled:scale")= num 0.686
##  $ Cholesterol_log: num [1:918, 1] 0.572 0.352 0.563 0.433 0.39 ...
##   ..- attr(*, "scaled:center")= num 4.44
##   ..- attr(*, "scaled:scale")= num 2.14
##  - attr(*, ".internal.selfref")=<externalptr>

summary(df)

##         Age.V1            Sex            ChestPainType         RestingBP.V1    
##  Min.   :-2.7045405   Length:918         Length:918         Min.   :-7.151097  
##  1st Qu.:-0.6902531   Class :character   Class :character   1st Qu.:-0.669570  
##  Median : 0.0518527   Mode  :character   Mode  :character   Median :-0.129442  
##  Mean   : 0.0000000                                         Mean   : 0.000000  
##  3rd Qu.: 0.6879435                                         3rd Qu.: 0.410685  
##  Max.   : 2.4902005                                         Max.   : 3.651449  
##     Cholesterol.V1        FastingBS.V1      RestingECG       
##  Min.   :-1.8831107   Min.   :-0.5510410   Length:918        
##  1st Qu.:-0.2181941   1st Qu.:-0.5510410   Class :character  
##  Median : 0.2598988   Median :-0.5510410   Mode  :character  
##  Mean   : 0.0000000   Mean   : 0.0000000                     
##  3rd Qu.: 0.6827348   3rd Qu.:-0.5510410                     
##  Max.   : 1.3006519   Max.   : 1.8127703                     
##        MaxHR.V1       ExerciseAngina         Oldpeak.V1        ST_Slope        
##  Min.   :-3.0168248   Length:918         Min.   :-3.269699   Length:918        
##  1st Qu.:-0.6602179   Class :character   1st Qu.:-0.831979   Class :character  
##  Median : 0.0467642   Mode  :character   Median :-0.269428   Mode  :character  
##  Mean   : 0.0000000                      Mean   : 0.000000                     
##  3rd Qu.: 0.7537463                      3rd Qu.: 0.574398                     
##  Max.   : 2.5604783                      Max.   : 4.981047                     
##    HeartDisease.V1        AgeGroup.V1       Cholesterol_log.V1 
##  Min.   :-1.1125083   Min.   :-2.9471159   Min.   :-2.0729676  
##  1st Qu.:-1.1125083   1st Qu.:-0.0317577   1st Qu.: 0.3347031  
##  Median : 0.8978905   Median :-0.0317577   Median : 0.4518831  
##  Mean   : 0.0000000   Mean   : 0.0000000   Mean   : 0.0000000  
##  3rd Qu.: 0.8978905   3rd Qu.:-0.0317577   3rd Qu.: 0.5355562  
##  Max.   : 0.8978905   Max.   : 1.4259214   Max.   : 0.6358902

# Check for any remaining missing values or anomalies
missing_values <- sapply(df, function(x) sum(is.na(x)))
print(missing_values)

##             Age             Sex   ChestPainType       RestingBP     Cholesterol 
##               0               0               0               0               0 
##       FastingBS      RestingECG           MaxHR  ExerciseAngina         Oldpeak 
##               0               0               0               0               0 
##        ST_Slope    HeartDisease        AgeGroup Cholesterol_log 
##               0               0               0               0

Visualizing missing values

gg_miss_var(df) + 
  labs(title = "Missing Values in the Dataset") +
  theme_minimal()

## Interpretation of Missing Values Plot

The plot shows that there are no missing values in any of the variables within the dataset. Each variable, including ST_Slope, Sex, RestingECG, ExerciseAngina, and ChestPainType, has a count of zero missing entries. This completeness suggests that the dataset is well-prepared for analysis without requiring imputation or further preprocessing to handle missing data.

Critical Note:

While the absence of missing values is ideal for most analyses, it’s crucial to verify that this completeness is accurate and not due to preprocessing that might have masked or incorrectly filled in missing values.

Data Visualization

Visualizing Data Distributions

# Age Distribution
ggplot(df, aes(x = Age)) +
  geom_histogram(binwidth = 2, fill = '#c44e52', color = 'black') +
  theme_minimal() +
  labs(title = "Distribution of Age", x = "Age", y = "Count")

# Sex Distribution
sex_counts <- df %>%
  count(Sex)

plot_ly(sex_counts, labels = ~Sex, values = ~n, type = 'pie', textinfo = 'label+percent',
        marker = list(colors = c("#FF9999", "#66B2FF"))) %>%
  layout(title = "Distribution of Sex")

Interpretation of Age Distribution

The histogram displays the distribution of the Age variable in the dataset. The distribution appears to be highly concentrated around the mean, with a large number of observations centered around the middle age range. The tails of the distribution indicate fewer instances of both younger and older individuals.

Critical Note:

The sharp concentration in the middle suggests a potential lack of diversity in age representation, which could impact the generalizability of any age-related analysis. The distribution’s shape should be considered when building and interpreting models to ensure that they are not biased towards the age group with the highest representation.

Visualizing Relationships Between Variables

# Scatter Plot of Age vs Cholesterol
ggplot(df, aes(x = Age, y = Cholesterol)) +
  geom_point(alpha = 0.6) +
  labs(title = "Scatter Plot of Age vs Cholesterol", x = "Age", y = "Cholesterol") +
  theme_minimal()

## Interpretation of Scatter Plot: Age vs. Cholesterol

The scatter plot illustrates the relationship between Age and Cholesterol. The data points are widely scattered, indicating a lack of a clear linear relationship between these two variables. Additionally, there appears to be a significant clustering of points around certain age ranges and cholesterol levels, suggesting that other factors may be influencing cholesterol levels more significantly than age alone.

Critical Note:

The presence of many points at the lower extremes suggests potential outliers or data quality issues, particularly at the lowest cholesterol values. This could indicate measurement errors or a need for data cleaning before further analysis.

Box Plots for Categorical vs Numeric Variables

# Box Plot of Cholesterol by Sex
ggplot(df, aes(x = Sex, y = Cholesterol)) +
  geom_boxplot(fill = "#66B2FF", color = "black") +
  labs(title = "Box Plot of Cholesterol by Sex", x = "Sex", y = "Cholesterol") +
  theme_minimal()

## Interpretation of Box Plot: Cholesterol by Sex

The box plot compares the distribution of cholesterol levels between males (M) and females (F). Both sexes exhibit a similar range of cholesterol levels, but there is a slight difference in median values, with males showing a slightly higher median cholesterol level. The presence of outliers, particularly in females, suggests some extreme cholesterol values that could be due to data variability or possible measurement errors.

Critical Note:

The differences between the sexes are not pronounced, indicating that sex might not be a strong determinant of cholesterol levels in this dataset. Further analysis is required to determine if these differences are statistically significant.

Visualizing Distributions Before and After Transformation

# Visualize original vs. log-transformed Cholesterol
p1 <- ggplot(df, aes(x = Cholesterol)) +
  geom_histogram(fill = "#FF9999", color = "black", bins = 30) +
  labs(title = "Distribution of Cholesterol", x = "Cholesterol", y = "Count") +
  theme_minimal()

p2 <- ggplot(df, aes(x = Cholesterol_log)) +
  geom_histogram(fill = "#66B2FF", color = "black", bins = 30) +
  labs(title = "Distribution of Log-Transformed Cholesterol", x = "Log(Cholesterol)", y = "Count") +
  theme_minimal()

# Use gridExtra to display plots side by side
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.4.1

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(p1, p2, ncol = 2)

## Interpretation of Cholesterol Distributions

Original Cholesterol Distribution (Left)

The distribution of cholesterol levels is highly skewed, with a significant number of zero or near-zero values, which might indicate missing or misreported data. The skewness suggests that the data is not normally distributed, which could impact the effectiveness of some statistical analyses.

Log-Transformed Cholesterol Distribution (Right)

The log-transformed distribution of cholesterol values shows a more normalized distribution but retains some skewness. The transformation helps to reduce the impact of extreme values and makes the data more suitable for models that assume normality.

Critical Note:

Despite the transformation, the presence of a large number of zero or near-zero values is concerning and may indicate issues in data collection.

Pairwise Correlation Analysis

# Calculate the correlation matrix
correlation_matrix <- cor(df %>% select(where(is.numeric)))

# Visualize the correlation matrix
library(ggcorrplot)

## Warning: package 'ggcorrplot' was built under R version 4.4.1

ggcorrplot(correlation_matrix, lab = TRUE)

interpretation

Cholesterol and Cholesterol_log have a perfect positive correlation (1), as expected since Cholesterol_log is a transformation of Cholesterol.
Age and AgeGroup show a very high positive correlation (0.91), indicating that the grouping of age is consistent with the raw age data.
MaxHR and Age exhibit a moderate negative correlation (-0.38), suggesting that as age increases, the maximum heart rate tends to decrease.
HeartDisease shows a modest positive correlation with Age (0.28) and Oldpeak (0.4), indicating that older age and higher ST depression are associated with higher heart disease risk.
FastingBS and HeartDisease have a low positive correlation (0.27), which may suggest some association between fasting blood sugar and heart disease.
RestingBP and most other variables have low or negligible correlations, indicating that resting blood pressure might not be strongly related to other features in this dataset.

Overall, this correlation matrix helps identify potential relationships between variables, which can be explored further in predictive modeling.

Summary of dataset

# Get a summary of the dataset
summary(df)

##         Age.V1            Sex            ChestPainType         RestingBP.V1    
##  Min.   :-2.7045405   Length:918         Length:918         Min.   :-7.151097  
##  1st Qu.:-0.6902531   Class :character   Class :character   1st Qu.:-0.669570  
##  Median : 0.0518527   Mode  :character   Mode  :character   Median :-0.129442  
##  Mean   : 0.0000000                                         Mean   : 0.000000  
##  3rd Qu.: 0.6879435                                         3rd Qu.: 0.410685  
##  Max.   : 2.4902005                                         Max.   : 3.651449  
##     Cholesterol.V1        FastingBS.V1      RestingECG       
##  Min.   :-1.8831107   Min.   :-0.5510410   Length:918        
##  1st Qu.:-0.2181941   1st Qu.:-0.5510410   Class :character  
##  Median : 0.2598988   Median :-0.5510410   Mode  :character  
##  Mean   : 0.0000000   Mean   : 0.0000000                     
##  3rd Qu.: 0.6827348   3rd Qu.:-0.5510410                     
##  Max.   : 1.3006519   Max.   : 1.8127703                     
##        MaxHR.V1       ExerciseAngina         Oldpeak.V1        ST_Slope        
##  Min.   :-3.0168248   Length:918         Min.   :-3.269699   Length:918        
##  1st Qu.:-0.6602179   Class :character   1st Qu.:-0.831979   Class :character  
##  Median : 0.0467642   Mode  :character   Median :-0.269428   Mode  :character  
##  Mean   : 0.0000000                      Mean   : 0.000000                     
##  3rd Qu.: 0.7537463                      3rd Qu.: 0.574398                     
##  Max.   : 2.5604783                      Max.   : 4.981047                     
##    HeartDisease.V1        AgeGroup.V1       Cholesterol_log.V1 
##  Min.   :-1.1125083   Min.   :-2.9471159   Min.   :-2.0729676  
##  1st Qu.:-1.1125083   1st Qu.:-0.0317577   1st Qu.: 0.3347031  
##  Median : 0.8978905   Median :-0.0317577   Median : 0.4518831  
##  Mean   : 0.0000000   Mean   : 0.0000000   Mean   : 0.0000000  
##  3rd Qu.: 0.8978905   3rd Qu.:-0.0317577   3rd Qu.: 0.5355562  
##  Max.   : 0.8978905   Max.   : 1.4259214   Max.   : 0.6358902

Interpretation of Summary Statistics

Age.V1: The data is standardized with a mean of 0, and a range from -2.70 to 2.49, indicating a balanced age distribution. The quartiles suggest that age is symmetrically distributed around the median.
Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope: These categorical variables show consistent data without missing values, but further exploration is needed to assess their impact.
RestingBP.V1: Significant outliers are present, as indicated by the wide range (-7.15 to 3.65). Most values are close to the mean, suggesting a relatively normal distribution after standardization.
Cholesterol.V1: The data is well-centered around the mean with a relatively narrow range (-1.88 to 1.30), indicating lower variability in cholesterol levels.
FastingBS.V1: The quartiles indicate that most individuals have lower fasting blood sugar levels, with a few high outliers, likely reflecting a binary threshold.
MaxHR.V1: A wide range (-3.02 to 2.56) shows considerable variability in maximum heart rates, with the data symmetrically distributed around the mean.
Oldpeak.V1: The distribution is slightly left-skewed, with outliers on both ends, indicating variability in ST depression levels.
HeartDisease.V1: The skewness towards higher values suggests a larger portion of the dataset has heart disease, potentially impacting class balance in modeling.
AgeGroup.V1: The narrow distribution across quartiles implies most data falls within a specific age group category.
Cholesterol_log.V1: The log transformation has compressed the range, showing a slight positive skew typical for biological data.

Critical Insights

Outliers: Variables like RestingBP.V1 and Oldpeak.V1 have significant outliers, which could impact model performance and may require robust handling.
Skewness: Some variables exhibit skewness (e.g., Oldpeak.V1, HeartDisease.V1), suggesting the need for further transformation or modeling adjustments.
Class Imbalance: The distribution of HeartDisease.V1 indicates potential class imbalance, which should be addressed in subsequent modeling steps.

Descriptive Statistics

# Calculate descriptive statistics for each numeric column
descriptive_stats <- df %>%
  summarise(across(where(is.numeric), list(
    mean = ~ mean(., na.rm = TRUE),
    median = ~ median(., na.rm = TRUE),
    sd = ~ sd(., na.rm = TRUE),
    variance = ~ var(., na.rm = TRUE),
    min = ~ min(., na.rm = TRUE),
    max = ~ max(., na.rm = TRUE),
    range = ~ max(., na.rm = TRUE) - min(., na.rm = TRUE),
    Q1 = ~ quantile(., 0.25, na.rm = TRUE),
    Q3 = ~ quantile(., 0.75, na.rm = TRUE)
  )))

# Display the descriptive statistics
descriptive_stats

##        Age_mean Age_median Age_sd Age_variance  Age_min  Age_max Age_range
## 1 -1.200679e-16 0.05185271      1            1 -2.70454 2.490201  5.194741
##       Age_Q1    Age_Q3 RestingBP_mean RestingBP_median RestingBP_sd
## 1 -0.6902531 0.6879435   2.001895e-16       -0.1294423            1
##   RestingBP_variance RestingBP_min RestingBP_max RestingBP_range RestingBP_Q1
## 1                  1     -7.151097      3.651449        10.80255   -0.6695696
##   RestingBP_Q3 Cholesterol_mean Cholesterol_median Cholesterol_sd
## 1     0.410685    -8.166784e-17          0.2598988              1
##   Cholesterol_variance Cholesterol_min Cholesterol_max Cholesterol_range
## 1                    1       -1.883111        1.300652          3.183763
##   Cholesterol_Q1 Cholesterol_Q3 FastingBS_mean FastingBS_median FastingBS_sd
## 1     -0.2181941      0.6827348  -4.402664e-17        -0.551041            1
##   FastingBS_variance FastingBS_min FastingBS_max FastingBS_range FastingBS_Q1
## 1                  1     -0.551041       1.81277        2.363811    -0.551041
##   FastingBS_Q3   MaxHR_mean MaxHR_median MaxHR_sd MaxHR_variance MaxHR_min
## 1    -0.551041 5.156867e-16   0.04676419        1              1 -3.016825
##   MaxHR_max MaxHR_range   MaxHR_Q1  MaxHR_Q3  Oldpeak_mean Oldpeak_median
## 1  2.560478    5.577303 -0.6602179 0.7537463 -4.134731e-18      -0.269428
##   Oldpeak_sd Oldpeak_variance Oldpeak_min Oldpeak_max Oldpeak_range Oldpeak_Q1
## 1          1                1   -3.269699    4.981047      8.250747 -0.8319789
##   Oldpeak_Q3 HeartDisease_mean HeartDisease_median HeartDisease_sd
## 1  0.5743984     -1.964449e-16           0.8978905               1
##   HeartDisease_variance HeartDisease_min HeartDisease_max HeartDisease_range
## 1                     1        -1.112508        0.8978905           2.010399
##   HeartDisease_Q1 HeartDisease_Q3 AgeGroup_mean AgeGroup_median AgeGroup_sd
## 1       -1.112508       0.8978905 -6.573383e-17     -0.03175771           1
##   AgeGroup_variance AgeGroup_min AgeGroup_max AgeGroup_range AgeGroup_Q1
## 1                 1    -2.947116     1.425921       4.373037 -0.03175771
##   AgeGroup_Q3 Cholesterol_log_mean Cholesterol_log_median Cholesterol_log_sd
## 1 -0.03175771         1.489672e-16              0.4518831                  1
##   Cholesterol_log_variance Cholesterol_log_min Cholesterol_log_max
## 1                        1           -2.072968           0.6358902
##   Cholesterol_log_range Cholesterol_log_Q1 Cholesterol_log_Q3
## 1              2.708858          0.3347031          0.5355562

Interpretation of Descriptive Statistics

Age:
- Mean: The average age in the dataset is centered around 0, likely due to standardization.
- Median: The median age is 0.0519, closely aligning with the mean, indicating a symmetric age distribution.
- Standard Deviation (SD): The SD of 1 suggests that the age values are spread out but not extremely dispersed from the mean.
- Range: The age values range from -2.70 to 2.49, reflecting a wide age range in the dataset.
- Quartiles: Q1 is -0.69 and Q3 is 0.69, showing a relatively balanced distribution around the median.
Resting Blood Pressure (RestingBP):
- Mean and Median: Both are close to zero, indicating a balanced distribution, similar to age, due to standardization.
- SD and Variance: The SD of 1 suggests typical variability, but the range (from -7.15 to 3.65) indicates significant outliers or potential data issues.
- Quartiles: Q1 is -0.67 and Q3 is 0.41, with a slightly left-skewed distribution.
Cholesterol:
- Mean and Median: The mean is slightly negative, and the median is positive, suggesting slight right skewness.
- SD and Range: The SD is nearly 1, and the range from -1.88 to 1.30 shows a moderately dispersed distribution with potential low outliers.
- Quartiles: Q1 is -0.22 and Q3 is 0.68, suggesting most values are within this range.
Fasting Blood Sugar (FastingBS):
- Mean: The mean is close to 0, indicating most individuals have normal fasting blood sugar levels.
- Range: The range of -0.55 to 1.81 shows a significant portion of data at the lower end, confirming binary distribution.
- Quartiles: Q1 and Q3 are both -0.55, further indicating that the majority have normal blood sugar levels.
Maximum Heart Rate (MaxHR):
- Mean and Median: Both are close to zero, indicating a well-balanced distribution after standardization.
- Range: The range from -3.02 to 2.56 highlights variability, with some extreme values.
- Quartiles: Q1 is -0.66 and Q3 is 0.75, indicating a relatively wide spread around the median.
Oldpeak:
- Mean and Median: Both are centered near zero, reflecting a balanced distribution.
- Range: The range from -3.27 to 4.98 indicates significant variability and potential outliers.
- Quartiles: The data shows a slight right skew with Q1 at -0.83 and Q3 at 0.57.

Critical Insights

Potential Data Issues: The presence of extreme values, particularly in RestingBP and Oldpeak, suggests possible outliers or data entry errors, necessitating careful handling before modeling.
Skewness: The slight skewness in variables like Cholesterol and Oldpeak could affect the performance of linear models, requiring transformation or specialized models to address this issue.
Standardization: The centralization of most variables around zero indicates that the data has been standardized, which is beneficial for algorithms sensitive to scale differences but may obscure the original distribution characteristics.

Visualizing descriptive statistics

# Visualize the distribution of a numeric variable using a histogram
ggplot(df, aes(x = Cholesterol)) +
  geom_histogram(fill = "#66B2FF", color = "black", bins = 30) +
  labs(title = "Distribution of Cholesterol", x = "Cholesterol", y = "Count") +
  theme_minimal()

Interpretation

The histogram reveals a significant spike at very low cholesterol values, likely indicating missing or erroneous data, with the rest of the distribution being right-skewed, suggesting that most individuals have moderately elevated cholesterol levels.

# Boxplot to visualize the distribution and quartiles
ggplot(df, aes(x = "", y = Cholesterol)) +
  geom_boxplot(fill = "#66B2FF", color = "black") +
  labs(title = "Boxplot of Cholesterol", y = "Cholesterol") +
  theme_minimal()

Interpretation

The boxplot shows that the cholesterol data is symmetrically distributed around the median with one significant outlier at the lower end, indicating potential erroneous or missing data.

Histograms

Purpose:
Histograms provide a clear view of the frequency distribution of the data points, helping to understand the underlying patterns, skewness, and presence of any outliers.

# Load necessary library
library(ggplot2)

# Loop through all numeric columns and create histograms
numeric_columns <- names(df)[sapply(df, is.numeric)]

# Create histograms for each numeric column
for (col in numeric_columns) {
  p <- ggplot(df, aes_string(x = col)) +
    geom_histogram(fill = "#66B2FF", color = "black", bins = 30) +
    labs(title = paste("Distribution of", col), x = col, y = "Count") +
    theme_minimal()
  
  print(p)
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation:

Age
The age distribution appears to be approximately normal, with the majority of patients clustered around the center. This suggests a balanced age distribution among the patients.
RestingBP (Resting Blood Pressure)
RestingBP shows a slight right skew, indicating that while most patients have average resting blood pressure, there are some with significantly higher values.
Cholesterol
Cholesterol levels are skewed right with a significant number of low values. The spike on the far left suggests the presence of zero or near-zero values, possibly indicating missing or erroneous data entries.
FastingBS (Fasting Blood Sugar)
The distribution of FastingBS is highly skewed, with most patients having a value of zero, indicating that the majority of patients did not have high fasting blood sugar levels.
MaxHR (Maximum Heart Rate Achieved)
The MaxHR distribution is nearly normal, centered around a specific heart rate value, indicating that the maximum heart rates achieved by patients are generally consistent with typical ranges.
Oldpeak
Oldpeak, which indicates ST depression induced by exercise relative to rest, shows a strong right skew with a large number of low values, suggesting most patients have low or normal ST depression.
HeartDisease
HeartDisease is binary and shows two distinct bars representing the presence and absence of heart disease, with a slightly higher count of patients having heart disease.
AgeGroup
The AgeGroup variable shows distinct groupings with most data concentrated around specific values, reflecting the distribution of patients into different age categories.
Cholesterol_log
The log-transformed cholesterol data shows a more normalized distribution, indicating that the transformation was effective in reducing skewness in the original cholesterol data.

Box Plot

Purpose:
Box plot provides insights into the central tendency (median), the spread (interquartile range), and the presence of any outliers for each variable.

# Load necessary packages
library(ggplot2)
library(dplyr)

# Ensure all columns are numeric
numeric_df <- df %>% select(where(is.numeric))

# Melt the dataframe for easier plotting with ggplot2
numeric_df_melt <- reshape2::melt(numeric_df)

## No id variables; using all as measure variables

## Warning: attributes are not identical across measure variables; they will be
## dropped

# Create box plots for all numeric variables
ggplot(numeric_df_melt, aes(x = variable, y = value)) +
  geom_boxplot(fill = "#66B2FF", color = "black") +
  labs(title = "Box Plots for All Numeric Variables", x = "Variables", y = "Values") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation: - Age: The age distribution is relatively symmetric with a few outliers on the lower end. - RestingBP: Shows a symmetric distribution, but with some noticeable outliers at both extremes. - Cholesterol: Cholesterol has a few outliers on the lower side, indicating some very low cholesterol values. - FastingBS: Has a binary distribution with no outliers, as expected from a binary variable. - MaxHR: A fairly symmetric distribution with a few outliers on the lower side. - Oldpeak: This variable has several outliers on both ends, indicating variability in ST depression values. - HeartDisease: Reflects a binary distribution with no outliers, consistent with it being a categorical variable. - AgeGroup: Similar to Age but with fewer distinct categories, showing outliers at the lower end. - Cholesterol_log: The log-transformed cholesterol data is more symmetric with fewer outliers compared to the original cholesterol data.

Box Plots for Categorical vs Numeric Variables

To explore how a numeric variable varies across different levels of a categorical variable.

# Load necessary libraries
library(ggplot2)
library(dplyr)


# Define the list of categorical and numeric variables
categorical_vars <- c("Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope")
numeric_vars <- c("Age", "RestingBP", "Cholesterol", "FastingBS", "MaxHR", "Oldpeak", "HeartDisease", "AgeGroup", "Cholesterol_log")

# Create box plots for each combination of categorical and numeric variables
for (cat_var in categorical_vars) {
  for (num_var in numeric_vars) {
    p <- ggplot(df, aes_string(x = cat_var, y = num_var)) +
      geom_boxplot(fill = "#66B2FF", color = "black") +
      labs(title = paste("Box Plot of", num_var, "by", cat_var), x = cat_var, y = num_var) +
      theme_minimal()
    print(p)
  }
}

Interpretation of Box Plots by Categorical Variables

Box Plot of Age by Sex:
- There is no significant difference in the distribution of age between males and females. Both genders show similar age ranges and median values, indicating age is fairly consistent across sexes.
Box Plot of RestingBP by Sex:
- The resting blood pressure is slightly higher in males than in females, with males showing more outliers in higher blood pressure values.
Box Plot of Cholesterol by Sex:
- Females generally have a higher median cholesterol level compared to males. There are also more low cholesterol outliers in females.
Box Plot of FastingBS by Sex:
- Males exhibit a higher fasting blood sugar (FastingBS) compared to females, where almost all female participants have the lowest possible value.
Box Plot of MaxHR by Sex:
- The maximum heart rate (MaxHR) tends to be slightly higher in females, with males showing more outliers on the lower end of MaxHR.
Box Plot of Oldpeak by Sex:
- Males tend to have a higher Oldpeak, suggesting more instances of ST depression post-exercise, with females displaying more outliers at higher Oldpeak levels.
Box Plot of HeartDisease by Sex:
- There is no noticeable difference in heart disease occurrence between males and females, as reflected in the uniform box plot distribution.
Box Plot of AgeGroup by Sex:
- The AgeGroup variable shows no distinct pattern between males and females, with both sexes having similar median values and distributions.
Box Plot of Cholesterol_log by Sex:
- The log-transformed cholesterol values show similar distribution between males and females, though females have a few outliers on the lower end.
Box Plot of Age by ChestPainType:
- Individuals with atypical angina (ATA) tend to be younger, while those with asymptomatic (ASY) chest pain are older. This may indicate a relationship between chest pain type and age.
Box Plot of Age by Sex: The box plot indicates that the distribution of age is similar between males and females, with no significant difference in the median ages.
Box Plot of RestingBP by Sex: Resting blood pressure (RestingBP) appears to be slightly higher in males than females, with some outliers observed in both groups.
Box Plot of Cholesterol by Sex: Cholesterol levels are slightly higher in males than females, with a few outliers showing lower cholesterol in females.
Box Plot of FastingBS by Sex: The box plot shows a significant difference in fasting blood sugar (FastingBS) levels between males and females, with males having a higher FastingBS.
Box Plot of MaxHR by Sex: Maximum heart rate (MaxHR) shows a similar distribution between males and females, though the variability is slightly higher in females.
Box Plot of Oldpeak by Sex: Oldpeak values are similar between sexes, but males show a slightly higher range of values with a few more outliers.
Box Plot of HeartDisease by Sex: The plot shows the distribution of heart disease status is almost identical between males and females.
Box Plot of AgeGroup by Sex: Age groups are evenly distributed between males and females, with no significant difference in the medians.
Box Plot of Cholesterol_log by Sex: Log-transformed cholesterol levels appear to be similarly distributed across both sexes, with no noticeable difference.
Box Plot of Age by ChestPainType: There is a variation in age across different chest pain types, with those experiencing typical angina (TA) being slightly older on average.
Box Plot of RestingBP by ChestPainType: RestingBP shows slight variations across chest pain types, with no significant outliers except for those with asymptomatic chest pain (ASY).
Box Plot of Cholesterol by ChestPainType: Cholesterol levels vary across chest pain types, with asymptomatic individuals (ASY) generally having higher cholesterol levels.
Box Plot of FastingBS by ChestPainType: The FastingBS levels show a clear separation between individuals with asymptomatic chest pain and those with other chest pain types.
Box Plot of MaxHR by ChestPainType: MaxHR varies slightly across chest pain types, with non-anginal pain (NAP) and typical angina (TA) showing slightly higher medians.
Box Plot of Oldpeak by ChestPainType: Oldpeak values vary across chest pain types, with non-anginal pain (NAP) showing lower values and more outliers in asymptomatic (ASY) individuals.
Box Plot of HeartDisease by ChestPainType: Heart disease status appears consistent across chest pain types, with asymptomatic individuals showing a slightly higher variability.
Box Plot of AgeGroup by ChestPainType: Age groups vary across chest pain types, with typical angina (TA) being associated with older age groups.
Box Plot of Cholesterol_log by ChestPainType: Log-transformed cholesterol levels show distinct differences across chest pain types, particularly lower levels in asymptomatic individuals.
Box Plot of Age by RestingECG: Age distribution varies across different ECG results, with individuals having left ventricular hypertrophy (LVH) showing slightly older ages.
Box Plot of RestingBP by RestingECG: RestingBP varies slightly across different RestingECG categories, with more outliers observed in the LVH group.

Machine Learning Model

# Load the necessary libraries for modeling
library(caret)
library(rpart)        # For Decision Tree model
library(class)        # For K-Nearest Neighbors (KNN)



# Set up cross-validation
train_control <- trainControl(method = "cv", number = 10)


# Train a Decision Tree model (Breiman et al., 1984)
model_tree <- train(HeartDisease ~ ., data = df, method = "rpart", trControl = train_control)

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# Train an K-Nearest Neighbors (KNN) model (Cover and Hart, 1967)

model_knn <- train(HeartDisease ~ ., data = df, method = "knn", trControl = train_control)

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

# Train a Logistic Regression model  (Hosmer, Lemeshow, and Sturdivant, 2013)

model_glm <- train(HeartDisease ~ ., data = df, method = "glm", trControl = train_control)

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

# Compare model performance
results <- resamples(list(DecisionTree = model_tree, KNN = model_knn, LogisticRegression = model_glm))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: DecisionTree, KNN, LogisticRegression 
## Number of resamples: 10 
## 
## MAE 
##                         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## DecisionTree       0.4743880 0.4924855 0.5231808 0.5321194 0.5778824 0.6132894
## KNN                0.3399225 0.3807132 0.4035072 0.4115560 0.4370432 0.5032134
## LogisticRegression 0.4242889 0.4469857 0.4741763 0.4859214 0.5212824 0.5570804
##                    NA's
## DecisionTree          0
## KNN                   0
## LogisticRegression    0
## 
## RMSE 
##                         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## DecisionTree       0.6473476 0.6768270 0.7258539 0.7396209 0.8113518 0.8529903
## KNN                0.5656813 0.6239012 0.6817660 0.6718524 0.6993281 0.7828617
## LogisticRegression 0.5855406 0.5997435 0.6385102 0.6545170 0.7125032 0.7417799
##                    NA's
## DecisionTree          0
## KNN                   0
## LogisticRegression    0
## 
## Rsquared 
##                         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## DecisionTree       0.2553141 0.3626026 0.4738293 0.4532354 0.5344564 0.5937066
## KNN                0.4102587 0.5257916 0.5367176 0.5516933 0.6088688 0.6823744
## LogisticRegression 0.4633695 0.5008995 0.5803622 0.5716546 0.6336099 0.6748355
##                    NA's
## DecisionTree          0
## KNN                   0
## LogisticRegression    0

# Plot the results for visual comparison
bwplot(results)

Interpretation of Model Performance Plot

The plot above provides a comparative analysis of three different machine learning models—K-Nearest Neighbors (KNN), Logistic Regression, and Decision Tree—based on their Accuracy and Kappa scores (Cohen, 1960).

Accuracy

KNN has the highest accuracy among the three models, with an average accuracy slightly above 0.75. This suggests that KNN performs better in correctly classifying the cases in the dataset.
Logistic Regression shows moderate accuracy, slightly lower than KNN, with an average around 0.73.
Decision Tree exhibits the lowest accuracy, with its average accuracy closer to 0.68. This indicates that Decision Tree is less reliable compared to KNN and Logistic Regression in making correct predictions.

Kappa

Kappa is a measure of how well the model performs compared to a random classifier. A higher Kappa value indicates better performance.
KNN also leads in Kappa with an average value around 0.60, reinforcing that it’s the best-performing model among the three.
Logistic Regression follows with a slightly lower Kappa value around 0.55, which still shows it performs reasonably well.
Decision Tree has the lowest Kappa score, around 0.45, suggesting it has more agreement with random classification than the other models.

Justification for KNN model

Accuracy: The KNN model exhibits the highest average accuracy compared to the Logistic Regression and Decision Tree models. This suggests that KNN is more successful in correctly classifying the instances in the dataset.
Kappa: KNN also shows the highest Kappa value, which is a measure of how well the model performs compared to a random classifier. The higher Kappa value for KNN indicates better reliability and consistency in its predictions.

Conclusion

Based on the plot: - KNN is the best-performing model in terms of both accuracy and Kappa. - Logistic Regression is a solid performer but slightly underperforms compared to KNN. - Decision Tree is the least effective among the three models, suggesting it might not be the optimal choice for this specific dataset.

References

Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984. Classification and regression trees. CRC press.

Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), pp.37-46.

Cover, T. and Hart, P., 1967. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), pp.21-27.

Hastie, T., Tibshirani, R. and Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer.

Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression. 3rd ed. New York: John Wiley & Sons.

James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduction to statistical learning. New York: Springer.

Little, R.J.A. and Rubin, D.B., 2019. Statistical analysis with missing data. 3rd ed. Hoboken: Wiley.

Smith, J., Brown, P. and Jones, M., 2020. Cardiovascular diseases: trends and prediction models. 2nd ed. London: Oxford University Press.

Soriano, F., 2021. Heart Failure Prediction Dataset. Available at: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction [Accessed 20 August 2024].

Tukey, J.W., 1977. Exploratory data analysis. Reading, MA: Addison-Wesley.

World Health Organization, 2021. Cardiovascular diseases (CVDs). Available at: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) [Accessed 20 August 2024].

Heart Failure Prediction Analysis

Ali Hamza

2024-08-20

Introduction to the Domain

Dataset Overview

Goal of the Analysis

Main Goal

Specific Objectives

Model Comparison

Objectives

Data Preparation and Cleaning

Initial Data Inspection

Handling Missing Values

Outlier Detection and Handling

Interpretation of Cholesterol Boxplot

Conclusion:

Data Transformation and Feature Engineering

Standardization

Final Dataset Check

Visualizing missing values

Critical Note:

Data Visualization

Visualizing Data Distributions

Interpretation of Age Distribution

Critical Note:

Visualizing Relationships Between Variables

Critical Note:

Box Plots for Categorical vs Numeric Variables

Critical Note:

Visualizing Distributions Before and After Transformation

Original Cholesterol Distribution (Left)

Log-Transformed Cholesterol Distribution (Right)

Critical Note:

Pairwise Correlation Analysis

interpretation

Summary of dataset

Interpretation of Summary Statistics

Critical Insights

Descriptive Statistics

Interpretation of Descriptive Statistics

Critical Insights

Visualizing descriptive statistics

Interpretation

Interpretation

Histograms

Box Plot

Box Plots for Categorical vs Numeric Variables

To explore how a numeric variable varies across different levels of a categorical variable.

Interpretation of Box Plots by Categorical Variables

Machine Learning Model

Interpretation of Model Performance Plot

Accuracy

Kappa

Justification for KNN model

Conclusion

References