The following analyses are based on the Heart Disease Prediction Dataset. The target variable in the dataset is the prediction of presence of heart disease (0:no; 1: yes)
There are several predictor variables (features) included in the data set.
We will go through step-by step in the analyses of this data using different ML algorithms. These steps would include
Load the required libraries and the dataset
Take a look at the data structure. Rename some of the columnsin a easier to understand format
## 'data.frame': 1025 obs. of 14 variables:
## $ age : int 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : int 1 1 1 1 0 0 1 1 1 1 ...
## $ cp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : int 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : int 0 1 0 0 1 0 0 0 0 0 ...
## $ restecg : int 1 0 1 1 1 0 2 0 0 0 ...
## $ thalach : int 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : int 2 0 0 2 1 1 0 1 2 1 ...
## $ ca : int 2 0 0 1 3 0 3 1 0 2 ...
## $ thal : int 3 3 3 3 2 2 1 3 3 2 ...
## $ target : int 0 0 0 0 0 1 0 0 0 0 ...
## [1] "age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
Check for missing values. Run summary statistics
# Check for missing values
sum(is.na(heart_data)) # Returns the total number of missing values
## [1] 0
summary(heart_data)
## Age Sex ChestPainType RestingBP
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.43 Mean :0.6956 Mean :0.9424 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. :126 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:132.0
## Median :240 Median :0.0000 Median :1.0000 Median :152.0
## Mean :246 Mean :0.1493 Mean :0.5298 Mean :149.1
## 3rd Qu.:275 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564 Max. :1.0000 Max. :2.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope NumberofVessels
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3366 Mean :1.072 Mean :1.385 Mean :0.7541
## 3rd Qu.:1.0000 3rd Qu.:1.800 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## ThalScan HeartDisease
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.324 Mean :0.5132
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
We can get a more detailed summary of the data using the dfSummaryFunction from devtools.
# Detailed summary using summarytools
summarytools::dfSummary(heart_data)
## Data Frame Summary
## heart_data
## Dimensions: 1025 x 14
## Duplicates: 723
##
## ------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ----------------- -------------------------- --------------------- --------------------- ---------- ---------
## 1 Age Mean (sd) : 54.4 (9.1) 41 distinct values : 1025 0
## [integer] min < med < max: . : : . (100.0%) (0.0%)
## 29 < 56 < 77 . : : : : :
## IQR (CV) : 13 (0.2) : : : : : :
## : : : : : : : :
##
## 2 Sex Min : 0 0 : 312 (30.4%) IIIIII 1025 0
## [integer] Mean : 0.7 1 : 713 (69.6%) IIIIIIIIIIIII (100.0%) (0.0%)
## Max : 1
##
## 3 ChestPainType Mean (sd) : 0.9 (1) 0 : 497 (48.5%) IIIIIIIII 1025 0
## [integer] min < med < max: 1 : 167 (16.3%) III (100.0%) (0.0%)
## 0 < 1 < 3 2 : 284 (27.7%) IIIII
## IQR (CV) : 2 (1.1) 3 : 77 ( 7.5%) I
##
## 4 RestingBP Mean (sd) : 131.6 (17.5) 49 distinct values . : 1025 0
## [integer] min < med < max: : : : (100.0%) (0.0%)
## 94 < 130 < 200 : : : :
## IQR (CV) : 20 (0.1) : : : : :
## . : : : : : : .
##
## 5 Cholesterol Mean (sd) : 246 (51.6) 152 distinct values : 1025 0
## [integer] min < med < max: . : (100.0%) (0.0%)
## 126 < 240 < 564 : : :
## IQR (CV) : 64 (0.2) : : : .
## . : : : :
##
## 6 FastingBS Min : 0 0 : 872 (85.1%) IIIIIIIIIIIIIIIII 1025 0
## [integer] Mean : 0.1 1 : 153 (14.9%) II (100.0%) (0.0%)
## Max : 1
##
## 7 RestingECG Mean (sd) : 0.5 (0.5) 0 : 497 (48.5%) IIIIIIIII 1025 0
## [integer] min < med < max: 1 : 513 (50.0%) IIIIIIIIII (100.0%) (0.0%)
## 0 < 1 < 2 2 : 15 ( 1.5%)
## IQR (CV) : 1 (1)
##
## 8 MaxHR Mean (sd) : 149.1 (23) 91 distinct values : 1025 0
## [integer] min < med < max: . : : (100.0%) (0.0%)
## 71 < 152 < 202 . : : :
## IQR (CV) : 34 (0.2) . : : : : .
## . : : : : : : : .
##
## 9 ExerciseAngina Min : 0 0 : 680 (66.3%) IIIIIIIIIIIII 1025 0
## [integer] Mean : 0.3 1 : 345 (33.7%) IIIIII (100.0%) (0.0%)
## Max : 1
##
## 10 Oldpeak Mean (sd) : 1.1 (1.2) 40 distinct values : 1025 0
## [numeric] min < med < max: : (100.0%) (0.0%)
## 0 < 0.8 < 6.2 :
## IQR (CV) : 1.8 (1.1) : : .
## : : : : : .
##
## 11 ST_Slope Mean (sd) : 1.4 (0.6) 0 : 74 ( 7.2%) I 1025 0
## [integer] min < med < max: 1 : 482 (47.0%) IIIIIIIII (100.0%) (0.0%)
## 0 < 1 < 2 2 : 469 (45.8%) IIIIIIIII
## IQR (CV) : 1 (0.4)
##
## 12 NumberofVessels Mean (sd) : 0.8 (1) 0 : 578 (56.4%) IIIIIIIIIII 1025 0
## [integer] min < med < max: 1 : 226 (22.0%) IIII (100.0%) (0.0%)
## 0 < 0 < 4 2 : 134 (13.1%) II
## IQR (CV) : 1 (1.4) 3 : 69 ( 6.7%) I
## 4 : 18 ( 1.8%)
##
## 13 ThalScan Mean (sd) : 2.3 (0.6) 0 : 7 ( 0.7%) 1025 0
## [integer] min < med < max: 1 : 64 ( 6.2%) I (100.0%) (0.0%)
## 0 < 2 < 3 2 : 544 (53.1%) IIIIIIIIII
## IQR (CV) : 1 (0.3) 3 : 410 (40.0%) IIIIIIII
##
## 14 HeartDisease Min : 0 0 : 499 (48.7%) IIIIIIIII 1025 0
## [integer] Mean : 0.5 1 : 526 (51.3%) IIIIIIIIII (100.0%) (0.0%)
## Max : 1
## ------------------------------------------------------------------------------------------------------------------
Some example data visualizations are shown below. If there are a few predictor variables it is a good idea to visualize them against the outcome variable to discern any trends
library(ggcorrplot)
numeric_vars <- heart_data %>%
select_if(is.numeric)
# Compute correlation matrix
corr_matrix <- cor(numeric_vars, use = "complete.obs")
# Ensure row and column names of the matrix are set correctly
rownames(corr_matrix) <- colnames(numeric_vars)
colnames(corr_matrix) <- colnames(numeric_vars)
# Plot heatmap
# Plot correlation matrix with adjustments for label size and rotation
ggcorrplot(corr_matrix,
lab = TRUE, # Show labels inside the plot
lab_size = 3, # Adjust label size to prevent overlap
type = "lower", # Only show the lower triangle
tl.cex = 10, # Adjust the size of axis labels
title = "Correlation Matrix"
) +
theme_minimal(base_size = 15) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))# Make x-axis labels vertical
## 2. Cleaning/Processing the data
#5 Code for splitting the dataset into a train and test dataset for running models
# Set a seed for reproducibility
# Set a seed for reproducibility
set.seed(123)
# Setting a seed ensures that the random operations (such as splitting data) will yield the same results
# every time the code is run. This is important for reproducibility of the results.
# Create a data partition (80% for training and 20% for validation)
trainIndex <- createDataPartition(heart_data$HeartDisease, # Target variable (response)
p = 0.8, # 80% of the data will be used for training
list = FALSE, # Return the indices as a matrix, not a list
times = 1) # Single split
# `createDataPartition()` function splits the dataset based on the target variable.
# It ensures that the training and validation sets have a similar distribution of the target classes
# (i.e., both will have roughly the same percentage of 'HeartDisease' cases as the original dataset).
# Split the data into training and validation sets
heart_train <- heart_data[trainIndex, ] # Use the indices from `trainIndex` to subset training data
heart_validation <- heart_data[-trainIndex, ] # Use the rest of the data for the validation set
# Check the dimensions of the resulting datasets
dim(heart_train) # Check the number of rows and columns in the training set
## [1] 820 14
dim(heart_validation) # Check the number of rows and columns in the validation set
## [1] 205 14
# Check the distribution of HeartDisease in both datasets
table(heart_train$HeartDisease) # Distribution of HeartDisease in the training set
##
## 0 1
## 404 416
table(heart_validation$HeartDisease) # Distribution of HeartDisease in the validation set
##
## 0 1
## 95 110
We are going to fit a series of machine learning models to the data. First we will start with a simple logistic regression model
## Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 2 1 2 ...
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.502797 1.498938 2.337 0.019447 *
## Age -0.008691 0.013971 -0.622 0.533902
## Sex -1.819684 0.283816 -6.412 1.44e-10 ***
## ChestPainType 0.786254 0.107476 7.316 2.56e-13 ***
## RestingBP -0.017666 0.006033 -2.928 0.003407 **
## Cholesterol -0.005499 0.002193 -2.508 0.012133 *
## FastingBS 0.060584 0.308917 0.196 0.844518
## RestingECG 0.410917 0.208930 1.967 0.049210 *
## MaxHR 0.023392 0.006061 3.859 0.000114 ***
## ExerciseAngina -0.982808 0.246911 -3.980 6.88e-05 ***
## Oldpeak -0.512094 0.127768 -4.008 6.12e-05 ***
## ST_Slope 0.558013 0.204138 2.734 0.006266 **
## NumberofVessels -0.693303 0.107595 -6.444 1.17e-10 ***
## ThalScan -0.864847 0.173339 -4.989 6.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1136.59 on 819 degrees of freedom
## Residual deviance: 595.25 on 806 degrees of freedom
## AIC: 623.25
##
## Number of Fisher Scoring iterations: 6
## Prediction Reference Freq
## 1 No No 78
## 2 Yes No 17
## 3 No Yes 8
## 4 Yes Yes 102
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.8780488 0.7531905 0.8252579 0.9195025 0.5365854 9.998134e-26
## McnemarPValue
## 1 0.1095986
## Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall
## 1 0.8210526 0.9272727 0.9069767 0.8571429 0.9069767 0.8210526
## F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
## 1 0.8618785 0.4634146 0.3804878 0.4195122 0.8741627
## CART
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 739, 738, 737, 737, 739, 738, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.002475248 0.8719848 0.7439462
## 0.003712871 0.8682958 0.7365290
## 0.004950495 0.8658861 0.7316801
## 0.009900990 0.8587017 0.7172563
## 0.011138614 0.8537935 0.7074324
## 0.017326733 0.8415966 0.6828411
## 0.019801980 0.8269459 0.6532174
## 0.024752475 0.8148239 0.6284327
## 0.056930693 0.7768538 0.5527527
## 0.502475248 0.5658894 0.1220406
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002475248.
## Prediction Reference Freq
## 1 No No 85
## 2 Yes No 10
## 3 No Yes 20
## 4 Yes Yes 90
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.8536585 0.7078385 0.7977251 0.8990351 0.5365854 5.249217e-22
## McnemarPValue
## 1 0.1003482
## Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall
## 1 0.8947368 0.8181818 0.8095238 0.9 0.8095238 0.8947368
## F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
## 1 0.85 0.4634146 0.4146341 0.5121951 0.8564593
## Random Forest
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 738, 738, 738, 738, 738, 739, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9890685 0.9781222
## 4 0.9890685 0.9781222
## 7 0.9890685 0.9781222
## 10 0.9878636 0.9757143
## 13 0.9878636 0.9757143
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Prediction Reference Freq
## 1 No No 95
## 2 Yes No 0
## 3 No Yes 0
## 4 Yes Yes 110
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 1 1 0.9821664 1 0.5365854 3.766682e-56
## McnemarPValue
## 1 NaN
## Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1
## 1 1 1 1 1 1 1 1
## Prevalence Detection Rate Detection Prevalence Balanced Accuracy
## 1 0.4634146 0.4634146 0.4634146 1
## k-Nearest Neighbors
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 738, 738, 738, 738, 738, 739, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6890650 0.3787409
## 7 0.7245950 0.4490015
## 9 0.7026288 0.4057456
## 11 0.6903731 0.3809917
## 13 0.7049789 0.4099462
## 15 0.7050380 0.4097678
## 17 0.7111955 0.4219552
## 19 0.7013498 0.4022075
## 21 0.7050380 0.4095982
## 23 0.7075366 0.4146636
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 69 20
## Yes 26 90
##
## Accuracy : 0.7756
## 95% CI : (0.7123, 0.8308)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : 1.108e-12
##
## Kappa : 0.5469
##
## Mcnemar's Test P-Value : 0.461
##
## Sensitivity : 0.7263
## Specificity : 0.8182
## Pos Pred Value : 0.7753
## Neg Pred Value : 0.7759
## Prevalence : 0.4634
## Detection Rate : 0.3366
## Detection Prevalence : 0.4341
## Balanced Accuracy : 0.7722
##
## 'Positive' Class : No
##
## Support Vector Machines with Radial Basis Function Kernel
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 738, 738, 738, 738, 738, 739, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8537695 0.7066352
## 0.50 0.8756774 0.7509335
## 1.00 0.8792908 0.7583513
## 2.00 0.9157894 0.8314137
## 4.00 0.9402403 0.8804481
##
## Tuning parameter 'sigma' was held constant at a value of 0.05215777
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05215777 and C = 4.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 90 9
## Yes 5 101
##
## Accuracy : 0.9317
## 95% CI : (0.8881, 0.9622)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8631
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.9474
## Specificity : 0.9182
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.9528
## Prevalence : 0.4634
## Detection Rate : 0.4390
## Detection Prevalence : 0.4829
## Balanced Accuracy : 0.9328
##
## 'Positive' Class : No
##
## Stochastic Gradient Boosting
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 738, 737, 739, 739, 739, 738, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8635184 0.7266483
## 1 100 0.8646931 0.7290636
## 1 150 0.8622838 0.7242327
## 1 200 0.8793441 0.7584956
## 1 250 0.8867659 0.7733511
## 2 50 0.8659126 0.7315107
## 2 100 0.8891457 0.7781621
## 2 150 0.9001375 0.8002071
## 2 200 0.9159776 0.8319518
## 2 250 0.9317735 0.8635564
## 3 50 0.8842376 0.7682127
## 3 100 0.9013126 0.8024969
## 3 150 0.9329042 0.8658082
## 3 200 0.9500236 0.9000769
## 3 250 0.9720225 0.9440566
## 4 50 0.8927003 0.7852327
## 4 100 0.9390322 0.8780984
## 4 150 0.9646599 0.9293525
## 4 200 0.9792959 0.9585872
## 4 250 0.9878185 0.9756437
## 5 50 0.9122595 0.8244606
## 5 100 0.9573719 0.9147962
## 5 150 0.9768866 0.9537948
## 5 200 0.9866434 0.9732984
## 5 250 0.9914627 0.9829341
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 95 0
## Yes 0 110
##
## Accuracy : 1
## 95% CI : (0.9822, 1)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4634
## Detection Rate : 0.4634
## Detection Prevalence : 0.4634
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : No
##
## Naive Bayes
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 738, 738, 739, 738, 737, 737, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.8158979 0.6314466
## TRUE 0.8401860 0.6800554
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 81 7
## Yes 14 103
##
## Accuracy : 0.8976
## 95% CI : (0.8477, 0.9355)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.793
##
## Mcnemar's Test P-Value : 0.1904
##
## Sensitivity : 0.8526
## Specificity : 0.9364
## Pos Pred Value : 0.9205
## Neg Pred Value : 0.8803
## Prevalence : 0.4634
## Detection Rate : 0.3951
## Detection Prevalence : 0.4293
## Balanced Accuracy : 0.8945
##
## 'Positive' Class : No
##
We will use scaling for the neural network model
## Neural Network
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 737, 738, 738, 739, 739, 739, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.5946378 0.1810603
## 1 1e-04 0.5981053 0.1860035
## 1 1e-03 0.6716980 0.3396106
## 1 1e-02 0.8117400 0.6218613
## 1 1e-01 0.8525248 0.7047625
## 3 0e+00 0.6052800 0.1993786
## 3 1e-04 0.5753827 0.1392475
## 3 1e-03 0.7316012 0.4629285
## 3 1e-02 0.8466061 0.6929022
## 3 1e-01 0.8610639 0.7217552
## 5 0e+00 0.6254597 0.2423645
## 5 1e-04 0.7934444 0.5851843
## 5 1e-03 0.8054215 0.6103869
## 5 1e-02 0.8499815 0.6994886
## 5 1e-01 0.8512763 0.7024581
## 7 0e+00 0.7135594 0.4221843
## 7 1e-04 0.7738992 0.5453271
## 7 1e-03 0.8338880 0.6677880
## 7 1e-02 0.8525549 0.7048655
## 7 1e-01 0.8671310 0.7337950
## 9 0e+00 0.7194356 0.4337219
## 9 1e-04 0.6845550 0.3614728
## 9 1e-03 0.8258880 0.6526302
## 9 1e-02 0.8452360 0.6901949
## 9 1e-01 0.8524209 0.7047164
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
## Neural Network
##
## 820 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 737, 738, 738, 739, 739, 739, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.5946378 0.1810603
## 1 1e-04 0.5981053 0.1860035
## 1 1e-03 0.6716980 0.3396106
## 1 1e-02 0.8117400 0.6218613
## 1 1e-01 0.8525248 0.7047625
## 3 0e+00 0.6052800 0.1993786
## 3 1e-04 0.5753827 0.1392475
## 3 1e-03 0.7316012 0.4629285
## 3 1e-02 0.8466061 0.6929022
## 3 1e-01 0.8610639 0.7217552
## 5 0e+00 0.6254597 0.2423645
## 5 1e-04 0.7934444 0.5851843
## 5 1e-03 0.8054215 0.6103869
## 5 1e-02 0.8499815 0.6994886
## 5 1e-01 0.8512763 0.7024581
## 7 0e+00 0.7135594 0.4221843
## 7 1e-04 0.7738992 0.5453271
## 7 1e-03 0.8338880 0.6677880
## 7 1e-02 0.8525549 0.7048655
## 7 1e-01 0.8671310 0.7337950
## 9 0e+00 0.7194356 0.4337219
## 9 1e-04 0.6845550 0.3614728
## 9 1e-03 0.8258880 0.6526302
## 9 1e-02 0.8452360 0.6901949
## 9 1e-01 0.8524209 0.7047164
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
We will run all models to make predictions on the validation dataset and then compare the models
## Model Accuracy Sensitivity Specificity
## 1 Logistic Regression 0.8780488 0.8210526 0.9272727
## 2 KNN 0.7707317 0.7263158 0.8090909
## 3 Decision Tree 0.8536585 0.8947368 0.8181818
## 4 Random Forest 1.0000000 1.0000000 1.0000000
## 5 SVM 0.9317073 0.9473684 0.9181818
## 6 GBM 1.0000000 1.0000000 1.0000000
## 7 Naive Bayes 0.8975610 0.8526316 0.9363636
## 8 Neural Network 0.8975610 0.8526316 0.9363636
We see that Random Forest and Gradient Boosting Trees were best among
the models tested. Both models accurately predicted all the validation
cases. Both the techniques are tree based ensemble models.
Here is a comparison between the two modeling algorithms.
| Aspect | Random Forest (RF) | Gradient Boosting Machine (GBM) |
|---|---|---|
| Ensemble Strategy | Bagging (Parallel tree building) | Boosting (Sequential tree building) |
| Training Speed | Faster (due to parallelization) | Slower (sequential, harder to parallelize) |
| Prediction Speed | Faster | Slower (many trees evaluated sequentially) |
| Overfitting | Less prone to overfitting | Prone to overfitting without careful tuning |
| Hyperparameter Tuning | Works well with default parameters | Requires careful tuning of learning rate, tree depth, etc. |
| Accuracy | Good, especially for less complex tasks | Often higher accuracy, especially for complex tasks |
| Feature Importance | Provides feature importance, easier to interpret | Provides feature importance, but harder to interpret |
| Handling of Outliers | Robust to outliers | More sensitive to outliers |
| Parallelization | Easily parallelizable | Not easily parallelizable |
| Interpretability | Hard to interpret individual trees but offers global feature importance | Harder to interpret individual predictions |
The above exercise shows how different machine learning models can be applied to datasets in healthcare and development. Some ML algorithms can be used successfully in classification and regression tasks. The models presented above are not exhaustive but give you an idea of some of the possibilities. There are other techniques like ensemble models, deep neural network that were not explored. However, if you can get simpler/faster models witht he required accuracy and precision you do not need to develop more complex models.