Introduction

The following analyses are based on the Heart Disease Prediction Dataset. The target variable in the dataset is the prediction of presence of heart disease (0:no; 1: yes)

There are several predictor variables (features) included in the data set.

  1. Age: age of the patient in years
  2. Sex: sex of the patient (0: male/1: female)
  3. cp: Chest pain type (0: typical angina, 1: atypical angina, 2: non-anginal pain; 3: asymptomatic)
  4. trestbps: resting blood pressure(systolic in mm hg)
  5. chol: total cholesterol (mg/dL)
  6. fbs: fasting blood sugar >120 mg/dl(1: true; 0:false)
  7. rest_ecg: resting electrocardiograph results (0: normal; 1: having ST-/T wave abnormality;2: showing probable or definite left ventricular hypertrophy)
  8. thalach: maximum exercise induced heart rate
  9. exang: exercise inducedc angina (1: yes; 0 : no)
  10. oldpeak: ST depression (magnitude) induced by exercise
  11. slope: the slope of the ST segment
  12. ca : the number of major vessels (0-3) colored by fluoroscopy
  13. thal: thallium heart disease scan (0-3: normal)

We will go through step-by step in the analyses of this data using different ML algorithms. These steps would include

  1. Exploratory data analysis and preprocessing
  2. Cleaning/processing data
  3. Modeling
  4. Model selection and Improvement

Exploratory data analysis and preprocessing

Load the required libraries and the dataset

Take a look at the data structure. Rename some of the columnsin a easier to understand format

## 'data.frame':    1025 obs. of  14 variables:
##  $ age     : int  52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ cp      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ restecg : int  1 0 1 1 1 0 2 0 0 0 ...
##  $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : int  2 0 0 2 1 1 0 1 2 1 ...
##  $ ca      : int  2 0 0 1 3 0 3 1 0 2 ...
##  $ thal    : int  3 3 3 3 2 2 1 3 3 2 ...
##  $ target  : int  0 0 0 0 0 1 0 0 0 0 ...
##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

Check for missing values. Run summary statistics

# Check for missing values
sum(is.na(heart_data))  # Returns the total number of missing values
## [1] 0
summary(heart_data)
##       Age             Sex         ChestPainType      RestingBP    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##   Cholesterol    FastingBS        RestingECG         MaxHR      
##  Min.   :126   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0  
##  Median :240   Median :0.0000   Median :1.0000   Median :152.0  
##  Mean   :246   Mean   :0.1493   Mean   :0.5298   Mean   :149.1  
##  3rd Qu.:275   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##  ExerciseAngina      Oldpeak         ST_Slope     NumberofVessels 
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3366   Mean   :1.072   Mean   :1.385   Mean   :0.7541  
##  3rd Qu.:1.0000   3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##     ThalScan      HeartDisease   
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.324   Mean   :0.5132  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

We can get a more detailed summary of the data using the dfSummaryFunction from devtools.

# Detailed summary using summarytools
summarytools::dfSummary(heart_data)
## Data Frame Summary  
## heart_data  
## Dimensions: 1025 x 14  
## Duplicates: 723  
## 
## ------------------------------------------------------------------------------------------------------------------
## No   Variable          Stats / Values             Freqs (% of Valid)    Graph                 Valid      Missing  
## ---- ----------------- -------------------------- --------------------- --------------------- ---------- ---------
## 1    Age               Mean (sd) : 54.4 (9.1)     41 distinct values                :         1025       0        
##      [integer]         min < med < max:                                         . : : .       (100.0%)   (0.0%)   
##                        29 < 56 < 77                                         . : : : : :                           
##                        IQR (CV) : 13 (0.2)                                  : : : : : :                           
##                                                                           : : : : : : : :                         
## 
## 2    Sex               Min  : 0                   0 : 312 (30.4%)       IIIIII                1025       0        
##      [integer]         Mean : 0.7                 1 : 713 (69.6%)       IIIIIIIIIIIII         (100.0%)   (0.0%)   
##                        Max  : 1                                                                                   
## 
## 3    ChestPainType     Mean (sd) : 0.9 (1)        0 : 497 (48.5%)       IIIIIIIII             1025       0        
##      [integer]         min < med < max:           1 : 167 (16.3%)       III                   (100.0%)   (0.0%)   
##                        0 < 1 < 3                  2 : 284 (27.7%)       IIIII                                     
##                        IQR (CV) : 2 (1.1)         3 :  77 ( 7.5%)       I                                         
## 
## 4    RestingBP         Mean (sd) : 131.6 (17.5)   49 distinct values        . :               1025       0        
##      [integer]         min < med < max:                                     : : :             (100.0%)   (0.0%)   
##                        94 < 130 < 200                                     : : : :                                 
##                        IQR (CV) : 20 (0.1)                                : : : : :                               
##                                                                         . : : : : : : .                           
## 
## 5    Cholesterol       Mean (sd) : 246 (51.6)     152 distinct values       :                 1025       0        
##      [integer]         min < med < max:                                   . :                 (100.0%)   (0.0%)   
##                        126 < 240 < 564                                    : : :                                   
##                        IQR (CV) : 64 (0.2)                                : : : .                                 
##                                                                         . : : : :                                 
## 
## 6    FastingBS         Min  : 0                   0 : 872 (85.1%)       IIIIIIIIIIIIIIIII     1025       0        
##      [integer]         Mean : 0.1                 1 : 153 (14.9%)       II                    (100.0%)   (0.0%)   
##                        Max  : 1                                                                                   
## 
## 7    RestingECG        Mean (sd) : 0.5 (0.5)      0 : 497 (48.5%)       IIIIIIIII             1025       0        
##      [integer]         min < med < max:           1 : 513 (50.0%)       IIIIIIIIII            (100.0%)   (0.0%)   
##                        0 < 1 < 2                  2 :  15 ( 1.5%)                                                 
##                        IQR (CV) : 1 (1)                                                                           
## 
## 8    MaxHR             Mean (sd) : 149.1 (23)     91 distinct values                :         1025       0        
##      [integer]         min < med < max:                                           . : :       (100.0%)   (0.0%)   
##                        71 < 152 < 202                                           . : : :                           
##                        IQR (CV) : 34 (0.2)                                    . : : : : .                         
##                                                                           . : : : : : : : .                       
## 
## 9    ExerciseAngina    Min  : 0                   0 : 680 (66.3%)       IIIIIIIIIIIII         1025       0        
##      [integer]         Mean : 0.3                 1 : 345 (33.7%)       IIIIII                (100.0%)   (0.0%)   
##                        Max  : 1                                                                                   
## 
## 10   Oldpeak           Mean (sd) : 1.1 (1.2)      40 distinct values    :                     1025       0        
##      [numeric]         min < med < max:                                 :                     (100.0%)   (0.0%)   
##                        0 < 0.8 < 6.2                                    :                                         
##                        IQR (CV) : 1.8 (1.1)                             : : .                                     
##                                                                         : : : : : .                               
## 
## 11   ST_Slope          Mean (sd) : 1.4 (0.6)      0 :  74 ( 7.2%)       I                     1025       0        
##      [integer]         min < med < max:           1 : 482 (47.0%)       IIIIIIIII             (100.0%)   (0.0%)   
##                        0 < 1 < 2                  2 : 469 (45.8%)       IIIIIIIII                                 
##                        IQR (CV) : 1 (0.4)                                                                         
## 
## 12   NumberofVessels   Mean (sd) : 0.8 (1)        0 : 578 (56.4%)       IIIIIIIIIII           1025       0        
##      [integer]         min < med < max:           1 : 226 (22.0%)       IIII                  (100.0%)   (0.0%)   
##                        0 < 0 < 4                  2 : 134 (13.1%)       II                                        
##                        IQR (CV) : 1 (1.4)         3 :  69 ( 6.7%)       I                                         
##                                                   4 :  18 ( 1.8%)                                                 
## 
## 13   ThalScan          Mean (sd) : 2.3 (0.6)      0 :   7 ( 0.7%)                             1025       0        
##      [integer]         min < med < max:           1 :  64 ( 6.2%)       I                     (100.0%)   (0.0%)   
##                        0 < 2 < 3                  2 : 544 (53.1%)       IIIIIIIIII                                
##                        IQR (CV) : 1 (0.3)         3 : 410 (40.0%)       IIIIIIII                                  
## 
## 14   HeartDisease      Min  : 0                   0 : 499 (48.7%)       IIIIIIIII             1025       0        
##      [integer]         Mean : 0.5                 1 : 526 (51.3%)       IIIIIIIIII            (100.0%)   (0.0%)   
##                        Max  : 1                                                                                   
## ------------------------------------------------------------------------------------------------------------------

Data Visualization

Some example data visualizations are shown below. If there are a few predictor variables it is a good idea to visualize them against the outcome variable to discern any trends

Plotting the correlation heatmap

library(ggcorrplot)

numeric_vars <- heart_data %>%
  select_if(is.numeric)

# Compute correlation matrix
corr_matrix <- cor(numeric_vars, use = "complete.obs")

# Ensure row and column names of the matrix are set correctly
rownames(corr_matrix) <- colnames(numeric_vars)
colnames(corr_matrix) <- colnames(numeric_vars)

# Plot heatmap
# Plot correlation matrix with adjustments for label size and rotation
ggcorrplot(corr_matrix, 
           lab = TRUE,           # Show labels inside the plot
           lab_size = 3,          # Adjust label size to prevent overlap
           type = "lower",        # Only show the lower triangle
           tl.cex = 10,           # Adjust the size of axis labels
           title = "Correlation Matrix"
) +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))# Make x-axis labels vertical

## 2. Cleaning/Processing the data

#5 Code for splitting the dataset into a train and test dataset for running models

# Set a seed for reproducibility
# Set a seed for reproducibility
set.seed(123)  
# Setting a seed ensures that the random operations (such as splitting data) will yield the same results 
# every time the code is run. This is important for reproducibility of the results.

# Create a data partition (80% for training and 20% for validation)
trainIndex <- createDataPartition(heart_data$HeartDisease,  # Target variable (response)
                                  p = 0.8,                 # 80% of the data will be used for training
                                  list = FALSE,            # Return the indices as a matrix, not a list
                                  times = 1)               # Single split

# `createDataPartition()` function splits the dataset based on the target variable.
# It ensures that the training and validation sets have a similar distribution of the target classes
# (i.e., both will have roughly the same percentage of 'HeartDisease' cases as the original dataset).

# Split the data into training and validation sets
heart_train <- heart_data[trainIndex, ]    # Use the indices from `trainIndex` to subset training data
heart_validation <- heart_data[-trainIndex, ] # Use the rest of the data for the validation set

# Check the dimensions of the resulting datasets
dim(heart_train)      # Check the number of rows and columns in the training set
## [1] 820  14
dim(heart_validation) # Check the number of rows and columns in the validation set
## [1] 205  14
# Check the distribution of HeartDisease in both datasets
table(heart_train$HeartDisease)      # Distribution of HeartDisease in the training set
## 
##   0   1 
## 404 416
table(heart_validation$HeartDisease) # Distribution of HeartDisease in the validation set
## 
##   0   1 
##  95 110

3. Modeling

We are going to fit a series of machine learning models to the data. First we will start with a simple logistic regression model

##  Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 2 1 2 ...
## 
## Call:
## NULL
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.502797   1.498938   2.337 0.019447 *  
## Age             -0.008691   0.013971  -0.622 0.533902    
## Sex             -1.819684   0.283816  -6.412 1.44e-10 ***
## ChestPainType    0.786254   0.107476   7.316 2.56e-13 ***
## RestingBP       -0.017666   0.006033  -2.928 0.003407 ** 
## Cholesterol     -0.005499   0.002193  -2.508 0.012133 *  
## FastingBS        0.060584   0.308917   0.196 0.844518    
## RestingECG       0.410917   0.208930   1.967 0.049210 *  
## MaxHR            0.023392   0.006061   3.859 0.000114 ***
## ExerciseAngina  -0.982808   0.246911  -3.980 6.88e-05 ***
## Oldpeak         -0.512094   0.127768  -4.008 6.12e-05 ***
## ST_Slope         0.558013   0.204138   2.734 0.006266 ** 
## NumberofVessels -0.693303   0.107595  -6.444 1.17e-10 ***
## ThalScan        -0.864847   0.173339  -4.989 6.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1136.59  on 819  degrees of freedom
## Residual deviance:  595.25  on 806  degrees of freedom
## AIC: 623.25
## 
## Number of Fisher Scoring iterations: 6
##   Prediction Reference Freq
## 1         No        No   78
## 2        Yes        No   17
## 3         No       Yes    8
## 4        Yes       Yes  102
##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.8780488 0.7531905     0.8252579     0.9195025    0.5365854   9.998134e-26
##   McnemarPValue
## 1     0.1095986
##   Sensitivity Specificity Pos Pred Value Neg Pred Value Precision    Recall
## 1   0.8210526   0.9272727      0.9069767      0.8571429 0.9069767 0.8210526
##          F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
## 1 0.8618785  0.4634146      0.3804878            0.4195122         0.8741627

Decision Tree Model

## CART 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 739, 738, 737, 737, 739, 738, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.002475248  0.8719848  0.7439462
##   0.003712871  0.8682958  0.7365290
##   0.004950495  0.8658861  0.7316801
##   0.009900990  0.8587017  0.7172563
##   0.011138614  0.8537935  0.7074324
##   0.017326733  0.8415966  0.6828411
##   0.019801980  0.8269459  0.6532174
##   0.024752475  0.8148239  0.6284327
##   0.056930693  0.7768538  0.5527527
##   0.502475248  0.5658894  0.1220406
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002475248.
##   Prediction Reference Freq
## 1         No        No   85
## 2        Yes        No   10
## 3         No       Yes   20
## 4        Yes       Yes   90
##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.8536585 0.7078385     0.7977251     0.8990351    0.5365854   5.249217e-22
##   McnemarPValue
## 1     0.1003482
##   Sensitivity Specificity Pos Pred Value Neg Pred Value Precision    Recall
## 1   0.8947368   0.8181818      0.8095238            0.9 0.8095238 0.8947368
##     F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
## 1 0.85  0.4634146      0.4146341            0.5121951         0.8564593

Random Forest Model

## Random Forest 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 738, 738, 738, 738, 738, 739, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9890685  0.9781222
##    4    0.9890685  0.9781222
##    7    0.9890685  0.9781222
##   10    0.9878636  0.9757143
##   13    0.9878636  0.9757143
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
##   Prediction Reference Freq
## 1         No        No   95
## 2        Yes        No    0
## 3         No       Yes    0
## 4        Yes       Yes  110
##   Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1        1     1     0.9821664             1    0.5365854   3.766682e-56
##   McnemarPValue
## 1           NaN
##   Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1
## 1           1           1              1              1         1      1  1
##   Prevalence Detection Rate Detection Prevalence Balanced Accuracy
## 1  0.4634146      0.4634146            0.4634146                 1

K-nearest neighbors (KNN) model

## k-Nearest Neighbors 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 738, 738, 738, 738, 738, 739, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.6890650  0.3787409
##    7  0.7245950  0.4490015
##    9  0.7026288  0.4057456
##   11  0.6903731  0.3809917
##   13  0.7049789  0.4099462
##   15  0.7050380  0.4097678
##   17  0.7111955  0.4219552
##   19  0.7013498  0.4022075
##   21  0.7050380  0.4095982
##   23  0.7075366  0.4146636
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  69  20
##        Yes 26  90
##                                           
##                Accuracy : 0.7756          
##                  95% CI : (0.7123, 0.8308)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 1.108e-12       
##                                           
##                   Kappa : 0.5469          
##                                           
##  Mcnemar's Test P-Value : 0.461           
##                                           
##             Sensitivity : 0.7263          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7753          
##          Neg Pred Value : 0.7759          
##              Prevalence : 0.4634          
##          Detection Rate : 0.3366          
##    Detection Prevalence : 0.4341          
##       Balanced Accuracy : 0.7722          
##                                           
##        'Positive' Class : No              
## 

Support Vector Machines (SVM) Model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 738, 738, 738, 738, 738, 739, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.8537695  0.7066352
##   0.50  0.8756774  0.7509335
##   1.00  0.8792908  0.7583513
##   2.00  0.9157894  0.8314137
##   4.00  0.9402403  0.8804481
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05215777
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05215777 and C = 4.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   90   9
##        Yes   5 101
##                                           
##                Accuracy : 0.9317          
##                  95% CI : (0.8881, 0.9622)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8631          
##                                           
##  Mcnemar's Test P-Value : 0.4227          
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.9182          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.9528          
##              Prevalence : 0.4634          
##          Detection Rate : 0.4390          
##    Detection Prevalence : 0.4829          
##       Balanced Accuracy : 0.9328          
##                                           
##        'Positive' Class : No              
## 

Gradient Boosting Machines

## Stochastic Gradient Boosting 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 738, 737, 739, 739, 739, 738, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8635184  0.7266483
##   1                  100      0.8646931  0.7290636
##   1                  150      0.8622838  0.7242327
##   1                  200      0.8793441  0.7584956
##   1                  250      0.8867659  0.7733511
##   2                   50      0.8659126  0.7315107
##   2                  100      0.8891457  0.7781621
##   2                  150      0.9001375  0.8002071
##   2                  200      0.9159776  0.8319518
##   2                  250      0.9317735  0.8635564
##   3                   50      0.8842376  0.7682127
##   3                  100      0.9013126  0.8024969
##   3                  150      0.9329042  0.8658082
##   3                  200      0.9500236  0.9000769
##   3                  250      0.9720225  0.9440566
##   4                   50      0.8927003  0.7852327
##   4                  100      0.9390322  0.8780984
##   4                  150      0.9646599  0.9293525
##   4                  200      0.9792959  0.9585872
##   4                  250      0.9878185  0.9756437
##   5                   50      0.9122595  0.8244606
##   5                  100      0.9573719  0.9147962
##   5                  150      0.9768866  0.9537948
##   5                  200      0.9866434  0.9732984
##   5                  250      0.9914627  0.9829341
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 10.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   95   0
##        Yes   0 110
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9822, 1)
##     No Information Rate : 0.5366     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4634     
##          Detection Rate : 0.4634     
##    Detection Prevalence : 0.4634     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : No         
## 

Naive Bayes Model

## Naive Bayes 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 738, 738, 739, 738, 737, 737, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.8158979  0.6314466
##    TRUE      0.8401860  0.6800554
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   81   7
##        Yes  14 103
##                                           
##                Accuracy : 0.8976          
##                  95% CI : (0.8477, 0.9355)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.793           
##                                           
##  Mcnemar's Test P-Value : 0.1904          
##                                           
##             Sensitivity : 0.8526          
##             Specificity : 0.9364          
##          Pos Pred Value : 0.9205          
##          Neg Pred Value : 0.8803          
##              Prevalence : 0.4634          
##          Detection Rate : 0.3951          
##    Detection Prevalence : 0.4293          
##       Balanced Accuracy : 0.8945          
##                                           
##        'Positive' Class : No              
## 

Simple neural network

We will use scaling for the neural network model

## Neural Network 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 737, 738, 738, 739, 739, 739, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.5946378  0.1810603
##   1     1e-04  0.5981053  0.1860035
##   1     1e-03  0.6716980  0.3396106
##   1     1e-02  0.8117400  0.6218613
##   1     1e-01  0.8525248  0.7047625
##   3     0e+00  0.6052800  0.1993786
##   3     1e-04  0.5753827  0.1392475
##   3     1e-03  0.7316012  0.4629285
##   3     1e-02  0.8466061  0.6929022
##   3     1e-01  0.8610639  0.7217552
##   5     0e+00  0.6254597  0.2423645
##   5     1e-04  0.7934444  0.5851843
##   5     1e-03  0.8054215  0.6103869
##   5     1e-02  0.8499815  0.6994886
##   5     1e-01  0.8512763  0.7024581
##   7     0e+00  0.7135594  0.4221843
##   7     1e-04  0.7738992  0.5453271
##   7     1e-03  0.8338880  0.6677880
##   7     1e-02  0.8525549  0.7048655
##   7     1e-01  0.8671310  0.7337950
##   9     0e+00  0.7194356  0.4337219
##   9     1e-04  0.6845550  0.3614728
##   9     1e-03  0.8258880  0.6526302
##   9     1e-02  0.8452360  0.6901949
##   9     1e-01  0.8524209  0.7047164
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
## Neural Network 
## 
## 820 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 737, 738, 738, 739, 739, 739, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.5946378  0.1810603
##   1     1e-04  0.5981053  0.1860035
##   1     1e-03  0.6716980  0.3396106
##   1     1e-02  0.8117400  0.6218613
##   1     1e-01  0.8525248  0.7047625
##   3     0e+00  0.6052800  0.1993786
##   3     1e-04  0.5753827  0.1392475
##   3     1e-03  0.7316012  0.4629285
##   3     1e-02  0.8466061  0.6929022
##   3     1e-01  0.8610639  0.7217552
##   5     0e+00  0.6254597  0.2423645
##   5     1e-04  0.7934444  0.5851843
##   5     1e-03  0.8054215  0.6103869
##   5     1e-02  0.8499815  0.6994886
##   5     1e-01  0.8512763  0.7024581
##   7     0e+00  0.7135594  0.4221843
##   7     1e-04  0.7738992  0.5453271
##   7     1e-03  0.8338880  0.6677880
##   7     1e-02  0.8525549  0.7048655
##   7     1e-01  0.8671310  0.7337950
##   9     0e+00  0.7194356  0.4337219
##   9     1e-04  0.6845550  0.3614728
##   9     1e-03  0.8258880  0.6526302
##   9     1e-02  0.8452360  0.6901949
##   9     1e-01  0.8524209  0.7047164
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.

Running the models on the validation/test Dataset

We will run all models to make predictions on the validation dataset and then compare the models

##                 Model  Accuracy Sensitivity Specificity
## 1 Logistic Regression 0.8780488   0.8210526   0.9272727
## 2                 KNN 0.7707317   0.7263158   0.8090909
## 3       Decision Tree 0.8536585   0.8947368   0.8181818
## 4       Random Forest 1.0000000   1.0000000   1.0000000
## 5                 SVM 0.9317073   0.9473684   0.9181818
## 6                 GBM 1.0000000   1.0000000   1.0000000
## 7         Naive Bayes 0.8975610   0.8526316   0.9363636
## 8      Neural Network 0.8975610   0.8526316   0.9363636

Visualizing Model Accuracies

We see that Random Forest and Gradient Boosting Trees were best among the models tested. Both models accurately predicted all the validation cases. Both the techniques are tree based ensemble models.

Here is a comparison between the two modeling algorithms.

Aspect Random Forest (RF) Gradient Boosting Machine (GBM)
Ensemble Strategy Bagging (Parallel tree building) Boosting (Sequential tree building)
Training Speed Faster (due to parallelization) Slower (sequential, harder to parallelize)
Prediction Speed Faster Slower (many trees evaluated sequentially)
Overfitting Less prone to overfitting Prone to overfitting without careful tuning
Hyperparameter Tuning Works well with default parameters Requires careful tuning of learning rate, tree depth, etc.
Accuracy Good, especially for less complex tasks Often higher accuracy, especially for complex tasks
Feature Importance Provides feature importance, easier to interpret Provides feature importance, but harder to interpret
Handling of Outliers Robust to outliers More sensitive to outliers
Parallelization Easily parallelizable Not easily parallelizable
Interpretability Hard to interpret individual trees but offers global feature importance Harder to interpret individual predictions

Discussion and Conclusions

The above exercise shows how different machine learning models can be applied to datasets in healthcare and development. Some ML algorithms can be used successfully in classification and regression tasks. The models presented above are not exhaustive but give you an idea of some of the possibilities. There are other techniques like ensemble models, deep neural network that were not explored. However, if you can get simpler/faster models witht he required accuracy and precision you do not need to develop more complex models.