Comparing Multiple Models:

Objective:

The objective of this exercise is to predict if a patient has diabetes for the Pima Indians Diabetes Data set using multiple ML algorithms.

We will build and train following Machine Learning models during this analysis.
1. Random Forest
2. eXtreme Gradient Boosting Machine (XGBOOST)
3. K-Nearest Neighbours(KNN)
4. Logistic Regression (glm)
5. R part Classification and Regression Trees (rpart)

Data Set

The data set is comprised of 768 observations and 9 variables. It is available in the package mlbench. We will be using diabetes as our response/target variable.

Data Description for the 9 variables are as follows.

pregnant - Number of times pregnant
glucose - Plasma glucose concentration (glucose tolerance test)
pressure - Diastolic blood pressure (mm Hg)
triceps - Triceps skin fold thickness (mm)
insulin - 2-Hour serum insulin (mu U/ml)
mass - Body mass index (weight in kg/(height in m)^2)
pedigree - Diabetes pedigree function
age - Age (years)
diabetes - Class variable (test for diabetes)

More information on the dataset can be read by accessing the website: http://math.furman.edu/~dcs/courses/math47/R/library/mlbench/html/PimaIndiansDiabetes.html

Load Packages

we are going to use the CARET and mlbench R packages. mlbench package holds the PimaIndiansDiabetes dataset and we would be building models using the CARET package.

# libraries used

# library(caret) #ML Model buidling package
# library(tidyverse) #ggplot and dplyr
# library(MASS) #Modern Applied Statistics with S
# library(mlbench) #data sets from the UCI repository.
# library(summarytools)
# library(corrplot) #Correlation plot
# library(gridExtra) #Multiple plot in single grip space
# library(timeDate) 
# library(pROC) #ROC
# library(caTools) #AUC
# library(rpart.plot) #CART Decision Tree
# library(e1071) #imports graphics, grDevices, class, stats, methods, utils
# library(graphics) #fourfoldplot

## Split the Data into Training and Testing sets

Data split(70/30) using the function createDataPartition from the CARET package.

# Usually set.seed(123) for reproducibility. I am not doing it here due to software issue as I can not publish this document on its inclusion.
data(PimaIndiansDiabetes)
df <- PimaIndiansDiabetes
str(df)

'data.frame':   768 obs. of  9 variables:
 $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
 $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
 $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
 $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
 $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
 $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

head(df)

  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35       0 33.6    0.627  50      pos
2        1      85       66      29       0 26.6    0.351  31      neg
3        8     183       64       0       0 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74       0       0 25.6    0.201  30      neg

#store rows for partition
partition <- caret::createDataPartition(y = df$diabetes, times = 1, p = 0.7, list = FALSE)

# create training data set
train_set <- df[partition,]

# create testing data set, subtracting the rows partition to get remaining 30% of the data
test_set <- df[-partition,]

str(train_set)

'data.frame':   538 obs. of  9 variables:
 $ pregnant: num  6 5 3 10 2 8 10 1 5 7 ...
 $ glucose : num  148 116 78 115 197 125 139 189 166 100 ...
 $ pressure: num  72 74 50 0 70 96 80 60 72 0 ...
 $ triceps : num  35 0 32 0 45 0 0 23 19 0 ...
 $ insulin : num  0 0 88 0 543 0 0 846 175 0 ...
 $ mass    : num  33.6 25.6 31 35.3 30.5 0 27.1 30.1 25.8 30 ...
 $ pedigree: num  0.627 0.201 0.248 0.134 0.158 ...
 $ age     : num  50 30 26 29 53 54 57 59 51 32 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 2 1 2 2 2 ...

str(test_set)

'data.frame':   230 obs. of  9 variables:
 $ pregnant: num  1 8 1 0 4 10 7 10 4 11 ...
 $ glucose : num  85 183 89 137 110 168 196 125 103 138 ...
 $ pressure: num  66 64 66 40 92 74 90 70 60 76 ...
 $ triceps : num  29 0 23 35 0 0 0 26 33 0 ...
 $ insulin : num  0 0 94 168 0 0 0 115 192 0 ...
 $ mass    : num  26.6 23.3 28.1 43.1 37.6 38 39.8 31.1 24 33.2 ...
 $ pedigree: num  0.351 0.672 0.167 2.288 0.191 ...
 $ age     : num  31 32 21 33 30 34 41 41 33 35 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 1 2 1 2 1 2 2 2 1 1 ...

Descriptive Statistics

summary(train_set)

    pregnant        glucose         pressure         triceps     
 Min.   : 0.00   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.00   1st Qu.:100.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.00   Median :117.0   Median : 72.00   Median :22.00  
 Mean   : 3.87   Mean   :120.7   Mean   : 69.17   Mean   :20.34  
 3rd Qu.: 6.00   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.00   Max.   :198.0   Max.   :122.00   Max.   :99.00  
    insulin            mass          pedigree           age       
 Min.   :  0.00   Min.   : 0.00   Min.   :0.0840   Min.   :21.00  
 1st Qu.:  0.00   1st Qu.:27.30   1st Qu.:0.2462   1st Qu.:24.00  
 Median : 14.50   Median :32.00   Median :0.3630   Median :29.00  
 Mean   : 80.34   Mean   :31.94   Mean   :0.4658   Mean   :33.36  
 3rd Qu.:130.00   3rd Qu.:36.38   3rd Qu.:0.6065   3rd Qu.:40.75  
 Max.   :846.00   Max.   :59.40   Max.   :2.4200   Max.   :81.00  
 diabetes 
 neg:350  
 pos:188

summarytools::descr(train_set)

Non-numerical variable(s) ignored: diabetes

Descriptive Statistics  
train_set  
N: 538  

                       age   glucose   insulin     mass   pedigree   pregnant   pressure   triceps
----------------- -------- --------- --------- -------- ---------- ---------- ---------- ---------
             Mean    33.36    120.74     80.34    31.94       0.47       3.87      69.17     20.34
          Std.Dev    11.81     32.66    119.63     7.78       0.32       3.39      19.96     16.35
              Min    21.00      0.00      0.00     0.00       0.08       0.00       0.00      0.00
               Q1    24.00    100.00      0.00    27.30       0.25       1.00      62.00      0.00
           Median    29.00    117.00     14.50    32.00       0.36       3.00      72.00     22.00
               Q3    41.00    141.00    130.00    36.40       0.61       6.00      80.00     32.00
              Max    81.00    198.00    846.00    59.40       2.42      17.00     122.00     99.00
              MAD    10.38     29.65     21.50     6.82       0.24       2.97      11.86     20.76
              IQR    16.75     41.00    130.00     9.07       0.36       5.00      18.00     32.00
               CV     0.35      0.27      1.49     0.24       0.69       0.88       0.29      0.80
         Skewness     1.14      0.04      2.34    -0.59       1.92       0.85      -1.85      0.21
      SE.Skewness     0.11      0.11      0.11     0.11       0.11       0.11       0.11      0.11
         Kurtosis     0.69      0.88      7.54     3.10       5.87       0.05       4.88     -0.34
          N.Valid   538.00    538.00    538.00   538.00     538.00     538.00     538.00    538.00
        Pct.Valid   100.00    100.00    100.00   100.00     100.00     100.00     100.00    100.00

There are no missing values in the dataset and hence we can proceed further.

Exploratory data analysis

Diabetes Distribution

ggplot(train_set, aes(train_set$diabetes, fill = diabetes)) + 
  geom_bar() +
  theme_bw() +
  labs(title = "Diabetes Classification", x = "Diabetes") +
  theme(plot.title = element_text(hjust = 0.5))

### Correlation Matrix
Let’s now see the correlation between all the numerical variables in the dataset.
Correlation Matrix can be ploted as follows.

cor_data <- cor(train_set[,setdiff(names(train_set), 'diabetes')])
#Numerical Correlation Matrix
cor_data

            pregnant    glucose   pressure     triceps     insulin
pregnant  1.00000000 0.12892659 0.12857993 -0.08967786 -0.06748752
glucose   0.12892659 1.00000000 0.11686557  0.05590162  0.36373586
pressure  0.12857993 0.11686557 1.00000000  0.23569618  0.10854547
triceps  -0.08967786 0.05590162 0.23569618  1.00000000  0.45291162
insulin  -0.06748752 0.36373586 0.10854547  0.45291162  1.00000000
mass     -0.02680956 0.20155969 0.27367542  0.41147508  0.25628831
pedigree -0.00602656 0.12650476 0.05610127  0.18672512  0.16943674
age       0.53487974 0.27353768 0.22838558 -0.09908890 -0.02301072
                mass    pedigree         age
pregnant -0.02680956 -0.00602656  0.53487974
glucose   0.20155969  0.12650476  0.27353768
pressure  0.27367542  0.05610127  0.22838558
triceps   0.41147508  0.18672512 -0.09908890
insulin   0.25628831  0.16943674 -0.02301072
mass      1.00000000  0.16300107  0.01782932
pedigree  0.16300107  1.00000000  0.05307082
age       0.01782932  0.05307082  1.00000000

# Correlation matrix plots
corrplot::corrplot(cor_data)

corrplot::corrplot(cor_data, type = "lower", method = "number")

corrplot::corrplot(cor_data, type = "lower", method = "pie")

We can see that there is a moderately positive high correlation between age and pregn.

Univariate Analysis

univar_graph <- function(univar_name, univar, data, output_var) {
  
  g_1 <- ggplot(data, aes(x=univar)) + 
    geom_density() + 
    xlab(univar_name) + 
    theme_bw()
  
  g_2 <- ggplot(data, aes(x=univar, fill=output_var)) + 
    geom_density(alpha=0.4) + 
    xlab(univar_name) + 
    theme_bw()
  
  gridExtra::grid.arrange(g_1, g_2, ncol=2, top = paste(univar_name,"variable", "/ [ Skew:",timeDate::skewness(univar),"]"))
  
}

for (x in 1:(ncol(train_set)-1)) {
  univar_graph(univar_name = names(train_set)[x], univar = train_set[,x], data = train_set, output_var = train_set[,'diabetes'])
}

Variables such as Insulin, pedigree and age have high right skewness.
pressure and mass have negative skewness.
pregnant, glucose, and triceps have moderate to low right skewness.

Univariable Analysis using Bar plots

# ggplot(data = train_set, aes(x = age, fill = diabetes)) +
#   geom_bar(stat='count', position='dodge') +
#   ggtitle("Age Vs Diabetes") +
#   theme_bw() +
#   labs(x = "Age") +
#   theme(plot.title = element_text(hjust = 0.5))  

bivar_graph <- function(bivar_name, bivar, data, output_var) {
  
  g_1 <- ggplot(data = data, aes(x = bivar, fill = output_var)) +
          geom_bar(stat='count', position='dodge') +
          theme_bw() +
          labs( title = paste(bivar_name,"- Diabetes", sep =" "), x = bivar_name) +
          theme(plot.title = element_text(hjust = 0.5))
   
  plot(g_1)
}

for (x in 1:(ncol(train_set)-1)) {
  bivar_graph(bivar_name = names(train_set)[x], bivar = train_set[,x], data = train_set, output_var = train_set[,'diabetes'])
}

* We can not see any trend in pregnant vs diabetes plot.
* Higher the glucose level, higher are the chances of positive diabetes. This is line with expectations.
* Diabetic pressure levels seems to be normally distributed for negative results and the positive cases range from bp of 60 to 110.
* As the BMI increases, the chances for a positive Diabetes case also increases.
* We can see that the positive diabetes cases increases as the age increases > 25.

*The reason that some of the graphs are small is due to fact that the maximum value for that variable is very high and the count for each is very small.
This can be confirmed by plotting box plots for each of the variables in the following section.

### Outlier Detection

box_plot <- function(bivar_name, bivar, data, output_var) {
  
  g_1 <- ggplot(data = data, aes(y = bivar, fill = output_var)) +
          geom_boxplot() +
          theme_bw() +
          labs( title = paste(bivar_name,"Outlier Detection", sep =" "), y = bivar_name) +
          theme(plot.title = element_text(hjust = 0.5))
   
  plot(g_1)
}

for (x in 1:(ncol(train_set)-1)) {
  box_plot(bivar_name = names(train_set)[x], bivar = train_set[,x], data = train_set, output_var = train_set[,'diabetes'])
}

* We can observe that insulin, pedigree and age have the highest outliers.
* We will take care of them in the train() function of the CARET package.

ML Model Building on Train Data Set

#####################################################################################################################################
# Random Forest Model
#####################################################################################################################################
model_forest <- caret::train(diabetes ~., data = train_set,
                         method = "ranger",
                         metric = "ROC",
                         trControl = trainControl(method = "cv", number = 10,
                                                  classProbs = T, summaryFunction = twoClassSummary),
                          preProcess = c("center","scale","pca"))
model_forest

Random Forest 

538 samples
  8 predictor
  2 classes: 'neg', 'pos' 

Pre-processing: centered (8), scaled (8), principal component
 signal extraction (8) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 485, 484, 484, 484, 484, 484, ... 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec     
  2     gini        0.8098789  0.8428571  0.5739766
  2     extratrees  0.8129741  0.8828571  0.5315789
  4     gini        0.8029574  0.8371429  0.5850877
  4     extratrees  0.8143150  0.8600000  0.5529240
  7     gini        0.7964495  0.8342857  0.5687135
  7     extratrees  0.8097828  0.8571429  0.5637427

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 4, splitrule =
 extratrees and min.node.size = 1.

# final ROC Value
model_forest$results[6,4]

[1] 0.8097828

#####################################################################################################################################
# XGBOOST - eXtreme Gradient BOOSTing 
#####################################################################################################################################
xgb_grid_1  <-  expand.grid(
                  nrounds = 50,
                  eta = c(0.03),
                  max_depth = 1,
                  gamma = 0,
                  colsample_bytree = 0.6,
                  min_child_weight = 1,
                  subsample = 0.5
                )

model_xgb <- caret::train(diabetes ~., data = train_set,
                         method = "xgbTree",
                         metric = "ROC",
                         tuneGrid=xgb_grid_1,
                         trControl = trainControl(method = "cv", number = 10,
                                                  classProbs = T, summaryFunction = twoClassSummary),
                         preProcess = c("center","scale","pca"))
model_xgb

eXtreme Gradient Boosting 

538 samples
  8 predictor
  2 classes: 'neg', 'pos' 

Pre-processing: centered (8), scaled (8), principal component
 signal extraction (8) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 484, 484, 484, 484, 485, 484, ... 
Resampling results:

  ROC        Sens       Spec     
  0.7852464  0.9514286  0.2827485

Tuning parameter 'nrounds' was held constant at a value of 50
 0.6
Tuning parameter 'min_child_weight' was held constant at a value of
 1
Tuning parameter 'subsample' was held constant at a value of 0.5

# final ROC value
model_xgb$results["ROC"]

        ROC
1 0.7852464

#####################################################################################################################################
# KNN - K Nearest Neighbours
#####################################################################################################################################
model_knn <- caret::train(diabetes ~., data = train_set,
                         method = "knn",
                         metric = "ROC",
                         tuneGrid = expand.grid(.k = c(3:10)),
                         trControl = trainControl(method = "cv", number = 10,
                                                  classProbs = T, summaryFunction = twoClassSummary),
                          preProcess = c("center","scale","pca"))

model_knn

k-Nearest Neighbors 

538 samples
  8 predictor
  2 classes: 'neg', 'pos' 

Pre-processing: centered (8), scaled (8), principal component
 signal extraction (8) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 484, 484, 484, 484, 484, 485, ... 
Resampling results across tuning parameters:

  k   ROC        Sens       Spec     
   3  0.7245029  0.7971429  0.5418129
   4  0.7390434  0.8028571  0.5204678
   5  0.7516625  0.8200000  0.5461988
   6  0.7538179  0.8200000  0.5040936
   7  0.7573559  0.8342857  0.5198830
   8  0.7705931  0.8428571  0.5146199
   9  0.7742774  0.8485714  0.5248538
  10  0.7766708  0.8542857  0.5298246

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 10.

#final ROC value
model_knn$results[8,2]

[1] 0.7766708

#####################################################################################################################################
# Logistic Regression
#####################################################################################################################################
model_glm <- caret::train(diabetes ~., data = train_set,
                         method = "glm",
                         metric = "ROC",
                         tuneLength = 10,
                         trControl = trainControl(method = "cv", number = 10,
                                                   classProbs = T, summaryFunction = twoClassSummary),
                          preProcess = c("center","scale","pca"))

model_glm

Generalized Linear Model 

538 samples
  8 predictor
  2 classes: 'neg', 'pos' 

Pre-processing: centered (8), scaled (8), principal component
 signal extraction (8) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 484, 484, 484, 484, 484, 485, ... 
Resampling results:

  ROC        Sens       Spec     
  0.8226149  0.8828571  0.5681287

#final ROC Value
model_glm$results[2]

        ROC
1 0.8226149

#####################################################################################################################################
# Rpart CART - classification and Regression Trees
#####################################################################################################################################
model_rpart <- caret::train(diabetes ~., data = train_set,
                         method = "rpart",
                         metric = "ROC",
                         tuneLength = 20,
                         trControl = trainControl(method = "cv", number = 10,
                                                   classProbs = T, summaryFunction = twoClassSummary))
                          # preProcess = c("center","scale","pca"))

model_rpart

CART 

538 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 484, 484, 485, 484, 484, 484, ... 
Resampling results across tuning parameters:

  cp          ROC        Sens       Spec     
  0.00000000  0.7507310  0.7800000  0.6160819
  0.01427772  0.7115581  0.8314286  0.5257310
  0.02855543  0.7163409  0.8514286  0.5517544
  0.04283315  0.7188972  0.8600000  0.5412281
  0.05711086  0.7188972  0.8600000  0.5412281
  0.07138858  0.7188972  0.8600000  0.5412281
  0.08566629  0.7184211  0.8571429  0.5467836
  0.09994401  0.7106809  0.8257143  0.5681287
  0.11422172  0.7063200  0.7971429  0.6154971
  0.12849944  0.7063200  0.7971429  0.6154971
  0.14277716  0.7063200  0.7971429  0.6154971
  0.15705487  0.7063200  0.7971429  0.6154971
  0.17133259  0.7063200  0.7971429  0.6154971
  0.18561030  0.7063200  0.7971429  0.6154971
  0.19988802  0.7063200  0.7971429  0.6154971
  0.21416573  0.7063200  0.7971429  0.6154971
  0.22844345  0.7063200  0.7971429  0.6154971
  0.24272116  0.7063200  0.7971429  0.6154971
  0.25699888  0.6473726  0.8371429  0.4576023
  0.27127660  0.6473726  0.8371429  0.4576023

ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.

# Plot model accuracy vs different values of cp (complexity parameter)
plot(model_rpart)

# Best Model Cp (Complexity Parameter)
model_rpart$bestTune

  cp
1  0

# Structure of final model selected
model_rpart$finalModel

n= 538 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 538 188 neg (0.65055762 0.34944238)  
    2) glucose< 127.5 341  64 neg (0.81231672 0.18768328)  
      4) age< 28.5 188  17 neg (0.90957447 0.09042553) *
      5) age>=28.5 153  47 neg (0.69281046 0.30718954)  
       10) glucose< 99.5 45   4 neg (0.91111111 0.08888889) *
       11) glucose>=99.5 108  43 neg (0.60185185 0.39814815)  
         22) mass< 26.25 18   2 neg (0.88888889 0.11111111) *
         23) mass>=26.25 90  41 neg (0.54444444 0.45555556)  
           46) pedigree< 0.561 64  23 neg (0.64062500 0.35937500)  
             92) age>=47.5 12   1 neg (0.91666667 0.08333333) *
             93) age< 47.5 52  22 neg (0.57692308 0.42307692)  
              186) pedigree< 0.2 10   1 neg (0.90000000 0.10000000) *
              187) pedigree>=0.2 42  21 neg (0.50000000 0.50000000)  
                374) pedigree>=0.371 19   6 neg (0.68421053 0.31578947) *
                375) pedigree< 0.371 23   8 pos (0.34782609 0.65217391)  
                  750) mass>=34.85 9   4 neg (0.55555556 0.44444444) *
                  751) mass< 34.85 14   3 pos (0.21428571 0.78571429) *
           47) pedigree>=0.561 26   8 pos (0.30769231 0.69230769)  
             94) pregnant< 5 11   4 neg (0.63636364 0.36363636) *
             95) pregnant>=5 15   1 pos (0.06666667 0.93333333) *
    3) glucose>=127.5 197  73 pos (0.37055838 0.62944162)  
      6) mass< 29.95 50  16 neg (0.68000000 0.32000000)  
       12) glucose< 145.5 29   4 neg (0.86206897 0.13793103) *
       13) glucose>=145.5 21   9 pos (0.42857143 0.57142857)  
         26) triceps< 7 14   6 neg (0.57142857 0.42857143) *
         27) triceps>=7 7   1 pos (0.14285714 0.85714286) *
      7) mass>=29.95 147  39 pos (0.26530612 0.73469388)  
       14) glucose< 157.5 82  30 pos (0.36585366 0.63414634)  
         28) pedigree< 0.436 42  21 neg (0.50000000 0.50000000)  
           56) mass< 41.8 32  12 neg (0.62500000 0.37500000)  
            112) pressure>=71 24   6 neg (0.75000000 0.25000000) *
            113) pressure< 71 8   2 pos (0.25000000 0.75000000) *
           57) mass>=41.8 10   1 pos (0.10000000 0.90000000) *
         29) pedigree>=0.436 40   9 pos (0.22500000 0.77500000) *
       15) glucose>=157.5 65   9 pos (0.13846154 0.86153846) *

# Should have refactored diabetes to make pos as the primary factor level to get straight forward decision tree
rpart.plot::rpart.plot(model_rpart$finalModel, type = 2, fallen.leaves = T, extra = 2, cex = 0.70)

Training Data set - Model Comparision by ROC value

model_list <- list(Random_Forest = model_forest, XGBoost = model_xgb, KNN = model_knn, Logistic_Regression = model_glm, Rpart_DT = model_rpart)
resamples <- resamples(model_list)

#box plot
bwplot(resamples, metric="ROC")

#dot plot
dotplot(resamples, metric="ROC")

Based on ROC value, we can conclude that Logistic Regression performed best for the training data set among all the five ML models.

Predition on Test Data Set

#####################################################################################################################################
# Random Forest
#####################################################################################################################################

# prediction on Test data set
pred_rf <- predict(model_forest, test_set)
# Confusion Matrix 
cm_rf <- confusionMatrix(pred_rf, test_set$diabetes, positive="pos")

# Prediction Probabilities
pred_prob_rf <- predict(model_forest, test_set, type="prob")
# ROC value
roc_rf <- roc(test_set$diabetes, pred_prob_rf$pos)

# Confusion Matrix for Random Forest Model
cm_rf

Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg 133  35
       pos  17  45
                                          
               Accuracy : 0.7739          
                 95% CI : (0.7143, 0.8263)
    No Information Rate : 0.6522          
    P-Value [Acc > NIR] : 4.208e-05       
                                          
                  Kappa : 0.4741          
                                          
 Mcnemar's Test P-Value : 0.0184          
                                          
            Sensitivity : 0.5625          
            Specificity : 0.8867          
         Pos Pred Value : 0.7258          
         Neg Pred Value : 0.7917          
             Prevalence : 0.3478          
         Detection Rate : 0.1957          
   Detection Prevalence : 0.2696          
      Balanced Accuracy : 0.7246          
                                          
       'Positive' Class : pos

# ROC Value for Random Forest
roc_rf


Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_rf$pos)

Data: pred_prob_rf$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.827

# AUC - Area under the curve
caTools::colAUC(pred_prob_rf$pos, test_set$diabetes, plotROC = T)

                 [,1]
neg vs. pos 0.8269583

#####################################################################################################################################
# XGBOOST - eXtreme Gradient BOOSTing 
#####################################################################################################################################

# prediction on Test data set
pred_xgb <- predict(model_xgb, test_set)
# Confusion Matrix 
cm_xgb <- confusionMatrix(pred_xgb, test_set$diabetes, positive="pos")

# Prediction Probabilities
pred_prob_xgb <- predict(model_xgb, test_set, type="prob")
# ROC value
roc_xgb <- roc(test_set$diabetes, pred_prob_xgb$pos)

# Confusion matrix 
cm_xgb

Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg 141  51
       pos   9  29
                                          
               Accuracy : 0.7391          
                 95% CI : (0.6773, 0.7946)
    No Information Rate : 0.6522          
    P-Value [Acc > NIR] : 0.002949        
                                          
                  Kappa : 0.3447          
                                          
 Mcnemar's Test P-Value : 1.203e-07       
                                          
            Sensitivity : 0.3625          
            Specificity : 0.9400          
         Pos Pred Value : 0.7632          
         Neg Pred Value : 0.7344          
             Prevalence : 0.3478          
         Detection Rate : 0.1261          
   Detection Prevalence : 0.1652          
      Balanced Accuracy : 0.6512          
                                          
       'Positive' Class : pos

# ROC Value for for XGBoost
roc_xgb


Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_xgb$pos)

Data: pred_prob_xgb$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.8155

# AUC - Area under the curve
caTools::colAUC(pred_prob_xgb$pos, test_set$diabetes, plotROC = T)

                 [,1]
neg vs. pos 0.8155417

#####################################################################################################################################
# KNN - K Nearest Neighbours
#####################################################################################################################################

# prediction on Test data set
pred_knn <- predict(model_knn, test_set)
# Confusion Matrix 
cm_knn <- confusionMatrix(pred_knn, test_set$diabetes, positive="pos")

# Prediction Probabilities
pred_prob_knn <- predict(model_knn, test_set, type="prob")
# ROC value
roc_knn <- roc(test_set$diabetes, pred_prob_knn$pos)

# Confusion matrix 
cm_knn

Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg 131  37
       pos  19  43
                                          
               Accuracy : 0.7565          
                 95% CI : (0.6958, 0.8105)
    No Information Rate : 0.6522          
    P-Value [Acc > NIR] : 0.000421        
                                          
                  Kappa : 0.4336          
                                          
 Mcnemar's Test P-Value : 0.023103        
                                          
            Sensitivity : 0.5375          
            Specificity : 0.8733          
         Pos Pred Value : 0.6935          
         Neg Pred Value : 0.7798          
             Prevalence : 0.3478          
         Detection Rate : 0.1870          
   Detection Prevalence : 0.2696          
      Balanced Accuracy : 0.7054          
                                          
       'Positive' Class : pos

# ROC Value for for XGBoost
roc_knn


Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_knn$pos)

Data: pred_prob_knn$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.8056

# AUC - Area under the curve
caTools::colAUC(pred_prob_knn$pos, test_set$diabetes, plotROC = T)

                [,1]
neg vs. pos 0.805625

#####################################################################################################################################
# Logistic Regression
#####################################################################################################################################

# prediction on Test data set
pred_glm <- predict(model_glm, test_set)
# Confusion Matrix 
cm_glm <- confusionMatrix(pred_glm, test_set$diabetes, positive="pos")

# Prediction Probabilities
pred_prob_glm <- predict(model_glm, test_set, type="prob")
# ROC value
roc_glm <- roc(test_set$diabetes, pred_prob_glm$pos)

# Confusion matrix 
cm_glm

Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg 131  38
       pos  19  42
                                          
               Accuracy : 0.7522          
                 95% CI : (0.6912, 0.8066)
    No Information Rate : 0.6522          
    P-Value [Acc > NIR] : 0.0007075       
                                          
                  Kappa : 0.4217          
                                          
 Mcnemar's Test P-Value : 0.0171182       
                                          
            Sensitivity : 0.5250          
            Specificity : 0.8733          
         Pos Pred Value : 0.6885          
         Neg Pred Value : 0.7751          
             Prevalence : 0.3478          
         Detection Rate : 0.1826          
   Detection Prevalence : 0.2652          
      Balanced Accuracy : 0.6992          
                                          
       'Positive' Class : pos

# ROC Value for for XGBoost
roc_glm


Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_glm$pos)

Data: pred_prob_glm$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.8369

# AUC - Area under the curve
caTools::colAUC(pred_prob_glm$pos, test_set$diabetes, plotROC = T)

                 [,1]
neg vs. pos 0.8369167

#####################################################################################################################################
# Rpart CART - classification and Regression Trees
#####################################################################################################################################

# prediction on Test data set
pred_rpart <- predict(model_rpart, test_set)
# Confusion Matrix 
cm_rpart <- confusionMatrix(pred_rpart, test_set$diabetes, positive="pos")

# Prediction Probabilities
pred_prob_rpart <- predict(model_rpart, test_set, type="prob")
# ROC value
roc_rpart <- roc(test_set$diabetes, pred_prob_rpart$pos)

# Confusion matrix 
cm_rpart

Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg 129  30
       pos  21  50
                                         
               Accuracy : 0.7783         
                 95% CI : (0.719, 0.8302)
    No Information Rate : 0.6522         
    P-Value [Acc > NIR] : 2.232e-05      
                                         
                  Kappa : 0.4981         
                                         
 Mcnemar's Test P-Value : 0.2626         
                                         
            Sensitivity : 0.6250         
            Specificity : 0.8600         
         Pos Pred Value : 0.7042         
         Neg Pred Value : 0.8113         
             Prevalence : 0.3478         
         Detection Rate : 0.2174         
   Detection Prevalence : 0.3087         
      Balanced Accuracy : 0.7425         
                                         
       'Positive' Class : pos

# ROC Value for for XGBoost
roc_rpart


Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_rpart$pos)

Data: pred_prob_rpart$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.7902

# AUC - Area under the curve
caTools::colAUC(pred_prob_rpart$pos, test_set$diabetes, plotROC = T)

                 [,1]
neg vs. pos 0.7901667

Test Set Results Comparision

result_rf <- c(cm_rf$byClass['Sensitivity'], cm_rf$byClass['Specificity'], cm_rf$byClass['Precision'], 
               cm_rf$byClass['Recall'], cm_rf$byClass['F1'], roc_rf$auc)

result_xgb <- c(cm_xgb$byClass['Sensitivity'], cm_xgb$byClass['Specificity'], cm_xgb$byClass['Precision'], 
               cm_xgb$byClass['Recall'], cm_xgb$byClass['F1'], roc_xgb$auc)

result_knn <- c(cm_knn$byClass['Sensitivity'], cm_knn$byClass['Specificity'], cm_knn$byClass['Precision'], 
               cm_knn$byClass['Recall'], cm_knn$byClass['F1'], roc_knn$auc)

result_glm <- c(cm_glm$byClass['Sensitivity'], cm_glm$byClass['Specificity'], cm_glm$byClass['Precision'], 
               cm_glm$byClass['Recall'], cm_glm$byClass['F1'], roc_glm$auc)

result_rpart <- c(cm_rpart$byClass['Sensitivity'], cm_rpart$byClass['Specificity'], cm_rpart$byClass['Precision'], 
               cm_rpart$byClass['Recall'], cm_rpart$byClass['F1'], roc_rpart$auc)


all_results <- data.frame(rbind(result_rf, result_xgb, result_knn, result_glm, result_rpart))
names(all_results) <- c("Sensitivity", "Specificity", "Precision", "Recall", "F1", "AUC")
all_results

             Sensitivity Specificity Precision Recall        F1       AUC
result_rf         0.5625   0.8866667 0.7258065 0.5625 0.6338028 0.8269583
result_xgb        0.3625   0.9400000 0.7631579 0.3625 0.4915254 0.8155417
result_knn        0.5375   0.8733333 0.6935484 0.5375 0.6056338 0.8056250
result_glm        0.5250   0.8733333 0.6885246 0.5250 0.5957447 0.8369167
result_rpart      0.6250   0.8600000 0.7042254 0.6250 0.6622517 0.7901667

Logistic Regression model looks best for the test data set as well based on Sensitivity, Precision, Recall, and F1 score.
The AUC value and Specificity for Logistic Regression is also pretty good and is at second position among all the models built and tested.

Visualization to compare accuracy of ML models

col <- c("#ed3b3b", "#0099ff")

graphics::fourfoldplot(cm_rf$table, color = col, conf.level = 0.95, margin = 1, 
                       main = paste("Random Forest Accuracy(",round(cm_rf$overall[1]*100),"%)", sep = ""))

graphics::fourfoldplot(cm_xgb$table, color = col, conf.level = 0.95, margin = 1, 
                       main = paste("XGB Accuracy(",round(cm_xgb$overall[1]*100),"%)", sep = ""))

graphics::fourfoldplot(cm_knn$table, color = col, conf.level = 0.95, margin = 1, 
                       main = paste("KNN Accuracy(",round(cm_knn$overall[1]*100),"%)", sep = ""))

graphics::fourfoldplot(cm_glm$table, color = col, conf.level = 0.95, margin = 1, 
                       main = paste("Logistic Regression Accuracy(",round(cm_glm$overall[1]*100),"%)", sep = ""))

graphics::fourfoldplot(cm_rpart$table, color = col, conf.level = 0.95, margin = 1, 
                       main = paste("Rpart DT Accuracy(",round(cm_rpart$overall[1]*100),"%)", sep = ""))

From the above plot we can confirm that Logistic Regression model performed best, based on overall Accuracy of the model also.

Conclusion

We have deployed multiple Machine Learning Models such as - Random Forest, eXtreme Gradient Boosting Machine, K-Nearest Neighbours, Logistic Regression and R part Classification and Regression Trees, and found that Logistic Regression has the best performance based on ROC value.
As a next step, we could go further and try to compare other ML models as well.

Glossary

Recall: Recall gives us an idea about when it’s actually Yes, how often does it predict Yes.
Precision: Precision tells us about when it predicts Yes, how often is it correct.
Sensitivity: Ability of the test to correctly identify the true positive rate.
Specificity: Ability of the test to correctly identify the true negative rate.

Abbreviations used
ROC: Reciever Operating Characteristics curve
AUC: Area under the ROC curve
AUC denotes the rate of successful classification by the logistic model.

Predict Diabetes using Pima Indians Data Set

Mayank Agrawal

21/06/2020