Objective:
The objective of this exercise is to predict if a patient has diabetes for the Pima Indians Diabetes Data set using multiple ML algorithms.
We will build and train following Machine Learning models during this analysis.
1. Random Forest
2. eXtreme Gradient Boosting Machine (XGBOOST)
3. K-Nearest Neighbours(KNN)
4. Logistic Regression (glm)
5. R part Classification and Regression Trees (rpart)
Data Set
The data set is comprised of 768 observations and 9 variables. It is available in the package mlbench. We will be using diabetes as our response/target variable.
Data Description for the 9 variables are as follows.
More information on the dataset can be read by accessing the website: http://math.furman.edu/~dcs/courses/math47/R/library/mlbench/html/PimaIndiansDiabetes.html
we are going to use the CARET and mlbench R packages. mlbench package holds the PimaIndiansDiabetes dataset and we would be building models using the CARET package.
# libraries used
# library(caret) #ML Model buidling package
# library(tidyverse) #ggplot and dplyr
# library(MASS) #Modern Applied Statistics with S
# library(mlbench) #data sets from the UCI repository.
# library(summarytools)
# library(corrplot) #Correlation plot
# library(gridExtra) #Multiple plot in single grip space
# library(timeDate)
# library(pROC) #ROC
# library(caTools) #AUC
# library(rpart.plot) #CART Decision Tree
# library(e1071) #imports graphics, grDevices, class, stats, methods, utils
# library(graphics) #fourfoldplot
## Split the Data into Training and Testing sets
Data split(70/30) using the function createDataPartition from the CARET package.
# Usually set.seed(123) for reproducibility. I am not doing it here due to software issue as I can not publish this document on its inclusion.
data(PimaIndiansDiabetes)
df <- PimaIndiansDiabetes
str(df)
'data.frame': 768 obs. of 9 variables:
$ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
$ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
$ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
$ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
$ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
$ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
$ age : num 50 31 32 21 33 30 26 29 53 54 ...
$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
head(df)
pregnant glucose pressure triceps insulin mass pedigree age diabetes
1 6 148 72 35 0 33.6 0.627 50 pos
2 1 85 66 29 0 26.6 0.351 31 neg
3 8 183 64 0 0 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.288 33 pos
6 5 116 74 0 0 25.6 0.201 30 neg
#store rows for partition
partition <- caret::createDataPartition(y = df$diabetes, times = 1, p = 0.7, list = FALSE)
# create training data set
train_set <- df[partition,]
# create testing data set, subtracting the rows partition to get remaining 30% of the data
test_set <- df[-partition,]
str(train_set)
'data.frame': 538 obs. of 9 variables:
$ pregnant: num 6 5 3 10 2 8 10 1 5 7 ...
$ glucose : num 148 116 78 115 197 125 139 189 166 100 ...
$ pressure: num 72 74 50 0 70 96 80 60 72 0 ...
$ triceps : num 35 0 32 0 45 0 0 23 19 0 ...
$ insulin : num 0 0 88 0 543 0 0 846 175 0 ...
$ mass : num 33.6 25.6 31 35.3 30.5 0 27.1 30.1 25.8 30 ...
$ pedigree: num 0.627 0.201 0.248 0.134 0.158 ...
$ age : num 50 30 26 29 53 54 57 59 51 32 ...
$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 2 1 2 2 2 ...
str(test_set)
'data.frame': 230 obs. of 9 variables:
$ pregnant: num 1 8 1 0 4 10 7 10 4 11 ...
$ glucose : num 85 183 89 137 110 168 196 125 103 138 ...
$ pressure: num 66 64 66 40 92 74 90 70 60 76 ...
$ triceps : num 29 0 23 35 0 0 0 26 33 0 ...
$ insulin : num 0 0 94 168 0 0 0 115 192 0 ...
$ mass : num 26.6 23.3 28.1 43.1 37.6 38 39.8 31.1 24 33.2 ...
$ pedigree: num 0.351 0.672 0.167 2.288 0.191 ...
$ age : num 31 32 21 33 30 34 41 41 33 35 ...
$ diabetes: Factor w/ 2 levels "neg","pos": 1 2 1 2 1 2 2 2 1 1 ...
summary(train_set)
pregnant glucose pressure triceps
Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.00 1st Qu.:100.0 1st Qu.: 62.00 1st Qu.: 0.00
Median : 3.00 Median :117.0 Median : 72.00 Median :22.00
Mean : 3.87 Mean :120.7 Mean : 69.17 Mean :20.34
3rd Qu.: 6.00 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.00 Max. :198.0 Max. :122.00 Max. :99.00
insulin mass pedigree age
Min. : 0.00 Min. : 0.00 Min. :0.0840 Min. :21.00
1st Qu.: 0.00 1st Qu.:27.30 1st Qu.:0.2462 1st Qu.:24.00
Median : 14.50 Median :32.00 Median :0.3630 Median :29.00
Mean : 80.34 Mean :31.94 Mean :0.4658 Mean :33.36
3rd Qu.:130.00 3rd Qu.:36.38 3rd Qu.:0.6065 3rd Qu.:40.75
Max. :846.00 Max. :59.40 Max. :2.4200 Max. :81.00
diabetes
neg:350
pos:188
summarytools::descr(train_set)
Non-numerical variable(s) ignored: diabetes
Descriptive Statistics
train_set
N: 538
age glucose insulin mass pedigree pregnant pressure triceps
----------------- -------- --------- --------- -------- ---------- ---------- ---------- ---------
Mean 33.36 120.74 80.34 31.94 0.47 3.87 69.17 20.34
Std.Dev 11.81 32.66 119.63 7.78 0.32 3.39 19.96 16.35
Min 21.00 0.00 0.00 0.00 0.08 0.00 0.00 0.00
Q1 24.00 100.00 0.00 27.30 0.25 1.00 62.00 0.00
Median 29.00 117.00 14.50 32.00 0.36 3.00 72.00 22.00
Q3 41.00 141.00 130.00 36.40 0.61 6.00 80.00 32.00
Max 81.00 198.00 846.00 59.40 2.42 17.00 122.00 99.00
MAD 10.38 29.65 21.50 6.82 0.24 2.97 11.86 20.76
IQR 16.75 41.00 130.00 9.07 0.36 5.00 18.00 32.00
CV 0.35 0.27 1.49 0.24 0.69 0.88 0.29 0.80
Skewness 1.14 0.04 2.34 -0.59 1.92 0.85 -1.85 0.21
SE.Skewness 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11
Kurtosis 0.69 0.88 7.54 3.10 5.87 0.05 4.88 -0.34
N.Valid 538.00 538.00 538.00 538.00 538.00 538.00 538.00 538.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
There are no missing values in the dataset and hence we can proceed further.
ggplot(train_set, aes(train_set$diabetes, fill = diabetes)) +
geom_bar() +
theme_bw() +
labs(title = "Diabetes Classification", x = "Diabetes") +
theme(plot.title = element_text(hjust = 0.5))
### Correlation Matrix
Let’s now see the correlation between all the numerical variables in the dataset.
Correlation Matrix can be ploted as follows.
cor_data <- cor(train_set[,setdiff(names(train_set), 'diabetes')])
#Numerical Correlation Matrix
cor_data
pregnant glucose pressure triceps insulin
pregnant 1.00000000 0.12892659 0.12857993 -0.08967786 -0.06748752
glucose 0.12892659 1.00000000 0.11686557 0.05590162 0.36373586
pressure 0.12857993 0.11686557 1.00000000 0.23569618 0.10854547
triceps -0.08967786 0.05590162 0.23569618 1.00000000 0.45291162
insulin -0.06748752 0.36373586 0.10854547 0.45291162 1.00000000
mass -0.02680956 0.20155969 0.27367542 0.41147508 0.25628831
pedigree -0.00602656 0.12650476 0.05610127 0.18672512 0.16943674
age 0.53487974 0.27353768 0.22838558 -0.09908890 -0.02301072
mass pedigree age
pregnant -0.02680956 -0.00602656 0.53487974
glucose 0.20155969 0.12650476 0.27353768
pressure 0.27367542 0.05610127 0.22838558
triceps 0.41147508 0.18672512 -0.09908890
insulin 0.25628831 0.16943674 -0.02301072
mass 1.00000000 0.16300107 0.01782932
pedigree 0.16300107 1.00000000 0.05307082
age 0.01782932 0.05307082 1.00000000
# Correlation matrix plots
corrplot::corrplot(cor_data)
corrplot::corrplot(cor_data, type = "lower", method = "number")
corrplot::corrplot(cor_data, type = "lower", method = "pie")
We can see that there is a moderately positive high correlation between age and pregn.
univar_graph <- function(univar_name, univar, data, output_var) {
g_1 <- ggplot(data, aes(x=univar)) +
geom_density() +
xlab(univar_name) +
theme_bw()
g_2 <- ggplot(data, aes(x=univar, fill=output_var)) +
geom_density(alpha=0.4) +
xlab(univar_name) +
theme_bw()
gridExtra::grid.arrange(g_1, g_2, ncol=2, top = paste(univar_name,"variable", "/ [ Skew:",timeDate::skewness(univar),"]"))
}
for (x in 1:(ncol(train_set)-1)) {
univar_graph(univar_name = names(train_set)[x], univar = train_set[,x], data = train_set, output_var = train_set[,'diabetes'])
}
Variables such as Insulin, pedigree and age have high right skewness.
pressure and mass have negative skewness.
pregnant, glucose, and triceps have moderate to low right skewness.
# ggplot(data = train_set, aes(x = age, fill = diabetes)) +
# geom_bar(stat='count', position='dodge') +
# ggtitle("Age Vs Diabetes") +
# theme_bw() +
# labs(x = "Age") +
# theme(plot.title = element_text(hjust = 0.5))
bivar_graph <- function(bivar_name, bivar, data, output_var) {
g_1 <- ggplot(data = data, aes(x = bivar, fill = output_var)) +
geom_bar(stat='count', position='dodge') +
theme_bw() +
labs( title = paste(bivar_name,"- Diabetes", sep =" "), x = bivar_name) +
theme(plot.title = element_text(hjust = 0.5))
plot(g_1)
}
for (x in 1:(ncol(train_set)-1)) {
bivar_graph(bivar_name = names(train_set)[x], bivar = train_set[,x], data = train_set, output_var = train_set[,'diabetes'])
}
* We can not see any trend in pregnant vs diabetes plot.
* Higher the glucose level, higher are the chances of positive diabetes. This is line with expectations.
* Diabetic pressure levels seems to be normally distributed for negative results and the positive cases range from bp of 60 to 110.
* As the BMI increases, the chances for a positive Diabetes case also increases.
* We can see that the positive diabetes cases increases as the age increases > 25.
*The reason that some of the graphs are small is due to fact that the maximum value for that variable is very high and the count for each is very small.
This can be confirmed by plotting box plots for each of the variables in the following section.
### Outlier Detection
box_plot <- function(bivar_name, bivar, data, output_var) {
g_1 <- ggplot(data = data, aes(y = bivar, fill = output_var)) +
geom_boxplot() +
theme_bw() +
labs( title = paste(bivar_name,"Outlier Detection", sep =" "), y = bivar_name) +
theme(plot.title = element_text(hjust = 0.5))
plot(g_1)
}
for (x in 1:(ncol(train_set)-1)) {
box_plot(bivar_name = names(train_set)[x], bivar = train_set[,x], data = train_set, output_var = train_set[,'diabetes'])
}
* We can observe that insulin, pedigree and age have the highest outliers.
* We will take care of them in the train() function of the CARET package.
#####################################################################################################################################
# Random Forest Model
#####################################################################################################################################
model_forest <- caret::train(diabetes ~., data = train_set,
method = "ranger",
metric = "ROC",
trControl = trainControl(method = "cv", number = 10,
classProbs = T, summaryFunction = twoClassSummary),
preProcess = c("center","scale","pca"))
model_forest
Random Forest
538 samples
8 predictor
2 classes: 'neg', 'pos'
Pre-processing: centered (8), scaled (8), principal component
signal extraction (8)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 485, 484, 484, 484, 484, 484, ...
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.8098789 0.8428571 0.5739766
2 extratrees 0.8129741 0.8828571 0.5315789
4 gini 0.8029574 0.8371429 0.5850877
4 extratrees 0.8143150 0.8600000 0.5529240
7 gini 0.7964495 0.8342857 0.5687135
7 extratrees 0.8097828 0.8571429 0.5637427
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 4, splitrule =
extratrees and min.node.size = 1.
# final ROC Value
model_forest$results[6,4]
[1] 0.8097828
#####################################################################################################################################
# XGBOOST - eXtreme Gradient BOOSTing
#####################################################################################################################################
xgb_grid_1 <- expand.grid(
nrounds = 50,
eta = c(0.03),
max_depth = 1,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1,
subsample = 0.5
)
model_xgb <- caret::train(diabetes ~., data = train_set,
method = "xgbTree",
metric = "ROC",
tuneGrid=xgb_grid_1,
trControl = trainControl(method = "cv", number = 10,
classProbs = T, summaryFunction = twoClassSummary),
preProcess = c("center","scale","pca"))
model_xgb
eXtreme Gradient Boosting
538 samples
8 predictor
2 classes: 'neg', 'pos'
Pre-processing: centered (8), scaled (8), principal component
signal extraction (8)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 484, 484, 484, 484, 485, 484, ...
Resampling results:
ROC Sens Spec
0.7852464 0.9514286 0.2827485
Tuning parameter 'nrounds' was held constant at a value of 50
0.6
Tuning parameter 'min_child_weight' was held constant at a value of
1
Tuning parameter 'subsample' was held constant at a value of 0.5
# final ROC value
model_xgb$results["ROC"]
ROC
1 0.7852464
#####################################################################################################################################
# KNN - K Nearest Neighbours
#####################################################################################################################################
model_knn <- caret::train(diabetes ~., data = train_set,
method = "knn",
metric = "ROC",
tuneGrid = expand.grid(.k = c(3:10)),
trControl = trainControl(method = "cv", number = 10,
classProbs = T, summaryFunction = twoClassSummary),
preProcess = c("center","scale","pca"))
model_knn
k-Nearest Neighbors
538 samples
8 predictor
2 classes: 'neg', 'pos'
Pre-processing: centered (8), scaled (8), principal component
signal extraction (8)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 484, 484, 484, 484, 484, 485, ...
Resampling results across tuning parameters:
k ROC Sens Spec
3 0.7245029 0.7971429 0.5418129
4 0.7390434 0.8028571 0.5204678
5 0.7516625 0.8200000 0.5461988
6 0.7538179 0.8200000 0.5040936
7 0.7573559 0.8342857 0.5198830
8 0.7705931 0.8428571 0.5146199
9 0.7742774 0.8485714 0.5248538
10 0.7766708 0.8542857 0.5298246
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 10.
#final ROC value
model_knn$results[8,2]
[1] 0.7766708
#####################################################################################################################################
# Logistic Regression
#####################################################################################################################################
model_glm <- caret::train(diabetes ~., data = train_set,
method = "glm",
metric = "ROC",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10,
classProbs = T, summaryFunction = twoClassSummary),
preProcess = c("center","scale","pca"))
model_glm
Generalized Linear Model
538 samples
8 predictor
2 classes: 'neg', 'pos'
Pre-processing: centered (8), scaled (8), principal component
signal extraction (8)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 484, 484, 484, 484, 484, 485, ...
Resampling results:
ROC Sens Spec
0.8226149 0.8828571 0.5681287
#final ROC Value
model_glm$results[2]
ROC
1 0.8226149
#####################################################################################################################################
# Rpart CART - classification and Regression Trees
#####################################################################################################################################
model_rpart <- caret::train(diabetes ~., data = train_set,
method = "rpart",
metric = "ROC",
tuneLength = 20,
trControl = trainControl(method = "cv", number = 10,
classProbs = T, summaryFunction = twoClassSummary))
# preProcess = c("center","scale","pca"))
model_rpart
CART
538 samples
8 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 484, 484, 485, 484, 484, 484, ...
Resampling results across tuning parameters:
cp ROC Sens Spec
0.00000000 0.7507310 0.7800000 0.6160819
0.01427772 0.7115581 0.8314286 0.5257310
0.02855543 0.7163409 0.8514286 0.5517544
0.04283315 0.7188972 0.8600000 0.5412281
0.05711086 0.7188972 0.8600000 0.5412281
0.07138858 0.7188972 0.8600000 0.5412281
0.08566629 0.7184211 0.8571429 0.5467836
0.09994401 0.7106809 0.8257143 0.5681287
0.11422172 0.7063200 0.7971429 0.6154971
0.12849944 0.7063200 0.7971429 0.6154971
0.14277716 0.7063200 0.7971429 0.6154971
0.15705487 0.7063200 0.7971429 0.6154971
0.17133259 0.7063200 0.7971429 0.6154971
0.18561030 0.7063200 0.7971429 0.6154971
0.19988802 0.7063200 0.7971429 0.6154971
0.21416573 0.7063200 0.7971429 0.6154971
0.22844345 0.7063200 0.7971429 0.6154971
0.24272116 0.7063200 0.7971429 0.6154971
0.25699888 0.6473726 0.8371429 0.4576023
0.27127660 0.6473726 0.8371429 0.4576023
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.
# Plot model accuracy vs different values of cp (complexity parameter)
plot(model_rpart)
# Best Model Cp (Complexity Parameter)
model_rpart$bestTune
cp
1 0
# Structure of final model selected
model_rpart$finalModel
n= 538
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 538 188 neg (0.65055762 0.34944238)
2) glucose< 127.5 341 64 neg (0.81231672 0.18768328)
4) age< 28.5 188 17 neg (0.90957447 0.09042553) *
5) age>=28.5 153 47 neg (0.69281046 0.30718954)
10) glucose< 99.5 45 4 neg (0.91111111 0.08888889) *
11) glucose>=99.5 108 43 neg (0.60185185 0.39814815)
22) mass< 26.25 18 2 neg (0.88888889 0.11111111) *
23) mass>=26.25 90 41 neg (0.54444444 0.45555556)
46) pedigree< 0.561 64 23 neg (0.64062500 0.35937500)
92) age>=47.5 12 1 neg (0.91666667 0.08333333) *
93) age< 47.5 52 22 neg (0.57692308 0.42307692)
186) pedigree< 0.2 10 1 neg (0.90000000 0.10000000) *
187) pedigree>=0.2 42 21 neg (0.50000000 0.50000000)
374) pedigree>=0.371 19 6 neg (0.68421053 0.31578947) *
375) pedigree< 0.371 23 8 pos (0.34782609 0.65217391)
750) mass>=34.85 9 4 neg (0.55555556 0.44444444) *
751) mass< 34.85 14 3 pos (0.21428571 0.78571429) *
47) pedigree>=0.561 26 8 pos (0.30769231 0.69230769)
94) pregnant< 5 11 4 neg (0.63636364 0.36363636) *
95) pregnant>=5 15 1 pos (0.06666667 0.93333333) *
3) glucose>=127.5 197 73 pos (0.37055838 0.62944162)
6) mass< 29.95 50 16 neg (0.68000000 0.32000000)
12) glucose< 145.5 29 4 neg (0.86206897 0.13793103) *
13) glucose>=145.5 21 9 pos (0.42857143 0.57142857)
26) triceps< 7 14 6 neg (0.57142857 0.42857143) *
27) triceps>=7 7 1 pos (0.14285714 0.85714286) *
7) mass>=29.95 147 39 pos (0.26530612 0.73469388)
14) glucose< 157.5 82 30 pos (0.36585366 0.63414634)
28) pedigree< 0.436 42 21 neg (0.50000000 0.50000000)
56) mass< 41.8 32 12 neg (0.62500000 0.37500000)
112) pressure>=71 24 6 neg (0.75000000 0.25000000) *
113) pressure< 71 8 2 pos (0.25000000 0.75000000) *
57) mass>=41.8 10 1 pos (0.10000000 0.90000000) *
29) pedigree>=0.436 40 9 pos (0.22500000 0.77500000) *
15) glucose>=157.5 65 9 pos (0.13846154 0.86153846) *
# Should have refactored diabetes to make pos as the primary factor level to get straight forward decision tree
rpart.plot::rpart.plot(model_rpart$finalModel, type = 2, fallen.leaves = T, extra = 2, cex = 0.70)
model_list <- list(Random_Forest = model_forest, XGBoost = model_xgb, KNN = model_knn, Logistic_Regression = model_glm, Rpart_DT = model_rpart)
resamples <- resamples(model_list)
#box plot
bwplot(resamples, metric="ROC")
#dot plot
dotplot(resamples, metric="ROC")
#####################################################################################################################################
# Random Forest
#####################################################################################################################################
# prediction on Test data set
pred_rf <- predict(model_forest, test_set)
# Confusion Matrix
cm_rf <- confusionMatrix(pred_rf, test_set$diabetes, positive="pos")
# Prediction Probabilities
pred_prob_rf <- predict(model_forest, test_set, type="prob")
# ROC value
roc_rf <- roc(test_set$diabetes, pred_prob_rf$pos)
# Confusion Matrix for Random Forest Model
cm_rf
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 133 35
pos 17 45
Accuracy : 0.7739
95% CI : (0.7143, 0.8263)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 4.208e-05
Kappa : 0.4741
Mcnemar's Test P-Value : 0.0184
Sensitivity : 0.5625
Specificity : 0.8867
Pos Pred Value : 0.7258
Neg Pred Value : 0.7917
Prevalence : 0.3478
Detection Rate : 0.1957
Detection Prevalence : 0.2696
Balanced Accuracy : 0.7246
'Positive' Class : pos
# ROC Value for Random Forest
roc_rf
Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_rf$pos)
Data: pred_prob_rf$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.827
# AUC - Area under the curve
caTools::colAUC(pred_prob_rf$pos, test_set$diabetes, plotROC = T)
[,1]
neg vs. pos 0.8269583
#####################################################################################################################################
# XGBOOST - eXtreme Gradient BOOSTing
#####################################################################################################################################
# prediction on Test data set
pred_xgb <- predict(model_xgb, test_set)
# Confusion Matrix
cm_xgb <- confusionMatrix(pred_xgb, test_set$diabetes, positive="pos")
# Prediction Probabilities
pred_prob_xgb <- predict(model_xgb, test_set, type="prob")
# ROC value
roc_xgb <- roc(test_set$diabetes, pred_prob_xgb$pos)
# Confusion matrix
cm_xgb
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 141 51
pos 9 29
Accuracy : 0.7391
95% CI : (0.6773, 0.7946)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 0.002949
Kappa : 0.3447
Mcnemar's Test P-Value : 1.203e-07
Sensitivity : 0.3625
Specificity : 0.9400
Pos Pred Value : 0.7632
Neg Pred Value : 0.7344
Prevalence : 0.3478
Detection Rate : 0.1261
Detection Prevalence : 0.1652
Balanced Accuracy : 0.6512
'Positive' Class : pos
# ROC Value for for XGBoost
roc_xgb
Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_xgb$pos)
Data: pred_prob_xgb$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.8155
# AUC - Area under the curve
caTools::colAUC(pred_prob_xgb$pos, test_set$diabetes, plotROC = T)
[,1]
neg vs. pos 0.8155417
#####################################################################################################################################
# KNN - K Nearest Neighbours
#####################################################################################################################################
# prediction on Test data set
pred_knn <- predict(model_knn, test_set)
# Confusion Matrix
cm_knn <- confusionMatrix(pred_knn, test_set$diabetes, positive="pos")
# Prediction Probabilities
pred_prob_knn <- predict(model_knn, test_set, type="prob")
# ROC value
roc_knn <- roc(test_set$diabetes, pred_prob_knn$pos)
# Confusion matrix
cm_knn
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 131 37
pos 19 43
Accuracy : 0.7565
95% CI : (0.6958, 0.8105)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 0.000421
Kappa : 0.4336
Mcnemar's Test P-Value : 0.023103
Sensitivity : 0.5375
Specificity : 0.8733
Pos Pred Value : 0.6935
Neg Pred Value : 0.7798
Prevalence : 0.3478
Detection Rate : 0.1870
Detection Prevalence : 0.2696
Balanced Accuracy : 0.7054
'Positive' Class : pos
# ROC Value for for XGBoost
roc_knn
Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_knn$pos)
Data: pred_prob_knn$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.8056
# AUC - Area under the curve
caTools::colAUC(pred_prob_knn$pos, test_set$diabetes, plotROC = T)
[,1]
neg vs. pos 0.805625
#####################################################################################################################################
# Logistic Regression
#####################################################################################################################################
# prediction on Test data set
pred_glm <- predict(model_glm, test_set)
# Confusion Matrix
cm_glm <- confusionMatrix(pred_glm, test_set$diabetes, positive="pos")
# Prediction Probabilities
pred_prob_glm <- predict(model_glm, test_set, type="prob")
# ROC value
roc_glm <- roc(test_set$diabetes, pred_prob_glm$pos)
# Confusion matrix
cm_glm
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 131 38
pos 19 42
Accuracy : 0.7522
95% CI : (0.6912, 0.8066)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 0.0007075
Kappa : 0.4217
Mcnemar's Test P-Value : 0.0171182
Sensitivity : 0.5250
Specificity : 0.8733
Pos Pred Value : 0.6885
Neg Pred Value : 0.7751
Prevalence : 0.3478
Detection Rate : 0.1826
Detection Prevalence : 0.2652
Balanced Accuracy : 0.6992
'Positive' Class : pos
# ROC Value for for XGBoost
roc_glm
Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_glm$pos)
Data: pred_prob_glm$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.8369
# AUC - Area under the curve
caTools::colAUC(pred_prob_glm$pos, test_set$diabetes, plotROC = T)
[,1]
neg vs. pos 0.8369167
#####################################################################################################################################
# Rpart CART - classification and Regression Trees
#####################################################################################################################################
# prediction on Test data set
pred_rpart <- predict(model_rpart, test_set)
# Confusion Matrix
cm_rpart <- confusionMatrix(pred_rpart, test_set$diabetes, positive="pos")
# Prediction Probabilities
pred_prob_rpart <- predict(model_rpart, test_set, type="prob")
# ROC value
roc_rpart <- roc(test_set$diabetes, pred_prob_rpart$pos)
# Confusion matrix
cm_rpart
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 129 30
pos 21 50
Accuracy : 0.7783
95% CI : (0.719, 0.8302)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 2.232e-05
Kappa : 0.4981
Mcnemar's Test P-Value : 0.2626
Sensitivity : 0.6250
Specificity : 0.8600
Pos Pred Value : 0.7042
Neg Pred Value : 0.8113
Prevalence : 0.3478
Detection Rate : 0.2174
Detection Prevalence : 0.3087
Balanced Accuracy : 0.7425
'Positive' Class : pos
# ROC Value for for XGBoost
roc_rpart
Call:
roc.default(response = test_set$diabetes, predictor = pred_prob_rpart$pos)
Data: pred_prob_rpart$pos in 150 controls (test_set$diabetes neg) < 80 cases (test_set$diabetes pos).
Area under the curve: 0.7902
# AUC - Area under the curve
caTools::colAUC(pred_prob_rpart$pos, test_set$diabetes, plotROC = T)
[,1]
neg vs. pos 0.7901667
result_rf <- c(cm_rf$byClass['Sensitivity'], cm_rf$byClass['Specificity'], cm_rf$byClass['Precision'],
cm_rf$byClass['Recall'], cm_rf$byClass['F1'], roc_rf$auc)
result_xgb <- c(cm_xgb$byClass['Sensitivity'], cm_xgb$byClass['Specificity'], cm_xgb$byClass['Precision'],
cm_xgb$byClass['Recall'], cm_xgb$byClass['F1'], roc_xgb$auc)
result_knn <- c(cm_knn$byClass['Sensitivity'], cm_knn$byClass['Specificity'], cm_knn$byClass['Precision'],
cm_knn$byClass['Recall'], cm_knn$byClass['F1'], roc_knn$auc)
result_glm <- c(cm_glm$byClass['Sensitivity'], cm_glm$byClass['Specificity'], cm_glm$byClass['Precision'],
cm_glm$byClass['Recall'], cm_glm$byClass['F1'], roc_glm$auc)
result_rpart <- c(cm_rpart$byClass['Sensitivity'], cm_rpart$byClass['Specificity'], cm_rpart$byClass['Precision'],
cm_rpart$byClass['Recall'], cm_rpart$byClass['F1'], roc_rpart$auc)
all_results <- data.frame(rbind(result_rf, result_xgb, result_knn, result_glm, result_rpart))
names(all_results) <- c("Sensitivity", "Specificity", "Precision", "Recall", "F1", "AUC")
all_results
Sensitivity Specificity Precision Recall F1 AUC
result_rf 0.5625 0.8866667 0.7258065 0.5625 0.6338028 0.8269583
result_xgb 0.3625 0.9400000 0.7631579 0.3625 0.4915254 0.8155417
result_knn 0.5375 0.8733333 0.6935484 0.5375 0.6056338 0.8056250
result_glm 0.5250 0.8733333 0.6885246 0.5250 0.5957447 0.8369167
result_rpart 0.6250 0.8600000 0.7042254 0.6250 0.6622517 0.7901667
col <- c("#ed3b3b", "#0099ff")
graphics::fourfoldplot(cm_rf$table, color = col, conf.level = 0.95, margin = 1,
main = paste("Random Forest Accuracy(",round(cm_rf$overall[1]*100),"%)", sep = ""))
graphics::fourfoldplot(cm_xgb$table, color = col, conf.level = 0.95, margin = 1,
main = paste("XGB Accuracy(",round(cm_xgb$overall[1]*100),"%)", sep = ""))
graphics::fourfoldplot(cm_knn$table, color = col, conf.level = 0.95, margin = 1,
main = paste("KNN Accuracy(",round(cm_knn$overall[1]*100),"%)", sep = ""))
graphics::fourfoldplot(cm_glm$table, color = col, conf.level = 0.95, margin = 1,
main = paste("Logistic Regression Accuracy(",round(cm_glm$overall[1]*100),"%)", sep = ""))
graphics::fourfoldplot(cm_rpart$table, color = col, conf.level = 0.95, margin = 1,
main = paste("Rpart DT Accuracy(",round(cm_rpart$overall[1]*100),"%)", sep = ""))
We have deployed multiple Machine Learning Models such as - Random Forest, eXtreme Gradient Boosting Machine, K-Nearest Neighbours, Logistic Regression and R part Classification and Regression Trees, and found that Logistic Regression has the best performance based on ROC value.
As a next step, we could go further and try to compare other ML models as well.
Recall: Recall gives us an idea about when it’s actually Yes, how often does it predict Yes.
Precision: Precision tells us about when it predicts Yes, how often is it correct.
Sensitivity: Ability of the test to correctly identify the true positive rate.
Specificity: Ability of the test to correctly identify the true negative rate.
Abbreviations used
ROC: Reciever Operating Characteristics curve
AUC: Area under the ROC curve
AUC denotes the rate of successful classification by the logistic model.