Today we are going to fit the most widely used classification models and compare them in order to see how they perform. The models we will create are:
In order to do our analysis, we will work with the pima dataset from the faraway
library which contains a sample of 768 adult female Pima Indians living near Phoenix. The objective is to create a machine learning tool to diagnose diabetes. Let’s get started.
str(pima)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant : int 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ diastolic: int 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : int 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ bmi : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ diabetes : num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ test : int 1 0 1 0 1 0 1 0 1 1 ...
There are a few things we have to do to clean our database. As we can appreciate, in first place we have to transform the 0 into NA (they are not real 0 but missing values), and then we have to transform the test variable into a factorial variable not to have problems.
pima$glucose[pima$glucose == 0] <- NA
pima$diastolic[pima$diastolic == 0] <- NA
pima$triceps[pima$triceps == 0] <- NA
pima$insulin[pima$insulin == 0] <- NA
pima$bmi[pima$bmi == 0] <- NA
sapply(pima, function(x) sum(is.na(x))) # To count NA values of the sample
## pregnant glucose diastolic triceps insulin bmi diabetes
## 0 5 35 227 374 11 0
## age test
## 0 0
Next we need a dataset without NA values to do our analysis correctly. A good way to solve this is to impute to the NA values its k-nearest neighbour with the knnImputation
from the DMwR
package.
pima <- pima[-manyNAs(pima),]
pima.clean <- knnImputation(pima, k = 10)
The last step will be to transform the test
variable into a factor.
pima.clean$test <- as.factor(pima.clean$test)
levels(pima.clean$test) <- c("Negative", "Positive")
table(pima.clean$test)
##
## Negative Positive
## 357 177
round(prop.table(table(pima.clean$test)) * 100, digits = 2)
##
## Negative Positive
## 66.85 33.15
Now we have our data spotless and ready to fit the models.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method = loess, fill = "red", color = "red", ...) +
geom_smooth(method = lm, fill = "blue", color = "blue", ...) +
ggtitle("Matrix of correlations")
p
}
ggpairs(pima.clean, lower = list(continuous = my_fn))
At a glance, there are some outliers that we have to analyze, for example, those from triceps
or insulin
variables. glucose
levels, times of pregnancy (pregnant
) or triceps
measurement are the most influential variables. triceps
and bmi
show some outliers which should be checked (perhaps they are errors and no actual data).
Also we can plot all the boxplots from the database. This could be easily with the following codechunk (we won’t run it for reasons of space):
for (i in 1:8) {
print(ggplot(pima.clean, aes(x = test, y = pima.clean[,i], col = pima.clean[,9])) + geom_boxplot()) + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip()
}
As example we will plot just one boxplot to show the relationship between test result and one of the variables and how the plot would look:
ggplot(pima.clean, aes(x = test, y = bmi)) + geom_boxplot() + geom_point(aes(fill = pima.clean[, 9]), alpha = 0.3, size = 2, shape = 21, colour = "grey20", position = position_jitter(width = 0.2, height = 0.1)) + coord_flip() + ggtitle("BMI vs. diabetes test result")
Before any analysis, we have to do the Hold-out split to test the forecast performance of each model.
set.seed(1)
pima.train <- sample_frac(tbl = pima.clean, replace = FALSE, size = 0.80)
pima.test <- anti_join(pima.clean, pima.train)
pima.train.lab <- pima.train[,9]
pima.test.lab <- pima.test[,9]
pima.train.data <- pima.train[,1:8]
pima.test.data <- pima.test[,1:8]
A simple and fast method to classificate binomial categories like diagnostic tests.
LR <- glm(test ~ ., family = binomial(link='logit'), data = pima.train)
par(mfrow = c(2, 2))
plot(LR)
summary(LR)
##
## Call:
## glm(formula = test ~ ., family = binomial(link = "logit"), data = pima.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5151 -0.6129 -0.3250 0.5103 2.4654
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.315382 1.214292 -8.495 < 2e-16 ***
## pregnant 0.125569 0.048459 2.591 0.009563 **
## glucose 0.038702 0.005762 6.717 1.86e-11 ***
## diastolic -0.001471 0.012604 -0.117 0.907097
## triceps 0.008856 0.017319 0.511 0.609107
## insulin 0.001848 0.001627 1.136 0.255914
## bmi 0.071924 0.029211 2.462 0.013806 *
## diabetes 1.447537 0.424914 3.407 0.000658 ***
## age 0.018257 0.015447 1.182 0.237243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 537.38 on 426 degrees of freedom
## Residual deviance: 346.59 on 418 degrees of freedom
## AIC: 364.59
##
## Number of Fisher Scoring iterations: 5
pR2(LR)
## llh llhNull G2 McFadden r2ML
## -173.2965476 -268.6884522 190.7838093 0.3550279 0.3603285
## r2CU
## 0.5033094
The model has a McFadden \(R^2\) of 0.30, which is a poor outcome.
In addition to this, the model has some outliers which are messing up the regression line. Furthermore, not all the variables are statiscally significants. We can make a deeper analysis with the step
function.
step(LR, trace = 0)
##
## Call: glm(formula = test ~ pregnant + glucose + bmi + diabetes, family = binomial(link = "logit"),
## data = pima.train)
##
## Coefficients:
## (Intercept) pregnant glucose bmi diabetes
## -10.33715 0.16503 0.04298 0.08348 1.49359
##
## Degrees of Freedom: 426 Total (i.e. Null); 422 Residual
## Null Deviance: 537.4
## Residual Deviance: 349.8 AIC: 359.8
The model suggested is test ~ pregnant + glucose + bmi + diabetes + age
. Now we will refit the model with those variables
LR2 <- glm(test ~ pregnant + glucose + bmi + diabetes + age, family = binomial(link='logit'), data = pima.train)
par(mfrow = c(2, 2))
plot(LR2)
summary(LR)
##
## Call:
## glm(formula = test ~ ., family = binomial(link = "logit"), data = pima.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5151 -0.6129 -0.3250 0.5103 2.4654
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.315382 1.214292 -8.495 < 2e-16 ***
## pregnant 0.125569 0.048459 2.591 0.009563 **
## glucose 0.038702 0.005762 6.717 1.86e-11 ***
## diastolic -0.001471 0.012604 -0.117 0.907097
## triceps 0.008856 0.017319 0.511 0.609107
## insulin 0.001848 0.001627 1.136 0.255914
## bmi 0.071924 0.029211 2.462 0.013806 *
## diabetes 1.447537 0.424914 3.407 0.000658 ***
## age 0.018257 0.015447 1.182 0.237243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 537.38 on 426 degrees of freedom
## Residual deviance: 346.59 on 418 degrees of freedom
## AIC: 364.59
##
## Number of Fisher Scoring iterations: 5
The model seems to be still underfitted. Let’s compare it with the previous LR model.
pR2(LR2)
## llh llhNull G2 McFadden r2ML
## -174.0839010 -268.6884522 189.2091026 0.3520976 0.3579651
## r2CU
## 0.5000082
AIC(LR, LR2)
## df AIC
## LR 9 364.5931
## LR2 6 360.1678
The variance explained is still too low. We have to prove with another model.
First up, we have to standarize the values to compare them.
pima.train.norm <- decostand(pima.train.data, "normalize")
pima.test.norm <- decostand(pima.test.data, "normalize")
We have to remember that the k-NN model is sensitive to outliers since they can alter the \(k\) values very easily. We should check again if we can remove them if possible.
There are many libraries to do a k-NN analysis. This time we will use the e1071
library, which brings us the option to create a trainControl
parameter where we can perform a Cross Validation beforehand to know the optimal value of \(k\).
cv <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = T, summaryFunction = twoClassSummary)
(knn.pima <- train(test ~ ., data = pima.train, method = "knn", preProcess = c("center", "scale"), trControl = cv, metric = "ROC", tuneLength = 10))
## k-Nearest Neighbors
##
## 427 samples
## 8 predictor
## 2 classes: 'Negative', 'Positive'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 385, 385, 384, 384, 384, 384, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.8379911 0.8766831 0.5565934
## 7 0.8468237 0.8743432 0.5877289
## 9 0.8498677 0.8777915 0.5780220
## 11 0.8499524 0.8766831 0.5727106
## 13 0.8542891 0.8766420 0.5728938
## 15 0.8592811 0.8916256 0.5589744
## 17 0.8642064 0.8974138 0.5516484
## 19 0.8663315 0.9032020 0.5615385
## 21 0.8690401 0.9032020 0.5520147
## 23 0.8691843 0.9089491 0.5349817
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.
The best number of \(k\) is 23.
knn.pima <- knn(train = pima.train.norm, test = pima.test.norm, cl = pima.train.lab, k = 23, prob = TRUE)
And we can check the accuracy very easily with mean
mean(knn.pima == pima.test.lab)
## [1] 0.7009346
The model could predict up to a 70% of the test results.
The SVM model is maybe the most widely used machine learning algorithm in the field of diagnostics. It is a robust method which is not altered by multicollinearity issues, among other things.
We will prove to fit a first model with a linear kernel.
x.pima <- subset(pima.train, select=-test)
y.pima <- pima.train.lab
pima.svm <- svm(x.pima, y.pima, kernel = "linear")
summary(pima.svm)
##
## Call:
## svm.default(x = x.pima, y = y.pima, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.125
##
## Number of Support Vectors: 191
##
## ( 97 94 )
##
##
## Number of Classes: 2
##
## Levels:
## Negative Positive
pima.svm.pred <- predict(pima.svm, x.pima)
prop.table(table(pima.svm.pred, y.pima))
## y.pima
## pima.svm.pred Negative Positive
## Negative 0.61124122 0.12177986
## Positive 0.06557377 0.20140515
agree <- pima.svm.pred == y.pima
mean(agree)
## [1] 0.8126464
Our model couldn’t classificate properly nearly a 29% of the data, which is a similar result as the k-NN model. Let’s try with the tune
function to brush up our model with a Cross Validation method.
(svm.tune <- tune(svm, train.x = x.pima, train.y = y.pima,
kernel = "radial", ranges = list(cost = 10^(-1:2), gamma = c(.5,1,2))))
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 0.5
##
## - best performance: 0.2343854
The SVM model suggested is with \(cost = 1\) and \(\gamma = 0.5\).
pima.svm2 <- svm(x.pima, y.pima, kernel = "radial", cost = 1, gamma = 0.5)
summary(pima.svm2)
##
## Call:
## svm.default(x = x.pima, y = y.pima, kernel = "radial", gamma = 0.5,
## cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 309
##
## ( 174 135 )
##
##
## Number of Classes: 2
##
## Levels:
## Negative Positive
pima.svm.pred2 <- predict(pima.svm2, x.pima)
prop.table(table(pima.svm.pred2, y.pima))
## y.pima
## pima.svm.pred2 Negative Positive
## Negative 0.66276347 0.03981265
## Positive 0.01405152 0.28337237
agree2 <- pima.svm.pred2 == y.pima
mean(agree2)
## [1] 0.9461358
The model has improved significantly with a different kernel and the parameters suggested by the tune
function
Although the usage of the neural network model has been irrelevant in the beginning, recently has gained notoriety thanks to the popularization of big data tools like Hadoop.
First up, we have to normalize the variables. If not, we cannot compare between these variables since they are fairly different.
scale01 <- function(x){
(x - min(x)) / (max(x) - min(x))
}
pima2 <- pima.clean
pima2[,9] <- unclass(pima2[,9])
pima.norm <- pima2 %>%
mutate_all(scale01)
pima.size <- floor(0.75 * nrow(pima.norm))
train <- sample(seq_len(nrow(pima.norm)), size = pima.size)
pima.train.n <- pima.norm[train, ]
pima.test.n <- pima.norm[-train, ]
Once we have normalized our data, we are going to fit the neural network. There is not fixed rule to choose the number of hidden nodes, but as a general rule, it should be a 2/3 of the input nodess, so let’s try to create a 8-4-1 neural network.
pima.nn <- neuralnet(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age, hidden = 4, data = pima.train.n, linear.output = TRUE)
A good way to see how our model performs is to calculate the correlation between predictions and actual data.
predict.nn <- compute(pima.nn, pima.test.n[,1:8])$net.result
cor(predict.nn, pima.test.n$test)
## [,1]
## [1,] 0.429594926
As we can see here, a correlation of 0.44 represents a moderate correlation. However, the model can still be improved.
pima.nn2 <- neuralnet(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age, hidden = 5, data = pima.train.n, linear.output = TRUE)
plot(pima.nn2)
summary(pima.nn2)
## Length Class Mode
## call 5 -none- call
## response 400 -none- numeric
## covariate 3200 -none- numeric
## model.list 2 -none- list
## err.fct 1 -none- function
## act.fct 1 -none- function
## linear.output 1 -none- logical
## data 9 data.frame list
## net.result 1 -none- list
## weights 1 -none- list
## startweights 1 -none- list
## generalized.weights 1 -none- list
## result.matrix 54 -none- numeric
predict.nn2 <- compute(pima.nn2, pima.test.n[,1:8])$net.result
cor(predict.nn2, pima.test.n$test)
## [,1]
## [1,] -0.05857173873
The model doesn’t improved adding more hidden nodes. We will continue with the first model.
pr.nn <- compute(pima.nn, pima.test.n[ ,1:8])
pr.nn_ <- pr.nn$net.result*(max(pima2$test)-min(pima2$test))+min(pima2$test)
test.r <- (pima.test.n$test)*(max(pima2$test)-min(pima2$test))+min(pima2$test)
(MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(pima.test.n))
## [1] 0.2051463725
As expected, the best model was SVM. This is due to its reliability and its resistance to interdependence, redundancy, irrelevancy and sample size (specially when sample is too small). To explore further the accuracy of the model we can calculate a confusion matrix.
caret::confusionMatrix(pima.svm.pred2, y.pima)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 283 17
## Positive 6 121
##
## Accuracy : 0.9461358
## 95% CI : (0.920271, 0.965552)
## No Information Rate : 0.676815
## P-Value [Acc > NIR] : < 0.000000000000000222
##
## Kappa : 0.8742558
## Mcnemar's Test P-Value : 0.03705622
##
## Sensitivity : 0.9792388
## Specificity : 0.8768116
## Pos Pred Value : 0.9433333
## Neg Pred Value : 0.9527559
## Prevalence : 0.6768150
## Detection Rate : 0.6627635
## Detection Prevalence : 0.7025761
## Balanced Accuracy : 0.9280252
##
## 'Positive' Class : Negative
##
As we can see, the Kappa coefficient (that is to say, the real hits of the model subtracting the hits by chance) is 0.87, which is a quite good result.
ctable <- table(pima.svm.pred2, y.pima)
fourfoldplot(ctable, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "Confusion Matrix")
set.seed(1)
cm <- confusionMatrix(data = pima.svm.pred2, reference = y.pima)
draw_confusion_matrix <- function(cm) {
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('Confusion Matrix', cex.main = 2)
rect(150, 430, 240, 370, col = '#3F97D0')
text(195, 435, 'Class1', cex = 1.2)
rect(250, 430, 340, 370, col = '#F7AD50')
text(295, 435, 'Class2', cex = 1.2)
text(125, 370, 'Predicted', cex = 1.3, srt = 90, font = 2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col = '#F7AD50')
rect(250, 305, 340, 365, col = '#3F97D0')
text(140, 400, 'Class1', cex = 1.2, srt = 90)
text(140, 335, 'Class2', cex = 1.2, srt = 90)
res <- as.numeric(cm$table)
text(195, 400, res[1], cex = 1.6, font = 2, col = 'white')
text(195, 335, res[2], cex = 1.6, font = 2, col = 'white')
text(295, 400, res[3], cex = 1.6, font = 2, col = 'white')
text(295, 335, res[4], cex = 1.6, font = 2, col = 'white')
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt = 'n', yaxt = 'n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
draw_confusion_matrix(cm)