1. Data preprocessing
  2. Data exploration
  3. Model creation
    1. Hold-out split
    2. Logistic Regression model (LR)
    3. K-Nearest Neighbour (k-NN)
    4. Support Vector Machine (SVM)
    5. Neural Network
  4. Conclusion

Today we are going to fit the most widely used classification models and compare them in order to see how they perform. The models we will create are:

In order to do our analysis, we will work with the pima dataset from the faraway library which contains a sample of 768 adult female Pima Indians living near Phoenix. The objective is to create a machine learning tool to diagnose diabetes. Let’s get started.

1. Data preprocessing

str(pima)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose  : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ diastolic: int  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps  : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin  : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ bmi      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ diabetes : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age      : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ test     : int  1 0 1 0 1 0 1 0 1 1 ...

There are a few things we have to do to clean our database. As we can appreciate, in first place we have to transform the 0 into NA (they are not real 0 but missing values), and then we have to transform the test variable into a factorial variable not to have problems.

pima$glucose[pima$glucose == 0] <- NA
pima$diastolic[pima$diastolic == 0] <- NA
pima$triceps[pima$triceps == 0] <- NA
pima$insulin[pima$insulin == 0] <- NA
pima$bmi[pima$bmi == 0] <- NA
sapply(pima, function(x) sum(is.na(x))) # To count NA values of the sample
##  pregnant   glucose diastolic   triceps   insulin       bmi  diabetes 
##         0         5        35       227       374        11         0 
##       age      test 
##         0         0

Next we need a dataset without NA values to do our analysis correctly. A good way to solve this is to impute to the NA values its k-nearest neighbour with the knnImputation from the DMwR package.

pima <- pima[-manyNAs(pima),]
pima.clean <- knnImputation(pima, k = 10)

The last step will be to transform the test variable into a factor.

pima.clean$test <- as.factor(pima.clean$test)
levels(pima.clean$test) <- c("Negative", "Positive")

table(pima.clean$test)
## 
## Negative Positive 
##      357      177
round(prop.table(table(pima.clean$test)) * 100, digits = 2)
## 
## Negative Positive 
##    66.85    33.15

Now we have our data spotless and ready to fit the models.

2. Data exploration

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method = loess, fill = "red", color = "red", ...) +
    geom_smooth(method = lm, fill = "blue", color = "blue", ...) +
    ggtitle("Matrix of correlations")
  p
}

ggpairs(pima.clean, lower = list(continuous = my_fn))

At a glance, there are some outliers that we have to analyze, for example, those from triceps or insulin variables. glucose levels, times of pregnancy (pregnant) or triceps measurement are the most influential variables. triceps and bmi show some outliers which should be checked (perhaps they are errors and no actual data).

Also we can plot all the boxplots from the database. This could be easily with the following codechunk (we won’t run it for reasons of space):

for (i in 1:8) { 
    print(ggplot(pima.clean, aes(x = test, y = pima.clean[,i], col = pima.clean[,9])) + geom_boxplot()) + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip()
}

As example we will plot just one boxplot to show the relationship between test result and one of the variables and how the plot would look:

ggplot(pima.clean, aes(x = test, y = bmi)) + geom_boxplot() + geom_point(aes(fill = pima.clean[, 9]), alpha = 0.3, size = 2, shape = 21, colour = "grey20", position = position_jitter(width = 0.2, height = 0.1)) + coord_flip() + ggtitle("BMI vs. diabetes test result")

3. Model creation

3.0. Hold-out split

Before any analysis, we have to do the Hold-out split to test the forecast performance of each model.

set.seed(1)
pima.train <- sample_frac(tbl = pima.clean, replace = FALSE, size = 0.80)
pima.test <- anti_join(pima.clean, pima.train)

pima.train.lab <- pima.train[,9]
pima.test.lab <- pima.test[,9]

pima.train.data <- pima.train[,1:8]
pima.test.data <- pima.test[,1:8]

3.1. Logistic Regression model (LR)

A simple and fast method to classificate binomial categories like diagnostic tests.

LR <- glm(test ~ ., family = binomial(link='logit'), data = pima.train)
par(mfrow = c(2, 2))
plot(LR)

summary(LR)
## 
## Call:
## glm(formula = test ~ ., family = binomial(link = "logit"), data = pima.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5151  -0.6129  -0.3250   0.5103   2.4654  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.315382   1.214292  -8.495  < 2e-16 ***
## pregnant      0.125569   0.048459   2.591 0.009563 ** 
## glucose       0.038702   0.005762   6.717 1.86e-11 ***
## diastolic    -0.001471   0.012604  -0.117 0.907097    
## triceps       0.008856   0.017319   0.511 0.609107    
## insulin       0.001848   0.001627   1.136 0.255914    
## bmi           0.071924   0.029211   2.462 0.013806 *  
## diabetes      1.447537   0.424914   3.407 0.000658 ***
## age           0.018257   0.015447   1.182 0.237243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 537.38  on 426  degrees of freedom
## Residual deviance: 346.59  on 418  degrees of freedom
## AIC: 364.59
## 
## Number of Fisher Scoring iterations: 5
pR2(LR)
##          llh      llhNull           G2     McFadden         r2ML 
## -173.2965476 -268.6884522  190.7838093    0.3550279    0.3603285 
##         r2CU 
##    0.5033094

The model has a McFadden \(R^2\) of 0.30, which is a poor outcome.

In addition to this, the model has some outliers which are messing up the regression line. Furthermore, not all the variables are statiscally significants. We can make a deeper analysis with the step function.

step(LR, trace = 0)
## 
## Call:  glm(formula = test ~ pregnant + glucose + bmi + diabetes, family = binomial(link = "logit"), 
##     data = pima.train)
## 
## Coefficients:
## (Intercept)     pregnant      glucose          bmi     diabetes  
##   -10.33715      0.16503      0.04298      0.08348      1.49359  
## 
## Degrees of Freedom: 426 Total (i.e. Null);  422 Residual
## Null Deviance:       537.4 
## Residual Deviance: 349.8     AIC: 359.8

The model suggested is test ~ pregnant + glucose + bmi + diabetes + age. Now we will refit the model with those variables

LR2 <- glm(test ~ pregnant + glucose + bmi + diabetes + age, family = binomial(link='logit'), data = pima.train)
par(mfrow = c(2, 2))
plot(LR2)

summary(LR)
## 
## Call:
## glm(formula = test ~ ., family = binomial(link = "logit"), data = pima.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5151  -0.6129  -0.3250   0.5103   2.4654  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.315382   1.214292  -8.495  < 2e-16 ***
## pregnant      0.125569   0.048459   2.591 0.009563 ** 
## glucose       0.038702   0.005762   6.717 1.86e-11 ***
## diastolic    -0.001471   0.012604  -0.117 0.907097    
## triceps       0.008856   0.017319   0.511 0.609107    
## insulin       0.001848   0.001627   1.136 0.255914    
## bmi           0.071924   0.029211   2.462 0.013806 *  
## diabetes      1.447537   0.424914   3.407 0.000658 ***
## age           0.018257   0.015447   1.182 0.237243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 537.38  on 426  degrees of freedom
## Residual deviance: 346.59  on 418  degrees of freedom
## AIC: 364.59
## 
## Number of Fisher Scoring iterations: 5

The model seems to be still underfitted. Let’s compare it with the previous LR model.

pR2(LR2)
##          llh      llhNull           G2     McFadden         r2ML 
## -174.0839010 -268.6884522  189.2091026    0.3520976    0.3579651 
##         r2CU 
##    0.5000082
AIC(LR, LR2)
##     df      AIC
## LR   9 364.5931
## LR2  6 360.1678

The variance explained is still too low. We have to prove with another model.

3.2. K-Nearest Neighbour (k-NN)

First up, we have to standarize the values to compare them.

pima.train.norm <- decostand(pima.train.data, "normalize")
pima.test.norm <- decostand(pima.test.data, "normalize")

We have to remember that the k-NN model is sensitive to outliers since they can alter the \(k\) values very easily. We should check again if we can remove them if possible.

There are many libraries to do a k-NN analysis. This time we will use the e1071 library, which brings us the option to create a trainControl parameter where we can perform a Cross Validation beforehand to know the optimal value of \(k\).

cv <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = T, summaryFunction = twoClassSummary)

(knn.pima <- train(test ~ ., data = pima.train, method = "knn", preProcess = c("center", "scale"), trControl = cv, metric = "ROC", tuneLength = 10))
## k-Nearest Neighbors 
## 
## 427 samples
##   8 predictor
##   2 classes: 'Negative', 'Positive' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 385, 385, 384, 384, 384, 384, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.8379911  0.8766831  0.5565934
##    7  0.8468237  0.8743432  0.5877289
##    9  0.8498677  0.8777915  0.5780220
##   11  0.8499524  0.8766831  0.5727106
##   13  0.8542891  0.8766420  0.5728938
##   15  0.8592811  0.8916256  0.5589744
##   17  0.8642064  0.8974138  0.5516484
##   19  0.8663315  0.9032020  0.5615385
##   21  0.8690401  0.9032020  0.5520147
##   23  0.8691843  0.9089491  0.5349817
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.

The best number of \(k\) is 23.

knn.pima <- knn(train = pima.train.norm, test = pima.test.norm, cl = pima.train.lab, k = 23, prob = TRUE)

And we can check the accuracy very easily with mean

mean(knn.pima == pima.test.lab)
## [1] 0.7009346

The model could predict up to a 70% of the test results.

3.3. Support Vector Machine (SVM)

The SVM model is maybe the most widely used machine learning algorithm in the field of diagnostics. It is a robust method which is not altered by multicollinearity issues, among other things.

We will prove to fit a first model with a linear kernel.

x.pima <- subset(pima.train, select=-test)
y.pima <- pima.train.lab

pima.svm <- svm(x.pima, y.pima, kernel = "linear")
summary(pima.svm)
## 
## Call:
## svm.default(x = x.pima, y = y.pima, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.125 
## 
## Number of Support Vectors:  191
## 
##  ( 97 94 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Negative Positive
pima.svm.pred <- predict(pima.svm, x.pima)
prop.table(table(pima.svm.pred, y.pima))
##              y.pima
## pima.svm.pred   Negative   Positive
##      Negative 0.61124122 0.12177986
##      Positive 0.06557377 0.20140515
agree <- pima.svm.pred == y.pima
mean(agree)
## [1] 0.8126464

Our model couldn’t classificate properly nearly a 29% of the data, which is a similar result as the k-NN model. Let’s try with the tune function to brush up our model with a Cross Validation method.

(svm.tune <- tune(svm, train.x = x.pima, train.y = y.pima, 
              kernel = "radial", ranges = list(cost = 10^(-1:2), gamma = c(.5,1,2))))
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.2343854

The SVM model suggested is with \(cost = 1\) and \(\gamma = 0.5\).

pima.svm2 <- svm(x.pima, y.pima, kernel = "radial", cost = 1, gamma = 0.5)
summary(pima.svm2)
## 
## Call:
## svm.default(x = x.pima, y = y.pima, kernel = "radial", gamma = 0.5, 
##     cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  309
## 
##  ( 174 135 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Negative Positive
pima.svm.pred2 <- predict(pima.svm2, x.pima)
prop.table(table(pima.svm.pred2, y.pima))
##               y.pima
## pima.svm.pred2   Negative   Positive
##       Negative 0.66276347 0.03981265
##       Positive 0.01405152 0.28337237
agree2 <- pima.svm.pred2 == y.pima
mean(agree2)
## [1] 0.9461358

The model has improved significantly with a different kernel and the parameters suggested by the tune function

3.4. Neural Network (NN)

Although the usage of the neural network model has been irrelevant in the beginning, recently has gained notoriety thanks to the popularization of big data tools like Hadoop.

First up, we have to normalize the variables. If not, we cannot compare between these variables since they are fairly different.

scale01 <- function(x){
  (x - min(x)) / (max(x) - min(x))
}

pima2 <- pima.clean
pima2[,9] <- unclass(pima2[,9])

pima.norm <- pima2 %>%
  mutate_all(scale01)

pima.size <- floor(0.75 * nrow(pima.norm))
train <- sample(seq_len(nrow(pima.norm)), size = pima.size)

pima.train.n <- pima.norm[train, ]
pima.test.n <- pima.norm[-train, ]

Once we have normalized our data, we are going to fit the neural network. There is not fixed rule to choose the number of hidden nodes, but as a general rule, it should be a 2/3 of the input nodess, so let’s try to create a 8-4-1 neural network.

pima.nn <- neuralnet(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age, hidden = 4, data = pima.train.n, linear.output = TRUE)

A good way to see how our model performs is to calculate the correlation between predictions and actual data.

predict.nn <- compute(pima.nn, pima.test.n[,1:8])$net.result
cor(predict.nn, pima.test.n$test)
##             [,1]
## [1,] 0.429594926

As we can see here, a correlation of 0.44 represents a moderate correlation. However, the model can still be improved.

pima.nn2 <- neuralnet(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age, hidden = 5, data = pima.train.n, linear.output = TRUE)

plot(pima.nn2)
summary(pima.nn2)
##                     Length Class      Mode    
## call                   5   -none-     call    
## response             400   -none-     numeric 
## covariate           3200   -none-     numeric 
## model.list             2   -none-     list    
## err.fct                1   -none-     function
## act.fct                1   -none-     function
## linear.output          1   -none-     logical 
## data                   9   data.frame list    
## net.result             1   -none-     list    
## weights                1   -none-     list    
## startweights           1   -none-     list    
## generalized.weights    1   -none-     list    
## result.matrix         54   -none-     numeric
predict.nn2 <- compute(pima.nn2, pima.test.n[,1:8])$net.result
cor(predict.nn2, pima.test.n$test)
##                [,1]
## [1,] -0.05857173873

The model doesn’t improved adding more hidden nodes. We will continue with the first model.

pr.nn <- compute(pima.nn, pima.test.n[ ,1:8])
pr.nn_ <- pr.nn$net.result*(max(pima2$test)-min(pima2$test))+min(pima2$test)
test.r <- (pima.test.n$test)*(max(pima2$test)-min(pima2$test))+min(pima2$test)
(MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(pima.test.n))
## [1] 0.2051463725

4. Conclusion

As expected, the best model was SVM. This is due to its reliability and its resistance to interdependence, redundancy, irrelevancy and sample size (specially when sample is too small). To explore further the accuracy of the model we can calculate a confusion matrix.

caret::confusionMatrix(pima.svm.pred2, y.pima)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative      283       17
##   Positive        6      121
##                                                 
##                Accuracy : 0.9461358             
##                  95% CI : (0.920271, 0.965552)  
##     No Information Rate : 0.676815              
##     P-Value [Acc > NIR] : < 0.000000000000000222
##                                                 
##                   Kappa : 0.8742558             
##  Mcnemar's Test P-Value : 0.03705622            
##                                                 
##             Sensitivity : 0.9792388             
##             Specificity : 0.8768116             
##          Pos Pred Value : 0.9433333             
##          Neg Pred Value : 0.9527559             
##              Prevalence : 0.6768150             
##          Detection Rate : 0.6627635             
##    Detection Prevalence : 0.7025761             
##       Balanced Accuracy : 0.9280252             
##                                                 
##        'Positive' Class : Negative              
## 

As we can see, the Kappa coefficient (that is to say, the real hits of the model subtracting the hits by chance) is 0.87, which is a quite good result.

ctable <- table(pima.svm.pred2, y.pima)
fourfoldplot(ctable, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "Confusion Matrix")

set.seed(1)

cm <- confusionMatrix(data = pima.svm.pred2, reference = y.pima)

draw_confusion_matrix <- function(cm) {
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('Confusion Matrix', cex.main = 2)

  rect(150, 430, 240, 370, col = '#3F97D0')
  text(195, 435, 'Class1', cex = 1.2)
  rect(250, 430, 340, 370, col = '#F7AD50')
  text(295, 435, 'Class2', cex = 1.2)
  text(125, 370, 'Predicted', cex = 1.3, srt = 90, font = 2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col = '#F7AD50')
  rect(250, 305, 340, 365, col = '#3F97D0')
  text(140, 400, 'Class1', cex = 1.2, srt = 90)
  text(140, 335, 'Class2', cex = 1.2, srt = 90)

  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex = 1.6, font = 2, col = 'white')
  text(195, 335, res[2], cex = 1.6, font = 2, col = 'white')
  text(295, 400, res[3], cex = 1.6, font = 2, col = 'white')
  text(295, 335, res[4], cex = 1.6, font = 2, col = 'white')

  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt = 'n', yaxt = 'n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  
draw_confusion_matrix(cm)