Spot_Check

Spot Checking for ML Algorithms in R

This document is based on the blog written by Jason Brownlee: Spot Check Machine Learning Algorithms in R. I think that this is a very useful one as the starting point for any new ML project. The original blog is very well organized, here I tried to make some extra notes in each step.

The Best Algorithm for Your Dataset

Jason defined the term "Spot Checking" as a process that using trial and error to discover a short list of algorithms that do well on the problem.

Which Algorithms to Spot Check

Jason suggested trying a mixture of algorithms and see which is good at picking out the structure in the data.

  • A mixture of algorithm representations (e.g. instances and trees).
  • A mixture of learning algorithms (e.g. different algorithms for learning the same type of representation).
  • A mixture of modeling types (e.g., linear and non-linear function or parametric and non-parametric).

In following sections, we listed several linear and nonlinear algorithms one should spot check on his own project in R. It excludes ensemble algorithms such as boosting and bagging.

Each algorithm will be presented in two ways.

  • The package itself and function used to train and predict.
  • The caret wrapper for the algorithm and its usage.

The reason putting the caret wrapper for each algorithm is that we can efficiently evaluate the accuracy of the algorithm on unseen data using the preprocessing, algorithm evaluation, and tuning capabilities of caret.

Datasets

Two datasets from mlbench library were used to demonstrate the algorithms.

  • Boston Housing dataset for regression problem.
  • Pima Indians Diabetes dataset for classification problem.

Algorithms

Algorithms were presented in two groups:

  • Linear Algorithms: simple methods that have strong bias but are fast to train.
  • Nonlinear Algorithms: more complex methods that have a large variance but are often more accurate.

Linear Algorithms

Linear Regression

lm() function in stats library.

# load the dataset
library(mlbench)
data(BostonHousing)
# fit model
lm.fit = lm(medv~., BostonHousing)
# summarize the fit
print(lm.fit)
## 
## Call:
## lm(formula = medv ~ ., data = BostonHousing)
## 
## Coefficients:
## (Intercept)         crim           zn        indus        chas1  
##   3.646e+01   -1.080e-01    4.642e-02    2.056e-02    2.687e+00  
##         nox           rm          age          dis          rad  
##  -1.777e+01    3.810e+00    6.922e-04   -1.476e+00    3.060e-01  
##         tax      ptratio            b        lstat  
##  -1.233e-02   -9.527e-01    9.312e-03   -5.248e-01
# make predictions
lm.pred = predict(lm.fit, BostonHousing)
# accuracy
lm.mse = mean((BostonHousing$medv - lm.pred)^2)
print(lm.mse)
## [1] 21.89483

lm() implementation in caret.

# load the libraies and dataset
library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(BostonHousing)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
lm.fit.c = train(medv~., data=BostonHousing, method="lm", metric="RMSE", preProc=c("center","scale"), trControl=control)
summary(lm.fit.c)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.53281    0.21095 106.814  < 2e-16 ***
## crim        -0.92906    0.28269  -3.287 0.001087 ** 
## zn           1.08264    0.32016   3.382 0.000778 ***
## indus        0.14104    0.42188   0.334 0.738288    
## chas1        0.68241    0.21884   3.118 0.001925 ** 
## nox         -2.05875    0.44262  -4.651 4.25e-06 ***
## rm           2.67688    0.29364   9.116  < 2e-16 ***
## age          0.01949    0.37184   0.052 0.958229    
## dis         -3.10712    0.41999  -7.398 6.01e-13 ***
## rad          2.66485    0.57770   4.613 5.07e-06 ***
## tax         -2.07884    0.63379  -3.280 0.001112 ** 
## ptratio     -2.06265    0.28323  -7.283 1.31e-12 ***
## b            0.85011    0.24521   3.467 0.000573 ***
## lstat       -3.74733    0.36216 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

The caret wrapper integrates the pre-processing(preProc option), cross-validation(trainControl()), et al, all in one function call.

preProc: a string vector that defines a pre-process of the predictor data. Possibilities are "BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "baglmpute", "medianImpute", "pca", "ica", "spatialSign". The default is no pre-processing.

metric: a string that specifies what summary metric will be used to select the optimal model. By default, possible values are "RMSE" and "RSquared" for regression and "Accuracy" and "Kappa" for classification. If custom performance metrics are used (via the summaryFunction argument in trainControl, the value of metric should match one of the arguments. If it does not, a warning is issued and the first metric given by the summaryFunction is used.

Logistic Regression

glm() function in stats library creates a generalized linear model. It can be configured to perform a logistic regression suitable for binary classification problems

# load the dataset
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
glm.fit = glm(diabetes~., data=PimaIndiansDiabetes, family=binomial(link='logit'))
# summarize the fit
print(glm.fit)
## 
## Call:  glm(formula = diabetes ~ ., family = binomial(link = "logit"), 
##     data = PimaIndiansDiabetes)
## 
## Coefficients:
## (Intercept)     pregnant      glucose     pressure      triceps  
##   -8.404696     0.123182     0.035164    -0.013296     0.000619  
##     insulin         mass     pedigree          age  
##   -0.001192     0.089701     0.945180     0.014869  
## 
## Degrees of Freedom: 767 Total (i.e. Null);  759 Residual
## Null Deviance:       993.5 
## Residual Deviance: 723.4     AIC: 741.4
# make prediction
probabilities = predict(glm.fit, PimaIndiansDiabetes[,1:8], type='response')
glm.pred = ifelse(probabilities > 0.5, 'pos','neg')
# accuracy
table(glm.pred, PimaIndiansDiabetes$diabetes)
##         
## glm.pred neg pos
##      neg 445 112
##      pos  55 156
glm.accuracy = mean(glm.pred == PimaIndiansDiabetes$diabetes)
glm.accuracy
## [1] 0.7825521

Using glm() in the caret.

# load the libraries and data
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
glm.fit.c = train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="Accuracy", preProc = c("center", "scale"), trControl=control)
# summarize the fit
print(glm.fit.c)
## Generalized Linear Model 
## 
## 768 samples
##   8 predictors
##   2 classes: 'neg', 'pos' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 615, 615, 614, 614, 614 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.7734997  0.4750097  0.03615513   0.08432477
## 
## 

Please note here, the metric has been set as "Accuracy" for the classification problem.

Linear Discriminant Analysis

lda() function in MASS library creates a linear model of a classification problem.

# load the libraries and data
library(MASS)
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
lda.fit = lda(diabetes~., data=PimaIndiansDiabetes)
# make predictions
lda.pred = predict(lda.fit, PimaIndiansDiabetes[,1:8])$class
# summarize accuracy
table(lda.pred, PimaIndiansDiabetes$diabetes)
##         
## lda.pred neg pos
##      neg 446 112
##      pos  54 156
lda.accuracy = mean(lda.pred == PimaIndiansDiabetes$diabetes)
lda.accuracy
## [1] 0.7838542

Using lda() in the caret.

# load libraies and data
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
lda.fit.c = train(diabetes~., data=PimaIndiansDiabetes, method="lda",metric="Accuracy", preProc = c("center", "scale"), trControl=control)
# summarize fit
print(lda.fit.c)
## Linear Discriminant Analysis 
## 
## 768 samples
##   8 predictors
##   2 classes: 'neg', 'pos' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 615, 615, 614, 614, 614 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.7734997  0.4761294  0.03615513   0.08247278
## 
## 

Regularized Regression

glmnet() function in glmnet library can be used for classification or regression.

Classification Example

# load the libraries and data
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-3
library(mlbench)
data(PimaIndiansDiabetes)
x = as.matrix(PimaIndiansDiabetes[,1:8])
y = as.matrix(PimaIndiansDiabetes[,9])
# fit model
glmnet.fit.1 = glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001)
# summarize the fit
print(glmnet.fit.1)
## 
## Call:  glmnet(x = x, y = y, family = "binomial", alpha = 0.5, lambda = 0.001) 
## 
##      Df   %Dev Lambda
## [1,]  8 0.2718  0.001
# make prediction
glmnet.pred.1 = predict(glmnet.fit.1, x, type="class")
# summarize accuracy
table(glmnet.pred.1, PimaIndiansDiabetes$diabetes)
##              
## glmnet.pred.1 neg pos
##           neg 446 112
##           pos  54 156
glmnet.accuracy = mean(glmnet.pred.1 == PimaIndiansDiabetes$diabetes)
glmnet.accuracy
## [1] 0.7838542

For generalized linear model, the elasticnet penalty is controlled by alpha, which bridges the gap between lasso (alpha = 1, the default) and ridge (alpha = 0). The tuning parameter lambda controls the overall strength of the penalty. For the detailed explaination of glmnet package, please see page.

Regression Example

# load the libraries and data
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas = as.numeric(as.character(BostonHousing$chas))
x = as.matrix(BostonHousing[,1:13])
y = as.matrix(BostonHousing[,14])
# fit model
glmnet.fit.2 = glmnet(x, y, family="gaussian", alpha=0.5, lambda=0.001)
# summarize the fit
print(glmnet.fit.2)
## 
## Call:  glmnet(x = x, y = y, family = "gaussian", alpha = 0.5, lambda = 0.001) 
## 
##      Df   %Dev Lambda
## [1,] 13 0.7406  0.001
# make prediction
glmnet.pred.2 = predict(glmnet.fit.2, x, type="link")
# summarize accuracy
glmnet.mse = mean((y - glmnet.pred.2)^2)
print(glmnet.mse)
## [1] 21.89497

Using glmnet() in the caret.

For classification

library(caret)
library(mlbench)
library(glmnet)
data(PimaIndiansDiabetes)
set.seed(10)
control = trainControl(method="cv", number=5)
fit.glmnet.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="glmnet", metric = "Accuracy", preProc=c("center","scale"),trControl=control)
print(fit.glmnet.1.c)
## glmnet 
## 
## 768 samples
##   8 predictors
##   2 classes: 'neg', 'pos' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 615, 615, 614, 614, 614 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.10   0.0004447834  0.7734997  0.4750097  0.03615513   0.08432477
##   0.10   0.0044478343  0.7747984  0.4784315  0.03709942   0.08687118
##   0.10   0.0444783425  0.7682794  0.4527797  0.03537012   0.08554097
##   0.55   0.0004447834  0.7734997  0.4750097  0.03615513   0.08432477
##   0.55   0.0044478343  0.7760971  0.4819463  0.03477662   0.08117211
##   0.55   0.0444783425  0.7682879  0.4491904  0.03763233   0.08761218
##   1.00   0.0004447834  0.7760971  0.4821814  0.03322648   0.07563185
##   1.00   0.0044478343  0.7734827  0.4758894  0.03291030   0.07692884
##   1.00   0.0444783425  0.7591631  0.4265520  0.03205871   0.07860213
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were alpha = 0.55 and lambda
##  = 0.004447834.

For regression

library(caret)
library(mlbench)
library(glmnet)
data(BostonHousing)
set.seed(10)
control = trainControl(method="cv", number=5)
fit.glmnet.2.c = train(medv~., data=BostonHousing, method="glmnet", metric = "RMSE", preProc=c("center","scale"),trControl=control)
print(fit.glmnet.2.c)
## glmnet 
## 
## 506 samples
##  13 predictors
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 405, 404, 405, 406, 404 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared   RMSE SD    Rsquared SD
##   0.10   0.01355531  4.842012  0.7289666  0.5355764  0.03990201 
##   0.10   0.13555307  4.841192  0.7289002  0.5596531  0.04191900 
##   0.10   1.35553073  5.002550  0.7144793  0.7088463  0.05757475 
##   0.55   0.01355531  4.841541  0.7290290  0.5360550  0.03995472 
##   0.55   0.13555307  4.857595  0.7273937  0.5842832  0.04428148 
##   0.55   1.35553073  5.310516  0.6883996  0.7020526  0.05979093 
##   1.00   0.01355531  4.840432  0.7291585  0.5387300  0.04020647 
##   1.00   0.13555307  4.917030  0.7209923  0.5961759  0.04538986 
##   1.00   1.35553073  5.498308  0.6804657  0.6656494  0.05176136 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.01355531.

As you can see from the caret output, running glmnet in caret will automatically go over alpha and lambda values and select the optimal combination for the best model performance.

Another thing you need to be careful is that for glmnet() function, it only takes matrix as input, not data.frame. It is because the core in glmnet package is a set of fortran subroutines, which was made for very fast execution. But you do not need to worry about it if you implement glmnet() in caret.

Nonlinear Algorithms

Nonlinear algorithms make fewer assumptions about the function being modeled. As such, they have a higher variance but are often in higher accuracy. The increased flexibility also can make them slower to train or increase their memory requirements.

k-Nearest Neighbors

knn3() function in caret library does not create model, it makes predictions from the training set directly. It can be used for classification or regression.

classification

# load the libraries and data
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
knn.fit.1 = knn3(diabetes~., data=PimaIndiansDiabetes, k=3)
# summarize the fit
print(knn.fit.1)
## 3-nearest neighbor classification model
## Training set class distribution:
## 
## neg pos 
## 500 268
# make predictions
knn.pred.1 = predict(knn.fit.1, PimaIndiansDiabetes[,1:8], type="class")
# summarize accuracy
table(knn.pred.1, PimaIndiansDiabetes$diabetes)
##           
## knn.pred.1 neg pos
##        neg 459  67
##        pos  41 201
knn.accuracy = mean(knn.pred.1 == PimaIndiansDiabetes$diabetes)
knn.accuracy
## [1] 0.859375

regression

Using the knnreg() function in caret.

library(caret)
library(mlbench)
data(BostonHousing)
BostonHousing$chas = as.numeric(as.character(BostonHousing$chas))
x = as.matrix(BostonHousing[,1:13])
y = as.matrix(BostonHousing[,14])
# fit model
knn.fit.2 = knnreg(x, y, k=3)
# summarize the fit
print(knn.fit.2)
## 3-nearest neighbor regression model
# make predictions
knn.pred.2 = predict(knn.fit.2, x)
# summarize accuracy
knn.mse = mean((BostonHousing$medv - knn.pred.2)^2)
print(knn.mse)
## [1] 17.9939

knn() implementation in caret

library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(7)
control = trainControl(method="cv", number=5)
knn.fit.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="knn", metric="Accuracy", preProc=c("center", "scale"), trControl=control)
# summarize fit
print(knn.fit.1.c)
## k-Nearest Neighbors 
## 
## 768 samples
##   8 predictors
##   2 classes: 'neg', 'pos' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 614, 614, 615, 615, 614 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   5  0.7096426  0.3407639  0.01240967   0.03398836
##   7  0.7200407  0.3649273  0.01476191   0.02673364
##   9  0.7239453  0.3671314  0.02063387   0.04816333
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.
library(caret)
data(BostonHousing)
data(BostonHousing)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
knn.fit.2.c = train(medv~., data=BostonHousing, method="knn", metric="RMSE", preProc=c("center", "scale"), trControl=control)
# summarize fit
print(knn.fit.2.c)
## k-Nearest Neighbors 
## 
## 506 samples
##  13 predictors
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 405, 404, 405, 406, 404 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   RMSE SD    Rsquared SD
##   5  4.679528  0.7521730  0.7693703  0.07520371 
##   7  4.659583  0.7585942  0.9032495  0.08238999 
##   9  4.698140  0.7575586  0.9825914  0.08791694 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 7.

When using the knn() in caret, the caret will automatically select the k number for the best performance.

Naive Bayes

naiveBayes() function in e1071 library models the probabilistic of each attribute to the outcome variable independently. It can be used for classification problem.

library(e1071)
library(mlbench)
# Load the dataset
data(PimaIndiansDiabetes)
# fit model
nb.fit = naiveBayes(diabetes~., data=PimaIndiansDiabetes)
# summarize the fit
print(nb.fit)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##       neg       pos 
## 0.6510417 0.3489583 
## 
## Conditional probabilities:
##      pregnant
## Y         [,1]     [,2]
##   neg 3.298000 3.017185
##   pos 4.865672 3.741239
## 
##      glucose
## Y         [,1]     [,2]
##   neg 109.9800 26.14120
##   pos 141.2575 31.93962
## 
##      pressure
## Y         [,1]     [,2]
##   neg 68.18400 18.06308
##   pos 70.82463 21.49181
## 
##      triceps
## Y         [,1]     [,2]
##   neg 19.66400 14.88995
##   pos 22.16418 17.67971
## 
##      insulin
## Y         [,1]      [,2]
##   neg  68.7920  98.86529
##   pos 100.3358 138.68912
## 
##      mass
## Y         [,1]     [,2]
##   neg 30.30420 7.689855
##   pos 35.14254 7.262967
## 
##      pedigree
## Y         [,1]      [,2]
##   neg 0.429734 0.2990853
##   pos 0.550500 0.3723545
## 
##      age
## Y         [,1]     [,2]
##   neg 31.19000 11.66765
##   pos 37.06716 10.96825
# make predictions
nb.pred = predict(nb.fit, PimaIndiansDiabetes[,1:8])
# summarize accuracy
table(nb.pred, PimaIndiansDiabetes$diabetes)
##        
## nb.pred neg pos
##     neg 421 104
##     pos  79 164
nb.accuracy = mean(nb.pred == PimaIndiansDiabetes$diabetes)
nb.accuracy
## [1] 0.7617188

The implementation of naive bayes in caret uses the NaiveBayes from the lkaR package.

library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
nb.fit.c = train(diabetes~., data=PimaIndiansDiabetes, method="nb", metric="Accuracy", trControl=control)
## Loading required package: klaR
# summarize fit
print(nb.fit.c)
## Naive Bayes 
## 
## 768 samples
##   8 predictors
##   2 classes: 'neg', 'pos' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 615, 615, 614, 614, 614 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   FALSE      0.7487310  0.4346715  0.02414925   0.05678239
##    TRUE      0.7618369  0.4569687  0.06089959   0.13574150
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were fL = 0 and usekernel = TRUE.

Support Vector Machine

ksvm() function in kernlab library can be used for classification or regression. It is a wrapper for the LIBSVM library and provides a suite of kernel types and configuration options.

In below examples, it is using the Radial Basic kernel.

Classification

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
ksvm.fit.1 = ksvm(diabetes~., data=PimaIndiansDiabetes, kernel="rbfdot")
# summarize the fit
print(ksvm.fit.1)
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.121151613088805 
## 
## Number of Support Vectors : 436 
## 
## Objective Function Value : -353.563 
## Training error : 0.175781
# make predictions
ksvm.pred.1 = predict(ksvm.fit.1, PimaIndiansDiabetes[,1:8], type="response")
# summarize accuracy
table(ksvm.pred.1, PimaIndiansDiabetes$diabetes)
##            
## ksvm.pred.1 neg pos
##         neg 463  98
##         pos  37 170
ksvm.accuracy = mean(ksvm.pred.1 == PimaIndiansDiabetes$diabetes)
ksvm.accuracy
## [1] 0.8242188

Regression

library(kernlab)
library(mlbench)
data(BostonHousing)
# fit model
ksvm.fit.2 = ksvm(medv~., BostonHousing, kernel="rbfdot")
# summarize the fit
print(ksvm.fit.2)
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0963515407067046 
## 
## Number of Support Vectors : 337 
## 
## Objective Function Value : -76.1578 
## Training error : 0.094743
# make predictions
ksvm.pred.2 = predict(ksvm.fit.2, BostonHousing)
# summarize accuracy
ksvm.mse = mean((BostonHousing$medv - ksvm.pred.2)^2)
print(ksvm.mse)
## [1] 8.014004
Using the SVM with Radial Basis kernel in caret.
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
ksvm.fit.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", metric="Accuracy", trControl=control)
# summarize fit
print(ksvm.fit.1.c)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 768 samples
##   8 predictors
##   2 classes: 'neg', 'pos' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 615, 615, 614, 614, 614 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.25  0.7682709  0.4536716  0.03665128   0.08083154
##   0.50  0.7669807  0.4550586  0.03379924   0.07405079
##   1.00  0.7565572  0.4361162  0.02732086   0.06557150
## 
## Tuning parameter 'sigma' was held constant at a value of 0.129025
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.129025 and C = 0.25.
library(caret)
library(mlbench)
data(BostonHousing)
set.seed(10)
control = trainControl(method="cv", number=5)
ksvm.fit.2.c = train(medv~., data=BostonHousing, method="svmRadial", metric="RMSE", trControl=control)
# summarize fit
print(ksvm.fit.2.c)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 506 samples
##  13 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 405, 404, 405, 406, 404 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   RMSE SD    Rsquared SD
##   0.25  4.820191  0.7540812  0.9281945  0.08355636 
##   0.50  4.293715  0.7965549  0.9109565  0.07587571 
##   1.00  3.859969  0.8299751  0.8611984  0.06989741 
## 
## Tuning parameter 'sigma' was held constant at a value of 0.09784294
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.09784294 and C = 1.

Classification and Regression Trees

rpart() function in rpart library provides an implementation of CART for classification and regression.

Classification

library(rpart)
library(mlbench)
data(PimaIndiansDiabetes)
rp.fit.1 = rpart(diabetes~., data=PimaIndiansDiabetes)

print(rp.fit.1)
## n= 768 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 768 268 neg (0.65104167 0.34895833)  
##     2) glucose< 127.5 485  94 neg (0.80618557 0.19381443)  
##       4) age< 28.5 271  23 neg (0.91512915 0.08487085) *
##       5) age>=28.5 214  71 neg (0.66822430 0.33177570)  
##        10) mass< 26.35 41   2 neg (0.95121951 0.04878049) *
##        11) mass>=26.35 173  69 neg (0.60115607 0.39884393)  
##          22) glucose< 99.5 55  10 neg (0.81818182 0.18181818) *
##          23) glucose>=99.5 118  59 neg (0.50000000 0.50000000)  
##            46) pedigree< 0.561 84  34 neg (0.59523810 0.40476190)  
##              92) pedigree< 0.2 21   4 neg (0.80952381 0.19047619) *
##              93) pedigree>=0.2 63  30 neg (0.52380952 0.47619048)  
##               186) pregnant>=1.5 52  21 neg (0.59615385 0.40384615)  
##                 372) pressure>=67 40  12 neg (0.70000000 0.30000000) *
##                 373) pressure< 67 12   3 pos (0.25000000 0.75000000) *
##               187) pregnant< 1.5 11   2 pos (0.18181818 0.81818182) *
##            47) pedigree>=0.561 34   9 pos (0.26470588 0.73529412) *
##     3) glucose>=127.5 283 109 pos (0.38515901 0.61484099)  
##       6) mass< 29.95 76  24 neg (0.68421053 0.31578947)  
##        12) glucose< 145.5 41   6 neg (0.85365854 0.14634146) *
##        13) glucose>=145.5 35  17 pos (0.48571429 0.51428571)  
##          26) insulin< 14.5 21   8 neg (0.61904762 0.38095238) *
##          27) insulin>=14.5 14   4 pos (0.28571429 0.71428571) *
##       7) mass>=29.95 207  57 pos (0.27536232 0.72463768)  
##        14) glucose< 157.5 115  45 pos (0.39130435 0.60869565)  
##          28) age< 30.5 50  23 neg (0.54000000 0.46000000)  
##            56) pressure>=61 40  13 neg (0.67500000 0.32500000)  
##             112) mass< 41.8 31   7 neg (0.77419355 0.22580645) *
##             113) mass>=41.8 9   3 pos (0.33333333 0.66666667) *
##            57) pressure< 61 10   0 pos (0.00000000 1.00000000) *
##          29) age>=30.5 65  18 pos (0.27692308 0.72307692) *
##        15) glucose>=157.5 92  12 pos (0.13043478 0.86956522) *
# make predictions
rp.pred.1 = predict(rp.fit.1, PimaIndiansDiabetes[,1:8], type="class")
# summarize accuracy
table(rp.pred.1, PimaIndiansDiabetes$diabetes)
##          
## rp.pred.1 neg pos
##       neg 449  72
##       pos  51 196
rp.accuracy = mean(rp.pred.1 == PimaIndiansDiabetes$diabetes)
rp.accuracy
## [1] 0.8398438

Regression

library(rpart)
library(mlbench)
data(BostonHousing)

rp.fit.2 = rpart(medv~., data=BostonHousing, control=rpart.control(minsplit=5))

print(rp.fit.2)
## n= 506 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 506 42716.3000 22.53281  
##    2) rm< 6.941 430 17317.3200 19.93372  
##      4) lstat>=14.4 175  3373.2510 14.95600  
##        8) crim>=6.99237 74  1085.9050 11.97838 *
##        9) crim< 6.99237 101  1150.5370 17.13762 *
##      5) lstat< 14.4 255  6632.2170 23.34980  
##       10) dis>=1.38485 250  3721.1630 22.90520  
##         20) rm< 6.543 195  1636.0670 21.62974 *
##         21) rm>=6.543 55   643.1691 27.42727 *
##       11) dis< 1.38485 5   390.7280 45.58000 *
##    3) rm>=6.941 76  6059.4190 37.23816  
##      6) rm< 7.437 46  1899.6120 32.11304  
##       12) crim>=7.393425 3    27.9200 14.40000 *
##       13) crim< 7.393425 43   864.7674 33.34884 *
##      7) rm>=7.437 30  1098.8500 45.09667  
##       14) ptratio>=18.3 3   223.8200 33.30000 *
##       15) ptratio< 18.3 27   411.1585 46.40741 *
# make predictions
rp.pred.2 = predict(rp.fit.2, BostonHousing[,1:13])
# summarize accuracy
rp.mse = mean((BostonHousing$medv - rp.pred.2)^2)
print(rp.mse)
## [1] 12.71556
Implementation using caret.
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
rp.fit.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="rpart", metric="Accuracy", trControl=control)
# summarize fit
print(rp.fit.1.c)
## CART 
## 
## 768 samples
##   8 predictors
##   2 classes: 'neg', 'pos' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 615, 615, 614, 614, 614 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.01741294  0.7435362  0.4121356  0.02808437   0.05483307
##   0.10447761  0.7370427  0.3910463  0.03609679   0.07754173
##   0.24253731  0.6914354  0.2031678  0.04410346   0.18774197
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01741294.
library(caret)
library(mlbench)
data(BostonHousing)

set.seed(10)
control = trainControl(method="cv", number=2)
rp.fit.2.c = train(medv~., data=BostonHousing, method="rpart", metric="RMSE", trControl=control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
# summarize fit
print(rp.fit.2.c)
## CART 
## 
## 506 samples
##  13 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 254, 252 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   RMSE SD    Rsquared SD
##   0.07165784  5.652481  0.6289471  0.7270540  0.10372090 
##   0.17117244  7.190488  0.4109681  0.3156617  0.06524176 
##   0.45274420  8.370570  0.3648352  1.3532270          NA 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.07165784.

Conclusion

There are many, many more algorithms out there in R, and available in caret. Here is the whole list of the models in the caret package. It is helpful to look at this list.

Compare The Performance of machine Learning Algorithms

There are several techniques that you can use to compare machine learning algorithms in R.

Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use the estimates to choose one or two best models from the suite of models that you have created.

The way that you can do that is to use different visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.

Function resamples() in caret provides methods for collection, analyzing and visualizing a set of resampling results from a common data set. Then the summary() function computes summary statistics across each model/metric combination.

# prepare training scheme
control = trainControl(method="repeatedcv", number=10, repeats=3)
# CART
set.seed(7)
fit.cart = train(diabetes~., data=PimaIndiansDiabetes, method="rpart", trControl=control)
# LDA
set.seed(7)
fit.lda = train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=control)
# SVM
set.seed(7)
fit.svm = train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", trControl=control)
# kNN
set.seed(7)
fit.knn = train(diabetes~., data=PimaIndiansDiabetes, method="knn", trControl=control)
# Random Forest
set.seed(7)
fit.rf = train(diabetes~., data=PimaIndiansDiabetes, method="rf", trControl=control)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
# collect resamples
results = resamples(list(CART=fit.cart, LDA=fit.lda, SVM=fit.svm, KNN=fit.knn, RF=fit.rf))

# summarize differences between models
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: CART, LDA, SVM, KNN, RF 
## Number of resamples: 30 
## 
## Accuracy 
##        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## CART 0.6234  0.7115 0.7403 0.7382  0.7760 0.8442    0
## LDA  0.6711  0.7532 0.7662 0.7759  0.8052 0.8701    0
## SVM  0.6711  0.7403 0.7582 0.7651  0.7890 0.8961    0
## KNN  0.6184  0.6984 0.7321 0.7299  0.7532 0.8182    0
## RF   0.6711  0.7273 0.7516 0.7617  0.7890 0.8571    0
## 
## Kappa 
##        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## CART 0.1585  0.3296 0.3765 0.3934  0.4685 0.6393    0
## LDA  0.2484  0.4196 0.4516 0.4801  0.5512 0.7048    0
## SVM  0.2187  0.3889 0.4167 0.4520  0.5003 0.7638    0
## KNN  0.1113  0.3228 0.3867 0.3819  0.4382 0.5867    0
## RF   0.2624  0.3787 0.4516 0.4588  0.5193 0.6781    0

Visualizing the differences

Density Plots

# density plots of accuracy
scales = list(x=list(relation="free"), y=list(relation="free"))
densityplot(results, scales=scales, pch = "|")

Dot Plots

It shows both the mean estimated accuracy as well as the 95% confidence interval (e.g. the range in which 95% of observed scores fell).
# dot plots of accuracy
scales <- list(x=list(relation="free"), y=list(relation="free"))
dotplot(results, scales=scales)

Scatterplot Matrix

This is invaluable when considering whether the predictions from two different algorithms are correlated. If weakly correlated, they are good candidates for being combined in an ensemble prediction.
# Pairwise scatterplots of predictions to compare models
splom(results)

Pairwise xyPlots

We can plot pair-wise comparison of the accuracy of trial-folds for two machine learning algorithms with an xyplot.
xyplot(results, models=c("LDA", "SVM"))

Statistical Significance Tests

We can calculate the significance of the differences between the metric distributions of different machine learning algorithms. Results can be summarized directly by calling the summary() function.

# difference in model predictions
diffs <- diff(results)
# summarize p-values for pair-wise comparisons
summary(diffs)
## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##      CART      LDA       SVM       KNN       RF       
## CART           -0.037759 -0.026908  0.008248 -0.023473
## LDA  0.0050068            0.010851  0.046007  0.014286
## SVM  0.0919580 0.3390336            0.035156  0.003435
## KNN  1.0000000 1.218e-05 0.0007092           -0.031721
## RF   0.1722106 0.1349151 1.0000000 0.0034441          
## 
## Kappa 
##      CART     LDA       SVM       KNN       RF       
## CART          -0.086692 -0.058612  0.011552 -0.065345
## LDA  0.001548            0.028079  0.098243  0.021347
## SVM  0.083995 0.221740             0.070164 -0.006733
## KNN  1.000000 4.122e-05 0.005469            -0.076897
## RF   0.031922 1.000000  1.000000  0.001012

The upper diagonal of the table shows the estimated difference between the distributions. If we think that LDA is the most accurate model from looking at the previous graphs, we can get an estimate of how much better than specific other models in terms of absolute accuracy.

sessionInfo()
## R version 3.2.4 RC (2016-03-02 r70278)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] randomForest_4.6-12 rpart_4.1-10        kernlab_0.9-23     
##  [4] klaR_0.6-12         e1071_1.6-7         glmnet_2.0-3       
##  [7] foreach_1.4.3       Matrix_1.2-4        MASS_7.3-29        
## [10] caret_6.0-64        ggplot2_2.0.0       lattice_0.20-24    
## [13] mlbench_2.1-1      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3          compiler_3.2.4       formatR_1.2.1       
##  [4] nloptr_1.0.4         plyr_1.8.3           class_7.3-9         
##  [7] iterators_1.0.8      tools_3.2.4          digest_0.6.9        
## [10] lme4_1.1-11          evaluate_0.8         gtable_0.1.2        
## [13] nlme_3.1-125         mgcv_1.7-28          yaml_2.1.13         
## [16] parallel_3.2.4       SparseM_1.7          stringr_1.0.0       
## [19] knitr_1.12.3         MatrixModels_0.4-1   combinat_0.0-8      
## [22] stats4_3.2.4         grid_3.2.4           nnet_7.3-7          
## [25] rmarkdown_0.9.5      knitrBootstrap_1.0.0 minqa_1.2.4         
## [28] reshape2_1.4.1       car_2.1-1            magrittr_1.5        
## [31] scales_0.4.0         codetools_0.2-8      htmltools_0.3       
## [34] splines_3.2.4        pbkrtest_0.4-4       mime_0.4            
## [37] colorspace_1.2-6     quantreg_5.21        stringi_1.0-1       
## [40] munsell_0.4.3        markdown_0.7.7

Last updated: 2016-03-14