This document is based on the blog written by Jason Brownlee: Spot Check Machine Learning Algorithms in R. I think that this is a very useful one as the starting point for any new ML project. The original blog is very well organized, here I tried to make some extra notes in each step.
Jason defined the term "Spot Checking" as a process that using trial and error to discover a short list of algorithms that do well on the problem.
Jason suggested trying a mixture of algorithms and see which is good at picking out the structure in the data.
In following sections, we listed several linear and nonlinear algorithms one should spot check on his own project in R. It excludes ensemble algorithms such as boosting and bagging.
Each algorithm will be presented in two ways.
The reason putting the caret wrapper for each algorithm is that we can efficiently evaluate the accuracy of the algorithm on unseen data using the preprocessing, algorithm evaluation, and tuning capabilities of caret.
Two datasets from mlbench library were used to demonstrate the algorithms.
Algorithms were presented in two groups:
lm() function in stats library.
# load the dataset
library(mlbench)
data(BostonHousing)
# fit model
lm.fit = lm(medv~., BostonHousing)
# summarize the fit
print(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = BostonHousing)
##
## Coefficients:
## (Intercept) crim zn indus chas1
## 3.646e+01 -1.080e-01 4.642e-02 2.056e-02 2.687e+00
## nox rm age dis rad
## -1.777e+01 3.810e+00 6.922e-04 -1.476e+00 3.060e-01
## tax ptratio b lstat
## -1.233e-02 -9.527e-01 9.312e-03 -5.248e-01
# make predictions
lm.pred = predict(lm.fit, BostonHousing)
# accuracy
lm.mse = mean((BostonHousing$medv - lm.pred)^2)
print(lm.mse)
## [1] 21.89483
lm() implementation in caret.
# load the libraies and dataset
library(mlbench)
library(caret)
data(BostonHousing)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
lm.fit.c = train(medv~., data=BostonHousing, method="lm", metric="RMSE", preProc=c("center","scale"), trControl=control)
summary(lm.fit.c)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.53281 0.21095 106.814 < 2e-16 ***
## crim -0.92906 0.28269 -3.287 0.001087 **
## zn 1.08264 0.32016 3.382 0.000778 ***
## indus 0.14104 0.42188 0.334 0.738288
## chas1 0.68241 0.21884 3.118 0.001925 **
## nox -2.05875 0.44262 -4.651 4.25e-06 ***
## rm 2.67688 0.29364 9.116 < 2e-16 ***
## age 0.01949 0.37184 0.052 0.958229
## dis -3.10712 0.41999 -7.398 6.01e-13 ***
## rad 2.66485 0.57770 4.613 5.07e-06 ***
## tax -2.07884 0.63379 -3.280 0.001112 **
## ptratio -2.06265 0.28323 -7.283 1.31e-12 ***
## b 0.85011 0.24521 3.467 0.000573 ***
## lstat -3.74733 0.36216 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
The caret wrapper integrates the pre-processing(preProc option), cross-validation(trainControl()), et al, all in one function call.
preProc: a string vector that defines a pre-process of the predictor data. Possibilities are "BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "baglmpute", "medianImpute", "pca", "ica", "spatialSign". The default is no pre-processing.
metric: a string that specifies what summary metric will be used to select the optimal model. By default, possible values are "RMSE" and "RSquared" for regression and "Accuracy" and "Kappa" for classification. If custom performance metrics are used (via the summaryFunction argument in trainControl, the value of metric should match one of the arguments. If it does not, a warning is issued and the first metric given by the summaryFunction is used.
glm() function in stats library creates a generalized linear model. It can be configured to perform a logistic regression suitable for binary classification problems
# load the dataset
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
glm.fit = glm(diabetes~., data=PimaIndiansDiabetes, family=binomial(link='logit'))
# summarize the fit
print(glm.fit)
##
## Call: glm(formula = diabetes ~ ., family = binomial(link = "logit"),
## data = PimaIndiansDiabetes)
##
## Coefficients:
## (Intercept) pregnant glucose pressure triceps
## -8.404696 0.123182 0.035164 -0.013296 0.000619
## insulin mass pedigree age
## -0.001192 0.089701 0.945180 0.014869
##
## Degrees of Freedom: 767 Total (i.e. Null); 759 Residual
## Null Deviance: 993.5
## Residual Deviance: 723.4 AIC: 741.4
# make prediction
probabilities = predict(glm.fit, PimaIndiansDiabetes[,1:8], type='response')
glm.pred = ifelse(probabilities > 0.5, 'pos','neg')
# accuracy
table(glm.pred, PimaIndiansDiabetes$diabetes)
##
## glm.pred neg pos
## neg 445 112
## pos 55 156
glm.accuracy = mean(glm.pred == PimaIndiansDiabetes$diabetes)
glm.accuracy
## [1] 0.7825521
Using glm() in the caret.
# load the libraries and data
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
glm.fit.c = train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="Accuracy", preProc = c("center", "scale"), trControl=control)
# summarize the fit
print(glm.fit.c)
## Generalized Linear Model
##
## 768 samples
## 8 predictors
## 2 classes: 'neg', 'pos'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 615, 615, 614, 614, 614
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.7734997 0.4750097 0.03615513 0.08432477
##
##
Please note here, the metric has been set as "Accuracy" for the classification problem.
lda() function in MASS library creates a linear model of a classification problem.
# load the libraries and data
library(MASS)
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
lda.fit = lda(diabetes~., data=PimaIndiansDiabetes)
# make predictions
lda.pred = predict(lda.fit, PimaIndiansDiabetes[,1:8])$class
# summarize accuracy
table(lda.pred, PimaIndiansDiabetes$diabetes)
##
## lda.pred neg pos
## neg 446 112
## pos 54 156
lda.accuracy = mean(lda.pred == PimaIndiansDiabetes$diabetes)
lda.accuracy
## [1] 0.7838542
Using lda() in the caret.
# load libraies and data
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
lda.fit.c = train(diabetes~., data=PimaIndiansDiabetes, method="lda",metric="Accuracy", preProc = c("center", "scale"), trControl=control)
# summarize fit
print(lda.fit.c)
## Linear Discriminant Analysis
##
## 768 samples
## 8 predictors
## 2 classes: 'neg', 'pos'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 615, 615, 614, 614, 614
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.7734997 0.4761294 0.03615513 0.08247278
##
##
glmnet() function in glmnet library can be used for classification or regression.
# load the libraries and data
library(glmnet)
library(mlbench)
data(PimaIndiansDiabetes)
x = as.matrix(PimaIndiansDiabetes[,1:8])
y = as.matrix(PimaIndiansDiabetes[,9])
# fit model
glmnet.fit.1 = glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001)
# summarize the fit
print(glmnet.fit.1)
##
## Call: glmnet(x = x, y = y, family = "binomial", alpha = 0.5, lambda = 0.001)
##
## Df %Dev Lambda
## [1,] 8 0.2718 0.001
# make prediction
glmnet.pred.1 = predict(glmnet.fit.1, x, type="class")
# summarize accuracy
table(glmnet.pred.1, PimaIndiansDiabetes$diabetes)
##
## glmnet.pred.1 neg pos
## neg 446 112
## pos 54 156
glmnet.accuracy = mean(glmnet.pred.1 == PimaIndiansDiabetes$diabetes)
glmnet.accuracy
## [1] 0.7838542
For generalized linear model, the elasticnet penalty is controlled by alpha, which bridges the gap between lasso (alpha = 1, the default) and ridge (alpha = 0). The tuning parameter lambda controls the overall strength of the penalty. For the detailed explaination of glmnet package, please see page.
# load the libraries and data
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas = as.numeric(as.character(BostonHousing$chas))
x = as.matrix(BostonHousing[,1:13])
y = as.matrix(BostonHousing[,14])
# fit model
glmnet.fit.2 = glmnet(x, y, family="gaussian", alpha=0.5, lambda=0.001)
# summarize the fit
print(glmnet.fit.2)
##
## Call: glmnet(x = x, y = y, family = "gaussian", alpha = 0.5, lambda = 0.001)
##
## Df %Dev Lambda
## [1,] 13 0.7406 0.001
# make prediction
glmnet.pred.2 = predict(glmnet.fit.2, x, type="link")
# summarize accuracy
glmnet.mse = mean((y - glmnet.pred.2)^2)
print(glmnet.mse)
## [1] 21.89497
Using glmnet() in the caret.
For classification
library(caret)
library(mlbench)
library(glmnet)
data(PimaIndiansDiabetes)
set.seed(10)
control = trainControl(method="cv", number=5)
fit.glmnet.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="glmnet", metric = "Accuracy", preProc=c("center","scale"),trControl=control)
print(fit.glmnet.1.c)
## glmnet
##
## 768 samples
## 8 predictors
## 2 classes: 'neg', 'pos'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 615, 615, 614, 614, 614
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa Accuracy SD Kappa SD
## 0.10 0.0004447834 0.7734997 0.4750097 0.03615513 0.08432477
## 0.10 0.0044478343 0.7747984 0.4784315 0.03709942 0.08687118
## 0.10 0.0444783425 0.7682794 0.4527797 0.03537012 0.08554097
## 0.55 0.0004447834 0.7734997 0.4750097 0.03615513 0.08432477
## 0.55 0.0044478343 0.7760971 0.4819463 0.03477662 0.08117211
## 0.55 0.0444783425 0.7682879 0.4491904 0.03763233 0.08761218
## 1.00 0.0004447834 0.7760971 0.4821814 0.03322648 0.07563185
## 1.00 0.0044478343 0.7734827 0.4758894 0.03291030 0.07692884
## 1.00 0.0444783425 0.7591631 0.4265520 0.03205871 0.07860213
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.55 and lambda
## = 0.004447834.
For regression
library(caret)
library(mlbench)
library(glmnet)
data(BostonHousing)
set.seed(10)
control = trainControl(method="cv", number=5)
fit.glmnet.2.c = train(medv~., data=BostonHousing, method="glmnet", metric = "RMSE", preProc=c("center","scale"),trControl=control)
print(fit.glmnet.2.c)
## glmnet
##
## 506 samples
## 13 predictors
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 405, 404, 405, 406, 404
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared RMSE SD Rsquared SD
## 0.10 0.01355531 4.842012 0.7289666 0.5355764 0.03990201
## 0.10 0.13555307 4.841192 0.7289002 0.5596531 0.04191900
## 0.10 1.35553073 5.002550 0.7144793 0.7088463 0.05757475
## 0.55 0.01355531 4.841541 0.7290290 0.5360550 0.03995472
## 0.55 0.13555307 4.857595 0.7273937 0.5842832 0.04428148
## 0.55 1.35553073 5.310516 0.6883996 0.7020526 0.05979093
## 1.00 0.01355531 4.840432 0.7291585 0.5387300 0.04020647
## 1.00 0.13555307 4.917030 0.7209923 0.5961759 0.04538986
## 1.00 1.35553073 5.498308 0.6804657 0.6656494 0.05176136
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.01355531.
As you can see from the caret output, running glmnet in caret will automatically go over alpha and lambda values and select the optimal combination for the best model performance.
Another thing you need to be careful is that for glmnet() function, it only takes matrix as input, not data.frame. It is because the core in glmnet package is a set of fortran subroutines, which was made for very fast execution. But you do not need to worry about it if you implement glmnet() in caret.
Nonlinear algorithms make fewer assumptions about the function being modeled. As such, they have a higher variance but are often in higher accuracy. The increased flexibility also can make them slower to train or increase their memory requirements.
knn3() function in caret library does not create model, it makes predictions from the training set directly. It can be used for classification or regression.
# load the libraries and data
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
knn.fit.1 = knn3(diabetes~., data=PimaIndiansDiabetes, k=3)
# summarize the fit
print(knn.fit.1)
## 3-nearest neighbor classification model
## Training set class distribution:
##
## neg pos
## 500 268
# make predictions
knn.pred.1 = predict(knn.fit.1, PimaIndiansDiabetes[,1:8], type="class")
# summarize accuracy
table(knn.pred.1, PimaIndiansDiabetes$diabetes)
##
## knn.pred.1 neg pos
## neg 459 67
## pos 41 201
knn.accuracy = mean(knn.pred.1 == PimaIndiansDiabetes$diabetes)
knn.accuracy
## [1] 0.859375
Using the knnreg() function in caret.
library(caret)
library(mlbench)
data(BostonHousing)
BostonHousing$chas = as.numeric(as.character(BostonHousing$chas))
x = as.matrix(BostonHousing[,1:13])
y = as.matrix(BostonHousing[,14])
# fit model
knn.fit.2 = knnreg(x, y, k=3)
# summarize the fit
print(knn.fit.2)
## 3-nearest neighbor regression model
# make predictions
knn.pred.2 = predict(knn.fit.2, x)
# summarize accuracy
knn.mse = mean((BostonHousing$medv - knn.pred.2)^2)
print(knn.mse)
## [1] 17.9939
knn() implementation in caret
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(7)
control = trainControl(method="cv", number=5)
knn.fit.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="knn", metric="Accuracy", preProc=c("center", "scale"), trControl=control)
# summarize fit
print(knn.fit.1.c)
## k-Nearest Neighbors
##
## 768 samples
## 8 predictors
## 2 classes: 'neg', 'pos'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 614, 614, 615, 615, 614
## Resampling results across tuning parameters:
##
## k Accuracy Kappa Accuracy SD Kappa SD
## 5 0.7096426 0.3407639 0.01240967 0.03398836
## 7 0.7200407 0.3649273 0.01476191 0.02673364
## 9 0.7239453 0.3671314 0.02063387 0.04816333
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
library(caret)
data(BostonHousing)
data(BostonHousing)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
knn.fit.2.c = train(medv~., data=BostonHousing, method="knn", metric="RMSE", preProc=c("center", "scale"), trControl=control)
# summarize fit
print(knn.fit.2.c)
## k-Nearest Neighbors
##
## 506 samples
## 13 predictors
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 405, 404, 405, 406, 404
## Resampling results across tuning parameters:
##
## k RMSE Rsquared RMSE SD Rsquared SD
## 5 4.679528 0.7521730 0.7693703 0.07520371
## 7 4.659583 0.7585942 0.9032495 0.08238999
## 9 4.698140 0.7575586 0.9825914 0.08791694
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
When using the knn() in caret, the caret will automatically select the k number for the best performance.
naiveBayes() function in e1071 library models the probabilistic of each attribute to the outcome variable independently. It can be used for classification problem.
library(e1071)
library(mlbench)
# Load the dataset
data(PimaIndiansDiabetes)
# fit model
nb.fit = naiveBayes(diabetes~., data=PimaIndiansDiabetes)
# summarize the fit
print(nb.fit)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## neg pos
## 0.6510417 0.3489583
##
## Conditional probabilities:
## pregnant
## Y [,1] [,2]
## neg 3.298000 3.017185
## pos 4.865672 3.741239
##
## glucose
## Y [,1] [,2]
## neg 109.9800 26.14120
## pos 141.2575 31.93962
##
## pressure
## Y [,1] [,2]
## neg 68.18400 18.06308
## pos 70.82463 21.49181
##
## triceps
## Y [,1] [,2]
## neg 19.66400 14.88995
## pos 22.16418 17.67971
##
## insulin
## Y [,1] [,2]
## neg 68.7920 98.86529
## pos 100.3358 138.68912
##
## mass
## Y [,1] [,2]
## neg 30.30420 7.689855
## pos 35.14254 7.262967
##
## pedigree
## Y [,1] [,2]
## neg 0.429734 0.2990853
## pos 0.550500 0.3723545
##
## age
## Y [,1] [,2]
## neg 31.19000 11.66765
## pos 37.06716 10.96825
# make predictions
nb.pred = predict(nb.fit, PimaIndiansDiabetes[,1:8])
# summarize accuracy
table(nb.pred, PimaIndiansDiabetes$diabetes)
##
## nb.pred neg pos
## neg 421 104
## pos 79 164
nb.accuracy = mean(nb.pred == PimaIndiansDiabetes$diabetes)
nb.accuracy
## [1] 0.7617188
The implementation of naive bayes in caret uses the NaiveBayes from the lkaR package.
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
nb.fit.c = train(diabetes~., data=PimaIndiansDiabetes, method="nb", metric="Accuracy", trControl=control)
# summarize fit
print(nb.fit.c)
## Naive Bayes
##
## 768 samples
## 8 predictors
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 615, 615, 614, 614, 614
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa Accuracy SD Kappa SD
## FALSE 0.7487310 0.4346715 0.02414925 0.05678239
## TRUE 0.7618369 0.4569687 0.06089959 0.13574150
##
## Tuning parameter 'fL' was held constant at a value of 0
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0 and usekernel = TRUE.
ksvm() function in kernlab library can be used for classification or regression. It is a wrapper for the LIBSVM library and provides a suite of kernel types and configuration options.
In below examples, it is using the Radial Basic kernel.
library(kernlab)
library(mlbench)
data(PimaIndiansDiabetes)
# fit model
ksvm.fit.1 = ksvm(diabetes~., data=PimaIndiansDiabetes, kernel="rbfdot")
# summarize the fit
print(ksvm.fit.1)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.121151613088805
##
## Number of Support Vectors : 436
##
## Objective Function Value : -353.563
## Training error : 0.175781
# make predictions
ksvm.pred.1 = predict(ksvm.fit.1, PimaIndiansDiabetes[,1:8], type="response")
# summarize accuracy
table(ksvm.pred.1, PimaIndiansDiabetes$diabetes)
##
## ksvm.pred.1 neg pos
## neg 463 98
## pos 37 170
ksvm.accuracy = mean(ksvm.pred.1 == PimaIndiansDiabetes$diabetes)
ksvm.accuracy
## [1] 0.8242188
library(kernlab)
library(mlbench)
data(BostonHousing)
# fit model
ksvm.fit.2 = ksvm(medv~., BostonHousing, kernel="rbfdot")
# summarize the fit
print(ksvm.fit.2)
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0963515407067046
##
## Number of Support Vectors : 337
##
## Objective Function Value : -76.1578
## Training error : 0.094743
# make predictions
ksvm.pred.2 = predict(ksvm.fit.2, BostonHousing)
# summarize accuracy
ksvm.mse = mean((BostonHousing$medv - ksvm.pred.2)^2)
print(ksvm.mse)
## [1] 8.014004
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
ksvm.fit.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", metric="Accuracy", trControl=control)
# summarize fit
print(ksvm.fit.1.c)
## Support Vector Machines with Radial Basis Function Kernel
##
## 768 samples
## 8 predictors
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 615, 615, 614, 614, 614
## Resampling results across tuning parameters:
##
## C Accuracy Kappa Accuracy SD Kappa SD
## 0.25 0.7682709 0.4536716 0.03665128 0.08083154
## 0.50 0.7669807 0.4550586 0.03379924 0.07405079
## 1.00 0.7565572 0.4361162 0.02732086 0.06557150
##
## Tuning parameter 'sigma' was held constant at a value of 0.129025
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.129025 and C = 0.25.
library(caret)
library(mlbench)
data(BostonHousing)
set.seed(10)
control = trainControl(method="cv", number=5)
ksvm.fit.2.c = train(medv~., data=BostonHousing, method="svmRadial", metric="RMSE", trControl=control)
# summarize fit
print(ksvm.fit.2.c)
## Support Vector Machines with Radial Basis Function Kernel
##
## 506 samples
## 13 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 405, 404, 405, 406, 404
## Resampling results across tuning parameters:
##
## C RMSE Rsquared RMSE SD Rsquared SD
## 0.25 4.820191 0.7540812 0.9281945 0.08355636
## 0.50 4.293715 0.7965549 0.9109565 0.07587571
## 1.00 3.859969 0.8299751 0.8611984 0.06989741
##
## Tuning parameter 'sigma' was held constant at a value of 0.09784294
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.09784294 and C = 1.
rpart() function in rpart library provides an implementation of CART for classification and regression.
library(rpart)
library(mlbench)
data(PimaIndiansDiabetes)
rp.fit.1 = rpart(diabetes~., data=PimaIndiansDiabetes)
print(rp.fit.1)
## n= 768
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 768 268 neg (0.65104167 0.34895833)
## 2) glucose< 127.5 485 94 neg (0.80618557 0.19381443)
## 4) age< 28.5 271 23 neg (0.91512915 0.08487085) *
## 5) age>=28.5 214 71 neg (0.66822430 0.33177570)
## 10) mass< 26.35 41 2 neg (0.95121951 0.04878049) *
## 11) mass>=26.35 173 69 neg (0.60115607 0.39884393)
## 22) glucose< 99.5 55 10 neg (0.81818182 0.18181818) *
## 23) glucose>=99.5 118 59 neg (0.50000000 0.50000000)
## 46) pedigree< 0.561 84 34 neg (0.59523810 0.40476190)
## 92) pedigree< 0.2 21 4 neg (0.80952381 0.19047619) *
## 93) pedigree>=0.2 63 30 neg (0.52380952 0.47619048)
## 186) pregnant>=1.5 52 21 neg (0.59615385 0.40384615)
## 372) pressure>=67 40 12 neg (0.70000000 0.30000000) *
## 373) pressure< 67 12 3 pos (0.25000000 0.75000000) *
## 187) pregnant< 1.5 11 2 pos (0.18181818 0.81818182) *
## 47) pedigree>=0.561 34 9 pos (0.26470588 0.73529412) *
## 3) glucose>=127.5 283 109 pos (0.38515901 0.61484099)
## 6) mass< 29.95 76 24 neg (0.68421053 0.31578947)
## 12) glucose< 145.5 41 6 neg (0.85365854 0.14634146) *
## 13) glucose>=145.5 35 17 pos (0.48571429 0.51428571)
## 26) insulin< 14.5 21 8 neg (0.61904762 0.38095238) *
## 27) insulin>=14.5 14 4 pos (0.28571429 0.71428571) *
## 7) mass>=29.95 207 57 pos (0.27536232 0.72463768)
## 14) glucose< 157.5 115 45 pos (0.39130435 0.60869565)
## 28) age< 30.5 50 23 neg (0.54000000 0.46000000)
## 56) pressure>=61 40 13 neg (0.67500000 0.32500000)
## 112) mass< 41.8 31 7 neg (0.77419355 0.22580645) *
## 113) mass>=41.8 9 3 pos (0.33333333 0.66666667) *
## 57) pressure< 61 10 0 pos (0.00000000 1.00000000) *
## 29) age>=30.5 65 18 pos (0.27692308 0.72307692) *
## 15) glucose>=157.5 92 12 pos (0.13043478 0.86956522) *
# make predictions
rp.pred.1 = predict(rp.fit.1, PimaIndiansDiabetes[,1:8], type="class")
# summarize accuracy
table(rp.pred.1, PimaIndiansDiabetes$diabetes)
##
## rp.pred.1 neg pos
## neg 449 72
## pos 51 196
rp.accuracy = mean(rp.pred.1 == PimaIndiansDiabetes$diabetes)
rp.accuracy
## [1] 0.8398438
library(rpart)
library(mlbench)
data(BostonHousing)
rp.fit.2 = rpart(medv~., data=BostonHousing, control=rpart.control(minsplit=5))
print(rp.fit.2)
## n= 506
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 506 42716.3000 22.53281
## 2) rm< 6.941 430 17317.3200 19.93372
## 4) lstat>=14.4 175 3373.2510 14.95600
## 8) crim>=6.99237 74 1085.9050 11.97838 *
## 9) crim< 6.99237 101 1150.5370 17.13762 *
## 5) lstat< 14.4 255 6632.2170 23.34980
## 10) dis>=1.38485 250 3721.1630 22.90520
## 20) rm< 6.543 195 1636.0670 21.62974 *
## 21) rm>=6.543 55 643.1691 27.42727 *
## 11) dis< 1.38485 5 390.7280 45.58000 *
## 3) rm>=6.941 76 6059.4190 37.23816
## 6) rm< 7.437 46 1899.6120 32.11304
## 12) crim>=7.393425 3 27.9200 14.40000 *
## 13) crim< 7.393425 43 864.7674 33.34884 *
## 7) rm>=7.437 30 1098.8500 45.09667
## 14) ptratio>=18.3 3 223.8200 33.30000 *
## 15) ptratio< 18.3 27 411.1585 46.40741 *
# make predictions
rp.pred.2 = predict(rp.fit.2, BostonHousing[,1:13])
# summarize accuracy
rp.mse = mean((BostonHousing$medv - rp.pred.2)^2)
print(rp.mse)
## [1] 12.71556
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
# train
set.seed(10)
control = trainControl(method="cv", number=5)
rp.fit.1.c = train(diabetes~., data=PimaIndiansDiabetes, method="rpart", metric="Accuracy", trControl=control)
# summarize fit
print(rp.fit.1.c)
## CART
##
## 768 samples
## 8 predictors
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 615, 615, 614, 614, 614
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.01741294 0.7435362 0.4121356 0.02808437 0.05483307
## 0.10447761 0.7370427 0.3910463 0.03609679 0.07754173
## 0.24253731 0.6914354 0.2031678 0.04410346 0.18774197
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01741294.
library(caret)
library(mlbench)
data(BostonHousing)
set.seed(10)
control = trainControl(method="cv", number=2)
rp.fit.2.c = train(medv~., data=BostonHousing, method="rpart", metric="RMSE", trControl=control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
# summarize fit
print(rp.fit.2.c)
## CART
##
## 506 samples
## 13 predictors
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 254, 252
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared RMSE SD Rsquared SD
## 0.07165784 5.652481 0.6289471 0.7270540 0.10372090
## 0.17117244 7.190488 0.4109681 0.3156617 0.06524176
## 0.45274420 8.370570 0.3648352 1.3532270 NA
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.07165784.
There are many, many more algorithms out there in R, and available in caret. Here is the whole list of the models in the caret package. It is helpful to look at this list.
There are several techniques that you can use to compare machine learning algorithms in R.
Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use the estimates to choose one or two best models from the suite of models that you have created.
The way that you can do that is to use different visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.
Function resamples() in caret provides methods for collection, analyzing and visualizing a set of resampling results from a common data set. Then the summary() function computes summary statistics across each model/metric combination.
# prepare training scheme
control = trainControl(method="repeatedcv", number=10, repeats=3)
# CART
set.seed(7)
fit.cart = train(diabetes~., data=PimaIndiansDiabetes, method="rpart", trControl=control)
# LDA
set.seed(7)
fit.lda = train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=control)
# SVM
set.seed(7)
fit.svm = train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", trControl=control)
# kNN
set.seed(7)
fit.knn = train(diabetes~., data=PimaIndiansDiabetes, method="knn", trControl=control)
# Random Forest
set.seed(7)
fit.rf = train(diabetes~., data=PimaIndiansDiabetes, method="rf", trControl=control)
# collect resamples
results = resamples(list(CART=fit.cart, LDA=fit.lda, SVM=fit.svm, KNN=fit.knn, RF=fit.rf))
# summarize differences between models
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: CART, LDA, SVM, KNN, RF
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.6234 0.7115 0.7403 0.7382 0.7760 0.8442 0
## LDA 0.6711 0.7532 0.7662 0.7759 0.8052 0.8701 0
## SVM 0.6711 0.7403 0.7582 0.7651 0.7890 0.8961 0
## KNN 0.6184 0.6984 0.7321 0.7299 0.7532 0.8182 0
## RF 0.6711 0.7273 0.7516 0.7617 0.7890 0.8571 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.1585 0.3296 0.3765 0.3934 0.4685 0.6393 0
## LDA 0.2484 0.4196 0.4516 0.4801 0.5512 0.7048 0
## SVM 0.2187 0.3889 0.4167 0.4520 0.5003 0.7638 0
## KNN 0.1113 0.3228 0.3867 0.3819 0.4382 0.5867 0
## RF 0.2624 0.3787 0.4516 0.4588 0.5193 0.6781 0
# density plots of accuracy
scales = list(x=list(relation="free"), y=list(relation="free"))
densityplot(results, scales=scales, pch = "|")
# dot plots of accuracy
scales <- list(x=list(relation="free"), y=list(relation="free"))
dotplot(results, scales=scales)
We can calculate the significance of the differences between the metric distributions of different machine learning algorithms. Results can be summarized directly by calling the summary() function.
# difference in model predictions
diffs <- diff(results)
# summarize p-values for pair-wise comparisons
summary(diffs)
##
## Call:
## summary.diff.resamples(object = diffs)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## CART LDA SVM KNN RF
## CART -0.037759 -0.026908 0.008248 -0.023473
## LDA 0.0050068 0.010851 0.046007 0.014286
## SVM 0.0919580 0.3390336 0.035156 0.003435
## KNN 1.0000000 1.218e-05 0.0007092 -0.031721
## RF 0.1722106 0.1349151 1.0000000 0.0034441
##
## Kappa
## CART LDA SVM KNN RF
## CART -0.086692 -0.058612 0.011552 -0.065345
## LDA 0.001548 0.028079 0.098243 0.021347
## SVM 0.083995 0.221740 0.070164 -0.006733
## KNN 1.000000 4.122e-05 0.005469 -0.076897
## RF 0.031922 1.000000 1.000000 0.001012
The upper diagonal of the table shows the estimated difference between the distributions. If we think that LDA is the most accurate model from looking at the previous graphs, we can get an estimate of how much better than specific other models in terms of absolute accuracy.
sessionInfo()
## R version 3.2.4 RC (2016-03-02 r70278)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.6-12 rpart_4.1-10 kernlab_0.9-23
## [4] klaR_0.6-12 e1071_1.6-7 glmnet_2.0-3
## [7] foreach_1.4.3 Matrix_1.2-4 MASS_7.3-29
## [10] caret_6.0-64 ggplot2_2.0.0 lattice_0.20-24
## [13] mlbench_2.1-1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 compiler_3.2.4 formatR_1.2.1
## [4] nloptr_1.0.4 plyr_1.8.3 class_7.3-9
## [7] iterators_1.0.8 tools_3.2.4 digest_0.6.9
## [10] lme4_1.1-11 evaluate_0.8 gtable_0.1.2
## [13] nlme_3.1-125 mgcv_1.7-28 yaml_2.1.13
## [16] parallel_3.2.4 SparseM_1.7 stringr_1.0.0
## [19] knitr_1.12.3 MatrixModels_0.4-1 combinat_0.0-8
## [22] stats4_3.2.4 grid_3.2.4 nnet_7.3-7
## [25] rmarkdown_0.9.5 knitrBootstrap_1.0.0 minqa_1.2.4
## [28] reshape2_1.4.1 car_2.1-1 magrittr_1.5
## [31] scales_0.4.0 codetools_0.2-8 htmltools_0.3
## [34] splines_3.2.4 pbkrtest_0.4-4 mime_0.4
## [37] colorspace_1.2-6 quantreg_5.21 stringi_1.0-1
## [40] munsell_0.4.3 markdown_0.7.7
Last updated: 2016-03-14