Complete all Exercises, and submit answers to VtopBeta

Datasets

### load packages
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(knitr)
Iris dataset for training and testing
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa

Split it into training set and testing set and validation set

ir_data=iris
set.seed(100)
head(ir_data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
intrain <- createDataPartition(y = ir_data$Species, p= 0.7, list = FALSE)
training<-iris[intrain,]
testing<-ir_data[-intrain,]
dim(training);dim(testing)
## [1] 105   5
## [1] 45  5
summary(ir_data)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
training[["Species"]] = factor(training[["Species"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

The results of confusion matrix show that this time the accuracy on the test set is 95.56%.

Support Vector Machine

set.seed(3233)
svm_Linear <- train(Species ~., data = training,    method = "svmLinear",trControl=trctrl,preProcess = c("center",  "scale"),tuneLength = 10)
svm_Linear
## Support Vector Machines with Linear Kernel 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 94, 93, 95, 95, 94, 96, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9589562  0.9381692
## 
## Tuning parameter 'C' was held constant at a value of 1
test_pred <- predict(svm_Linear, newdata = testing)
test_pred
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor virginica  versicolor
## [25] versicolor versicolor versicolor virginica  versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica
confusionMatrix(test_pred, testing$Species )
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         13         0
##   virginica       0          2        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8667           1.0000
## Specificity                 1.0000            1.0000           0.9333
## Pos Pred Value              1.0000            1.0000           0.8824
## Neg Pred Value              1.0000            0.9375           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2889           0.3333
## Detection Prevalence        0.3333            0.2889           0.3778
## Balanced Accuracy           1.0000            0.9333           0.9667
grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))
set.seed(3233)

svm_Linear_Grid <- train(Species ~ ., data = training,  method = "svmLinear",trControl=trctrl,preProcess = c("center","scale"),tuneGrid=grid,tuneLength = 10)
## Warning: model fit failed for Fold01.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold01.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold01.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
svm_Linear_Grid
## Support Vector Machines with Linear Kernel 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 94, 93, 95, 95, 94, 96, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.00        NaN        NaN
##   0.01  0.8820539  0.8228964
##   0.05  0.9375253  0.9060775
##   0.10  0.9612626  0.9416873
##   0.25  0.9589562  0.9381692
##   0.50  0.9626599  0.9437247
##   0.75  0.9519192  0.9276385
##   1.00  0.9589562  0.9381692
##   1.25  0.9619865  0.9428105
##   1.50  0.9619865  0.9428105
##   1.75  0.9619865  0.9428105
##   2.00  0.9619865  0.9428105
##   5.00  0.9717508  0.9575908
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 5.
plot(svm_Linear_Grid)

test_pred_grid <- predict(svm_Linear_Grid, newdata = testing)
test_pred_grid
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor virginica  versicolor
## [25] versicolor versicolor versicolor virginica  versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica
confusionMatrix(test_pred_grid, testing$Species )
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         13         0
##   virginica       0          2        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8667           1.0000
## Specificity                 1.0000            1.0000           0.9333
## Pos Pred Value              1.0000            1.0000           0.8824
## Neg Pred Value              1.0000            0.9375           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2889           0.3333
## Detection Prevalence        0.3333            0.2889           0.3778
## Balanced Accuracy           1.0000            0.9333           0.9667

Random forest

library(randomForest)
model <- randomForest(Species ~., data = training)
pred <- predict(model, newdata = testing)
table(pred, testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         13         0
##   virginica       0          2        15
(15+14+15)/nrow(testing) #change this according to the diagonal element of the previous statement result 
## [1] 0.9777778
plot(model)

So 97.77778% accuracy is found

Naive Bayes

library(e1071)
model <- naiveBayes(Species ~., data = training)
class(model)
## [1] "naiveBayes"
summary(model)
##         Length Class  Mode     
## apriori 3      table  numeric  
## tables  4      -none- list     
## levels  3      -none- character
## call    4      -none- call
print(model)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Conditional probabilities:
##             Sepal.Length
## Y                [,1]      [,2]
##   setosa     5.071429 0.3409083
##   versicolor 5.825714 0.4667427
##   virginica  6.540000 0.6611932
## 
##             Sepal.Width
## Y                [,1]      [,2]
##   setosa     3.517143 0.3416962
##   versicolor 2.748571 0.2974118
##   virginica  2.962857 0.3263756
## 
##             Petal.Length
## Y                [,1]      [,2]
##   setosa     1.471429 0.1856173
##   versicolor 4.182857 0.4712223
##   virginica  5.525714 0.5653437
## 
##             Petal.Width
## Y                 [,1]      [,2]
##   setosa     0.2514286 0.1039554
##   versicolor 1.3114286 0.1794951
##   virginica  1.9885714 0.2857101
preds <- predict(model, newdata = training)
table(preds,training$Species)
##             
## preds        setosa versicolor virginica
##   setosa         35          0         0
##   versicolor      0         33         3
##   virginica       0          2        32
(35+33+32)/(35+33+2+32+3)#change this according to the diagonal element of the previous statement result 
## [1] 0.952381

So 95.2381% accuracy is found by this method.

Decision tree

dtree_fit <- train(Species ~., data = training, method = "rpart",parms = list( split = "information"),trControl=trctrl,tuneLength = 10)
dtree_fit
## CART 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 96, 96, 94, 93, 93, 94, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.00000000  0.9432323  0.9135621
##   0.05555556  0.9432323  0.9135621
##   0.11111111  0.9432323  0.9135621
##   0.16666667  0.9432323  0.9135621
##   0.22222222  0.9432323  0.9135621
##   0.27777778  0.9432323  0.9135621
##   0.33333333  0.9432323  0.9135621
##   0.38888889  0.9432323  0.9135621
##   0.44444444  0.8632323  0.7992764
##   0.50000000  0.3920202  0.1285714
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.3888889.
library(rpart.plot)
library(RColorBrewer)
prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)

test_pred <- predict(dtree_fit, newdata = testing)
preds <- predict(model, newdata = training)
table(preds,training$Species)
##             
## preds        setosa versicolor virginica
##   setosa         35          0         0
##   versicolor      0         33         3
##   virginica       0          2        32
(35+33+32)/(33+35+2+3+32)#change this according to the diagonal element of the previous statement result
## [1] 0.952381

95.2381% Accuracy was found in this method.

K Nearest Neighbors

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
knn_fit <- train(Species ~., data = training, method = "knn",
                    trControl=trctrl,
                    preProcess =    c("center", "scale"),
                    tuneLength =    10)
knn_fit
## k-Nearest Neighbors 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 95, 95, 93, 94, 93, 95, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9602862  0.9400659
##    7  0.9512626  0.9263538
##    9  0.9401515  0.9096871
##   11  0.9425589  0.9133488
##   13  0.9483670  0.9220234
##   15  0.9434343  0.9145537
##   17  0.9337374  0.8999448
##   19  0.9236700  0.8848134
##   21  0.9239226  0.8853054
##   23  0.9183670  0.8768944
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(knn_fit)

test_pred <- predict(knn_fit, newdata = testing)
test_pred
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] versicolor setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor virginica  versicolor
## [25] versicolor versicolor versicolor virginica  versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica
confusionMatrix(test_pred, testing$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      1         13         0
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9333         
##                  95% CI : (0.8173, 0.986)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9            
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 0.9333            0.8667           1.0000
## Specificity                 1.0000            0.9667           0.9333
## Pos Pred Value              1.0000            0.9286           0.8824
## Neg Pred Value              0.9677            0.9355           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3111            0.2889           0.3333
## Detection Prevalence        0.3111            0.3111           0.3778
## Balanced Accuracy           0.9667            0.9167           0.9667

So 97.78% Accuracy was found using this method.

Inference

So according to accuracy results “KNN and Random Forest” performs the best on this dataset.