One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
The goal of our project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. We may use any of the other variables to predict with. We should descibe how to build our model, how we used cross validation, what we think the expected out of sample error is, and why we made the choices to select the parcticular prediction models over the others. We will also use your prediction model to predict 20 different test cases.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
After loading the data we change the name of variable problem_id of test data to classe as well as it’s class from integer to that of the class of classe i.e. factor as it will help us later when we predict our test data using our predictive model.
train_data<- read.csv("pml-training.csv", header = TRUE, na.strings=c(" ","","NA"))
test_data<- read.csv("pml-testing.csv", header = TRUE, na.strings=c(" ","","NA"))
# making problem_id column of test_set to classe column as
names(test_data)[160]<- "classe"
# converting the class of that column to that of class of classe variable of train_data
test_data$classe<- as.factor(test_data$classe)
Counting no. of variables of train and test data which has NA’s greater than 5%.
dim(train_data)
## [1] 19622 160
dim(test_data)
## [1] 20 160
isNA<- function(x){sum(is.na(x))/length(x)}
sum(sapply(train_data,isNA)> 0.05)
## [1] 100
sum(sapply(test_data,isNA)> 0.05)
## [1] 100
Removing those variables which has more than 5% NA’s as well as removing first five column of test and training data as it contains details about individual performing the task and it has nothing to do with predicting the classe variable as this variable doesn’t depends on them.
train_data<- train_data[,sapply(train_data,isNA)< 0.05]
test_data<- test_data[,sapply(test_data,isNA)< 0.05]
train_data<- train_data[,-c(1:5)]
test_data<- test_data[,-c(1:5)]
dim(train_data)
## [1] 19622 55
dim(test_data)
## [1] 20 55
Now talking a look at clean dataset using summary function.
summary(train_data)
## new_window num_window roll_belt pitch_belt
## no :19216 Min. : 1.0 Min. :-28.90 Min. :-55.8000
## yes: 406 1st Qu.:222.0 1st Qu.: 1.10 1st Qu.: 1.7600
## Median :424.0 Median :113.00 Median : 5.2800
## Mean :430.6 Mean : 64.41 Mean : 0.3053
## 3rd Qu.:644.0 3rd Qu.:123.00 3rd Qu.: 14.9000
## Max. :864.0 Max. :162.00 Max. : 60.3000
## yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## Min. :-180.00 Min. : 0.00 Min. :-1.040000 Min. :-0.64000
## 1st Qu.: -88.30 1st Qu.: 3.00 1st Qu.:-0.030000 1st Qu.: 0.00000
## Median : -13.00 Median :17.00 Median : 0.030000 Median : 0.02000
## Mean : -11.21 Mean :11.31 Mean :-0.005592 Mean : 0.03959
## 3rd Qu.: 12.90 3rd Qu.:18.00 3rd Qu.: 0.110000 3rd Qu.: 0.11000
## Max. : 179.00 Max. :29.00 Max. : 2.220000 Max. : 0.64000
## gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## Min. :-1.4600 Min. :-120.000 Min. :-69.00 Min. :-275.00
## 1st Qu.:-0.2000 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00
## Median :-0.1000 Median : -15.000 Median : 35.00 Median :-152.00
## Mean :-0.1305 Mean : -5.595 Mean : 30.15 Mean : -72.59
## 3rd Qu.:-0.0200 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00
## Max. : 1.6200 Max. : 85.000 Max. :164.00 Max. : 105.00
## magnet_belt_x magnet_belt_y magnet_belt_z roll_arm
## Min. :-52.0 Min. :354.0 Min. :-623.0 Min. :-180.00
## 1st Qu.: 9.0 1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77
## Median : 35.0 Median :601.0 Median :-320.0 Median : 0.00
## Mean : 55.6 Mean :593.7 Mean :-345.5 Mean : 17.83
## 3rd Qu.: 59.0 3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30
## Max. :485.0 Max. :673.0 Max. : 293.0 Max. : 180.00
## pitch_arm yaw_arm total_accel_arm gyros_arm_x
## Min. :-88.800 Min. :-180.0000 Min. : 1.00 Min. :-6.37000
## 1st Qu.:-25.900 1st Qu.: -43.1000 1st Qu.:17.00 1st Qu.:-1.33000
## Median : 0.000 Median : 0.0000 Median :27.00 Median : 0.08000
## Mean : -4.612 Mean : -0.6188 Mean :25.51 Mean : 0.04277
## 3rd Qu.: 11.200 3rd Qu.: 45.8750 3rd Qu.:33.00 3rd Qu.: 1.57000
## Max. : 88.500 Max. : 180.0000 Max. :66.00 Max. : 4.87000
## gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
## Min. :-3.4400 Min. :-2.3300 Min. :-404.00 Min. :-318.0
## 1st Qu.:-0.8000 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0
## Median :-0.2400 Median : 0.2300 Median : -44.00 Median : 14.0
## Mean :-0.2571 Mean : 0.2695 Mean : -60.24 Mean : 32.6
## 3rd Qu.: 0.1400 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0
## Max. : 2.8400 Max. : 3.0200 Max. : 437.00 Max. : 308.0
## accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z
## Min. :-636.00 Min. :-584.0 Min. :-392.0 Min. :-597.0
## 1st Qu.:-143.00 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2
## Median : -47.00 Median : 289.0 Median : 202.0 Median : 444.0
## Mean : -71.25 Mean : 191.7 Mean : 156.6 Mean : 306.5
## 3rd Qu.: 23.00 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0
## Max. : 292.00 Max. : 782.0 Max. : 583.0 Max. : 694.0
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## Min. :-153.71 Min. :-149.59 Min. :-150.871
## 1st Qu.: -18.49 1st Qu.: -40.89 1st Qu.: -77.644
## Median : 48.17 Median : -20.96 Median : -3.324
## Mean : 23.84 Mean : -10.78 Mean : 1.674
## 3rd Qu.: 67.61 3rd Qu.: 17.50 3rd Qu.: 79.643
## Max. : 153.55 Max. : 149.40 Max. : 154.952
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## Min. : 0.00 Min. :-204.0000 Min. :-2.10000
## 1st Qu.: 4.00 1st Qu.: -0.0300 1st Qu.:-0.14000
## Median :10.00 Median : 0.1300 Median : 0.03000
## Mean :13.72 Mean : 0.1611 Mean : 0.04606
## 3rd Qu.:19.00 3rd Qu.: 0.3500 3rd Qu.: 0.21000
## Max. :58.00 Max. : 2.2200 Max. :52.00000
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## Min. : -2.380 Min. :-419.00 Min. :-189.00 Min. :-334.00
## 1st Qu.: -0.310 1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00
## Median : -0.130 Median : -8.00 Median : 41.50 Median : -1.00
## Mean : -0.129 Mean : -28.62 Mean : 52.63 Mean : -38.32
## 3rd Qu.: 0.030 3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00
## Max. :317.000 Max. : 235.00 Max. : 315.00 Max. : 318.00
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## Min. :-643.0 Min. :-3600 Min. :-262.00 Min. :-180.0000
## 1st Qu.:-535.0 1st Qu.: 231 1st Qu.: -45.00 1st Qu.: -0.7375
## Median :-479.0 Median : 311 Median : 13.00 Median : 21.7000
## Mean :-328.5 Mean : 221 Mean : 46.05 Mean : 33.8265
## 3rd Qu.:-304.0 3rd Qu.: 390 3rd Qu.: 95.00 3rd Qu.: 140.0000
## Max. : 592.0 Max. : 633 Max. : 452.00 Max. : 180.0000
## pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x
## Min. :-72.50 Min. :-180.00 Min. : 0.00 Min. :-22.000
## 1st Qu.: 0.00 1st Qu.: -68.60 1st Qu.: 29.00 1st Qu.: -0.220
## Median : 9.24 Median : 0.00 Median : 36.00 Median : 0.050
## Mean : 10.71 Mean : 19.21 Mean : 34.72 Mean : 0.158
## 3rd Qu.: 28.40 3rd Qu.: 110.00 3rd Qu.: 41.00 3rd Qu.: 0.560
## Max. : 89.80 Max. : 180.00 Max. :108.00 Max. : 3.970
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
## 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
## Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
## Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
## 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
## Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0
## 1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0
## Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0
## Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6
## 3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0
## Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
## classe
## A:5580
## B:3797
## C:3422
## D:3216
## E:3607
##
We should check the correlation among variables before proceeding to the modeling procedures as it helps in analysing the scope of further dimension reduction of training data using PCA.
library(corrgram)
corrgram(train_data, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Correlation among predictors of training data")
Some Rendering for Correlation Values
The order of variables of training data in the diagonal panel in correlation plot is same as that of order of variable in summary of train_data shown previously.
From above after visualizing the correlation among the variables of training data, we can see that there is not much correlation among many variables and we can move forward towards the modelling of data without worrying much about the correlation factor.
We first use classification trees to analyze the train data set.We have to predict classe variable from rest of the variable in the data set. We are using rpart package for predicting the decision tree and using rattle and rpart.plot function for ploting the fancy decision tree.
Dividing the train_data in training data and cross-validation data. Here we are using validation set approach.
suppressMessages(library(randomForest))
suppressMessages(library(caret))
set.seed(1)
inTrain <- sample(1:nrow(train_data), .7*nrow(train_data),replace = FALSE)
train<- train_data[inTrain,]
cv<- train_data[-inTrain,]
suppressMessages(library(rattle))
suppressMessages(library(rpart.plot))
suppressMessages(library(rpart))
set.seed(2)
tree.WLE<- rpart(classe~. , train, method="class")
The summary() function the gives the number of terminal nodes, lists the variables that are used as internal nodes in the tree and the (training) error rate.
suppressWarnings(fancyRpartPlot(tree.WLE))
tree.pred <- predict(tree.WLE ,cv, type ="class")
DecTreeConfMat<- confusionMatrix(tree.pred, cv$classe)
DecTreeConfMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1508 262 47 99 42
## B 44 626 37 33 25
## C 11 61 869 131 57
## D 86 132 62 586 111
## E 17 59 67 117 798
##
## Overall Statistics
##
## Accuracy : 0.7452
## 95% CI : (0.7339, 0.7563)
## No Information Rate : 0.283
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6761
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9052 0.5491 0.8031 0.60663 0.7725
## Specificity 0.8934 0.9707 0.9459 0.92054 0.9464
## Pos Pred Value 0.7702 0.8183 0.7697 0.59980 0.7543
## Neg Pred Value 0.9598 0.8996 0.9552 0.92261 0.9513
## Prevalence 0.2830 0.1936 0.1838 0.16409 0.1755
## Detection Rate 0.2562 0.1063 0.1476 0.09954 0.1356
## Detection Prevalence 0.3326 0.1299 0.1918 0.16596 0.1797
## Balanced Accuracy 0.8993 0.7599 0.8745 0.76358 0.8595
plot(DecTreeConfMat$table, col = DecTreeConfMat$byClass,
main = paste("Decision Tree - Accuracy =",
round(DecTreeConfMat$overall['Accuracy'], 4)))
Using caret package as it is difficult to assume the argument n.tree and interaction.depth in gbm function ,caret should handle all the parameter stuff. As in gbm package we have to assume n.tree and interation.depth argument initially and select the best one using cross-validation method which might be hectic in comparision to that of boosting done by caret as most of the cross-validation and assuming appropriate n.tree and interaction.depth is done by the function present in caret package itself.
set.seed(5)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
GBM.WLE <- train(classe ~ ., data= train, method = "gbm",
trControl = controlGBM, verbose = FALSE)
GBM.WLE
## Stochastic Gradient Boosting
##
## 13735 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 10988, 10988, 10988, 10987, 10989
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7635965 0.7000433
## 1 100 0.8326170 0.7881208
## 1 150 0.8728797 0.8390671
## 2 50 0.8861303 0.8557128
## 2 100 0.9394981 0.9234230
## 2 150 0.9624317 0.9524555
## 3 50 0.9303973 0.9118459
## 3 100 0.9696394 0.9615733
## 3 150 0.9868946 0.9834172
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
GBM.pred <- predict(GBM.WLE, newdata = cv)
GBMConfMat <- confusionMatrix(GBM.pred, cv$classe)
GBMConfMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1662 12 0 0 0
## B 3 1114 4 2 5
## C 0 13 1073 10 4
## D 1 1 5 953 12
## E 0 0 0 1 1012
##
## Overall Statistics
##
## Accuracy : 0.9876
## 95% CI : (0.9844, 0.9903)
## No Information Rate : 0.283
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9843
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9772 0.9917 0.9865 0.9797
## Specificity 0.9972 0.9971 0.9944 0.9961 0.9998
## Pos Pred Value 0.9928 0.9876 0.9755 0.9805 0.9990
## Neg Pred Value 0.9991 0.9945 0.9981 0.9974 0.9957
## Prevalence 0.2830 0.1936 0.1838 0.1641 0.1755
## Detection Rate 0.2823 0.1892 0.1823 0.1619 0.1719
## Detection Prevalence 0.2844 0.1916 0.1869 0.1651 0.1721
## Balanced Accuracy 0.9974 0.9871 0.9930 0.9913 0.9897
plot(GBMConfMat$table, col = GBMConfMat$byClass,
main = paste("GBM - Accuracy =", round(GBMConfMat$overall['Accuracy'], 4)))
In Bagging we build a number of decision trees on bootstrapped training samples whereas Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. As bagging can also be done using caret package and infact easier to use there as it includes functions for cross-validation but here I am using randomForesrt package to show how Bagging is simply a special case of a Random Forest with m = p (in randomForest function m is represented as argument mtry).
suppressMessages(library(randomForest))
set.seed(3)
bag.WLE<- randomForest(classe ~ ., train, mtry = dim(train)[2]-1, importance =TRUE)
bag.WLE
##
## Call:
## randomForest(formula = classe ~ ., data = train, mtry = dim(train)[2] - 1, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 54
##
## OOB estimate of error rate: 0.46%
## Confusion matrix:
## A B C D E class.error
## A 3909 3 0 1 1 0.001277466
## B 15 2636 5 1 0 0.007903651
## C 0 7 2327 6 0 0.005555556
## D 0 1 12 2236 1 0.006222222
## E 1 1 0 8 2564 0.003885004
bag.pred <- predict(bag.WLE,newdata = cv)
BagConfMat <- confusionMatrix(bag.pred, cv$classe)
BagConfMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1665 0 0 0 0
## B 1 1138 3 1 2
## C 0 1 1079 7 0
## D 0 1 0 958 4
## E 0 0 0 0 1027
##
## Overall Statistics
##
## Accuracy : 0.9966
## 95% CI : (0.9948, 0.9979)
## No Information Rate : 0.283
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9957
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9982 0.9972 0.9917 0.9942
## Specificity 1.0000 0.9985 0.9983 0.9990 1.0000
## Pos Pred Value 1.0000 0.9939 0.9926 0.9948 1.0000
## Neg Pred Value 0.9998 0.9996 0.9994 0.9984 0.9988
## Prevalence 0.2830 0.1936 0.1838 0.1641 0.1755
## Detection Rate 0.2828 0.1933 0.1833 0.1627 0.1745
## Detection Prevalence 0.2828 0.1945 0.1846 0.1636 0.1745
## Balanced Accuracy 0.9997 0.9984 0.9978 0.9954 0.9971
plot(BagConfMat$table, col =BagConfMat$byClass,
main = paste("Bagging - Accuracy =",
round(BagConfMat$overall['Accuracy'], 4)))
Growing a random forest proceeds in exactly the same way, except that we use a smaller value of the mtry argument. By default, randomForest() uses p/3 variables when building a random forest of regression trees, and sqrt(p) variables when building a random forest of classification trees. Since our datat is for classsification tree, we use mtry = sqrt(p).
attach(train)
set.seed(4)
rForest.WLE<- randomForest(classe ~ ., train, mtry = sqrt(dim(train)[2]-1), importance =TRUE)
rForest.WLE
##
## Call:
## randomForest(formula = classe ~ ., data = train, mtry = sqrt(dim(train)[2] - 1), importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.31%
## Confusion matrix:
## A B C D E class.error
## A 3913 1 0 0 0 0.0002554931
## B 7 2650 0 0 0 0.0026345502
## C 0 10 2330 0 0 0.0042735043
## D 0 0 16 2232 2 0.0080000000
## E 0 0 0 7 2567 0.0027195027
rForest.pred <- predict(rForest.WLE, newdata = cv)
rForestConfMat <- confusionMatrix(rForest.pred, cv$classe)
rForestConfMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1666 1 0 0 0
## B 0 1138 3 0 0
## C 0 1 1079 3 0
## D 0 0 0 963 2
## E 0 0 0 0 1031
##
## Overall Statistics
##
## Accuracy : 0.9983
## 95% CI : (0.9969, 0.9992)
## No Information Rate : 0.283
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9979
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9982 0.9972 0.9969 0.9981
## Specificity 0.9998 0.9994 0.9992 0.9996 1.0000
## Pos Pred Value 0.9994 0.9974 0.9963 0.9979 1.0000
## Neg Pred Value 1.0000 0.9996 0.9994 0.9994 0.9996
## Prevalence 0.2830 0.1936 0.1838 0.1641 0.1755
## Detection Rate 0.2830 0.1933 0.1833 0.1636 0.1751
## Detection Prevalence 0.2832 0.1938 0.1840 0.1639 0.1751
## Balanced Accuracy 0.9999 0.9988 0.9982 0.9982 0.9990
plot(rForestConfMat$table, col =rForestConfMat$byClass,
main = paste("Random Forest - Accuracy =",
round(rForestConfMat$overall['Accuracy'], 4)))
Form above we get the accuracy of our four prediction model :
1. Classification Tree : 0.7452
2. Boosting : 0.9876
3. Bagging : 0.9966
4. Random Forest : 0.9983
From above it is clear that Random Forest has the highest accuracy among others. So we apply Random Forest model on 20 test data to predict the classe.
# to overcome this - Error in predict.randomForest(rForest.WLE, newdata = test_data) :
# Type of predictors in new data do not match that of the training data.
test_data <- rbind(train_data[1, ] , test_data)
test_data <- test_data[-1,]
test.pred <- predict(rForest.WLE, newdata=test_data)
# to remove the names of test_data i.e row containin integer 2 to 21 using "unname" function
# 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
# B A B A A E D B A A B C B A E E A B B B
# Levels: A B C D E
unname(test.pred)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E