test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- train[ ,colSums(is.na(train)) <= nrow(train)*0.6]
test <- test[ ,colSums(is.na(test)) != nrow(test)]
If the count of NAs in a column equals to the number of rows, then the column must be entirely NA.
library(caret)
## Warning: package 'caret' was built under R version 3.2.1
## Loading required package: lattice
## Loading required package: ggplot2
col.NZV <- nearZeroVar(train, saveMetrics = TRUE) # Figure out the columns that has near zero variance
drop.columns <- c("X", "problem_id", names(col.NZV))
test <- test[ , !colnames(test) %in% drop.columns]
classe <- train$classe
train <- train[, colnames(train) %in% names(test)]
train <- data.frame(train, classe)
Drop the unnecessary columns, including \(observation id\), \(problem id\), and \(near zero variance\). Since variables in \(test\) data set are less than \(train\) data set, we use the variables appear in both data sets to fit the model.
set.seed(123)
inTrain <-createDataPartition(train$user_name, p=0.6, list = FALSE)
training <- train[inTrain, ]
validation <- train[-inTrain, ]
Creat new \(training\) data set and \(validation\) data set from the original data.
library(rpart)
# tree <- train(classe ~., data = training, method = "rpart") # it seems there is error using caret?
tree <- rpart(classe ~., data = training, method = "class")
rattle::fancyRpartPlot(tree)
tree.predict <- predict(tree, newdata = validation, type = "class")
confusionMatrix(validation$classe, tree.predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2143 59 19 0 0
## B 64 1271 174 9 0
## C 9 96 1224 7 7
## D 1 70 139 868 206
## E 0 0 60 80 1340
##
## Overall Statistics
##
## Accuracy : 0.8725
## 95% CI : (0.865, 0.8798)
## No Information Rate : 0.2826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8388
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9666 0.8496 0.7574 0.9004 0.8628
## Specificity 0.9861 0.9611 0.9809 0.9396 0.9778
## Pos Pred Value 0.9649 0.8373 0.9114 0.6760 0.9054
## Neg Pred Value 0.9868 0.9644 0.9397 0.9854 0.9665
## Prevalence 0.2826 0.1907 0.2060 0.1229 0.1979
## Detection Rate 0.2731 0.1620 0.1560 0.1106 0.1708
## Detection Prevalence 0.2831 0.1935 0.1712 0.1637 0.1886
## Balanced Accuracy 0.9764 0.9054 0.8692 0.9200 0.9203
The Decision tree performs quite well. Accuracy is \(0.8725\)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1234)
rf <- randomForest(classe ~., data = training, ntree = 200)
rf.predict <- predict(rf, newdata = validation)
confusionMatrix(validation$classe, rf.predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2221 0 0 0 0
## B 1 1516 1 0 0
## C 0 2 1338 3 0
## D 0 0 1 1282 1
## E 0 0 0 0 1480
##
## Overall Statistics
##
## Accuracy : 0.9989
## 95% CI : (0.9978, 0.9995)
## No Information Rate : 0.2832
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9985
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9995 0.9987 0.9985 0.9977 0.9993
## Specificity 1.0000 0.9997 0.9992 0.9997 1.0000
## Pos Pred Value 1.0000 0.9987 0.9963 0.9984 1.0000
## Neg Pred Value 0.9998 0.9997 0.9997 0.9995 0.9998
## Prevalence 0.2832 0.1935 0.1708 0.1638 0.1888
## Detection Rate 0.2831 0.1932 0.1705 0.1634 0.1886
## Detection Prevalence 0.2831 0.1935 0.1712 0.1637 0.1886
## Balanced Accuracy 0.9998 0.9992 0.9989 0.9987 0.9997
plot(rf, main = "MSE versus number of tree of Random Forest")
Random forest performs much better than decision tree. The accuracy attains \(0.9989\). The the plot we can see only after fitting 30 trees, RandomForst reaches a very small error rate. The default number of tree (ntree) is 500, however we only need set a small number to reduce computing time.
library(gbm)
## Warning: package 'gbm' was built under R version 3.2.3
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
set.seed(12345)
boost <- gbm(classe ~., data = training, distribution = "multinomial", n.trees = 100,
shrinkage = 0.2, interaction.depth = 3,verbose = FALSE)
# notice that boost.predict is a array
boost.predict <- predict(boost, newdata = validation, n.trees = 100, type = "response")
boost.predict[1:6, ,]
## A B C D E
## [1,] 0.9997076 0.0001108933 7.413240e-05 2.352405e-05 8.382577e-05
## [2,] 0.9996852 0.0001108908 9.654712e-05 2.352352e-05 8.382389e-05
## [3,] 0.9997006 0.0001108925 8.114757e-05 2.352388e-05 8.382519e-05
## [4,] 0.9996852 0.0001108908 9.654712e-05 2.352352e-05 8.382389e-05
## [5,] 0.9997023 0.0001108927 7.946069e-05 2.352392e-05 8.382533e-05
## [6,] 0.9996852 0.0001108908 9.654712e-05 2.352352e-05 8.382389e-05
boost.predIndex <- apply(boost.predict, MARGIN = 1, which.max) # find out the index of maximal prob. each row
boost.predClass <- colnames(boost.predict)[boost.predIndex] # convert index into class A-E
levels(boost.predClass) <- levels(validation$classe) # make sure they have the same level
confusionMatrix(validation$classe, boost.predClass)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2218 3 0 0 0
## B 2 1509 6 1 0
## C 0 1 1335 7 0
## D 0 0 1 1283 0
## E 0 0 0 7 1473
##
## Overall Statistics
##
## Accuracy : 0.9964
## 95% CI : (0.9948, 0.9976)
## No Information Rate : 0.2829
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9955
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9974 0.9948 0.9884 1.0000
## Specificity 0.9995 0.9986 0.9988 0.9998 0.9989
## Pos Pred Value 0.9986 0.9941 0.9940 0.9992 0.9953
## Neg Pred Value 0.9996 0.9994 0.9989 0.9977 1.0000
## Prevalence 0.2829 0.1928 0.1710 0.1654 0.1877
## Detection Rate 0.2827 0.1923 0.1702 0.1635 0.1877
## Detection Prevalence 0.2831 0.1935 0.1712 0.1637 0.1886
## Balanced Accuracy 0.9993 0.9980 0.9968 0.9941 0.9995
Even though the overall accuracy is \(0.9964\), actually Boosting performs a little bit better than Random forest if we re-fit the model with other parameters. When fitting Boosting, you need tune parameters carefully including \(n.trees\), \(shrinkage\), \(interaction.depth\), etc. The running time increases a lot if you increase \(n.trees\) and \(interaction.depth\).
levels(test$cvtd_timestamp) <- levels(train$cvtd_timestamp)
levels(test$new_window) <- levels(train$new_window)
test.predicted <- predict(rf, newdata = test, type = "class")
test.predicted
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Consider both accuracy and running time, we choose \(RandomForest\) as the final model. Please note that since the test data set are too small, we need manually assign the levels of specific columns in \(train\) dataset to corresponding columns in \(test\) dataset, so that test dataset could be predicted. Another method is that you can combine the training and testing dataset at the very begining and then split them, which would make them have the same levels.