test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- train[ ,colSums(is.na(train)) <= nrow(train)*0.6]
test <- test[ ,colSums(is.na(test)) != nrow(test)]
If the count of NAs in a column equals to the number of rows, then the column must be entirely NA.
library(caret)
## Warning: package 'caret' was built under R version 3.2.1
## Loading required package: lattice
## Loading required package: ggplot2
col.NZV <- nearZeroVar(train, saveMetrics = TRUE) # Figure out the columns that has near zero variance
drop.columns <- c("X", "problem_id", names(col.NZV))
test <- test[ , !colnames(test) %in% drop.columns]
classe <- train$classe
train <- train[, colnames(train) %in% names(test)]
train <- data.frame(train, classe)
Drop the unnecessary columns, including observationid, problemid, and nearzerovariance. Since variables in test data set are less than train data set, we use the variables appear in both data sets to fit the model.
set.seed(123)
folds <- createFolds(train$user_name, k=5, list = TRUE)
library(rpart)
# tree <- train(classe ~., data = training, method = "rpart") # it seems there is error using caret?
# tree <- rpart(classe ~., data = training, method = "class")
# rattle::fancyRpartPlot(tree)
# tree.predict <- predict(tree, newdata = validation, type = "class")
# confusionMatrix(validation$classe, tree.predict)
k = 5
accuracy <- rep(NA, 5)
for(i in 1:k){
kfolds.train <- train[-folds[[i]], ]
kfolds.test <- train[folds[[i]], ]
tree <- rpart(classe ~., data = kfolds.train, method = "class")
tree.predict <- predict(tree, newdata = kfolds.test, type = "class")
result <- confusionMatrix(kfolds.test$classe, tree.predict)
accuracy[i] <- result$overall[1]
}
accuracy
## [1] 0.8707622 0.8650025 0.8679918 0.8749045 0.8827727
mean(accuracy)
## [1] 0.8722867
The Decision tree performs quite well. The 5-folds average accuracy is 0.872, and average out of sample error is 0.128.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
k = 5
accuracy <- rep(NA, 5)
for(i in 1:k){
set.seed(1234)
kfolds.train <- train[-folds[[i]], ]
kfolds.test <- train[folds[[i]], ]
rf <- randomForest(classe ~., data = kfolds.train, ntree = 200)
rf.predict <- predict(rf, newdata = kfolds.test)
result <- confusionMatrix(kfolds.test$classe, rf.predict)
accuracy[i] <- result$overall[1]
}
accuracy
## [1] 0.9984706 0.9989812 0.9989806 0.9992357 0.9992355
mean(accuracy)
## [1] 0.9989807
Random forest performs much better than decision tree. The 5-folds average accuracy attains 0.999, the average out of sample error is only 0.001.
From the 5-folds cross validation, we see the performance of Random forest is very good. We fit the Random forest with all the train data set to get the final model which would be used in prediction of test data set later.
set.seed(1234)
rf <- randomForest(classe ~., data = train, ntree = 200)
plot(rf, main = "MSE versus number of tree of Random Forest")
We can see only after fitting 30 trees, RandomForst reaches a very small error rate. The default number of tree (ntree) is 500, however we only need set a small number to reduce computing time.
We use caret to fit the Boosting here and set k-folds cross validation through its parameter. We split the train data into training and validation, then do boosting with 3-folds cross validation on training, then evaluate performance on validation data set. (To expedite the running we only use 3-folds)
Actually if you fit k-folds cross validation by hand, it will generator k models, base on the average out of sample error rate you could determine the performance of this kind of model. Then you fit the model with the whole data set, use the fitted model to predict the response of test data set. But if you do k-folds in caret by setting the parameter, it gives you one final model. I am confused how does this final model come
set.seed(12345)
inTrain <-createDataPartition(train$user_name, p=0.6, list = FALSE)
training <- train[inTrain, ]
validation <- train[-inTrain, ]
set.seed(12345)
fitControl <- trainControl(method = "repeatedcv",
number = 3,
repeats = 1)
boost <- train(classe ~ ., data=training, method = "gbm",
trControl = fitControl,
verbose = FALSE)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.2.3
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.2.1
boostingPred <- predict(boost, newdata=validation)
confusionMatrix(boostingPred, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2228 0 0 0 0
## B 0 1517 1 0 0
## C 0 1 1340 5 0
## D 0 3 4 1315 4
## E 0 0 0 6 1422
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.9955, 0.998)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9961
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 0.9974 0.9963 0.9917 0.9972
## Specificity 1.000 0.9998 0.9991 0.9983 0.9991
## Pos Pred Value 1.000 0.9993 0.9955 0.9917 0.9958
## Neg Pred Value 1.000 0.9994 0.9992 0.9983 0.9994
## Prevalence 0.284 0.1939 0.1714 0.1690 0.1817
## Detection Rate 0.284 0.1933 0.1708 0.1676 0.1812
## Detection Prevalence 0.284 0.1935 0.1716 0.1690 0.1820
## Balanced Accuracy 1.000 0.9986 0.9977 0.9950 0.9981
# library(gbm)
# set.seed(12345)
# boost <- gbm(classe ~., data = training, distribution = "multinomial", cv.folds = 2, n.trees = 100,
# shrinkage = 0.2, interaction.depth = 3,verbose = FALSE)
# # notice that boost.predict is a array
# boost.predict <- predict(boost, newdata = validation, n.trees = 100, type = "response")
# boost.predict[1:6, ,]
# boost.predIndex <- apply(boost.predict, MARGIN = 1, which.max) # find out the index of maximal prob. each row
# boost.predClass <- colnames(boost.predict)[boost.predIndex] # convert index into class A-E
# levels(boost.predClass) <- levels(validation$classe) # make sure they have the same level
# confusionMatrix(validation$classe, boost.predClass)
Even though the overall accuracy is 0.9973, actually Boosting performs a little bit better than Random forest if we re-fit the model with other parameters. When fitting Boosting, you need tune parameters carefully including n.trees, shrinkage, interaction.depth, etc. The running time increases a lot if you increase n.trees and interaction.depth.
levels(test$cvtd_timestamp) <- levels(train$cvtd_timestamp)
levels(test$new_window) <- levels(train$new_window)
test.predicted <- predict(rf, newdata = test, type = "class")
test.predicted
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Consider both accuracy and running time, we choose RandomForest as the final model. Please note that since the test data set are too small, we need manually assign the levels of specific columns in train dataset to corresponding columns in test dataset, so that test dataset could be predicted. Another method is that you can combine the training and testing dataset at the very begining and then split them, which would make them have the same levels.