Using data provided by http://groupware.les.inf.puc-rio.br/har [1], we will build a model to predict how well a bicep curl exercise is performed. The data is obtained using accelerometors on numerous points of the body. The performance of the exercise will fall into one of five categories:
Class A - Exactly to Specification Class B - Throwing elbows to the front Class C - Lifting dumbell only halfway Class D - Lowering dumbell only halfway Class E - throwing hips to the front
library(caret)
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv","training.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv","testing.csv")
train <- read.csv("training.csv")
test <- read.csv("testing.csv")
#Convert all columns to same class in both datasets
for(i in 1:length(test[1,])){
if(class(train[,i])=="factor"){
test[,i]<-as.factor(test[,i])
} else class(test[,i])<- class(train[,i])
}
#Blanks to NAs
train[train==""] <- NA
test[test==""] <- NA
#Determine unnecessary variables
var <- nearZeroVar(train)
train <- train[,-var]
test <- test[,-var]
#Remove individual identifiers
train <- train[,-c(1:5)]
test <- test[,-c(1:5)]
#Determine variables that are mostly NA
nearNA <- sapply(train,FUN=function(x)mean(is.na(x)>.9))
train <- train[,!nearNA]
test <- test[,!nearNA]
dim(train)
## [1] 19622 54
dim(test)
## [1] 20 54
table(train$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
Dataset is quite large. Setting up three cores to handle the random forest algorithm.
library(parallel)
library(doParallel)
cluster <- makeCluster(3)
registerDoParallel(cluster)
library(caret)
set.seed(123)
#3-fold CV with parallel processing
ctrl = trainControl(method="cv",
number=3,
allowParallel=TRUE)
#Random Forest model build
fit <- train(classe~.,
data=train,
method="rf",
trControl=ctrl
)
stopCluster(cluster)
registerDoSEQ()
library(caret)
fit
## Random Forest
##
## 19622 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 13081, 13082, 13081
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9952095 0.9939402
## 27 0.9974009 0.9967123
## 53 0.9955152 0.9943270
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
fit$resample
## Accuracy Kappa Resample
## 1 0.9977068 0.9970994 Fold1
## 2 0.9975539 0.9969059 Fold3
## 3 0.9969419 0.9961316 Fold2
confusionMatrix.train(fit)
## Cross-Validated (3 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 28.4 0.1 0.0 0.0 0.0
## B 0.0 19.3 0.0 0.0 0.0
## C 0.0 0.0 17.4 0.1 0.0
## D 0.0 0.0 0.0 16.3 0.1
## E 0.0 0.0 0.0 0.0 18.3
##
## Accuracy (average) : 0.9974
library(caret)
pred <- predict(fit,test[,-length(test)])
write.csv(pred,"predictions.csv")
How I built the model - I used the caret package and the random forest algorithm because it is a thorough and realistic model that estimates the out of sample error well.
How I used Cross Validation - I defined a 3-fold cross validation for the random forest to use. I used only 3 folds due to the size of the training data. With a more manageable dataset, I would have used 10.
Expected out of sample rate - Since cross validation is a big component of random forests, the accuracy rate is much more realistic than on a highly fitted single tree method. However, unseen data will still have a lower accuracy than any prediction of accuracy based on the training data. So my expected accuracy rate here will be at best 99.74%, which is excellent, but optimistic.
[1] Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
Read more: http://groupware.les.inf.puc-rio.br/har#dataset#ixzz4xxTstr00