Can we quantify how well people perform barbell lifts? Can we use quantitive characteristic of this performance to classify this exercice?
trainUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
destrain<-file.path(getwd(),"training.csv")
destest<-file.path(getwd(),"testing.csv")
#download.file(trainUrl,destrain)
#download.file(testUrl,destest)
setwd("C:/Users/dpierre/Documents")
training<-read.csv("training.csv",na.strings=c("NA","#DIV/0!",""))
testing<-read.csv("testing.csv",na.strings=c("NA","#DIV/0!",""))
By summarizing training and testing dataset, we notice that a few variables contain NA values. We will remove all variables with more than 70% missing values in the training and testing dataset and remain “classe” and “problem_id” variables and those using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
# initialization of a variable
drops <- NULL
# drop variables with more than 50% NA values
for (i in 1:length(training)){
if (sum(is.na(training[,i]))/nrow(training) >= 0.70) {
drops <- c(drops, colnames(training)[i])
}
}
training <- training[,!(names(training) %in% drops)]
testing <- testing[,!(names(testing) %in% drops)]
# remain variables using data from accelerometers
trainCol<-grep("belt|forearm|arm|dumbbell|classe", colnames(training))
training <- training[,trainCol]
testCol<-grep("belt|forearm|arm|dumbbell|problem_id", colnames(testing))
testing <- testing[,testCol]
Some of the variables might have no variability in them and would no be useful covariates. We will identify variables that have a very little variability and will likely be not predictors
# Check the variability of the predictors
library(caret)
NZV<-nearZeroVar(training,saveMetrics=TRUE)
sum(NZV[,"nzv"])
## [1] 0
All the predictors have good variability and should be used in the prediction algorithm
# Check if the predictors are identical in the training and test set
identical(colnames(training)[1:length(training)-1],names(testing)[1:length(testing)-1])
## [1] TRUE
We will use our training set, subslipt it into training and test sets. We will build our model on the subsetted training set and evaluate it on the subsetted test set. We will repeat this over and over and average the estimated errors. Finally, will estimate what’s going to happen in, when we get a new test set.
set.seed(1300)
library(caret)
intrain= createDataPartition(y=training$classe, p = 0.6,list=FALSE)
sub_training = training[-intrain,]
sub_testing = training[intrain,]
We build a model with trees. The basic idea is to take each of the predictors and use it to split the classe outcome into differents groups. In other words the outcome would be splited in five different exercises. We will evaluate the homogeneity of the classe within each type of exercice. The first thing that we can do is to plot the “roll_belt” versus “pitch_belt” and color it by different classes
qplot(roll_belt,pitch_belt,colour=classe,data=sub_training)
We can see that group are well identified but a little scattered. We can fit the model using rpart function.
set.seed(1300)
mod_tree <- train(classe ~ ., data = sub_training, method="rpart")
We build also model with Random Forest, an extended model for classification and regression trees. We use the train function and embbed the trainControl argument, by specifying the cross validation method cv and the number of k-fold equal to 10, as larger k gives less biais and more variance.
set.seed(1300)
mod_rf <- train(classe ~ .,data = sub_training, method = 'rf',
trControl = trainControl(method = "cv", number = 10), ntree=500)
# Predict testing values with trees (in sample test)
library(rattle)
pred_tree <- predict(mod_tree,sub_testing)
cm_tree <- confusionMatrix(pred_tree,sub_testing$classe)
cm_tree$table
## Reference
## Prediction A B C D E
## A 3042 986 937 848 302
## B 49 755 64 350 308
## C 248 538 1053 732 590
## D 0 0 0 0 0
## E 9 0 0 0 965
We can look in our table of our prediction versus the Classe to see what would like a new variable. Overall the model is not enough accurate in the prediction. We can read this model split with the plot below to tell us what the classification tree is doing.
fancyRpartPlot(mod_tree$finalModel)
# Predict testing values with random forest (in sample test)
library(rattle)
pred_rf <- predict(mod_rf,sub_testing)
cm_rf <- confusionMatrix(pred_rf,sub_testing$classe)
cm_rf$table
## Reference
## Prediction A B C D E
## A 3332 39 0 0 0
## B 10 2212 33 2 11
## C 1 24 2009 55 6
## D 0 0 12 1866 13
## E 5 4 0 7 2135
We can look in our table of our prediction versus the Classe to see what would like a new variable. Overall the model is highly accurate in the prediction.
sum(pred_rf == sub_testing$classe)/length(pred_rf) # Accuracy of the Random Forest model
## [1] 0.9811481
predict(mod_rf,testing)
## [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The out sample error is the error we get on a new data set. Our new dataset, i.e the downloaded testing dataset does not contain the outcome variable:classe. This means that after predicting the likely outcome value(A,B,C,D or E) for the new dataset, we will not be able to compare it to what we effectively have. The approach we are going to use is called K-fold cross validation. The idea is we break our training dataset into ten (10) equal size datasets. For each fold(train & test), we will build our prediction on the training data and apply it to the test data. we would average the error we get accross all those experiments, and we would get an estimate of the average error rate in an out of sample estimation procedure.
set.seed(32323)
# ten folds with training data
train_folds <- createFolds(y=sub_training$classe,k=10,list=TRUE, returnTrain=TRUE)
# ten folds with testing data
test_folds <- createFolds(y=sub_training$classe,k=10,list=TRUE, returnTrain=FALSE)
vec<-c()
# average the error we get accross all the ten folds
for (i in 1:10){
index_train<-as.data.frame(train_folds[i])
index_test<-as.data.frame(test_folds[i])
train<-sub_training[index_train[,1],]
test<-sub_training[index_test[,1],]
mod<-randomForest(classe ~. , data=train)
pred <- predict(mod,test)
accuracy<-sum(pred == test$classe)/length(pred)
vec<-c(vec,accuracy)
}
(1-mean(vec))*100 # Out_sample_error rate in %
## [1] 0.1147475
The out sample error rate is equal to 0.11%