In this project, the goal is to predict the manner in which a person did a specific exercise movements using data collected by wearable devices. Six participants was asked to perform barbell lifts in 5 ways for 10 repetitions. Class A is the correct movement, while the other 4 ways are common mistakes. With the data set, classification models are fitted, and the model with greatest performance is selected to predict 20 cases. The data for this project come from this source: linked phrase. For more information, visit linked phrase.
library(caret)
library(rattle)
Load the data.
pml.training <- read.csv("pml-training.csv",na.strings=c("NA","#DIV/0!",""))
pml.testing <- read.csv("pml-testing.csv",na.strings=c("NA","#DIV/0!",""))
dim(pml.training)
## [1] 19622 160
A brief view of the dataset shows that there are a lot of variables which are almost empty. Drop the variables with more than 80% “NA”. There are 60 variables dropped in total.
NAcols <- colSums(is.na.data.frame(pml.training))>nrow(pml.training)*0.8
pml.training <- pml.training[,!NAcols]
pml.training <- pml.training[,-(1:7)]
print(ncol(pml.training))
## [1] 53
I use several algorithms to fit models, and compare the accuracy of them to select the best model. Classification accuracy is the ratio of correct predictions to total predictions made.
As the class of “testing” set is actually unknown, the “training” set has to be devided into real training and testing set. In this way I can measure the out-of-sample error of the fitted model. Also, 5-fold cross validation is used in modeling.
set.seed(3234)
InTrain <- createDataPartition(pml.training$classe, p=0.7, list=FALSE)
train <- pml.training[InTrain,]
test <- pml.training[-InTrain,]
control <- trainControl(method = "cv", number = 5)
set.seed(61287)
fit.rpart <- train(classe~., method="rpart", trControl=control, data=train)
print(fit.rpart)
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10989, 10990, 10990, 10989
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03102431 0.5161963 0.36917742
## 0.06086190 0.3905549 0.16616499
## 0.11433221 0.3154244 0.04736344
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03102431.
fancyRpartPlot(fit.rpart$finalModel)
pdt.rpart <- predict(fit.rpart, test)
confusionMatrix(pdt.rpart, test$classe)$overall[1]
## Accuracy
## 0.4953271
The accuracy of classification tree in this case is quite poor.
fit.gbm <- train(classe~., method="gbm", trControl=control, data=train, verbose=FALSE)
pdt.gbm <- predict(fit.gbm, test)
confusionMatrix(pdt.gbm, test$classe)$overall[1]
## Accuracy
## 0.9587086
fit.rf <- train(classe~., method="rf", trControl=control, data=train)
pdt.rf <- predict(fit.rf, test)
confusionMatrix(pdt.rf, test$classe)$overall[1]
## Accuracy
## 0.9913339
The accuracy of random forest is the highest.
Having the highest accuracy, which means the lowest out-of-sample error, random forest is selected to predict the 20 cases.
predict(fit.rf, pml.testing)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E