This is the report of the Peer Assessment project for Practical Machine Learning course,
The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set.
A short description of the datasets content from the authors’ website:
“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg)."
rm(list=ls())
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
set.seed(12345)
training <- read.csv("pmltraining.csv")
testing <- read.csv("pmltesting.csv")
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
## [1] 13737 160
dim(TestSet)
## [1] 5885 160
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
## [1] 13737 88
dim(TestSet)
## [1] 5885 88
the number of variables for the analysis has been reduced to 88 only.
corMatrix<- cor(TrainSet[sapply(TrainSet, function(x) !is.factor(x))])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower",
tl.cex = 0.8, tl.col = rgb(0, 0, 0))
The highly correlated variables are shown in dark colors in the graph
Three methods will be applied to model the regressions and the best one (with higher accuracy when applied to the Test dataset) will be used for the quiz predictions. The methods are: Random Forests, Decision Tree and Generalized Boosted Model, as described below. A Confusion Matrix is plotted at the end of each analysis to better visualize the accuracy of the models.
# set.seed(12345)
# modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
# fancyRpartPlot(modFitDecTree)
#
# # prediction on Test dataset
# predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
# confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
# confMatDecTree
#
# # plot matrix results
# plot(confMatDecTree$table, col = confMatDecTree$byClass,
# main = paste("Decision Tree - Accuracy =",
# round(confMatDecTree$overall['Accuracy'], 4)))