The use of wearable devices made it possible to collect large amount of data about personal activity relatively inexpensively. These devices are used by a group of enthusiasts who take measurements about themselves regularly to improve their health. In this project the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the data set.
Provided report describes and explains how the model was built, cross validated, what is the expected out of sample error. The selected prediction model is used to predict 20 different test cases.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The extensive data set of almost 2000 observations was used for model construction and testing.
For this data set, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions: - exactly according to the specification (Class A), - throwing the elbows to the front (Class B), - lifting the dumbbell only halfway (Class C), - lowering the dumbbell only halfway (Class D), - throwing the hips to the front (Class E).” The outcome variable is “classe”.
Two models were built and tested: Decision Tree and Random Forest. The model with the highest accuracy (Random Forest) was chosen as the final model. Prediction for the testing data set was done and accuracy was assessed.
Cross-validation was done by sub-setting the training data set into 2 sub-samples: trainingNew data set (70%) and validationNew data set (30%). Models were trained on the trainingNew data set, and validated on the validationNew data set. The most accurate model was tested on the original Testing data set.
The expected out-of-sample error can be assessed as 1 - accuracy on the validation data set. Accuracy is the proportion of correctly classified observation over the total sample in the validation data set.
Expected accuracy is the expected accuracy in the out-of-sample data set (i.e. original testing data set). Thus, the expected value of the out-of-sample error should correspond to the expected percentage of mis-classified observations in the Test data set.
library(caret); library(randomForest); library(rpart); library(RColorBrewer); library(rattle); library(ggplot2);library(lattice)
Data was downloaded from the links above and uploaded into R. Columns with NA data and columns not related to the model construction were removed. Variable “classe” was converted into factor variable.
## reading data, adding NA for missing info
training <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
testing <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
## removing the columns with NAs
training <- training[,colSums(is.na(training))==0]
testing <- testing[,colSums(is.na(testing))==0]
## removing the columns 1-7 with data not used
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
## making "classe" the factor
training$classe <- as.factor(training$classe)
Cross-validation was done by sub-setting the training data set into 2 sub-samples: trainingNew data set (70%) and validationNew data set (30%).
set.seed(12345)
## split training data into training and validation parts
trainingset <- createDataPartition(y=training$classe, p=0.7, list=FALSE)
trainingNew <- training[trainingset, ]
validationNew <- training[-trainingset, ]
plot(as.factor(trainingNew$classe), col="lightblue", main = "Classe distribution in training data set", xlab="classe", ylab="Frequency")
Data is distributed across all values of “classe” variable.
Model construction on training data sub-set. Prediction calculation and accuracy assessment on validation data sub-set.
model1 <- rpart(classe ~ ., data=trainingNew, method="class")
prediction1 <- predict(model1, validationNew, type = "class")
cm1 <- confusionMatrix(prediction1, validationNew$classe)
We see that the accuracy of the model is not high
## Reference
## Prediction A B C D E
## A 1532 176 28 48 41
## B 54 585 57 64 76
## C 35 154 819 134 126
## D 25 76 58 631 56
## E 28 148 64 87 783
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.391674e-01 6.692057e-01 7.277473e-01 7.503499e-01 2.844520e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 1.005297e-37
Model construction on training data sub-set. Prediction calculation and accuracy assessment on validation data sub-set.
model2 <- randomForest(classe ~. , data=trainingNew, method="class")
prediction2 <- predict(model2, validationNew, type = "class")
cm2 <- confusionMatrix(prediction2, validationNew$classe)
Calculation takes more time but accuracy is much higher.
## Reference
## Prediction A B C D E
## A 1673 3 0 0 0
## B 1 1135 4 0 0
## C 0 1 1022 10 0
## D 0 0 0 954 1
## E 0 0 0 0 1081
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9966015 0.9957011 0.9947562 0.9979229 0.2844520
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
Random Forest algorithm performed much better than Decision Trees. Accuracy for Random Forest model was 0.996 compared to 0.74 for Decision Tree model. The Random Forest model was selected.
The expected out-of-sample error is estimated as 0.4%.
The Test data set comprises 20 cases. With such error we can expect that very few of the test samples will be mis-classified. Prediction results are presented below.
finalpredict <- predict(model2, testing, type="class")
finalpredict
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E