Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
These participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Predict how an exercise was executed (ie. which Class) by taking the output from the 4 accelerometers.
The following steps were taken to complete the project.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/America/New_York'
library(rpart)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(knitr)
setwd("/Users/jvanstee/datasciencecoursera/practicalmachinelearning")
trainURL <-
"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
A quick review of the excel files show that many columns are filled with NA, #DIV/0! and blank values. Since they are non-valid values we remove them from the data set with na.strings parameter upon saving. Since the first 7 columns contain non-predictors we remove them from the dataset as well.
training <- read.csv(url(trainURL),na.strings = c("NA","#DIV/0!"," "))
quizz <- read.csv(url(testURL),na.strings = c("NA","#DIV/0!"," "))
#remove non valid observation values
training <- training[,colSums(is.na(training))==0]
quizz <- quizz[,colSums(is.na(quizz))==0]
#remove first 7 columns
training <- training[,-c(1:7)]
dim(training)
## [1] 19622 53
quizz <- quizz[,-c(1:7)]
dim(quizz)
## [1] 20 53
After this operation the remaining training set still contains 19,622 observations, but the number of predictors were reduced to 53 from 160.
Create a training set (Train.set with 60% of training file) and testing set (Test.set with 40% of training file).
inTrain <- createDataPartition(training$classe, p= 0.6, list = FALSE)
Train.set <- training[inTrain,]
Test.set <- training[-inTrain,]
dim(Train.set)
## [1] 11776 53
dim(Test.set)
## [1] 7846 53
Since many algorithms in the Caret package are computationally intensive we enable multicore processing to expedite the computations. This project was run on MacBook Air on one Intel Core i7 processor with two cores.
#multicore Parallel processing
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
doMC::registerDoMC(cores=2)
We run several prediction models and will pick the best one.
#Model Decison Trees rpart with scaling and cross validation
set.seed(1234)
Model.rpart1 <- train(classe ~ ., preProcess = c("center","scale"),trControl = trainControl(method = "cv",number = 4),data = Train.set, method = "rpart")
print(Model.rpart1)
## CART
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 8833, 8831, 8832, 8832
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03500237 0.5091706 0.36251629
## 0.06003797 0.4278303 0.22869990
## 0.11414333 0.3446776 0.09215309
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03500237.
predictions.rpart1 <- predict(Model.rpart1,newdata = Test.set)
confusionMatrix(predictions.rpart1,Test.set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2043 663 615 578 213
## B 37 492 56 230 197
## C 148 363 697 478 373
## D 0 0 0 0 0
## E 4 0 0 0 659
##
## Overall Statistics
##
## Accuracy : 0.4959
## 95% CI : (0.4848, 0.507)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3408
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9153 0.32411 0.50950 0.0000 0.45700
## Specificity 0.6315 0.91783 0.78975 1.0000 0.99938
## Pos Pred Value 0.4968 0.48617 0.33851 NaN 0.99397
## Neg Pred Value 0.9494 0.84987 0.88405 0.8361 0.89099
## Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Detection Rate 0.2604 0.06271 0.08884 0.0000 0.08399
## Detection Prevalence 0.5241 0.12898 0.26243 0.0000 0.08450
## Balanced Accuracy 0.7734 0.62097 0.64963 0.5000 0.72819
The accuracy of this model at 50% is low. Next we’ll run a prediction with Random Forest.
#Model Random Forest
set.seed(12345)
Model.rf <- train(classe ~ ., data = Train.set, method = "rf", metric = "Accuracy", preProcess = c("center","scale"),trControl = trainControl(method = "cv",number = 4, p = 0.6, allowParallel = TRUE))
print(Model.rf)
## Random Forest
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 8833, 8832, 8831, 8832
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9891305 0.9862496
## 27 0.9898949 0.9872167
## 52 0.9802145 0.9749688
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
predictions.rf <- predict(Model.rf,newdata = Test.set)
confusionMatrix(predictions.rf,Test.set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2224 18 0 1 0
## B 6 1494 23 0 2
## C 0 5 1341 17 3
## D 0 1 4 1268 10
## E 2 0 0 0 1427
##
## Overall Statistics
##
## Accuracy : 0.9883
## 95% CI : (0.9856, 0.9905)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9852
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9842 0.9803 0.9860 0.9896
## Specificity 0.9966 0.9951 0.9961 0.9977 0.9997
## Pos Pred Value 0.9915 0.9797 0.9817 0.9883 0.9986
## Neg Pred Value 0.9986 0.9962 0.9958 0.9973 0.9977
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1904 0.1709 0.1616 0.1819
## Detection Prevalence 0.2859 0.1944 0.1741 0.1635 0.1821
## Balanced Accuracy 0.9965 0.9896 0.9882 0.9919 0.9946
The accuracy of this model at 99.2% is very good. The out-of-sample accuracy is .8%. Therefor the Random Forest Model may be the right choice. However before we decide we like to run Generalized Boosted Regression model.
#Model Generalized Boosted Regression
set.seed(12)
Model.Control <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
Model.gbm <- train(classe ~ ., data = Train.set, method = "gbm", trControl = Model.Control, verbose = FALSE)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loaded gbm 2.1.3
predictions.gbm <- predict(Model.gbm, newdata = Test.set )
confusionMatrix(predictions.gbm,Test.set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2185 56 0 2 4
## B 35 1424 48 4 16
## C 5 36 1304 55 9
## D 3 1 14 1220 27
## E 4 1 2 5 1386
##
## Overall Statistics
##
## Accuracy : 0.9583
## 95% CI : (0.9537, 0.9626)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9473
## Mcnemar's Test P-Value : 3.194e-11
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9789 0.9381 0.9532 0.9487 0.9612
## Specificity 0.9890 0.9837 0.9838 0.9931 0.9981
## Pos Pred Value 0.9724 0.9325 0.9255 0.9644 0.9914
## Neg Pred Value 0.9916 0.9851 0.9901 0.9900 0.9913
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2785 0.1815 0.1662 0.1555 0.1767
## Detection Prevalence 0.2864 0.1946 0.1796 0.1612 0.1782
## Balanced Accuracy 0.9839 0.9609 0.9685 0.9709 0.9796
The accuracy of this model is 96.24%, less than the Random Forest model.
Based on its predicted accuracy of 99.2% the Random Forest model is chosen.
We apply Random Forest based machine learning algorithm to the 20 test cases from the test data.
#apply model.rf to quizz
prediction.quizz <- predict(Model.rf,newdata = quizz)
prediction.quizz
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E