1. Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

These participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

correct = exactly according to the specification (Class A),
wrong = throwing the elbows to the front (Class B),
wrong = lifting the dumbbell only halfway (Class C),
wrong = lowering the dumbbell only halfway (Class D)
wrong = throwing the hips to the front (Class E)

More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

2.The question;

Predict how an exercise was executed (ie. which Class) by taking the output from the 4 accelerometers.

The following steps were taken to complete the project.

Load & clean the data and do some exploratory analysis
Reduce the number of variables
Split the data into a training set and a test set
Run 3 prediction models; Prediction with Decision Trees using cross validation, Prediction with Random Forest using cross validation and Prediction with Generalized Boosted Regression.
Chose a model and run it against the test cases of the course’s prediction quizz

3.Load, clean the data and do some exploratory analysis

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/America/New_York'

library(rpart)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(knitr)

setwd("/Users/jvanstee/datasciencecoursera/practicalmachinelearning")

trainURL <-
  "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" 

testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

4.Reduce the number of variables

A quick review of the excel files show that many columns are filled with NA, #DIV/0! and blank values. Since they are non-valid values we remove them from the data set with na.strings parameter upon saving. Since the first 7 columns contain non-predictors we remove them from the dataset as well.

training <- read.csv(url(trainURL),na.strings = c("NA","#DIV/0!"," "))

quizz <- read.csv(url(testURL),na.strings = c("NA","#DIV/0!"," "))
#remove non valid observation values
training <- training[,colSums(is.na(training))==0]

quizz <- quizz[,colSums(is.na(quizz))==0]

#remove first 7 columns
training <- training[,-c(1:7)]

dim(training)

## [1] 19622    53

quizz <- quizz[,-c(1:7)]

dim(quizz)

## [1] 20 53

After this operation the remaining training set still contains 19,622 observations, but the number of predictors were reduced to 53 from 160.

5.Create train and test data sets

Create a training set (Train.set with 60% of training file) and testing set (Test.set with 40% of training file).

inTrain <- createDataPartition(training$classe, p= 0.6, list = FALSE)

Train.set <- training[inTrain,]

Test.set <- training[-inTrain,]

dim(Train.set)

## [1] 11776    53

dim(Test.set)

## [1] 7846   53

6.Multi Core Processing

Since many algorithms in the Caret package are computationally intensive we enable multicore processing to expedite the computations. This project was run on MacBook Air on one Intel Core i7 processor with two cores.

#multicore Parallel processing
library(doMC)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

doMC::registerDoMC(cores=2)

7.Data Modeling

We run several prediction models and will pick the best one.

7.1 Prediction with Decision Trees using cross validation

#Model Decison Trees rpart with scaling and cross validation
set.seed(1234)
Model.rpart1 <- train(classe ~ ., preProcess = c("center","scale"),trControl = trainControl(method = "cv",number = 4),data = Train.set, method = "rpart")
print(Model.rpart1)

## CART 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 8833, 8831, 8832, 8832 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.03500237  0.5091706  0.36251629
##   0.06003797  0.4278303  0.22869990
##   0.11414333  0.3446776  0.09215309
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03500237.

predictions.rpart1 <- predict(Model.rpart1,newdata = Test.set)

confusionMatrix(predictions.rpart1,Test.set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2043  663  615  578  213
##          B   37  492   56  230  197
##          C  148  363  697  478  373
##          D    0    0    0    0    0
##          E    4    0    0    0  659
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4959         
##                  95% CI : (0.4848, 0.507)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3408         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9153  0.32411  0.50950   0.0000  0.45700
## Specificity            0.6315  0.91783  0.78975   1.0000  0.99938
## Pos Pred Value         0.4968  0.48617  0.33851      NaN  0.99397
## Neg Pred Value         0.9494  0.84987  0.88405   0.8361  0.89099
## Prevalence             0.2845  0.19347  0.17436   0.1639  0.18379
## Detection Rate         0.2604  0.06271  0.08884   0.0000  0.08399
## Detection Prevalence   0.5241  0.12898  0.26243   0.0000  0.08450
## Balanced Accuracy      0.7734  0.62097  0.64963   0.5000  0.72819

The accuracy of this model at 50% is low. Next we’ll run a prediction with Random Forest.

7.2 Prediction with Random Forest using cross validation

#Model Random Forest
set.seed(12345)

Model.rf <- train(classe ~ ., data = Train.set, method = "rf", metric = "Accuracy", preProcess = c("center","scale"),trControl = trainControl(method = "cv",number = 4, p = 0.6, allowParallel = TRUE))
print(Model.rf)

## Random Forest 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 8833, 8832, 8831, 8832 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9891305  0.9862496
##   27    0.9898949  0.9872167
##   52    0.9802145  0.9749688
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

predictions.rf <- predict(Model.rf,newdata = Test.set)

confusionMatrix(predictions.rf,Test.set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2224   18    0    1    0
##          B    6 1494   23    0    2
##          C    0    5 1341   17    3
##          D    0    1    4 1268   10
##          E    2    0    0    0 1427
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9883          
##                  95% CI : (0.9856, 0.9905)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9852          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9842   0.9803   0.9860   0.9896
## Specificity            0.9966   0.9951   0.9961   0.9977   0.9997
## Pos Pred Value         0.9915   0.9797   0.9817   0.9883   0.9986
## Neg Pred Value         0.9986   0.9962   0.9958   0.9973   0.9977
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1904   0.1709   0.1616   0.1819
## Detection Prevalence   0.2859   0.1944   0.1741   0.1635   0.1821
## Balanced Accuracy      0.9965   0.9896   0.9882   0.9919   0.9946

The accuracy of this model at 99.2% is very good. The out-of-sample accuracy is .8%. Therefor the Random Forest Model may be the right choice. However before we decide we like to run Generalized Boosted Regression model.

7.3 Prediction with Generalized Boosted Regression

#Model Generalized Boosted Regression
set.seed(12)
Model.Control <- trainControl(method = "repeatedcv", number = 5, repeats = 1)

Model.gbm <- train(classe ~ ., data = Train.set, method = "gbm", trControl = Model.Control, verbose = FALSE)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loaded gbm 2.1.3

predictions.gbm <- predict(Model.gbm, newdata = Test.set )

confusionMatrix(predictions.gbm,Test.set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2185   56    0    2    4
##          B   35 1424   48    4   16
##          C    5   36 1304   55    9
##          D    3    1   14 1220   27
##          E    4    1    2    5 1386
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9583          
##                  95% CI : (0.9537, 0.9626)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9473          
##  Mcnemar's Test P-Value : 3.194e-11       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9789   0.9381   0.9532   0.9487   0.9612
## Specificity            0.9890   0.9837   0.9838   0.9931   0.9981
## Pos Pred Value         0.9724   0.9325   0.9255   0.9644   0.9914
## Neg Pred Value         0.9916   0.9851   0.9901   0.9900   0.9913
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2785   0.1815   0.1662   0.1555   0.1767
## Detection Prevalence   0.2864   0.1946   0.1796   0.1612   0.1782
## Balanced Accuracy      0.9839   0.9609   0.9685   0.9709   0.9796

The accuracy of this model is 96.24%, less than the Random Forest model.

8.Conclusion

Based on its predicted accuracy of 99.2% the Random Forest model is chosen.

9. Validation of algorithm

We apply Random Forest based machine learning algorithm to the 20 test cases from the test data.

#apply model.rf to quizz
prediction.quizz <- predict(Model.rf,newdata = quizz)
prediction.quizz

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Course Project

JP Van Steerteghem

12/11/2017