Executive Summary
It is now possible to use wearables to collect a large amount of data about personal activity. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this paper, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Loading the data
PmlTrain <- read.csv("pml-training.csv", header=T)
PmlTest <- read.csv("pml-testing.csv", header=T)
dim(PmlTrain); dim(PmlTest)
## [1] 19622 160
## [1] 20 160
Preparing and Cleaning the data
The paper uses accelerometers on the belt, forearm, arm and dumbell as predictors; the outcome is classe. Some of the accelerometers have NA values and these are removed.
library(plyr); library(dplyr)
## Warning: package 'dplyr' was built under R version 3.1.3
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
PmlTrainAccel <- select(PmlTrain, contains("accel"), contains("classe"))
PmlTrainAccel <- PmlTrainAccel[ , colSums(is.na(PmlTrainAccel)) == 0]
summary(PmlTrainAccel)
## total_accel_belt accel_belt_x accel_belt_y accel_belt_z
## Min. : 0.00 Min. :-120.000 Min. :-69.00 Min. :-275.00
## 1st Qu.: 3.00 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00
## Median :17.00 Median : -15.000 Median : 35.00 Median :-152.00
## Mean :11.31 Mean : -5.595 Mean : 30.15 Mean : -72.59
## 3rd Qu.:18.00 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00
## Max. :29.00 Max. : 85.000 Max. :164.00 Max. : 105.00
## total_accel_arm accel_arm_x accel_arm_y accel_arm_z
## Min. : 1.00 Min. :-404.00 Min. :-318.0 Min. :-636.00
## 1st Qu.:17.00 1st Qu.:-242.00 1st Qu.: -54.0 1st Qu.:-143.00
## Median :27.00 Median : -44.00 Median : 14.0 Median : -47.00
## Mean :25.51 Mean : -60.24 Mean : 32.6 Mean : -71.25
## 3rd Qu.:33.00 3rd Qu.: 84.00 3rd Qu.: 139.0 3rd Qu.: 23.00
## Max. :66.00 Max. : 437.00 Max. : 308.0 Max. : 292.00
## total_accel_dumbbell accel_dumbbell_x accel_dumbbell_y
## Min. : 0.00 Min. :-419.00 Min. :-189.00
## 1st Qu.: 4.00 1st Qu.: -50.00 1st Qu.: -8.00
## Median :10.00 Median : -8.00 Median : 41.50
## Mean :13.72 Mean : -28.62 Mean : 52.63
## 3rd Qu.:19.00 3rd Qu.: 11.00 3rd Qu.: 111.00
## Max. :58.00 Max. : 235.00 Max. : 315.00
## accel_dumbbell_z total_accel_forearm accel_forearm_x accel_forearm_y
## Min. :-334.00 Min. : 0.00 Min. :-498.00 Min. :-632.0
## 1st Qu.:-142.00 1st Qu.: 29.00 1st Qu.:-178.00 1st Qu.: 57.0
## Median : -1.00 Median : 36.00 Median : -57.00 Median : 201.0
## Mean : -38.32 Mean : 34.72 Mean : -61.65 Mean : 163.7
## 3rd Qu.: 38.00 3rd Qu.: 41.00 3rd Qu.: 76.00 3rd Qu.: 312.0
## Max. : 318.00 Max. :108.00 Max. : 477.00 Max. : 923.0
## accel_forearm_z classe
## Min. :-446.00 A:5580
## 1st Qu.:-182.00 B:3797
## Median : -39.00 C:3422
## Mean : -55.29 D:3216
## 3rd Qu.: 26.00 E:3607
## Max. : 291.00
Exploring the data
dim(PmlTrainAccel); names(PmlTrainAccel)
## [1] 19622 17
## [1] "total_accel_belt" "accel_belt_x" "accel_belt_y"
## [4] "accel_belt_z" "total_accel_arm" "accel_arm_x"
## [7] "accel_arm_y" "accel_arm_z" "total_accel_dumbbell"
## [10] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [13] "total_accel_forearm" "accel_forearm_x" "accel_forearm_y"
## [16] "accel_forearm_z" "classe"
pie(summary(PmlTrainAccel$classe), main="5 different classes of barbell lifts")

Splitting the data and plotting the predictors
The data are split into a training set and testing set. The 2 data sets are used to build the prediction model and to determine the error rate.
library(caret)
## Warning: package 'caret' was built under R version 3.1.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.1.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.1.2
inTrain <- createDataPartition(y=PmlTrainAccel$classe, p=0.6, list=FALSE)
training <- PmlTrainAccel[inTrain,]; testing <- PmlTrainAccel[-inTrain,]
dim(training); dim(testing)
## [1] 11776 17
## [1] 7846 17
featurePlot(x=training[, c(-17)], y=training$classe, plot="pairs")

Preprocessing the predictors with Random Forest method (training)
library(randomForest); library(ipred); set.seed(12345)
ModFit <- randomForest(classe ~., data=training, preProcess=c("center","scale")); ModFit
##
## Call:
## randomForest(formula = classe ~ ., data = training, preProcess = c("center", "scale"))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.99%
## Confusion matrix:
## A B C D E class.error
## A 3207 35 45 58 3 0.04211470
## B 93 2068 75 22 21 0.09258447
## C 29 63 1938 18 6 0.05647517
## D 49 12 84 1774 11 0.08082902
## E 5 33 20 23 2084 0.03741339
Accuracy is equal to 1-error rate: 100% - 6%. The accuracy of the model with training data set is more than 90%.
Applying the Random Forest method to the testing data set (testing)
ModFittest <- randomForest(classe ~ ., data=testing, importance=T, prox=T); ModFittest
##
## Call:
## randomForest(formula = classe ~ ., data = testing, importance = T, prox = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 7.51%
## Confusion matrix:
## A B C D E class.error
## A 2123 17 34 53 5 0.04883513
## B 72 1359 58 16 13 0.10474308
## C 32 44 1271 17 4 0.07090643
## D 54 10 60 1149 13 0.10653188
## E 4 38 23 22 1355 0.06033287
Accuracy is equal to 1-error rate: 100% - 8%. The accuracy of the model with testing data set is more than 90%.
Predicting the “classe” with the PmlTest data set (validation)
PMLA2 <- predict(ModFit, newdata=PmlTest); PMLA2
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Cross validating ModFit1 and ModFit2
The PMLA2 values are used to predict the answers to the 20 questions in the PMLTest data set - there are 20 correct answers out of 20 questions. Indeed, as shown by the Confusion Matrixes, the accuracy of the Random Forest method is more than 90%.