Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
The goal of this project is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants to predict the manner in which they did the exercise. The classe variable in the training set is the target. The report describes how the model was built, how cross validation was used, the expected out of sample error, and why the data processing and modeling choices were made. The prediction model will be used to predict 20 different test cases.
The training and test data analyzed in this report are available here:
This section will go through the required steps to get the raw data loaded, processed, and ready to be modeled.
First, the required packages/libraries are loaded.
# Load libraries ---------------------------------------------------------------
library(caret)
library(dplyr)
Second, the raw training and testing data sets are loaded.
# Load data --------------------------------------------------------------------
training <- read.csv("./data/pml-training.csv")
testing <- read.csv("./data/pml-testing.csv")
The training data set has a large number of columns, many of which are either (1) not predictors or (2) have NAs or blank values. We need to clean the training and testing data sets to remove non-predictors / targets and any columns with empty values (e.g. NA or blank values).
# Initial data frame sizes and structure ---------------------------------------
dim(training)
## [1] 19622 160
str(training[ ,1:20]) # Structure of first 20 columns shown
## 'data.frame': 19622 obs. of 20 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1: int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2: int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1: Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
First, the non-predictor / target columns (columns 1 through 7) are removed.
# Remove non-predictors and non-target columns ---------------------------------
training <- training[,-(1:7)]
testing <- testing[,-(1:7)]
Second, the columns with NAs and blanks are removed.
# Remove columns with NA or blank values ---------------------------------------
training <- training[ , !apply(training, 2, function(x) any(is.na(x)))]
training <- training[ , !apply(training, 2, function(x) any(x==""))]
testing <- testing[ , !apply(testing, 2, function(x) any(is.na(x)))]
testing <- testing[ , !apply(testing, 2, function(x) any(x==""))]
The final training set dimensions are as follows. As we can see, it cut down the columns substantially, leaving a well formed data frame for modeling.
# Final dimensions and structure of training set -------------------------------
dim(training)
## [1] 19622 53
str(training[,1:20]) # Structure of first 20 columns shown
## 'data.frame': 19622 obs. of 20 variables:
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt: int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
Once the data is cleaned, the first step is to partition the training set into a training and a test set. The set.seed() function is used for reproducibility. A random sample is taken using split of 60/40 for training/testing data sets using the createDataPartition() function. The train.set data will be used for training the models. The test.set data will be used to determine accuracy of the test set.
# Partition data into training, testing, and CV sets ---------------------------
set.seed(100)
inTrain <- createDataPartition(y=training$classe, p=0.6, list=FALSE)
train.set <- training[inTrain, ] # 60% of training set
test.set <- training[-inTrain, ] # 40% of training set
No additional pre-processing is performed. PCA was not used because this would reduce the interpretability of the model. Feature reduction using variable importance was not used since this increases the bias of the model.
Three models were created using the train.set: random forest, linear discriminant analysis, and gradient boosting machine (boosting). Cross validation was performed on all models using k = 5 folds. For computational considerations with random forest, rather than limiting the number of factors, limiting the number of trees provided a computationally efficient method to develop the model. The number of trees was limited to 300.
# Modeling ---------------------------------------------------------------------
folds <- 5
ntree <- 300
set.seed(222)
fitControl <- trainControl(method="cv", number = folds)
# Random Forest
model.rf <- train(classe ~ .,
data=train.set,
method="rf",
trControl=fitControl,
ntree=ntree)
# LDA
model.lda <- train(classe ~ .,
data=train.set,
method="lda",
trControl=fitControl)
# GBM (Boosting)
model.gbm <- train(classe ~ .,
data=train.set,
method="gbm",
trControl=fitControl,
verbose=FALSE)
In this section, we assess the performance of each model on the test.set. The model with the lowest out-of-sample error rate (1 - Testing Accuracy Rate) is considered to have the best performance.
The following code predicts the classe classification on the test.set for the Random Forest model.
# Prediction on test set -------------------------------------------------------
pred.rf <- predict(model.rf, newdata=test.set)
The confusionMatrix() function shows the accuracy of the model on the test data set.
# Confusion Matrix to show accuracy --------------------------------------------
confusionMatrix(pred.rf, test.set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2231 10 0 0 0
## B 1 1502 9 0 0
## C 0 6 1350 29 1
## D 0 0 9 1257 6
## E 0 0 0 0 1435
##
## Overall Statistics
##
## Accuracy : 0.991
## 95% CI : (0.9886, 0.9929)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9886
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9996 0.9895 0.9868 0.9774 0.9951
## Specificity 0.9982 0.9984 0.9944 0.9977 1.0000
## Pos Pred Value 0.9955 0.9934 0.9740 0.9882 1.0000
## Neg Pred Value 0.9998 0.9975 0.9972 0.9956 0.9989
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1914 0.1721 0.1602 0.1829
## Detection Prevalence 0.2856 0.1927 0.1767 0.1621 0.1829
## Balanced Accuracy 0.9989 0.9939 0.9906 0.9876 0.9976
Out of sample error for RF is 0.009.
# Out of sample error ----------------------------------------------------------
round(1-confusionMatrix(pred.rf, test.set$classe)$overall[[1]], 4)
## [1] 0.009
The following code predicts the classe classification on the test.set for the LDA model.
# Prediction on test set -------------------------------------------------------
pred.lda <- predict(model.lda, newdata=test.set)
The confusionMatrix() function shows the accuracy of the model on the test data set.
# Confusion Matrix to show accuracy --------------------------------------------
confusionMatrix(pred.lda, test.set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1792 231 115 71 43
## B 48 960 131 58 259
## C 211 181 925 140 114
## D 177 67 166 971 156
## E 4 79 31 46 870
##
## Overall Statistics
##
## Accuracy : 0.7033
## 95% CI : (0.693, 0.7134)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.625
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8029 0.6324 0.6762 0.7551 0.6033
## Specificity 0.9181 0.9216 0.9003 0.9137 0.9750
## Pos Pred Value 0.7957 0.6593 0.5888 0.6318 0.8447
## Neg Pred Value 0.9213 0.9127 0.9294 0.9501 0.9161
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2284 0.1224 0.1179 0.1238 0.1109
## Detection Prevalence 0.2870 0.1856 0.2002 0.1959 0.1313
## Balanced Accuracy 0.8605 0.7770 0.7882 0.8344 0.7892
Out of sample error for LDA is 0.2967.
# Out of sample error ----------------------------------------------------------
round(1-confusionMatrix(pred.lda, test.set$classe)$overall[[1]], 4)
## [1] 0.2967
The following code predicts the classe classification on the test.set for the GBM model.
# Prediction on test set -------------------------------------------------------
pred.gbm <- predict(model.gbm, newdata=test.set)
The confusionMatrix() function shows the accuracy of the model on the test data set.
# Confusion Matrix to show accuracy --------------------------------------------
confusionMatrix(pred.gbm, test.set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2204 54 0 3 1
## B 20 1410 41 2 19
## C 4 50 1304 39 12
## D 3 4 20 1228 15
## E 1 0 3 14 1395
##
## Overall Statistics
##
## Accuracy : 0.9611
## 95% CI : (0.9566, 0.9653)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9508
## Mcnemar's Test P-Value : 1.279e-07
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9875 0.9289 0.9532 0.9549 0.9674
## Specificity 0.9897 0.9870 0.9838 0.9936 0.9972
## Pos Pred Value 0.9744 0.9450 0.9255 0.9669 0.9873
## Neg Pred Value 0.9950 0.9830 0.9901 0.9912 0.9927
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2809 0.1797 0.1662 0.1565 0.1778
## Detection Prevalence 0.2883 0.1902 0.1796 0.1619 0.1801
## Balanced Accuracy 0.9886 0.9579 0.9685 0.9742 0.9823
Out of sample error for GBM is 0.0389.
# Out of sample error ----------------------------------------------------------
round(1-confusionMatrix(pred.gbm, test.set$classe)$overall[[1]], 4)
## [1] 0.0389
The Random Forest model produced the most accurate model, followed closely by the GBM model and more distantly by the LDA model. The out of sample error rates were as follows:
The Random Forest model will be used to predict the testing set.
The testing set is predicted below.
pred.testing <- predict(model.rf, newdata=testing)
data.frame(Observation = 1:20 , Prediction = pred.testing)
## Observation Prediction
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B