Practical Machine Learning Course Project

Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity. For this project, we use the Weight Lifting Exercises data set found at http://groupware.les.inf.puc-rio.br/har. In this data set, six participants wore accelerometers and were asked to perform barbell lifts correctly and incorrectly in 5 different ways, with Class A being the correct method and Classes B-E using an incorrect method. Our goal is to use data from the Weight Lifting Exercises dataset to predict the manner in which the participants performed the exercise.

Data Transformation

To begin, download the testing and training data sets and load required packages. We will set the testing dataset aside for now as it will be used for final validation.

training<-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
                   na.strings=c("NA","#DIV/0!", ""))
testing<-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
                  na.strings=c("NA","#DIV/0!", ""))
library(AppliedPredictiveModeling)
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

library(rattle)

## Loading required package: RGtk2
## Rattle: A free graphical interface for data mining with R.
## Version 3.5.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

str(training)

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

A review of the structure of the training set reveals a large number of missing values, which we remove. We also remove the first seven columns of the data set as not relevant to our review since the person performing the maneuver or the time it was performed should not impact the class to which the activity belongs. Finally, we divide the cleaned training set into two new datasets for cross-validation purposes. We split the data based on the “classe” variable and create a new training set using 60% of the original data, with the remaining 40% of the data split onto a new testing set.

trainingnoNA<-training[,colSums(is.na(training))==0]
testingnoNA<-testing[,colSums(is.na(testing))==0]

trainingnoNA<-trainingnoNA[,-c(1:7)]
testingnoNA<-testingnoNA[,-c(1:7)]

inTrain<-createDataPartition(y=trainingnoNA$classe,p=0.6,list=FALSE)
newtrain<-trainingnoNA[inTrain,]
newtest<-trainingnoNA[-inTrain,]

Before beginning our modeling, let’s check the frequency of the appearance of Classes A-E. If there is severe imbalance, weighting may need to be considered. It would also be difficult to make an accurate model if one of the classes has little to no appearance in our new training data set.

plot(newtrain$classe,main="Frequency of 'Classe' Variable",xlab="'Classe' Variable",ylab="Frequency",col="darkorchid")

Per our histogram, Class A appears most frequently, with the other variables all having a strong showing in our training data set. With these frequencies, accurate model prediction for each variable should not be difficult to achieve. The histogram also gives us a basic idea of what we can expect to see when running our final model on the test/validation data set.

Model Creation

Let’s begin with a basic classification tree model. We have 52 variables to consider (with column 53 being our ‘classe’ variable), therefore our initial model with all variables included will most likely not be our final model due to potential variable correlation. Since we are trying to predict class type, let’s start with the rpart method.

dim(newtrain)

## [1] 11776    53

set.seed(777)
modFitstart <- train(classe ~ .,method="rpart",data=newtrain)

## Loading required package: rpart

fancyRpartPlot(modFitstart$finalModel)

Using the rattle package, we can make a dendrogram that shows the predicted class selection based on the rpart model. However, one big issue sticks out from this plot - given the frequencies in our histogram, how is it not possible to reach class D as an outcome using this method? A review of model accuracy shows that the rpart method using all available variables is little better than random chance guessing. Pretty classification trees do not indicate accuracy, and this is not the model for us.

modFitstart

## CART 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.03500237  0.5219837  0.3812620  0.02941830   0.04819701
##   0.05995887  0.4020024  0.1827887  0.05657764   0.09577239
##   0.11366872  0.3145447  0.0431631  0.03833566   0.05875299
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03500237.

Let’s use the caret package’s preprocessing abilities to help in our model building. All 52 variables may not be necessary - instead, we can eliminate some redundant variables that may unnecessarily complicate our model. Only those features that impact class variability should be considered.

preProc<-preProcess(newtrain[,-53],method="pca",thresh=0.95)
preProc

## Created from 11776 samples and 52 variables
## 
## Pre-processing:
##   - centered (52)
##   - ignored (0)
##   - principal component signal extraction (52)
##   - scaled (52)
## 
## PCA needed 26 components to capture 95 percent of the variance

Per the above principal component analysis, only 26 components are needed to capture 95% of the variance. Since we only want to use relevant features, we remove those variables that aren’t useful, may lead to overfitting, or may otherwise mislead our final results.

predtrain<-predict(preProc,newtrain[,-53])

Since it is accuracy we want, perhaps the best method to use is Random Forest as it is one of the best performing algorithms for prediction purposes. The authors of the original paper** also use the random forest method in their review. If running the below code, please be aware that this does take some computational time to run.

set.seed(777)
modFit<-train(newtrain$classe~.,data=predtrain,method="rf",trControl=trainControl(method="cv"),importance=TRUE)
modFit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.01%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3320   10   14    2    2 0.008363202
## B   43 2188   44    2    2 0.039929794
## C    8   32 1977   33    4 0.037487829
## D    5    4   90 1827    4 0.053367876
## E    3   10   21   21 2110 0.025404157

Our results of the random forest method show excellent accuracy, with an OOB error rate of 3%. Although we assume this to be slightly underestimated when applied to the testing set, we should still be within our 95% accuracy goal.

Since the training and test sets must be preprocessed in the same way for cross-validation purposes, we use the same preprocessing techniques on the test set as used on the training set. We then run a confusion matrix to see how well our model performs on the testing data set we created in the Data Transformation section.

predtest <- predict(preProc,newtest[,-53])
confusion<-confusionMatrix(newtest$classe,predict(modFit,predtest))
confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2208    8    8    3    5
##          B   40 1452   25    0    1
##          C    2   29 1314   21    2
##          D    3    2   60 1215    6
##          E    1    6   22   15 1398
## 
## Overall Statistics
##                                           
##                Accuracy : 0.967           
##                  95% CI : (0.9628, 0.9708)
##     No Information Rate : 0.2873          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9582          
##  Mcnemar's Test P-Value : 1.291e-11       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9796   0.9699   0.9195   0.9689   0.9901
## Specificity            0.9957   0.9896   0.9916   0.9892   0.9932
## Pos Pred Value         0.9892   0.9565   0.9605   0.9448   0.9695
## Neg Pred Value         0.9918   0.9929   0.9822   0.9941   0.9978
## Prevalence             0.2873   0.1908   0.1821   0.1598   0.1800
## Detection Rate         0.2814   0.1851   0.1675   0.1549   0.1782
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9876   0.9798   0.9556   0.9791   0.9916

Our results appear to be solid, with most predictions falling squarely in the correct class. Our accuracy rate is 96.7%, and our out of sample error rate is 1-accuracy, or 3.3%.

Final Results

The final step of our project is to take the 20 test cases set aside as a test/validation set and submit them for automatic review. Let’s run our machine learning algorithm on the 20 test cases and see how we did:

finalpredict<-predict(preProc,testingnoNA)
finalresults<-predict(modFit,finalpredict)
finalresults

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(finalresults)

20/20 is our final result - perfect!

**Source: Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Practical Machine Learning Course Project

Tiffany Stoeke

November 12, 2015

Executive Summary

Data Transformation

Model Creation

Final Results