Practical Machine Learning in R

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. The participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

The goal of this project is to predict the manner in which they did the exercise. This is the classe variable in the training set. We may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Clear R memory and load the training and testing datasets…

rm(list = ls())
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

training_raw = read.csv("/Users/woodzsan/Desktop/Data Science with R/Practical Machine Learning/Course Project/pml-training.csv")
testing = read.csv("/Users/woodzsan/Desktop/Data Science with R/Practical Machine Learning/Course Project/pml-testing.csv")

Some statistics on the raw training dataset, training_raw:

dim(training_raw)

## [1] 19622   160

table(training_raw$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Exploratory Analysis of the ‘training’ dataset…

We see that a number columns in training_raw are missing most of their values. Many have either ‘NA’ or ’ ’ in about 19216 rows. Data is present for 406/19622 = 2.0% of the rows. We will make the decision to eliminate these columns.

# Remove columns with > 19200 'NA' values
training <- training_raw[ , colSums(is.na(training_raw)) < 19200]
# Remove columns with > 19200 ' ' values
training <- training[, colSums(training != "") > 19200]

Separation of Raw Training Dataset into Training and Cross-Validation datasets

We need establish training and cross-validation datasets to determine the accuracy of our prediction model.

inTrain <- createDataPartition(y=training$classe,p=0.8,list=FALSE)
training <- training[inTrain,]
cross_val <- training[-inTrain,]

y_training <- training$classe
y_cross_val <- cross_val$classe

# Strip out categorical variables and `classe` from 'training' and 'cross_val' datasets
X_train <- training[-c(1:6,60)]
X_cv <- cross_val[-c(1:6,60)]

Now, let’s look at the correlation values among the predictors…

M <- abs(cor(X_train))
diag(M) <- 0
which(M>0.8,arr.ind = TRUE)

##                  row col
## yaw_belt           4   2
## total_accel_belt   5   2
## accel_belt_y      10   2
## accel_belt_z      11   2
## accel_belt_x       9   3
## magnet_belt_x     12   3
## roll_belt          2   4
## roll_belt          2   5
## accel_belt_y      10   5
## accel_belt_z      11   5
## pitch_belt         3   9
## magnet_belt_x     12   9
## roll_belt          2  10
## total_accel_belt   5  10
## accel_belt_z      11  10
## roll_belt          2  11
## total_accel_belt   5  11
## accel_belt_y      10  11
## pitch_belt         3  12
## accel_belt_x       9  12
## gyros_arm_y       20  19
## gyros_arm_x       19  20
## magnet_arm_x      25  22
## accel_arm_x       22  25
## magnet_arm_z      27  26
## magnet_arm_y      26  27
## accel_dumbbell_x  35  29
## accel_dumbbell_z  37  30
## gyros_dumbbell_z  34  32
## gyros_forearm_z   47  32
## gyros_dumbbell_x  32  34
## gyros_forearm_z   47  34
## pitch_dumbbell    29  35
## yaw_dumbbell      30  37
## gyros_forearm_z   47  46
## gyros_dumbbell_x  32  47
## gyros_dumbbell_z  34  47
## gyros_forearm_y   46  47

We see that many of the columns are highly correlated with each other. Let’s train a PCA model with 2 principal components:

preProc <- preProcess(X_train,method="pca",pcaComp=2)
pred_training <- predict(preProc,X_train)
plot(pred_training[,1],pred_training[,2])

We see from the plot of the two principal components that there are 5 relatively distinct groupings that correspond to the different values of the classe variable.

# Add `classe` vector to the principal component predictors...
training_pca <- cbind(pred_training,y_training)
# Rename `training$classe` column to `classe`
names(training_pca)[names(training_pca) == 'y_training'] <- 'classe'

Let’s train a Random Forest classifier using the principal components as our predictors.

modelFit <- train(classe~.,method="rf",data=training_pca)

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

Cross-Validation Dataset Accuracy

First, let’s determine the principal component predictions for the cross-validation dataset based on our preprocessing parameters for the training dataset.

pred_cross_val <- predict(preProc,X_cv)
plot(pred_cross_val[,1],pred_cross_val[,2])

Let’s generate a confusion matrix for the cross-validation predictions for classe vs. the actual values.

confusionMatrix(y_cross_val,predict(modelFit,pred_cross_val))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 886   0   0   0   0
##          B   0 633   0   0   0
##          C   0   0 532   0   0
##          D   0   0   0 512   0
##          E   0   0   0   0 575
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9988, 1)
##     No Information Rate : 0.2823     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2823   0.2017   0.1695   0.1632   0.1832
## Detection Rate         0.2823   0.2017   0.1695   0.1632   0.1832
## Detection Prevalence   0.2823   0.2017   0.1695   0.1632   0.1832
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Wow, we have an cross-validation accuracy of 1! This means that the Random Forest algorithm is likely overfitting the data.

Anyway, let’s go ahead and test this RF model with the test data. We need to recall what columns of the training dataset we kept so that the columns of the test dataset are identical.

# Remove columns with >=1 'NA' value
testing_modified <- testing[ , colSums(is.na(testing)) < 1]
# Remove columns with unnecessary categorical variables and strip out `classe`
testing_modified <- testing_modified[-c(1:6,60)]

Generate principal components on the test set using the preProc method we developed for the training set.

testing_pca <- predict(preProc,testing_modified)
plot(testing_pca[,1],testing_pca[,2])

Make predictions on classe based on modelFit:

testing_classe_pred <- predict(modelFit,testing_pca)
testing_classe_pred

##  [1] A C A A D D D B A A A C D A E E C E D B
## Levels: A B C D E

This set of predictions give us an accuracy of 0.55. Therefore, we conclude that we are overfitting.

Pre-processing using K-folds cross-validation.

Define the training control for the K-fold validation using K=3 folds.

set.seed(123) 
train.control <- trainControl(method = "cv", number = 3, verboseIter=FALSE)
training <- training[-c(1:6)]

Train the model…

modelFit2 <- train(classe ~., data = training, method = "rf",
               trControl = train.control)

For this model, the accuracy numbers are consistently 0.99 for all three folds.

modelFit2

## Random Forest 
## 
## 15699 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 10468, 10465, 10465 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9945219  0.9930701
##   27    0.9970064  0.9962133
##   53    0.9956687  0.9945211
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

Let’s look at the confusion matrix for the cross validation dataset.

confusionMatrix(y_cross_val,predict(modelFit2,X_cv))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 886   0   0   0   0
##          B   0 633   0   0   0
##          C   0   0 532   0   0
##          D   0   0   0 512   0
##          E   0   0   0   0 575
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9988, 1)
##     No Information Rate : 0.2823     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2823   0.2017   0.1695   0.1632   0.1832
## Detection Rate         0.2823   0.2017   0.1695   0.1632   0.1832
## Detection Prevalence   0.2823   0.2017   0.1695   0.1632   0.1832
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Now, we generate predictions on classe based on modelFit2:

testing_classe_pred <- predict(modelFit2,testing_modified)
testing_classe_pred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

We find that these predictions give us a 100% score on the Prediction Quiz.

Practical Machine Learning in R - Course Project