Practical Machine Learning Course Project

Summary of Problem Statement

Personal activity trackers are often used to quantify how much activity wearers are performing, but they can also be used to determine whether or not the wearers are peforming those activities correctly. Velloso et al. (2013) had six participants perform one set of repetitions of dumbell bicep curls in five different ways

Class A: exactly according to the specification
Class B: throwing the elbows to the front
Class C: Lifting the dumbbell only halfway
Class D: Lowering the dumbbell only halfway
Class E: Throwing the hips to the front

The goal of this course project is to use machine learning to predict which class of activity was performed based on the data recorded by the personal activity trackers.

Data Collection and Preparation

Load packages

Load required packages:

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

Download data files

Download the required data from the URL specified in the Coursera Problem Statement. Note that the source *.csv files contain three types of NAs (“NA” character strings, division by zero “#DIV/0!”, and blank spaces “”), all of which are replaced as NAs during read.csv download.

# Check - do files exist? Assuming that if one file exists then the other does
# as well
if (!file.exists("pml-training.csv") | !file.exists("pml-testing.csv")) {
    
    # If not, download to working directory
    download.file(url      = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
                  destfile = "pml-training.csv")
    download.file(url      = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
                  destfile = "pml-testing.csv")
}

# Read downloaded *.csv files, replace NA strings with NA values
dTrain <- read.csv("pml-training.csv", na.strings = c("NA", "#DIV/0!", "")) # Training data
dTest  <- read.csv("pml-testing.csv",  na.strings = c("NA", "#DIV/0!", "")) # Test data

Partition Data

Next, the training data set needs to be partioned into the in-training data set and the cross-validation data set. This can be accomplished with the createDataPartition function in the caret package. Here, I’ve elected to use a 70% | 30% split (in-training | cross-validation). Additionally, for reproducibility we set the random seed to an arbitrary value (1234).

# Set seed for reproducibility
set.seed(1234)

# Get rows for in-training set
inTrain <- createDataPartition(y    = dTrain$classe, # Our outcome variable
                               p    = 0.7,           # Fraction to go into in-training set
                               list = FALSE)         # Want a vector, not list

# Create in-training (myTrain) and cross-validating (myCross) data.frames
myTrain <- dTrain[inTrain,]
myCross <- dTrain[-inTrain,]

Filter out columns

Looking at the data, we see that the first five columns:

names(myTrain)[1:5]

## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"

are row indices, user names, and timestamps, which should have nothing to do with the outcome (we want features for prediction that can be generalized to other people at any time).

Second, there are many columns that have a large number of NA values. The following code shows the fraction of entries in each column of myTrain that are NA:

# Get fraction of entries in each column that are NA
NAfrac <- data.frame(column = names(myTrain[, 6:ncol(myTrain)]),
                     frac   = as.numeric(apply(X = myTrain[, 6:ncol(myTrain)],
                                               MARGIN = 2,
                                               FUN = function(x) {
                                                   length(which(is.na(x)))
                                               }
                                               ) / nrow(myTrain)
                                         )
                     )

# Summarize result
summary(NAfrac$frac)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.9793  0.6330  0.9793  1.0000

Any column that is essentially completely NA should be dropped. Here, I assume that any column with > 90% NA values should be dropped.

Lastly, some columns are constant (i.e. have only 1 unique value), and therefore have no predictive value:

# Get number of unique values in each column
unqCount <- data.frame(column = names(myTrain), 
                       count  = as.numeric(apply(X = myTrain,
                                                 MARGIN = 2,
                                                 FUN = function(x) {
                                                     length(unique(x))
                                                 }
                                                 )
                                           )
                       )

# Summary
summary(unqCount$count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   131.5   235.0   758.8   289.5 13740.0

# Columns with 10 or fewer unique values
apply(myTrain[, which(unqCount$count <= 10)], 2, unique)

## $user_name
## [1] "carlitos" "pedro"    "adelmo"   "charles"  "eurico"   "jeremy"  
## 
## $new_window
## [1] "no"  "yes"
## 
## $kurtosis_yaw_belt
## [1] NA
## 
## $skewness_yaw_belt
## [1] NA
## 
## $amplitude_yaw_belt
## [1] NA   " 0"
## 
## $kurtosis_yaw_dumbbell
## [1] NA
## 
## $skewness_yaw_dumbbell
## [1] NA
## 
## $amplitude_yaw_dumbbell
## [1] NA   " 0"
## 
## $kurtosis_yaw_forearm
## [1] NA
## 
## $skewness_yaw_forearm
## [1] NA
## 
## $amplitude_yaw_forearm
## [1] NA   " 0"
## 
## $classe
## [1] "A" "B" "C" "D" "E"

Of the columns with <= 10 unique values, the only ones that we would want to remove (that wouldn’t already have been filtered out by the previously), are the amplitude_yam_* columns. However these columns would effectively be filtered out by the NA column step.

So the two filtering steps to apply are:

Drop the first five columns
Remove columns that are >= 90% NA

# Filter function
filterFun <- function(x) {
    
    # Drop first five columns
    x <- x[, -(1:5)]
    
    # Drop columns that are mostly NA
    x <- x[, -which(NAfrac$frac >= 0.9)]
}

# Apply to myTrain, myCross, and dTest data sets
myTrain <- filterFun(myTrain)
myCross <- filterFun(myCross)
dTest   <- filterFun(dTest)

Predicting with Random Forests

Training model

We’ll be using random forests to make predictions. The important parameters for training random forests are the number of folds to use for cross-validation and the number of trees to use. Here I assume that 4 folds with 200 trees will be sufficient.

# Build cross-validation folds for myTrain data set
cvf <- trainControl(method = "cv",
                    number = 4)

# Fit random forest model
rfm <- train(classe ~.,
             data      = myTrain,
             method    = "rf", 
             trControl = cvf,
             ntree     = 200)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Print random forest results
rfm

## Random Forest 
## 
## 13737 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 10302, 10302, 10304, 10303 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9925750  0.9906067
##   28    0.9959234  0.9948436
##   54    0.9945405  0.9930940
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 28.

Testing against cross-validation data set

Predict the values for the classe column in the myCross data set using the random forest model rfm:

# Make prediction
cvPredict <- predict(rfm, myCross)

# Get confusion matrix result
cvCF <- confusionMatrix(myCross$classe, cvPredict)

# Print result
cvCF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    3 1135    1    0    0
##          C    0    4 1022    0    0
##          D    0    0    1  963    0
##          E    0    0    0    2 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9981          
##                  95% CI : (0.9967, 0.9991)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9976          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9965   0.9980   0.9979   1.0000
## Specificity            1.0000   0.9992   0.9992   0.9998   0.9996
## Pos Pred Value         1.0000   0.9965   0.9961   0.9990   0.9982
## Neg Pred Value         0.9993   0.9992   0.9996   0.9996   1.0000
## Prevalence             0.2850   0.1935   0.1740   0.1640   0.1835
## Detection Rate         0.2845   0.1929   0.1737   0.1636   0.1835
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9991   0.9978   0.9986   0.9989   0.9998

So according to the cross-validation set, the accuracy of the model is 99.8130841%, which is equivalent to an out-of-sample error rate of 0.1869159%.

Final Test

Finally, we test once against the test data set dTest

# Make predictions against test data set
result <- predict(rfm, dTest)

# Print result
result

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

References

1, Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013.