Executive Summary

Data collected from wearable devices like fitbit can be used to predict if the owner of the device is using proper form during weight-lifting exercises. The ability to determine when someone is doing an exercise incorrectly and subsequently notify them can help prevent serious injury, bettering the health of individuals and avoiding costly medical services. This paper goes over the model selection and evaluation process for this project as well as the final results obtained from the model.

System

This analysis was run with the following system specifications:

#Load required packages
library(caret)
## Warning: package 'caret' was built under R version 3.3.1
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
## Warning: package 'rpart' was built under R version 3.3.1
library(rattle)
## Warning: package 'rattle' was built under R version 3.3.1
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
R.Version()
## $platform
## [1] "x86_64-w64-mingw32"
## 
## $arch
## [1] "x86_64"
## 
## $os
## [1] "mingw32"
## 
## $system
## [1] "x86_64, mingw32"
## 
## $status
## [1] ""
## 
## $major
## [1] "3"
## 
## $minor
## [1] "3.0"
## 
## $year
## [1] "2016"
## 
## $month
## [1] "05"
## 
## $day
## [1] "03"
## 
## $`svn rev`
## [1] "70573"
## 
## $language
## [1] "R"
## 
## $version.string
## [1] "R version 3.3.0 (2016-05-03)"
## 
## $nickname
## [1] "Supposedly Educational"

Exploratory Analysis

Our data contains 160 variables, incuding our target variable. Before we begin to build our model, it’s a good idea to learn a little bit more about our data. We begin by printing the variable names along with their class and sample values.

#Load training & test data sets
df <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings=c("NA","#DIV/0!",""))
test <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", na.strings=c("NA","#DIV/0!",""))

#Inspect data
str(df)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]
#Print possible values for target var
unique(df$classe)
## [1] A B C D E
## Levels: A B C D E

Our target variable is categorical, so a decision tree type model will be a good start.

Methodology

Data Pre-Processing

We begin by splitting our data into a training and test set (called validation) so we can cross-validate our model. We also remove variables that contain only NAs since they do not have any predictive power.

#Remove columns in the test table that are entirely NAs. Also remove the ID column which has no predictive power.
df <- df[ ,colSums(is.na(df)) != nrow(df)]
df <- df[, -which(names(df) == "X")]
test <- test[ ,colSums(is.na(test)) != nrow(test)]

#Remove the same columns for the test set
test <- test[ ,which(names(test) %in% names(df))]

#Remove columns in df that are also not in test set
df <- df[ ,which(names(df) %in% c(names(test), "classe"))]

#Data partition
index <- createDataPartition(df$classe, times=1, p=0.5)[[1]]
training <- df[index, ]
validation <- df[-index, ]

Basic Decision Tree Model

Next, we create a decision tree model with the default specifications. Because our target variable ‘classe’ is a factor, we use the method ‘class’ for building model and a confusion matrix to evaluate it.

#Model
tree <- rpart(classe ~ ., data=training, method="class")

#Visualize
fancyRpartPlot(tree, main="Basic Decision Tree Model")

#Predict
pred1 <- predict(tree, validation, type = "class")

#Evaluate the model
confusionMatrix(validation$classe, pred1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2692   69   29    0    0
##          B  260 1404  221   13    0
##          C    8  106 1569   28    0
##          D    5   76  233 1209   85
##          E    0    0    3  221 1579
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8617          
##                  95% CI : (0.8547, 0.8684)
##     No Information Rate : 0.3022          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8247          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9079   0.8483   0.7635   0.8219   0.9489
## Specificity            0.9857   0.9394   0.9817   0.9522   0.9725
## Pos Pred Value         0.9649   0.7397   0.9170   0.7519   0.8758
## Neg Pred Value         0.9611   0.9683   0.9400   0.9681   0.9894
## Prevalence             0.3022   0.1687   0.2095   0.1499   0.1696
## Detection Rate         0.2744   0.1431   0.1599   0.1232   0.1610
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9468   0.8939   0.8726   0.8870   0.9607

Our model proved to be accurate on this validation set, however it might perform differently on other test sets. Cross-validation can be used to get a better estimate of our model’s performance. It entails training our model mutliple times, using the same specifications, each time using a different section of our data as our test or validation data set. It is typical to use 5-fold validation, where we’ll create 5 different models and average the model performance measures, like accuracy, to give us a better idea of how good our model actually is. The goal is to then tune our model to maximize the average performance measures. In some cases, a blended model can be created but it is out of the scope of this exercise. Instead, our goal is to use cross-validation to simply validate our model’s performance. Although some cross-validation functions exist in R, we will write our own function to explicitly show how cross-validation works.

crossValidate <- function(x) {
#Set random seed for reproducibility
set.seed(123)

#Initialize the accs vector
accs <- rep(0,6)

#Shuffle the dataframe in case it was sorted in any particular order
n <- nrow(x)
shuffled <- x[sample(n),]

for (i in 1:6) {
  # These indices indicate the interval of the test set
  indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
  
  # Exclude them from the train set
  train <- shuffled[-indices,]
  
  # Include them in the test set
  validation <- shuffled[indices,]
  
  # A model is learned using each training set
  tree <<- rpart(classe ~ ., train, method = "class")
  
  # Make a prediction on the test set using tree
 pred <- predict(tree, validation, type="class")
  
  # Assign the confusion matrix to conf
conf <- table(validation$classe, pred)
  
  # Assign the accuracy of this model to the ith index in accs
  accs[i] <- sum(diag(conf))/sum(conf)
}

# Print out the mean of accs
meanAccuracy <<- mean(accs) #save as global var
print(meanAccuracy)
}

crossValidate(df)
## [1] 0.8707441

Our current model has a mean accuracy of 0.8707441 and an out of sample error of 0.1292559. Let’s see if we can improve our model by using the random forest algorithm, which combines many learning models together to improve classification accuracy.

Random Forest Model

#Start timer

ptm <- proc.time()
#set seed for reproducibility
set.seed(12345)

#Subset for faster training
# training_subset <- df[sample(nrow(training), 2000),]

#Train model
randomForestModel <- train(classe ~ ., data=training, method = "rf")
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.3.1
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
print(randomForestModel, digits=3)
## Random Forest 
## 
## 9812 samples
##   58 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 9812, 9812, 9812, 9812, 9812, 9812, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    2    0.982     0.977
##   41    0.996     0.995
##   80    0.995     0.994
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 41.
#Make predictions using new model and calculate accuracy
predictions <- predict(randomForestModel, validation)
cf <- confusionMatrix(validation$classe, predictions)
print(cf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2790    0    0    0    0
##          B    1 1895    2    0    0
##          C    0    2 1708    1    0
##          D    0    0    3 1605    0
##          E    0    0    0    1 1802
## 
## Overall Statistics
##                                           
##                Accuracy : 0.999           
##                  95% CI : (0.9981, 0.9995)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   0.9989   0.9971   0.9988   1.0000
## Specificity            1.0000   0.9996   0.9996   0.9996   0.9999
## Pos Pred Value         1.0000   0.9984   0.9982   0.9981   0.9994
## Neg Pred Value         0.9999   0.9997   0.9994   0.9998   1.0000
## Prevalence             0.2845   0.1934   0.1746   0.1638   0.1837
## Detection Rate         0.2844   0.1932   0.1741   0.1636   0.1837
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9998   0.9993   0.9984   0.9992   0.9999
#save accuracy
accuracy <- (cf$overall['Accuracy'])

#Plot results
# MDSplot(randomForestModel, training$classe)

#Print time elapsed
proc.time() - ptm
##    user  system elapsed 
## 5263.76   28.40 5521.29

Our random forest model, although time consuming to train, obtained an accuracy of 0.9989806 and an out of sample error of 0.0010194. This is a very big improvement over our basic decision tree mode. Let’s use our new and improved model to make our predictions on the test set:

Make predictions

#Apply to test set
predTest <- predict(randomForestModel, test)
results <- data.frame(test$user_name, prediction = predTest)
print(results)
##    test.user_name prediction
## 1           pedro          B
## 2          jeremy          A
## 3          jeremy          B
## 4          adelmo          A
## 5          eurico          A
## 6          jeremy          E
## 7          jeremy          D
## 8          jeremy          B
## 9        carlitos          A
## 10        charles          A
## 11       carlitos          B
## 12         jeremy          C
## 13         eurico          B
## 14         jeremy          A
## 15         jeremy          E
## 16         eurico          E
## 17          pedro          A
## 18       carlitos          B
## 19          pedro          B
## 20         eurico          B

These predictions obtained a perfect 100% score when submitted to Coursera, which was significantly better than the 90% score obtained from submitting the predictions from our decision tree model.