Executive Summary

The prime objective of the analysis was to build a model to predict the manner in which weight lifting exercise was performed.

Six male participants were asked to perform barbell lifts correctly and incorrectly in five different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Training data retrieved from the accelerometers, on the belt, forearm, arm, and dumbell of the participants, were fed into a Random Forest model that was generated utilizing R program’s caret package.

Five-fold cross validation with five repeats was used to train the Random Forest model on the training dataset. The estimated accuracy of the generated Random Forest model on the unseen data was 0.99. The prediction model was then tested on a separate test data with twenty observations given, and the model correctly predicted the classes on all of the twenty observations.

Exploratory Data Analysis

The training data contains 19622 observations with 159 predictor variables, and one outcome variable.

# Load the training data:
url_csvtr <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url_csvtr, destfile= "./pml-training.csv")
dattr = read.csv("./pml-training.csv")
names(dattr)
##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "classe"

The outcome in the dataset is the “classe” variable reflecting five weight lifting specification levels - “A”, “B”, “C”, “D” and “E”.

table(dattr$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

The primary goal of this exploratory analysis was to reduce (if possible) the number of predictor variables to be used in the model to predict the weight lifting class.

A vast majority of the given predictors contained NA’s or blanks in the observations. These variables were not fed into the model. Some of the other predictors such as names of subjects, dates, window info etc. were evaluated to discern any effects on the outcome variable. Shown below is an even distribution of the classes among the subjects.

table(dattr$user_name)
## 
##   adelmo carlitos  charles   eurico   jeremy    pedro 
##     3892     3112     3536     3070     3402     2610

Similary explorations were conducted for the timestamp, window and date variables to see if these variables affected the weight lifting class variable.

table(dattr$user_name,dattr$classe)
table(dattr$raw_timestamp_part_1,dattr$classe)
table(dattr$raw_timestamp_part_2,dattr$classe)
table(dattr$num_window,dattr$classe)
dattime <- data.frame(as.Date(dattr$cvtd_timestamp, "%d/%m/%Y"), dattr$classe)
plot(dattime$as.Date.dattr.cvtd_timestamp....d..m..Y.., dattime$dattr.classe)

Data Cleaning

Based on the data explorations performed above, the subject names, timestamps, windows, or dates did not seem to correlate with the distributions of the “classe” variable, so these variables will not be used in the prediction model. The training dataset was cleaned and modified for use in the prediction model.

# Remove non-impact variables from training dataset and data clean-up
dattr_rev <- subset(dattr, select=-c(1:7,12:36,50:59,69:83,87:101,103:112,125:139,141:150))
dattr_rev[dattr_rev == ""] <- NA # convert blanks to NA
dattr_rev<-subset(dattr_rev, complete.cases(dattr_rev)) # Remove NA's

The abridged training data set to be used for model building now contains 52 predictor variables, which has been reduced from originally containing 159 variables.

names(dattr_rev[-53])
##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"

Model Training

The caret package in R program was utilized in an attempt to streamline the process for creating the predictive model.. A Random Forest model was built using caret’s “ranger” function, which is a fast implementation of Random Forests, particularly suited for high dimensional data. Random Forest models are less interpretable but are more accurate than regression models. They are easier to tune, and require little preprocessing, capturing threshold effects and variable interactions very well.

The ideal model is the where both the variance and squared bias are low. Cross-validation is a way to estimate the test error using training data. Repeated five-fold Cross Validation was set up to split the data into five folds and repeated five times. The final model accuracy is taken as the mean from the number of repeats.

# Parallelizing your code.
# Set up Parallel package for multi-core training
library(parallel)
library(doParallel)
# Calculate the number of cores to use for multi-core training
no_cores <- detectCores() - 1 # convention to leave 1 core for OS
# Initiate cluster
cluster <- makeCluster(no_cores)
registerDoParallel(cluster)
# set up training run for x and y syntax, because model format performs poorly
x <- dattr_rev[,-53]
y <- dattr_rev[,53]


# Create train/test indexes
library(caret)
set.seed(42)

# Create Folds
# Leverage caret to create 25 total folds, but ensure that class distributions
# matches the overall training data set. This is known as stratified
# cross validation and generally produces better results.
mymultiFolds <- createMultiFolds(dattr_rev$classe, k = 5, times = 5)

# Compare class distribution in one of the one of 25 folds
i3 <- mymultiFolds$Fold3.Rep3
table(dattr_rev$classe[i3]) / length(i3)
## 
##         A         B         C         D         E 
## 0.2844037 0.1934888 0.1743756 0.1639271 0.1838048
# Summarize the target variable in dattr-rev
table(dattr_rev$classe) / nrow(dattr_rev)
## 
##         A         B         C         D         E 
## 0.2843747 0.1935073 0.1743961 0.1638977 0.1838243
# Use five-fold cross validation repeated five times for the model
myControl <- trainControl(
        method = "repeatedcv", number = 5, repeats = 5,
        index=mymultiFolds,
        classProbs = TRUE,
        verboseIter = FALSE,
        savePredictions = TRUE,
        allowParallel = TRUE
)

## Random Forest on HAR data
set.seed(42)

# Train Random Forest model
model_rf <- train(
  x,y,
  metric = "Accuracy",
  method = "ranger",
  importance='impurity', # extract variable importance in ranger
  trControl = myControl
)


## De-register parallel processing and Shutdown cluster
stopCluster(cluster)
registerDoSEQ()

Prediction Model Perfomance

Cross validation gives an unbiased estimation of the Random Forest model’s performance on unseen data. Model accuracy on the unseen data was calculated to be above 0.99. Therefore, the expected out of sample error is less than 0.01.

model_rf
## Random Forest 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 15698, 15698, 15697, 15698, 15697, 15698, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9912954  0.9889883
##   27    0.9930078  0.9911550
##   52    0.9865560  0.9829931
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
# Plot Model
plot(model_rf)

The plot below shows the predictors with the highest impact on the “classe”" variable as determined by this Random Forest Model.

plot(varImp(model_rf))

Prediction on Testing Data and Conclusions

The generated Random Forest algorithm was applied to the 20 test cases available in given the test data.

# Load the testing data:
url_csvts <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url_csvts, destfile= "./pml-testing.csv")
datts = read.csv("./pml-testing.csv")

# Prediction on testing data
predict(model_rf, datts)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

A model’s accuracy depends on the dataset split (train/test). The model correctly predicted all 20 class specifications in the given testing data. This leads to the conclusion that the model’s cross validation splits hit the sweet spot point between underfitting (High Bias) or overfitting (High Variance) the training dataset.