Weight Training the Right Way: Common Mistakes Predictor

Synopsis

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this report, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants and build a model to quantify how well weight lifting exercises are done. Please see this webpage for information on this research.

After data exploration and data cleaning, cross-validation technique is used to build models using four different techniques. The best model shows an accuracy of 99.5% and that model is used to predict the outcome for the supplied test data.

Data Exploration

The training data file and the test data file are loaded. We look at the structure of the data, head of the data, tail of the data and the column names. It is clear there are many columns that contain NA, #DIV/0! and blank as values. These are read as NA values to help the analysis.

There is a total of 19622 observations of 160 variables in the training data set. Test data set contains 20 observations of 160 variables.

# load libraries, set seed for reproducibility
library(caret)
library(gbm)
library(corrplot)
set.seed(1)

# load data
trainDF <- read.csv("C:/Users/mgrav/Desktop/pml-training.csv", na.strings = c("NA", "#DIV/0!", ""))
testDF <- read.csv("C:/Users/mgrav/Desktop/pml-testing.csv", na.strings = c("NA", "#DIV/0!", ""))

# explore data
str(trainDF)
head(trainDF)
tail(trainDF)
colnames(trainDF)

Data Cleaning

There are 100 columns that contain more than 50% NA values. In fact, the number is the same whether are looking for 20% NA values or 80% NA values. These columns are removed.

# remove "NA" columns
naStats <- colSums(is.na(trainDF))
naCols <- (naStats/nrow(trainDF) > 0.5)
sum(naCols == TRUE)
sum((naStats/nrow(trainDF) > 0.2) == TRUE)
sum((naStats/nrow(trainDF) > 0.8) == TRUE)
trainDF <- trainDF[,!naCols]
testDF <- testDF[,!naCols]

There is no documentation from the original research about column names. Examination of the titles of the remaining column names show the first seven columns are not relevant to our modeling. Please see below.

# remove obviously unneeded columns
colnames(trainDF)

##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
## [13] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [16] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [19] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [22] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [25] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [28] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [31] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [34] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [37] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [40] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [43] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [46] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [49] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [52] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [55] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [58] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

trainDF <- trainDF[,-c(1:7)]
testDF <- testDF[,-c(1:7)]

Let us take a look at the summary data for the remaining columns. Please note the distribution of classe variable which is our predictor variable (last one in the list).

summary(trainDF)

##    roll_belt        pitch_belt          yaw_belt       total_accel_belt
##  Min.   :-28.90   Min.   :-55.8000   Min.   :-180.00   Min.   : 0.00   
##  1st Qu.:  1.10   1st Qu.:  1.7600   1st Qu.: -88.30   1st Qu.: 3.00   
##  Median :113.00   Median :  5.2800   Median : -13.00   Median :17.00   
##  Mean   : 64.41   Mean   :  0.3053   Mean   : -11.21   Mean   :11.31   
##  3rd Qu.:123.00   3rd Qu.: 14.9000   3rd Qu.:  12.90   3rd Qu.:18.00   
##  Max.   :162.00   Max.   : 60.3000   Max.   : 179.00   Max.   :29.00   
##   gyros_belt_x        gyros_belt_y       gyros_belt_z    
##  Min.   :-1.040000   Min.   :-0.64000   Min.   :-1.4600  
##  1st Qu.:-0.030000   1st Qu.: 0.00000   1st Qu.:-0.2000  
##  Median : 0.030000   Median : 0.02000   Median :-0.1000  
##  Mean   :-0.005592   Mean   : 0.03959   Mean   :-0.1305  
##  3rd Qu.: 0.110000   3rd Qu.: 0.11000   3rd Qu.:-0.0200  
##  Max.   : 2.220000   Max.   : 0.64000   Max.   : 1.6200  
##   accel_belt_x       accel_belt_y     accel_belt_z     magnet_belt_x  
##  Min.   :-120.000   Min.   :-69.00   Min.   :-275.00   Min.   :-52.0  
##  1st Qu.: -21.000   1st Qu.:  3.00   1st Qu.:-162.00   1st Qu.:  9.0  
##  Median : -15.000   Median : 35.00   Median :-152.00   Median : 35.0  
##  Mean   :  -5.595   Mean   : 30.15   Mean   : -72.59   Mean   : 55.6  
##  3rd Qu.:  -5.000   3rd Qu.: 61.00   3rd Qu.:  27.00   3rd Qu.: 59.0  
##  Max.   :  85.000   Max.   :164.00   Max.   : 105.00   Max.   :485.0  
##  magnet_belt_y   magnet_belt_z       roll_arm         pitch_arm      
##  Min.   :354.0   Min.   :-623.0   Min.   :-180.00   Min.   :-88.800  
##  1st Qu.:581.0   1st Qu.:-375.0   1st Qu.: -31.77   1st Qu.:-25.900  
##  Median :601.0   Median :-320.0   Median :   0.00   Median :  0.000  
##  Mean   :593.7   Mean   :-345.5   Mean   :  17.83   Mean   : -4.612  
##  3rd Qu.:610.0   3rd Qu.:-306.0   3rd Qu.:  77.30   3rd Qu.: 11.200  
##  Max.   :673.0   Max.   : 293.0   Max.   : 180.00   Max.   : 88.500  
##     yaw_arm          total_accel_arm  gyros_arm_x        gyros_arm_y     
##  Min.   :-180.0000   Min.   : 1.00   Min.   :-6.37000   Min.   :-3.4400  
##  1st Qu.: -43.1000   1st Qu.:17.00   1st Qu.:-1.33000   1st Qu.:-0.8000  
##  Median :   0.0000   Median :27.00   Median : 0.08000   Median :-0.2400  
##  Mean   :  -0.6188   Mean   :25.51   Mean   : 0.04277   Mean   :-0.2571  
##  3rd Qu.:  45.8750   3rd Qu.:33.00   3rd Qu.: 1.57000   3rd Qu.: 0.1400  
##  Max.   : 180.0000   Max.   :66.00   Max.   : 4.87000   Max.   : 2.8400  
##   gyros_arm_z       accel_arm_x       accel_arm_y      accel_arm_z     
##  Min.   :-2.3300   Min.   :-404.00   Min.   :-318.0   Min.   :-636.00  
##  1st Qu.:-0.0700   1st Qu.:-242.00   1st Qu.: -54.0   1st Qu.:-143.00  
##  Median : 0.2300   Median : -44.00   Median :  14.0   Median : -47.00  
##  Mean   : 0.2695   Mean   : -60.24   Mean   :  32.6   Mean   : -71.25  
##  3rd Qu.: 0.7200   3rd Qu.:  84.00   3rd Qu.: 139.0   3rd Qu.:  23.00  
##  Max.   : 3.0200   Max.   : 437.00   Max.   : 308.0   Max.   : 292.00  
##   magnet_arm_x     magnet_arm_y     magnet_arm_z    roll_dumbbell    
##  Min.   :-584.0   Min.   :-392.0   Min.   :-597.0   Min.   :-153.71  
##  1st Qu.:-300.0   1st Qu.:  -9.0   1st Qu.: 131.2   1st Qu.: -18.49  
##  Median : 289.0   Median : 202.0   Median : 444.0   Median :  48.17  
##  Mean   : 191.7   Mean   : 156.6   Mean   : 306.5   Mean   :  23.84  
##  3rd Qu.: 637.0   3rd Qu.: 323.0   3rd Qu.: 545.0   3rd Qu.:  67.61  
##  Max.   : 782.0   Max.   : 583.0   Max.   : 694.0   Max.   : 153.55  
##  pitch_dumbbell     yaw_dumbbell      total_accel_dumbbell
##  Min.   :-149.59   Min.   :-150.871   Min.   : 0.00       
##  1st Qu.: -40.89   1st Qu.: -77.644   1st Qu.: 4.00       
##  Median : -20.96   Median :  -3.324   Median :10.00       
##  Mean   : -10.78   Mean   :   1.674   Mean   :13.72       
##  3rd Qu.:  17.50   3rd Qu.:  79.643   3rd Qu.:19.00       
##  Max.   : 149.40   Max.   : 154.952   Max.   :58.00       
##  gyros_dumbbell_x    gyros_dumbbell_y   gyros_dumbbell_z 
##  Min.   :-204.0000   Min.   :-2.10000   Min.   : -2.380  
##  1st Qu.:  -0.0300   1st Qu.:-0.14000   1st Qu.: -0.310  
##  Median :   0.1300   Median : 0.03000   Median : -0.130  
##  Mean   :   0.1611   Mean   : 0.04606   Mean   : -0.129  
##  3rd Qu.:   0.3500   3rd Qu.: 0.21000   3rd Qu.:  0.030  
##  Max.   :   2.2200   Max.   :52.00000   Max.   :317.000  
##  accel_dumbbell_x  accel_dumbbell_y  accel_dumbbell_z  magnet_dumbbell_x
##  Min.   :-419.00   Min.   :-189.00   Min.   :-334.00   Min.   :-643.0   
##  1st Qu.: -50.00   1st Qu.:  -8.00   1st Qu.:-142.00   1st Qu.:-535.0   
##  Median :  -8.00   Median :  41.50   Median :  -1.00   Median :-479.0   
##  Mean   : -28.62   Mean   :  52.63   Mean   : -38.32   Mean   :-328.5   
##  3rd Qu.:  11.00   3rd Qu.: 111.00   3rd Qu.:  38.00   3rd Qu.:-304.0   
##  Max.   : 235.00   Max.   : 315.00   Max.   : 318.00   Max.   : 592.0   
##  magnet_dumbbell_y magnet_dumbbell_z  roll_forearm       pitch_forearm   
##  Min.   :-3600     Min.   :-262.00   Min.   :-180.0000   Min.   :-72.50  
##  1st Qu.:  231     1st Qu.: -45.00   1st Qu.:  -0.7375   1st Qu.:  0.00  
##  Median :  311     Median :  13.00   Median :  21.7000   Median :  9.24  
##  Mean   :  221     Mean   :  46.05   Mean   :  33.8265   Mean   : 10.71  
##  3rd Qu.:  390     3rd Qu.:  95.00   3rd Qu.: 140.0000   3rd Qu.: 28.40  
##  Max.   :  633     Max.   : 452.00   Max.   : 180.0000   Max.   : 89.80  
##   yaw_forearm      total_accel_forearm gyros_forearm_x  
##  Min.   :-180.00   Min.   :  0.00      Min.   :-22.000  
##  1st Qu.: -68.60   1st Qu.: 29.00      1st Qu.: -0.220  
##  Median :   0.00   Median : 36.00      Median :  0.050  
##  Mean   :  19.21   Mean   : 34.72      Mean   :  0.158  
##  3rd Qu.: 110.00   3rd Qu.: 41.00      3rd Qu.:  0.560  
##  Max.   : 180.00   Max.   :108.00      Max.   :  3.970  
##  gyros_forearm_y     gyros_forearm_z    accel_forearm_x   accel_forearm_y 
##  Min.   : -7.02000   Min.   : -8.0900   Min.   :-498.00   Min.   :-632.0  
##  1st Qu.: -1.46000   1st Qu.: -0.1800   1st Qu.:-178.00   1st Qu.:  57.0  
##  Median :  0.03000   Median :  0.0800   Median : -57.00   Median : 201.0  
##  Mean   :  0.07517   Mean   :  0.1512   Mean   : -61.65   Mean   : 163.7  
##  3rd Qu.:  1.62000   3rd Qu.:  0.4900   3rd Qu.:  76.00   3rd Qu.: 312.0  
##  Max.   :311.00000   Max.   :231.0000   Max.   : 477.00   Max.   : 923.0  
##  accel_forearm_z   magnet_forearm_x  magnet_forearm_y magnet_forearm_z
##  Min.   :-446.00   Min.   :-1280.0   Min.   :-896.0   Min.   :-973.0  
##  1st Qu.:-182.00   1st Qu.: -616.0   1st Qu.:   2.0   1st Qu.: 191.0  
##  Median : -39.00   Median : -378.0   Median : 591.0   Median : 511.0  
##  Mean   : -55.29   Mean   : -312.6   Mean   : 380.1   Mean   : 393.6  
##  3rd Qu.:  26.00   3rd Qu.:  -73.0   3rd Qu.: 737.0   3rd Qu.: 653.0  
##  Max.   : 291.00   Max.   :  672.0   Max.   :1480.0   Max.   :1090.0  
##  classe  
##  A:5580  
##  B:3797  
##  C:3422  
##  D:3216  
##  E:3607  
##

Let us do a correlation plot!

corDF <- trainDF
corDF$classe <- as.numeric(corDF$classe)
trainCor <- cor(corDF)
corrplot(trainCor, method = "color", tl.col = "black", tl.cex = 0.6)

Cross Validation Data Sets

Test data set has only 20 rows whereas the training data set has 19622 rows. Recommended training to test ratio is about 70 to 30. Hence, we are going to divide the training data into two data sets: cross validation training set and cross validation test set. We are using a ratio of 75-25 for this.

The following plot shows the distribution of classe variable in the cross validation training data set.

# cross validation requires subsetting trainDF
inTrain = createDataPartition(trainDF$classe, p = 3/4)[[1]]
cvTrainDF = trainDF[ inTrain,]
cvTestDF = trainDF[-inTrain,]

# plot showing distribution of predictor in Cross Validation training set
barplot(table(cvTrainDF$classe), main = "Distribution of classe for Cross Validation Training Set",
    ylab = "Frequency", xlab = 'classe', col = "salmon")

Building Models

The following algorithms are used for building models and evaluating them:

Random Forest
Gradient Boosting Machine (GBM)
Linear Discriminant Analysis (LDA)
Recursive Paritioning for Classification (RPART)

mdlRF <- train(classe ~ ., data = cvTrainDF, method = "rf")

mdlGBM <- train(classe ~ ., data = cvTrainDF, method = "gbm")

mdlLDA <- train(classe ~ ., data = cvTrainDF, method = "lda")

mdlRPART <- train(classe ~ ., data = cvTrainDF, method = "rpart")

# for faster model building!
ctrl <- trainControl(allowParallel = T, method = "cv", number = 4)
mdlRF2 <- train(classe ~ ., data = cvTrainDF, method = "rf", ntree = 100, trControl = ctrl)

Evaluating Models

We use the cross validation test data set to perform predictions for all the above models and calculate the confusion matrix.

# predict
predictionRF <- predict(mdlRF, cvTestDF)
predictionGBM <- predict(mdlGBM, cvTestDF)
predictionLDA <- predict(mdlLDA, cvTestDF)
predictionRPART <- predict(mdlRPART, cvTestDF)
predictionRF2 <- predict(mdlRF2, cvTestDF)

# accuracy
cmRF <- confusionMatrix(cvTestDF$classe, predictionRF)  
cmGBM <- confusionMatrix(cvTestDF$classe, predictionGBM)
cmLDA <- confusionMatrix(cvTestDF$classe, predictionLDA)
cmRPART <- confusionMatrix(cvTestDF$classe, predictionRPART)
cmRF2 <- confusionMatrix(cvTestDF$classe, predictionRF2)

Please see below for the accuracy of various models.

accuracyStr <-paste0("ACCURACY OF MODELS", "\n",
    "Random Forest: ", format(round(cmRF$overall['Accuracy'], 3), nsmall = 3), "\n",
    "Random Forest2: ", format(round(cmRF2$overall['Accuracy'], 3), nsmall = 3), "\n",
    "GBM: ", format(round(cmGBM$overall['Accuracy'], 3), nsmall = 3), "\n",
    "LDA: ", format(round(cmLDA$overall['Accuracy'], 3), nsmall = 3), "\n",
    "RPART: ", format(round(cmRPART$overall['Accuracy'], 3), nsmall = 3))
cat(accuracyStr)

## ACCURACY OF MODELS
## Random Forest: 0.995
## Random Forest2: 0.995
## GBM: 0.963
## LDA: 0.709
## RPART: 0.501

Model choice and Error rate

Random Forest is the model of our choice as the accuracy is 0.995. Since Error Rate is (1 - Accuracy), the error rate is 0.005. Please see the confusion matrix for the Random Forest Model!

cmRF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1391    1    2    0    1
##          B    5  943    1    0    0
##          C    0    4  851    0    0
##          D    0    0    9  794    1
##          E    0    0    0    2  899
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9922, 0.9965)
##     No Information Rate : 0.2847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9933          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9947   0.9861   0.9975   0.9978
## Specificity            0.9989   0.9985   0.9990   0.9976   0.9995
## Pos Pred Value         0.9971   0.9937   0.9953   0.9876   0.9978
## Neg Pred Value         0.9986   0.9987   0.9970   0.9995   0.9995
## Prevalence             0.2847   0.1933   0.1760   0.1623   0.1837
## Detection Rate         0.2836   0.1923   0.1735   0.1619   0.1833
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9976   0.9966   0.9926   0.9975   0.9986

cmRF2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1392    1    2    0    0
##          B    5  943    1    0    0
##          C    0    4  849    2    0
##          D    0    0    7  796    1
##          E    0    0    0    3  898
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9922, 0.9965)
##     No Information Rate : 0.2849          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9933          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9947   0.9884   0.9938   0.9989
## Specificity            0.9991   0.9985   0.9985   0.9981   0.9993
## Pos Pred Value         0.9978   0.9937   0.9930   0.9900   0.9967
## Neg Pred Value         0.9986   0.9987   0.9975   0.9988   0.9998
## Prevalence             0.2849   0.1933   0.1752   0.1633   0.1833
## Detection Rate         0.2838   0.1923   0.1731   0.1623   0.1831
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9978   0.9966   0.9934   0.9959   0.9991

Prediction for the Testing Set

Using the Random Forest Model, the following shows the prediction for the original testing data set we started with!

predictionFinalRF <- predict(mdlRF, testDF)
predictionFinalRF

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Reference

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013

THE END