Synopsis
This analysis uses data from the accelerometers on the belt, forearm, arm, and dumbbell to predict the manner in which participants performed a certain exercise.
Subjects were asked to perform barbell lifts correctly and incorrectly in 5 different ways:
- Class A - Exactly according to the specification (correct)
 - Class B - Throwing the elbows to the front (mistake)
 - Class C - Lifting the dumbbell only halfway (mistake)
 - Class D - Lowering the dumbbell only halfway (mistake)
 - Class E - Throwing the hips to the front (mistake)
 
Accelerometers were located on four locations:
- belt
 - forearm
 - arm
 - dumbbell
 
Two prediction techniques, Decision Trees and XGBoost, are used for prediction.
Further information about the dataset is available here.
Setup
Data Processing
First, we need to load the data into R. We’ll call the raw data trainData and testData, and later split them into the cleaned sets we’ll use for analysis - trainSet, validationSet, and testSet.
## Load data
trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
dir <- "/Users/kevinroche22/RData/WeightLiftingMLPrediction"
setwd(dir)
download.file(trainUrl, "pml-training.csv")
download.file(testUrl, "pml-testing.csv")
trainData <- read.csv("pml-training.csv", header = TRUE, sep = ",")
testData <- read.csv("pml-testing.csv", header = TRUE, sep = ",")Let’s take a look at what we’re working with.
## [1] 19622   160
## [1]  20 160
So, the training data contains nearly 20k observations of 160 different variables, while the test data contains 20 observations of the same 160 variables. It looks like some of the variables have missing data, and there are blanks and division by 0’s that aren’t recorded as NA’s. Let’s convert all of the missing observations to NA and then see how many missing values each variable has.
## Convert missing values to NA
trainData[trainData==""] <- NA
trainData[trainData=="#DIV/0!"] <- NA
testData[testData==""] <- NA
testData[testData=="#DIV/0!"] <- NA
## View proportions
trainData %>% summarise_each(funs(100*mean(is.na(.))))##   X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 0         0                    0                    0              0
##   new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1          0          0         0          0        0                0
##   kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 1           97.98186            98.09398               100           97.97676
##   skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## 1             98.09398               100      97.93089       97.93089
##   max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt amplitude_roll_belt
## 1     97.98186      97.93089       97.93089     97.98186            97.93089
##   amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## 1             97.93089           97.98186             97.93089      97.93089
##   stddev_roll_belt var_roll_belt avg_pitch_belt stddev_pitch_belt
## 1         97.93089      97.93089       97.93089          97.93089
##   var_pitch_belt avg_yaw_belt stddev_yaw_belt var_yaw_belt gyros_belt_x
## 1       97.93089     97.93089        97.93089     97.93089            0
##   gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## 1            0            0            0            0            0
##   magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm
## 1             0             0             0        0         0       0
##   total_accel_arm var_accel_arm avg_roll_arm stddev_roll_arm var_roll_arm
## 1               0      97.93089     97.93089        97.93089     97.93089
##   avg_pitch_arm stddev_pitch_arm var_pitch_arm avg_yaw_arm stddev_yaw_arm
## 1      97.93089         97.93089      97.93089    97.93089       97.93089
##   var_yaw_arm gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
## 1    97.93089           0           0           0           0           0
##   accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 1           0            0            0            0          98.32841
##   kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## 1            98.3386         97.98695          98.32331            98.3386
##   skewness_yaw_arm max_roll_arm max_picth_arm max_yaw_arm min_roll_arm
## 1         97.98695     97.93089      97.93089    97.93089     97.93089
##   min_pitch_arm min_yaw_arm amplitude_roll_arm amplitude_pitch_arm
## 1      97.93089    97.93089           97.93089            97.93089
##   amplitude_yaw_arm roll_dumbbell pitch_dumbbell yaw_dumbbell
## 1          97.93089             0              0            0
##   kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## 1               97.95638                97.94109                   100
##   skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## 1               97.95128                97.93599                   100
##   max_roll_dumbbell max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## 1          97.93089           97.93089         97.95638          97.93089
##   min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## 1           97.93089         97.95638                97.93089
##   amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## 1                 97.93089               97.95638                    0
##   var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 1           97.93089          97.93089             97.93089          97.93089
##   avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell avg_yaw_dumbbell
## 1           97.93089              97.93089           97.93089         97.93089
##   stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1            97.93089         97.93089                0                0
##   gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1                0                0                0                0
##   magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1                 0                 0                 0            0
##   pitch_forearm yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## 1             0           0              98.35898               98.36408
##   kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## 1                  100              98.35389               98.36408
##   skewness_yaw_forearm max_roll_forearm max_picth_forearm max_yaw_forearm
## 1                  100         97.93089          97.93089        98.35898
##   min_roll_forearm min_pitch_forearm min_yaw_forearm amplitude_roll_forearm
## 1         97.93089          97.93089        98.35898               97.93089
##   amplitude_pitch_forearm amplitude_yaw_forearm total_accel_forearm
## 1                97.93089              98.35898                   0
##   var_accel_forearm avg_roll_forearm stddev_roll_forearm var_roll_forearm
## 1          97.93089         97.93089            97.93089         97.93089
##   avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 1          97.93089             97.93089          97.93089        97.93089
##   stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## 1           97.93089        97.93089               0               0
##   gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1               0               0               0               0
##   magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1                0                0                0      0
Quite a few of the variables are nearly entirely missing (>97% NA). Let’s remove them from the dataset, along with the metadata variables - which won’t be used for prediction.
## Remove metadata variables
trainData <- trainData %>% 
        select(-c(1:7))
testData <- testData %>% 
        select(-c(1:7))
## Remove variables that are missing atleast half of their observations
naVars <- trainData %>% 
        map(function(x) mean(is.na(x)) > 0.50) # logical vector detailing whether or not each variable is missing more than 50% of its data
trainData <- trainData[, naVars == FALSE] # remove variables that are missing more than 50% of their observations
testSet <- testData[, naVars == FALSE]
## Check to make sure only variables with no NA's are left. This leaves us with 53 variables.
trainData %>% 
        summarise_each(funs(sum(is.na(.))))##   roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## 1         0          0        0                0            0            0
##   gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## 1            0            0            0            0             0
##   magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
## 1             0             0        0         0       0               0
##   gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## 1           0           0           0           0           0           0
##   magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
## 1            0            0            0             0              0
##   yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1            0                    0                0                0
##   gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1                0                0                0                0
##   magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1                 0                 0                 0            0
##   pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x gyros_forearm_y
## 1             0           0                   0               0               0
##   gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1               0               0               0               0
##   magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1                0                0                0      0
Let’s also take a look at which of the remaining variables have zero or near-zero variance.
nearZeroVar() diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
## Check if any of the remaining variables have zero or near-zero variance
nearZeroVar(trainData, saveMetrics = TRUE)##                      freqRatio percentUnique zeroVar   nzv
## roll_belt             1.101904     6.7781062   FALSE FALSE
## pitch_belt            1.036082     9.3772296   FALSE FALSE
## yaw_belt              1.058480     9.9734991   FALSE FALSE
## total_accel_belt      1.063160     0.1477933   FALSE FALSE
## gyros_belt_x          1.058651     0.7134849   FALSE FALSE
## gyros_belt_y          1.144000     0.3516461   FALSE FALSE
## gyros_belt_z          1.066214     0.8612782   FALSE FALSE
## accel_belt_x          1.055412     0.8357966   FALSE FALSE
## accel_belt_y          1.113725     0.7287738   FALSE FALSE
## accel_belt_z          1.078767     1.5237998   FALSE FALSE
## magnet_belt_x         1.090141     1.6664968   FALSE FALSE
## magnet_belt_y         1.099688     1.5187035   FALSE FALSE
## magnet_belt_z         1.006369     2.3290184   FALSE FALSE
## roll_arm             52.338462    13.5256345   FALSE FALSE
## pitch_arm            87.256410    15.7323412   FALSE FALSE
## yaw_arm              33.029126    14.6570176   FALSE FALSE
## total_accel_arm       1.024526     0.3363572   FALSE FALSE
## gyros_arm_x           1.015504     3.2769341   FALSE FALSE
## gyros_arm_y           1.454369     1.9162165   FALSE FALSE
## gyros_arm_z           1.110687     1.2638875   FALSE FALSE
## accel_arm_x           1.017341     3.9598410   FALSE FALSE
## accel_arm_y           1.140187     2.7367241   FALSE FALSE
## accel_arm_z           1.128000     4.0362858   FALSE FALSE
## magnet_arm_x          1.000000     6.8239731   FALSE FALSE
## magnet_arm_y          1.056818     4.4439914   FALSE FALSE
## magnet_arm_z          1.036364     6.4468454   FALSE FALSE
## roll_dumbbell         1.022388    84.2065029   FALSE FALSE
## pitch_dumbbell        2.277372    81.7449801   FALSE FALSE
## yaw_dumbbell          1.132231    83.4828254   FALSE FALSE
## total_accel_dumbbell  1.072634     0.2191418   FALSE FALSE
## gyros_dumbbell_x      1.003268     1.2282132   FALSE FALSE
## gyros_dumbbell_y      1.264957     1.4167771   FALSE FALSE
## gyros_dumbbell_z      1.060100     1.0498420   FALSE FALSE
## accel_dumbbell_x      1.018018     2.1659362   FALSE FALSE
## accel_dumbbell_y      1.053061     2.3748853   FALSE FALSE
## accel_dumbbell_z      1.133333     2.0894914   FALSE FALSE
## magnet_dumbbell_x     1.098266     5.7486495   FALSE FALSE
## magnet_dumbbell_y     1.197740     4.3012945   FALSE FALSE
## magnet_dumbbell_z     1.020833     3.4451126   FALSE FALSE
## roll_forearm         11.589286    11.0895933   FALSE FALSE
## pitch_forearm        65.983051    14.8557741   FALSE FALSE
## yaw_forearm          15.322835    10.1467740   FALSE FALSE
## total_accel_forearm   1.128928     0.3567424   FALSE FALSE
## gyros_forearm_x       1.059273     1.5187035   FALSE FALSE
## gyros_forearm_y       1.036554     3.7763735   FALSE FALSE
## gyros_forearm_z       1.122917     1.5645704   FALSE FALSE
## accel_forearm_x       1.126437     4.0464784   FALSE FALSE
## accel_forearm_y       1.059406     5.1116094   FALSE FALSE
## accel_forearm_z       1.006250     2.9558659   FALSE FALSE
## magnet_forearm_x      1.012346     7.7667924   FALSE FALSE
## magnet_forearm_y      1.246914     9.5403119   FALSE FALSE
## magnet_forearm_z      1.000000     8.5771073   FALSE FALSE
## classe                1.469581     0.0254816   FALSE FALSE
None of the remaining variables have zero or near-zero variance, so no changes are made to the dataset.
Lastly, we’ll partition the training data into a training set (70% of obs) and a validation set (30% of obs).
Modeling
Two techniques are used to predict the class of the participants in the test set.
- Decision Tree
 - Extreme Gradient Boosting
 
Decision Tree
Decision trees create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees typically lack the predictive power of some other techniques, but they make feature importance clear and relations easy to see, so the results are worth investigating.
##    user  system elapsed 
##   2.624   0.031   2.691
The way in which the decision tree is formed is depicted above. Now we’ll use the fitted model to predict the class’s of the participants in the validation set.
## Predict on validation set
dtValidationPrediction <- predict(dtFit, newdata = validationSet, type = "class")
## Evaluate 
confusionMatrix(dtValidationPrediction, validationSet$classe)## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1467  153   14   39   26
##          B   71  738  149   94  147
##          C   39   99  766  167  131
##          D   75  114   71  622   93
##          E   22   35   26   42  685
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7269          
##                  95% CI : (0.7154, 0.7383)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6545          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8763   0.6479   0.7466   0.6452   0.6331
## Specificity            0.9449   0.9029   0.9103   0.9283   0.9740
## Pos Pred Value         0.8634   0.6155   0.6373   0.6379   0.8457
## Neg Pred Value         0.9505   0.9144   0.9445   0.9303   0.9218
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2493   0.1254   0.1302   0.1057   0.1164
## Detection Prevalence   0.2887   0.2037   0.2042   0.1657   0.1376
## Balanced Accuracy      0.9106   0.7754   0.8284   0.7867   0.8035
We’re able to predict the class of the validation set with 72.69% accuracy. Accuracy is defined as (number of correct predictions)/(total number of predictions).
72.69% isn’t a great accuracy score in this context, so we’ll turn to XGBoost and see if we can build a better model.
XGBoost
XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. At the time of writing, it is widely considered the gold standard for building predictive models.
XGBoost requires some data transformations to be made in order to run the model - ie. matrix format, outcome variable removed from training set, etc. Let’s make those transformations now.
Lastly, because we’re using cross-validation, we’ll use the entire set of training data without partitioning out part of it as a validation set.
## Separate label/outcome variable (classe)
classe <- trainData[, "classe"]
lengthClasse <- length(levels(classe)) 
levels(classe) <- 1:lengthClasse
## XGBoost labels need to start at 0
## ABCDE becomes 01234 in matrix form
classeMatrix <- as.matrix(as.integer(classe)-1)
# Remove outcome variables 
trainData$classe <- NULL
testSet$problem_id <- NULL
# Convert data to matrix
matrixTrainSet <- as.matrix(trainData)
matrixTestSet <- as.matrix(testSet)Next we’ll set the parameters for the model - it’s cleaner to do it separately.
merror is chosen as the evaluation metric because it is considered the best evaluation metric for multiclass classification problems. It is calculated as (number of wrong cases)/(total cases).
## Set parameters
nRoundsCV <- 250 # number of iterations in cross-validation
param <- list(objective = "multi:softprob", # multiclass classification
              num_class = lengthClasse,     # number of classes (5 here)
              eval_metric = "merror",       # evaluation metric 
              nthread = 8,                  # threads to be used 
              max_depth = 16,               # max depth of tree 
              eta = 0.3,                    # step size shrinkage 
              gamma = 0,                    # minimum loss reduction 
              subsample = 1,                # data instances to grow tree 
              colsample_bytree = 1,         # subsample ratio of columns
              min_child_weight = 12         # min sum of instance weight
              )Now, we’ll run XGBoost with 5-fold cross validation on the training data in an effort to determine the index with the best mean merror. 5-fold cross validation was selected because it strikes a nice balance between optimal model selection and runtime.
Then, we’ll predict the classes of the participants in the training data using the cross-validation model and compare it to the actual values.
## 5-fold CV
system.time(cvModel <- xgb.cv(param=param, 
                               data=matrixTrainSet, 
                               label=classeMatrix, 
                               nfold=5, 
                               nrounds=nRoundsCV, 
                               print.every.n = 10, 
                               early.stop.round = 20,
                               prediction=TRUE, 
                               verbose=FALSE,
                               maximize = FALSE))##    user  system elapsed 
## 379.618  19.047 119.284
## Predict classe using cross-validation model
cvPrediction <- matrix(cvModel$pred, 
                       nrow = length(cvModel$pred)/lengthClasse, 
                       ncol = lengthClasse) %>% 
        max.col("last")
## Confusion matrix
confusionMatrix(factor(classeMatrix+1), factor(cvPrediction))## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 5574    4    0    0    2
##          2    9 3779    8    0    1
##          3    0    7 3408    7    0
##          4    0    0   16 3195    5
##          5    0    0    4    6 3597
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9965          
##                  95% CI : (0.9956, 0.9973)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9956          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9984   0.9971   0.9919   0.9959   0.9978
## Specificity            0.9996   0.9989   0.9991   0.9987   0.9994
## Pos Pred Value         0.9989   0.9953   0.9959   0.9935   0.9972
## Neg Pred Value         0.9994   0.9993   0.9983   0.9992   0.9995
## Prevalence             0.2845   0.1932   0.1751   0.1635   0.1837
## Detection Rate         0.2841   0.1926   0.1737   0.1628   0.1833
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9990   0.9980   0.9955   0.9973   0.9986
The cross-validation model is able to predict the class with 99.65% accuracy. This is substantially better than the decision tree model. Let’s follow through and use the XGBoost model to predict the class of the test participants.
We’ll use the number of rounds that returned the lowest test merror mean in the cross validation model and use it to make the model that we’re applying to the test data optimal.
## Determine iteration with lowest test merror mean
minErrorIndex <- which.min(cvModel$evaluation_log[, test_merror_mean])
print(minErrorIndex) # shows which index had the lowest test merror mean## [1] 95
## Train the real model with full data using the optimal test merror mean
system.time(bestModel <- xgboost::xgboost(param=param, 
                                 data=matrixTrainSet, 
                                 label=classeMatrix, 
                                 nrounds=minErrorIndex, 
                                 verbose=FALSE,
                                 maximize=FALSE, 
                                 predict=TRUE))## [00:32:46] WARNING: amalgamation/../src/learner.cc:573: 
## Parameters: { "predict" } might not be used.
## 
##   This may not be accurate due to some parameters are only used in language bindings but
##   passed down to XGBoost core.  Or some parameters are not used but slip through this
##   verification. Please open an issue if you find above cases.
##    user  system elapsed 
##  85.056   4.320  29.217
## Predict Test Data
bestModelPrediction <- predict(bestModel, matrixTestSet)
## Convert prediction from 01234 format back to ABCDE
finalPrediction <- matrix(bestModelPrediction, 
                          nrow=lengthClasse,
                          ncol=length(bestModelPrediction)/lengthClasse) %>% 
        t() %>%
        max.col("last") 
finalPrediction <- toupper(letters[finalPrediction])
print(finalPrediction)##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A" "B" "B"
## [20] "B"
The optimal model is used to predict the class of the 20 test participants, listed above.
Finally, let’s take a look at the importance of the different features in the model.
xgb.ggplot.importance performs 1-D clustering of the importance values, with bar colors corresponding to different clusters that have somewhat similar importance values.
## Feature importance
importance_matrix <- xgb.importance(names(matrixTrainSet),model=bestModel)
xgb.ggplot.importance(importance_matrix[1:20,]) According to the importance matrix, roll_belt is the most important feature.