Synopsis
This analysis uses data from the accelerometers on the belt, forearm, arm, and dumbbell to predict the manner in which participants performed a certain exercise.
Subjects were asked to perform barbell lifts correctly and incorrectly in 5 different ways:
- Class A - Exactly according to the specification (correct)
- Class B - Throwing the elbows to the front (mistake)
- Class C - Lifting the dumbbell only halfway (mistake)
- Class D - Lowering the dumbbell only halfway (mistake)
- Class E - Throwing the hips to the front (mistake)
Accelerometers were located on four locations:
- belt
- forearm
- arm
- dumbbell
Two prediction techniques, Decision Trees and XGBoost, are used for prediction.
Further information about the dataset is available here.
Setup
Data Processing
First, we need to load the data into R. We’ll call the raw data trainData and testData, and later split them into the cleaned sets we’ll use for analysis - trainSet, validationSet, and testSet.
## Load data
trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
dir <- "/Users/kevinroche22/RData/WeightLiftingMLPrediction"
setwd(dir)
download.file(trainUrl, "pml-training.csv")
download.file(testUrl, "pml-testing.csv")
trainData <- read.csv("pml-training.csv", header = TRUE, sep = ",")
testData <- read.csv("pml-testing.csv", header = TRUE, sep = ",")Let’s take a look at what we’re working with.
## [1] 19622 160
## [1] 20 160
So, the training data contains nearly 20k observations of 160 different variables, while the test data contains 20 observations of the same 160 variables. It looks like some of the variables have missing data, and there are blanks and division by 0’s that aren’t recorded as NA’s. Let’s convert all of the missing observations to NA and then see how many missing values each variable has.
## Convert missing values to NA
trainData[trainData==""] <- NA
trainData[trainData=="#DIV/0!"] <- NA
testData[testData==""] <- NA
testData[testData=="#DIV/0!"] <- NA
## View proportions
trainData %>% summarise_each(funs(100*mean(is.na(.))))## X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 0 0 0 0 0
## new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1 0 0 0 0 0 0
## kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 1 97.98186 98.09398 100 97.97676
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## 1 98.09398 100 97.93089 97.93089
## max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt amplitude_roll_belt
## 1 97.98186 97.93089 97.93089 97.98186 97.93089
## amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## 1 97.93089 97.98186 97.93089 97.93089
## stddev_roll_belt var_roll_belt avg_pitch_belt stddev_pitch_belt
## 1 97.93089 97.93089 97.93089 97.93089
## var_pitch_belt avg_yaw_belt stddev_yaw_belt var_yaw_belt gyros_belt_x
## 1 97.93089 97.93089 97.93089 97.93089 0
## gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## 1 0 0 0 0 0
## magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm
## 1 0 0 0 0 0 0
## total_accel_arm var_accel_arm avg_roll_arm stddev_roll_arm var_roll_arm
## 1 0 97.93089 97.93089 97.93089 97.93089
## avg_pitch_arm stddev_pitch_arm var_pitch_arm avg_yaw_arm stddev_yaw_arm
## 1 97.93089 97.93089 97.93089 97.93089 97.93089
## var_yaw_arm gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
## 1 97.93089 0 0 0 0 0
## accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 1 0 0 0 0 98.32841
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## 1 98.3386 97.98695 98.32331 98.3386
## skewness_yaw_arm max_roll_arm max_picth_arm max_yaw_arm min_roll_arm
## 1 97.98695 97.93089 97.93089 97.93089 97.93089
## min_pitch_arm min_yaw_arm amplitude_roll_arm amplitude_pitch_arm
## 1 97.93089 97.93089 97.93089 97.93089
## amplitude_yaw_arm roll_dumbbell pitch_dumbbell yaw_dumbbell
## 1 97.93089 0 0 0
## kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## 1 97.95638 97.94109 100
## skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## 1 97.95128 97.93599 100
## max_roll_dumbbell max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## 1 97.93089 97.93089 97.95638 97.93089
## min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## 1 97.93089 97.95638 97.93089
## amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## 1 97.93089 97.95638 0
## var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 1 97.93089 97.93089 97.93089 97.93089
## avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell avg_yaw_dumbbell
## 1 97.93089 97.93089 97.93089 97.93089
## stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1 97.93089 97.93089 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1 0 0 0 0
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1 0 0 0 0
## pitch_forearm yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## 1 0 0 98.35898 98.36408
## kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## 1 100 98.35389 98.36408
## skewness_yaw_forearm max_roll_forearm max_picth_forearm max_yaw_forearm
## 1 100 97.93089 97.93089 98.35898
## min_roll_forearm min_pitch_forearm min_yaw_forearm amplitude_roll_forearm
## 1 97.93089 97.93089 98.35898 97.93089
## amplitude_pitch_forearm amplitude_yaw_forearm total_accel_forearm
## 1 97.93089 98.35898 0
## var_accel_forearm avg_roll_forearm stddev_roll_forearm var_roll_forearm
## 1 97.93089 97.93089 97.93089 97.93089
## avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 1 97.93089 97.93089 97.93089 97.93089
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## 1 97.93089 97.93089 0 0
## gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1 0 0 0 0
## magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1 0 0 0 0
Quite a few of the variables are nearly entirely missing (>97% NA). Let’s remove them from the dataset, along with the metadata variables - which won’t be used for prediction.
## Remove metadata variables
trainData <- trainData %>%
select(-c(1:7))
testData <- testData %>%
select(-c(1:7))
## Remove variables that are missing atleast half of their observations
naVars <- trainData %>%
map(function(x) mean(is.na(x)) > 0.50) # logical vector detailing whether or not each variable is missing more than 50% of its data
trainData <- trainData[, naVars == FALSE] # remove variables that are missing more than 50% of their observations
testSet <- testData[, naVars == FALSE]
## Check to make sure only variables with no NA's are left. This leaves us with 53 variables.
trainData %>%
summarise_each(funs(sum(is.na(.))))## roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## 1 0 0 0 0 0 0
## gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## 1 0 0 0 0 0
## magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
## 1 0 0 0 0 0 0
## gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## 1 0 0 0 0 0 0
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
## 1 0 0 0 0 0
## yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1 0 0 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1 0 0 0 0
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1 0 0 0 0
## pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x gyros_forearm_y
## 1 0 0 0 0 0
## gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1 0 0 0 0
## magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1 0 0 0 0
Let’s also take a look at which of the remaining variables have zero or near-zero variance.
nearZeroVar() diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
## Check if any of the remaining variables have zero or near-zero variance
nearZeroVar(trainData, saveMetrics = TRUE)## freqRatio percentUnique zeroVar nzv
## roll_belt 1.101904 6.7781062 FALSE FALSE
## pitch_belt 1.036082 9.3772296 FALSE FALSE
## yaw_belt 1.058480 9.9734991 FALSE FALSE
## total_accel_belt 1.063160 0.1477933 FALSE FALSE
## gyros_belt_x 1.058651 0.7134849 FALSE FALSE
## gyros_belt_y 1.144000 0.3516461 FALSE FALSE
## gyros_belt_z 1.066214 0.8612782 FALSE FALSE
## accel_belt_x 1.055412 0.8357966 FALSE FALSE
## accel_belt_y 1.113725 0.7287738 FALSE FALSE
## accel_belt_z 1.078767 1.5237998 FALSE FALSE
## magnet_belt_x 1.090141 1.6664968 FALSE FALSE
## magnet_belt_y 1.099688 1.5187035 FALSE FALSE
## magnet_belt_z 1.006369 2.3290184 FALSE FALSE
## roll_arm 52.338462 13.5256345 FALSE FALSE
## pitch_arm 87.256410 15.7323412 FALSE FALSE
## yaw_arm 33.029126 14.6570176 FALSE FALSE
## total_accel_arm 1.024526 0.3363572 FALSE FALSE
## gyros_arm_x 1.015504 3.2769341 FALSE FALSE
## gyros_arm_y 1.454369 1.9162165 FALSE FALSE
## gyros_arm_z 1.110687 1.2638875 FALSE FALSE
## accel_arm_x 1.017341 3.9598410 FALSE FALSE
## accel_arm_y 1.140187 2.7367241 FALSE FALSE
## accel_arm_z 1.128000 4.0362858 FALSE FALSE
## magnet_arm_x 1.000000 6.8239731 FALSE FALSE
## magnet_arm_y 1.056818 4.4439914 FALSE FALSE
## magnet_arm_z 1.036364 6.4468454 FALSE FALSE
## roll_dumbbell 1.022388 84.2065029 FALSE FALSE
## pitch_dumbbell 2.277372 81.7449801 FALSE FALSE
## yaw_dumbbell 1.132231 83.4828254 FALSE FALSE
## total_accel_dumbbell 1.072634 0.2191418 FALSE FALSE
## gyros_dumbbell_x 1.003268 1.2282132 FALSE FALSE
## gyros_dumbbell_y 1.264957 1.4167771 FALSE FALSE
## gyros_dumbbell_z 1.060100 1.0498420 FALSE FALSE
## accel_dumbbell_x 1.018018 2.1659362 FALSE FALSE
## accel_dumbbell_y 1.053061 2.3748853 FALSE FALSE
## accel_dumbbell_z 1.133333 2.0894914 FALSE FALSE
## magnet_dumbbell_x 1.098266 5.7486495 FALSE FALSE
## magnet_dumbbell_y 1.197740 4.3012945 FALSE FALSE
## magnet_dumbbell_z 1.020833 3.4451126 FALSE FALSE
## roll_forearm 11.589286 11.0895933 FALSE FALSE
## pitch_forearm 65.983051 14.8557741 FALSE FALSE
## yaw_forearm 15.322835 10.1467740 FALSE FALSE
## total_accel_forearm 1.128928 0.3567424 FALSE FALSE
## gyros_forearm_x 1.059273 1.5187035 FALSE FALSE
## gyros_forearm_y 1.036554 3.7763735 FALSE FALSE
## gyros_forearm_z 1.122917 1.5645704 FALSE FALSE
## accel_forearm_x 1.126437 4.0464784 FALSE FALSE
## accel_forearm_y 1.059406 5.1116094 FALSE FALSE
## accel_forearm_z 1.006250 2.9558659 FALSE FALSE
## magnet_forearm_x 1.012346 7.7667924 FALSE FALSE
## magnet_forearm_y 1.246914 9.5403119 FALSE FALSE
## magnet_forearm_z 1.000000 8.5771073 FALSE FALSE
## classe 1.469581 0.0254816 FALSE FALSE
None of the remaining variables have zero or near-zero variance, so no changes are made to the dataset.
Lastly, we’ll partition the training data into a training set (70% of obs) and a validation set (30% of obs).
Modeling
Two techniques are used to predict the class of the participants in the test set.
- Decision Tree
- Extreme Gradient Boosting
Decision Tree
Decision trees create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees typically lack the predictive power of some other techniques, but they make feature importance clear and relations easy to see, so the results are worth investigating.
## user system elapsed
## 2.624 0.031 2.691
The way in which the decision tree is formed is depicted above. Now we’ll use the fitted model to predict the class’s of the participants in the validation set.
## Predict on validation set
dtValidationPrediction <- predict(dtFit, newdata = validationSet, type = "class")
## Evaluate
confusionMatrix(dtValidationPrediction, validationSet$classe)## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1467 153 14 39 26
## B 71 738 149 94 147
## C 39 99 766 167 131
## D 75 114 71 622 93
## E 22 35 26 42 685
##
## Overall Statistics
##
## Accuracy : 0.7269
## 95% CI : (0.7154, 0.7383)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6545
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8763 0.6479 0.7466 0.6452 0.6331
## Specificity 0.9449 0.9029 0.9103 0.9283 0.9740
## Pos Pred Value 0.8634 0.6155 0.6373 0.6379 0.8457
## Neg Pred Value 0.9505 0.9144 0.9445 0.9303 0.9218
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2493 0.1254 0.1302 0.1057 0.1164
## Detection Prevalence 0.2887 0.2037 0.2042 0.1657 0.1376
## Balanced Accuracy 0.9106 0.7754 0.8284 0.7867 0.8035
We’re able to predict the class of the validation set with 72.69% accuracy. Accuracy is defined as (number of correct predictions)/(total number of predictions).
72.69% isn’t a great accuracy score in this context, so we’ll turn to XGBoost and see if we can build a better model.
XGBoost
XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. At the time of writing, it is widely considered the gold standard for building predictive models.
XGBoost requires some data transformations to be made in order to run the model - ie. matrix format, outcome variable removed from training set, etc. Let’s make those transformations now.
Lastly, because we’re using cross-validation, we’ll use the entire set of training data without partitioning out part of it as a validation set.
## Separate label/outcome variable (classe)
classe <- trainData[, "classe"]
lengthClasse <- length(levels(classe))
levels(classe) <- 1:lengthClasse
## XGBoost labels need to start at 0
## ABCDE becomes 01234 in matrix form
classeMatrix <- as.matrix(as.integer(classe)-1)
# Remove outcome variables
trainData$classe <- NULL
testSet$problem_id <- NULL
# Convert data to matrix
matrixTrainSet <- as.matrix(trainData)
matrixTestSet <- as.matrix(testSet)Next we’ll set the parameters for the model - it’s cleaner to do it separately.
merror is chosen as the evaluation metric because it is considered the best evaluation metric for multiclass classification problems. It is calculated as (number of wrong cases)/(total cases).
## Set parameters
nRoundsCV <- 250 # number of iterations in cross-validation
param <- list(objective = "multi:softprob", # multiclass classification
num_class = lengthClasse, # number of classes (5 here)
eval_metric = "merror", # evaluation metric
nthread = 8, # threads to be used
max_depth = 16, # max depth of tree
eta = 0.3, # step size shrinkage
gamma = 0, # minimum loss reduction
subsample = 1, # data instances to grow tree
colsample_bytree = 1, # subsample ratio of columns
min_child_weight = 12 # min sum of instance weight
)Now, we’ll run XGBoost with 5-fold cross validation on the training data in an effort to determine the index with the best mean merror. 5-fold cross validation was selected because it strikes a nice balance between optimal model selection and runtime.
Then, we’ll predict the classes of the participants in the training data using the cross-validation model and compare it to the actual values.
## 5-fold CV
system.time(cvModel <- xgb.cv(param=param,
data=matrixTrainSet,
label=classeMatrix,
nfold=5,
nrounds=nRoundsCV,
print.every.n = 10,
early.stop.round = 20,
prediction=TRUE,
verbose=FALSE,
maximize = FALSE))## user system elapsed
## 379.618 19.047 119.284
## Predict classe using cross-validation model
cvPrediction <- matrix(cvModel$pred,
nrow = length(cvModel$pred)/lengthClasse,
ncol = lengthClasse) %>%
max.col("last")
## Confusion matrix
confusionMatrix(factor(classeMatrix+1), factor(cvPrediction))## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 5574 4 0 0 2
## 2 9 3779 8 0 1
## 3 0 7 3408 7 0
## 4 0 0 16 3195 5
## 5 0 0 4 6 3597
##
## Overall Statistics
##
## Accuracy : 0.9965
## 95% CI : (0.9956, 0.9973)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9956
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9984 0.9971 0.9919 0.9959 0.9978
## Specificity 0.9996 0.9989 0.9991 0.9987 0.9994
## Pos Pred Value 0.9989 0.9953 0.9959 0.9935 0.9972
## Neg Pred Value 0.9994 0.9993 0.9983 0.9992 0.9995
## Prevalence 0.2845 0.1932 0.1751 0.1635 0.1837
## Detection Rate 0.2841 0.1926 0.1737 0.1628 0.1833
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9990 0.9980 0.9955 0.9973 0.9986
The cross-validation model is able to predict the class with 99.65% accuracy. This is substantially better than the decision tree model. Let’s follow through and use the XGBoost model to predict the class of the test participants.
We’ll use the number of rounds that returned the lowest test merror mean in the cross validation model and use it to make the model that we’re applying to the test data optimal.
## Determine iteration with lowest test merror mean
minErrorIndex <- which.min(cvModel$evaluation_log[, test_merror_mean])
print(minErrorIndex) # shows which index had the lowest test merror mean## [1] 95
## Train the real model with full data using the optimal test merror mean
system.time(bestModel <- xgboost::xgboost(param=param,
data=matrixTrainSet,
label=classeMatrix,
nrounds=minErrorIndex,
verbose=FALSE,
maximize=FALSE,
predict=TRUE))## [00:32:46] WARNING: amalgamation/../src/learner.cc:573:
## Parameters: { "predict" } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
## user system elapsed
## 85.056 4.320 29.217
## Predict Test Data
bestModelPrediction <- predict(bestModel, matrixTestSet)
## Convert prediction from 01234 format back to ABCDE
finalPrediction <- matrix(bestModelPrediction,
nrow=lengthClasse,
ncol=length(bestModelPrediction)/lengthClasse) %>%
t() %>%
max.col("last")
finalPrediction <- toupper(letters[finalPrediction])
print(finalPrediction)## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A" "B" "B"
## [20] "B"
The optimal model is used to predict the class of the 20 test participants, listed above.
Finally, let’s take a look at the importance of the different features in the model.
xgb.ggplot.importance performs 1-D clustering of the importance values, with bar colors corresponding to different clusters that have somewhat similar importance values.
## Feature importance
importance_matrix <- xgb.importance(names(matrixTrainSet),model=bestModel)
xgb.ggplot.importance(importance_matrix[1:20,]) According to the importance matrix, roll_belt is the most important feature.