This project outlines a method for taking belt, forearm, arm, and dumbell acceleration and gyroscope measurements and predicting whether the intended exercise was done correctly or not. The data were collected by Groupware@LES.
The training data were first read in to create a data frame for the building data with exercise classifications marked A (correctly performed), as well as B - E (some type of incorrect performance of the exercise). The non-sensor data was removed from the data as well as measurement types with more than 10% missing values because model building and prediction are greatly impacted by missing values. Finally, several columns had numeric data in character format so those were converted for analysis.
projBuild <- read.csv("pml-training.csv", stringsAsFactors = FALSE)
## Change outcome variable to class factor for analysis.
projBuild$classe <- as.factor(projBuild$classe)
## 1st 7 columns are non-sensor data.
projBuild <- projBuild[,-(1:7)]
## Convert all character columns to numeric.
projBuild[,names(projBuild)[sapply(projBuild, class) == "character"]] <-
sapply(projBuild[,names(projBuild)[
sapply(projBuild, class) == "character"]], function(x)
x <- as.numeric(x))
## Several columns had a lot of missing values.
sapply(projBuild, function(x) paste(round(100*sum(is.na(x))/length(x),1),"% NAs",sep=""))
## roll_belt pitch_belt yaw_belt
## "0% NAs" "0% NAs" "0% NAs"
## total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## "0% NAs" "98% NAs" "98.1% NAs"
## kurtosis_yaw_belt skewness_roll_belt skewness_roll_belt.1
## "100% NAs" "98% NAs" "98.1% NAs"
## skewness_yaw_belt max_roll_belt max_picth_belt
## "100% NAs" "97.9% NAs" "97.9% NAs"
## max_yaw_belt min_roll_belt min_pitch_belt
## "98% NAs" "97.9% NAs" "97.9% NAs"
## min_yaw_belt amplitude_roll_belt amplitude_pitch_belt
## "98% NAs" "97.9% NAs" "97.9% NAs"
## amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## "98% NAs" "97.9% NAs" "97.9% NAs"
## stddev_roll_belt var_roll_belt avg_pitch_belt
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## stddev_pitch_belt var_pitch_belt avg_yaw_belt
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## stddev_yaw_belt var_yaw_belt gyros_belt_x
## "97.9% NAs" "97.9% NAs" "0% NAs"
## gyros_belt_y gyros_belt_z accel_belt_x
## "0% NAs" "0% NAs" "0% NAs"
## accel_belt_y accel_belt_z magnet_belt_x
## "0% NAs" "0% NAs" "0% NAs"
## magnet_belt_y magnet_belt_z roll_arm
## "0% NAs" "0% NAs" "0% NAs"
## pitch_arm yaw_arm total_accel_arm
## "0% NAs" "0% NAs" "0% NAs"
## var_accel_arm avg_roll_arm stddev_roll_arm
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## var_roll_arm avg_pitch_arm stddev_pitch_arm
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## var_pitch_arm avg_yaw_arm stddev_yaw_arm
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## var_yaw_arm gyros_arm_x gyros_arm_y
## "97.9% NAs" "0% NAs" "0% NAs"
## gyros_arm_z accel_arm_x accel_arm_y
## "0% NAs" "0% NAs" "0% NAs"
## accel_arm_z magnet_arm_x magnet_arm_y
## "0% NAs" "0% NAs" "0% NAs"
## magnet_arm_z kurtosis_roll_arm kurtosis_picth_arm
## "0% NAs" "98.3% NAs" "98.3% NAs"
## kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## "98% NAs" "98.3% NAs" "98.3% NAs"
## skewness_yaw_arm max_roll_arm max_picth_arm
## "98% NAs" "97.9% NAs" "97.9% NAs"
## max_yaw_arm min_roll_arm min_pitch_arm
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## min_yaw_arm amplitude_roll_arm amplitude_pitch_arm
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## amplitude_yaw_arm roll_dumbbell pitch_dumbbell
## "97.9% NAs" "0% NAs" "0% NAs"
## yaw_dumbbell kurtosis_roll_dumbbell kurtosis_picth_dumbbell
## "0% NAs" "98% NAs" "97.9% NAs"
## kurtosis_yaw_dumbbell skewness_roll_dumbbell skewness_pitch_dumbbell
## "100% NAs" "98% NAs" "97.9% NAs"
## skewness_yaw_dumbbell max_roll_dumbbell max_picth_dumbbell
## "100% NAs" "97.9% NAs" "97.9% NAs"
## max_yaw_dumbbell min_roll_dumbbell min_pitch_dumbbell
## "98% NAs" "97.9% NAs" "97.9% NAs"
## min_yaw_dumbbell amplitude_roll_dumbbell amplitude_pitch_dumbbell
## "98% NAs" "97.9% NAs" "97.9% NAs"
## amplitude_yaw_dumbbell total_accel_dumbbell var_accel_dumbbell
## "98% NAs" "0% NAs" "97.9% NAs"
## avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## "0% NAs" "0% NAs" "0% NAs"
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## "0% NAs" "0% NAs" "0% NAs"
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## "0% NAs" "0% NAs" "0% NAs"
## roll_forearm pitch_forearm yaw_forearm
## "0% NAs" "0% NAs" "0% NAs"
## kurtosis_roll_forearm kurtosis_picth_forearm kurtosis_yaw_forearm
## "98.4% NAs" "98.4% NAs" "100% NAs"
## skewness_roll_forearm skewness_pitch_forearm skewness_yaw_forearm
## "98.4% NAs" "98.4% NAs" "100% NAs"
## max_roll_forearm max_picth_forearm max_yaw_forearm
## "97.9% NAs" "97.9% NAs" "98.4% NAs"
## min_roll_forearm min_pitch_forearm min_yaw_forearm
## "97.9% NAs" "97.9% NAs" "98.4% NAs"
## amplitude_roll_forearm amplitude_pitch_forearm amplitude_yaw_forearm
## "97.9% NAs" "97.9% NAs" "98.4% NAs"
## total_accel_forearm var_accel_forearm avg_roll_forearm
## "0% NAs" "97.9% NAs" "97.9% NAs"
## stddev_roll_forearm var_roll_forearm avg_pitch_forearm
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## "97.9% NAs" "97.9% NAs" "97.9% NAs"
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x
## "97.9% NAs" "97.9% NAs" "0% NAs"
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## "0% NAs" "0% NAs" "0% NAs"
## accel_forearm_y accel_forearm_z magnet_forearm_x
## "0% NAs" "0% NAs" "0% NAs"
## magnet_forearm_y magnet_forearm_z classe
## "0% NAs" "0% NAs" "0% NAs"
## Only keep sensor measurement types with less than 10% NA values.
projBuild <- projBuild[,names(projBuild)[sapply(projBuild, function(x) 100*sum(is.na(x))/length(x) < 10)]]
## Print out a table of the outcomes and counts for each outcome.
uniqueOutcomesTable <- table(projBuild$classe)
## Display a plot of the counts for each outcome (A-E) for the building
## data set.
outcomesPlot <- ggplot(data.frame(uniqueOutcomesTable), aes(Var1, Freq)) +
geom_bar(stat = "Identity", fill = "red") +
labs(x="Unique Outcomes") +
labs(y="Counts") +
labs(title="Unique Outcomes Counts Plot") +
theme_minimal() +
theme(title = element_text(face = "bold", size=18),
plot.margin=unit(c(2,4,2,2),"lines"),
axis.title.x = element_text(margin = ggplot2::margin(25,0,0,0), size=16),
axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=16))
print(uniqueOutcomesTable)
##
## A B C D E
## 5580 3797 3422 3216 3607
print(outcomesPlot)
Row 5373 has 3 high-leverage outliers in the gyros_forearm_y, gyros_forearm_z and gyros_dumbbell_z columns. This row was therefore removed.
## Identify only gyroscope data.
gyro <- projBuild[,names(projBuild)[grep("gyro", names(projBuild))]]
## Return which gyro data rows have the outlier.
sapply(projBuild[,names(projBuild)[
grep("gyro", names(projBuild))]], function(x) match(x[abs(x)>200],x))
## $gyros_belt_x
## integer(0)
##
## $gyros_belt_y
## integer(0)
##
## $gyros_belt_z
## integer(0)
##
## $gyros_arm_x
## integer(0)
##
## $gyros_arm_y
## integer(0)
##
## $gyros_arm_z
## integer(0)
##
## $gyros_dumbbell_x
## [1] 5373
##
## $gyros_dumbbell_y
## integer(0)
##
## $gyros_dumbbell_z
## [1] 5373
##
## $gyros_forearm_x
## integer(0)
##
## $gyros_forearm_y
## [1] 5373
##
## $gyros_forearm_z
## [1] 5373
## Create plots of each pair of correlated variables.
plotList <- list()
plotList[[1]] <-
ggplot(projBuild, aes(1,gyros_forearm_y)) +
geom_boxplot(color = "red") +
labs(x=NULL) +
labs(y="gyros_forearem_y") +
labs(title=NULL) +
theme_minimal() +
theme(title = element_text(face = "bold", size=12),
## plot.margin=unit(c(2,4,2,2),"lines"),
axis.text.x = element_blank(),
axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=10))
plotList[[2]] <-
ggplot(projBuild, aes(1,gyros_forearm_z)) +
geom_boxplot(color = "red") +
labs(x=NULL) +
labs(y="gyros_forearem_z") +
labs(title=NULL) +
theme_minimal() +
theme(title = element_text(face = "bold", size=12),
## plot.margin=unit(c(2,4,2,2),"lines"),
axis.text.x = element_blank(),
axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=10))
plotList[[3]] <-
ggplot(projBuild, aes(1,gyros_dumbbell_z)) +
geom_boxplot(color = "red") +
labs(x=NULL) +
labs(y="gyros_dumbbell_z") +
labs(title=NULL) +
theme_minimal() +
theme(title = element_text(face = "bold", size=12),
## plot.margin=unit(c(2,4,2,2),"lines"),
axis.text.x = element_blank(),
axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=10))
## And print the plots.
grid.arrange(plotList[[1]], plotList[[2]], ncol=2)
## Remove the offending row.
projBuild <- projBuild[-5373,]
Initially, the building data was further subsetted into a training set, a testing set and a validation set. A tree model, a random forest model, and a boosted tree model were trained on the training set, and then tested on the test set. They were then combined in a general additive model which was trained on the testing set. Finally, all four models were applied to the validation set to directly compare their results.
## Use parallel processing to speed up calculations.
registerDoMC(cores=8)
inProjTrain <- createDataPartition(projBuild$classe, p = .7, list=FALSE)
projTrain <- projBuild[inProjTrain,]
projTestAndValidate <- projBuild[-inProjTrain,]
inProjTest <- createDataPartition(projTestAndValidate$classe, p = .5, list=FALSE)
projTest <- projTestAndValidate[inProjTest,]
projValidate <- projTestAndValidate[-inProjTest,]
## Perform parallel processing to speed up computations.
registerDoMC(cores=8)
## Fit a tree model to the training data.
rpartProjModel <- train(classe~., method="rpart", data=projTrain, preProcess = c("center", "scale"))
## Perform parallel processing to speed up computations.
registerDoMC(cores=8)
## Fit a random forest model to the training data.
rfProjModel <- train(classe~., method="rf", data=projTrain, preProcess = c("center", "scale"))
## Perform parallel processing to speed up computations.
registerDoMC(cores=8)
## Fit a gradient boosting machine to the training data.
gbmProjModel <- train(classe~., method="gbm", data=projTrain, preProcess = c("center", "scale"), verbose=FALSE)
## Make predictions on the test set.
rpartPredTest <- predict(rpartProjModel, projTest)
rfPredTest <- predict(rfProjModel, projTest)
gbmPredTest <- predict(gbmProjModel, projTest)
## Combine predictions from 3 models and train on the test set.
predDFtest <- data.frame(rpartPredTest, rfPredTest, gbmPredTest, classe = projTest$classe)
registerDoMC(cores=4)
comboModFit <- train(classe~., method="gam", data=predDFtest)
comboPredTest <- predict(comboModFit, predDFtest)
As can be seen below, the tree model performed the worst on the validation set with an accuracy of 50.77%, the boosted tree model did better at 96.12%, and the random forest performed the best at 99.29%. These 3 models were also combined combined but performed poorly with an accuracy of 65.76%. Since the random forest was the best-performing individual model, and the combination of models did not help, a random forest model was selected going forward.
## Make predictions from all 3 models on the validation set.
rpartPredValidate <- predict(rpartProjModel, projValidate)
rfPredValidate <- predict(rfProjModel, projValidate)
gbmPredValidate <- predict(gbmProjModel, projValidate)
## Combine predictions from 3 models and predict on the validation set.
predDFvalidate <- data.frame(rpartPredValidate, rfPredValidate, gbmPredValidate, classe=projValidate$classe)
comboModFit <- train(classe~., method="rpart", data=predDFvalidate)
comboPredValidate <- predict(comboModFit, predDFvalidate)
## Print Tree Model Results
print(confusionMatrix(rpartPredValidate, projValidate$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 528 102 12 19 15
## B 2 109 13 5 7
## C 164 94 326 150 109
## D 140 264 162 308 188
## E 2 0 0 0 222
##
## Overall Statistics
##
## Accuracy : 0.5077
## 95% CI : (0.4894, 0.5259)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3865
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6316 0.19156 0.6355 0.6390 0.41035
## Specificity 0.9297 0.98862 0.7871 0.6934 0.99917
## Pos Pred Value 0.7811 0.80147 0.3867 0.2900 0.99107
## Neg Pred Value 0.8640 0.83601 0.9109 0.9074 0.88259
## Prevalence 0.2843 0.19347 0.1744 0.1639 0.18395
## Detection Rate 0.1795 0.03706 0.1108 0.1047 0.07548
## Detection Prevalence 0.2299 0.04624 0.2866 0.3611 0.07616
## Balanced Accuracy 0.7806 0.59009 0.7113 0.6662 0.70476
## Print Random Forest Model Results
print(confusionMatrix(rfPredValidate, projValidate$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 836 6 0 0 0
## B 0 562 3 0 0
## C 0 1 506 1 2
## D 0 0 4 480 3
## E 0 0 0 1 536
##
## Overall Statistics
##
## Accuracy : 0.9929
## 95% CI : (0.9891, 0.9956)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.991
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9877 0.9864 0.9959 0.9908
## Specificity 0.9971 0.9987 0.9984 0.9972 0.9996
## Pos Pred Value 0.9929 0.9947 0.9922 0.9856 0.9981
## Neg Pred Value 1.0000 0.9971 0.9971 0.9992 0.9979
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1840
## Detection Rate 0.2843 0.1911 0.1721 0.1632 0.1823
## Detection Prevalence 0.2863 0.1921 0.1734 0.1656 0.1826
## Balanced Accuracy 0.9986 0.9932 0.9924 0.9965 0.9952
## Print Boosted Tree Model Results
print(confusionMatrix(gbmPredValidate, projValidate$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 828 21 0 0 1
## B 5 534 17 3 9
## C 0 12 487 11 4
## D 3 0 8 460 9
## E 0 2 1 8 518
##
## Overall Statistics
##
## Accuracy : 0.9612
## 95% CI : (0.9536, 0.9679)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9509
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9904 0.9385 0.9493 0.9544 0.9575
## Specificity 0.9895 0.9857 0.9889 0.9919 0.9954
## Pos Pred Value 0.9741 0.9401 0.9475 0.9583 0.9792
## Neg Pred Value 0.9962 0.9853 0.9893 0.9911 0.9905
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1840
## Detection Rate 0.2815 0.1816 0.1656 0.1564 0.1761
## Detection Prevalence 0.2890 0.1931 0.1748 0.1632 0.1799
## Balanced Accuracy 0.9900 0.9621 0.9691 0.9731 0.9765
## Print Combined Model Results
print(confusionMatrix(comboPredValidate, projValidate$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 836 7 510 481 5
## B 0 562 3 0 0
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 0 0 0 1 536
##
## Overall Statistics
##
## Accuracy : 0.6576
## 95% CI : (0.6401, 0.6748)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5444
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9877 0.0000 0.0000 0.9908
## Specificity 0.5235 0.9987 1.0000 1.0000 0.9996
## Pos Pred Value 0.4546 0.9947 NaN NaN 0.9981
## Neg Pred Value 1.0000 0.9971 0.8256 0.8361 0.9979
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1840
## Detection Rate 0.2843 0.1911 0.0000 0.0000 0.1823
## Detection Prevalence 0.6253 0.1921 0.0000 0.0000 0.1826
## Balanced Accuracy 0.7618 0.9932 0.5000 0.5000 0.9952
The trainControl function in the caret package allows model creation to be cross validated so that the out of sample error rate can be estimated and the best paramters selected. A new randomization seed was selected and a random forest model was trained on the building set which was subsetted by the train function into 25 different training-test set pairs using a 632 bootstrapping method for cross validation.
set.seed(12345)
## Create instructions for cross-validating train function results.
control <- trainControl(method="boot632", number = 25, returnResamp = "all")
## Perform parallel processing to speed up computations.
registerDoMC(cores=8)
## Fit a random forest model to the training data.
rfControlModel <- train(classe~., method="rf", data=projBuild, preProcess = c("center", "scale"), trControl = control)
##rfProjModel$r
print(rfControlModel)
## Random Forest
##
## 19621 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 19621, 19621, 19621, 19621, 19621, 19621, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9957192 0.9945826
## 27 0.9956004 0.9944324
## 52 0.9902780 0.9876954
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
confusionMatrix(predict(rfControlModel), projBuild$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5579 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
As can be seen above in the confusion matrix output, the best run of the random forest model predicted perfectly on the building set. The out of sample error rate, averaged over all runs by the 632 bootstrap method, was estimated to be 1 - 99.57% = .43%. The following code would print out the test predictions.
projTest <- read.csv("pml-testing.csv", stringsAsFactors = FALSE)
testSetPredictions <- predict(rfControlModel, projTest)
print(testSetPredictions)