As part of Course 7, Practical Machine Learning in the Data Science Specialization on Coursera, the goal of this project is to predict the manner in which participants did an exercise. This is the ‘classe’ variable in the below dataset. We will clean and explore the data to choose the variables to predict with. This report will outline how the models are built, estimate our out-of-sample error utilizing validation tests, and explain how we choose the best model to utilize. Lastly, we will use this model to predict 20 test cases.
This human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset above). The approach we propose for the Weight Lifting Exercises dataset is to investigate “how (well)” an activity was performed by the wearer. The “how (well)” investigation has only received little attention so far, even though it potentially provides useful information for a large variety of applications,such as sports training.
In this work (see the paper referenced) we first define quality of execution and investigate three aspects that pertain to qualitative activity recognition: the problem of specifying correct execution, the automatic and robust detection of execution mistakes, and how to provide feedback on the quality of execution to the user. We tried out an on-body sensing approach (dataset here), but also an “ambient sensing approach” (by using Microsoft Kinect - dataset still unavailable)
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
library(caret)
library(dplyr)
url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_raw <- read.csv(url_train)
test_raw <- read.csv(url_test)
dim(train_raw); dim(test_raw)
## [1] 19622 160
## [1] 20 160
View(train_raw)
After viewing the data it appears the first 6 features are not applicable to the analysis, so we can remove them.
train_data <- train_raw[,-c(1:6)]
test_data <- test_raw[, -c(1:6)]
dim(train_data); dim(test_data)
## [1] 19622 154
## [1] 20 154
We will need to address any NAs in the data.
table(sapply(train_data, function(x) mean(is.na(x))))
##
## 0 0.979308938946081
## 87 67
This is an interesting result, 87 features have no NA’s, while 67 have 97%+ NA’s. Let’s exclude these features for now.
remove_NA <- sapply(train_data, function(x) mean(is.na(x))) == 0
table(remove_NA)
## remove_NA
## FALSE TRUE
## 67 87
train_data <- train_data[, remove_NA == TRUE]
test_data <- test_data[, remove_NA == TRUE]
dim(train_data); dim(test_data)
## [1] 19622 87
## [1] 20 87
Let’s also remove the zero-variance features from the modeling.
near_zero <- nearZeroVar(train_data)
train_data <- train_data[, -near_zero]
test_data <- test_data[, -near_zero]
dim(train_data); dim(test_data)
## [1] 19622 54
## [1] 20 54
Lastly, let’s convert Classe to a factor variable
train_data$classe <- as.factor(train_data$classe)
Now, with the data cleaned, we can create split our train_data set into training and validation sets to build our models against.
inTrain <- createDataPartition(train_data$classe, p = 3/4, list = FALSE)
train_set <- train_data[inTrain, ]
valid_set <- train_data[-inTrain, ]
dim(train_set); dim(valid_set)
## [1] 14718 54
## [1] 4904 54
This is a large dataset with many features, for speed optimization, let’s perform a Principal Components Analysis to see how many features will be needed to give us the majority of the accuracy to improve model performance.
train_set_num <- train_set %>%
select(-classe) %>%
mutate_all(as.numeric)
train_set_num[is.na(train_set_num)] = 0
pca <- prcomp(train_set_num)
qplot(1:length(pca$sdev), pca$sdev / sum(pca$sdev), ylab = "% Explained",
xlab = "# of Features")
From this chart it appears that we seem to get to 99% explaination in between 30 and 40 features.
cumsum(pca$sdev / sum(pca$sdev))[30:40]
## [1] 0.9808430 0.9842486 0.9874429 0.9896621 0.9913573 0.9929876 0.9944913
## [8] 0.9958561 0.9967019 0.9974892 0.9982421
Looks like 35 features will get us to 99% of the variance explained, so if we run into performance issues, we can use this as a constraint on the models.
Now it’s time to build and train our various models. Due to having limited computing power on a single core, we will not utilze a Random Forest model. We will instead train 2 models then tune the better performing model to improve results.
We will utilize a Confusion Matrix to compare the result on the Training Set.
set.seed(4242)
model_rpart <- train(classe ~ ., data = train_set, method = "rpart")
set.seed(4242)
train_control <- trainControl(method = "cv", number = 3)
model_gbm <- train(classe ~ ., data = train_set, method = "gbm",
trControl = train_control, verbose = FALSE)
model_gbm$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 9 150 3 0.1 10
With more compute power and time, we can set the tuning grid to a range of values i.e.
Here, I have illustrated a single user-defined output.
set.seed(4242)
train_control <- trainControl(method = "cv", number = 3)
gbm_grid <- expand.grid(interaction.depth = 5,
n.trees = 175,
shrinkage = 0.075,
n.minobsinnode = 10)
model_gbm_user <- train(classe ~ ., data = train_set, method = "gbm",
trControl = train_control, tuneGrid = gbm_grid, verbose = FALSE)
Let’s validate the model against the validation data
predict_rpart <- predict(model_rpart, newdata = valid_set)
predict_gbm <- predict(model_gbm, newdata = valid_set)
predict_gbm_user <- predict(model_gbm_user, newdata = valid_set)
CM_rpart <- confusionMatrix(predict_rpart, valid_set$classe)
CM_gbm <- confusionMatrix(predict_gbm, valid_set$classe)
CM_gbm_user <- confusionMatrix(predict_gbm_user, valid_set$classe)
CM_rpart$table
## Reference
## Prediction A B C D E
## A 1273 392 401 342 96
## B 15 326 34 148 60
## C 105 231 420 272 197
## D 0 0 0 0 0
## E 2 0 0 42 548
CM_gbm$table
## Reference
## Prediction A B C D E
## A 1388 8 0 0 1
## B 7 926 12 2 4
## C 0 12 842 8 3
## D 0 3 1 794 7
## E 0 0 0 0 886
CM_gbm_user$table
## Reference
## Prediction A B C D E
## A 1393 3 0 0 0
## B 2 942 2 0 1
## C 0 4 853 8 2
## D 0 0 0 796 2
## E 0 0 0 0 896
CM_rpart$overall[1]; CM_gbm$overall[1]; CM_gbm_user$overall[1]
## Accuracy
## 0.5234502
## Accuracy
## 0.9861338
## Accuracy
## 0.995106
Now it’s time to use our best model [model_gbm_user] to predict values for our Test set. Given our success with the validation data, we would expect our out-of-sample error to be small.
predict_rf <- predict(model_gbm_user, newdata = test_data)
predict_rf
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.