Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways (labeled A-E). More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The ‘foreach’ and ‘doMC’ packages were loaded to utilize 3of 4 computing cores to speed up processing for this dataset.
The data contained nearly 20,000 observations of 160 total variables from a csv file, though many variables had significant amounts of NA values, missing or blank values, and error values by trying to divide by zero. To begin cleaning the data - for both the test and train - all missing and error values were converted to NA’s and then all NA values summed across columns/variables. Any column that had a majority of NA’s was discarded. Lastly, all columns that were not relevant predictors (such as names/times) were removed, leaving us with a total of 52 predictor variables.
Since the dataset is relatively large, and computational time will be on the order of hours - a straight randomized 50/50 split of the “training” set will be used for cross validation.
library(caret)
library(foreach)
library(doMC)
registerDoMC(cores = 3)
trn <- read.csv("/Users/charlesbecker/Downloads/pml-training.csv",na.strings=c("NA","#DIV/0!",""))
tst <- read.csv("/Users/charlesbecker/Downloads/pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
sumNA <- function(x) sum(is.na(x))
NA_sum_trn <- apply(trn, 2, sumNA)
NA_sum_tst <- apply(tst, 2, sumNA)
NA_row_train <- which(NA_sum_trn > nrow(trn)*.5)
NA_row_test <- which(NA_sum_tst > nrow(tst)*.5)
training <- trn[,-NA_row_train]
testing <- tst[,-NA_row_test]
training <- training[,-(1:7)]
testing <- testing[,-(1:7)]
samp <- sample(nrow(training), nrow(training)/2)
new_train <- training[samp,]
cv_data <- training[-samp,]
A varity of models will be trained on the same training data and tested on the cross validation set to see error rates. Classification models included are: Rpart decision tree, Naive Bayes, Stochastic Gradient Boosting, Neural network, Logistical Regression Boosting, K-Nearest Neighbor, and Random Forest. All models will be run under the “caret” package wrapper.
mod_rpart <- train(classe ~ ., method = "rpart", data = new_train)
mod_nb <- train(classe ~ ., method = "nb", data = new_train)
mod_gbm <- train(classe ~ ., method = "gbm", data = new_train, verbose = F)
mod_nnet <- train(classe ~ ., method = "nnet", data = new_train)
mod_logit <- train(classe ~ ., method = "LogitBoost", data = new_train)
mod_knn <- train(classe ~ ., method = "knn", data = new_train)
mod_rf <- train(classe ~ ., method = "rf", data = new_train, prox = T)
table(p_rpart, cv_data$classe)
##
## p_rpart A B C D E
## A 2511 795 764 734 267
## B 57 637 59 270 245
## C 212 445 910 619 473
## D 0 0 0 0 0
## E 11 0 0 0 802
confusionMatrix(p_rpart, cv_data$classe)$overall[1]
## Accuracy
## 0.4953623
table(p_nb, cv_data$classe)
##
## p_nb A B C D E
## A 1957 139 111 90 50
## B 113 1273 137 7 199
## C 351 290 1336 314 99
## D 355 168 146 1149 59
## E 15 7 3 63 1380
confusionMatrix(p_nb, cv_data$classe)$overall[1]
## Accuracy
## 0.7231679
table(p_nnet, cv_data$classe)
##
## p_nnet A B C D E
## A 1642 189 197 49 123
## B 50 292 153 72 218
## C 573 712 1209 627 750
## D 526 684 173 875 696
## E 0 0 1 0 0
confusionMatrix(p_nnet, cv_data$classe)$overall[1]
## Accuracy
## 0.4095403
table(p_gbm, cv_data$classe)
##
## p_gbm A B C D E
## A 2736 70 0 3 3
## B 37 1749 64 5 26
## C 11 51 1645 61 15
## D 4 3 23 1543 24
## E 3 4 1 11 1719
confusionMatrix(p_gbm, cv_data$classe)$overall[1]
## Accuracy
## 0.9572928
table(p_logit, cv_data$classe)
##
## p_logit A B C D E
## A 2371 175 33 28 12
## B 62 1339 117 18 53
## C 25 53 1146 51 35
## D 39 20 65 1275 40
## E 7 12 14 35 1417
confusionMatrix(p_logit, cv_data$classe)$overall[1]
## Accuracy
## 0.8941009
table(p_knn, cv_data$classe)
##
## p_knn A B C D E
## A 2628 135 32 34 39
## B 39 1552 62 18 117
## C 40 100 1535 170 47
## D 73 48 55 1377 73
## E 11 42 49 24 1511
confusionMatrix(p_knn, cv_data$classe)$overall[1]
## Accuracy
## 0.8768729
table(p_rf, cv_data$classe)
##
## p_rf A B C D E
## A 2785 19 0 0 0
## B 4 1854 17 0 2
## C 1 4 1707 15 2
## D 0 0 9 1607 3
## E 1 0 0 1 1780
confusionMatrix(p_rf, cv_data$classe)$overal[1]
## Accuracy
## 0.9920497
There’s a wide variety of accuracy between the models ranging from ~ 40-99 %. The random forest was the most accurate with a 99.2 % accuracy on a 9811 random out-of-sample test - which equates to an error rate of ~ 0.8 %. Since the random forest did a better job with all 5 classifiers than any other model, an ensemble stacking approach was not used. However, the Gradient Boosting model can provide some insight on the influence of the variables as show below.
predict(mod_rf, tst)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E