Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
# Load the data, removing the likely unavialbe information
Test <- read.csv("pml-testing.csv", na.strings = c("", "NA", "#DIV/0!"))
Train <- read.csv("pml-training.csv", na.strings = c("", "NA", "#DIV/0!"))
Need to explore the dataset in order to have an improved understanding of the available data, before attempting to tidy the data.
# Dimensions of data
dim(Train)
head(Train)
summary(Train)
sapply(Train, class)
table(Train$classe)
duplicated(colnames(Train))
The training data has 19622 instances with 160 attributes, which might be worth reducing. Experimented with removing data columns with high percentages of NA information. Removing either 50% or 95% NAs both resulted in 60 attributes remaining.
CTrain <- Train[, -which(colMeans(is.na(Train)) > .5)]
dim(CTrain)
## [1] 19622 60
CTrain <- Train[, -which(colMeans(is.na(Train)) > .95)]
dim(CTrain)
## [1] 19622 60
Continued exploring the kind of data inside the dataset. Played around with the users versus data as well as the expected classe versus data pertinent to the course. Then removed the superfluous columns in order to streamline and improve the modeling predictions.
levels(CTrain$user_name)
## [1] "adelmo" "carlitos" "charles" "eurico" "jeremy" "pedro"
Upercent <- prop.table(table(CTrain$user_name)) * 100
cbind(freq = table(Train$user_name), percentage = Upercent)
## freq percentage
## adelmo 3892 19.83488
## carlitos 3112 15.85975
## charles 3536 18.02059
## eurico 3070 15.64570
## jeremy 3402 17.33768
## pedro 2610 13.30140
plot(Train$user_name, )
levels(CTrain$classe)
## [1] "A" "B" "C" "D" "E"
Cpercent <- prop.table(table(CTrain$classe)) * 100
cbind(freq = table(Train$classe), percentage = Cpercent)
## freq percentage
## A 5580 28.43747
## B 3797 19.35073
## C 3422 17.43961
## D 3216 16.38977
## E 3607 18.38243
plot(Train$classe, )
### Remove likely unnecessary columns
TempTrain <- !names(CTrain) %in%
c('X', 'user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2',
'cvtd_timestamp', 'new_window')
CTrain <- CTrain[, TempTrain]
Basic preparation of the training data to create a training and testing dataset that did not impede on the actual testing data.
PTrain <- createDataPartition(CTrain$classe, p = 0.6)[[1]]
CrossV <- CTrain[-PTrain,]
PTrain <- CTrain[ PTrain, ]
PTrain <- createDataPartition(CrossV$classe, p = 0.75)[[1]]
CrossVtest <- CrossV[-PTrain, ]
CrossV <- CrossV[PTrain, ]
Attempted three different models. The Random Forest had the highest accuracy and lowest in-sample/out-of-sample error, so that one was selected. Additionally, the model was excellent in crossvalidation tests.
mod1 <- train(classe ~ ., data=CTrain, method="rf")
mod2 <- train(classe ~ ., data=CTrain, method="gbm")
mod3 <- train(classe ~ ., data=CTrain, method="lda")
pred1 <- predict(mod1, CrossV)
pred2 <- predict(mod2, CrossV)
pred3 <- predict(mod3, CrossV)
### Confusion Matrices
confusionMatrix(pred1, CrossV$classe)
confusionMatrix(pred2, CrossV$classe)
confusionMatrix(pred3, CrossV$classe)
### Create Combination Model
Cmodel <- data.frame(pred1, pred2, classe=CrossV$classe)
Cmodel2 <- data.frame(pred2, pred3, classe=CrossV$classe)
Cmodel3 <- data.frame(pred1, pred2, pred3, classe=CrossV$classe)
CmodelFit <- train(classe ~ ., method="rf", data=Cmodel)
CmodelFit2 <- train(classe ~ ., method="rf", data=Cmodel2)
CmodelFit3 <- train(classe ~ ., method="rf", data=Cmodel3)
#### in-sample error
CmodelFitIn <- predict(CmodelFit, Cmodel)
CmodelFitIn2 <- predict(CmodelFit2, Cmodel)
CmodelFitIn3 <- predict(CmodelFit3, Cmodel)
confusionMatrix(CmodelFitIn, Cmodel$classe)
confusionMatrix(CmodelFitIn2, Cmodel$classe)
confusionMatrix(CmodelFitIn3, Cmodel$classe)
# ERROR NEEDS WORK out-of-sample error
pred1 <- predict(mod1, CrossV)
pred3 <- predict(mod3, CrossV)
confusionMatrix(pred1, CrossVtest$classe)
The Random Forest model with a 5-fold cross validation will be used as the predictor rather than the other methods or a combination of methods.
* RF handls large input well, including when the interacations are unknown.
* RF has built-in cross-validation that estimates the out-of-sample error rate.
Then, I explored the model with various data visualizations, including a list of the top variables.
RFmodel <- train(classe ~., data=CTrain, method = "rf",
trControl = trainControl(method = "cv", number = 5))
save(RFmodel, file = "RFmodel2.Rda")
print(RFmodel)
## Random Forest
##
## 19622 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15697, 15699, 15697, 15698, 15697
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9962287 0.9952296
## 27 0.9984711 0.9980662
## 53 0.9960759 0.9950359
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
plot(RFmodel)
varImp(RFmodel)
## rf variable importance
##
## only 20 most important variables shown (out of 53)
##
## Overall
## num_window 100.000
## roll_belt 61.803
## pitch_forearm 37.701
## yaw_belt 31.640
## magnet_dumbbell_z 28.210
## magnet_dumbbell_y 27.872
## pitch_belt 27.056
## roll_forearm 22.823
## accel_dumbbell_y 12.473
## roll_dumbbell 10.433
## accel_belt_z 10.374
## magnet_dumbbell_x 10.314
## accel_forearm_x 9.818
## total_accel_dumbbell 8.911
## accel_dumbbell_z 7.930
## magnet_forearm_z 7.043
## magnet_belt_z 6.868
## magnet_belt_y 6.436
## magnet_belt_x 5.569
## roll_arm 5.141
plot(varImp(RFmodel))
plot(varImp(RFmodel), main = "Importance of Top 30 Variables", top = 30)
plot(varImp(RFmodel), main = "Importance of Top 15 Variables", top = 15)
plot(varImp(RFmodel), main = "Importance of Top 10 Variables", top = 10)
RFmodel$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.13%
## Confusion matrix:
## A B C D E class.error
## A 5578 1 0 0 1 0.0003584229
## B 4 3790 2 1 0 0.0018435607
## C 0 5 3417 0 0 0.0014611338
## D 0 0 7 3207 2 0.0027985075
## E 0 0 0 2 3605 0.0005544774
Overall, this model has a low error rate, consistently under .16%.
Intersected the cleaned training data with the testing data to ensure that I had the same variables, and then ran predictions against the testing data.
CTest <- Test[ , intersect(names(CTrain), names(Test))]
RFpredict <- predict(RFmodel, CTest)
confusionMatrix(RFmodel, Test$classe)
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are un-normalized aggregated counts)
##
## Reference
## Prediction A B C D E
## A 5578 6 0 0 0
## B 1 3788 6 0 0
## C 0 2 3416 10 0
## D 0 1 0 3205 2
## E 1 0 0 1 3605
##
## Accuracy (average) : 0.9985
The accuracy remains high, consistently over 99.8% accuracy.