INTRODUCTION
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
The data set for this analysis is from:
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
METHODS
This project uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. This data is used to predict whether participants did the exercise correctly or incorrectly.
This project uses two datsets: a training dataset with 19,622 observations of 53 variables, and a testing dataset of 20 observations of 53 variables. The data sets were cleaned up to removed variables that most of the observations missing.
setwd("C:/Users/Deborah Passey/Desktop")
pml_training <- read.csv("pml-training.csv", na.strings = c("","NA","#DIV/0!"))
pml_testing <- read.csv("pml-testing.csv", na.strings = c("","NA","#DIV/0!"))
training <- pml_training[lapply(pml_training, function(x) sum(is.na(x)) / length(x) ) < 0.05]
testing <- pml_testing[lapply(pml_testing, function(x) sum(is.na(x)) / length(x)) < 0.05]
training_data <- training[,-c(1:7)]
testing_data <- testing[,-c(1:7)]
training_data$classe <- unclass(training_data$classe)
testing_data$classe <- unclass(testing_data$classe)
inTrain <- createDataPartition(y=training_data$classe, p=0.7,list = FALSE)
train <- training_data[inTrain,]
test <- training_data[-inTrain,]
Data Summary for Participants
In order to cut down on processing time and predictors, the dataset was trimmed down to a select number of variables. The research from Ugulino et al (2012) was used a selection algorithm to identify 16 variables that are potentially the best at predicting whether participants did the exercise correctly: (1) Sensor on the belt - acceleration, pitch, yaw, and roll, (2) Sensor on the arm - acceleration, pitch, yaw, and roll, (3) Sensor on the forearm - acceleration, pitch, yaw, and roll, and (4) Sensor on the dumbbell - accleraton, pitch, yaw, and roll.
The data table below shows the average for each of the six participants. The table reports the average of the Euler angles: roll, pitch and yaw, and accelerometer (accel) data for the arm, dumbbell, forearm, and belt.
means <- training[,c("user_name", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")]
dataset <- means %>% group_by(user_name) %>% summarise_each(funs(mean))
summ <- kable(dataset) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px")
summ
| user_name | roll_belt | roll_arm | roll_dumbbell | roll_forearm | pitch_belt | pitch_arm | pitch_dumbbell | pitch_forearm | yaw_belt | yaw_arm | yaw_dumbbell | yaw_forearm | total_accel_arm | total_accel_belt | total_accel_dumbbell | total_accel_forearm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| adelmo | 128.2869990 | -15.70269 | 61.688812 | 0.000000 | -41.396737 | 5.204404 | -22.903124 | 0.000000 | 156.7731244 | -26.316871 | -87.25352 | 0.00000 | 22.23124 | 18.129240 | 18.411870 | 40.83505 |
| carlitos | 1.1677410 | 50.81837 | 28.207232 | 39.716886 | 6.582005 | -13.245909 | 3.884116 | 12.446048 | -93.0259961 | 4.276141 | 38.95589 | 15.42912 | 26.66388 | 2.997108 | 7.054627 | 35.08580 |
| charles | 122.0277149 | -45.72347 | -3.803530 | -11.705710 | 15.762226 | -12.492098 | -7.397196 | 24.483634 | 7.3655062 | -31.928391 | 31.64890 | -64.17740 | 22.21352 | 18.391120 | 5.337104 | 30.01103 |
| eurico | -0.0102345 | 90.25534 | 6.832875 | -1.639225 | 3.358449 | -6.694892 | 6.481118 | 4.401209 | -87.2585668 | 46.479805 | 33.62308 | 38.14837 | 28.44072 | 3.242020 | 9.288925 | 34.25896 |
| jeremy | 0.7217578 | 0.00000 | 61.315148 | 108.166149 | 4.103906 | 0.000000 | -47.904785 | 0.125050 | -87.5118460 | 0.000000 | -73.20845 | 87.32715 | 30.23633 | 4.422693 | 29.922399 | 33.36655 |
| pedro | 125.2708812 | 52.65374 | -29.186737 | 83.750153 | 25.523372 | -1.840647 | 13.321590 | 27.137759 | -0.3736169 | 18.077609 | 109.24753 | 54.25502 | 23.87088 | 19.946360 | 10.108812 | 33.81609 |
RESULTS
Three models were fit with the training data set: (1) gradient boosting model (GBM), (2) linear model, and (3) random forest. To increase repoducibility, the “set.seed(150)” function was used for each model. The GBM used a k-fold cross-validation, where the dataset is split into k-subsets. Each subset is held out while the model is trained on the other subsets. This process is completed to determine accuracy for each of the datasets, and an overall accuracy estimate is provided. The linear model and random forest models were fit with “classe” as the outcome and 16 variables as predictors.
## GBM Model
set.seed(200)
control <- trainControl(method = "cv", number = 5)
gbm_model <- train(classe ~ ., method = "gbm", data = train[,c("classe", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")], trControl= control, verbose=FALSE)
## Linear Model
set.seed(200)
lm_model <‐ train(classe ~.,data = train[,c("classe", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")],method="lm")
## Random Forest
set.seed(200)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
rf_model <- train(classe ~ ., data = train[,c("classe", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")], method = "rf", ntree = 10, trControl = control, verbose=FALSE)
trellis.par.set(caretTheme())
plot(gbm_model)
summary(gbm_model$finalModel)
## var rel.inf
## roll_belt roll_belt 33.752094
## pitch_forearm pitch_forearm 15.998537
## roll_forearm roll_forearm 14.699880
## yaw_belt yaw_belt 8.675100
## pitch_belt pitch_belt 7.445905
## total_accel_belt total_accel_belt 2.928083
## yaw_dumbbell yaw_dumbbell 2.622590
## yaw_arm yaw_arm 2.502491
## roll_arm roll_arm 2.363111
## roll_dumbbell roll_dumbbell 1.870073
## total_accel_arm total_accel_arm 1.497838
## pitch_dumbbell pitch_dumbbell 1.221761
## pitch_arm pitch_arm 1.187460
## total_accel_dumbbell total_accel_dumbbell 1.135138
## total_accel_forearm total_accel_forearm 1.069536
## yaw_forearm yaw_forearm 1.030404
gbm_predict <- predict(gbm_model, testing_data)
gbm_prediction <- chartr("12345", "ABCDE", round(gbm_predict, digits=0))
summary(lm_model$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7241 -0.9839 -0.0674 0.8922 3.7157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.601e-01 7.944e-02 4.533 5.87e-06 ***
## roll_belt -4.737e-03 1.216e-03 -3.896 9.82e-05 ***
## roll_arm 1.657e-03 2.010e-04 8.243 < 2e-16 ***
## roll_dumbbell 7.636e-04 2.172e-04 3.516 0.00044 ***
## roll_forearm 9.796e-04 1.145e-04 8.552 < 2e-16 ***
## pitch_belt -1.392e-02 1.482e-03 -9.393 < 2e-16 ***
## pitch_arm -7.721e-03 3.767e-04 -20.497 < 2e-16 ***
## pitch_dumbbell -5.788e-04 4.293e-04 -1.348 0.17767
## pitch_forearm 1.855e-02 4.633e-04 40.032 < 2e-16 ***
## yaw_belt -6.957e-03 5.272e-04 -13.195 < 2e-16 ***
## yaw_arm 7.767e-04 1.752e-04 4.433 9.36e-06 ***
## yaw_dumbbell -2.753e-03 2.288e-04 -12.031 < 2e-16 ***
## yaw_forearm -8.186e-04 1.291e-04 -6.343 2.33e-10 ***
## total_accel_arm -5.045e-03 1.131e-03 -4.461 8.21e-06 ***
## total_accel_belt 1.071e-01 8.423e-03 12.712 < 2e-16 ***
## total_accel_dumbbell 7.404e-05 1.537e-03 0.048 0.96158
## total_accel_forearm 3.604e-02 1.233e-03 29.225 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.299 on 13720 degrees of freedom
## Multiple R-squared: 0.2293, Adjusted R-squared: 0.2284
## F-statistic: 255.2 on 16 and 13720 DF, p-value: < 2.2e-16
lm_predict <‐ predict(lm_model,testing_data)
lm_prediction <- chartr("12345", "ABCDE", round(lm_predict, digits=0))
summary(rf_model$finalModel)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 13737 -none- numeric
## mse 10 -none- numeric
## rsq 10 -none- numeric
## oob.times 13737 -none- numeric
## importance 16 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 13737 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 16 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 2 -none- list
rf_predict <‐ predict(rf_model,testing_data)
rf_prediction <- chartr("12345", "ABCDE", round(rf_predict, digits=0))
## Comparisons
conf.matrix <- round(prop.table(table(lm_prediction, gbm_prediction, rf_prediction), 3), 3)
conf.matrix
## , , rf_prediction = A
##
## gbm_prediction
## lm_prediction A B C D E
## A 0.143 0.000 0.000 0.000 0.000
## B 0.286 0.286 0.000 0.000 0.000
## C 0.143 0.143 0.000 0.000 0.000
## D 0.000 0.000 0.000 0.000 0.000
##
## , , rf_prediction = B
##
## gbm_prediction
## lm_prediction A B C D E
## A 0.000 0.000 0.000 0.000 0.000
## B 0.000 0.250 0.000 0.000 0.000
## C 0.000 0.125 0.375 0.000 0.000
## D 0.000 0.000 0.250 0.000 0.000
##
## , , rf_prediction = C
##
## gbm_prediction
## lm_prediction A B C D E
## A 0.000 0.000 0.000 0.000 0.000
## B 0.000 0.000 0.000 0.000 0.000
## C 0.000 0.000 1.000 0.000 0.000
## D 0.000 0.000 0.000 0.000 0.000
##
## , , rf_prediction = D
##
## gbm_prediction
## lm_prediction A B C D E
## A 0.000 0.000 0.000 0.000 0.000
## B 0.000 0.000 0.000 0.000 0.000
## C 0.000 0.000 0.000 1.000 0.000
## D 0.000 0.000 0.000 0.000 0.000
##
## , , rf_prediction = E
##
## gbm_prediction
## lm_prediction A B C D E
## A 0.000 0.000 0.000 0.000 0.000
## B 0.000 0.000 0.333 0.000 0.000
## C 0.000 0.000 0.000 0.333 0.333
## D 0.000 0.000 0.000 0.000 0.000
CONCLUSIONS
The predicted cases for all three models is found below. The random forest model performed well and was used to predict the 20 cases for the quiz.
prediction_results <- cbind(gbm_prediction, lm_prediction, rf_prediction)
kable(prediction_results) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| gbm_prediction | lm_prediction | rf_prediction |
|---|---|---|
| C | D | B |
| A | C | A |
| B | B | B |
| B | B | A |
| B | B | A |
| E | C | E |
| D | C | D |
| C | D | B |
| A | A | A |
| B | C | A |
| C | C | B |
| C | C | C |
| C | C | B |
| A | B | A |
| D | C | E |
| C | B | E |
| A | B | A |
| C | C | B |
| B | C | B |
| B | B | B |