Assignment description

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants who each performed a bicep curl in 5 different ways. The goal is to find a quality predictive algorithm.
Using the caret and parsnip packages. The process created two different methods of pre-process (i.e. normalized and PCA). Then created three models: tree, generalized linear, and random forest. Build workflows combing each pre-process and model. Look at the metrics of each model. Eliminate poor performing models. Finally apply the best fitting model to the unseen testing data.

1. The question: with the given dataset can we create a model that correctly classifies the 5 different bicep curl movements?

2. Input Data

This data is from: Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Cited by 2 (Google Scholar)

Read more: http://groupware.les.inf.puc-rio.br/har.#ixzz4tofRpJYs

3. Features: what to use as predictors

Looking at the original data we see alot of NA values. The participant might make a difference so turn user_name into factors. Am not going to use time slices so we can drop the time columns.
Using the Parsnip package I created two different recipes to pre-process the data. The first recipe was to normalize the data: to account for any variables that might of had a large variance by centering them around zero and giving them a standard deviation of one. The normalized pre-process also looks for and eliminates any predictors that have a variance really close to zero. The second recipe uses the Principal Component Analysis (pca in the rest of the code). Both recipes convert the user_name predictor into a dummy variable.

4. Algorithms

The training data was analyzed using three different methods: multinomial regression, a tree model, and a random forest model. Each model was run twice: once with the normalized and recipe, and again with the PCA recipe.

5. Parameters - which predictors were used?

Multinomial regression important predictors:

Normalized
Variable Importance Sign
yaw_belt 24.695736 POS
roll_belt 18.355697 NEG
pitch_belt 13.349379 POS
magnet_dumbbell_z 6.870658 NEG
magnet_dumbbell_x 6.130485 POS
user_name_eurico 3.634274 POS
user_name_jeremy 3.621108 POS
magnet_belt_z 3.178922 NEG
accel_arm_z 3.130332 NEG
magnet_dumbbell_y 3.075882 POS
PCA
Variable Importance Sign
PC2 0.0012430 POS
PC3 0.0003528 NEG
PC4 0.0001386 POS
PC1 0.0000000 NEG

Tree method important predictors:

Normalized
Variable Importance
roll_belt 1636.8128
accel_belt_z 830.0889
pitch_forearm 817.8788
magnet_dumbbell_y 813.5692
accel_dumbbell_y 779.7746
total_accel_dumbbell 734.4937
total_accel_belt 704.5511
pitch_belt 699.4032
roll_dumbbell 434.4295
yaw_belt 426.0193
PCA
Variable Importance
PC2 679.60503
PC4 168.47264
PC1 33.17063
PC3 27.06690

Random Forest important predictors:

Normalized
Variable Importance
pitch_forearm 779.0456
roll_belt 762.5084
yaw_belt 749.8133
pitch_belt 544.6227
magnet_dumbbell_y 530.7919
magnet_dumbbell_z 489.0929
roll_forearm 454.2440
magnet_dumbbell_x 383.2810
accel_belt_z 341.9422
magnet_belt_z 321.8242
PCA
Variable Importance
PC3 2620.208
PC2 2568.502
PC4 2525.222
PC1 2331.543

6. How good were the models?

Given here is the Receiver Operating Characteristic (the ROC value) for each of the 6 models. The first line is the normalized pre-process and the second line is the principal component.

ROC Values
ROC_AUC_estimate Multinomial Tree Random_Forest
Normalized 0.650 0.854 1.000
PCA 0.583 0.640 0.873

In this table we see that the normalized random forest model provides the largest confidence that it would give the the smallest out-of-sample error. Pause should be given though because of such a high value, and consideration should be made that possibly re-sampling the other methods might provide a better prediction on the validation data.

The next table provides the accuracy measures of the six models.

Accuracy
accuracy_estimate Multinomial Tree Random_Forest
Normalized 0.321 0.632 0.990
PCA 0.284 0.363 0.625

From this table we see that the normalized random forest model is predicted to have a 99.0% accuracy on the validation data.

Test the predictors:

Given below is the normalized random forest model applied to the validation data set:

.pred_class…1 .pred_class…2 .pred_A .pred_B .pred_C .pred_D .pred_E
B B 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000
A A 0.9400000 0.0000000 0.0600000 0.0000000 0.0000000
B B 0.0750000 0.8250000 0.1000000 0.0000000 0.0000000
A A 0.9875000 0.0000000 0.0125000 0.0000000 0.0000000
A A 0.8142857 0.0857143 0.1000000 0.0000000 0.0000000
E E 0.0000000 0.2000000 0.1750000 0.0555556 0.5694444
D D 0.0000000 0.0444444 0.1861111 0.7694444 0.0000000
B B 0.1416667 0.3666667 0.0983333 0.2933333 0.1000000
A A 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
A A 0.9000000 0.0000000 0.0000000 0.0666667 0.0333333
B B 0.0333333 0.8041667 0.1125000 0.0500000 0.0000000
C C 0.0000000 0.1000000 0.8033333 0.0000000 0.0966667
B B 0.0000000 0.9000000 0.1000000 0.0000000 0.0000000
A A 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
E E 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000
E E 0.1000000 0.1000000 0.0000000 0.0000000 0.8000000
A A 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
B B 0.1000000 0.8571429 0.0428571 0.0000000 0.0000000
B B 0.3000000 0.6571429 0.0000000 0.0428571 0.0000000
B B 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000

Conclusion: Even though the final results do given a 100% correct prediction on the validation data. The extremely high ROC value of the normalized random forest method should prompt the necessity to re-run the code using different re-sampling methods.

Appendix: r code

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(warning = FALSE)
## libraries
library(tidyverse) #*
library(caret)#*
library(tidymodels) #* for the dials fuction
library(parsnip)#*
library(recipes)#*
library(workflows)#*
library(yardstick)#* for the roc_auc
library(vip)
library(knitr)
# Data urls
if(!file.exists("./data")){dir.create("./data")}
url1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
#download.file(url1, destfile = "./data/pml-training.csv")
url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#download.file(url2, destfile = "./data/pml-testing.csv")
## Data Source
website <- "http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har."
validation <- read_csv("./data/pml-testing.csv")
df <- read_csv("./data/pml-training.csv")
# making NA explicit
df <- df %>% complete(.)
# Remove the columns with more than 90% NA values
df <- 
        df[,!sapply(df,
                           function(x) mean(is.na(x)))>0.9]
# convert user_name into factors
df$user_name <-
        parse_factor(df$user_name, 
                     levels = c("carlitos", "pedro","adelmo",
                                "charles","eurico","jeremy"),
                     ordered = FALSE)
# get rid of the index, time, and window cols
df <- 
        df %>% 
        select(-(...1 | raw_timestamp_part_1:num_window))
# Split the Data
inTrain = createDataPartition(df$classe, p = 0.8, list = FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]
# Convert classe into factors
training$classe <- 
        parse_factor(training$classe, 
                     levels = c("A","B","C","D","E"), 
                     ordered = FALSE)
testing$classe <- 
        parse_factor(testing$classe, 
                     levels = c("A","B","C","D","E"), 
                     ordered = FALSE)
## Recipes 
norm_recipe <-
        recipe(classe ~ ., data = training) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_nzv(all_predictors()) %>%
        step_normalize(all_numeric_predictors())
# Principal Component Analysis Recipe
pca_recipe <-
        recipe(classe ~., data = training) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_pca(all_numeric_predictors(), num_comp = 4)
## Multinomial Regression  
mr_model <- 
        multinom_reg(penalty = 0.1) %>%
        set_engine("glmnet") %>%
        set_mode("classification")
# Normalized and Centered
mr_workflow_n <-
        workflow() %>%
        add_model(mr_model) %>%
        add_recipe(norm_recipe)
# Fit the mr n model
set.seed(234)
mr_fit_n <- 
        mr_workflow_n %>%
        fit(data = training)
# Multinomial pca
mr_workflow_pca <- 
        workflow() %>%
        add_model(mr_model) %>%
        add_recipe(pca_recipe)
## Fit the Mutlinomial pca model
set.seed(234)
mr_fit_pca <- 
        mr_workflow_pca %>%
        fit(data = training)
## Tree Model
tune_spec <- 
        decision_tree(cost_complexity = 0.02,
                      tree_depth = 10,
                      min_n = 1000) %>%
        set_engine("rpart") %>%
        set_mode("classification")
# Normalized Tree
tree_wf_n <- 
        workflow() %>%
        add_model(tune_spec) %>%
        add_recipe(norm_recipe)
# Fit the Normalized Tree
set.seed(234)
tree_fit_n <- 
        tree_wf_n %>%
        fit(data=training)
# Tree pca
tree_wf_pca <-
        workflow() %>%
        add_model(tune_spec) %>%
        add_recipe(pca_recipe)
# Fit Tree pca
set.seed(567)
tree_fit_pca <- 
        tree_wf_pca %>%
        fit(data=training)
## Random Forest Model
randf_model <- 
        rand_forest(mtry = 5, trees = 10) %>%
        set_engine("ranger", importance = "impurity") %>%
        set_mode("classification")
# Random Forest pca
randf_wf_pca <-
        workflow() %>%
        add_model(randf_model) %>%
        add_recipe(pca_recipe)
## Fit the Random Forest pca
set.seed(567)
randf_fit_pca <- 
        randf_wf_pca %>%
        fit(data=training)
# Normalized Random Forest
randf_wf_n <-
        workflow() %>%
        add_model(randf_model) %>%
        add_recipe(norm_recipe)
## Fit Normalized Random Forest
set.seed(234)
randf_fit_n <- 
        randf_wf_n %>%
        fit(data=training)
mr_n_vip <- mr_fit_n %>%
        extract_fit_parsnip() %>%
        vip()
mr_n_vip$data
mr_pca_vip <- mr_fit_pca %>%
        extract_fit_parsnip() %>%
        vip()
mr_pca_vip$data
tree_n_vip <- tree_fit_n %>%
        extract_fit_parsnip() %>%
        vip()
tree_pca_vip <- tree_fit_pca %>%
        extract_fit_parsnip() %>%
        vip()
randf_n_vip <- randf_fit_n %>%
        extract_fit_parsnip() %>%
        vip()
randf_pca_vip <- randf_fit_pca %>%
        extract_fit_parsnip() %>%
        vip()
mr_vip_n <- knitr::kable(mr_n_vip$data, caption = "Normalized")
mr_vip_pca <- knitr::kable(mr_pca_vip$data, caption = "PCA")
tree_vip_n <- knitr::kable(tree_n_vip$data, caption = "Normalized")
tree_vip_pca <- knitr::kable(tree_pca_vip$data, caption = "PCA")
randf_vip_n <- knitr::kable(randf_n_vip$data, caption = "Normalized")
randf_vip_pca <- knitr::kable(randf_pca_vip$data, caption = "PCA")
# Predict using the norm center model
mr_pred_n <- 
        predict(mr_fit_n, testing) %>%
        bind_cols(predict(mr_fit_n, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
# Check the quality of fit
mr_n_roc_auc <- 
        mr_pred_n %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
mr_n_roc_auc_est <- round(mr_n_roc_auc$.estimate,3)
mr_n_accuracy <-
        mr_pred_n %>%
        accuracy(truth = testing$classe, .pred_class)
mr_n_acc_est <- round(mr_n_accuracy$.estimate, 3)
## predict with the mr pca model
mr_pred_pca <- 
        predict(mr_fit_pca, testing) %>%
        bind_cols(predict(mr_fit_pca, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check the quality of fit
mr_pca_roc_auc <-
        mr_pred_pca %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
mr_pca_accuracy <- 
        mr_pred_pca %>%
        accuracy(truth = testing$classe, .pred_class)
mr_pca_roc_auc_est <- round(mr_pca_roc_auc$.estimate,3)
mr_pca_acc_est <- round(mr_pca_accuracy$.estimate ,3)
# Predict tree using the norm center model
tree_pred_n <- 
        predict(tree_fit_n, testing) %>%
        bind_cols(predict(tree_fit_n, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check quality of fit
tree_n_roc_auc <-
        tree_pred_n %>%
        roc_auc(truth = testing$classe, 
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
tree_n_roc_auc_est <- round(tree_n_roc_auc$.estimate, 3)
tree_n_accuracy <-
        tree_pred_n %>%
        accuracy(truth = testing$classe, .pred_class)
tree_n_acc_est <- round(tree_n_accuracy$.estimate, 3)
# Predict tree using the pca recipe
tree_pred_pca <- 
        predict(tree_fit_pca, testing) %>%
        bind_cols(predict(tree_fit_pca, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
tree_pca_roc_auc <-
        tree_pred_pca %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
tree_pca_roc_auc_est <- round(tree_pca_roc_auc$.estimate, 3)
tree_pca_accuracy <-
        tree_pred_pca %>%
        accuracy(truth = testing$classe, .pred_class)
tree_pca_acc_est <- round(tree_pca_accuracy$.estimate, 3)
# Predict randf using the pca recipe
randf_predict_pca <- 
        predict(randf_fit_pca, testing) %>%
        bind_cols(predict(randf_fit_pca, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check quality of fit
randf_pca_cm <- 
        confusionMatrix(randf_predict_pca$classe , testing$classe)
randf_pca_roc_auc <-
        randf_predict_pca %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
randf_pca_roc_auc_est <- 
        round(randf_pca_roc_auc$.estimate, 3)
randf_pca_accuracy <-
        randf_predict_pca %>%
        accuracy(truth = testing$classe, .pred_class)
randf_pca_acc_est <- 
        round(randf_pca_accuracy$.estimate, 3)
## Predict using the Normailzed Random Forest----------------------------------
randf_pred_n <- 
        predict(randf_fit_n, testing) %>%
        bind_cols(predict(randf_fit_n, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check quality of fit
randf_n_roc_auc <-
        randf_pred_n %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
randf_n_accuracy <-
        randf_pred_n %>%
        accuracy(truth = testing$classe, .pred_class)
randf_n_est <-
        round(randf_n_roc_auc$.estimate, 3)
randf_n_acc_est <-
        round(randf_n_accuracy$.estimate, 3)
# Tables of estimate values
my_roc_auc_table <- knitr::kable(data.frame(ROC_AUC_estimate = c("Normalized", "PCA"),
                Multinomial = c(mr_n_roc_auc_est, mr_pca_roc_auc_est),
                Tree = c(tree_n_roc_auc_est, tree_pca_roc_auc_est),
                Random_Forest = c(randf_n_est,randf_pca_roc_auc_est)),
                caption = "ROC Values")
my_acc_table <- knitr::kable(data.frame(accuracy_estimate = c("Normalized", "PCA"),
                           Multinomial = c(mr_n_acc_est, mr_pca_acc_est),
                           Tree = c(tree_n_acc_est, tree_pca_acc_est),
                           Random_Forest = c(randf_n_acc_est,randf_pca_acc_est)),
                           caption = "Accuracy")
## Test normalized Random Forest model
final_predict <- 
        predict(randf_fit_n, new_data = validation) %>%
        bind_cols(predict(randf_fit_n, validation),
                  predict(randf_fit_n, validation, type = "prob"))

final_results <-
        knitr::kable(final_predict)