Assignment description

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants who each performed a bicep curl in 5 different ways. The goal is to find a quality predictive algorithm.
Using the caret and parsnip packages. The process created two different methods of pre-process (i.e. normalized and PCA). Then created three models: tree, generalized linear, and random forest. Build workflows combing each pre-process and model. Look at the metrics of each model. Eliminate poor performing models. Finally apply the best fitting model to the unseen testing data.

1. The question: with the given dataset can we create a model that correctly classifies the 5 different bicep curl movements?

2. Input Data

This data is from: Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Cited by 2 (Google Scholar)

3. Features: what to use as predictors

Looking at the original data we see alot of NA values. The participant might make a difference so turn user_name into factors. Am not going to use time slices so we can drop the time columns.
Using the Parsnip package I created two different recipes to pre-process the data. The first recipe was to normalize the data: to account for any variables that might of had a large variance by centering them around zero and giving them a standard deviation of one. The normalized pre-process also looks for and eliminates any predictors that have a variance really close to zero. The second recipe uses the Principal Component Analysis (pca in the rest of the code). Both recipes convert the user_name predictor into a dummy variable.

4. Algorithms

The training data was analyzed using three different methods: multinomial regression, a tree model, and a random forest model. Each model was run twice: once with the normalized and recipe, and again with the PCA recipe.

5. Parameters - which predictors were used?

Multinomial regression important predictors:

Normalized
Variable	Importance	Sign
yaw_belt	24.695736	POS
roll_belt	18.355697	NEG
pitch_belt	13.349379	POS
magnet_dumbbell_z	6.870658	NEG
magnet_dumbbell_x	6.130485	POS
user_name_eurico	3.634274	POS
user_name_jeremy	3.621108	POS
magnet_belt_z	3.178922	NEG
accel_arm_z	3.130332	NEG
magnet_dumbbell_y	3.075882	POS

PCA
Variable	Importance	Sign
PC2	0.0012430	POS
PC3	0.0003528	NEG
PC4	0.0001386	POS
PC1	0.0000000	NEG

Tree method important predictors:

Normalized
Variable	Importance
roll_belt	1636.8128
accel_belt_z	830.0889
pitch_forearm	817.8788
magnet_dumbbell_y	813.5692
accel_dumbbell_y	779.7746
total_accel_dumbbell	734.4937
total_accel_belt	704.5511
pitch_belt	699.4032
roll_dumbbell	434.4295
yaw_belt	426.0193

PCA
Variable	Importance
PC2	679.60503
PC4	168.47264
PC1	33.17063
PC3	27.06690

Random Forest important predictors:

Normalized
Variable	Importance
pitch_forearm	779.0456
roll_belt	762.5084
yaw_belt	749.8133
pitch_belt	544.6227
magnet_dumbbell_y	530.7919
magnet_dumbbell_z	489.0929
roll_forearm	454.2440
magnet_dumbbell_x	383.2810
accel_belt_z	341.9422
magnet_belt_z	321.8242

PCA
Variable	Importance
PC3	2620.208
PC2	2568.502
PC4	2525.222
PC1	2331.543

6. How good were the models?

Given here is the Receiver Operating Characteristic (the ROC value) for each of the 6 models. The first line is the normalized pre-process and the second line is the principal component.

ROC Values
ROC_AUC_estimate	Multinomial	Tree	Random_Forest
Normalized	0.650	0.854	1.000
PCA	0.583	0.640	0.873

In this table we see that the normalized random forest model provides the largest confidence that it would give the the smallest out-of-sample error. Pause should be given though because of such a high value, and consideration should be made that possibly re-sampling the other methods might provide a better prediction on the validation data.

The next table provides the accuracy measures of the six models.

Accuracy
accuracy_estimate	Multinomial	Tree	Random_Forest
Normalized	0.321	0.632	0.990
PCA	0.284	0.363	0.625

From this table we see that the normalized random forest model is predicted to have a 99.0% accuracy on the validation data.

Test the predictors:

Given below is the normalized random forest model applied to the validation data set:

.pred_class…1	.pred_class…2	.pred_A	.pred_B	.pred_C	.pred_D	.pred_E
B	B	0.0000000	1.0000000	0.0000000	0.0000000	0.0000000
A	A	0.9400000	0.0000000	0.0600000	0.0000000	0.0000000
B	B	0.0750000	0.8250000	0.1000000	0.0000000	0.0000000
A	A	0.9875000	0.0000000	0.0125000	0.0000000	0.0000000
A	A	0.8142857	0.0857143	0.1000000	0.0000000	0.0000000
E	E	0.0000000	0.2000000	0.1750000	0.0555556	0.5694444
D	D	0.0000000	0.0444444	0.1861111	0.7694444	0.0000000
B	B	0.1416667	0.3666667	0.0983333	0.2933333	0.1000000
A	A	1.0000000	0.0000000	0.0000000	0.0000000	0.0000000
A	A	0.9000000	0.0000000	0.0000000	0.0666667	0.0333333
B	B	0.0333333	0.8041667	0.1125000	0.0500000	0.0000000
C	C	0.0000000	0.1000000	0.8033333	0.0000000	0.0966667
B	B	0.0000000	0.9000000	0.1000000	0.0000000	0.0000000
A	A	1.0000000	0.0000000	0.0000000	0.0000000	0.0000000
E	E	0.0000000	0.0000000	0.0000000	0.0000000	1.0000000
E	E	0.1000000	0.1000000	0.0000000	0.0000000	0.8000000
A	A	1.0000000	0.0000000	0.0000000	0.0000000	0.0000000
B	B	0.1000000	0.8571429	0.0428571	0.0000000	0.0000000
B	B	0.3000000	0.6571429	0.0000000	0.0428571	0.0000000
B	B	0.0000000	1.0000000	0.0000000	0.0000000	0.0000000

Conclusion: Even though the final results do given a 100% correct prediction on the validation data. The extremely high ROC value of the normalized random forest method should prompt the necessity to re-run the code using different re-sampling methods.

Appendix: r code

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(warning = FALSE)
## libraries
library(tidyverse) #*
library(caret)#*
library(tidymodels) #* for the dials fuction
library(parsnip)#*
library(recipes)#*
library(workflows)#*
library(yardstick)#* for the roc_auc
library(vip)
library(knitr)
# Data urls
if(!file.exists("./data")){dir.create("./data")}
url1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
#download.file(url1, destfile = "./data/pml-training.csv")
url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#download.file(url2, destfile = "./data/pml-testing.csv")
## Data Source
website <- "http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har."
validation <- read_csv("./data/pml-testing.csv")
df <- read_csv("./data/pml-training.csv")
# making NA explicit
df <- df %>% complete(.)
# Remove the columns with more than 90% NA values
df <- 
        df[,!sapply(df,
                           function(x) mean(is.na(x)))>0.9]
# convert user_name into factors
df$user_name <-
        parse_factor(df$user_name, 
                     levels = c("carlitos", "pedro","adelmo",
                                "charles","eurico","jeremy"),
                     ordered = FALSE)
# get rid of the index, time, and window cols
df <- 
        df %>% 
        select(-(...1 | raw_timestamp_part_1:num_window))
# Split the Data
inTrain = createDataPartition(df$classe, p = 0.8, list = FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]
# Convert classe into factors
training$classe <- 
        parse_factor(training$classe, 
                     levels = c("A","B","C","D","E"), 
                     ordered = FALSE)
testing$classe <- 
        parse_factor(testing$classe, 
                     levels = c("A","B","C","D","E"), 
                     ordered = FALSE)
## Recipes 
norm_recipe <-
        recipe(classe ~ ., data = training) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_nzv(all_predictors()) %>%
        step_normalize(all_numeric_predictors())
# Principal Component Analysis Recipe
pca_recipe <-
        recipe(classe ~., data = training) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_pca(all_numeric_predictors(), num_comp = 4)
## Multinomial Regression  
mr_model <- 
        multinom_reg(penalty = 0.1) %>%
        set_engine("glmnet") %>%
        set_mode("classification")
# Normalized and Centered
mr_workflow_n <-
        workflow() %>%
        add_model(mr_model) %>%
        add_recipe(norm_recipe)
# Fit the mr n model
set.seed(234)
mr_fit_n <- 
        mr_workflow_n %>%
        fit(data = training)
# Multinomial pca
mr_workflow_pca <- 
        workflow() %>%
        add_model(mr_model) %>%
        add_recipe(pca_recipe)
## Fit the Mutlinomial pca model
set.seed(234)
mr_fit_pca <- 
        mr_workflow_pca %>%
        fit(data = training)
## Tree Model
tune_spec <- 
        decision_tree(cost_complexity = 0.02,
                      tree_depth = 10,
                      min_n = 1000) %>%
        set_engine("rpart") %>%
        set_mode("classification")
# Normalized Tree
tree_wf_n <- 
        workflow() %>%
        add_model(tune_spec) %>%
        add_recipe(norm_recipe)
# Fit the Normalized Tree
set.seed(234)
tree_fit_n <- 
        tree_wf_n %>%
        fit(data=training)
# Tree pca
tree_wf_pca <-
        workflow() %>%
        add_model(tune_spec) %>%
        add_recipe(pca_recipe)
# Fit Tree pca
set.seed(567)
tree_fit_pca <- 
        tree_wf_pca %>%
        fit(data=training)
## Random Forest Model
randf_model <- 
        rand_forest(mtry = 5, trees = 10) %>%
        set_engine("ranger", importance = "impurity") %>%
        set_mode("classification")
# Random Forest pca
randf_wf_pca <-
        workflow() %>%
        add_model(randf_model) %>%
        add_recipe(pca_recipe)
## Fit the Random Forest pca
set.seed(567)
randf_fit_pca <- 
        randf_wf_pca %>%
        fit(data=training)
# Normalized Random Forest
randf_wf_n <-
        workflow() %>%
        add_model(randf_model) %>%
        add_recipe(norm_recipe)
## Fit Normalized Random Forest
set.seed(234)
randf_fit_n <- 
        randf_wf_n %>%
        fit(data=training)
mr_n_vip <- mr_fit_n %>%
        extract_fit_parsnip() %>%
        vip()
mr_n_vip$data
mr_pca_vip <- mr_fit_pca %>%
        extract_fit_parsnip() %>%
        vip()
mr_pca_vip$data
tree_n_vip <- tree_fit_n %>%
        extract_fit_parsnip() %>%
        vip()
tree_pca_vip <- tree_fit_pca %>%
        extract_fit_parsnip() %>%
        vip()
randf_n_vip <- randf_fit_n %>%
        extract_fit_parsnip() %>%
        vip()
randf_pca_vip <- randf_fit_pca %>%
        extract_fit_parsnip() %>%
        vip()
mr_vip_n <- knitr::kable(mr_n_vip$data, caption = "Normalized")
mr_vip_pca <- knitr::kable(mr_pca_vip$data, caption = "PCA")
tree_vip_n <- knitr::kable(tree_n_vip$data, caption = "Normalized")
tree_vip_pca <- knitr::kable(tree_pca_vip$data, caption = "PCA")
randf_vip_n <- knitr::kable(randf_n_vip$data, caption = "Normalized")
randf_vip_pca <- knitr::kable(randf_pca_vip$data, caption = "PCA")
# Predict using the norm center model
mr_pred_n <- 
        predict(mr_fit_n, testing) %>%
        bind_cols(predict(mr_fit_n, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
# Check the quality of fit
mr_n_roc_auc <- 
        mr_pred_n %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
mr_n_roc_auc_est <- round(mr_n_roc_auc$.estimate,3)
mr_n_accuracy <-
        mr_pred_n %>%
        accuracy(truth = testing$classe, .pred_class)
mr_n_acc_est <- round(mr_n_accuracy$.estimate, 3)
## predict with the mr pca model
mr_pred_pca <- 
        predict(mr_fit_pca, testing) %>%
        bind_cols(predict(mr_fit_pca, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check the quality of fit
mr_pca_roc_auc <-
        mr_pred_pca %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
mr_pca_accuracy <- 
        mr_pred_pca %>%
        accuracy(truth = testing$classe, .pred_class)
mr_pca_roc_auc_est <- round(mr_pca_roc_auc$.estimate,3)
mr_pca_acc_est <- round(mr_pca_accuracy$.estimate ,3)
# Predict tree using the norm center model
tree_pred_n <- 
        predict(tree_fit_n, testing) %>%
        bind_cols(predict(tree_fit_n, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check quality of fit
tree_n_roc_auc <-
        tree_pred_n %>%
        roc_auc(truth = testing$classe, 
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
tree_n_roc_auc_est <- round(tree_n_roc_auc$.estimate, 3)
tree_n_accuracy <-
        tree_pred_n %>%
        accuracy(truth = testing$classe, .pred_class)
tree_n_acc_est <- round(tree_n_accuracy$.estimate, 3)
# Predict tree using the pca recipe
tree_pred_pca <- 
        predict(tree_fit_pca, testing) %>%
        bind_cols(predict(tree_fit_pca, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
tree_pca_roc_auc <-
        tree_pred_pca %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
tree_pca_roc_auc_est <- round(tree_pca_roc_auc$.estimate, 3)
tree_pca_accuracy <-
        tree_pred_pca %>%
        accuracy(truth = testing$classe, .pred_class)
tree_pca_acc_est <- round(tree_pca_accuracy$.estimate, 3)
# Predict randf using the pca recipe
randf_predict_pca <- 
        predict(randf_fit_pca, testing) %>%
        bind_cols(predict(randf_fit_pca, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check quality of fit
randf_pca_cm <- 
        confusionMatrix(randf_predict_pca$classe , testing$classe)
randf_pca_roc_auc <-
        randf_predict_pca %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
randf_pca_roc_auc_est <- 
        round(randf_pca_roc_auc$.estimate, 3)
randf_pca_accuracy <-
        randf_predict_pca %>%
        accuracy(truth = testing$classe, .pred_class)
randf_pca_acc_est <- 
        round(randf_pca_accuracy$.estimate, 3)
## Predict using the Normailzed Random Forest----------------------------------
randf_pred_n <- 
        predict(randf_fit_n, testing) %>%
        bind_cols(predict(randf_fit_n, testing, type = "prob")) %>%
        bind_cols(testing %>% select(classe))
## Check quality of fit
randf_n_roc_auc <-
        randf_pred_n %>%
        roc_auc(truth = testing$classe,
                .pred_A, .pred_B, .pred_C, .pred_D, .pred_E)
randf_n_accuracy <-
        randf_pred_n %>%
        accuracy(truth = testing$classe, .pred_class)
randf_n_est <-
        round(randf_n_roc_auc$.estimate, 3)
randf_n_acc_est <-
        round(randf_n_accuracy$.estimate, 3)
# Tables of estimate values
my_roc_auc_table <- knitr::kable(data.frame(ROC_AUC_estimate = c("Normalized", "PCA"),
                Multinomial = c(mr_n_roc_auc_est, mr_pca_roc_auc_est),
                Tree = c(tree_n_roc_auc_est, tree_pca_roc_auc_est),
                Random_Forest = c(randf_n_est,randf_pca_roc_auc_est)),
                caption = "ROC Values")
my_acc_table <- knitr::kable(data.frame(accuracy_estimate = c("Normalized", "PCA"),
                           Multinomial = c(mr_n_acc_est, mr_pca_acc_est),
                           Tree = c(tree_n_acc_est, tree_pca_acc_est),
                           Random_Forest = c(randf_n_acc_est,randf_pca_acc_est)),
                           caption = "Accuracy")
## Test normalized Random Forest model
final_predict <- 
        predict(randf_fit_n, new_data = validation) %>%
        bind_cols(predict(randf_fit_n, validation),
                  predict(randf_fit_n, validation, type = "prob"))

final_results <-
        knitr::kable(final_predict)

Bicep_Curl_Pred

Nate Foulkes

11/9/2021