Introduction

For this course project I have been asked to examine the Weight Lifting Exercises dataset kindly provided by http://groupware.les.inf.puc-rio.br/har.

The goal is to predict the quality of execution of a movement, specifically the “Dumbbell Curl” exercise.

Six participants were ask to perform the exercise correctly and then incorrectly 5 different ways, with 10 repetitions of each. The data were collected through wearable devices such as Jawbone Up, Nike FuelBand, and Fitbit.

My goal is to correctly predict the way the exercise was conducted for the records in the test data set, which does not include the outcome variable. My results will then be entered into the prediction quiz to find out how accurately my model predicted.

download the data

train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

train_filename <- "pml-training.csv"
test_filename <- "pml-testing.csv"

curl::curl_download(train_url, train_filename)
curl::curl_download(test_url, test_filename)

load training data

library(tidyverse)
library(caret)
library(doParallel)

training <- read_csv(train_filename)

Data Cleaning / EDA

training %>% dim()
## [1] 19622   160

We have a large data set, over 19k rows and 160 variables, lets look at the distribution of the variable we are going to be predicting: classe

training %>%
        ggplot(aes(classe, fill = classe)) +
        geom_bar()

We have an even spread amount of each class, lets look at the first few variables.

training %>%
        select(1:8) %>%
        summary()
##        X1         user_name         raw_timestamp_part_1 raw_timestamp_part_2
##  Min.   :    1   Length:19622       Min.   :1.322e+09    Min.   :   294      
##  1st Qu.: 4906   Class :character   1st Qu.:1.323e+09    1st Qu.:252912      
##  Median : 9812   Mode  :character   Median :1.323e+09    Median :496380      
##  Mean   : 9812                      Mean   :1.323e+09    Mean   :500656      
##  3rd Qu.:14717                      3rd Qu.:1.323e+09    3rd Qu.:751891      
##  Max.   :19622                      Max.   :1.323e+09    Max.   :998801      
##  cvtd_timestamp      new_window          num_window      roll_belt     
##  Length:19622       Length:19622       Min.   :  1.0   Min.   :-28.90  
##  Class :character   Class :character   1st Qu.:222.0   1st Qu.:  1.10  
##  Mode  :character   Mode  :character   Median :424.0   Median :113.00  
##                                        Mean   :430.6   Mean   : 64.41  
##                                        3rd Qu.:644.0   3rd Qu.:123.00  
##                                        Max.   :864.0   Max.   :162.00

I’m not going to be using variables 1 to 7 for prediction, as they do not appear to have any information needed for prediction. So I will drop those. Also I’m converting classe to factor and all the other variables to numeric

training_clean <- training %>%
        select(-c(1:7)) %>%
        mutate(classe = as_factor(classe)) %>%
        mutate_if(sapply(.,is.character), as.numeric)

Lets find out what NAs we have.

training_clean %>%
        select(-classe) %>%
        pivot_longer(everything()) %>%
        group_by(name) %>%
        filter(value %>% is.na) %>%
        summarise(count = n()) %>%
        arrange(count)
## # A tibble: 100 x 2
##    name                     count
##    <chr>                    <int>
##  1 amplitude_pitch_arm      19216
##  2 amplitude_pitch_belt     19216
##  3 amplitude_pitch_dumbbell 19216
##  4 amplitude_pitch_forearm  19216
##  5 amplitude_roll_arm       19216
##  6 amplitude_roll_belt      19216
##  7 amplitude_roll_dumbbell  19216
##  8 amplitude_roll_forearm   19216
##  9 amplitude_yaw_arm        19216
## 10 avg_pitch_arm            19216
## # ... with 90 more rows

I cant use variables with then many NAs for prediction! So, I’ll drop those variables

na_cols <-  training_clean %>%
        select(-classe) %>%
        pivot_longer(everything()) %>%
        group_by(name) %>%
        filter(value %>% is.na) %>%
        summarise(count = n()) %>%
        pivot_wider(names_from = name, values_from = count)

training_clean <- training_clean %>% select(-names(na_cols))

Instead of fitting all rows, to speed things up, I’ll draw a random sample of about half.

Then I’ll set up pre-processing to normalize and apply it to the training sample.

#draw sample for quicker fitting
set.seed(123)
sample <- sample_n(training_clean, 8000)

preProcSample = preProcess(sample, 
                           method = c("center", "scale"))

#apply pre-processing to training set
sample_trans <- predict(preProcSample, sample)

To get a robust estimate for my out of sample error I’ll use 10 fold cross validation with 10 repeats.

trainControl <- trainControl(method = "repeatedcv",
                             number = 10,
                             repeats = 10,
                             allowParallel = TRUE)

I’m going to fit a random forest model, doParallel will speeds things up.

no_cores = detectCores() -1
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)

mod_tree <- train(classe ~ ., 
                  method = "rf", 
                  data = sample_trans,
                  trControl = trainControl)

stopCluster(cl)

#save the model
saveRDS(mod_tree, file = "final_mod_tree")

mod_tree
## Random Forest 
## 
## 8000 samples
##   52 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 7200, 7199, 7199, 7200, 7200, 7200, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9833249  0.9788694
##   27    0.9852627  0.9813281
##   52    0.9796131  0.9741697
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

I estimate my out of sample accuracy to be 98.5% with mtry = 27, I’m happy with that and I’ll go ahead and use this to predict on the test set for the quiz

#read data
testing <- read_csv(test_filename)

#preProcess data
testing_clean <-  testing %>%
        select(-c(1:7)) %>%
        mutate_if(sapply(.,is.character), as.numeric)

#remove NA cols
testing_clean <- testing_clean %>% select(-names(na_cols))

testing_processed <- predict(preProcSample, testing_clean)

predict(mod_tree, testing_processed) %>%
        as_tibble() %>%
        mutate(row_num = row_number()) %>%
        select(row_num, value)
## # A tibble: 20 x 2
##    row_num value
##      <int> <fct>
##  1       1 B    
##  2       2 A    
##  3       3 B    
##  4       4 A    
##  5       5 A    
##  6       6 E    
##  7       7 D    
##  8       8 B    
##  9       9 A    
## 10      10 A    
## 11      11 B    
## 12      12 C    
## 13      13 B    
## 14      14 A    
## 15      15 E    
## 16      16 E    
## 17      17 A    
## 18      18 B    
## 19      19 B    
## 20      20 B

The result of the Quiz submission was 100%