Using the Tidymodels package

INTRODUCTION

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here:

http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
(see the section on the Weight Lifting Exercise Dataset).

OBJECTIVE

The goal of the project is to predict the manner in which the 6 participants did the exercise. This is the “classe” variable in the training set.


Load the Training and Testing Datasets

training <- read.csv("pml-training.csv")   #main data for this project.
testing <- read.csv("pml-testing.csv")   #will be used to predict 20 observations.
dim(training)
## [1] 19622   160
dim(testing)
## [1]  20 160

Both training and tesing datasets have 160 variables.


Load the Machine Learning Package

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.1      ✔ recipes      1.0.3 
## ✔ dials        1.1.0      ✔ rsample      1.1.0 
## ✔ dplyr        1.0.10     ✔ tibble       3.1.8 
## ✔ ggplot2      3.4.0      ✔ tidyr        1.2.1 
## ✔ infer        1.0.3      ✔ tune         1.0.1 
## ✔ modeldata    1.0.1      ✔ workflows    1.1.0 
## ✔ parsnip      1.0.3      ✔ workflowsets 1.0.0 
## ✔ purrr        0.3.5      ✔ yardstick    1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages


Load the Tidyverse Package for data visualization

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ✔ stringr 1.4.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()


Data Cleaning and Wrangling

training <- training[, colSums(is.na(training))==0]
training <- training[, -c(1:7)]
classe <- as.factor(training$classe)
training <- training[,sapply(training, is.numeric)]
training <- data.frame(training, classe)

Columns with NA and unwanted(non-numerical) values are removed. The resulting dataset now contains only 53 columns from 160.


Response Variable Analysis

ggplot(data=training) +
        geom_bar(mapping=aes(classe, fill=classe))


Classification of 5 classes is shown in the plot above. Classe A higher than the rest of the 4 classes but prediction can proceed due to the high number of observations.


Data Preprocessing
Training data is split into new training (70%) and testing sets (30%). Preprocessing is done on all predictors removing the variables with large absolute correlations with other variables. The same is done to both the splitted datasets.

# Splitting the original training data into new training data and testing sets.
data_split <- initial_split(data=training, prop=0.7)

# Preprocessing
training_recipe <- training(data_split) %>%
        recipe(classe ~.) %>%
        step_corr(all_predictors()) %>%
        prep()
training_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         52
## 
## Training data contained 13735 data points and no missing data.
## 
## Operations:
## 
## Correlation filter on accel_belt_x, accel_belt_y, accel_belt_z,... [trained]
#preprocess testing data
data_testing <- training_recipe %>%
        bake(testing(data_split))

#load training data
data_training <- juice(training_recipe)


Model Training using Random Forest
The prediction algorithm of choice is the random forest for classification.

data_rf <- rand_forest(trees = 100, mode="classification") %>%
        set_engine("randomForest") %>%
        fit(classe ~., data=data_training)
data_rf
## parsnip model object
## 
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.73%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3864    6    0    0    2 0.002066116
## B   15 2688    8    0    2 0.009214891
## C    0   19 2350    1    0 0.008438819
## D    0    0   32 2210    5 0.016466400
## E    0    1    3    6 2523 0.003947888

As shown, 100 tress are used and 6 variable are tried in each split. The Out Of Bag estimate error rate is very small at 0.76%. A confusion matrix is shown above.


Prediction
The predict() command is used for prediction. Arguments include the model and the splitted testing dataset.

predict(data_rf, data_testing)
## # A tibble: 5,887 × 1
##    .pred_class
##    <fct>      
##  1 A          
##  2 A          
##  3 A          
##  4 A          
##  5 A          
##  6 A          
##  7 A          
##  8 A          
##  9 A          
## 10 A          
## # … with 5,877 more rows


Validation
Validation is done using the metrics() function where the truth argument corresponds to the response variable class and the estimate argument correspond to the predicted values. This step outputs the accuracy and kappa metrics as shown.

data_rf %>%
        predict(data_testing) %>%
        bind_cols(data_testing) %>%
        metrics(truth=classe, estimate=.pred_class)
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.995
## 2 kap      multiclass     0.994


THE MODEL IS 99% ACCURATE


Predicting the testing dataset
Using the original testing dataset, a prediction is done to produce 20 predicted values.

predict(data_rf, testing)
## # A tibble: 20 × 1
##    .pred_class
##    <fct>      
##  1 B          
##  2 A          
##  3 B          
##  4 A          
##  5 A          
##  6 E          
##  7 D          
##  8 B          
##  9 A          
## 10 A          
## 11 B          
## 12 C          
## 13 B          
## 14 A          
## 15 E          
## 16 E          
## 17 A          
## 18 B          
## 19 B          
## 20 B