Practical Machine Learning Project

OBJECTIVE

The goal of the project is to predict the manner in which the 6 participants did the exercise. This is the “classe” variable in the training set.

Load the Training and Testing Datasets

training <- read.csv("pml-training.csv")   #main data for this project.
testing <- read.csv("pml-testing.csv")   #will be used to predict 20 observations.

dim(training)

## [1] 19622   160

dim(testing)

## [1]  20 160

Both training and tesing datasets have 160 variables.

Load the Machine Learning Package

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

## ✔ broom        1.0.1      ✔ recipes      1.0.3 
## ✔ dials        1.1.0      ✔ rsample      1.1.0 
## ✔ dplyr        1.0.10     ✔ tibble       3.1.8 
## ✔ ggplot2      3.4.0      ✔ tidyr        1.2.1 
## ✔ infer        1.0.3      ✔ tune         1.0.1 
## ✔ modeldata    1.0.1      ✔ workflows    1.1.0 
## ✔ parsnip      1.0.3      ✔ workflowsets 1.0.0 
## ✔ purrr        0.3.5      ✔ yardstick    1.1.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

Load the Tidyverse Package for data visualization

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ✔ stringr 1.4.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()

Data Cleaning and Wrangling

training <- training[, colSums(is.na(training))==0]
training <- training[, -c(1:7)]
classe <- as.factor(training$classe)
training <- training[,sapply(training, is.numeric)]
training <- data.frame(training, classe)

Columns with NA and unwanted(non-numerical) values are removed. The resulting dataset now contains only 53 columns from 160.

Response Variable Analysis

ggplot(data=training) +
        geom_bar(mapping=aes(classe, fill=classe))

Classification of 5 classes is shown in the plot above. Classe A higher than the rest of the 4 classes but prediction can proceed due to the high number of observations.

Data Preprocessing
Training data is split into new training (70%) and testing sets (30%). Preprocessing is done on all predictors removing the variables with large absolute correlations with other variables. The same is done to both the splitted datasets.

# Splitting the original training data into new training data and testing sets.
data_split <- initial_split(data=training, prop=0.7)

# Preprocessing
training_recipe <- training(data_split) %>%
        recipe(classe ~.) %>%
        step_corr(all_predictors()) %>%
        prep()
training_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         52
## 
## Training data contained 13735 data points and no missing data.
## 
## Operations:
## 
## Correlation filter on accel_belt_x, accel_belt_y, accel_belt_z,... [trained]

#preprocess testing data
data_testing <- training_recipe %>%
        bake(testing(data_split))

#load training data
data_training <- juice(training_recipe)

Model Training using Random Forest
The prediction algorithm of choice is the random forest for classification.

data_rf <- rand_forest(trees = 100, mode="classification") %>%
        set_engine("randomForest") %>%
        fit(classe ~., data=data_training)
data_rf

## parsnip model object
## 
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.73%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3864    6    0    0    2 0.002066116
## B   15 2688    8    0    2 0.009214891
## C    0   19 2350    1    0 0.008438819
## D    0    0   32 2210    5 0.016466400
## E    0    1    3    6 2523 0.003947888

As shown, 100 tress are used and 6 variable are tried in each split. The Out Of Bag estimate error rate is very small at 0.76%. A confusion matrix is shown above.

Prediction
The predict() command is used for prediction. Arguments include the model and the splitted testing dataset.

predict(data_rf, data_testing)

## # A tibble: 5,887 × 1
##    .pred_class
##    <fct>      
##  1 A          
##  2 A          
##  3 A          
##  4 A          
##  5 A          
##  6 A          
##  7 A          
##  8 A          
##  9 A          
## 10 A          
## # … with 5,877 more rows

Validation
Validation is done using the metrics() function where the truth argument corresponds to the response variable class and the estimate argument correspond to the predicted values. This step outputs the accuracy and kappa metrics as shown.

data_rf %>%
        predict(data_testing) %>%
        bind_cols(data_testing) %>%
        metrics(truth=classe, estimate=.pred_class)

## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.995
## 2 kap      multiclass     0.994

THE MODEL IS 99% ACCURATE

Predicting the testing dataset
Using the original testing dataset, a prediction is done to produce 20 predicted values.

predict(data_rf, testing)

## # A tibble: 20 × 1
##    .pred_class
##    <fct>      
##  1 B          
##  2 A          
##  3 B          
##  4 A          
##  5 A          
##  6 E          
##  7 D          
##  8 B          
##  9 A          
## 10 A          
## 11 B          
## 12 C          
## 13 B          
## 14 A          
## 15 E          
## 16 E          
## 17 A          
## 18 B          
## 19 B          
## 20 B

Practical Machine Learning Project

Jamal Rogers

2022-11-18

INTRODUCTION

OBJECTIVE