Using the Tidymodels package
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now
possible to collect a large amount of data about personal activity
relatively inexpensively. These type of devices are part of the
quantified self movement – a group of enthusiasts who take measurements
about themselves regularly to improve their health, to find patterns in
their behavior, or because they are tech geeks. One thing that people
regularly do is quantify how much of a particular activity they do, but
they rarely quantify how well they do it. In this project, the goal will
be to use data from accelerometers on the belt, forearm, arm, and
dumbell of 6 participants. They were asked to perform barbell lifts
correctly and incorrectly in 5 different ways. More information is
available from the website here:
http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
(see the section on the Weight Lifting Exercise Dataset).
The goal of the project is to predict the manner in which the 6 participants did the exercise. This is the “classe” variable in the training set.
Load the Training and Testing Datasets
training <- read.csv("pml-training.csv") #main data for this project.
testing <- read.csv("pml-testing.csv") #will be used to predict 20 observations.
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
Both training and tesing datasets have 160 variables.
Load the Machine Learning Package
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.1 ✔ recipes 1.0.3
## ✔ dials 1.1.0 ✔ rsample 1.1.0
## ✔ dplyr 1.0.10 ✔ tibble 3.1.8
## ✔ ggplot2 3.4.0 ✔ tidyr 1.2.1
## ✔ infer 1.0.3 ✔ tune 1.0.1
## ✔ modeldata 1.0.1 ✔ workflows 1.1.0
## ✔ parsnip 1.0.3 ✔ workflowsets 1.0.0
## ✔ purrr 0.3.5 ✔ yardstick 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
Load the Tidyverse Package for data
visualization
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ stringr 1.4.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
Data Cleaning and Wrangling
training <- training[, colSums(is.na(training))==0]
training <- training[, -c(1:7)]
classe <- as.factor(training$classe)
training <- training[,sapply(training, is.numeric)]
training <- data.frame(training, classe)
Columns with NA and unwanted(non-numerical) values are removed. The resulting dataset now contains only 53 columns from 160.
Response Variable Analysis
ggplot(data=training) +
geom_bar(mapping=aes(classe, fill=classe))
Classification of 5 classes is shown in the plot above. Classe A
higher than the rest of the 4 classes but prediction can proceed due to
the high number of observations.
Data Preprocessing
Training data is split
into new training (70%) and testing sets (30%). Preprocessing is done on
all predictors removing the variables with large absolute correlations
with other variables. The same is done to both the splitted
datasets.
# Splitting the original training data into new training data and testing sets.
data_split <- initial_split(data=training, prop=0.7)
# Preprocessing
training_recipe <- training(data_split) %>%
recipe(classe ~.) %>%
step_corr(all_predictors()) %>%
prep()
training_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 52
##
## Training data contained 13735 data points and no missing data.
##
## Operations:
##
## Correlation filter on accel_belt_x, accel_belt_y, accel_belt_z,... [trained]
#preprocess testing data
data_testing <- training_recipe %>%
bake(testing(data_split))
#load training data
data_training <- juice(training_recipe)
Model Training using Random Forest
The
prediction algorithm of choice is the random forest for
classification.
data_rf <- rand_forest(trees = 100, mode="classification") %>%
set_engine("randomForest") %>%
fit(classe ~., data=data_training)
data_rf
## parsnip model object
##
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y, ntree = ~100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 0.73%
## Confusion matrix:
## A B C D E class.error
## A 3864 6 0 0 2 0.002066116
## B 15 2688 8 0 2 0.009214891
## C 0 19 2350 1 0 0.008438819
## D 0 0 32 2210 5 0.016466400
## E 0 1 3 6 2523 0.003947888
As shown, 100 tress are used and 6 variable are tried in each split. The Out Of Bag estimate error rate is very small at 0.76%. A confusion matrix is shown above.
Prediction
The predict() command is used
for prediction. Arguments include the model and the splitted testing
dataset.
predict(data_rf, data_testing)
## # A tibble: 5,887 × 1
## .pred_class
## <fct>
## 1 A
## 2 A
## 3 A
## 4 A
## 5 A
## 6 A
## 7 A
## 8 A
## 9 A
## 10 A
## # … with 5,877 more rows
Validation
Validation is done using the
metrics() function where the truth argument corresponds to the response
variable class and the estimate argument correspond to the predicted
values. This step outputs the accuracy and kappa metrics as shown.
data_rf %>%
predict(data_testing) %>%
bind_cols(data_testing) %>%
metrics(truth=classe, estimate=.pred_class)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.995
## 2 kap multiclass 0.994
THE MODEL IS 99% ACCURATE
Predicting the testing dataset
Using the
original testing dataset, a prediction is done to produce 20 predicted
values.
predict(data_rf, testing)
## # A tibble: 20 × 1
## .pred_class
## <fct>
## 1 B
## 2 A
## 3 B
## 4 A
## 5 A
## 6 E
## 7 D
## 8 B
## 9 A
## 10 A
## 11 B
## 12 C
## 13 B
## 14 A
## 15 E
## 16 E
## 17 A
## 18 B
## 19 B
## 20 B