For this course project I have been asked to examine the Weight Lifting Exercises dataset kindly provided by http://groupware.les.inf.puc-rio.br/har.
The goal is to predict the quality of execution of a movement, specifically the “Dumbbell Curl” exercise.
Six participants were ask to perform the exercise correctly and then incorrectly 5 different ways, with 10 repetitions of each. The data were collected through wearable devices such as Jawbone Up, Nike FuelBand, and Fitbit.
My goal is to correctly predict the way the exercise was conducted for the records in the test data set, which does not include the outcome variable. My results will then be entered into the prediction quiz to find out how accurately my model predicted.
download the data
train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_filename <- "pml-training.csv"
test_filename <- "pml-testing.csv"
curl::curl_download(train_url, train_filename)
curl::curl_download(test_url, test_filename)
load training data
library(tidyverse)
library(caret)
library(doParallel)
training <- read_csv(train_filename)
Data Cleaning / EDA
training %>% dim()
## [1] 19622 160
We have a large data set, over 19k rows and 160 variables, lets look at the distribution of the variable we are going to be predicting: classe
training %>%
ggplot(aes(classe, fill = classe)) +
geom_bar()
We have an even spread amount of each class, lets look at the first few variables.
training %>%
select(1:8) %>%
summary()
## X1 user_name raw_timestamp_part_1 raw_timestamp_part_2
## Min. : 1 Length:19622 Min. :1.322e+09 Min. : 294
## 1st Qu.: 4906 Class :character 1st Qu.:1.323e+09 1st Qu.:252912
## Median : 9812 Mode :character Median :1.323e+09 Median :496380
## Mean : 9812 Mean :1.323e+09 Mean :500656
## 3rd Qu.:14717 3rd Qu.:1.323e+09 3rd Qu.:751891
## Max. :19622 Max. :1.323e+09 Max. :998801
## cvtd_timestamp new_window num_window roll_belt
## Length:19622 Length:19622 Min. : 1.0 Min. :-28.90
## Class :character Class :character 1st Qu.:222.0 1st Qu.: 1.10
## Mode :character Mode :character Median :424.0 Median :113.00
## Mean :430.6 Mean : 64.41
## 3rd Qu.:644.0 3rd Qu.:123.00
## Max. :864.0 Max. :162.00
I’m not going to be using variables 1 to 7 for prediction, as they do not appear to have any information needed for prediction. So I will drop those. Also I’m converting classe to factor and all the other variables to numeric
training_clean <- training %>%
select(-c(1:7)) %>%
mutate(classe = as_factor(classe)) %>%
mutate_if(sapply(.,is.character), as.numeric)
Lets find out what NAs we have.
training_clean %>%
select(-classe) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
filter(value %>% is.na) %>%
summarise(count = n()) %>%
arrange(count)
## # A tibble: 100 x 2
## name count
## <chr> <int>
## 1 amplitude_pitch_arm 19216
## 2 amplitude_pitch_belt 19216
## 3 amplitude_pitch_dumbbell 19216
## 4 amplitude_pitch_forearm 19216
## 5 amplitude_roll_arm 19216
## 6 amplitude_roll_belt 19216
## 7 amplitude_roll_dumbbell 19216
## 8 amplitude_roll_forearm 19216
## 9 amplitude_yaw_arm 19216
## 10 avg_pitch_arm 19216
## # ... with 90 more rows
I cant use variables with then many NAs for prediction! So, I’ll drop those variables
na_cols <- training_clean %>%
select(-classe) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
filter(value %>% is.na) %>%
summarise(count = n()) %>%
pivot_wider(names_from = name, values_from = count)
training_clean <- training_clean %>% select(-names(na_cols))
Instead of fitting all rows, to speed things up, I’ll draw a random sample of about half.
Then I’ll set up pre-processing to normalize and apply it to the training sample.
#draw sample for quicker fitting
set.seed(123)
sample <- sample_n(training_clean, 8000)
preProcSample = preProcess(sample,
method = c("center", "scale"))
#apply pre-processing to training set
sample_trans <- predict(preProcSample, sample)
To get a robust estimate for my out of sample error I’ll use 10 fold cross validation with 10 repeats.
trainControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
allowParallel = TRUE)
I’m going to fit a random forest model, doParallel will speeds things up.
no_cores = detectCores() -1
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)
mod_tree <- train(classe ~ .,
method = "rf",
data = sample_trans,
trControl = trainControl)
stopCluster(cl)
#save the model
saveRDS(mod_tree, file = "final_mod_tree")
mod_tree
## Random Forest
##
## 8000 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 7200, 7199, 7199, 7200, 7200, 7200, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9833249 0.9788694
## 27 0.9852627 0.9813281
## 52 0.9796131 0.9741697
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
I estimate my out of sample accuracy to be 98.5% with mtry = 27, I’m happy with that and I’ll go ahead and use this to predict on the test set for the quiz
#read data
testing <- read_csv(test_filename)
#preProcess data
testing_clean <- testing %>%
select(-c(1:7)) %>%
mutate_if(sapply(.,is.character), as.numeric)
#remove NA cols
testing_clean <- testing_clean %>% select(-names(na_cols))
testing_processed <- predict(preProcSample, testing_clean)
predict(mod_tree, testing_processed) %>%
as_tibble() %>%
mutate(row_num = row_number()) %>%
select(row_num, value)
## # A tibble: 20 x 2
## row_num value
## <int> <fct>
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B
The result of the Quiz submission was 100%