This is the week 4 project of the Practical Machine Learning MOOC on Coursera. It is about predicting the type of weight-lifting exercise (coded by variable classs
, with values “A” to “E”). A training dataset (with column classe
) and a test dataset (without it) were provided as csv files.
More information on the weight-lifting experiment is available online at http:/groupware.les.inf.puc-rio.br/har#weight_lifting_exercises .
Both files required basic cleaning. I removed from the training dataset the coulumns not available in the test dataset (missing, with no values or with no variability).
The test dataset contains 20 test cases of unknown class. The traing dataset contains almost 20 000 data-points in 58 variables. I will further partition it in a training dataset and a validation set against which I will develop a ML model.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
setwd("~/Documents/Coursera/pract_machi_learn/")
# download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "training.csv")
# download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")
test <- read.csv("test.csv", na.strings=c("NA","#DIV/0!", "")) %>%
select_if(function(x) !(all(is.na(x)) | all(x==""))) %>% # remove empty colums
select(-c("new_window", "problem_id", "X")) %>% # no variation in new_window, X1 = problem_id = row number
mutate_if(is.character, as.factor) %>%
na.omit()
training <- read.csv("training.csv", na.strings=c("NA","#DIV/0!", "")) %>%
# don't train on columns that cannot be tested. there is a variable named classe, which is related to the assignemment but not testable.
select(c(colnames(test), "classe")) %>%
mutate_if(is.character, as.factor) %>%
na.omit()
# classe <- training$classe
# training %<>% select(-c("classe"))
# summary(training$classe)
# summary(training)
# summary(test)
The default model for the caret
package is Random Forest with 500 trees. I also added an instruction to also perform a 10-folds cross-validation.
The main disadvantage is that it is slow. I chose to make the training data small (10% of the original trainig data, about 2000 samples) in stead of a more traditional split such as 75%. I used the rest of the original training data for validation. My experiments showed that even smaller training partitions may work well enough on the validation dataset (such as 1% to 5%, which I used while developping this document).
set.seed(1)
part = caret::createDataPartition(training$`classe`, p = 0.1)[[1]] # about 2000 samples. my laptop takes too long to train on more and I don't want to wait.
train_1 = training[ part,]
valid_1 = training[-part,]
set.seed(1)
model_1 <- caret::train(`classe` ~ ., data=train_1,
method="rf",
trControl=caret::trainControl(method = "CV", 10)) # cross-validation, 10-fold)
# Summary of the model
model_1
## Random Forest
##
## 1964 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1768, 1767, 1767, 1768, 1767, 1768, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9531412 0.9406368
## 40 0.9816454 0.9767816
## 79 0.9826711 0.9780814
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 79.
This model reached an accuracy very close to 100% compared to the validation data, which suggest that it is likely to perform adequately on the test data.
predicted <- predict(model_1, valid_1)
cm <- caret::confusionMatrix(valid_1$classe, predicted)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5017 5 0 0 0
## B 45 3327 45 0 0
## C 0 35 3028 16 0
## D 0 0 16 2878 0
## E 0 12 0 38 3196
##
## Overall Statistics
##
## Accuracy : 0.988
## 95% CI : (0.9863, 0.9895)
## No Information Rate : 0.2867
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9848
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9911 0.9846 0.9803 0.9816 1.0000
## Specificity 0.9996 0.9937 0.9965 0.9989 0.9965
## Pos Pred Value 0.9990 0.9737 0.9834 0.9945 0.9846
## Neg Pred Value 0.9964 0.9963 0.9958 0.9963 1.0000
## Prevalence 0.2867 0.1914 0.1749 0.1660 0.1810
## Detection Rate 0.2841 0.1884 0.1715 0.1630 0.1810
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9954 0.9892 0.9884 0.9902 0.9983
data.frame(cm$table) %>%
group_by(Reference) %>%
mutate(prop = Freq/sum(Freq)) %>%
ggplot(aes(x = Reference, y = Prediction, fill = prop)) +
geom_tile(color="white", size=0.5) +
geom_text(aes(label = scales::percent(prop, 0.1))) +
scale_fill_viridis_c("", direction = -1, labels=scales::percent_format(), limits=c(0, 1), breaks=seq(0, 1, 0.25))
Confusion matrix
# acc <- caret::postResample(predicted,valid_1$`classe`)
# acc
Here is a dataframe with the predicted class for each test-case as well as estimated probabilities of each class.
final_prediction <- predict(model_1, test, type="prob")
final_prediction %>% bind_cols(`class` = apply(final_prediction, 1, which.max)) %>%
mutate(`class`=factor(`class`, levels = 1:5, labels=LETTERS[1:5])) %>%
mutate(`N` = 1:20)
.