This is the week 4 project of the Practical Machine Learning MOOC on Coursera. It is about predicting the type of weight-lifting exercise (coded by variable classs, with values “A” to “E”). A training dataset (with column classe) and a test dataset (without it) were provided as csv files.

More information on the weight-lifting experiment is available online at http:/groupware.les.inf.puc-rio.br/har#weight_lifting_exercises .

Both files required basic cleaning. I removed from the training dataset the coulumns not available in the test dataset (missing, with no values or with no variability).

The test dataset contains 20 test cases of unknown class. The traing dataset contains almost 20 000 data-points in 58 variables. I will further partition it in a training dataset and a validation set against which I will develop a ML model.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

setwd("~/Documents/Coursera/pract_machi_learn/")
# download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "training.csv")
# download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")

test <- read.csv("test.csv", na.strings=c("NA","#DIV/0!", "")) %>%
  select_if(function(x) !(all(is.na(x)) | all(x==""))) %>% # remove empty colums
  select(-c("new_window", "problem_id", "X")) %>% # no variation in new_window, X1 = problem_id = row number
  mutate_if(is.character, as.factor) %>%
  na.omit()

training <- read.csv("training.csv", na.strings=c("NA","#DIV/0!", "")) %>% 
  # don't train on columns that cannot be tested. there is a variable named classe, which is related to the assignemment but not testable.
  select(c(colnames(test), "classe")) %>% 
  mutate_if(is.character, as.factor) %>%
  na.omit()

# classe <- training$classe
# training %<>% select(-c("classe"))

# summary(training$classe)
# summary(training)
# summary(test)

Training a model

The default model for the caret package is Random Forest with 500 trees. I also added an instruction to also perform a 10-folds cross-validation.

The main disadvantage is that it is slow. I chose to make the training data small (10% of the original trainig data, about 2000 samples) in stead of a more traditional split such as 75%. I used the rest of the original training data for validation. My experiments showed that even smaller training partitions may work well enough on the validation dataset (such as 1% to 5%, which I used while developping this document).

set.seed(1)
part = caret::createDataPartition(training$`classe`, p = 0.1)[[1]] # about 2000 samples. my laptop takes too long to train on more and I don't want to wait. 
train_1 = training[ part,] 
valid_1 = training[-part,]

set.seed(1)
model_1 <- caret::train(`classe` ~ ., data=train_1,
                       method="rf",
                       trControl=caret::trainControl(method = "CV", 10)) # cross-validation, 10-fold)

# Summary of the model
model_1

## Random Forest 
## 
## 1964 samples
##   57 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1768, 1767, 1767, 1768, 1767, 1768, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9531412  0.9406368
##   40    0.9816454  0.9767816
##   79    0.9826711  0.9780814
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 79.

Confusion matrix and accuracy statistics

This model reached an accuracy very close to 100% compared to the validation data, which suggest that it is likely to perform adequately on the test data.

predicted <- predict(model_1, valid_1)
cm <- caret::confusionMatrix(valid_1$classe, predicted)
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5017    5    0    0    0
##          B   45 3327   45    0    0
##          C    0   35 3028   16    0
##          D    0    0   16 2878    0
##          E    0   12    0   38 3196
## 
## Overall Statistics
##                                           
##                Accuracy : 0.988           
##                  95% CI : (0.9863, 0.9895)
##     No Information Rate : 0.2867          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9848          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9911   0.9846   0.9803   0.9816   1.0000
## Specificity            0.9996   0.9937   0.9965   0.9989   0.9965
## Pos Pred Value         0.9990   0.9737   0.9834   0.9945   0.9846
## Neg Pred Value         0.9964   0.9963   0.9958   0.9963   1.0000
## Prevalence             0.2867   0.1914   0.1749   0.1660   0.1810
## Detection Rate         0.2841   0.1884   0.1715   0.1630   0.1810
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9954   0.9892   0.9884   0.9902   0.9983

data.frame(cm$table) %>% 
  group_by(Reference) %>%
  mutate(prop = Freq/sum(Freq)) %>%
  ggplot(aes(x = Reference, y = Prediction, fill = prop)) +
  geom_tile(color="white", size=0.5) +
  geom_text(aes(label = scales::percent(prop, 0.1))) +
  scale_fill_viridis_c("", direction = -1, labels=scales::percent_format(), limits=c(0, 1), breaks=seq(0, 1, 0.25))

Confusion matrix

# acc <- caret::postResample(predicted,valid_1$`classe`)
# acc

Final prediction

Here is a dataframe with the predicted class for each test-case as well as estimated probabilities of each class.

final_prediction <- predict(model_1, test, type="prob")
final_prediction %>% bind_cols(`class` = apply(final_prediction, 1, which.max)) %>%
  mutate(`class`=factor(`class`, levels = 1:5, labels=LETTERS[1:5])) %>%
  mutate(`N` = 1:20)

Fitbit Data Analysis

Practical Machine Learning: Week 4 Project

Alex Istrate

Training a model

Confusion matrix and accuracy statistics

Final prediction