Introduction

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.

Initialization code

knitr::opts_chunk$set(echo = TRUE)
library(data.table)
library(dplyr)
library(caret)
library(rpart)
library(knitr)
library(kableExtra)

library(randomForest)
library(foreach) # for the random forest learning, multiple core use
library(doParallel)

Loading data

URL_train = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URL_valid = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

download.file(URL_train,destfile = "./pml_training.csv",method = "curl")
download.file(URL_valid,destfile = "./pml_validation.csv",method = "curl")

library(data.table)
data <- fread("pml_training.csv")
data_valid <- fread("pml_validation.csv")

Cleaning and preparing data

First : discard variables with variances close to zero.

library(dplyr)
library(caret)
data_sub <- data %>% select(-nearZeroVar(data))

Second : eliminate variables containing too high a proportion of NA.

prop_NA <- colMeans(is.na(x = data_sub))
prop_NA <- prop_NA[prop_NA > 0] # Subset of columns with NA

There is 65 variables with NA values. The minimum of the proportion of NA across these 65 variables is 0.9793089. Then, ll these variables are to be deleted.

ind_NA <- which(names(data_sub) %in% names(prop_NA))
data_sub_2 <- data_sub %>% select(-all_of(ind_NA))

Third : columns going from the first to the sixth are to be deleted.

Since it interferes with the variable classe, the columns going from the first to the sixth are to be deleted. Indeed, columns like V1 or raw_time give the class directly without taking measurements with the sensors.

data_sub_3 <- data_sub_2[,-(1:6)]

Fifth : spliting the last subset data data_sub_4 into training and testing sets.

set.seed(0)
inTrain <-
  as.vector(createDataPartition(y = data_sub_4$classe, p = .7, list = FALSE))
data_train <- data_sub_4[inTrain,]
data_test <- data_sub_4[-inTrain,]

Classification problem (output : discrete values) with supervised learning

The first approach try to partitioning the training set into multiple sub-spaces in order that the final subset is as homogeneous as possible. The outcome is the varibale classe which is a categorical variable. Thus, we will apply classification tree.

Generating the object model that contains the classification tree by using rpart() from the caret package.

library(rpart)
library(rpart.plot)
model <-
  rpart(classe ~ .,
        data = data_train)
rpart.plot(x =model , yesno = 2, type = 0, extra = 104,tweak = 1.2)

Each node shows :

  • the predicted class (A to E) ;
  • the predicted probability of each class ;
  • the percentage of observations in the node.

This model is complex, hence there is a high chance of overfitting.

The maximum depth of this tree is given by the code below :

nodes <- as.numeric(rownames(model$frame))
max(rpart:::tree.depth(nodes))
## [1] 9

A node’s depth is the number of edges back up to the root. Here, the root is given by the first node, at the first split.

Predictions using model and testing set.

pred_mod <- predict(object = model,newdata = data_test, type = "class")
confus_mod <- confusionMatrix(pred_mod,factor(data_test$classe))
kbl(round(confus_mod$byClass,4)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.8088 0.9449 0.8537 0.9256 0.8537 0.8088 0.8307 0.2845 0.2301 0.2695 0.8769
Class: B 0.5961 0.8881 0.5612 0.9016 0.5612 0.5961 0.5781 0.1935 0.1154 0.2056 0.7421
Class: C 0.6881 0.9107 0.6193 0.9326 0.6193 0.6881 0.6519 0.1743 0.1200 0.1937 0.7994
Class: D 0.7033 0.9266 0.6526 0.9410 0.6526 0.7033 0.6770 0.1638 0.1152 0.1766 0.8150
Class: E 0.6442 0.9557 0.7659 0.9226 0.7659 0.6442 0.6998 0.1839 0.1184 0.1546 0.7999
kbl(bind_cols(Feature=names(confus_mod$overall), model=round(confus_mod$overall,4))) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)
Feature model
Accuracy 0.6991
Kappa 0.6202
AccuracyLower 0.6872
AccuracyUpper 0.7108
AccuracyNull 0.2845
AccuracyPValue 0.0000
McnemarPValue 0.0000

The confusion matrix returns an accuracy of 0.6991 over all the prediction with a confidence interval at 95 % of [0.6872,0.7108]. The interval is quite tight.

Reducing tree size by modifing the complexity parameter.

A way to reduce the increase in the maximum depth of a tree is to change the complexity parameter cp. This parameter imposes a penalty for too deep depth. The higher the cp, the smaller the tree.

By cross validation approach, an optimal value for cp is searched. The selected cp value is the one that maximise the cross-vaidation accuracy.

Two way of coding the learning procedure

The classic way, with train() from caret package.

set.seed(0)
model2 <- train(
  classe ~ .,
  data = data_train,
  method = "rpart",
  trControl = trainControl(
    method = "repeatedcv",
    number = 10, # 10-fold cross validation
    repeats = 10
  ),
  tuneLength = 1000 # number of possible cp values to evaluate.
)

Using multiple cores and train().

R processes commands with a single CPU core. To make the most of the powers available, it is possible to use the doParallel and foreach libraries. This is done below :

  • using the function foreach::foreach(), a range of cp values is scanned ;
  • each value will be treated by one core separately ;
  • this results in a saving of time ;
  • from the final model, only data concerning the value of the cp and accuracy are kept ;
  • final results are stored in the object results_2BIS.
library(foreach)
library(doParallel)
my_cores <- detectCores()
my_cores <- my_cores - 2 #Keep two cores free
registerDoParallel(my_cores)
time_track_2BIS <- system.time({
  results_2BIS <-
    foreach(num_cp = seq(
      from = 0,
      to = .2,
      length.out = 500
    ),
    .combine = rbind) %dopar% {
      model2BIS <- train(
        classe ~ .,
        data = data_train,
        method = "rpart",
        trControl = trainControl(
          method = "repeatedcv",
          number = 10,
          # 10-fold cross validation
          repeats = 3
        ),
        tuneGrid = expand.grid(.cp = num_cp),
      )
      model2BIS$results
    }
})
print(time_track_2BIS)
print(str(results_2BIS))

It is possible to see a significant time saving, almost 10 times less. Moreover, in the approach with several cores, the cross-validation was repeated three times at each cp.

print(model2$bestTune)
##   cp
## 1  0
row_max <- which.max(results_2BIS[,"Accuracy"])
print(results_2BIS[row_max,1])
## [1] 0
plot(model2)

So with 1000 different values, this plot shows that the best cp is close to zero and the accuracy of the cross-validation decrease when the complexity parameter increase. Every cross-validation, for every cp is repeated 10 time.

nodes2 <- as.numeric(rownames(model2$finalModel$frame))
max(rpart:::tree.depth(nodes2))
## [1] 21

The best maximum depth is greater than for model . Therefore, model2 brings more complexity than the first modeling with model. In fact, in the first modeling, we had cp=0.01. The tree is hence more complex :

Predictions using model2 and testing set.

pred_mod2 <- predict(object = model2$finalModel,newdata = data_test, type = "class")
confus_mod2 <- confusionMatrix(pred_mod2,factor(data_test$classe))
kbl(round(confus_mod2$byClass,4)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.9432 0.9888 0.9711 0.9777 0.9711 0.9432 0.9570 0.2845 0.2683 0.2763 0.9660
Class: B 0.8903 0.9735 0.8895 0.9737 0.8895 0.8903 0.8899 0.1935 0.1723 0.1937 0.9319
Class: C 0.8850 0.9671 0.8502 0.9755 0.8502 0.8850 0.8672 0.1743 0.1543 0.1815 0.9260
Class: D 0.9191 0.9860 0.9277 0.9842 0.9277 0.9191 0.9234 0.1638 0.1506 0.1623 0.9525
Class: E 0.9381 0.9831 0.9261 0.9860 0.9261 0.9381 0.9320 0.1839 0.1725 0.1862 0.9606
kbl(bind_cols(Feature=matrix(names(confus_mod2$overall)),model=round(confus_mod$overall,4),model2=round(confus_mod2$overall,4))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)
Feature model model2
Accuracy 0.6991 0.9179
Kappa 0.6202 0.8963
AccuracyLower 0.6872 0.9106
AccuracyUpper 0.7108 0.9248
AccuracyNull 0.2845 0.2845
AccuracyPValue 0.0000 0.0000
McnemarPValue 0.0000 0.0002

The accuracy is better with the best tune of model2. This suggest a smaller cp and therefore a deeper tree (with 21). The graphical representation is unintelligible, which is why it is not provided.

Discussions

Classification tree is a very useful machine learning algorithm by virtue of its easy interpretability. Classification trees might be highly performed compared to Single Linear Regression (or even Multiple) with highly non-linear relationships between predictors and outcome.

Nevertheless, this algorithm has two big issues :

  • building trees are unstable by a little variations in training set ;
  • overfitting on the training set.

It’s possible to build multiple trees from the training set by using an other machine learning algorithm : random forest.

Random forest classifier with supervised learning

In the previous section, the rpart method was used, now it is the rf method.

Circumscription of parameters to be determined

With the Random Forest algorithm, there is at least two parameters to be fixed at values maximizing accuracy : + mtry : number of predictor variables used at each split ; + ntree : number of tree used to learn. By default ntree = 500.

Use multiple cores

Similar to the previous section, foreach() is used to exploit multiple CPU cores.

ntree from 10 to 130 by 20.

mtry_seq <- c(2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 49)
time_track <- system.time({
  results <-
    foreach(numb_tree = seq(from = 10, to = 130, by = 20),
            combine = rbind) %dopar% {
              model3 <- train(
                form = classe ~ .,
                data = data_train,
                mthod = "rf",
                trControl = trainControl(
                  method = "repeatedcv",
                  number = 10,
                  repeats = 4
                ),
                # Train for all values of mtry stored in mtry_seq
                tuneGrid = expand.grid(.mtry = mtry_seq),
                ntree = numb_tree,
                importance = TRUE,
                metric = "Accuracy",
                verbose = FALSE
              )
              # The return is stored in `results`
              bind_cols(model3$results[, 1:2], rep(numb_tree, length(mtry_seq)))
            }
})

ntree from 150 to 230 by 20.

mtry_seq <- c(2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 49)
time_track_2 <- system.time({
  results_2 <-
    foreach(numb_tree = seq(from = 150, to = 230, by = 20),
            combine = rbind) %dopar% {
              model4 <- train(
                form = classe ~ .,
                data = data_train,
                mthod = "rf",
                trControl = trainControl(
                  method = "repeatedcv",
                  number = 10,
                  repeats = 4
                ),
                # Train for all values of mtry stored in mtry_seq
                tuneGrid = expand.grid(.mtry = mtry_seq),
                ntree = numb_tree,
                importance = TRUE,
                metric = "Accuracy",
                verbose = FALSE
              )
              # The return is stored in `results_2`
              bind_cols(model4$results[, 1:2], rep(numb_tree, length(mtry_seq))) 
            }
})

Time cost.

print(time_track)
## utilisateur     système      écoulé 
##   44357.503    1653.262    5713.620
print(time_track_2)
## utilisateur     système      écoulé 
##   106388.62     4489.74    12592.78

Merging the two resulting objects results and results_2 into one for visualization.

result_acc <-  bind_rows(results,results_2)
names(result_acc) <- c("mtry", "Accuracy", "ntree")
sapply(result_acc,which.max)
##     mtry Accuracy    ntree 
##      144       50       13
result_acc[50,]
##    mtry  Accuracy ntree
## 91    5 0.9911731   190

Visualization of the evolution of accuracy according to mtry. Curves are grouped by the value of ntree.

library(ggplot2)
mpt <- result_acc %>%
  ggplot(aes(x = mtry, y = Accuracy, group=ntree, color=numb_tree)) +
  geom_point(aes(colour = ntree)) +
  geom_line(aes(colour = ntree)) +
  labs(x = "mtry", y = "Accuracy")
print(mpt)

Selection of best tunes.

It turns out :

  • the more trees, the better the accuracy. But it’s huge time consuming. About 3.5 h for values of ntree ranging from 150 to 230 ;
  • the best tunes are :
    • mtry = 5 ;
    • ntree = 190 ;

Build a Random Forest learning named model5 with the best tunes

model5 <-
  randomForest(
    as.factor(classe) ~ .,
    data = data_train,
    mtry = 5,
    ntree = 190,
    importance = TRUE
  )

Eror rate

plot(model5)
legend("right", colnames(model5$err.rate),cex=1.5,fill = 1:6,lty = 1:6, title="Classe and OOB", text.font=4, bg='#abd4b6')

This plot leads to the same conclusion as before. The higher the number of trees, the more the error decreases and conversely the more the accuracy increases.

Predictions using model and testing set.

pred_mod5 <- predict(object = model5,newdata = data_test, type = "class")
confus_mod5 <- confusionMatrix(pred_mod5,factor(data_test$classe))
kbl(round(confus_mod5$byClass,4)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: A 0.9988 0.9995 0.9988 0.9995 0.9988 0.9988 0.9988 0.2845 0.2841 0.2845 0.9992
Class: B 0.9947 0.9966 0.9861 0.9987 0.9861 0.9947 0.9904 0.1935 0.1925 0.1952 0.9957
Class: C 0.9766 0.9967 0.9843 0.9951 0.9843 0.9766 0.9804 0.1743 0.1703 0.1730 0.9867
Class: D 0.9844 0.9978 0.9885 0.9970 0.9885 0.9844 0.9865 0.1638 0.1613 0.1631 0.9911
Class: E 0.9991 0.9994 0.9972 0.9998 0.9972 0.9991 0.9982 0.1839 0.1837 0.1842 0.9992
kbl(bind_cols(Feature=names(confus_mod5$overall), model5=round(confus_mod5$overall,4))) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)
Feature model5
Accuracy 0.9918
Kappa 0.9897
AccuracyLower 0.9892
AccuracyUpper 0.9940
AccuracyNull 0.2845
AccuracyPValue 0.0000
McnemarPValue NaN

The confusion matrix returns an accuracy of 0.9918 over all the prediction with a confidence interval at 95 % of [0.9892,0.994]. The interval is quite tight.

Discussion

With a Random Forest learning, the accuracy on the treaning set is much better. Indeed :

  • for accuracy : from 0.6991 we get 0.9918 ;
  • for confidence interval : from [0.6872,0.7108] we get [0.9892,0.994].

But all of these came with cost : time. From few second (less 2 min) for rpart() we go to several hours (more than 5 h) to find the best tunes.

Prediction on the validation set

Last thing, we will use the validation set stored in data_valid to make predictions like with the testing set data_test.

pred_val <-
  predict(object = model,
          newdata = data_valid_sub,
          type = "class")
pred_val2 <-
  predict(object = model2$finalModel,
          newdata = data_valid_sub,
          type = "class")
pred_val5 <-
  predict(object = model5,
          newdata = data_valid_sub,
          type = "class")
kbl(
  bind_rows(
    bind_cols("model (rpart)", t(as.character(pred_val))),
    bind_cols("model2 (rpart with cp control)", t(as.character(pred_val2))),
    bind_cols("model5 (random forest)", t(as.character(pred_val5)))
  ),
  col.names = c("Learning approach","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20")
) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    font_size = 11
  )
Learning approach 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
model (rpart) C A C C A E D B A A C C E A E A A D D E
model2 (rpart with cp control) B A B A A E D B A A B C B A E E A B B B
model5 (random forest) B A B A A E D B A A B C B A E E A B B B

We can notice the classifications with Random Forest and rpart return the same results.





end