Prediction Assignment Writeup

Introduction

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.

Initialization code

knitr::opts_chunk$set(echo = TRUE)
library(data.table)
library(dplyr)
library(caret)
library(rpart)
library(knitr)
library(kableExtra)

library(randomForest)
library(foreach) # for the random forest learning, multiple core use
library(doParallel)

Loading data

URL_train = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URL_valid = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

download.file(URL_train,destfile = "./pml_training.csv",method = "curl")
download.file(URL_valid,destfile = "./pml_validation.csv",method = "curl")

library(data.table)
data <- fread("pml_training.csv")
data_valid <- fread("pml_validation.csv")

Cleaning and preparing data

First : discard variables with variances close to zero.

library(dplyr)
library(caret)
data_sub <- data %>% select(-nearZeroVar(data))

Second : eliminate variables containing too high a proportion of `NA`.

prop_NA <- colMeans(is.na(x = data_sub))
prop_NA <- prop_NA[prop_NA > 0] # Subset of columns with NA

There is 65 variables with NA values. The minimum of the proportion of NA across these 65 variables is 0.9793089. Then, ll these variables are to be deleted.

ind_NA <- which(names(data_sub) %in% names(prop_NA))
data_sub_2 <- data_sub %>% select(-all_of(ind_NA))

Third : columns going from the first to the sixth are to be deleted.

Since it interferes with the variable classe, the columns going from the first to the sixth are to be deleted. Indeed, columns like V1 or raw_time give the class directly without taking measurements with the sensors.

data_sub_3 <- data_sub_2[,-(1:6)]

Fourth : get rid of the variables related to dumbbell.

In this Human Activity Recognition (HAR), we want to predict “which” activity was performed at a specific point in time. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Therefore, the variables related to dumbbell skews the results since according to its position, we know the exercise. In other words, the value of this variable indicates the class sought. Instead, we want to predict the class (A, B, C, D, or E) using the wrist, arm, and waist sensors. So, below, all variables related to dumbbell are deleted from the data set used.

no_dumbbell <-
  grep(pattern = "dumbbell",
       ignore.case = TRUE,
       x = colnames(data_sub_3))

data_sub_4 <- data_sub_3 %>% select(-all_of(no_dumbbell))
data_valid_sub <- data_valid %>% select(names(data_sub_4)[-40])

Fifth : spliting the last subset data `data_sub_4` into training and testing sets.

set.seed(0)
inTrain <-
  as.vector(createDataPartition(y = data_sub_4$classe, p = .7, list = FALSE))
data_train <- data_sub_4[inTrain,]
data_test <- data_sub_4[-inTrain,]

Classification problem (output : discrete values) with supervised learning

The first approach try to partitioning the training set into multiple sub-spaces in order that the final subset is as homogeneous as possible. The outcome is the varibale classe which is a categorical variable. Thus, we will apply classification tree.

Generating the object `model` that contains the classification tree by using `rpart()` from the `caret` package.

library(rpart)
library(rpart.plot)
model <-
  rpart(classe ~ .,
        data = data_train)
rpart.plot(x =model , yesno = 2, type = 0, extra = 104,tweak = 1.2)

Each node shows :

the predicted class (A to E) ;
the predicted probability of each class ;
the percentage of observations in the node.

This model is complex, hence there is a high chance of overfitting.

The maximum depth of this tree is given by the code below :

nodes <- as.numeric(rownames(model$frame))
max(rpart:::tree.depth(nodes))

## [1] 9

A node’s depth is the number of edges back up to the root. Here, the root is given by the first node, at the first split.

Predictions using `model` and testing set.

pred_mod <- predict(object = model,newdata = data_test, type = "class")
confus_mod <- confusionMatrix(pred_mod,factor(data_test$classe))
kbl(round(confus_mod$byClass,4)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.8088	0.9449	0.8537	0.9256	0.8537	0.8088	0.8307	0.2845	0.2301	0.2695	0.8769
Class: B	0.5961	0.8881	0.5612	0.9016	0.5612	0.5961	0.5781	0.1935	0.1154	0.2056	0.7421
Class: C	0.6881	0.9107	0.6193	0.9326	0.6193	0.6881	0.6519	0.1743	0.1200	0.1937	0.7994
Class: D	0.7033	0.9266	0.6526	0.9410	0.6526	0.7033	0.6770	0.1638	0.1152	0.1766	0.8150
Class: E	0.6442	0.9557	0.7659	0.9226	0.7659	0.6442	0.6998	0.1839	0.1184	0.1546	0.7999

kbl(bind_cols(Feature=names(confus_mod$overall), model=round(confus_mod$overall,4))) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)

Feature	model
Accuracy	0.6991
Kappa	0.6202
AccuracyLower	0.6872
AccuracyUpper	0.7108
AccuracyNull	0.2845
AccuracyPValue	0.0000
McnemarPValue	0.0000

The confusion matrix returns an accuracy of 0.6991 over all the prediction with a confidence interval at 95 % of [0.6872,0.7108]. The interval is quite tight.

Reducing tree size by modifing the complexity parameter.

A way to reduce the increase in the maximum depth of a tree is to change the complexity parameter cp. This parameter imposes a penalty for too deep depth. The higher the cp, the smaller the tree.

By cross validation approach, an optimal value for cp is searched. The selected cp value is the one that maximise the cross-vaidation accuracy.

Two way of coding the learning procedure

The classic way, with `train()` from `caret` package.

set.seed(0)
model2 <- train(
  classe ~ .,
  data = data_train,
  method = "rpart",
  trControl = trainControl(
    method = "repeatedcv",
    number = 10, # 10-fold cross validation
    repeats = 10
  ),
  tuneLength = 1000 # number of possible cp values to evaluate.
)

Using multiple cores and `train()`.

R processes commands with a single CPU core. To make the most of the powers available, it is possible to use the doParallel and foreach libraries. This is done below :

using the function foreach::foreach(), a range of cp values is scanned ;
each value will be treated by one core separately ;
this results in a saving of time ;
from the final model, only data concerning the value of the cp and accuracy are kept ;
final results are stored in the object results_2BIS.

library(foreach)
library(doParallel)
my_cores <- detectCores()
my_cores <- my_cores - 2 #Keep two cores free
registerDoParallel(my_cores)
time_track_2BIS <- system.time({
  results_2BIS <-
    foreach(num_cp = seq(
      from = 0,
      to = .2,
      length.out = 500
    ),
    .combine = rbind) %dopar% {
      model2BIS <- train(
        classe ~ .,
        data = data_train,
        method = "rpart",
        trControl = trainControl(
          method = "repeatedcv",
          number = 10,
          # 10-fold cross validation
          repeats = 3
        ),
        tuneGrid = expand.grid(.cp = num_cp),
      )
      model2BIS$results
    }
})
print(time_track_2BIS)
print(str(results_2BIS))

It is possible to see a significant time saving, almost 10 times less. Moreover, in the approach with several cores, the cross-validation was repeated three times at each cp.

print(model2$bestTune)

##   cp
## 1  0

row_max <- which.max(results_2BIS[,"Accuracy"])
print(results_2BIS[row_max,1])

## [1] 0

plot(model2)

So with 1000 different values, this plot shows that the best cp is close to zero and the accuracy of the cross-validation decrease when the complexity parameter increase. Every cross-validation, for every cp is repeated 10 time.

nodes2 <- as.numeric(rownames(model2$finalModel$frame))
max(rpart:::tree.depth(nodes2))

## [1] 21

The best maximum depth is greater than for model . Therefore, model2 brings more complexity than the first modeling with model. In fact, in the first modeling, we had cp=0.01. The tree is hence more complex :

Predictions using `model2` and testing set.

pred_mod2 <- predict(object = model2$finalModel,newdata = data_test, type = "class")
confus_mod2 <- confusionMatrix(pred_mod2,factor(data_test$classe))
kbl(round(confus_mod2$byClass,4)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9432	0.9888	0.9711	0.9777	0.9711	0.9432	0.9570	0.2845	0.2683	0.2763	0.9660
Class: B	0.8903	0.9735	0.8895	0.9737	0.8895	0.8903	0.8899	0.1935	0.1723	0.1937	0.9319
Class: C	0.8850	0.9671	0.8502	0.9755	0.8502	0.8850	0.8672	0.1743	0.1543	0.1815	0.9260
Class: D	0.9191	0.9860	0.9277	0.9842	0.9277	0.9191	0.9234	0.1638	0.1506	0.1623	0.9525
Class: E	0.9381	0.9831	0.9261	0.9860	0.9261	0.9381	0.9320	0.1839	0.1725	0.1862	0.9606

kbl(bind_cols(Feature=matrix(names(confus_mod2$overall)),model=round(confus_mod$overall,4),model2=round(confus_mod2$overall,4))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)

Feature	model	model2
Accuracy	0.6991	0.9179
Kappa	0.6202	0.8963
AccuracyLower	0.6872	0.9106
AccuracyUpper	0.7108	0.9248
AccuracyNull	0.2845	0.2845
AccuracyPValue	0.0000	0.0000
McnemarPValue	0.0000	0.0002

The accuracy is better with the best tune of model2. This suggest a smaller cp and therefore a deeper tree (with 21). The graphical representation is unintelligible, which is why it is not provided.

Discussions

Classification tree is a very useful machine learning algorithm by virtue of its easy interpretability. Classification trees might be highly performed compared to Single Linear Regression (or even Multiple) with highly non-linear relationships between predictors and outcome.

Nevertheless, this algorithm has two big issues :

building trees are unstable by a little variations in training set ;
overfitting on the training set.

It’s possible to build multiple trees from the training set by using an other machine learning algorithm : random forest.

Random forest classifier with supervised learning

In the previous section, the rpart method was used, now it is the rf method.

Circumscription of parameters to be determined

With the Random Forest algorithm, there is at least two parameters to be fixed at values maximizing accuracy : + mtry : number of predictor variables used at each split ; + ntree : number of tree used to learn. By default ntree = 500.

Use multiple cores

Similar to the previous section, foreach() is used to exploit multiple CPU cores.

`ntree` from 10 to 130 by 20.

mtry_seq <- c(2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 49)
time_track <- system.time({
  results <-
    foreach(numb_tree = seq(from = 10, to = 130, by = 20),
            combine = rbind) %dopar% {
              model3 <- train(
                form = classe ~ .,
                data = data_train,
                mthod = "rf",
                trControl = trainControl(
                  method = "repeatedcv",
                  number = 10,
                  repeats = 4
                ),
                # Train for all values of mtry stored in mtry_seq
                tuneGrid = expand.grid(.mtry = mtry_seq),
                ntree = numb_tree,
                importance = TRUE,
                metric = "Accuracy",
                verbose = FALSE
              )
              # The return is stored in `results`
              bind_cols(model3$results[, 1:2], rep(numb_tree, length(mtry_seq)))
            }
})

`ntree` from 150 to 230 by 20.

mtry_seq <- c(2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 49)
time_track_2 <- system.time({
  results_2 <-
    foreach(numb_tree = seq(from = 150, to = 230, by = 20),
            combine = rbind) %dopar% {
              model4 <- train(
                form = classe ~ .,
                data = data_train,
                mthod = "rf",
                trControl = trainControl(
                  method = "repeatedcv",
                  number = 10,
                  repeats = 4
                ),
                # Train for all values of mtry stored in mtry_seq
                tuneGrid = expand.grid(.mtry = mtry_seq),
                ntree = numb_tree,
                importance = TRUE,
                metric = "Accuracy",
                verbose = FALSE
              )
              # The return is stored in `results_2`
              bind_cols(model4$results[, 1:2], rep(numb_tree, length(mtry_seq))) 
            }
})

Time cost.

print(time_track)

## utilisateur     système      écoulé 
##   44357.503    1653.262    5713.620

print(time_track_2)

## utilisateur     système      écoulé 
##   106388.62     4489.74    12592.78

Merging the two resulting objects `results` and `results_2` into one for visualization.

result_acc <-  bind_rows(results,results_2)
names(result_acc) <- c("mtry", "Accuracy", "ntree")
sapply(result_acc,which.max)

##     mtry Accuracy    ntree 
##      144       50       13

result_acc[50,]

##    mtry  Accuracy ntree
## 91    5 0.9911731   190

Visualization of the evolution of `accuracy` according to `mtry`. Curves are grouped by the value of `ntree`.

library(ggplot2)
mpt <- result_acc %>%
  ggplot(aes(x = mtry, y = Accuracy, group=ntree, color=numb_tree)) +
  geom_point(aes(colour = ntree)) +
  geom_line(aes(colour = ntree)) +
  labs(x = "mtry", y = "Accuracy")
print(mpt)

Selection of best tunes.

It turns out :

the more trees, the better the accuracy. But it’s huge time consuming. About 3.5 h for values of ntree ranging from 150 to 230 ;
the best tunes are :
- mtry = 5 ;
- ntree = 190 ;

Build a Random Forest learning named `model5` with the best tunes

model5 <-
  randomForest(
    as.factor(classe) ~ .,
    data = data_train,
    mtry = 5,
    ntree = 190,
    importance = TRUE
  )

Eror rate

plot(model5)
legend("right", colnames(model5$err.rate),cex=1.5,fill = 1:6,lty = 1:6, title="Classe and OOB", text.font=4, bg='#abd4b6')

This plot leads to the same conclusion as before. The higher the number of trees, the more the error decreases and conversely the more the accuracy increases.

Predictions using `model` and testing set.

pred_mod5 <- predict(object = model5,newdata = data_test, type = "class")
confus_mod5 <- confusionMatrix(pred_mod5,factor(data_test$classe))
kbl(round(confus_mod5$byClass,4)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9988	0.9995	0.9988	0.9995	0.9988	0.9988	0.9988	0.2845	0.2841	0.2845	0.9992
Class: B	0.9947	0.9966	0.9861	0.9987	0.9861	0.9947	0.9904	0.1935	0.1925	0.1952	0.9957
Class: C	0.9766	0.9967	0.9843	0.9951	0.9843	0.9766	0.9804	0.1743	0.1703	0.1730	0.9867
Class: D	0.9844	0.9978	0.9885	0.9970	0.9885	0.9844	0.9865	0.1638	0.1613	0.1631	0.9911
Class: E	0.9991	0.9994	0.9972	0.9998	0.9972	0.9991	0.9982	0.1839	0.1837	0.1842	0.9992

kbl(bind_cols(Feature=names(confus_mod5$overall), model5=round(confus_mod5$overall,4))) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)

Feature	model5
Accuracy	0.9918
Kappa	0.9897
AccuracyLower	0.9892
AccuracyUpper	0.9940
AccuracyNull	0.2845
AccuracyPValue	0.0000
McnemarPValue	NaN

The confusion matrix returns an accuracy of 0.9918 over all the prediction with a confidence interval at 95 % of [0.9892,0.994]. The interval is quite tight.

Discussion

With a Random Forest learning, the accuracy on the treaning set is much better. Indeed :

for accuracy : from 0.6991 we get 0.9918 ;
for confidence interval : from [0.6872,0.7108] we get [0.9892,0.994].

But all of these came with cost : time. From few second (less 2 min) for rpart() we go to several hours (more than 5 h) to find the best tunes.

Prediction on the validation set

Last thing, we will use the validation set stored in data_valid to make predictions like with the testing set data_test.

pred_val <-
  predict(object = model,
          newdata = data_valid_sub,
          type = "class")
pred_val2 <-
  predict(object = model2$finalModel,
          newdata = data_valid_sub,
          type = "class")
pred_val5 <-
  predict(object = model5,
          newdata = data_valid_sub,
          type = "class")
kbl(
  bind_rows(
    bind_cols("model (rpart)", t(as.character(pred_val))),
    bind_cols("model2 (rpart with cp control)", t(as.character(pred_val2))),
    bind_cols("model5 (random forest)", t(as.character(pred_val5)))
  ),
  col.names = c("Learning approach","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20")
) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    font_size = 11
  )

Learning approach	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
model (rpart)	C	A	C	C	A	E	D	B	A	A	C	C	E	A	E	A	A	D	D	E
model2 (rpart with cp control)	B	A	B	A	A	E	D	B	A	A	B	C	B	A	E	E	A	B	B	B
model5 (random forest)	B	A	B	A	A	E	D	B	A	A	B	C	B	A	E	E	A	B	B	B

Prediction Assignment Writeup

Idriss .S

3/24/2022

Introduction

Initialization code

Loading data

Cleaning and preparing data

First : discard variables with variances close to zero.

Second : eliminate variables containing too high a proportion of `NA`.

Third : columns going from the first to the sixth are to be deleted.

Fifth : spliting the last subset data `data_sub_4` into training and testing sets.

Classification problem (output : discrete values) with supervised learning

Generating the object `model` that contains the classification tree by using `rpart()` from the `caret` package.

Predictions using `model` and testing set.

Reducing tree size by modifing the complexity parameter.

Two way of coding the learning procedure

The classic way, with `train()` from `caret` package.

Using multiple cores and `train()`.

Predictions using `model2` and testing set.

Discussions

Random forest classifier with supervised learning

Circumscription of parameters to be determined

Use multiple cores

`ntree` from 10 to 130 by 20.

`ntree` from 150 to 230 by 20.

Time cost.

Merging the two resulting objects `results` and `results_2` into one for visualization.

Visualization of the evolution of `accuracy` according to `mtry`. Curves are grouped by the value of `ntree`.

Selection of best tunes.

Build a Random Forest learning named `model5` with the best tunes

Eror rate

Predictions using `model` and testing set.

Discussion

Prediction on the validation set

We can notice the classifications with Random Forest and `rpart` return the same results.

end

Prediction Assignment Writeup

Idriss .S

3/24/2022

Introduction

Initialization code

Loading data

Cleaning and preparing data

First : discard variables with variances close to zero.

Second : eliminate variables containing too high a proportion of NA.

Third : columns going from the first to the sixth are to be deleted.

Fourth : get rid of the variables related to dumbbell.

Fifth : spliting the last subset data data_sub_4 into training and testing sets.

Classification problem (output : discrete values) with supervised learning

Generating the object model that contains the classification tree by using rpart() from the caret package.

Predictions using model and testing set.

Reducing tree size by modifing the complexity parameter.

Two way of coding the learning procedure

The classic way, with train() from caret package.

Using multiple cores and train().

Predictions using model2 and testing set.

Discussions

Random forest classifier with supervised learning

Circumscription of parameters to be determined

Use multiple cores

ntree from 10 to 130 by 20.

ntree from 150 to 230 by 20.

Time cost.

Merging the two resulting objects results and results_2 into one for visualization.

Visualization of the evolution of accuracy according to mtry. Curves are grouped by the value of ntree.

Selection of best tunes.

Build a Random Forest learning named model5 with the best tunes

Eror rate

Predictions using model and testing set.

Discussion

Prediction on the validation set

We can notice the classifications with Random Forest and rpart return the same results.

end

Second : eliminate variables containing too high a proportion of `NA`.

Fifth : spliting the last subset data `data_sub_4` into training and testing sets.

Generating the object `model` that contains the classification tree by using `rpart()` from the `caret` package.

Predictions using `model` and testing set.

The classic way, with `train()` from `caret` package.

Using multiple cores and `train()`.

Predictions using `model2` and testing set.

`ntree` from 10 to 130 by 20.

`ntree` from 150 to 230 by 20.

Merging the two resulting objects `results` and `results_2` into one for visualization.

Visualization of the evolution of `accuracy` according to `mtry`. Curves are grouped by the value of `ntree`.

Build a Random Forest learning named `model5` with the best tunes

Predictions using `model` and testing set.

We can notice the classifications with Random Forest and `rpart` return the same results.