One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.
knitr::opts_chunk$set(echo = TRUE)
library(data.table)
library(dplyr)
library(caret)
library(rpart)
library(knitr)
library(kableExtra)
library(randomForest)
library(foreach) # for the random forest learning, multiple core use
library(doParallel)
URL_train = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URL_valid = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(URL_train,destfile = "./pml_training.csv",method = "curl")
download.file(URL_valid,destfile = "./pml_validation.csv",method = "curl")
library(data.table)
data <- fread("pml_training.csv")
data_valid <- fread("pml_validation.csv")
library(dplyr)
library(caret)
data_sub <- data %>% select(-nearZeroVar(data))
NA.prop_NA <- colMeans(is.na(x = data_sub))
prop_NA <- prop_NA[prop_NA > 0] # Subset of columns with NA
There is 65 variables with NA values. The minimum of the
proportion of NA across these 65 variables is 0.9793089.
Then, ll these variables are to be deleted.
ind_NA <- which(names(data_sub) %in% names(prop_NA))
data_sub_2 <- data_sub %>% select(-all_of(ind_NA))
Since it interferes with the variable classe, the
columns going from the first to the sixth are to be deleted. Indeed,
columns like V1 or raw_time give the class
directly without taking measurements with the sensors.
data_sub_3 <- data_sub_2[,-(1:6)]
data_sub_4 into
training and testing sets.set.seed(0)
inTrain <-
as.vector(createDataPartition(y = data_sub_4$classe, p = .7, list = FALSE))
data_train <- data_sub_4[inTrain,]
data_test <- data_sub_4[-inTrain,]
The first approach try to partitioning the training set into multiple
sub-spaces in order that the final subset is as homogeneous as possible.
The outcome is the varibale classe which is a categorical
variable. Thus, we will apply classification tree.
model that contains the
classification tree by using rpart() from the
caret package.library(rpart)
library(rpart.plot)
model <-
rpart(classe ~ .,
data = data_train)
rpart.plot(x =model , yesno = 2, type = 0, extra = 104,tweak = 1.2)
Each node shows :
This model is complex, hence there is a high chance of overfitting.
The maximum depth of this tree is given by the code below :
nodes <- as.numeric(rownames(model$frame))
max(rpart:::tree.depth(nodes))
## [1] 9
A node’s depth is the number of edges back up to the root. Here, the root is given by the first node, at the first split.
model and testing set.pred_mod <- predict(object = model,newdata = data_test, type = "class")
confus_mod <- confusionMatrix(pred_mod,factor(data_test$classe))
kbl(round(confus_mod$byClass,4)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: A | 0.8088 | 0.9449 | 0.8537 | 0.9256 | 0.8537 | 0.8088 | 0.8307 | 0.2845 | 0.2301 | 0.2695 | 0.8769 |
| Class: B | 0.5961 | 0.8881 | 0.5612 | 0.9016 | 0.5612 | 0.5961 | 0.5781 | 0.1935 | 0.1154 | 0.2056 | 0.7421 |
| Class: C | 0.6881 | 0.9107 | 0.6193 | 0.9326 | 0.6193 | 0.6881 | 0.6519 | 0.1743 | 0.1200 | 0.1937 | 0.7994 |
| Class: D | 0.7033 | 0.9266 | 0.6526 | 0.9410 | 0.6526 | 0.7033 | 0.6770 | 0.1638 | 0.1152 | 0.1766 | 0.8150 |
| Class: E | 0.6442 | 0.9557 | 0.7659 | 0.9226 | 0.7659 | 0.6442 | 0.6998 | 0.1839 | 0.1184 | 0.1546 | 0.7999 |
kbl(bind_cols(Feature=names(confus_mod$overall), model=round(confus_mod$overall,4))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)
| Feature | model |
|---|---|
| Accuracy | 0.6991 |
| Kappa | 0.6202 |
| AccuracyLower | 0.6872 |
| AccuracyUpper | 0.7108 |
| AccuracyNull | 0.2845 |
| AccuracyPValue | 0.0000 |
| McnemarPValue | 0.0000 |
The confusion matrix returns an accuracy of 0.6991 over all the prediction with a confidence interval at 95 % of [0.6872,0.7108]. The interval is quite tight.
A way to reduce the increase in the maximum depth of a tree is to
change the complexity parameter cp. This parameter imposes
a penalty for too deep depth. The higher the cp, the
smaller the tree.
By cross validation approach, an optimal value for cp is
searched. The selected cp value is the one that maximise
the cross-vaidation accuracy.
train() from caret
package.set.seed(0)
model2 <- train(
classe ~ .,
data = data_train,
method = "rpart",
trControl = trainControl(
method = "repeatedcv",
number = 10, # 10-fold cross validation
repeats = 10
),
tuneLength = 1000 # number of possible cp values to evaluate.
)
train().R processes commands with a single CPU core. To make
the most of the powers available, it is possible to use the
doParallel and foreach libraries. This is done
below :
foreach::foreach(), a range of
cp values is scanned ;cp and accuracy are kept ;results_2BIS.library(foreach)
library(doParallel)
my_cores <- detectCores()
my_cores <- my_cores - 2 #Keep two cores free
registerDoParallel(my_cores)
time_track_2BIS <- system.time({
results_2BIS <-
foreach(num_cp = seq(
from = 0,
to = .2,
length.out = 500
),
.combine = rbind) %dopar% {
model2BIS <- train(
classe ~ .,
data = data_train,
method = "rpart",
trControl = trainControl(
method = "repeatedcv",
number = 10,
# 10-fold cross validation
repeats = 3
),
tuneGrid = expand.grid(.cp = num_cp),
)
model2BIS$results
}
})
print(time_track_2BIS)
print(str(results_2BIS))
It is possible to see a significant time saving, almost 10 times
less. Moreover, in the approach with several cores, the cross-validation
was repeated three times at each cp.
print(model2$bestTune)
## cp
## 1 0
row_max <- which.max(results_2BIS[,"Accuracy"])
print(results_2BIS[row_max,1])
## [1] 0
plot(model2)
So with 1000 different values, this plot shows that the best
cp is close to zero and the accuracy of the
cross-validation decrease when the complexity parameter increase. Every
cross-validation, for every cp is repeated 10 time.
nodes2 <- as.numeric(rownames(model2$finalModel$frame))
max(rpart:::tree.depth(nodes2))
## [1] 21
The best maximum depth is greater than for model .
Therefore, model2 brings more complexity than the first
modeling with model. In fact, in the first modeling, we had
cp=0.01. The tree is hence more complex :
model2 and testing set.pred_mod2 <- predict(object = model2$finalModel,newdata = data_test, type = "class")
confus_mod2 <- confusionMatrix(pred_mod2,factor(data_test$classe))
kbl(round(confus_mod2$byClass,4)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: A | 0.9432 | 0.9888 | 0.9711 | 0.9777 | 0.9711 | 0.9432 | 0.9570 | 0.2845 | 0.2683 | 0.2763 | 0.9660 |
| Class: B | 0.8903 | 0.9735 | 0.8895 | 0.9737 | 0.8895 | 0.8903 | 0.8899 | 0.1935 | 0.1723 | 0.1937 | 0.9319 |
| Class: C | 0.8850 | 0.9671 | 0.8502 | 0.9755 | 0.8502 | 0.8850 | 0.8672 | 0.1743 | 0.1543 | 0.1815 | 0.9260 |
| Class: D | 0.9191 | 0.9860 | 0.9277 | 0.9842 | 0.9277 | 0.9191 | 0.9234 | 0.1638 | 0.1506 | 0.1623 | 0.9525 |
| Class: E | 0.9381 | 0.9831 | 0.9261 | 0.9860 | 0.9261 | 0.9381 | 0.9320 | 0.1839 | 0.1725 | 0.1862 | 0.9606 |
kbl(bind_cols(Feature=matrix(names(confus_mod2$overall)),model=round(confus_mod$overall,4),model2=round(confus_mod2$overall,4))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)
| Feature | model | model2 |
|---|---|---|
| Accuracy | 0.6991 | 0.9179 |
| Kappa | 0.6202 | 0.8963 |
| AccuracyLower | 0.6872 | 0.9106 |
| AccuracyUpper | 0.7108 | 0.9248 |
| AccuracyNull | 0.2845 | 0.2845 |
| AccuracyPValue | 0.0000 | 0.0000 |
| McnemarPValue | 0.0000 | 0.0002 |
The accuracy is better with the best tune of model2.
This suggest a smaller cp and therefore a deeper tree (with
21). The graphical representation is unintelligible, which is why it is
not provided.
Classification tree is a very useful machine learning algorithm by virtue of its easy interpretability. Classification trees might be highly performed compared to Single Linear Regression (or even Multiple) with highly non-linear relationships between predictors and outcome.
Nevertheless, this algorithm has two big issues :
It’s possible to build multiple trees from the training set by using an other machine learning algorithm : random forest.
In the previous section, the rpart method was used, now
it is the rf method.
With the Random Forest algorithm, there is at least two parameters to
be fixed at values maximizing accuracy : + mtry : number of
predictor variables used at each split ; + ntree : number
of tree used to learn. By default ntree = 500.
Similar to the previous section, foreach() is used to
exploit multiple CPU cores.
ntree from 10 to 130 by 20.mtry_seq <- c(2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 49)
time_track <- system.time({
results <-
foreach(numb_tree = seq(from = 10, to = 130, by = 20),
combine = rbind) %dopar% {
model3 <- train(
form = classe ~ .,
data = data_train,
mthod = "rf",
trControl = trainControl(
method = "repeatedcv",
number = 10,
repeats = 4
),
# Train for all values of mtry stored in mtry_seq
tuneGrid = expand.grid(.mtry = mtry_seq),
ntree = numb_tree,
importance = TRUE,
metric = "Accuracy",
verbose = FALSE
)
# The return is stored in `results`
bind_cols(model3$results[, 1:2], rep(numb_tree, length(mtry_seq)))
}
})
ntree from 150 to 230 by 20.mtry_seq <- c(2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 49)
time_track_2 <- system.time({
results_2 <-
foreach(numb_tree = seq(from = 150, to = 230, by = 20),
combine = rbind) %dopar% {
model4 <- train(
form = classe ~ .,
data = data_train,
mthod = "rf",
trControl = trainControl(
method = "repeatedcv",
number = 10,
repeats = 4
),
# Train for all values of mtry stored in mtry_seq
tuneGrid = expand.grid(.mtry = mtry_seq),
ntree = numb_tree,
importance = TRUE,
metric = "Accuracy",
verbose = FALSE
)
# The return is stored in `results_2`
bind_cols(model4$results[, 1:2], rep(numb_tree, length(mtry_seq)))
}
})
print(time_track)
## utilisateur système écoulé
## 44357.503 1653.262 5713.620
print(time_track_2)
## utilisateur système écoulé
## 106388.62 4489.74 12592.78
results and
results_2 into one for visualization.result_acc <- bind_rows(results,results_2)
names(result_acc) <- c("mtry", "Accuracy", "ntree")
sapply(result_acc,which.max)
## mtry Accuracy ntree
## 144 50 13
result_acc[50,]
## mtry Accuracy ntree
## 91 5 0.9911731 190
accuracy according to
mtry. Curves are grouped by the value of
ntree.library(ggplot2)
mpt <- result_acc %>%
ggplot(aes(x = mtry, y = Accuracy, group=ntree, color=numb_tree)) +
geom_point(aes(colour = ntree)) +
geom_line(aes(colour = ntree)) +
labs(x = "mtry", y = "Accuracy")
print(mpt)
It turns out :
ntree ranging from 150
to 230 ;mtry = 5 ;ntree = 190 ;model5 with the
best tunesmodel5 <-
randomForest(
as.factor(classe) ~ .,
data = data_train,
mtry = 5,
ntree = 190,
importance = TRUE
)
plot(model5)
legend("right", colnames(model5$err.rate),cex=1.5,fill = 1:6,lty = 1:6, title="Classe and OOB", text.font=4, bg='#abd4b6')
This plot leads to the same conclusion as before. The higher the number
of trees, the more the error decreases and conversely the more the
accuracy increases.
model and testing set.pred_mod5 <- predict(object = model5,newdata = data_test, type = "class")
confus_mod5 <- confusionMatrix(pred_mod5,factor(data_test$classe))
kbl(round(confus_mod5$byClass,4)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),font_size = 11)
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: A | 0.9988 | 0.9995 | 0.9988 | 0.9995 | 0.9988 | 0.9988 | 0.9988 | 0.2845 | 0.2841 | 0.2845 | 0.9992 |
| Class: B | 0.9947 | 0.9966 | 0.9861 | 0.9987 | 0.9861 | 0.9947 | 0.9904 | 0.1935 | 0.1925 | 0.1952 | 0.9957 |
| Class: C | 0.9766 | 0.9967 | 0.9843 | 0.9951 | 0.9843 | 0.9766 | 0.9804 | 0.1743 | 0.1703 | 0.1730 | 0.9867 |
| Class: D | 0.9844 | 0.9978 | 0.9885 | 0.9970 | 0.9885 | 0.9844 | 0.9865 | 0.1638 | 0.1613 | 0.1631 | 0.9911 |
| Class: E | 0.9991 | 0.9994 | 0.9972 | 0.9998 | 0.9972 | 0.9991 | 0.9982 | 0.1839 | 0.1837 | 0.1842 | 0.9992 |
kbl(bind_cols(Feature=names(confus_mod5$overall), model5=round(confus_mod5$overall,4))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F,font_size = 11)
| Feature | model5 |
|---|---|
| Accuracy | 0.9918 |
| Kappa | 0.9897 |
| AccuracyLower | 0.9892 |
| AccuracyUpper | 0.9940 |
| AccuracyNull | 0.2845 |
| AccuracyPValue | 0.0000 |
| McnemarPValue | NaN |
The confusion matrix returns an accuracy of 0.9918 over all the prediction with a confidence interval at 95 % of [0.9892,0.994]. The interval is quite tight.
With a Random Forest learning, the accuracy on the treaning set is much better. Indeed :
But all of these came with cost : time. From few
second (less 2 min) for rpart() we go to several hours
(more than 5 h) to find the best tunes.
Last thing, we will use the validation set stored in
data_valid to make predictions like with the testing set
data_test.
pred_val <-
predict(object = model,
newdata = data_valid_sub,
type = "class")
pred_val2 <-
predict(object = model2$finalModel,
newdata = data_valid_sub,
type = "class")
pred_val5 <-
predict(object = model5,
newdata = data_valid_sub,
type = "class")
kbl(
bind_rows(
bind_cols("model (rpart)", t(as.character(pred_val))),
bind_cols("model2 (rpart with cp control)", t(as.character(pred_val2))),
bind_cols("model5 (random forest)", t(as.character(pred_val5)))
),
col.names = c("Learning approach","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
font_size = 11
)
| Learning approach | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| model (rpart) | C | A | C | C | A | E | D | B | A | A | C | C | E | A | E | A | A | D | D | E |
| model2 (rpart with cp control) | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
| model5 (random forest) | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
rpart return the same results.