The task developed in this document is for the completion of the Practical Machine Learning Course Assignment, part of Coursera’s Data Science Certification by Johns Hopkins University.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The following environment was chosen to accomplish the objective: RStudio 2022.02.1 Build 461, R version 4.0.4 (2021-02-15), caret 6.0.92, rpart 4.1.15, rpart.plot 3.1.1, corrplot 0.92, randomForest 4.6.14.
The random seed used to reference our stats was set to 88570
The traning and test data used in this project are available at <https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv> and <https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv> respectively and kindly provided by: <http://groupware.les.inf.puc-rio.br/har>. The loaded dataframes have the following structure :
| Dataframe | Num. of Observations | Num. of Variables |
|---|---|---|
| training_raw | 19622 | 160 |
| testing_raw | 20 | 160 |
The target to predict is the variable classe.
First, by observing the test dataframe we may find columns that are empty so it’s safe to remove them. This reduced our variables amount to :
| Dataframe | Num. of Observations | Num. of Variables |
|---|---|---|
| training_clean | 19622 | 60 |
| testing_clean | 20 | 60 |
Now, assuming that we are using only the sensors data to learn, let’s remove other non relevant columns to reduce noise (index, user name and temporal references), this will bring our variables to :
| Dataframe | Num. of Observations | Num. of Variables |
|---|---|---|
| training_clean | 19622 | 55 |
| testing_clean | 20 | 55 |
The target variable classe is originally of type
character. Let’s change to a factor in order to easy our model and cross
validation algorithms .
## A B C D E
## 5580 3797 3422 3216 3607
Is there any noticeable correlations with our target variable
classe ? - What stats cor can tell us ?
Hard to tell there is a strong one, the best one is close to 0.30. Now let’s take a look on the overall features correlation :
It looks like some features are quite correlated with each other. We may try to reduce them by PCA should we need help due that our final result doesn’t meet the specifications or we have time and resources to do some research in order to improve performance. For now, we’re restricted to 2000 words.
Since the test set is reserved to the ultimate verification, we will extract the validation set from the training. The split will be the common 3/4. Therefore we will end up with the following sets:
| Dataframe | Num. of Observations |
|---|---|
| training | 14718 |
| validation | 4904 |
| test | 20 |
Since we’re trying to classify the exercises, the obvious baseline algorithm is the Decision Tree and here is the structure he found :
Now, the performance evaluation using the validation set gives us the following figures:
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1160 141 0 79 15
## B 60 739 52 37 61
## C 0 57 713 73 12
## D 8 62 35 638 61
## E 4 41 3 56 797
##
## Overall Statistics
##
## Accuracy : 0.8252
## 95% CI : (0.8143, 0.8358)
## No Information Rate : 0.2512
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7799
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9416 0.7106 0.8879 0.7225 0.8425
## Specificity 0.9360 0.9457 0.9654 0.9587 0.9737
## Pos Pred Value 0.8315 0.7787 0.8339 0.7935 0.8846
## Neg Pred Value 0.9795 0.9239 0.9778 0.9402 0.9628
## Prevalence 0.2512 0.2121 0.1637 0.1801 0.1929
## Detection Rate 0.2365 0.1507 0.1454 0.1301 0.1625
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9388 0.8281 0.9266 0.8406 0.9081
Looks like there is a big room to improvement. Let’s register some parameters to further benchmarks:
| Performance Gauge | Result |
|---|---|
| Accuracy | 82.52 % |
| Out-of_Sample Error (est) | 17.48 % |
| Time to train | 6.32 sec. |
| Latency to predict | 0.144 sec. |
The Random Forrest algorithm in R using a 5-Fold cross validation is knew to perform well due his best procedures to manage outliers and the automatic selection of variables, so we should get better results. It would be informative to benchmark the procedures with number of trees around 50 just to see how would be if working in dynamic industrial control environments (process response is a real concern). For brevity here let’s do only for 10 and 100 trees. Here are the models :
For 100 trees :
## Random Forest
##
## 14718 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 11774, 11775, 11773, 11774, 11776
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9930698 0.9912334
## 28 0.9971463 0.9963904
## 54 0.9942923 0.9927801
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 28.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 1 948 0 0 0
## C 0 2 853 0 0
## D 0 0 2 802 0
## E 0 0 0 1 900
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9973, 0.9996)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9985
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9979 0.9977 0.9988 1.0000
## Specificity 1.0000 0.9997 0.9995 0.9995 0.9998
## Pos Pred Value 1.0000 0.9989 0.9977 0.9975 0.9989
## Neg Pred Value 0.9997 0.9995 0.9995 0.9998 1.0000
## Prevalence 0.2847 0.1937 0.1743 0.1637 0.1835
## Detection Rate 0.2845 0.1933 0.1739 0.1635 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9996 0.9988 0.9986 0.9991 0.9999
And for only 10 trees :
## Random Forest
##
## 14718 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 11775, 11776, 11773, 11774, 11774
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9813842 0.9764488
## 28 0.9927301 0.9908044
## 54 0.9934776 0.9917495
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 54.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 1 0 0 0
## B 1 944 4 0 0
## C 0 3 852 0 0
## D 1 5 8 790 0
## E 0 0 0 1 900
##
## Overall Statistics
##
## Accuracy : 0.9951
## 95% CI : (0.9927, 0.9969)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9938
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9906 0.9861 0.9987 1.0000
## Specificity 0.9997 0.9987 0.9993 0.9966 0.9998
## Pos Pred Value 0.9993 0.9947 0.9965 0.9826 0.9989
## Neg Pred Value 0.9994 0.9977 0.9970 0.9998 1.0000
## Prevalence 0.2847 0.1943 0.1762 0.1613 0.1835
## Detection Rate 0.2843 0.1925 0.1737 0.1611 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9991 0.9946 0.9927 0.9977 0.9999
Here are some important values to shed some light :
| Performance Gauge | 10 trees | 100 trees |
|---|---|---|
| Accuracy @ mtry | 99.51 % | 99.88 % |
| Out-of_Sample Error (est) | 0.49 % | 0.12 % |
| Time to train | 37.18 sec. | 310.214 sec. |
| Latency to predict | 0.118 sec. | 0.211 sec. |
The Random Forrest produced models that are at first glance doing a good job on predicting the exercises. There is no concerns about accuracy and we will probably get a good AUC (not shown, again we should restrict to 2000 words)
However, the model with 100 trees is useless on control loops (training of more than 300 sec is too long and latency of 600ms is barely enough in most of the systems). The model with 10 trees performs better on latency than a single decision tree and is somehow acceptable with his 90 ms. His training time can be upgraded by maybe using what was suggested in the EDA phase (reducing the dimension of the correlated x,y,z accelerometers axes by PCA and therefore the number of features) but this has to be based on knowledge of the sensors intrinsics.
Here are the results that our model predicted from the provided 20 samples:
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The 20 files with result for each question of the assignment are available at <https://github.com/msantrax/coursera_johns_hopkins/blob/main/docs/index.md>
# Load the needed libraries
library(caret)
library(rpart)
library(rpart.plot)
library(corrplot)
library(randomForest)
# Normalize stats random
set.seed(88570)
# Load the work data locally
training_raw <- read.csv("pml-training.csv")
testing_raw <- read.csv("pml-testing.csv")
# Clean empty features
emptymap <- (colSums(is.na(testing_raw)) == 0)
training_clean <- training_raw[, emptymap]
testing_clean <- testing_raw[, emptymap]
rm(emptymap)
# Clean empty features
training_clean$classe <- factor(training_clean$classe)
summary(training_clean$classe)
# Detect correlation with target
training_corr <- data.frame(data.matrix(training_clean))
classe_idx <- which(names(training_corr) == "classe")
correlations <- cor(training_corr[, -classe_idx], as.numeric(training_corr$classe))
best_correlations <- subset(as.data.frame(as.table(correlations)), abs(Freq)>0.25)
best_correlations
# Detect correlation among features
corrplot(
cor(training_corr[, -length(names(training_clean))]),
method = "square",
type = "lower",
order = "hclust",
tl.col = "black",
tl.cex = 0.4,
col = colorRampPalette(c("purple", "dark green"))(200)
)
# Create the validation set
inTrain <- createDataPartition(training_clean$classe, p = 3/4, list = FALSE)
validation <- training_clean[-inTrain, ]
training <- training_clean[inTrain, ]
# Model by Decision Tree
start <- proc.time()
model_dtree <- rpart(classe ~ ., data = training, method = "class")
prp(model_dtree)
model_dtree_time = proc.time() - start
model_dtree_time = model_dtree_time[3]
# Evaluate the decision tree
start <- proc.time()
predict_dtree <- predict(model_dtree, validation, type = "class")
confusionMatrix(validation$classe, predict_dtree)
predict_dtree_time = proc.time() - start
predict_dtree_time = predict_dtree_time[3]
# Get the accuracy and estimated out-of-sample
dtree_accuracy <- postResample(predict_dtree, validation$classe)
dtree_accuracy <- sprintf("%6.2f %%", dtree_accuracy[1]*100)
dtree_ose <- 1 - as.numeric(confusionMatrix(validation$classe, predict_dtree)$overall[1])
dtree_ose <- sprintf("%6.2f %%", dtree_ose*100)
## Edition stub - load a pre-processed model from archived file
model_rf100 <- readRDS("rf_100.rds")
modeling_time100 = c(309.021, 0.670, 310.214, 0, 0)
# Generate the Random Forrest Model
# start <- proc.time()
# model_rf100 <- train(classe ~ ., data = training, method = "rf", trControl = trainControl(method = "cv", 5), ntree = 100)
modeling_time100 = modeling_time100[3]
model_rf100
start <- proc.time()
predict_rf100 <- predict(model_rf100, validation)
confusionMatrix(validation$classe, predict_rf100)
predicting_time100 = proc.time() - start
predicting_time100 = predicting_time100[3]
# Get the accuracy and estimated out-of-sample
rf100_accuracy <- postResample(predict_rf100, validation$classe)
rf100_accuracy <- sprintf("%6.2f %%", rf100_accuracy[1]*100)
rf100_ose <- 1 - as.numeric(confusionMatrix(validation$classe, predict_rf100)$overall[1])
rf100_ose <- sprintf("%6.2f %%", rf100_ose*100)
## Edition stub - load a pre-processed model from archived file
model_rf10 <- readRDS("rf_10.rds")
modeling_time10 = c(36.869, 0.240, 37.180, 0, 0)
# Generate the Random Forrest Model
# start <- proc.time()
# model_rf100 <- train(classe ~ ., data = training, method = "rf", trControl = trainControl(method = "cv", 5), ntree = 100)
modeling_time10 = modeling_time10[3]
model_rf10
start <- proc.time()
predict_rf10 <- predict(model_rf10, validation)
confusionMatrix(validation$classe, predict_rf10)
predicting_time10 = proc.time() - start
predicting_time10 = predicting_time10[3]
# Get the accuracy and estimated out-of-sample
rf10_accuracy <- postResample(predict_rf10, validation$classe)
rf10_accuracy <- sprintf("%6.2f %%", rf10_accuracy[1]*100)
rf10_ose <- 1 - as.numeric(confusionMatrix(validation$classe, predict_rf10)$overall[1])
rf10_ose <- sprintf("%6.2f %%", rf10_ose*100)
# Show the results obtained by prediction over test data
pred_test = predict(model_rf100, testing_clean[, -length(names(testing_clean))])
n = length(pred_test)
for(i in 1:n){
filename = paste0("Assignment/test_",i,".txt")
write.table(pred_test[i],
file = filename,
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
}
pred_test