This is a simple test using the caret package and Human Activity Recognition (HAR) data, a recent growing research area because of the increasing prevalence of “context-aware systems”. These are refering to the many accelerometers and sensors present in portable devices and wearables that we use and carry with us. We will be examining this type of data to compare models in Random Forests and generalized boosted regression which have grown popular in Kaggle competitions.
The data is from Groupware@LES and the data set is calle WLE for Weight lifting exercises.
Activities are often quantified in terms of how many times they are performed but much rarely are they quantfied in terms of how well they were performed. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants and predict what manner the exercice was performed.
#Needed libraries
library(plyr)
library(tidyverse)
library(forcats)
library(Amelia)
library(caret)
library(corrplot)
library(scales)
library(plotly)
if(!file.exists("./data")) dir.create("./data")
#Download data and read the data
URLtrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URLtest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(URLtrain, destfile = "./data/pml-training.csv")
download.file(URLtest, destfile = "./data/pml-testing.csv")
A first examination of the datasets revealed a lot of missing data and also the error: #DIV/0. We will clean those with a second read:
train <- read.csv("./data/pml-training.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("#DIV/0!", "NA", ""))
test <- read.csv("./data/pml-testing.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("#DIV/0!", "NA", ""))
dim(train)
## [1] 19622 160
dim(test)
## [1] 20 160
Let us have a look at what is missing.
par(mfrow=c(1,2))
missmap(train)
missmap(test)
There are a lot of missing data in the two sets. Imputing is probably necessary as machine learning is not meant to process missing data. Not imputing means removing the missing data which is usually ok because it doesn’t modify the shape of the data. However, often, data is not missing at random, meaning it is missing for a reason. Excluding missing values can introduce a bias.
We will assume this is missing data at random and go ahead and remove them. The large amount of missing data is not going to be useful for predicting anyway.
## Remove columns with more than 50% NA
train <- train[, -which(colMeans(is.na(train)) > 0.5)]
test <- test[, -which(colMeans(is.na(test)) > 0.5)]
dim(train)
## [1] 19622 60
dim(test)
## [1] 20 60
Columns with no data and those with less than 50% were dropped. Now we want to keep only the variables with measurements and drop the rest. The columns we don’t need all appear to be the seven first ones like X, name, etc, so we can remove them.
train <- train %>%
select(-(1:7))
test <- test %>%
select(-(1:7))
Before we build our prediction models, one last thing we will check for is variability. Some variables have none at all and these are not useful for a prediction model.
library(DT)
NZV <- nearZeroVar(train,saveMetrics=TRUE)
datatable(NZV, extensions = "Buttons")
We have a clear picture that we need not eliminate variables further as none are showing zero variability. We also will not need to do pre processing to standardize because we do not have much variability.
We will first split the training data into a sub-training set and a sub validation set in a ratio of 70% to 30%.
# create test/train data sets
inTrain <- createDataPartition(y=train$classe, p=0.7, list=FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]
We first explore the decision trees which we will compare to a random forest model after.
In building thw decision tree model, we will also apply a k-fold cross-validation. In our tests though, it did not increase the accuracy. The model can be built without cross-validation and still yield the same value.
set.seed(42)
#K-fold cross-validation with 5 folds
control5 <- trainControl(method = "cv", number = 5)
# fit classification tree as a model
modFit <- train(classe ~ ., trControl=control5, data = training, method="rpart")
#visualize tree
rpart.plot::rpart.plot(modFit$finalModel, type=4, extra=+100, branch.lty=3)
Predict on testting data and print the confusion matrix
pred <- predict(modFit, newdata=testing)
print(confusionMatrix(pred, testing$classe), digits=3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1534 473 490 435 155
## B 28 382 28 172 155
## C 109 284 508 357 306
## D 0 0 0 0 0
## E 3 0 0 0 466
##
## Overall Statistics
##
## Accuracy : 0.491
## 95% CI : (0.478, 0.504)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.334
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.916 0.3354 0.4951 0.000 0.4307
## Specificity 0.631 0.9193 0.7827 1.000 0.9994
## Pos Pred Value 0.497 0.4993 0.3248 NaN 0.9936
## Neg Pred Value 0.950 0.8521 0.8801 0.836 0.8863
## Prevalence 0.284 0.1935 0.1743 0.164 0.1839
## Detection Rate 0.261 0.0649 0.0863 0.000 0.0792
## Detection Prevalence 0.525 0.1300 0.2658 0.000 0.0797
## Balanced Accuracy 0.774 0.6273 0.6389 0.500 0.7150
This model is not good for predicting. The confusion matrix reveals that we only have 50% accuracy which is no better than chance or a flip of a coin.
We are invoking a correlation matrix to see how variables are linked to one another.
corrPlot <- cor(training[, -length(names(training))])
corrplot(corrPlot, method="circle", order="hclust", addrect=3, tl.cex = 0.5)
Positive correlations are displayed in blue, negative correlations are in red. The intensity of the color is proportional to the correlation coefficients. Three clusters are shown. Some variables are more important for the predictive model and having correlated variables makes for better predictions.
We train a new model with random forests and we use the k-fold repeated cross-validation ten times with three repeats. We are also using parallelism to take advantage of as many CPU cores as we have. Random Forests can take a long time on older computers expecially within caret.
library(parallel)
library(doParallel)
#We parallelize the next step to use available cpu cores.
parallelCluster <- makeCluster(detectCores())
registerDoParallel(parallelCluster)
Control <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelFitRF <- train(classe ~ ., data=training, method="rf", trControl=Control)
# Shutdown cluster
stopCluster(parallelCluster)
registerDoSEQ()
#Predcit in sub-test data
predRf <- predict(modelFitRF, newdata=testing)
print(confusionMatrix(predRf, testing$classe), digits=3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 8 0 0 0
## B 0 1122 4 0 0
## C 0 9 1020 10 2
## D 0 0 2 954 9
## E 1 0 0 0 1071
##
## Overall Statistics
##
## Accuracy : 0.992
## 95% CI : (0.99, 0.994)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.99
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.999 0.985 0.994 0.990 0.990
## Specificity 0.998 0.999 0.996 0.998 1.000
## Pos Pred Value 0.995 0.996 0.980 0.989 0.999
## Neg Pred Value 1.000 0.996 0.999 0.998 0.998
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.191 0.173 0.162 0.182
## Detection Prevalence 0.286 0.191 0.177 0.164 0.182
## Balanced Accuracy 0.999 0.992 0.995 0.994 0.995
This time, we are getting 99% accuracy. This is much better.
Lets look at the importance of the variables in the model:
imp <- importance(modelFitRF$finalModel, class = NULL, scale = TRUE, type = 2)
imp <- data.frame(imp)
imp$exer <- rownames(imp)
gg <- ggplot(imp, aes(x=MeanDecreaseGini, y=reorder(exer, MeanDecreaseGini),
text = comma(MeanDecreaseGini)))
gg <- gg + geom_segment(aes(xend = 0, yend=exer), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_continuous(expand = c(0.1,0), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=10))
gg <- gg + theme(axis.text.y=element_text(size=10))
ggplotly(gg, tooltip = "text")
According to Breiman and Cutler Random Forests “Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.”
This is telling us that “roll-belt”“,”pitch_forearm" and “yaw_belt” are the three most important varibles to predict in this model. If they were to be removed, the predicton power of the model would be greatly diminished.
We will try one last model for comparison. Random forests and boosting have a very good track record and we can see what accuracy we get with a GBM model.
Control <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelFitGBM <- train(classe ~ ., data=training, method="gbm", trControl=Control, verbose=FALSE)
#Predcit in sub-test data
predRfGBM <- predict(modelFitGBM, newdata=testing)
print(confusionMatrix(predRfGBM, testing$classe), digits=3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1641 44 0 0 1
## B 21 1060 29 6 13
## C 7 32 981 25 13
## D 3 1 15 925 33
## E 2 2 1 8 1022
##
## Overall Statistics
##
## Accuracy : 0.956
## 95% CI : (0.951, 0.962)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.945
## Mcnemar's Test P-Value : 7.63e-09
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.980 0.931 0.956 0.960 0.945
## Specificity 0.989 0.985 0.984 0.989 0.997
## Pos Pred Value 0.973 0.939 0.927 0.947 0.987
## Neg Pred Value 0.992 0.983 0.991 0.992 0.988
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.279 0.180 0.167 0.157 0.174
## Detection Prevalence 0.286 0.192 0.180 0.166 0.176
## Balanced Accuracy 0.985 0.958 0.970 0.974 0.971
We got 96% which is very good but Random Forests is a more powerful model with this data.
Now, we can go back to our original split of the data where we left our test dataset and use the better random forests model. The model has a 100% accuracy (0.995).
predRf <- predict(modelFitRF, newdata=test)
predRf
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E