library(caret)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(randomForest)
library(knitr)
library(ggplot2)
library(gbm)
knitr::opts_chunk$set(echo = TRUE)
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: (see the section on the Weight Lifting Exercise Dataset).
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
The training data for this project are available here:
The test data are available here:
The data for this project come from this source:. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
set.seed(54321)
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
testing <- read.csv(url(testUrl), na.strings=c("NA","#DIV/0!",""))
In real life, simple models often beat complex ones, because they can generalize much better. We will do a random 60:40 split in our data set (60% will be for training models, 40% to evaluate them)
set.seed(123) # For reproducibility; 123 has no particular meaning
inTrain <- createDataPartition(training$classe, p=0.6, list=FALSE)
myTraining <- training[inTrain, ]
myTesting <- training[-inTrain, ]
dim(myTraining); dim(myTesting)
[1] 11776 160 [1] 7846 160
Remove the NearZeroVariance variables
caret::nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
nzv <- nearZeroVar(myTraining, saveMetrics=TRUE)
myTraining <- myTraining[,nzv$nzv==FALSE]
nzv<- nearZeroVar(myTesting,saveMetrics=TRUE)
myTesting <- myTesting[,nzv$nzv==FALSE]
#Remove the first column of the Train data set
myTraining <- myTraining[c(-1)]
#Clean variables with more than 60% NA
trainingV3 <- myTraining
for(i in 1:length(myTraining)) {
if( sum( is.na( myTraining[, i] ) ) /nrow(myTraining) >= .7) {
for(j in 1:length(trainingV3)) {
if( length( grep(names(myTraining[i]), names(trainingV3)[j]) ) == 1) {
trainingV3 <- trainingV3[ , -j]
}
}
}
}
# Set back to the original variable name
myTraining <- trainingV3
rm(trainingV3)
In order to select the relevant features, the various variables were plotted graphically. To begin, the variables were plotted a few at a time using boxplots in R’s featurePlot() function, like so:
featurePlot(x = myTraining[,8:10], y = myTraining$classe, plot = "box")
At this point, the plots were visually inspected. Any variable where the boxes for a single variable had some significant differences were then further inspected via a stacked histogram, such as this one for the accel_belt_z variable above:
ggplot(data = myTraining, aes(x = accel_belt_z, fill = classe)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Those that were determined to be potentially useful were then noted, and ultimately used in the final model.
Here is one of the golden rules of machine learning and modeling in general: models are built using training data, and evaluated on testing data. The reason is over-fitting: most models’ accuracy can be artificially increased to a point where they “learn” every single detail of the data used to build them; unfortunately, it usually means they lose the capability to generalize. That’s why we need unseen data (i.e., the testing set): if we over-fit the training data, the performance on the testing data will be poor.
clean1 <- colnames(myTraining)
clean2 <- colnames(myTraining[, -58]) # remove the classe column
myTesting <- myTesting[clean1] # allow only variables in myTesting that are also in myTraining
testing <- testing[clean2] # allow only variables in testing that are also in myTraining
dim(myTesting)
[1] 7846 58
#Coerce the data into the same type
for (i in 1:length(testing) ) {
for(j in 1:length(myTraining)) {
if( length( grep(names(myTraining[i]), names(testing)[j]) ) == 1) {
class(testing[j]) <- class(myTraining[i])
}
}
}
# To get the same class between testing and myTraining
testing <- rbind(myTraining[2, -58] , testing)
testing <- testing[-1,]
A decision tree (also known as regression tree for continuous outcome variables) is a simple and popular machine learning algorithm, with a few interesting advantages over linear models: they make no assumptions about the relation between the outcome and predictors (i.e., they allow for linear and non-linear relations); the interpretability of a decision tree could not be higher - at the end of the process, a set of rules, in natural language, relating the outcome to the explanatory variables, can be easily derived from the tree.
set.seed(123) #For reproducibility; 123 has no particular meaning
modFitA1 <- rpart(classe ~ ., data=myTraining, method="class")
fancyRpartPlot(modFitA1)
Predicting between pairs produces categorical output: -1, 0, or 1. A confusion matrix counts how many times the predicted category mapped to the various true categories.
predictionsA1 <- predict(modFitA1, myTesting, type = "class")
cmtree <- confusionMatrix(predictionsA1, myTesting$classe)
cmtree
Confusion Matrix and Statistics
Reference
Prediction A B C D E A 2107 74 12 0 0 B 119 1313 8 40 0 C 6 131 1231 78 0 D 0 0 56 962 82 E 0 0 61 206 1360
Overall Statistics
Accuracy : 0.8887
95% CI : (0.8816, 0.8956)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8593
Mcnemar’s Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9440 0.8650 0.8999 0.7481 0.9431 Specificity 0.9847 0.9736 0.9668 0.9790 0.9583 Pos Pred Value 0.9608 0.8872 0.8513 0.8745 0.8359 Neg Pred Value 0.9779 0.9678 0.9786 0.9520 0.9868 Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838 Detection Rate 0.2685 0.1673 0.1569 0.1226 0.1733 Detection Prevalence 0.2795 0.1886 0.1843 0.1402 0.2074 Balanced Accuracy 0.9643 0.9193 0.9333 0.8635 0.9507
plot(cmtree$table, col = cmtree$byClass, main = paste("Decision Tree Confusion Matrix: Accuracy =", round(cmtree$overall['Accuracy'], 4)))
What if, instead of growing a single tree, we grow many (ranging from a few hundred to a few thousand), and introduce some sources of randomness, so that each tree is most likely different from the others? What we get is a random forest
How many trees are needed to reach the minimum error estimate?
This is a simple problem; it appears that about 100 trees would be enough.
Some of most interesting characteristics of random forests are:
They do not over-fit.
There is no need for cross-validation.
We can grow as many tree as we want (the limit is the computational power).
Although we usually improve accuracy, it comes at a cost: interpretability.
set.seed(123) #For reproducibility; 123 has no particular meaning
modFitB1 <- randomForest(classe ~ ., data=myTraining, importance = TRUE, ntree=100)
predictionB1 <- predict(modFitB1, myTesting, type = "class")
cmrf <- confusionMatrix(predictionB1, myTesting$classe)
cmrf
Confusion Matrix and Statistics
Reference
Prediction A B C D E A 2231 1 0 0 0 B 1 1517 7 0 0 C 0 0 1359 3 0 D 0 0 2 1282 0 E 0 0 0 1 1442
Overall Statistics
Accuracy : 0.9981
95% CI : (0.9968, 0.9989)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9976
Mcnemar’s Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9996 0.9993 0.9934 0.9969 1.0000 Specificity 0.9998 0.9987 0.9995 0.9997 0.9998 Pos Pred Value 0.9996 0.9948 0.9978 0.9984 0.9993 Neg Pred Value 0.9998 0.9998 0.9986 0.9994 1.0000 Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838 Detection Rate 0.2843 0.1933 0.1732 0.1634 0.1838 Detection Prevalence 0.2845 0.1944 0.1736 0.1637 0.1839 Balanced Accuracy 0.9997 0.9990 0.9965 0.9983 0.9999
plot(modFitB1)
plot(cmrf$table, col = cmtree$byClass, main = paste("Random Forest Confusion Matrix: Accuracy =", round(cmrf$overall['Accuracy'], 4)))
Gradient boosted machines (GBMs) are an extremely popular machine learning algorithm that have proven successful across many domains. Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms
We will use 5-fold cross validation to estimate accuracy.
This will split our data set into 5 parts, train in 4 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 1 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate. Also, getting the gbm info off of:
set.seed(123) #For reproducibility; 123 has no particular meaning
# Run algorithms using 5-fold cross validation repeated 1 times
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1)
# Using caret with the default grid to optimize tune parameters automatically
# GBM Tuning parameters:
# n.trees (# Boosting Iterations)
# interaction.depth (Max Tree Depth)
# shrinkage (Shrinkage)
# n.minobsinnode (Min. Terminal Node Size)
gbmFit1 <- train(classe ~ .
, data = myTraining
, method = "gbm"
, trControl = fitControl
, verbose = FALSE
)
gbmFinMod1 <- gbmFit1$finalModel
gbmPredTest <- predict(gbmFit1, newdata=myTesting)
gbmAccuracyTest <- confusionMatrix(gbmPredTest, myTesting$classe)
gbmAccuracyTest
Confusion Matrix and Statistics
Reference
Prediction A B C D E A 2232 0 0 0 0 B 0 1514 2 0 0 C 0 2 1360 3 0 D 0 2 6 1282 3 E 0 0 0 1 1439
Overall Statistics
Accuracy : 0.9976
95% CI : (0.9962, 0.9985)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9969
Mcnemar’s Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 1.0000 0.9974 0.9942 0.9969 0.9979 Specificity 1.0000 0.9997 0.9992 0.9983 0.9998 Pos Pred Value 1.0000 0.9987 0.9963 0.9915 0.9993 Neg Pred Value 1.0000 0.9994 0.9988 0.9994 0.9995 Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838 Detection Rate 0.2845 0.1930 0.1733 0.1634 0.1834 Detection Prevalence 0.2845 0.1932 0.1740 0.1648 0.1835 Balanced Accuracy 1.0000 0.9985 0.9967 0.9976 0.9989
plot(gbmFit1, ylim=c(0.9, 1))
plot(gbmFit1, metric = "Kappa")
plot(gbmFit1, plotType = "level")
resampleHist(gbmFit1)
Random Forests gave an Accuracy in the Testing data set of 99.81%, which was more accurate that what I got from the Decision Trees or GBM.
predictionB2 <- predict(modFitB1, testing, type = "class")
predictionB2
1 21 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 B A B A A E D B A A B C B A E E A B B B Levels: A B C D E
# Write the results to a text file for submission
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}