The devices such as Jawbone Up, Nike FuelBand, and Fitbit now collect a large amount of data about personal activity. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
More information is available from the website: http://groupware.les.inf.puc-rio.br/har (section on the Weight Lifting Exercise Dataset).
#Training data set
pmlTrainDS <- read.csv("./pml-training.csv", na.strings = c("", "NA", "NULL"))
#Testing data set
pmlTestDS <- read.csv("./pml-testing.csv", na.strings = c("", "NA", "NULL"))
Loading all necessary packages for the project
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.2.5
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
dim(pmlTrainDS)
## [1] 19622 160
Exploring the training dataset we observe that there are quite many variables to predict the dependent ‘Classe’ variable, that has 5 levels [A,B,C,D,E]. In order to build accurate prediction model, we will perform initial pre-processing to identify and filter out the un-necessary, empty, highly correlated, near-zero variance variables.
Removing empty columns from dataset
filtered_pmT <- pmlTrainDS[ , colSums(is.na(pmlTrainDS)) == 0]
dim(filtered_pmT)
## [1] 19622 60
Removing near-zero variance columns, using nearZeroVar() from ‘caret’ package
nzv <- nearZeroVar(filtered_pmT)
filtered_pmT <- filtered_pmT[, -nzv]
dim(filtered_pmT)
## [1] 19622 59
Removing highly correlated variables, using 0.80 as cutoff point
#create correlation matrix
cor_pt <- cor(filtered_pmT[ , sapply(filtered_pmT, is.numeric)])
dim(cor_pt)
## [1] 56 56
#Plotting the correlation matrix, using 'corrplot' package
corrplot(cor_pt, order = "alphabet", tl.cex=0.7, tl.col ="steelblue")
#Display the correlation summary, prior to removal
summary(cor_pt[upper.tri(cor_pt)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.992000 -0.102000 0.001729 0.001405 0.084720 0.980900
#using findCorrelation() from 'caret' package, flag the predictors
highlyCorVars <- findCorrelation(cor_pt, cutoff = 0.80)
filtered_pmT <- filtered_pmT[, -highlyCorVars]
#Display correlation summary, after removing predictors with absolute correlations above 0.80.
postCorRem <- cor(filtered_pmT[ , sapply(filtered_pmT, is.numeric)])
summary(postCorRem[upper.tri(postCorRem)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.992000 -0.102200 0.002635 0.003229 0.088840 0.980900
dim(filtered_pmT)
## [1] 19622 46
Removing first five columns [“X”, “user_name”, “raw_timestamp_part_1”, “raw_timestamp_part_2”, “num_window”], they are not useful for prediction
filtered_pmT <- filtered_pmT[, -c(1:5)]
dim(filtered_pmT)
## [1] 19622 41
The number of predictors have reduced from 160 to 40, using all the above stated methods.
set.seed(999)
trainIndex <- createDataPartition(filtered_pmT$classe, p = 0.70, list = FALSE)
training <- filtered_pmT[trainIndex,] #70%
dim(training)
## [1] 13737 41
validSet <- filtered_pmT[-trainIndex,] #30%
dim(validSet)
## [1] 5885 41
There are numerous machine learning algorithms to build prediction models. For our classification problem, we choose Random Forest method. Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees (ref Wikipedia). This algorithm is best-known for its accuracy, handles large datasets and large number of variables very efficiently. It provides estimates of which variables are important in the classification.
First we will build the prediction model using only the training set. Then we explore importance and accuracy results.
set.seed(999)
Fitting the model using randomForest algorithm
rfModel <- randomForest(classe ~ ., type= "classification", data = training, ntree = 200,
importance = TRUE)
rfModel
##
## Call:
## randomForest(formula = classe ~ ., data = training, type = "classification", ntree = 200, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 0.58%
## Confusion matrix:
## A B C D E class.error
## A 3905 1 0 0 0 0.0002560164
## B 14 2638 6 0 0 0.0075244545
## C 0 14 2381 1 0 0.0062604341
## D 0 0 30 2219 3 0.0146536412
## E 0 2 3 6 2514 0.0043564356
Plotting the error rates of the randomForest object, we observe that, as the number of trees increase, the error rates (miss-classification) decrease. Black line is the out-of-bag estimate and other colors denote each class error.
layout(matrix(c(1,2),nrow = 1), width = c(4,1))
par(mar=c(5,4,4,0)); plot(rfModel, main = "Error rates per class and OOB")
par(mar=c(5,0,4,2)); plot(c(0,1),type = "n", axes=F, xlab = "", ylab = "")
legend("top", colnames(rfModel$err.rate), col = 1:6, cex = 0.8, fill = 1:6)
With the plot below we can see which predictors have higher importance (sorted in decreasing order of importance)
varImpPlot(rfModel, main = "Variable Importance Plot", cex = 0.6, col ="steelblue")
Partial plots gives a graphical depiction of the marginal effect of an individual variable on the class probability.
Displaying plots for top 10 variables
imp <- importance(rfModel)
impvar <- rownames(imp)[order(imp[, "MeanDecreaseAccuracy"], decreasing=TRUE)]
impvarTop10 <- impvar[1:10]
par(mfrow = c(2, 5), mar = c(1,1,1,1))
for (i in seq_along(impvarTop10)) {
par(mar = c(4,2,2,2))
partialPlot(rfModel, training, impvarTop10[i], xlab = impvarTop10[i], main = "")
}
Our Random Forest model had Out-of-Bag(OOB) estimates of 0.58% from training data. We can test the accuracy of the model using validation set.
pred <- predict(rfModel, validSet)
print(confusionMatrix(pred, validSet$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 6 0 0 0
## B 1 1129 6 0 0
## C 0 4 1020 13 1
## D 1 0 0 950 1
## E 0 0 0 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9942
## 95% CI : (0.9919, 0.996)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9927
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9912 0.9942 0.9855 0.9982
## Specificity 0.9986 0.9985 0.9963 0.9996 0.9998
## Pos Pred Value 0.9964 0.9938 0.9827 0.9979 0.9991
## Neg Pred Value 0.9995 0.9979 0.9988 0.9972 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1918 0.1733 0.1614 0.1835
## Detection Prevalence 0.2851 0.1930 0.1764 0.1618 0.1837
## Balanced Accuracy 0.9987 0.9949 0.9952 0.9925 0.9990
We observe that Accuracy of 99.4% is obtained when predicting model using validation data.
The margin of a data point is defined as the proportion of votes for the correct class minus maximum proportion of votes for the other classes. Thus under majority votes, positive margin means correct classification.
plot(margin(rfModel, validSet$classe), cex = 0.7, main = "Margin of Predictions")
From the plot, we can observe positive margin indicating that classification is correct.
Finally we perform model prediction on the original test data.
result <- predict(rfModel, newdata = pmlTestDS)
result
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E