Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Quantified self enthusiasts regularly quantify how much of a particular activity they do, yet they rarely qualify how well they do it.
Six males aged 20-28 were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Now we seek to predict whether the lifts were performed correctly using the data from accelerometers on the belt, forearm, arm, and dumbell of study participants. Data for this project is courtesy of http://groupware.les.inf.puc-rio.br/har.
Source data for this project:
http://groupware.les.inf.puc-rio.br/har
Training data for this project:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Test data for this project:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The full reference is as follows:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13)”. Stuttgart, Germany: ACM SIGCHI, 2013.
Load required packages, set seed for reproducibility.
library(caret)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(randomForest)
library(rattle)
library(rpart)
library(rpart.plot)
set.seed(777)
Load training and test datasets.
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
quizUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainData <- read.csv(url(trainUrl), strip.white = TRUE, na.strings = c("NA",""))
quizData <- read.csv(url(quizUrl), strip.white = TRUE, na.strings = c("NA",""))
dim(trainData)
## [1] 19622 160
dim(quizData)
## [1] 20 160
Partition the original training data (70% and 30%)
inTrain <- createDataPartition(trainData$classe, p=0.70, list=FALSE)
training <- trainData[ inTrain, ]
testing <- trainData[-inTrain, ]
dim(training)
## [1] 13737 160
dim(testing)
## [1] 5885 160
Inspection of the data reveals that columns 1 through 5 are non-principal components, we’ll remove them.
training <- training[ , -(1:5)]
testing <- testing[ , -(1:5)]
dim(training)
## [1] 13737 155
dim(testing)
## [1] 5885 155
The training and test sets each have a large number of NA values as well as near-zero-variance (NZV) variables. For dimension reduction, let’s remove these, along with their ID variables.
nzvVar <- nearZeroVar(training)
training <- training[ , -nzvVar]
testing <- testing[ , -nzvVar]
dim(training)
## [1] 13737 119
dim(testing)
## [1] 5885 119
Remove variables that contain > 95% NA values.
naVar <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[ , naVar == FALSE]
testing <- testing [ , naVar == FALSE]
dim(training)
## [1] 13737 54
dim(testing)
## [1] 5885 54
Now that we’ve reduced the dimensions of our data from the original 160 predictors down to a more manageable 54, let’s see which are covariate.
The final column, “classe”, contains our outcomes of interest. Let’s see which other variables closely correlate with classe.
classeIndex <- which(names(training)=="classe")
partitionClasse <- createDataPartition(y=training$classe, p=0.75, list = FALSE)
subsetTrain <- training[partitionClasse, ]
subsetTest <- training[-partitionClasse, ]
Which predictors are highly correlated with classe?
corrClasse <- cor(subsetTrain[-classeIndex], as.numeric(subsetTrain$classe))
bestCorr <- subset(as.data.frame(as.table(corrClasse)), abs(Freq)>0.25)
bestCorr
## Var1 Var2 Freq
## 13 magnet_belt_y A -0.2848973
## 25 magnet_arm_x A 0.2931318
## 26 magnet_arm_y A -0.2549615
## 42 pitch_forearm A 0.3323660
None appear to be highly correlated, let’s check visually.
p1 <- ggplot(subsetTrain, aes(classe, pitch_forearm)) +
geom_boxplot(aes(fill = classe))
p2 <- ggplot(subsetTrain, aes(classe, magnet_arm_x)) +
geom_boxplot(aes(fill = classe))
grid.arrange(p1, p2, nrow = 1)
No clear correlation emerges for simple linear regression. Let’s train some models to look more closely for correlation among predictors and the classe outcomes.
set.seed(777)
modFit <- train(classe ~ ., method = "rpart", data = training)
fancyRpartPlot(modFit$finalModel)
Predictions of the classification tree model on
testing:
predModFit <- predict(modFit, newdata = testing)
confModFit <- confusionMatrix(predModFit, testing$classe)
confModFit
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1555 302 200 196 51
## B 28 389 33 180 90
## C 87 448 793 552 292
## D 0 0 0 0 0
## E 4 0 0 36 649
##
## Overall Statistics
##
## Accuracy : 0.5754
## 95% CI : (0.5626, 0.588)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.455
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9289 0.3415 0.7729 0.0000 0.5998
## Specificity 0.8221 0.9303 0.7162 1.0000 0.9917
## Pos Pred Value 0.6749 0.5403 0.3651 NaN 0.9419
## Neg Pred Value 0.9668 0.8548 0.9372 0.8362 0.9167
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2642 0.0661 0.1347 0.0000 0.1103
## Detection Prevalence 0.3915 0.1223 0.3691 0.0000 0.1171
## Balanced Accuracy 0.8755 0.6359 0.7446 0.5000 0.7957
The predictive accuracy of the classification tree model is relatively low at 57.5%.
Because we have many generally weak predictors, let’s try boosting. Perhaps weighting them, adding them up, then averaging will allow us to get a stronger predictor.
set.seed(777)
ctrlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fitGBM <- train(classe ~ ., data = training, method = "gbm",
trControl = ctrlGBM, verbose = FALSE)
fitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
Predictions of the GBM on testing.
predictGBM <- predict(fitGBM, newdata = testing)
conf_matrix_GBM <- confusionMatrix(predictGBM, testing$classe)
conf_matrix_GBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 11 0 0 1
## B 0 1117 9 9 6
## C 0 10 1015 9 2
## D 0 1 2 942 8
## E 0 0 0 4 1065
##
## Overall Statistics
##
## Accuracy : 0.9878
## 95% CI : (0.9846, 0.9904)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9845
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9807 0.9893 0.9772 0.9843
## Specificity 0.9972 0.9949 0.9957 0.9978 0.9992
## Pos Pred Value 0.9929 0.9790 0.9797 0.9885 0.9963
## Neg Pred Value 1.0000 0.9954 0.9977 0.9955 0.9965
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1898 0.1725 0.1601 0.1810
## Detection Prevalence 0.2865 0.1939 0.1760 0.1619 0.1816
## Balanced Accuracy 0.9986 0.9878 0.9925 0.9875 0.9917
The predictive accuracy of our GBM model is reasonably good at 98.7%.
set.seed(777)
ctrlRF <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fitRF <- train(classe ~ ., data = training, method = "rf",
trControl = ctrlRF, verbose = FALSE)
fitRF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.16%
## Confusion matrix:
## A B C D E class.error
## A 3905 0 0 0 1 0.0002560164
## B 6 2649 3 0 0 0.0033860045
## C 0 3 2393 0 0 0.0012520868
## D 0 0 6 2246 0 0.0026642984
## E 0 1 0 2 2522 0.0011881188
Predictions of the Random Forest Model on testing.
predictRF <- predict(fitRF, newdata = testing)
conf_matrix_RF <- confusionMatrix(predictRF, testing$classe)
conf_matrix_RF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 2 0 0 0
## B 0 1136 2 0 0
## C 0 1 1024 3 0
## D 0 0 0 961 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9983
## 95% CI : (0.9969, 0.9992)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9979
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9974 0.9981 0.9969 0.9982
## Specificity 0.9995 0.9996 0.9992 0.9996 1.0000
## Pos Pred Value 0.9988 0.9982 0.9961 0.9979 1.0000
## Neg Pred Value 1.0000 0.9994 0.9996 0.9994 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1930 0.1740 0.1633 0.1835
## Detection Prevalence 0.2848 0.1934 0.1747 0.1636 0.1835
## Balanced Accuracy 0.9998 0.9985 0.9986 0.9982 0.9991
The predictive accuracy of the Random Forest model is excellent at 99.8%.
In summary, predictive accuracy of the three models evaluated are as follows:
The Random Forest model is chosen to make predictions on the 20 data points from the original testing dataset quizData.
predQuiz <- predict(fitRF, newdata = quizData)
predQuiz
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E