Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Outcome variable is classe, a factor variable with 5 levels. For this data set, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions: - exactly according to the specification (Class A) - throwing the elbows to the front (Class B) - lifting the dumbbell only halfway (Class C) - lowering the dumbbell only halfway (Class D) - throwing the hips to the front (Class E)
Two models will be tested using decision tree and random forest algorithms. The model with the higher accuracy will be chosen as the final model.
I first download the training and testing data sets from the given URLs. And then do data cleaning for further analysis.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
trnLnk <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
tstLnk <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training_data <- read.csv(url(trnLnk),na.strings=c("NA","#DIV/0!",""),header=T)
testing_data <- read.csv(url(tstLnk),na.strings=c("NA","#DIV/0!",""),header=T)
#Check dimensions of train dataset
dim(training_data)
## [1] 19622 160
dim(testing_data)
## [1] 20 160
##Create list of unwanted fields:
trnRemCols <- grepl("^X|timestamp|window|user_name",names(training_data))
tstRemCols <- grepl("^X|timestamp|window|user_name",names(testing_data))
#Remove unwanted fields
trnRmUnwtdCols <- training_data[,!trnRemCols]
tstRmUnwtdCols <- testing_data[,!tstRemCols]
#Create list of near zero variance fields
NearZeroVar <- nearZeroVar(trnRmUnwtdCols,saveMetrics=T)
#Remove near zero variance fields
trnRmZVCols <- trnRmUnwtdCols[,!NearZeroVar$nzv]
tstRmZVCols <- tstRmUnwtdCols[,!NearZeroVar$nzv]
#Remove fields with NAs
trnNArmCondn <- (colSums(is.na(trnRmZVCols))!=0)
tstNArmCondn <- (colSums(is.na(tstRmZVCols))!=0)
trnRmNACols <- trnRmZVCols[,!trnNArmCondn]
tstRmNACols <- tstRmZVCols[,!tstNArmCondn]
#New Training and Testing Datasets after clean-up
trnDataNew <- trnRmNACols
tstDataNew <- tstRmNACols
dim(trnDataNew); dim(tstDataNew)
## [1] 19622 53
## [1] 20 53
The training dataset has 19622 observations and 160 variables, and the testing data set contains 20 observations and the same variables as the training set.
In order to get out-of-sample errors, I split the training set into a training set (70%) for prediction and a validation set (30%) to compute the out-of-sample errors.
set.seed(1234)
#Train-model Validation Partition
intrain <- createDataPartition(y=trnDataNew$classe,p=0.7,list=F)
modTRNSample <- trnDataNew[intrain,] #To be used for model training
modTSTSample <- trnDataNew[-intrain,] #To be used for testing accuracy of models
dim(modTRNSample); dim(modTSTSample)
## [1] 13737 53
## [1] 5885 53
library(caret)
control <- trainControl(method = "cv", number = 5)
fit_rpart <- train(classe ~ ., data = modTRNSample, method = "rpart", trControl = control)
## Loading required package: rpart
print(fit_rpart, digits = 4)
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10988, 10991, 10990, 10989
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03550 0.5214 0.38010
## 0.06093 0.4175 0.21094
## 0.11738 0.3333 0.07467
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0355.
library(rattle)
## R session is headless; GTK+ not initialized.
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(fit_rpart$finalModel)
library(caret)
predict_rpart <- predict(fit_rpart, modTSTSample)
conf_rpart <- confusionMatrix(modTSTSample$classe, predict_rpart)
conf_rpart
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1530 35 105 0 4
## B 486 379 274 0 0
## C 493 31 502 0 0
## D 452 164 348 0 0
## E 168 145 302 0 467
##
## Overall Statistics
##
## Accuracy : 0.489
## 95% CI : (0.4762, 0.5019)
## No Information Rate : 0.5317
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3311
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4890 0.5027 0.3279 NA 0.99151
## Specificity 0.9478 0.8519 0.8797 0.8362 0.88641
## Pos Pred Value 0.9140 0.3327 0.4893 NA 0.43161
## Neg Pred Value 0.6203 0.9210 0.7882 NA 0.99917
## Prevalence 0.5317 0.1281 0.2602 0.0000 0.08003
## Detection Rate 0.2600 0.0644 0.0853 0.0000 0.07935
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.18386
## Balanced Accuracy 0.7184 0.6773 0.6038 NA 0.93896
accuracy_rpart <- conf_rpart$overall[1]
accuracy_rpart
## Accuracy
## 0.4890399
From the confusion matrix, the accuracy rate is about 0.5, and so the out-of-sample error rate is about 0.5. Using classification tree does not predict the outcome classe very well. Now I use decision trees and random forests to predict the outcome.
library(rpart)
model1 <- rpart(classe~.,method="class",data=modTRNSample)
prediction1<-predict(model1,modTSTSample,type="class")
library(rpart.plot)
rpart.plot(model1,main="classification tree",extra=102,under=T,faclen=0)
confusionMatrix(prediction1,modTSTSample$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1364 169 24 48 16
## B 60 581 46 79 74
## C 52 137 765 129 145
## D 183 194 125 650 159
## E 15 58 66 58 688
##
## Overall Statistics
##
## Accuracy : 0.6879
## 95% CI : (0.6758, 0.6997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6066
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8148 0.51010 0.7456 0.6743 0.6359
## Specificity 0.9390 0.94543 0.9047 0.8657 0.9590
## Pos Pred Value 0.8415 0.69167 0.6230 0.4958 0.7774
## Neg Pred Value 0.9273 0.88940 0.9440 0.9314 0.9212
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Detection Rate 0.2318 0.09873 0.1300 0.1105 0.1169
## Detection Prevalence 0.2754 0.14274 0.2087 0.2228 0.1504
## Balanced Accuracy 0.8769 0.72776 0.8252 0.7700 0.7974
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
model2 <- randomForest(classe~.,method="class",data=modTRNSample)
prediction2<-predict(model2,modTSTSample,type="class")
confusionMatrix(prediction2,modTSTSample$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 7 0 0 0
## B 0 1131 6 0 0
## C 0 1 1020 5 0
## D 0 0 0 958 1
## E 0 0 0 1 1081
##
## Overall Statistics
##
## Accuracy : 0.9964
## 95% CI : (0.9946, 0.9978)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9955
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9930 0.9942 0.9938 0.9991
## Specificity 0.9983 0.9987 0.9988 0.9998 0.9998
## Pos Pred Value 0.9958 0.9947 0.9942 0.9990 0.9991
## Neg Pred Value 1.0000 0.9983 0.9988 0.9988 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1922 0.1733 0.1628 0.1837
## Detection Prevalence 0.2856 0.1932 0.1743 0.1630 0.1839
## Balanced Accuracy 0.9992 0.9959 0.9965 0.9968 0.9994
Random forest, though a little more complex, was way more accurate. Hence, the random forest technique was chosen as the final prediction algorithm.
predictfinal <- predict(model2, testing_data, type="class")
predictfinal
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E