Author: Hannah Hon
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train <- download.file(train,"./data")
training <- read.csv("train")
test <- download.file(test, "./test")
testing <- read.csv("test")
## remove the invalid columes
training <- training[,colSums(is.na(training)) == 0]
testing <- testing[,colSums(is.na(testing)) == 0]
dim(training)
## [1] 19622 93
dim(testing)
## [1] 20 60
## Now remove the first 7 outputs as they have few impact on Classe
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
dim(training)
## [1] 19622 86
dim(testing)
## [1] 20 53
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainData <- training[inTrain, ]
testData <- training[-inTrain,]
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
testData <- testData[, -NZV]
dim(trainData)
## [1] 13737 53
Three prediction models will be used, which are random forest, decision tree and generalized boosted model. #### 1. Random Forest
set.seed(12345)
controlrf <- trainControl(method="cv", number=3, verboseIter=FALSE)
rf <- train(classe ~ ., trainData, method="rf",trControl=controlrf)
rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.75%
## Confusion matrix:
## A B C D E class.error
## A 3901 4 1 0 0 0.001280082
## B 20 2628 9 1 0 0.011286682
## C 0 17 2368 11 0 0.011686144
## D 0 0 26 2224 2 0.012433393
## E 0 2 4 6 2513 0.004752475
modelrf <- predict(rf, testData)
conf <- confusionMatrix(modelrf,testData$classe)
conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 7 0 0 0
## B 0 1128 1 0 0
## C 0 3 1022 7 1
## D 0 1 3 957 4
## E 1 0 0 0 1077
##
## Overall Statistics
##
## Accuracy : 0.9952
## 95% CI : (0.9931, 0.9968)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.994
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9903 0.9961 0.9927 0.9954
## Specificity 0.9983 0.9998 0.9977 0.9984 0.9998
## Pos Pred Value 0.9958 0.9991 0.9894 0.9917 0.9991
## Neg Pred Value 0.9998 0.9977 0.9992 0.9986 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1917 0.1737 0.1626 0.1830
## Detection Prevalence 0.2855 0.1918 0.1755 0.1640 0.1832
## Balanced Accuracy 0.9989 0.9951 0.9969 0.9956 0.9976
The accurarcy is very high for random forest prediction method, which is 0.9946.However, it might be the reason of overfitting.
plot(modelrf)
modelrp <- rpart(classe ~ ., trainData, method="class")
fancyRpartPlot(modelrp)
trainpred <- predict(modelrp, testData, type = "class")
confrp <- confusionMatrix(testData$classe,trainpred)
The accuracy for decision tree is 0.75, which is not as accurate as generalized boosted and random forest.
set.seed(12345)
controlgbm <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modelgbm <- train(classe ~ ., data=trainData, method = "gbm",
trControl = controlgbm, verbose = FALSE)
modelgbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 41 had non-zero influence.
predictgbm <- predict(modelgbm, newdata=testData)
confgbm <- confusionMatrix(predictgbm, testData$classe)
confgbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1653 36 0 0 4
## B 15 1062 30 0 8
## C 4 33 981 31 5
## D 2 5 11 928 16
## E 0 3 4 5 1049
##
## Overall Statistics
##
## Accuracy : 0.964
## 95% CI : (0.9589, 0.9686)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9544
## Mcnemar's Test P-Value : 9.355e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9875 0.9324 0.9561 0.9627 0.9695
## Specificity 0.9905 0.9888 0.9850 0.9931 0.9975
## Pos Pred Value 0.9764 0.9525 0.9307 0.9647 0.9887
## Neg Pred Value 0.9950 0.9839 0.9907 0.9927 0.9932
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2809 0.1805 0.1667 0.1577 0.1782
## Detection Prevalence 0.2877 0.1895 0.1791 0.1635 0.1803
## Balanced Accuracy 0.9890 0.9606 0.9706 0.9779 0.9835
The accuracy from gerneralized boosted model is 0.9645, which is also very high.
The accuracy of the 3 regression modeling methods are:
Random Forest : 0.9963 Decision Tree : 0.7514 GBM : 0.9645
predictTest <- predict(rf,testing)
predictTest
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E