Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Based on these data we are going to build a prediction model, which predicts the manner in which the excersize is done according to quantified variables.
In this section we are going to do the following tasks:
library(caret)
library(ggplot2)
library(randomForest)
library(corrplot)
library(rpart)
set.seed(1000)
mainDir <- getwd()
subDir <- "outputDirectory"
if (!file.exists(subDir)){
dir.create(file.path(mainDir, subDir))
}
if(!file.exists("outputDirectory/training.csv")){
fileURL_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileURL_train, destfile = "outputDirectory/training.csv")
}
if(!file.exists("outputDirectory/test.csv")){
fileURL_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileURL_test, destfile = "outputDirectory/test.csv")
}
pmltr <- read.csv("outputDirectory/training.csv", stringsAsFactors = FALSE)
pmlts <- read.csv("outputDirectory/test.csv", stringsAsFactors = FALSE)
# remove index variable which does not affect on prediction
pmltr <- pmltr[, -c(1, 3:7)]
pmltr$user_name <- as.factor(pmltr$user_name)
pmltr$classe <- as.factor(pmltr$classe)
# convert variables to number
for(i in 1:ncol(pmltr)){
if(class(pmltr[,i]) == "character"){
pmltr[,i] <- as.numeric(pmltr[,i])
}
}
# remove variables which contain NA
num_na <- sapply(1:ncol(pmltr), function(x) sum(is.na(pmltr[,x])))
pmltr_new <- pmltr[, num_na == 0]
In this section we are going to do some exploratory ananlysis on training data such as: grouping data by classe and user_name variables and also draw a plot based on two variables
table(pmltr_new$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
table(pmltr_new$user_name)
##
## adelmo carlitos charles eurico jeremy pedro
## 3892 3112 3536 3070 3402 2610
# draw some plot of data
qplot(user_name, roll_belt, colour=classe, data=pmltr_new, main = "Activity level of each user on belt", xlab = "user name", ylab = "activity level on belt")
In this section we are going to partition training data set, to two separate chunks. The big one is dedicated for training and small one is dedicated for testing
inTrain <- createDataPartition(y = pmltr_new$classe, p = 0.75, list = FALSE)
tempTrain <- pmltr_new[inTrain,]
tempTest <- pmltr_new[-inTrain,]
table(tempTrain$classe)
##
## A B C D E
## 4185 2848 2567 2412 2706
In this section we build the classification Trees model based on trainings data with two fold cross validation. The result is displayed in confuction matrix and a diagram displays the importance of variables used in prediction.
ct_model <- train(classe ~., data = tempTrain, method = "rpart", trControl=trainControl(method="cv",number=2))
confusionMatrix(predict(ct_model, newdata = tempTest), tempTest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 837 144 21 37 11
## B 252 586 161 347 252
## C 283 219 673 420 222
## D 0 0 0 0 0
## E 23 0 0 0 416
##
## Overall Statistics
##
## Accuracy : 0.5122
## 95% CI : (0.4981, 0.5263)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3865
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6000 0.6175 0.7871 0.0000 0.46171
## Specificity 0.9393 0.7441 0.7175 1.0000 0.99425
## Pos Pred Value 0.7971 0.3667 0.3704 NaN 0.94761
## Neg Pred Value 0.8552 0.8902 0.9410 0.8361 0.89138
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.18373
## Detection Rate 0.1707 0.1195 0.1372 0.0000 0.08483
## Detection Prevalence 0.2141 0.3259 0.3705 0.0000 0.08952
## Balanced Accuracy 0.7696 0.6808 0.7523 0.5000 0.72798
print(plot(varImp(ct_model, scale = FALSE)))
predict(ct_model, newdata = pmlts)
## [1] B C C C B C C C A A C C B A C B B C B B
## Levels: A B C D E
This model has the lowest accuracy amoung the selected models and it is about 51% with confidence interval 95%.
In this section we build the Gradiant Boosting model based on trainings data with two fold cross validation. The result is displayed in confuction matrix and a diagram displays the importance of variables used in prediction.
gbm_model <- train(classe ~., data = tempTrain, method = "gbm", verbose = FALSE, trControl=trainControl(method="cv",number=2))
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## Loading required package: plyr
confusionMatrix(predict(gbm_model, newdata = tempTest), tempTest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1374 26 0 0 2
## B 16 892 21 2 6
## C 1 31 821 21 10
## D 3 0 13 778 9
## E 1 0 0 3 874
##
## Overall Statistics
##
## Accuracy : 0.9664
## 95% CI : (0.9609, 0.9712)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9574
## Mcnemar's Test P-Value : 0.0004813
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9849 0.9399 0.9602 0.9677 0.9700
## Specificity 0.9920 0.9886 0.9844 0.9939 0.9990
## Pos Pred Value 0.9800 0.9520 0.9287 0.9689 0.9954
## Neg Pred Value 0.9940 0.9856 0.9915 0.9937 0.9933
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2802 0.1819 0.1674 0.1586 0.1782
## Detection Prevalence 0.2859 0.1911 0.1803 0.1637 0.1790
## Balanced Accuracy 0.9885 0.9643 0.9723 0.9808 0.9845
print(plot(varImp(gbm_model, scale = FALSE)))
predict(gbm_model, newdata = pmlts)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
This model has a good accuracy and it is 96% with confidence interval 95%
In this section we build the Random Forest model based on trainings data with two fold cross validation. The result is displayed in confuction matrix and a diagram displays the importance of variables used in prediction.
rf_model <- train(classe ~., data = tempTrain, method = "rf", trControl=trainControl(method="cv",number=2))
confusionMatrix(predict(rf_model, newdata = tempTest), tempTest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 4 0 0 0
## B 3 944 6 0 0
## C 0 1 839 6 0
## D 0 0 10 795 6
## E 0 0 0 3 895
##
## Overall Statistics
##
## Accuracy : 0.992
## 95% CI : (0.9891, 0.9943)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9899
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9978 0.9947 0.9813 0.9888 0.9933
## Specificity 0.9989 0.9977 0.9983 0.9961 0.9993
## Pos Pred Value 0.9971 0.9906 0.9917 0.9803 0.9967
## Neg Pred Value 0.9991 0.9987 0.9961 0.9978 0.9985
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2838 0.1925 0.1711 0.1621 0.1825
## Detection Prevalence 0.2847 0.1943 0.1725 0.1654 0.1831
## Balanced Accuracy 0.9984 0.9962 0.9898 0.9925 0.9963
print(plot(varImp(rf_model, scale = FALSE)))
predict(rf_model, newdata = pmlts)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
This model has the best accuracy and it is 99% with confidence interval 95%