Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
setwd("C:\\") # start with setting the working directory in C:
if(!file.exists("./c8_project")){dir.create("./c8_project")}
setwd("C:\\c8_project") # SETTING THE WORKING DIRECTORY
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url, destfile = "./training.csv")
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url, destfile = "./testing.csv")
# Read the above files and create data tables.
train.set = read.csv("training.csv")
test.set = read.csv("testing.csv")
dim(train.set)
## [1] 19622 160
dim(test.set)
## [1] 20 160
table(colSums(is.na(train.set)))# how many 0 NAs and how many non-zero NAs and their count.
##
## 0 19216
## 93 67
table(colSums(is.na(test.set)))
##
## 0 20
## 60 100
n = nrow(train.set)
p = n * 0.8
train.nona = train.set[,which(colSums(is.na(train.set)) < p)]
table(colSums(is.na(train.nona)))
##
## 0
## 93
test.nona = test.set[,which(colSums(is.na(train.set)) < p)]
table(colSums(is.na(test.nona)))
##
## 0 20
## 60 33
test.nona.2 = test.nona[,which(colSums(is.na(test.nona)) != 20)]
table(colSums(is.na(test.nona.2)))
##
## 0
## 60
train.nona.2 = train.nona[,which(colSums(is.na(test.nona)) != 20)]
table(colSums(is.na(train.nona.2)))
##
## 0
## 60
# removing variables not useful in classification of "classe":
train.nona.2 = subset(train.nona.2, select=-c(1:6))
test.nona.2 = subset(test.nona.2, select=-c(1:6))
table(names(train.nona.2) == names(test.nona.2))
##
## FALSE TRUE
## 1 53
# which columns in train.set is not in test.set:
table(names(train.nona.2) == names(test.nona.2))
##
## FALSE TRUE
## 1 53
names(train.nona.2)[which(names(train.nona.2) != names(test.nona.2))]
## [1] "classe"
# "classe" column is in train.set but not in test.set. this makes sense as this is the column we will be trying to predict for test.set.
# which column in test.set is not in train.set:
table(names(test.nona.2) == names(train.nona.2))
##
## FALSE TRUE
## 1 53
names(test.nona.2)[which(names(test.nona.2) != names(train.nona.2))]
## [1] "problem_id"
# "problem id" is the column in test.set not in train.set. we should remove this column.
test = subset(test.nona.2, select=-c(problem_id))
#write.csv(train.nona.2, "train.nona.2.csv")
#write.csv(test.nona.2, "test.nona.2.csv")
The idea here is to use 70% of the provided training-set for 10-fold cross-validation to estimate test-error (or accuracy) of the class prediction.
And then use the remaining 30% of set as Validation set using the trained model to check validation-error (or accuracy) of the class prediction.
by 10-fold CV we are trying to estimate test-error. while with validation-set we are trying to see how the selected model will performa on an unknown data set.
set.seed(1)
index = createDataPartition(y=train.nona.2$classe, p=0.7, list=FALSE)
train = train.nona.2[index,]
validation = train.nona.2[-index,]
dim(train)
## [1] 13737 54
dim(validation)
## [1] 5885 54
#write.csv(train, "train.csv")
we will create multiple classification model. we will evaluate each model ased on their 10-fold cv test accruacy and validation-set accuracy.
Then we will perform model selection based on best estimated test accuracy and validation set accuracy.
The Models we will consider are:
set.seed(1)
# MODELING:
tree.model = train(classe ~ ., method="rpart", data=train,
trControl = trainControl(method = "cv", number=10))
## Loading required package: rpart
tree.model
## CART
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03931441 0.5299648 0.39652742
## 0.05947174 0.4173415 0.21089859
## 0.11789238 0.3152014 0.04728215
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03931441.
# PREDICTION:
tree.pred = predict(tree.model, validation)
table(tree.pred, validation$classe)
##
## tree.pred A B C D E
## A 1510 463 462 427 179
## B 27 389 36 163 141
## C 132 287 528 374 299
## D 0 0 0 0 0
## E 5 0 0 0 463
# VALIDATION-SET ACCURACY:
tree.accuracy = mean(tree.pred == validation$classe)
tree.accuracy
## [1] 0.491079
set.seed(1)
# MODELING:
lda.model = train(classe ~ ., method="lda", data=train,
trControl = trainControl(method = "cv", number=10))
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
lda.model
## Linear Discriminant Analysis
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7122376 0.6357642
##
##
# PREDICTION:
lda.pred = predict(lda.model, validation)
table(lda.pred, validation$classe)
##
## lda.pred A B C D E
## A 1372 153 100 45 46
## B 52 743 98 44 176
## C 114 140 673 120 92
## D 129 50 119 718 98
## E 7 53 36 37 670
# VALIDATION-SET ACCURACY:
lda.accuracy = mean(lda.pred == validation$classe)
lda.accuracy
## [1] 0.7096007
set.seed(1)
# MODELING:
svm.model = svm(classe ~ ., data=train, cross=10)
summary(svm.model)
##
## Call:
## svm(formula = classe ~ ., data = train, cross = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.01886792
##
## Number of Support Vectors: 6281
##
## ( 1290 1435 1430 1165 961 )
##
##
## Number of Classes: 5
##
## Levels:
## A B C D E
##
## 10-fold cross-validation on training data:
##
## Total Accuracy: 93.98704
## Single Accuracies:
## 94.39184 94.25036 94.54148 94.31901 93.44978 93.377 94.31901 93.95924 94.54148 92.72198
# PREDICTION:
svm.pred = predict(svm.model, validation)
table(svm.pred, validation$classe)
##
## svm.pred A B C D E
## A 1659 70 4 2 0
## B 1 1036 28 1 3
## C 14 29 987 90 13
## D 0 2 6 871 19
## E 0 2 1 0 1047
# VALIDATION-SET ACCURACY:
svm.accuracy = mean(svm.pred == validation$classe)
svm.accuracy
## [1] 0.9515718
set.seed(1)
# MODELING:
bag.model = bag(train[,1:53], train[,54], B=10,
bagControl = bagControl(fit = ctreeBag$fit,
predict = ctreeBag$pred,
aggregate = ctreeBag$aggregate),
trControl = trainControl(method = "cv", number=10))
summary(bag.model)
##
## Call:
## bag.default(x = train[, 1:53], y = train[, 54], B = 10, bagControl
## = ctreeBag$aggregate), trControl = trainControl(method = "cv", number
## = 10))
##
## Out of bag statistics (B = 10):
##
## Accuracy Kappa
## 0.0% 0.8608 0.8238
## 2.5% 0.8616 0.8249
## 25.0% 0.8674 0.8323
## 50.0% 0.8720 0.8379
## 75.0% 0.8788 0.8464
## 97.5% 0.8905 0.8615
## 100.0% 0.8912 0.8623
# PREDICTION:
bag.pred = predict(bag.model, validation)
table(bag.pred, validation$classe)
##
## bag.pred A B C D E
## A 1646 22 5 4 0
## B 17 1082 27 5 12
## C 2 17 975 22 5
## D 6 9 18 923 11
## E 3 9 1 10 1054
# VALIDATION-SET ACCURACY:
bag.accuracy = mean(bag.pred == validation$classe)
bag.accuracy
## [1] 0.9651657
set.seed(1)
# MODELING:
boost.model = train(classe ~., data=train, method = "gbm",
verbose = FALSE,
trControl = trainControl(method = "cv", number=10))
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
boost.model
## Stochastic Gradient Boosting
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7622458 0.6983534
## 1 100 0.8308971 0.7859187
## 1 150 0.8688964 0.8340455
## 2 50 0.8835304 0.8524731
## 2 100 0.9389997 0.9228070
## 2 150 0.9618581 0.9517388
## 3 50 0.9306287 0.9121743
## 3 100 0.9702285 0.9623277
## 3 150 0.9858784 0.9821365
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
# PREDICTION:
boost.pred = predict(boost.model, validation)
table(boost.pred, validation$classe)
##
## boost.pred A B C D E
## A 1667 7 0 2 0
## B 6 1118 14 6 5
## C 0 13 1010 8 2
## D 1 1 1 948 5
## E 0 0 1 0 1070
# VALIDATION-SET ACCURACY:
boost.accuracy = mean(boost.pred == validation$classe)
boost.accuracy
## [1] 0.9877655
set.seed(1)
# MODELING:
rf.model = train(classe ~., data=train, method = "rf",
trControl = trainControl(method = "cv", number=10))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:Hmisc':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf.model
## Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9941766 0.9926335
## 27 0.9975251 0.9968695
## 53 0.9947592 0.9933705
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
# PREDICTION:
rf.pred = predict(rf.model, validation)
table(rf.pred, validation$classe)
##
## rf.pred A B C D E
## A 1674 1 0 0 0
## B 0 1137 0 0 2
## C 0 0 1026 0 0
## D 0 1 0 964 1
## E 0 0 0 0 1079
# VALIDATION-SET ACCURACY:
rf.accuracy = mean(rf.pred == validation$classe)
rf.accuracy
## [1] 0.9991504
Based on highest 10-fold cross-validation estimated Test-Accuracy of 99.8% and ValidationSet-Accuracy of 99.92%, I select “Random Forest” as the model of choice for classification task of this project.
# final prediction on test set:
test.pred = predict(rf.model, test)
table(test.pred)
## test.pred
## A B C D E
## 7 8 1 1 3
# adding predcited class to test set as a column.
test$classe = test.pred
dim(test)
## [1] 20 54
test$classe
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E