Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.5.3
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
## Warning: package 'rpart' was built under R version 3.5.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.3
library(rattle)
## Warning: package 'rattle' was built under R version 3.5.3
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Warning: package 'gbm' was built under R version 3.5.3
## Loaded gbm 2.1.5
train_in<- read.csv("C:/Users/ashishma/Downloads/pml-training.csv")
dim(train_in)
## [1] 19622 160
test_in<-read.csv("C:/Users/ashishma/Downloads/pml-testing.csv")
dim(test_in)
## [1] 20 160
So there are 19622 observations and 160 variables in training set and 20 observations and 160 variables in testing set.
In order to clean the data we will remove missing values , if we see data there are few columns which have most observations as NA we can remove these columns.
train_clean<-train_in[ ,colSums(is.na(train_in)) == 0]
dim(train_clean)
## [1] 19622 93
test_clean<- test_in[ ,colSums(is.na(test_in)) == 0]
dim(test_clean)
## [1] 20 60
First seven columns are also not of much use so we can remove them.
train_clean<- train_clean[ , -c(1:7)]
dim(train_clean)
## [1] 19622 86
test_clean<- test_clean[ , -c(1:7)]
dim(test_clean)
## [1] 20 53
set.seed(12345)
inTrain<-createDataPartition(train_clean$classe, p = 0.7, list = FALSE)
trainData<-train_clean[inTrain , ]
testData<-train_clean[-inTrain , ]
dim(trainData)
## [1] 13737 86
dim(testData)
## [1] 5885 86
We can remove the variables that are near zero variance
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
testData <- testData[, -NZV]
dim(trainData)
## [1] 13737 53
dim(testData)
## [1] 5885 53
set.seed(12345)
model_tree<-rpart(classe ~ . , data = trainData, method = "class")
fancyRpartPlot(model_tree)
Now using this model we will see how this is performing on our Test Data .
predict_model_tree<- predict(model_tree, testData, type = "class")
cm_tree<- confusionMatrix(predict_model_tree, testData$classe )
cm_tree$table
## Reference
## Prediction A B C D E
## A 1498 196 69 106 25
## B 42 669 85 86 92
## C 43 136 739 129 131
## D 33 85 98 553 44
## E 58 53 35 90 790
cm_tree$overall[1]
## Accuracy
## 0.7220051
So we can see the above model gives accuracy of 0.7220051 which gives us out-of-sample-error about .28 .
model_rf<- randomForest(classe ~ . , data = testData)
print(model_rf)
##
## Call:
## randomForest(formula = classe ~ ., data = testData)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 1.7%
## Confusion matrix:
## A B C D E class.error
## A 1669 5 0 0 0 0.002986858
## B 25 1100 13 1 0 0.034240562
## C 0 16 1008 2 0 0.017543860
## D 0 0 26 936 2 0.029045643
## E 0 1 3 6 1072 0.009242144
Now using this model we will see how this is performing on our Test Data .
predict_model_rf<- predict(model_rf, testData, type = "class")
cm_rf<-confusionMatrix(predict_model_rf,testData$classe)
cm_rf$table
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 0 0 0
## C 0 0 1026 0 0
## D 0 0 0 964 0
## E 0 0 0 0 1082
cm_rf$overall[1]
## Accuracy
## 1
So we can see the above model gives accuracy of 0.9830076 which gives us out-of-sample-error about .02 .
model_gbm<- train(classe ~ . , data = trainData, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 1), verbose = FALSE)
print(model_gbm)
## Stochastic Gradient Boosting
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 10989, 10990, 10990, 10989, 10990
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7572970 0.6921642
## 1 100 0.8255803 0.7792114
## 1 150 0.8566640 0.8185488
## 2 50 0.8565919 0.8183136
## 2 100 0.9050737 0.8798732
## 2 150 0.9315712 0.9134018
## 3 50 0.8971386 0.8697786
## 3 100 0.9431460 0.9280516
## 3 150 0.9615636 0.9513656
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Now using this model we will see how this is performing on our Test Data .
predict_model_gbm<- predict(model_gbm, testData)
cm_gbm<-confusionMatrix(predict_model_gbm,testData$classe)
cm_gbm$table
## Reference
## Prediction A B C D E
## A 1647 38 0 2 2
## B 16 1066 41 3 15
## C 6 32 963 36 7
## D 5 3 20 917 22
## E 0 0 2 6 1036
cm_gbm$overall[1]
## Accuracy
## 0.9564996
So we can see the above model gives accuracy of 0.9564996 which gives us out-of-sample-error about .05 .
We have observed out of three models our most accurate model come out to be Random Forest model, so we will use this model to predict value of classe for test_clean.
Result<- predict(model_rf, test_clean)
Result
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E