Prologue Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
Introduction This project uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: “exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front” (Class E)(Velloso, Bulling, Gellersen, Ugulino, & Fuks, 2013). Read more: http://groupware.les.inf.puc-rio.br/har#ixzz6fttLBk3d
Let’s kick start this project:
library(caret)
library(rpart)
library(ggplot2)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
library(partykit)
trainingdata<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header=TRUE)
testcases<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header=TRUE)
#Checking the Dimension of the Data files.
dim(trainingdata)
## [1] 19622 160
dim(testcases)
## [1] 20 160
Looking at the dimensions, we see that there are total of 19,622 observations in the trainingdata and 20 in the testcases. There are total of 160 different variables, and the variable “classe” is going to be the outcome variable. Before that, I want to reduce the number of variables that have near zero values (NZV)in both trainingdata, and testcases.
NZV<-nearZeroVar(trainingdata)
trainingdata<-trainingdata[, -NZV]
testcases<-testcases[, -NZV]
Wow!! We got rid of total of 60 variables that near zero values.
It also looks that the variables 1 through 5 are related to ID variables. So, let’s get rid of them.
trainingdata<-trainingdata[, -(1:5)]
testcases<-testcases[, -(1:5)]
As we got rid of the ID variables, we now have total of 95 variables. We can still get rid of some of the variables that have most NAs.
NAs<- sapply(trainingdata, function(x) mean(is.na(x)))>0.95
trainingdata<-trainingdata[, NAs==FALSE]
testcases<-testcases[, NAs==FALSE]
So far, we have been able to boil down the total number of variables to 54 from 160.
They still have a lot of, i.e., 54 variables and, so far, I don’t which I want to get rid of. So, let’s run a correlation and see what stands out.
citation("corrplot")
##
## To cite corrplot in publications use:
##
## Taiyun Wei and Viliam Simko (2020). R package "corrplot":
## Visualization of a Correlation Matrix (Version 0.85). Available from
## https://github.com/taiyun/corrplot
##
## A BibTeX entry for LaTeX users is
##
## @Manual{corrplot2020,
## title = {R package "corrplot": Visualization of a Correlation Matrix},
## author = {Taiyun Wei and Viliam Simko},
## year = {2020},
## note = {(Version 0.85)},
## url = {https://github.com/taiyun/corrplot},
## }
Matrixcor<-cor(trainingdata[,-54])
corrplot(Matrixcor, type="lower", order="FPC", method="color", tl.cex=0.5)
Setting correlation cutoff statisticts to 0.75, I have been able to boil down the total number of variables to 21.
Now, lets break the trainingdata into training and testing sets. We are going to break total observations into 70 and 30%.
TrainD<-createDataPartition(trainingdata$classe, p=0.6, list=FALSE)
trainingset<-trainingdata[TrainD, ]
testset<-trainingdata[-TrainD, ]
dim(trainingset)
## [1] 11776 54
dim(testset)
## [1] 7846 54
There are total of 11,776 observations in trainingset and 7846 observations in testset.
I am going to a model on the training set.
set.seed(123)
model1<-rpart(classe~., data=trainingset, method="class")
fancyRpartPlot(model1)
There are different number of trees and different number of interaction depths involved.
Now, let’s plot the predicted result from the testing set using our model fit in the training set.
predmod<-predict(model1, testset, type="class")
ctree<-confusionMatrix(predmod, as.factor(testset$classe))
ctree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1958 291 13 133 82
## B 208 993 212 135 181
## C 23 113 1042 109 82
## D 39 76 81 812 145
## E 4 45 20 97 952
##
## Overall Statistics
##
## Accuracy : 0.7337
## 95% CI : (0.7238, 0.7435)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6616
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8772 0.6542 0.7617 0.6314 0.6602
## Specificity 0.9076 0.8837 0.9495 0.9480 0.9741
## Pos Pred Value 0.7905 0.5743 0.7611 0.7042 0.8515
## Neg Pred Value 0.9490 0.9142 0.9497 0.9292 0.9272
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2496 0.1266 0.1328 0.1035 0.1213
## Detection Prevalence 0.3157 0.2204 0.1745 0.1470 0.1425
## Balanced Accuracy 0.8924 0.7689 0.8556 0.7897 0.8171
Looking at the results, the algorithm works good. Now, lets pass the model on the validation set.
valid<- predict(model1, newdata=testcases)
valid
## A B C D E
## 1 0.04430380 0.658227848 0.00000000 0.00000000 0.29746835
## 2 0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 3 0.02877698 0.208633094 0.74100719 0.00000000 0.02158273
## 4 0.84729064 0.001231527 0.00000000 0.05788177 0.09359606
## 5 0.84729064 0.001231527 0.00000000 0.05788177 0.09359606
## 6 0.02974359 0.083076923 0.69230769 0.09333333 0.10153846
## 7 0.07104796 0.067495560 0.05506217 0.69626998 0.11012433
## 8 0.19236641 0.341984733 0.18167939 0.16641221 0.11755725
## 9 0.98723404 0.012765957 0.00000000 0.00000000 0.00000000
## 10 0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 11 0.01000000 0.486666667 0.20666667 0.13000000 0.16666667
## 12 0.01000000 0.486666667 0.20666667 0.13000000 0.16666667
## 13 0.01484230 0.128014842 0.07792208 0.21892393 0.56029685
## 14 0.98723404 0.012765957 0.00000000 0.00000000 0.00000000
## 15 0.01484230 0.128014842 0.07792208 0.21892393 0.56029685
## 16 0.02412869 0.021447721 0.10321716 0.73190349 0.11930295
## 17 0.84729064 0.001231527 0.00000000 0.05788177 0.09359606
## 18 0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 19 0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 20 0.01000000 0.486666667 0.20666667 0.13000000 0.16666667