PML_Final Project

Background

Prologue Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

Introduction This project uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: “exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front” (Class E)(Velloso, Bulling, Gellersen, Ugulino, & Fuks, 2013). Read more: http://groupware.les.inf.puc-rio.br/har#ixzz6fttLBk3d

The goal of our project is to predict the manner in which they did the exercise.
The “classe” variable in the training set is the outcome variable.
There are variables to predict the “classe” with.

Let’s kick start this project:

Invoking the required libraries

library(caret)
library(rpart)
library(ggplot2)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
library(partykit)

Getting, Preparing, and Exploring the Data

trainingdata<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header=TRUE)
testcases<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header=TRUE)

#Checking the Dimension of the Data files.
dim(trainingdata)

## [1] 19622   160

dim(testcases)

## [1]  20 160

Looking at the dimensions, we see that there are total of 19,622 observations in the trainingdata and 20 in the testcases. There are total of 160 different variables, and the variable “classe” is going to be the outcome variable. Before that, I want to reduce the number of variables that have near zero values (NZV)in both trainingdata, and testcases.

NZV<-nearZeroVar(trainingdata)
trainingdata<-trainingdata[, -NZV]
testcases<-testcases[, -NZV]

Wow!! We got rid of total of 60 variables that near zero values.

It also looks that the variables 1 through 5 are related to ID variables. So, let’s get rid of them.

trainingdata<-trainingdata[, -(1:5)]
testcases<-testcases[, -(1:5)]

As we got rid of the ID variables, we now have total of 95 variables. We can still get rid of some of the variables that have most NAs.

NAs<- sapply(trainingdata, function(x) mean(is.na(x)))>0.95
trainingdata<-trainingdata[, NAs==FALSE]
testcases<-testcases[, NAs==FALSE]

So far, we have been able to boil down the total number of variables to 54 from 160.

They still have a lot of, i.e., 54 variables and, so far, I don’t which I want to get rid of. So, let’s run a correlation and see what stands out.

citation("corrplot")

## 
## To cite corrplot in publications use:
## 
##   Taiyun Wei and Viliam Simko (2020). R package "corrplot":
##   Visualization of a Correlation Matrix (Version 0.85). Available from
##   https://github.com/taiyun/corrplot
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{corrplot2020,
##     title = {R package "corrplot": Visualization of a Correlation Matrix},
##     author = {Taiyun Wei and Viliam Simko},
##     year = {2020},
##     note = {(Version 0.85)},
##     url = {https://github.com/taiyun/corrplot},
##   }

Matrixcor<-cor(trainingdata[,-54])
corrplot(Matrixcor, type="lower", order="FPC", method="color", tl.cex=0.5)

Setting correlation cutoff statisticts to 0.75, I have been able to boil down the total number of variables to 21.

Training, Testing, and Validation

Now, lets break the trainingdata into training and testing sets. We are going to break total observations into 70 and 30%.

TrainD<-createDataPartition(trainingdata$classe, p=0.6, list=FALSE)
trainingset<-trainingdata[TrainD, ]
testset<-trainingdata[-TrainD, ]
dim(trainingset)

## [1] 11776    54

dim(testset)

## [1] 7846   54

There are total of 11,776 observations in trainingset and 7846 observations in testset.

Fitting Models

I am going to a model on the training set.

set.seed(123)
model1<-rpart(classe~., data=trainingset, method="class")
fancyRpartPlot(model1)

There are different number of trees and different number of interaction depths involved.

Now, let’s plot the predicted result from the testing set using our model fit in the training set.

predmod<-predict(model1, testset, type="class")
ctree<-confusionMatrix(predmod, as.factor(testset$classe))
ctree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1958  291   13  133   82
##          B  208  993  212  135  181
##          C   23  113 1042  109   82
##          D   39   76   81  812  145
##          E    4   45   20   97  952
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7337          
##                  95% CI : (0.7238, 0.7435)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6616          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8772   0.6542   0.7617   0.6314   0.6602
## Specificity            0.9076   0.8837   0.9495   0.9480   0.9741
## Pos Pred Value         0.7905   0.5743   0.7611   0.7042   0.8515
## Neg Pred Value         0.9490   0.9142   0.9497   0.9292   0.9272
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2496   0.1266   0.1328   0.1035   0.1213
## Detection Prevalence   0.3157   0.2204   0.1745   0.1470   0.1425
## Balanced Accuracy      0.8924   0.7689   0.8556   0.7897   0.8171

Looking at the results, the algorithm works good. Now, lets pass the model on the validation set.

valid<- predict(model1, newdata=testcases)
valid

##             A           B          C          D          E
## 1  0.04430380 0.658227848 0.00000000 0.00000000 0.29746835
## 2  0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 3  0.02877698 0.208633094 0.74100719 0.00000000 0.02158273
## 4  0.84729064 0.001231527 0.00000000 0.05788177 0.09359606
## 5  0.84729064 0.001231527 0.00000000 0.05788177 0.09359606
## 6  0.02974359 0.083076923 0.69230769 0.09333333 0.10153846
## 7  0.07104796 0.067495560 0.05506217 0.69626998 0.11012433
## 8  0.19236641 0.341984733 0.18167939 0.16641221 0.11755725
## 9  0.98723404 0.012765957 0.00000000 0.00000000 0.00000000
## 10 0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 11 0.01000000 0.486666667 0.20666667 0.13000000 0.16666667
## 12 0.01000000 0.486666667 0.20666667 0.13000000 0.16666667
## 13 0.01484230 0.128014842 0.07792208 0.21892393 0.56029685
## 14 0.98723404 0.012765957 0.00000000 0.00000000 0.00000000
## 15 0.01484230 0.128014842 0.07792208 0.21892393 0.56029685
## 16 0.02412869 0.021447721 0.10321716 0.73190349 0.11930295
## 17 0.84729064 0.001231527 0.00000000 0.05788177 0.09359606
## 18 0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 19 0.67335244 0.202865330 0.01489971 0.07793696 0.03094556
## 20 0.01000000 0.486666667 0.20666667 0.13000000 0.16666667