Practical Machine Learning Prediction Assignment

Project Introduction

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Goal

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Getting and loading the data

Getting and Loading Dataset.

# Loading Data
# In this part we are going to load the dataset, and attached it to the environment
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(mlbench)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(foreach)
library(doParallel)
## Loading required package: iterators
## Loading required package: parallel
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
set.seed(39)
trainUrl <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
ptrain <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
ptest <- read.csv(url(testUrl), na.strings=c("NA","#DIV/0!",""))

Cleaning Data

Here we are going to remove all the NA from the variables. We also going to split data in two, we are going to use 75% of the data for training and another 25 for testing.

rmNaTrain <- ptrain[, apply(ptrain, 2, function(x) !any(is.na(x)))]
training <-rmNaTrain
dim(training)
## [1] 19622    60
# cleaning variables
clVariables <-rmNaTrain[,-c(1:8)]
training <- clVariables
dim(training)
## [1] 19622    52
cleanpmTest <-ptest[,names(training[,-52])]
testing <- cleanpmTest
dim(testing)
## [1] 20 51
# Splitting Data

library(caret)
set.seed(39)
inTrain <- createDataPartition(y=training$classe, p=0.75, list=F)
ptrain1 <- training[inTrain, ]
ptrain2 <- training[-inTrain, ]

dim(ptrain1)
## [1] 14718    52
dim(ptrain2)
## [1] 4904   52

Summarize Dataset

Now we will take a look at the data set a few different ways

# Dimensions of the Dataset

# Level 

levels(ptrain1$classe)
## [1] "A" "B" "C" "D" "E"

Model Selection

We will use random forest and Gradient Boosting for comparison on which algorithms best describes the data.

library(randomForest)
library(caret)


set.seed(13333)

control <- trainControl(method = "cv", number = 5, allowParallel = T, verbose=T)
rf.formula = randomForest(classe~., data=ptrain1, method="rf",trControl=control,verbose=F)

Model Comparison.

mpredict <- predict(rf.formula, newdata=ptrain2)
confusionMatrix(mpredict, ptrain2$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    2    0    0    0
##          B    0  947    7    0    0
##          C    0    0  847   16    1
##          D    0    0    1  788    2
##          E    0    0    0    0  898
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9941         
##                  95% CI : (0.9915, 0.996)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9925         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9979   0.9906   0.9801   0.9967
## Specificity            0.9994   0.9982   0.9958   0.9993   1.0000
## Pos Pred Value         0.9986   0.9927   0.9803   0.9962   1.0000
## Neg Pred Value         1.0000   0.9995   0.9980   0.9961   0.9993
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1931   0.1727   0.1607   0.1831
## Detection Prevalence   0.2849   0.1945   0.1762   0.1613   0.1831
## Balanced Accuracy      0.9997   0.9981   0.9932   0.9897   0.9983

Validation

predSec <- predict(rf.formula,newdata=testing)
predSec
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

we will use the confusionmatrix to validate our selection using ptrain2

control <- trainControl(method = "cv", number = 5, allowParallel = T, verbose=T)
fit.gbm <-train(classe~.,data=ptrain1, method="gbm", trControl=control, verbose=F)
## Loading required package: gbm
## Loading required package: splines
## Loaded gbm 2.1.3
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
## + Fold1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## - Fold1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## - Fold1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## - Fold1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold2: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## - Fold2: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold2: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## - Fold2: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold2: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## - Fold2: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold3: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## - Fold3: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold3: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## - Fold3: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold3: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## - Fold3: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold4: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## - Fold4: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold4: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## - Fold4: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold4: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## - Fold4: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold5: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## - Fold5: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold5: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## - Fold5: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold5: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## - Fold5: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## Aggregating results
## Selecting tuning parameters
## Fitting n.trees = 150, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10 on full training set
fit.gbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 51 predictors of which 41 had non-zero influence.
class(fit.gbm)
## [1] "train"         "train.formula"
gbmPred <- predict(fit.gbm, newdata=ptrain2)
confusionMatrix(gbmPred,ptrain2$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1365   28    0    1    0
##          B   22  889   30    3    4
##          C    6   27  812   28   11
##          D    2    3   13  761   10
##          E    0    2    0   11  876
## 
## Overall Statistics
##                                           
##                Accuracy : 0.959           
##                  95% CI : (0.9531, 0.9644)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9482          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9785   0.9368   0.9497   0.9465   0.9723
## Specificity            0.9917   0.9851   0.9822   0.9932   0.9968
## Pos Pred Value         0.9792   0.9378   0.9186   0.9645   0.9854
## Neg Pred Value         0.9915   0.9848   0.9893   0.9896   0.9938
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2783   0.1813   0.1656   0.1552   0.1786
## Detection Prevalence   0.2843   0.1933   0.1803   0.1609   0.1813
## Balanced Accuracy      0.9851   0.9609   0.9660   0.9698   0.9845
predtrain <-predict(fit.gbm,newdata=ptrain1)
confusionMatrix(predtrain, ptrain1$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4141   81    0    0    2
##          B   33 2711   52    9   14
##          C    7   52 2485   57   16
##          D    3    1   25 2324   26
##          E    1    3    5   22 2648
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9722          
##                  95% CI : (0.9694, 0.9748)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9648          
##  Mcnemar's Test P-Value : 1.141e-09       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9895   0.9519   0.9681   0.9635   0.9786
## Specificity            0.9921   0.9909   0.9891   0.9955   0.9974
## Pos Pred Value         0.9804   0.9617   0.9496   0.9769   0.9884
## Neg Pred Value         0.9958   0.9885   0.9932   0.9929   0.9952
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2814   0.1842   0.1688   0.1579   0.1799
## Detection Prevalence   0.2870   0.1915   0.1778   0.1616   0.1820
## Balanced Accuracy      0.9908   0.9714   0.9786   0.9795   0.9880
preditrain <- predict(fit.gbm, newdata=ptrain1)
confusionMatrix(preditrain, ptrain1$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4141   81    0    0    2
##          B   33 2711   52    9   14
##          C    7   52 2485   57   16
##          D    3    1   25 2324   26
##          E    1    3    5   22 2648
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9722          
##                  95% CI : (0.9694, 0.9748)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9648          
##  Mcnemar's Test P-Value : 1.141e-09       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9895   0.9519   0.9681   0.9635   0.9786
## Specificity            0.9921   0.9909   0.9891   0.9955   0.9974
## Pos Pred Value         0.9804   0.9617   0.9496   0.9769   0.9884
## Neg Pred Value         0.9958   0.9885   0.9932   0.9929   0.9952
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2814   0.1842   0.1688   0.1579   0.1799
## Detection Prevalence   0.2870   0.1915   0.1778   0.1616   0.1820
## Balanced Accuracy      0.9908   0.9714   0.9786   0.9795   0.9880

Model Testing

getwd()
## [1] "E:/DScience/Practical_Machine_Learning/Project"
pml_write_filles = function(x){
  n = length(x)
  for(i in 1:n){
    filename=paste0("problem_id_",i,"txt")
    
    write.table(x[i],file = filename, quote = FALSE, row.names = FALSE, col.names = FALSE)
    
  }
}
pml_write_filles
## function(x){
##   n = length(x)
##   for(i in 1:n){
##     filename=paste0("problem_id_",i,"txt")
##     
##     write.table(x[i],file = filename, quote = FALSE, row.names = FALSE, col.names = FALSE)
##     
##   }
## }