Prediction Assignment

Overview

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Loading Data and Cleaning

You can also embed plots, for example:

#download files
 if (!file.exists("./pml-training.csv")){
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",destfile="./pml-training.csv")
 }
if (!file.exists("./pml-testing.csv")){
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",destfile="./pml-testing.csv")
}
#Load data
training<-read.csv("pml-training.csv")
testing<-read.csv("pml-testing.csv")
dim(training)

## [1] 19622   160

dim(testing)

## [1]  20 160

#Remove near zero variance column
library(caret)
training1<-training[, -nearZeroVar(training)]

#mostly null

training<-training1[, -which(colMeans(is.na(training1)) > 0.5)]

#remmove first 7 columns which are username or timestamp
 training<-training[,-c(1:7)]


#testing<-testing1[complete.cases(testing1), ]
dim(training)

## [1] 19622    52

#remove null row
training1<-training[complete.cases(training), ]
dim(training1)

## [1] 19622    52

Data partition

#create data partition for training and test
library(caret)

## Warning: package 'caret' was built under R version 3.5.2

## Loading required package: lattice

## Loading required package: ggplot2

dpart<-createDataPartition(training1$classe,p=0.7,list=FALSE)
trainSet<-training[dpart,]
testSet<-training[-dpart,]
dim(trainSet)

## [1] 13737    52

dim(testSet)

## [1] 5885   52

Prediction Model

The model run time exceeds 30 min. Please be patient

set.seed(1234)
#use 10 fold cross validation
fitControl <- trainControl(method = "cv",number = 10)
#use random forest method
rf<-train(classe~.,method="rf",trControl = fitControl,data=trainSet,verbose = FALSE)
#gradient boost
gbm<-train(classe~.,method="gbm",trControl = fitControl,data=trainSet,verbose = FALSE)
#linear discriminator
lda<-train(classe~.,method="lda",trControl = fitControl,data=trainSet,verbose = FALSE)

predrf<-predict(rf,testSet)
predgbm<-predict(gbm,testSet)
predlda<-predict(lda,testSet)

confusionMatrix(predrf,testSet$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674   14    0    0    0
##          B    0 1121    7    0    0
##          C    0    3 1017    6    4
##          D    0    0    2  957    2
##          E    0    1    0    1 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9932          
##                  95% CI : (0.9908, 0.9951)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9914          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9842   0.9912   0.9927   0.9945
## Specificity            0.9967   0.9985   0.9973   0.9992   0.9996
## Pos Pred Value         0.9917   0.9938   0.9874   0.9958   0.9981
## Neg Pred Value         1.0000   0.9962   0.9981   0.9986   0.9988
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1905   0.1728   0.1626   0.1828
## Detection Prevalence   0.2868   0.1917   0.1750   0.1633   0.1832
## Balanced Accuracy      0.9983   0.9914   0.9943   0.9960   0.9970

confusionMatrix(predgbm,testSet$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1653   41    0    0    3
##          B   12 1073   37    5    8
##          C    3   25  975   28   14
##          D    6    0   12  923   18
##          E    0    0    2    8 1039
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9623         
##                  95% CI : (0.9571, 0.967)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9523         
##  Mcnemar's Test P-Value : 1.25e-09       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9875   0.9421   0.9503   0.9575   0.9603
## Specificity            0.9896   0.9869   0.9856   0.9927   0.9979
## Pos Pred Value         0.9741   0.9454   0.9330   0.9625   0.9905
## Neg Pred Value         0.9950   0.9861   0.9895   0.9917   0.9911
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2809   0.1823   0.1657   0.1568   0.1766
## Detection Prevalence   0.2884   0.1929   0.1776   0.1630   0.1782
## Balanced Accuracy      0.9885   0.9645   0.9679   0.9751   0.9791

confusionMatrix(predlda,testSet$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1372  184  112   46   43
##          B   35  711  101   58  171
##          C  126  139  642  105  129
##          D  133   50  144  703  119
##          E    8   55   27   52  620
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6879          
##                  95% CI : (0.6758, 0.6997)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6049          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8196   0.6242   0.6257   0.7293   0.5730
## Specificity            0.9086   0.9231   0.8973   0.9094   0.9704
## Pos Pred Value         0.7809   0.6608   0.5627   0.6118   0.8136
## Neg Pred Value         0.9268   0.9110   0.9191   0.9449   0.9098
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2331   0.1208   0.1091   0.1195   0.1054
## Detection Prevalence   0.2986   0.1828   0.1939   0.1952   0.1295
## Balanced Accuracy      0.8641   0.7737   0.7615   0.8193   0.7717

#pred<-predict(rf,testSet$classe)

Using model to predict class

# remove unnecessary columns from testing
#keep<-names(trainSet)
#testing1<-testing[,(names(testing) %in% keep)]
#dim(testing1)
set.seed(1234)
values<-predict(rf,testing)

Conclusion

The final answer is given below which has 99.56% accuracy using random forest method

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E