Prediction Assignment

Import Library

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

Download data CSV files

Dataset: Train dataset

Dataset: Test dataset

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',"training.csv", quiet=FALSE)
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv',"testing.csv", quiet=FALSE)

Read Data in CSV

train_data<- read.csv("training.csv")
test_data<- read.csv("testing.csv")

Cleaning the data

Observing the data cleaning is required.

Step1 remove NA, "", #DIV/0!.

Step2 remove Variables near to zero and NA.

Step3 remove Non-numerical variable like timestamp.

train_data <- read.csv('training.csv', na.strings = c("NA", "#DIV/0!", ""))
test_data <-  read.csv('testing.csv', na.strings = c("NA", "#DIV/0!", ""))
nz <- nearZeroVar(train_data)
train_data <- train_data[,-nz]
test_data <- test_data[,-nz]
rm_na <- sapply(train_data, function(x) mean(is.na(x))) > 0.95
train_data <- train_data[,rm_na == FALSE]
test_data <- test_data[,rm_na == FALSE]
train_data<- train_data[, -c(1:7)]
test_data<- test_data[, -c(1:7)]

Split the Training Dataset

inTrainIndex <- createDataPartition(train_data$classe, p=0.6)[[1]]
train_data <- train_data[inTrainIndex,]
testcv_data <- train_data[-inTrainIndex,]

Machine Learning Algorithm for Prediction

Decision Tree
Random Forest

dt_model <- train(classe ~., method='rpart', data=train_data)
dt_Prediction <- predict(dt_model, testcv_data)
confusionMatrix(testcv_data$classe, dt_Prediction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1202   24   96    0    5
##          B  397  276  218   34    1
##          C  371   35  395    0    0
##          D  351   10  292  137    0
##          E  184  149  221   36  262
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4838          
##                  95% CI : (0.4694, 0.4982)
##     No Information Rate : 0.5334          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3264          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4798  0.55870  0.32324  0.66184  0.97761
## Specificity            0.9429  0.84531  0.88313  0.85453  0.86676
## Pos Pred Value         0.9058  0.29806  0.49313  0.17342  0.30751
## Neg Pred Value         0.6132  0.94218  0.78768  0.98208  0.99844
## Prevalence             0.5334  0.10520  0.26022  0.04408  0.05707
## Detection Rate         0.2560  0.05877  0.08411  0.02917  0.05579
## Detection Prevalence   0.2826  0.19719  0.17057  0.16823  0.18143
## Balanced Accuracy      0.7114  0.70201  0.60319  0.75818  0.92218

rf_model <- train(classe ~., method='rf', data=train_data, ntree=10)
rf_prediction <- predict(rf_model, testcv_data)
confusionMatrix(testcv_data$classe, rf_prediction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1327    0    0    0    0
##          B    0  926    0    0    0
##          C    0    1  800    0    0
##          D    0    0    1  789    0
##          E    0    0    0    0  852
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9996          
##                  95% CI : (0.9985, 0.9999)
##     No Information Rate : 0.2826          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9995          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9989   0.9988   1.0000   1.0000
## Specificity            1.0000   1.0000   0.9997   0.9997   1.0000
## Pos Pred Value         1.0000   1.0000   0.9988   0.9987   1.0000
## Neg Pred Value         1.0000   0.9997   0.9997   1.0000   1.0000
## Prevalence             0.2826   0.1974   0.1706   0.1680   0.1814
## Detection Rate         0.2826   0.1972   0.1704   0.1680   0.1814
## Detection Prevalence   0.2826   0.1972   0.1706   0.1682   0.1814
## Balanced Accuracy      1.0000   0.9995   0.9992   0.9999   1.0000

Result

From the confusion matrix it is clear that random forest algorithm works better than decision tree. So using random forest model the prediction should be made.

Conclusion

predict(rf_model, test_data)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E