Practical Machine Learning new

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

library(lubridate)

## Warning: package 'lubridate' was built under R version 3.5.3

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(caret)

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: ggplot2

library(rpart)

## Warning: package 'rpart' was built under R version 3.5.3

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.5.3

library(rattle)

## Warning: package 'rattle' was built under R version 3.5.3

## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gbm)

## Warning: package 'gbm' was built under R version 3.5.3

## Loaded gbm 2.1.5

train_in<- read.csv("C:/Users/ashishma/Downloads/pml-training.csv")
dim(train_in)

## [1] 19622   160

test_in<-read.csv("C:/Users/ashishma/Downloads/pml-testing.csv")
dim(test_in)

## [1]  20 160

So there are 19622 observations and 160 variables in training set and 20 observations and 160 variables in testing set.

Clean data

In order to clean the data we will remove missing values , if we see data there are few columns which have most observations as NA we can remove these columns.

train_clean<-train_in[ ,colSums(is.na(train_in)) == 0]
dim(train_clean)

## [1] 19622    93

test_clean<- test_in[ ,colSums(is.na(test_in)) == 0]
dim(test_clean)

## [1] 20 60

First seven columns are also not of much use so we can remove them.

train_clean<- train_clean[ , -c(1:7)]
dim(train_clean)

## [1] 19622    86

test_clean<- test_clean[ , -c(1:7)]
dim(test_clean)

## [1] 20 53

Preparing Data For Prediction

set.seed(12345)
inTrain<-createDataPartition(train_clean$classe, p = 0.7, list = FALSE)
trainData<-train_clean[inTrain , ]
testData<-train_clean[-inTrain , ]
dim(trainData)

## [1] 13737    86

dim(testData)

## [1] 5885   86

We can remove the variables that are near zero variance

NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
testData  <- testData[, -NZV]
dim(trainData)

## [1] 13737    53

dim(testData)

## [1] 5885   53

Model Building

We will predict outome for our model by using 3 different techniques:

Classification Trees

Random Forest

Generalized Boosted Models

Prediction with Classification Trees

set.seed(12345)
model_tree<-rpart(classe ~ . , data = trainData, method = "class")
fancyRpartPlot(model_tree)

Now using this model we will see how this is performing on our Test Data .

predict_model_tree<- predict(model_tree, testData, type = "class")
cm_tree<- confusionMatrix(predict_model_tree, testData$classe )
cm_tree$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1498  196   69  106   25
##          B   42  669   85   86   92
##          C   43  136  739  129  131
##          D   33   85   98  553   44
##          E   58   53   35   90  790

cm_tree$overall[1]

##  Accuracy 
## 0.7220051

So we can see the above model gives accuracy of 0.7220051 which gives us out-of-sample-error about .28 .

Prediction with Random Forest

model_rf<- randomForest(classe ~ . , data = testData)
print(model_rf)

## 
## Call:
##  randomForest(formula = classe ~ ., data = testData) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 1.7%
## Confusion matrix:
##      A    B    C   D    E class.error
## A 1669    5    0   0    0 0.002986858
## B   25 1100   13   1    0 0.034240562
## C    0   16 1008   2    0 0.017543860
## D    0    0   26 936    2 0.029045643
## E    0    1    3   6 1072 0.009242144

Now using this model we will see how this is performing on our Test Data .

predict_model_rf<- predict(model_rf, testData, type = "class")
cm_rf<-confusionMatrix(predict_model_rf,testData$classe)
cm_rf$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    0 1139    0    0    0
##          C    0    0 1026    0    0
##          D    0    0    0  964    0
##          E    0    0    0    0 1082

cm_rf$overall[1]

## Accuracy 
##        1

So we can see the above model gives accuracy of 0.9830076 which gives us out-of-sample-error about .02 .

Prediction with Gradient Boosting Machine

model_gbm<- train(classe ~ . , data = trainData, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 1), verbose = FALSE)
print(model_gbm)

## Stochastic Gradient Boosting 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 10989, 10990, 10990, 10989, 10990 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7572970  0.6921642
##   1                  100      0.8255803  0.7792114
##   1                  150      0.8566640  0.8185488
##   2                   50      0.8565919  0.8183136
##   2                  100      0.9050737  0.8798732
##   2                  150      0.9315712  0.9134018
##   3                   50      0.8971386  0.8697786
##   3                  100      0.9431460  0.9280516
##   3                  150      0.9615636  0.9513656
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Now using this model we will see how this is performing on our Test Data .

predict_model_gbm<- predict(model_gbm, testData)
cm_gbm<-confusionMatrix(predict_model_gbm,testData$classe)
cm_gbm$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1647   38    0    2    2
##          B   16 1066   41    3   15
##          C    6   32  963   36    7
##          D    5    3   20  917   22
##          E    0    0    2    6 1036

cm_gbm$overall[1]

##  Accuracy 
## 0.9564996

So we can see the above model gives accuracy of 0.9564996 which gives us out-of-sample-error about .05 .

Conclusion

We have observed out of three models our most accurate model come out to be Random Forest model, so we will use this model to predict value of classe for test_clean.

Result<- predict(model_rf, test_clean)
Result

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Practical Machine Learning new

priya malhotra

May 8, 2019

Introduction

Data

Clean data

Preparing Data For Prediction

Model Building

Prediction with Classification Trees

Prediction with Random Forest

Prediction with Gradient Boosting Machine

Conclusion