Machine Learning Project

Author: Hannah Hon

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Library

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

Getting and Cleaning Data

train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train <- download.file(train,"./data")
training <- read.csv("train")
test <- download.file(test, "./test")
testing <- read.csv("test")
## remove the invalid columes
training <- training[,colSums(is.na(training)) == 0]
testing <- testing[,colSums(is.na(testing)) == 0]
dim(training)
## [1] 19622    93
dim(testing)
## [1] 20 60
## Now remove the first 7 outputs as they have few impact on Classe
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
dim(training)
## [1] 19622    86
dim(testing)
## [1] 20 53

Preparation for Prediction

inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainData <- training[inTrain, ]
testData <- training[-inTrain,]
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
testData  <- testData[, -NZV]
dim(trainData)
## [1] 13737    53

Prediction Model Building

Three prediction models will be used, which are random forest, decision tree and generalized boosted model. #### 1. Random Forest

set.seed(12345)
controlrf <- trainControl(method="cv", number=3, verboseIter=FALSE)
rf <- train(classe ~ ., trainData, method="rf",trControl=controlrf)
rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.75%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3901    4    1    0    0 0.001280082
## B   20 2628    9    1    0 0.011286682
## C    0   17 2368   11    0 0.011686144
## D    0    0   26 2224    2 0.012433393
## E    0    2    4    6 2513 0.004752475
modelrf <- predict(rf, testData)
conf <- confusionMatrix(modelrf,testData$classe)
conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    7    0    0    0
##          B    0 1128    1    0    0
##          C    0    3 1022    7    1
##          D    0    1    3  957    4
##          E    1    0    0    0 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9952          
##                  95% CI : (0.9931, 0.9968)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.994           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9903   0.9961   0.9927   0.9954
## Specificity            0.9983   0.9998   0.9977   0.9984   0.9998
## Pos Pred Value         0.9958   0.9991   0.9894   0.9917   0.9991
## Neg Pred Value         0.9998   0.9977   0.9992   0.9986   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1917   0.1737   0.1626   0.1830
## Detection Prevalence   0.2855   0.1918   0.1755   0.1640   0.1832
## Balanced Accuracy      0.9989   0.9951   0.9969   0.9956   0.9976

The accurarcy is very high for random forest prediction method, which is 0.9946.However, it might be the reason of overfitting.

plot(modelrf)

2. Decision Tree

modelrp <- rpart(classe ~ ., trainData, method="class")
fancyRpartPlot(modelrp)

trainpred <- predict(modelrp, testData, type = "class")
confrp <- confusionMatrix(testData$classe,trainpred)

The accuracy for decision tree is 0.75, which is not as accurate as generalized boosted and random forest.

3. Generalized Boosted

set.seed(12345)
controlgbm <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modelgbm  <- train(classe ~ ., data=trainData, method = "gbm",
                    trControl = controlgbm, verbose = FALSE)
modelgbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 41 had non-zero influence.
predictgbm <- predict(modelgbm, newdata=testData)
confgbm <- confusionMatrix(predictgbm, testData$classe)
confgbm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1653   36    0    0    4
##          B   15 1062   30    0    8
##          C    4   33  981   31    5
##          D    2    5   11  928   16
##          E    0    3    4    5 1049
## 
## Overall Statistics
##                                           
##                Accuracy : 0.964           
##                  95% CI : (0.9589, 0.9686)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9544          
##  Mcnemar's Test P-Value : 9.355e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9875   0.9324   0.9561   0.9627   0.9695
## Specificity            0.9905   0.9888   0.9850   0.9931   0.9975
## Pos Pred Value         0.9764   0.9525   0.9307   0.9647   0.9887
## Neg Pred Value         0.9950   0.9839   0.9907   0.9927   0.9932
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2809   0.1805   0.1667   0.1577   0.1782
## Detection Prevalence   0.2877   0.1895   0.1791   0.1635   0.1803
## Balanced Accuracy      0.9890   0.9606   0.9706   0.9779   0.9835

The accuracy from gerneralized boosted model is 0.9645, which is also very high.

Applying selected model to testing data

The accuracy of the 3 regression modeling methods are:

Random Forest : 0.9963 Decision Tree : 0.7514 GBM : 0.9645

predictTest <- predict(rf,testing)
predictTest
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E