Final Project - Practical Machine Learning

Introduction

For this project, we are given data from accelerometers on the belt, forearm, arm, and dumbell of 6 research study participants. Our training data consists of accelerometer data and a label identifying the quality of the activity the participant was doing. Our testing data consists of accelerometer data without the identifying label. Our goal is to predict the labels for the test set observations.

Below is the code I used when creating the model, estimating the out-of-sample error, and making predictions. I also include a description of each step of the process.

Data

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv)

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Requered Library

library(caret)

## Warning: package 'caret' was built under R version 3.2.3

## Loading required package: lattice
## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.3

library(rpart)

## Warning: package 'rpart' was built under R version 3.2.3

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.2.3

library(RColorBrewer)

## Warning: package 'RColorBrewer' was built under R version 3.2.3

library(rattle)

## Warning: package 'rattle' was built under R version 3.2.3

## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.2.3

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

library(knitr)

## Warning: package 'knitr' was built under R version 3.2.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.2.3

Getting and loading the data

ptrain <- read.csv("pml-training.csv")
ptest <- read.csv("pml-testing.csv")

Because I want to be able to estimate the out-of-sample error, I randomly split the full training data (ptrain) into a smaller training set (ptrain1) and a validation set (ptrain2):

partition <- createDataPartition(y=ptrain$classe, p=0.7, list=F)
ptrain1 <- ptrain[partition, ]
ptrain2 <- ptrain[-partition, ]

Now I’m removing those variable with have maximum valuses as NA, Variance Near by zero and those variable wich do nat have a significance on pridiction.

# Near zero variance

nzv<- nearZeroVar(ptrain1)

ptrain1<- ptrain1[,-nzv]
ptrain2<- ptrain2[,-nzv]

# Mostly NA

mostlyNa<- sapply(ptrain1,function(x) mean(is.na(x)))> 0.95
ptrain1<- ptrain1[,mostlyNa==F]
ptrain2<- ptrain2[,mostlyNa==F]

#non singificance for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), which happen to be the first five variables
ptrain1<- ptrain1[,-(1:5)]
ptrain2<- ptrain2[,-(1:5)]

EVALUATION

Classification Tree

modFit <- train(classe ~ ., data = ptrain1, method="rpart")
print(modFit, digits=3)

## CART 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   cp      Accuracy  Kappa   Accuracy SD  Kappa SD
##   0.0395  0.542     0.4093  0.0512       0.0794  
##   0.0595  0.389     0.1606  0.0487       0.0813  
##   0.1159  0.321     0.0545  0.0425       0.0628  
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.0395.

print(modFit$finalModel, digits=3)

## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9830 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130 12512 8650 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -34.3 1069    2 A (1 0.0019 0 0 0) *
##      5) pitch_forearm>=-34.3 11443 8650 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 438 9663 6930 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 122 5991 3540 A (0.41 0.18 0.19 0.16 0.06) *
##         21) roll_forearm>=122 3672 2480 C (0.078 0.18 0.32 0.23 0.18) *
##       11) magnet_dumbbell_y>=438 1780  868 B (0.034 0.51 0.042 0.23 0.18) *
##    3) roll_belt>=130 1225   43 E (0.035 0 0 0 0.96) *

fancyRpartPlot(modFit$finalModel)

Now run the prediction model against ptrain2

# Run against ptrain2
predictions <- predict(modFit, ptrain2)
print(confusionMatrix(predictions, ptrain2$classe), digits=4)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1492  485  453  451  131
##          B   27  381   34  158  151
##          C  124  273  539  355  283
##          D    0    0    0    0    0
##          E   31    0    0    0  517
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4977          
##                  95% CI : (0.4849, 0.5106)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3442          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8913  0.33450  0.52534   0.0000  0.47782
## Specificity            0.6390  0.92204  0.78699   1.0000  0.99355
## Pos Pred Value         0.4954  0.50732  0.34244      NaN  0.94343
## Neg Pred Value         0.9367  0.85236  0.88703   0.8362  0.89414
## Prevalence             0.2845  0.19354  0.17434   0.1638  0.18386
## Detection Rate         0.2535  0.06474  0.09159   0.0000  0.08785
## Detection Prevalence   0.5118  0.12761  0.26746   0.0000  0.09312
## Balanced Accuracy      0.7652  0.62827  0.65617   0.5000  0.73568

It was really disappinting to see this low accuracy (0.4833)

Random Forest

#Train on training set 1 of 4 with only cross validation.
set.seed(666)
modFit <- train(ptrain1$classe ~ ., method="rf", trControl=trainControl(method = "cv", number = 4), data=ptrain1)
print(modFit, digits=3)

## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 10302, 10302, 10304, 10303 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    0.991     0.989  0.00252      0.00319 
##   27    0.996     0.994  0.00104      0.00131 
##   53    0.993     0.991  0.00139      0.00176 
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Now run the prediction model against ptrain2

predictions <- predict(modFit, newdata=ptrain2)
print(confusionMatrix(predictions, ptrain2$classe), digits=4)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    4    0    0    0
##          B    0 1135    2    0    0
##          C    0    0 1024    5    0
##          D    0    0    0  959    2
##          E    0    0    0    0 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9978          
##                  95% CI : (0.9962, 0.9988)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9972          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9965   0.9981   0.9948   0.9982
## Specificity            0.9991   0.9996   0.9990   0.9996   1.0000
## Pos Pred Value         0.9976   0.9982   0.9951   0.9979   1.0000
## Neg Pred Value         1.0000   0.9992   0.9996   0.9990   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1929   0.1740   0.1630   0.1835
## Detection Prevalence   0.2851   0.1932   0.1749   0.1633   0.1835
## Balanced Accuracy      0.9995   0.9980   0.9985   0.9972   0.9991

accuracy <- postResample(predictions, ptrain2$classe)
accuracy

##  Accuracy     Kappa 
## 0.9977910 0.9972057

oose <- 1 - as.numeric(confusionMatrix(predictions, ptrain2$classe)$overall[1])
oose

## [1] 0.002209006

Predicting for Test Data Set

Then, we estimate the performance of the model on the validation data set (ptest).

pridictions<-predict(modFit, newdata=ptest)
print(pridictions)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Figures

1-####Correlation Matrix Visualization

corrPlot <- cor(ptrain1[, -length(names(ptrain1))])
corrplot(corrPlot, method="color")

2-####Decision Tree Visualization

treeModel <- rpart(classe ~ ., data=ptrain1, method="class")
prp(treeModel)