Practical Machine Learning Project

In this project, I am to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. My goal is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. More Details: https://class.coursera.org/predmachlearn-012/human_grading/view/courses/973547/assessments/4/submissions.

data preparation and Pre-Processing

My fist step is to load the data from local files, remove any predictors of near zero, most-NA, and corralated ones.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(2015)
dat <- read.csv('data/pml-training.csv', row.names = 1)
dim(dat)

## [1] 19622   159

#remove half-NA, Zero- and Near Zero-Variance Predictors
dat <- dat[colSums(is.na(dat)) < 0.5*nrow(dat)]  #93 variavbles
nzv <- nearZeroVar(dat)
dat <- dat[, -nzv] #59 variavbles
dim(dat)

## [1] 19622    58

#Identifyi and Remove Correlated Predictors
numericData <- dat[sapply(dat, is.numeric)]
descrCor <- cor(numericData)
summary(descrCor[upper.tri(descrCor)])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992000 -0.104100  0.001566  0.001313  0.086960  0.980900

highlyCorDescr <- findCorrelation(descrCor, cutoff = .8)
highlyCorCol <- colnames(numericData[,highlyCorDescr])
dat <- dat[, -which(colnames(dat) %in% highlyCorCol)] 
dim(dat)

## [1] 19622    46

Model Training and Parameter Tuning

Simple Splitting Based on the Outcome by 6/4
Model Parameter setting (cross-validation resampling method with 10-fold)
Model fitting and selecting

#Simple Splitting 
inTraining <- createDataPartition(dat$classe, p = .6, list = FALSE)
training <- dat[ inTraining,]
testing  <- dat[-inTraining,]

#Model Parameter Setting
fitControl <- trainControl(method = "cv", number = 10)

#Model List: http://topepo.github.io/caret/modelList.html
# Generalized Linear Model (glm)
start <- proc.time()
gbmFit <- train(classe ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE)

## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr

elapsed <- proc.time() - start

I’ve tried 3 modles: Recursive Partitioning (rpart), gradient boosting machine (gbm) model, and Random Forest (RF) model. The code for them are similar except for ‘method= modelCode’. The rpart is not usable in this data. The final choice is gbm, due to its high accuracy. rf model is even higher in accuracy, however, it consumed twice the time.

The modeling codes and results are in the annex at the end.

Final prediction:

Predicted the csv test data with the gbm model and write the results to text files.

Annex Model Results Comparing

# gradient boosting machine (gbm) model
elapsed

##    user  system elapsed 
##  678.79    0.72  680.02

gbmFit

## Stochastic Gradient Boosting 
## 
## 11776 samples
##    45 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 10598, 10599, 10598, 10600, 10598, 10598, ... 
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.8296495  0.7839736  0.013695744
##   1                  100      0.8963129  0.8686992  0.007708690
##   1                  150      0.9201751  0.8988967  0.009220057
##   2                   50      0.9538873  0.9416148  0.005537182
##   2                  100      0.9850544  0.9810943  0.003493162
##   2                  150      0.9915081  0.9892585  0.002886121
##   3                   50      0.9817428  0.9769025  0.004531410
##   3                  100      0.9935462  0.9918367  0.002374829
##   3                  150      0.9962634  0.9952739  0.002375532
##   Kappa SD   
##   0.017415787
##   0.009806397
##   0.011686888
##   0.007007403
##   0.004419199
##   0.003650359
##   0.005730630
##   0.003003883
##   0.003004306
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3 and shrinkage = 0.1.

prediction_gbm <- predict(gbmFit, testing)
confusionMatrix(prediction_gbm, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    2    0    0    0
##          B    0 1511    1    0    0
##          C    0    3 1360    1    0
##          D    0    2    7 1285    3
##          E    0    0    0    0 1439
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9976          
##                  95% CI : (0.9962, 0.9985)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9969          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9954   0.9942   0.9992   0.9979
## Specificity            0.9996   0.9998   0.9994   0.9982   1.0000
## Pos Pred Value         0.9991   0.9993   0.9971   0.9907   1.0000
## Neg Pred Value         1.0000   0.9989   0.9988   0.9998   0.9995
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1926   0.1733   0.1638   0.1834
## Detection Prevalence   0.2847   0.1927   0.1738   0.1653   0.1834
## Balanced Accuracy      0.9998   0.9976   0.9968   0.9987   0.9990

plot(gbmFit)

#Random Forest (RF)
start <- proc.time()
rfFit <- train(classe ~ ., data = training,
                method = "rf",
                trControl = fitControl,
                verbose = FALSE)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

elapsed <- proc.time() - start
elapsed

##    user  system elapsed 
## 1637.55    5.76 1645.17

rfFit

## Random Forest 
## 
## 11776 samples
##    45 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 10599, 10599, 10598, 10597, 10597, 10599, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD   Kappa SD   
##    2    0.9821664  0.9774326  0.0022347582  0.002828670
##   34    0.9993203  0.9991403  0.0008774815  0.001109833
##   67    0.9982162  0.9977439  0.0012315787  0.001557637
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 34.

prediction_rf <- predict(rfFit, testing)
confusionMatrix(prediction_rf, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    0    0    0    0
##          B    0 1517    2    0    0
##          C    0    1 1366    1    0
##          D    0    0    0 1285    2
##          E    0    0    0    0 1440
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9992          
##                  95% CI : (0.9983, 0.9997)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.999           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9993   0.9985   0.9992   0.9986
## Specificity            1.0000   0.9997   0.9997   0.9997   1.0000
## Pos Pred Value         1.0000   0.9987   0.9985   0.9984   1.0000
## Neg Pred Value         1.0000   0.9998   0.9997   0.9998   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1933   0.1741   0.1638   0.1835
## Detection Prevalence   0.2845   0.1936   0.1744   0.1640   0.1835
## Balanced Accuracy      1.0000   0.9995   0.9991   0.9995   0.9993

plot(rfFit)

#Recursive Partitioning (rpart)
# rpartFit <- train(classe ~ ., data = training,
#                method = "rpart",
#                trControl = fitControl,
#                verbose = FALSE)
# This molde cannot be fitted.

Practical Machine Learning Project

Zhang, Zi Wei

March 8, 2015

data preparation and Pre-Processing

Model Training and Parameter Tuning

Final prediction:

Annex Model Results Comparing