Practical Machine Learning: Project

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Dataset

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Data Preprocessing

Loading the R packages:

# packages
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(rattle)

## R session is headless; GTK+ not initialized.

## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart)
library(rpart.plot)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(repmis)

You can also embed plots, for example:

## [1] TRUE

##                      freqRatio percentUnique zeroVar   nzv
## roll_belt             1.101904     6.7781062   FALSE FALSE
## pitch_belt            1.036082     9.3772296   FALSE FALSE
## yaw_belt              1.058480     9.9734991   FALSE FALSE
## total_accel_belt      1.063160     0.1477933   FALSE FALSE
## gyros_belt_x          1.058651     0.7134849   FALSE FALSE
## gyros_belt_y          1.144000     0.3516461   FALSE FALSE
## gyros_belt_z          1.066214     0.8612782   FALSE FALSE
## accel_belt_x          1.055412     0.8357966   FALSE FALSE
## accel_belt_y          1.113725     0.7287738   FALSE FALSE
## accel_belt_z          1.078767     1.5237998   FALSE FALSE
## magnet_belt_x         1.090141     1.6664968   FALSE FALSE
## magnet_belt_y         1.099688     1.5187035   FALSE FALSE
## magnet_belt_z         1.006369     2.3290184   FALSE FALSE
## roll_arm             52.338462    13.5256345   FALSE FALSE
## pitch_arm            87.256410    15.7323412   FALSE FALSE
## yaw_arm              33.029126    14.6570176   FALSE FALSE
## total_accel_arm       1.024526     0.3363572   FALSE FALSE
## gyros_arm_x           1.015504     3.2769341   FALSE FALSE
## gyros_arm_y           1.454369     1.9162165   FALSE FALSE
## gyros_arm_z           1.110687     1.2638875   FALSE FALSE
## accel_arm_x           1.017341     3.9598410   FALSE FALSE
## accel_arm_y           1.140187     2.7367241   FALSE FALSE
## accel_arm_z           1.128000     4.0362858   FALSE FALSE
## magnet_arm_x          1.000000     6.8239731   FALSE FALSE
## magnet_arm_y          1.056818     4.4439914   FALSE FALSE
## magnet_arm_z          1.036364     6.4468454   FALSE FALSE
## roll_dumbbell         1.022388    84.2065029   FALSE FALSE
## pitch_dumbbell        2.277372    81.7449801   FALSE FALSE
## yaw_dumbbell          1.132231    83.4828254   FALSE FALSE
## total_accel_dumbbell  1.072634     0.2191418   FALSE FALSE
## gyros_dumbbell_x      1.003268     1.2282132   FALSE FALSE
## gyros_dumbbell_y      1.264957     1.4167771   FALSE FALSE
## gyros_dumbbell_z      1.060100     1.0498420   FALSE FALSE
## accel_dumbbell_x      1.018018     2.1659362   FALSE FALSE
## accel_dumbbell_y      1.053061     2.3748853   FALSE FALSE
## accel_dumbbell_z      1.133333     2.0894914   FALSE FALSE
## magnet_dumbbell_x     1.098266     5.7486495   FALSE FALSE
## magnet_dumbbell_y     1.197740     4.3012945   FALSE FALSE
## magnet_dumbbell_z     1.020833     3.4451126   FALSE FALSE
## roll_forearm         11.589286    11.0895933   FALSE FALSE
## pitch_forearm        65.983051    14.8557741   FALSE FALSE
## yaw_forearm          15.322835    10.1467740   FALSE FALSE
## total_accel_forearm   1.128928     0.3567424   FALSE FALSE
## gyros_forearm_x       1.059273     1.5187035   FALSE FALSE
## gyros_forearm_y       1.036554     3.7763735   FALSE FALSE
## gyros_forearm_z       1.122917     1.5645704   FALSE FALSE
## accel_forearm_x       1.126437     4.0464784   FALSE FALSE
## accel_forearm_y       1.059406     5.1116094   FALSE FALSE
## accel_forearm_z       1.006250     2.9558659   FALSE FALSE
## magnet_forearm_x      1.012346     7.7667924   FALSE FALSE
## magnet_forearm_y      1.246914     9.5403119   FALSE FALSE
## magnet_forearm_z      1.000000     8.5771073   FALSE FALSE
## classe                1.469581     0.0254816   FALSE FALSE

The cleaned data sets clearnTrainData and cleanTestData both have 53 columns with the same first 52 variables and the last variable classe and problem_id individually. cleanTrainData has 19622 rows while cleanTestData has 20 rows.

Data spliting

I split the cleaned training set trainData into a training set (train, 70%) for prediction and a test set (test 30%).

set.seed(7826) 
inTrain <- createDataPartition(cleanTrainData$classe, p = 0.7, list = FALSE)
train <- cleanTrainData[inTrain, ]
test <- cleanTrainData[-inTrain, ]

Prediction Algorithms

First, the “out of the box” classification tree and random forests to predict outcomes for classe.

ctrl <- trainControl(method = "cv", number = 5)
modfit <- train(classe ~ ., data = train, method = "rpart", 
                   trControl = ctrl)
print(modfit, digits = 4)

## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10989, 10989, 10990, 10989, 10991 
## Resampling results across tuning parameters:
## 
##   cp       Accuracy  Kappa    Accuracy SD  Kappa SD
##   0.03723  0.5241    0.38748  0.03851      0.06202 
##   0.05954  0.4144    0.20668  0.06477      0.10984 
##   0.11423  0.3482    0.09762  0.03575      0.05469 
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03723.

Accuracy was used to select the optimal model using the largest value. The final value used for the model was cp = 0.03723.

fancyRpartPlot(modfit$finalModel)

# Predicting outcomes using test dataset
predict <- predict(modfit, test)
# Prediction results
(confidence <- confusionMatrix(test$classe, predict))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1544   21  107    0    2
##          B  492  391  256    0    0
##          C  474   38  514    0    0
##          D  436  175  353    0    0
##          E  155  138  293    0  496
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5004          
##                  95% CI : (0.4876, 0.5133)
##     No Information Rate : 0.5269          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3464          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4979  0.51245  0.33749       NA  0.99598
## Specificity            0.9533  0.85396  0.88262   0.8362  0.89122
## Pos Pred Value         0.9223  0.34328  0.50097       NA  0.45841
## Neg Pred Value         0.6303  0.92162  0.79234       NA  0.99958
## Prevalence             0.5269  0.12965  0.25879   0.0000  0.08462
## Detection Rate         0.2624  0.06644  0.08734   0.0000  0.08428
## Detection Prevalence   0.2845  0.19354  0.17434   0.1638  0.18386
## Balanced Accuracy      0.7256  0.68321  0.61006       NA  0.94360

##Verifying accuracy of prediction...
(accuracy <- confidence$overall[1])

##  Accuracy 
## 0.5004248

The accuracy rate of classification tree is 0.5 does not warrant good prediction. For thsi reason, we use random forests.

Random forests

random_forests <- train(classe ~ ., data = train, method = "rf", trControl = ctrl)
print(random_forests, digits = 4)

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10990, 10990, 10988, 10990 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa   Accuracy SD  Kappa SD
##    2    0.9905    0.9880  0.002697     0.003414
##   27    0.9908    0.9884  0.003523     0.004459
##   52    0.9851    0.9811  0.004618     0.005847
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 27.

# Confusion matrix
predictRF <- predict(random_forests, test)
(confRF <- confusionMatrix(test$classe, predictRF))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1668    3    0    1    2
##          B   10 1125    3    1    0
##          C    0    2 1017    7    0
##          D    0    1   15  944    4
##          E    2    2    1    4 1073
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9901          
##                  95% CI : (0.9873, 0.9925)
##     No Information Rate : 0.2855          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9875          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9929   0.9929   0.9817   0.9864   0.9944
## Specificity            0.9986   0.9971   0.9981   0.9959   0.9981
## Pos Pred Value         0.9964   0.9877   0.9912   0.9793   0.9917
## Neg Pred Value         0.9972   0.9983   0.9961   0.9974   0.9988
## Prevalence             0.2855   0.1925   0.1760   0.1626   0.1833
## Detection Rate         0.2834   0.1912   0.1728   0.1604   0.1823
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9957   0.9950   0.9899   0.9912   0.9963

Verifying accuracy

# Verifying accuracy
(accuracyRF <- confRF$overall[1])

##  Accuracy 
## 0.9901444

Random forest method has a much higher accuracy rate is 0.991 than classifications trees, and so the out-of-sample error rate is 0.009. But the algorithm itself is difficult to interpret and is computationally inefficient. It took around 15 minutes to compute the results.

Prediction on Testing Set with Random Forests

# Run against 20 testing set provided by Professor Leek.
(predict(random_forests, cleanTestData))

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Appendix

Correlation Matrix

library(corrplot)
 M <- cor(cleanTestData)
 corrplot(M, method = "circle", order = "FPC")