Practical Machine Learning Report

Executive Summmary

We use data from accelerometers from many different subjects to build highly predictive machine model. In the end we get a model with 99.2% accuracy, which is then used to correctly predict 20 ultimate test values after cross validation. The best model used the Random Forest.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Procedure

First we download the data for this project. We will get the training data containing 19622 observations of 160 variables. We will also download the ultimate test containing 20 observations of 160 variables of data.

We will save the ultimate test data until the very end of the model building procedure since it is very small. We will subdivide the training data into a training and test set.

Nomenclature: PML shall refer to the training set csv as a whole (19,622 cases). Training will refer to a subset of PML used for training, and testing will refer to a subset of PML used for cross validation. The “ultimate test” will use the model on a very small subset of data.

Load the data

# Load the (total) training set.
# Missing variables take on 3 general forms.  Properly categorize them as NAs.
pml <- read.csv("pml-training.csv", na.strings=c("#DIV/0!","","NA"))

Load the caret package and divide the PML into training and test sets. Training will be 60%, testing is 40% of the PML. Set seed for repoducability.

# load the caret package for training and partitioning
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(1337)

# segment the original training set for cross validation
inTrain <- createDataPartition(y=pml$classe,
                               p=0.6, list=FALSE)
training <- pml[inTrain,]
testing <- pml[-inTrain,]

The data contains zero columns, remove those. The first 7 columns of data contain information that shouldn’t improve the model and can be removed (including user name, and row numbers, and iterative counts).

# remove zero columns
training <- training[,colSums(is.na(training)) == 0]
testing <- testing[,colSums(is.na(testing)) == 0]

#remove non-useful measurement data
training <- training[,-(c(1:7))]
testing <- testing[,-c((1:7))]

The variable that is being predicted in both the training and testing set is the “classe” variable that can be one of five values (A,B,C,D,E).

I expect that a suitable model will have a error rate of less than 1 in 20, since the assignment ultimately expects us correctly predict 20 values in the ultimate test case.

Let’s run a quadratic determinant analysis, which is a more general version of the linear classifier. Let’s then validate our sample by applying it to our testing data.

# Train modFit1 with QDA.  Predict on the model, and 
# cross-validate with the testing data.
modFit1 <- train(classe ~ .,method="qda",data=training)

## Loading required package: MASS

prediction1 <- predict(modFit1,newdata=testing)
confusionMatrix(prediction1, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2061   80    1    2    0
##          B   95 1279   66    5   58
##          C   36  137 1286  176   54
##          D   30    2    6 1086   39
##          E   10   20    9   17 1291
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8926          
##                  95% CI : (0.8855, 0.8993)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8643          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9234   0.8426   0.9401   0.8445   0.8953
## Specificity            0.9852   0.9646   0.9378   0.9883   0.9913
## Pos Pred Value         0.9613   0.8510   0.7614   0.9338   0.9584
## Neg Pred Value         0.9700   0.9623   0.9867   0.9701   0.9768
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2627   0.1630   0.1639   0.1384   0.1645
## Detection Prevalence   0.2733   0.1916   0.2153   0.1482   0.1717
## Balanced Accuracy      0.9543   0.9036   0.9389   0.9164   0.9433

This model gives a 89.2% accuracy, which is less than one in ten. I think we can make a model that does better than this, considering that the submission assignment wants us to get 20/20 predictions correct in the submission stage, which heavily suggests we can do better than 89.2% percent accuracy.

We will use the randomForest package because it runs a lot faster than the caret version, specifying randomForest. After we run the model we will validate it by running it on the testing set.

# Predict with random forest
library(randomForest)

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

modFit2 <- randomForest(classe ~ ., data=training)
prediction2 <- predict(modFit2,newdata=testing)
confusionMatrix(prediction2, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232   13    0    0    0
##          B    0 1500   15    0    0
##          C    0    5 1352   24    0
##          D    0    0    1 1262    4
##          E    0    0    0    0 1438
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9921          
##                  95% CI : (0.9899, 0.9939)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.99            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9881   0.9883   0.9813   0.9972
## Specificity            0.9977   0.9976   0.9955   0.9992   1.0000
## Pos Pred Value         0.9942   0.9901   0.9790   0.9961   1.0000
## Neg Pred Value         1.0000   0.9972   0.9975   0.9964   0.9994
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1912   0.1723   0.1608   0.1833
## Detection Prevalence   0.2861   0.1931   0.1760   0.1615   0.1833
## Balanced Accuracy      0.9988   0.9929   0.9919   0.9903   0.9986

This is a lot better with a 99.2% accuracy rate! That’s approximately a 1 in 125 error.

Let’s try our Random Forest (modFit2) on the ultimate test set. And output our predictions:

# try small test set
pmlultimatetest <- read.csv("pml-testing.csv", na.strings=c("#DIV/0!","","NA"))

# predict for small 20 case set
predict3 <- predict(modFit2, newdata=pmlultimatetest)
predict3

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Submission

Let’s write the prediction to a file for submission.

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(predict3)

I submitted these predictions to the second part of the assignment and all 20/20 were correct!

Appendix: Full Printouts of Each Model. Model 1 (QDA) followed by Model 2 (randomForest)

# QDA model
print(modFit1$finalModel)

## Call:
## qda(x, grouping = y)
## 
## Prior probabilities of groups:
##         A         B         C         D         E 
## 0.2843071 0.1935292 0.1744226 0.1638927 0.1838485 
## 
## Group means:
##   roll_belt pitch_belt   yaw_belt total_accel_belt gyros_belt_x
## A  59.96214  0.2422073 -11.142936         10.75269 -0.005627240
## B  64.73887 -0.4847258 -13.468916         11.08425 -0.002395788
## C  64.48253 -1.2340604  -7.113057         11.13291 -0.015998053
## D  59.93643  1.5626166 -18.212503         11.14611 -0.013756477
## E  74.04343  0.5354919  -5.568674         12.70254  0.006013857
##   gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## A   0.04120968   -0.1230585    -6.105137     29.17473    -63.62515
## B   0.04251426   -0.1328872    -4.464239     31.64195    -72.90478
## C   0.03897274   -0.1360419    -3.824732     30.82668    -70.54382
## D   0.03601036   -0.1359378    -8.141451     29.88912    -68.07461
## E   0.03978753   -0.1287667    -4.461894     28.92240    -91.00416
##   magnet_belt_x magnet_belt_y magnet_belt_z   roll_arm  pitch_arm
## A      58.35663      602.2261     -338.0606 -0.9099074   3.444131
## B      50.26547      599.2264     -336.6073 31.7422203  -6.671071
## C      57.45764      599.7317     -337.6738 25.5064216  -1.449391
## D      48.90000      593.7606     -341.4803 23.8906114 -10.568031
## E      62.59076      568.4425     -377.2416 20.1005266 -13.276910
##       yaw_arm total_accel_arm gyros_arm_x gyros_arm_y gyros_arm_z
## A -10.6519654        27.32706  0.01674731  -0.2184827   0.2693668
## B   7.3877271        26.54410  0.02355419  -0.2889601   0.2573892
## C   5.2980964        24.32619  0.06817916  -0.2392600   0.2631402
## D   5.2741140        23.45233  0.03256995  -0.2455389   0.2613316
## E   0.1092656        24.42910  0.07375058  -0.3094134   0.2894965
##   accel_arm_x accel_arm_y accel_arm_z magnet_arm_x magnet_arm_y
## A  -131.33333    46.80974   -73.75179    -16.68817    236.57139
## B   -41.60026    24.69987   -97.39798    238.61694    126.48355
## C   -79.53603    41.23856   -54.14460    146.95618    190.94985
## D    13.24922    26.24197   -49.02021    392.94404     96.80052
## E   -15.06420    16.39353   -77.31732    333.72656     81.55658
##   magnet_arm_z roll_dumbbell pitch_dumbbell yaw_dumbbell
## A     412.1219      21.40308     -18.729006    0.9157553
## B     190.6762      35.58885       2.845738   13.2715066
## C     362.8140     -13.15525     -24.922559  -16.2464268
## D     298.4461      50.72061      -1.656519    0.9171672
## E     212.8554      27.61226      -7.324180    5.3187394
##   total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## A             14.68668        0.1062037      0.042458184      -0.05898148
## B             14.35630        0.1684642      0.014111452      -0.14433523
## C             12.88364        0.1949464      0.051645570      -0.15321811
## D             11.22021        0.2081658      0.008150259      -0.12967358
## E             14.55566        0.1381109      0.105838337      -0.13818938
##   accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## A      -50.3900836         52.79152        -57.14367         -385.9973
## B       -0.9262835         69.15665        -16.84116         -253.4370
## C      -40.2156767         30.03651        -52.76582         -370.3096
## D      -21.5025907         52.73938        -32.54611         -319.8098
## E      -19.3852194         56.72887        -25.00739         -291.2619
##   magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm
## A          217.8874          10.28901     25.96955     -6.733557
## B          267.5024          50.31768     32.49744     14.560926
## C          153.9830          62.94985     58.52051     12.591198
## D          220.8720          60.73109     15.65495     27.931922
## E          239.6106          72.18476     38.92143     16.402503
##   yaw_forearm total_accel_forearm gyros_forearm_x gyros_forearm_y
## A    25.36562            32.00000       0.1744504     0.154596774
## B    14.27789            35.55287       0.1571128     0.087595437
## C    39.89561            34.88462       0.2005161     0.106363194
## D     5.00142            35.99637       0.1120570     0.023683938
## E    12.95390            36.64527       0.1436351    -0.005404157
##   gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## A       0.1841876      -0.9596774        170.3796       -58.98656
## B       0.1752611     -77.9324265        139.3629       -46.37473
## C       0.1411928     -48.8583252        213.3189       -61.01607
## D       0.1289534    -150.1839378        154.6187       -43.51865
## E       0.1385266     -71.8138568        146.9478       -61.36998
##   magnet_forearm_x magnet_forearm_y magnet_forearm_z
## A        -195.2264         477.5412         406.1263
## B        -327.6235         279.7016         376.2211
## C        -339.6480         500.5501         458.5336
## D        -449.2969         319.0663         355.3938
## E        -334.1404         280.5926         357.5358

# Random Forest Model.  Can't list all the forests but here are the trees, so to speak.
print(modFit2)

## 
## Call:
##  randomForest(formula = classe ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.65%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3345    2    0    0    1 0.0008960573
## B   14 2258    7    0    0 0.0092145678
## C    0   17 2035    2    0 0.0092502434
## D    0    0   24 1905    1 0.0129533679
## E    0    0    2    6 2157 0.0036951501