Practical Machine Learning Assignment

runtime: shiny output: html document title: “PML Marijana Vujkovic” — ####Marijana Vujkovic ####Friday, 06/19/2015

Executive Summary

Following is the markdown for the “Practice Machine Learning” assignment. The goal is to fit a classification model for a multi-class outcome in a dataset containing > 19.000 records and 160 variables. My approach was to signficantly reduce the number of classifiers while maintaining the predictive capacities, and training the data set using various machine learning algorithms (random forests, gradient boosted machine, linear discriminant analysis, and support vector machines) and evaluated which one has the best perforance.

In summary, the set of potential classifiers was reduced to 8 predictors, and using a random forest classifier a 100% in-sample accuracy and 92% out-of-sample accuracy was achieved.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this report, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (refer the section on the Weight Lifting Exercise Dataset).

For this purpose, I mainly used the caret package and also the rio package, described as swiss-army knife for data I/O. It allows for reading data from URLs and Excel by using the import command.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#library("devtools")
#install_github("leeper/rio")
library("rio")
library("adabag")
## Loading required package: rpart
## Loading required package: mlbench

1. Data Import

train <- rio::import("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings=c("NA","#DIV/0!"))
test  <- rio::import("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",  na.strings=c("NA","#DIV/0!"))

2. Variable Selection

After importing the data, the outcome is converted to a factor variable. I remove the first 7 columns as they are non-relevant. Variables that have > 90% missing values are removed, as well as variables with zero variance.

# covert outcome to factor
train$classe <- factor(train$classe)

# remove first 7 columns
.train <- train[ , -c(1:7)]

# remove variables with > 90% missing values
.train <- .train[ , colSums(is.na(.train)) < 0.9 * nrow(.train)]

# Removing near zero variance columns
.n0 <- nearZeroVar(.train, saveMetrics = TRUE)
.train <- .train[ , .n0$nzv == FALSE]

This step in variable selection reduces the initial dataset containing 160 variables to 53

# final number of variables
dim(.train)
## [1] 19622    53

3. Variable Reduction

The second round of variable selection is through reduction of variables that show multi-colinearity. By removing variable-pairs that show correlation >.75.

# Identifying Correlated Predictors
train.corr <- cor(.train[,-ncol(.train)])
summary(train.corr[upper.tri(train.corr)])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992000 -0.110100  0.002092  0.001790  0.092550  0.980900
high.corr <- findCorrelation(train.corr, cutoff = .75)
.train <- .train[,-high.corr]
new.corr <- cor(.train[,-ncol(.train)])
summary(new.corr[upper.tri(new.corr)])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.607000 -0.089520  0.004661  0.008109  0.084190  0.736500
dim(.train)
## [1] 19622    33

We have effectively reduced our data set from 52 predictors to 32 predictors. however, this will still take too long for the different machine learning algorithms to model. So we would like to perform an additional round of feature selection. This time by using the adaBoost model, as it is quickly performed on the entire dataset.

train.boost <- boosting(classe ~ ., data = .train, mfinal = 10)
importanceplot(train.boost)

It seems that the first 8 predictors have a high influence on the data set, and the ones thereafter have more marginal effect. Via eyeballing the importance-plot, the cut-off of variable inclusion is set to an importance threshold of 5.

Y <- .train$classe # temporary storage
var.keep <- which(train.boost$importance > 5)
.train <- .train[ ,(names(.train) %in% names(var.keep))]
.train$classe <- Y

4. Preproccessing

All the remaining variables are standardized.

pre_Obj = preProcess(.train[,-ncol(.train)], method = c('center', 'scale'))
.train = predict(pre_Obj, .train[,-ncol(.train)])
.train$classe <- Y

5. Test Set: Variable Selection and Preprocessing

The test dataset is reduced to the variables from the training dataset, and preprocessed according to the training preprocessing object.

.test <- test[ ,(names(test) %in% names(.train))]
.test = predict(pre_Obj, .test)

6. Splitting the Training Set

The training dataset is split into a training and cross-validation data set in order to assess the in-sample and out-of-sample performance of 4 different modeling approaches

# create a training and cross-validation set
inTrain = createDataPartition(.train$classe, p = 0.75, list = FALSE)
.t  <- .train[inTrain, ]
.cv <- .train[-inTrain, ]

# 4 seperate training sets for different methods
tFolds <- createFolds(.t$classe, k = 4, list = TRUE, returnTrain = FALSE)
t.rf = .t[tFolds$Fold1, ]
t.gbm = .t[tFolds$Fold2, ]
t.lda = .t[tFolds$Fold3, ]
t.svm = .t[tFolds$Fold4, ]

# 4 seperate cross-validation sets for different methods
cvFolds <- createFolds(.cv$classe, k = 4, list = TRUE, returnTrain = FALSE)
cv.rf = .cv[cvFolds$Fold1, ]
cv.gbm = .cv[cvFolds$Fold2, ]
cv.lda = .cv[cvFolds$Fold3, ]
cv.svm = .cv[cvFolds$Fold4, ]

7. Model fitting

Methods include: random forests (rf), gradient boosted machine (gbm), linear discriminant analysis (lda), and support vector machines (svm). Each model is automatically tuned and is evaluated using 5 repeats of 5-fold cross validation.

# defining the cross-validation parameters
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
# training the models
rf.fit  = train(classe ~ ., method = "rf",  trControl = fitControl, data = t.rf)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
gbm.fit = train(classe ~ ., method = 'gbm', trControl = fitControl, data = t.gbm, verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
lda.fit = train(classe ~ ., method = 'sparseLDA', trControl = fitControl, data = t.lda)
## Loading required package: sparseLDA
## Loading required package: lars
## Loaded lars 1.2
## 
## Loading required package: elasticnet
## Loading required package: MASS
## Loading required package: mda
## Loading required package: class
## Loaded mda 0.4-7
svm.fit = train(classe ~ ., method = 'svmRadial', trControl = fitControl, data = t.svm)
## Loading required package: kernlab

8. Assessing the train and cross-validation fit

8.1 random forests

# in-sample accuracy
.rf.train <- predict(rf.fit, t.rf)
confusionMatrix(.rf.train, t.rf$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1047    0    0    0    0
##          B    0  712    0    0    0
##          C    0    0  642    0    0
##          D    0    0    0  603    0
##          E    0    0    0    0  677
## 
## Overall Statistics
##                                     
##                Accuracy : 1         
##                  95% CI : (0.999, 1)
##     No Information Rate : 0.2844    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##  Mcnemar's Test P-Value : NA        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2844   0.1934   0.1744   0.1638   0.1839
## Detection Rate         0.2844   0.1934   0.1744   0.1638   0.1839
## Detection Prevalence   0.2844   0.1934   0.1744   0.1638   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000
# out-of-sample accuracy
.rf.cv <- predict(rf.fit, cv.rf)
confusionMatrix(.rf.cv, cv.rf$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 341  13   1   1   0
##          B   5 210   6   2   1
##          C   3  11 200   6   2
##          D   0   2   6 191   2
##          E   0   1   1   1 220
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9478          
##                  95% CI : (0.9338, 0.9596)
##     No Information Rate : 0.2847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9339          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9771   0.8861   0.9346   0.9502   0.9778
## Specificity            0.9829   0.9858   0.9783   0.9902   0.9970
## Pos Pred Value         0.9579   0.9375   0.9009   0.9502   0.9865
## Neg Pred Value         0.9908   0.9731   0.9861   0.9902   0.9950
## Prevalence             0.2847   0.1933   0.1746   0.1639   0.1835
## Detection Rate         0.2781   0.1713   0.1631   0.1558   0.1794
## Detection Prevalence   0.2904   0.1827   0.1811   0.1639   0.1819
## Balanced Accuracy      0.9800   0.9360   0.9564   0.9702   0.9874

8.2 gradient boosted machine

# in-sample accuracy
.gbm.train <- predict(gbm.fit, t.gbm)
confusionMatrix(.gbm.train, t.gbm$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1026   18    3    2    0
##          B    9  653   30   13    8
##          C    9   28  582   29   12
##          D    0   10   24  553    6
##          E    2    3    3    6  651
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9416          
##                  95% CI : (0.9335, 0.9489)
##     No Information Rate : 0.2842          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9261          
##  Mcnemar's Test P-Value : 0.04558         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9809   0.9171   0.9065   0.9171   0.9616
## Specificity            0.9913   0.9798   0.9743   0.9870   0.9953
## Pos Pred Value         0.9781   0.9158   0.8818   0.9325   0.9789
## Neg Pred Value         0.9924   0.9801   0.9801   0.9838   0.9914
## Prevalence             0.2842   0.1935   0.1745   0.1639   0.1840
## Detection Rate         0.2788   0.1774   0.1582   0.1503   0.1769
## Detection Prevalence   0.2851   0.1938   0.1793   0.1611   0.1807
## Balanced Accuracy      0.9861   0.9485   0.9404   0.9520   0.9785
# out-of-sample accuracy
.gbm.cv <- predict(gbm.fit, cv.gbm)
confusionMatrix(.gbm.cv, cv.gbm$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 328  11   2   1   1
##          B   7 207  13   7   4
##          C   6   7 187  14   1
##          D   5   9  12 169   3
##          E   2   3   0  10 217
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9038          
##                  95% CI : (0.8859, 0.9197)
##     No Information Rate : 0.2838          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8783          
##  Mcnemar's Test P-Value : 0.2234          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9425   0.8734   0.8738   0.8408   0.9602
## Specificity            0.9829   0.9687   0.9723   0.9717   0.9850
## Pos Pred Value         0.9563   0.8697   0.8698   0.8535   0.9353
## Neg Pred Value         0.9773   0.9696   0.9733   0.9689   0.9909
## Prevalence             0.2838   0.1933   0.1746   0.1639   0.1843
## Detection Rate         0.2675   0.1688   0.1525   0.1378   0.1770
## Detection Prevalence   0.2798   0.1941   0.1754   0.1615   0.1892
## Balanced Accuracy      0.9627   0.9210   0.9231   0.9063   0.9726

8.3 linear discriminant analysis

# in-sample accuracy
.lda.train <- predict(lda.fit, t.lda)
confusionMatrix(.lda.train, t.lda$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 713 268 130  65  79
##          B  23  48  35  53  55
##          C 124 139 371  94  80
##          D 107 127  46 291 119
##          E  79 130  59 100 343
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4802          
##                  95% CI : (0.4639, 0.4964)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3396          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6816  0.06742   0.5788  0.48259  0.50740
## Specificity            0.7941  0.94403   0.8561  0.87024  0.87742
## Pos Pred Value         0.5681  0.22430   0.4592  0.42174  0.48242
## Neg Pred Value         0.8626  0.80831   0.9059  0.89558  0.88777
## Prevalence             0.2844  0.19358   0.1743  0.16395  0.18380
## Detection Rate         0.1939  0.01305   0.1009  0.07912  0.09326
## Detection Prevalence   0.3412  0.05818   0.2197  0.18760  0.19331
## Balanced Accuracy      0.7379  0.50572   0.7174  0.67642  0.69241
# out-of-sample accuracy
.lda.cv <- predict(lda.fit, cv.lda)
confusionMatrix(.lda.cv, cv.lda$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 237  79  45  34  29
##          B  10  23  12  20  24
##          C  43  53 117  38  37
##          D  26  36  18  77  34
##          E  33  47  22  32 101
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4523          
##                  95% CI : (0.4242, 0.4807)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3033          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6791  0.09664  0.54673  0.38308  0.44889
## Specificity            0.7870  0.93327  0.83119  0.88889  0.86627
## Pos Pred Value         0.5590  0.25843  0.40625  0.40314  0.42979
## Neg Pred Value         0.8605  0.81107  0.89670  0.88031  0.87500
## Prevalence             0.2844  0.19397  0.17441  0.16381  0.18337
## Detection Rate         0.1932  0.01874  0.09535  0.06275  0.08231
## Detection Prevalence   0.3456  0.07253  0.23472  0.15566  0.19152
## Balanced Accuracy      0.7330  0.51495  0.68896  0.63599  0.65758

8.4 support vector machines

# in-sample accuracy
.svm.train <- predict(svm.fit, t.svm)
confusionMatrix(.svm.train, t.svm$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 968  79  52  60  10
##          B  19 428  37  31  17
##          C  41 101 503  55  68
##          D  10  93  39 447  24
##          E   8  11  11  10 557
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7891          
##                  95% CI : (0.7755, 0.8022)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7324          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9254   0.6011   0.7835   0.7413   0.8240
## Specificity            0.9237   0.9649   0.9127   0.9460   0.9867
## Pos Pred Value         0.8281   0.8045   0.6549   0.7292   0.9330
## Neg Pred Value         0.9689   0.9098   0.9523   0.9491   0.9614
## Prevalence             0.2843   0.1935   0.1745   0.1639   0.1837
## Detection Rate         0.2631   0.1163   0.1367   0.1215   0.1514
## Detection Prevalence   0.3177   0.1446   0.2088   0.1666   0.1623
## Balanced Accuracy      0.9245   0.7830   0.8481   0.8437   0.9053
# out-of-sample accuracy
.svm.cv <- predict(svm.fit, cv.svm)
confusionMatrix(.svm.cv, cv.svm$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 327  27  19  19   7
##          B   8 145   8  15   1
##          C   9  34 169  21  21
##          D   1  26  13 142   8
##          E   4   5   4   4 188
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7927         
##                  95% CI : (0.7689, 0.815)
##     No Information Rate : 0.2849         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7365         
##  Mcnemar's Test P-Value : 1.409e-10      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9370   0.6118   0.7934   0.7065   0.8356
## Specificity            0.9178   0.9676   0.9160   0.9531   0.9830
## Pos Pred Value         0.8195   0.8192   0.6654   0.7474   0.9171
## Neg Pred Value         0.9734   0.9122   0.9547   0.9430   0.9637
## Prevalence             0.2849   0.1935   0.1739   0.1641   0.1837
## Detection Rate         0.2669   0.1184   0.1380   0.1159   0.1535
## Detection Prevalence   0.3257   0.1445   0.2073   0.1551   0.1673
## Balanced Accuracy      0.9274   0.7897   0.8547   0.8298   0.9093

From the observed results, I would have to go with the model of the random forests. It showed a perfect in-sample accuracy (namely 1), and a 0.9241 out-of-sample accuracy. The gradient boosted method also performed very well, although not as accurate as the random forest, with an in-sample accuracy of 0.9304 and out-of-sample accuracy of 0.8694. Third, the support vector machine showed an in-sample accuracy of 0.7673 and out-of-sample accuracy of 0.7651. Even though the support vector machine did not perform as well as the random forests and gradient boosted machine, it seems that is the least prone to overfitting in this data set since the in-sample and out-of-sample error-rates are almost equal. Finally, the linear discriminant analysis performed very poorly (in-sample accuracy = 0.4826, out-of-sample accuracy = 0.4694) and should not be considered for multi-class prediction in this scenario.

Ergo, the prediction on the test-set will be based on the random forest fit.

9. Test the machine learning

rf.test = predict(rf.fit, .test)

10. Predicted Values

rf.test
##  [1] D A B C A E D B A A B C B A E E A B A B
## Levels: A B C D E