Prediction Asssignment

Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.

The goal of our project is to predict the manner in which they did the exercise. In this study, we will be looking for best classification method which provides most accurate prediction of this manner (variable classe). We will also use our prediction model to predict 20 different test cases.

Preparation: data download, loading, cleaning

In the following session we’ll load the necessary libraries, download the datasets and load them and do some exploration and cleaning. Please change the folders respectively. Also set seed in order to reproduciblity.

library(caret);library(tidyverse);library(gbm)

## Loading required package: lattice

## Loading required package: ggplot2

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## <U+221A> tibble  3.0.4     <U+221A> dplyr   1.0.2
## <U+221A> tidyr   1.1.2     <U+221A> stringr 1.4.0
## <U+221A> readr   1.4.0     <U+221A> forcats 0.5.0
## <U+221A> purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

## Loaded gbm 2.1.8

set.seed(1234)

setwd("C:/R/Coursera - R Programming/PracticalMachineLearning/PredictionAssignment")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
              "data/pml-training.csv","curl")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
              "data/pml-testing.csv","curl")

# there are NAs and empty cells as well
FullTrain<-read.csv("data/pml-training.csv", header=T, na.strings=c("","NA")) 
FullTest<-read.csv("data/pml-testing.csv", header=T, na.strings=c("","NA"))

# we exclude variables with more than 95% missing values ratio
FullTrain <- FullTrain %>% select_if(colMeans(is.na.data.frame(FullTrain))<0.95)
FullTest <- FullTest %>% select_if(colMeans(is.na.data.frame(FullTest))<0.95)

# also exclude ID and timestamp variables and change dependent variable to factor variable (due to it's categorical variable)
FullTrain <- FullTrain[,-c(1:7)]
FullTest <- FullTest[,-c(1:7)]
FullTrain$classe<-as.factor(FullTrain$classe)

Splitting data and set cross-validation

We split training data set into two data set:, 70% for training (train.data) and 30% for testing (test.data). Also set training control to k-fold cross-validation with 5 subsets (due the sample size is large enough). The k-fold cross-validation method evaluates the model performance on different subset of the training data and then calculate the average prediction error rate. (read more: http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/)

# Split the data into training and test set
training.samples <- FullTrain$classe %>%
  createDataPartition(p = 0.7, list = FALSE)
train.data  <- FullTrain[training.samples, ]
test.data <- FullTrain[-training.samples, ]

# Define training control
fitControl <- trainControl(method = "cv", number = 5)

Model selection and evaluation

We competes three model that fit for classification task evaluate them based on their accuracy. These models are Generalized Boosted Model (GBM), Random Forest (RF) and Decision Trees (DT). You can see the results below (especially confusion matrices) and a comparison of their accuracy. We’ll use the best one (most accurate) for the prediction of quiz task. Quick recap: accuracy refers to the ratio of correctly classified cases while out-of-sample error estimated with the one minus accuracy when we predict classes on the test subset of training data. It means that the expected out-of-sample error is the expected misclassified cases ratio on the original test set.

Decision Tree (DT)

Let’s start with the most simple model and move towards the more complex ones.

Let’s imagine you are playing a game of Twenty Questions. Your opponent has secretly chosen a subject, and you must figure out what he/she chose. At each turn, you may ask a yes-or-no question, and your opponent must answer truthfully. How do you find out the secret in the fewest number of questions? It should be obvious some questions are better than others. For example, asking “Can it fly?” as your first question is likely to be unfruitful, whereas asking “Is it alive?” is a bit more useful. Intuitively, you want each question to significantly narrow down the space of possibly secrets, eventually leading to your answer. That is the basic idea behind decision trees. At each point, you consider a set of questions that can partition your data set. You choose the question that provides the best split and again find the best questions for the partitions. You stop once all the points you are considering are of the same class. Then the task of classication is easy. You can simply grab a point, and chuck it down the tree. The questions will guide it to its appropriate class. (see: https://www.datacamp.com/community/tutorials/decision-trees-R)

# Train the model
fitDT <- train(classe ~ ., data=train.data, 
               method="rpart",
               trControl=fitControl)

# Summarize the results
fitDT$finalModel

## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 12585 8689 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -33.65 1126    9 A (0.99 0.008 0 0 0) *
##      5) pitch_forearm>=-33.65 11459 8680 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 439.5 9719 6993 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 121.5 5952 3518 A (0.41 0.18 0.18 0.16 0.061) *
##         21) roll_forearm>=121.5 3767 2534 C (0.078 0.18 0.33 0.23 0.18) *
##       11) magnet_dumbbell_y>=439.5 1740  855 B (0.03 0.51 0.036 0.23 0.19) *
##    3) roll_belt>=130.5 1152   10 E (0.0087 0 0 0 0.99) *

predictDT <- predict(fitDT, newdata=test.data)
confusionMatrix(predictDT, test.data$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1519  473  484  451  156
##          B   28  401   45  167  148
##          C  123  265  497  346  289
##          D    0    0    0    0    0
##          E    4    0    0    0  489
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4938          
##                  95% CI : (0.4809, 0.5067)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.338           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9074  0.35206  0.48441   0.0000  0.45194
## Specificity            0.6286  0.91825  0.78946   1.0000  0.99917
## Pos Pred Value         0.4927  0.50824  0.32697      NaN  0.99189
## Neg Pred Value         0.9447  0.85518  0.87881   0.8362  0.89002
## Prevalence             0.2845  0.19354  0.17434   0.1638  0.18386
## Detection Rate         0.2581  0.06814  0.08445   0.0000  0.08309
## Detection Prevalence   0.5239  0.13407  0.25828   0.0000  0.08377
## Balanced Accuracy      0.7680  0.63516  0.63693   0.5000  0.72555

Random Forest (RF)

Random forest is a supervised learning algorithm. The “forest” it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result. Put simply: random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. One big advantage of random forest is that it can be used for both classification and regression problems, which form the majority of current machine learning systems. (see: https://builtin.com/data-science/random-forest-algorithm)

# Train the model
fitRF <- train(classe ~ ., data=train.data, 
               method="rf",
               trControl=fitControl)

# Summarize the results
fitRF$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.68%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3903    2    0    0    1 0.0007680492
## B   14 2638    6    0    0 0.0075244545
## C    0   19 2374    3    0 0.0091819699
## D    0    0   40 2210    2 0.0186500888
## E    0    0    2    4 2519 0.0023762376

predictRF <- predict(fitRF, newdata=test.data)
confusionMatrix(predictRF, test.data$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    3    0    0    0
##          B    0 1132   14    0    0
##          C    0    4 1012   13    0
##          D    0    0    0  950    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9939          
##                  95% CI : (0.9915, 0.9957)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9923          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9939   0.9864   0.9855   0.9991
## Specificity            0.9993   0.9971   0.9965   0.9998   0.9998
## Pos Pred Value         0.9982   0.9878   0.9835   0.9989   0.9991
## Neg Pred Value         1.0000   0.9985   0.9971   0.9972   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1924   0.1720   0.1614   0.1837
## Detection Prevalence   0.2850   0.1947   0.1749   0.1616   0.1839
## Balanced Accuracy      0.9996   0.9955   0.9914   0.9926   0.9994

Generalized Boosted Model (GBM)

These models are a combination of two techniques: decision tree algorithms and boosting methods. Generalized Boosting Models repeatedly fit many decision trees to improve the accuracy of the model. For each new tree in the model, a random subset of all the data is selected using the boosting method. For each new tree in the model the input data are weighted in such a way that data that was poorly modelled by previous trees has a higher probability of being selected in the new tree. This means that after the first tree is fitted the model will take into account the error in the prediction of that tree to fit the next tree, and so on. By taking into account the fit of previous trees that are built, the model continuously tries to improve its accuracy. This sequential approach is unique to boosting. (see: https://support.bccvl.org.au/support/solutions/articles/6000083212-generalized-boosting-model)

# Train the model
fitGBM <- train(classe ~ ., data = train.data, 
                 method = "gbm", 
                 trControl = fitControl,
                 verbose = FALSE)

# Summarize the results
fitGBM$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.

predictGBM <- predict(fitGBM, newdata=test.data)
confusionMatrix(predictGBM, test.data$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1657   24    0    3    2
##          B    9 1093   33    3   11
##          C    7   21  983   23    7
##          D    1    1    9  931    9
##          E    0    0    1    4 1053
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9715          
##                  95% CI : (0.9669, 0.9756)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9639          
##                                           
##  Mcnemar's Test P-Value : 3.247e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9898   0.9596   0.9581   0.9658   0.9732
## Specificity            0.9931   0.9882   0.9881   0.9959   0.9990
## Pos Pred Value         0.9828   0.9513   0.9443   0.9790   0.9953
## Neg Pred Value         0.9960   0.9903   0.9911   0.9933   0.9940
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2816   0.1857   0.1670   0.1582   0.1789
## Detection Prevalence   0.2865   0.1952   0.1769   0.1616   0.1798
## Balanced Accuracy      0.9915   0.9739   0.9731   0.9809   0.9861

# check importance of predictors (for potential runtime optimization)
#importance <- varImp(fitGBM, scale=FALSE)
#plot(importance)
#plot(gbmFit1)

Choosing the most accurate model and predict for the quiz

The accuracy of the 3 classifications are the following:

Decision Tree: 0.4946 Random Fores: 0.9913 Generalized Boosted Model: 0.9636

Although our personal expectation was that the most accurate model is the GBM, it’s turned out that the Random Forest model is more accurate. We use this one to predict the 20 data points of the quiz:

predict(fitRF, newdata=FullTest)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E