Practical Machine Learning Course Project

Introduction

The goal of this project is to predict the manner in which participants did a barbell lift excercise - to detect proper and improper methods of doing the excercise. The data has been gathered with devices, such as Fitbit or Fuelband.

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Loading required packages

library(caret)
library(data.table)
library(rattle)

Reading and preparing data

The training data is fairly large, so we will use fread function from data.table package. Next, we will clean the data, followed by splitting training set into training and validation set for cross-validation

data <- fread("pml-training.csv")
data <- as.data.frame(data)

testing <- fread("pml-testing.csv")
testing <- as.data.frame(testing)

#Clean the data of variables with near zero variance
nzv <- nearZeroVar(data)
data.clean <- data[,-nzv]
testing <- testing[,-nzv]

#next, we will remove variables with NA values
NA_amount <- sapply(data.clean, function(x) {sum(is.na(x))/length(x)})
data.clean <- data.clean[,NA_amount < 0.2]
testing <- testing[,NA_amount < 0.2]

#finally, we will remove variables, that seem unreasonable to use in prediction,
#like timestamp and username
data.clean <- data.clean[,-(1:6)]
testing <- testing[,-(1:6)]

#next, we will split the training data into further training and test set
set.seed(2137)
inTrain <- createDataPartition(y=data.clean$classe,
                               p=0.7, list = F)

training <- data.clean[inTrain,]
validation <- data.clean[-inTrain,]

Training Machine Learning Algorithms

We will train a few algorithms on the training data. Then we will perform a cross-validation, and choose the best model. We will use:

decicion tree ( rpart)
gradient boosting ( gbm)
random forest ( rf)

#decision tree
fit_tree <- train(classe ~ ., data = training, method = "rpart")
fancyRpartPlot(fit_tree$finalModel)

#gradient boosting
fit_gbm <- train(classe ~ ., data = training, method = "gbm")

fit_gbm

## Stochastic Gradient Boosting 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7495273  0.6823972
##   1                  100      0.8191430  0.7710538
##   1                  150      0.8508741  0.8112632
##   2                   50      0.8542863  0.8153573
##   2                  100      0.9024533  0.8765248
##   2                  150      0.9267391  0.9072774
##   3                   50      0.8929240  0.8644181
##   3                  100      0.9359425  0.9189245
##   3                  150      0.9547163  0.9426980
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

#random forest

#using parallel processing to shorten the time of fitting random forest algorithm
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

fitControl <- trainControl(method = "cv",
                           number = 5,
                           allowParallel = TRUE)

fit_rf <- train(classe ~ ., data = training, method = "rf", trControl = fitControl)

stopCluster(cluster)
registerDoSEQ()

fit_rf

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10987, 10990, 10990, 10991 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9903181  0.9877516
##   27    0.9895905  0.9868307
##   52    0.9801267  0.9748582
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Cross-validation

Now we will use the validation subset in order to correctly evaluate the accuracy of each model and choose the best one for prediction.

#cross-validation of decision tree
pred_tree <- predict(fit_tree, validation)
confusionMatrix(pred_tree, as.factor(validation$classe))$overall[1]

##  Accuracy 
## 0.4909091

#cross-validation of gradient boosting
pred_gbm <- predict(fit_gbm, validation)
confusionMatrix(pred_gbm, as.factor(validation$classe))$overall[1]

##  Accuracy 
## 0.9649958

#cross-validation of random forest
pred_rf <- predict(fit_rf, validation)
confusionMatrix(pred_rf, as.factor(validation$classe))$overall[1]

##  Accuracy 
## 0.9940527

Random forest is clearly the most accurate algorithm for predicting the classe variable.

Prediction

Let’s use the best created algorithm - random forest - for predicting the final 20 observations.

test_prediction <- predict(fit_rf, testing)
test_prediction

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E