Practical Machine Learning: Coursera Project

Executive Summary

We are asked to classify a specific set of activity variables to determine classe as the outcome. Two predictive models (recursive partitioning and random forest) were used to predict the outcome based on the identified predictors. The random forest yielded a higher accuracy result of 98.8% (or an error of 1.2%) and was used to predict the outcome of the test data.

Background

Devices such as Jawbone Up, Nike FuelBand, and Fitbit now allows the inexpensive collection of a large amount of data on one’s personal activity. The dataset provided quantifies how 6 participants perform an activity through inputs from accelerometers on the belt, forearm, arm, and dumbell. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Two datasets have been provided : training and test. The goal of this project is to predict the manner of which the exercise was conducted as indicated in the classe variable. The test set will be used to validate the prediction derived from the training set. The model will be chosen based on its accuracy and will be used to predict 20 different test cases.

Data Processing

Loading the relevant packages

We first call the relevant packages to be used for the analysis

library(caret)
library(dplyr)
library(rpart)
library(randomForest)

Loading the dataset

We read the csv files that were downloaded Training Dataset and Test Dataset

setwd("~/Documents/R Programming/Data Sources")
data_training <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA", 
    ""))
data_test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA", 
    ""))

We remove columns that are blank or have NA values. We also remove the columns that are non-predictor variables ( variables such as user_name and timestamps). This leaves 52 variables for predictor and 1 variable for the outcome.

data_training_short <- data_training[, (colSums(is.na(data_training)) == 0)]
data_test_short <- data_test[, (colSums(is.na(data_test)) == 0)]
data_training_short2 <- select(data_training_short, roll_belt:classe)[, ]
data_test_short2 <- select(data_test_short, roll_belt:problem_id)[, ]

Subsetting the data

Our training set has 19622 rows which is significantly large. We split the data into 2 subsets for faster turnaround time of the machine.

set.seed(1007)
ids_train = createDataPartition(y = data_training_short2$classe, p = 0.5, list = FALSE)
train_dataset1 = data_training_short2[ids_train, ]
train_dataset2 = data_training_short2[-ids_train, ]

From each of the subset we split further into training set using 80% of the data and validation set using the remaining 20%.

set.seed(1007)
ids_train1 <- createDataPartition(y = train_dataset1$classe, p = 0.8, list = FALSE)
dataset1_training <- train_dataset1[ids_train1, ]
dataset1_validation <- train_dataset1[-ids_train1, ]

set.seed(1007)
ids_train2 <- createDataPartition(y = train_dataset2$classe, p = 0.8, list = FALSE)
dataset2_training <- train_dataset2[ids_train2, ]
dataset2_validation <- train_dataset2[-ids_train2, ]

Predicting the outcome

We evaluate the outcome using two methods: Recursive Partition and Random Forest. Since the outcome is a discrete type of variable (i.e. A,B,C,D or E), we presume that pre-processing of the predictor variable will not have a significant impact, so we evaluate the data as is. There will be four accuracy values to be derived : two predictive models for two subsets. Sample code used for set 1 is shown.

set.seed(1007)
control <- trainControl(method = "cv", number = 5)
fit_rpart <- train(classe ~ ., data = dataset1_training, method = "rpart", trControl = control)
predict_rpart <- predict(fit_rpart, dataset1_validation)
conf_rpart <- confusionMatrix(dataset1_validation$classe, predict_rpart)

set.seed(1007)
fit_rf <- train(classe ~ ., data = dataset1_training, method = "rf", trControl = control)
predict_rf <- predict(fit_rf, dataset1_validation)
conf_rf <- confusionMatrix(dataset1_validation$classe, predict_rf)

conf_rpart$table

##           Reference
## Prediction   A   B   C   D   E
##          A 507  13  36   0   2
##          B 133 141 105   0   0
##          C 155  13 174   0   0
##          D 156  65 100   0   0
##          E  47  57  96   0 160

conf_rpart$overall[1]

##  Accuracy 
## 0.5010204

conf_rf$table

##           Reference
## Prediction   A   B   C   D   E
##          A 557   1   0   0   0
##          B   1 374   4   0   0
##          C   0   4 335   3   0
##          D   0   1   7 313   0
##          E   0   1   1   2 356

conf_rf$overall[1]

##  Accuracy 
## 0.9872449

set.seed(1007)
control <- trainControl(method = "cv", number = 5)
fit_rpart2 <- train(classe ~ ., data = dataset2_training, method = "rpart", 
    trControl = control)
predict_rpart2 <- predict(fit_rpart2, dataset2_validation)
conf_rpart2 <- confusionMatrix(dataset2_validation$classe, predict_rpart2)

set.seed(1007)
fit_rf2 <- train(classe ~ ., data = dataset2_training, method = "rf", trControl = control)
predict_rf2 <- predict(fit_rf2, dataset2_validation)
conf_rf2 <- confusionMatrix(dataset2_validation$classe, predict_rf2)

conf_rpart2$table

##           Reference
## Prediction   A   B   C   D   E
##          A 495  13  43   0   7
##          B 166 116  97   0   0
##          C 167  14 161   0   0
##          D 153  56 112   0   0
##          E  50  56  88   0 166

conf_rf2$table

##           Reference
## Prediction   A   B   C   D   E
##          A 558   0   0   0   0
##          B   2 371   5   0   1
##          C   0   3 338   1   0
##          D   0   0   5 315   1
##          E   0   0   0   4 356

Results

Based on the results we see that the Random Forest method yields a better accuracy, 98.8% or an error of 1.2%.

conf_rpart$overall[1]

##  Accuracy 
## 0.5010204

conf_rpart2$overall[1]

##  Accuracy 
## 0.4785714

conf_rf$overall[1]

##  Accuracy 
## 0.9877551

conf_rf2$overall[1]

##  Accuracy 
## 0.9882653

We therefore use the Random Forest model to predict our dataset

predictionTest <- predict(fit_rf, data_test_short2)
print(predictionTest)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

The prediction gave 20/20 correct answers.

References

Additional information on the data can be derived here

For queries you may contact me through my LinkedIn Account