We are asked to classify a specific set of activity variables to determine classe as the outcome. Two predictive models (recursive partitioning and random forest) were used to predict the outcome based on the identified predictors. The random forest yielded a higher accuracy result of 98.8% (or an error of 1.2%) and was used to predict the outcome of the test data.
Devices such as Jawbone Up, Nike FuelBand, and Fitbit now allows the inexpensive collection of a large amount of data on one’s personal activity. The dataset provided quantifies how 6 participants perform an activity through inputs from accelerometers on the belt, forearm, arm, and dumbell. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Two datasets have been provided : training and test. The goal of this project is to predict the manner of which the exercise was conducted as indicated in the classe variable. The test set will be used to validate the prediction derived from the training set. The model will be chosen based on its accuracy and will be used to predict 20 different test cases.
We first call the relevant packages to be used for the analysis
library(caret)
library(dplyr)
library(rpart)
library(randomForest)
We read the csv files that were downloaded Training Dataset and Test Dataset
setwd("~/Documents/R Programming/Data Sources")
data_training <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA",
""))
data_test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA",
""))
We remove columns that are blank or have NA values. We also remove the columns that are non-predictor variables ( variables such as user_name and timestamps). This leaves 52 variables for predictor and 1 variable for the outcome.
data_training_short <- data_training[, (colSums(is.na(data_training)) == 0)]
data_test_short <- data_test[, (colSums(is.na(data_test)) == 0)]
data_training_short2 <- select(data_training_short, roll_belt:classe)[, ]
data_test_short2 <- select(data_test_short, roll_belt:problem_id)[, ]
Our training set has 19622 rows which is significantly large. We split the data into 2 subsets for faster turnaround time of the machine.
set.seed(1007)
ids_train = createDataPartition(y = data_training_short2$classe, p = 0.5, list = FALSE)
train_dataset1 = data_training_short2[ids_train, ]
train_dataset2 = data_training_short2[-ids_train, ]
From each of the subset we split further into training set using 80% of the data and validation set using the remaining 20%.
set.seed(1007)
ids_train1 <- createDataPartition(y = train_dataset1$classe, p = 0.8, list = FALSE)
dataset1_training <- train_dataset1[ids_train1, ]
dataset1_validation <- train_dataset1[-ids_train1, ]
set.seed(1007)
ids_train2 <- createDataPartition(y = train_dataset2$classe, p = 0.8, list = FALSE)
dataset2_training <- train_dataset2[ids_train2, ]
dataset2_validation <- train_dataset2[-ids_train2, ]
We evaluate the outcome using two methods: Recursive Partition and Random Forest. Since the outcome is a discrete type of variable (i.e. A,B,C,D or E), we presume that pre-processing of the predictor variable will not have a significant impact, so we evaluate the data as is. There will be four accuracy values to be derived : two predictive models for two subsets. Sample code used for set 1 is shown.
set.seed(1007)
control <- trainControl(method = "cv", number = 5)
fit_rpart <- train(classe ~ ., data = dataset1_training, method = "rpart", trControl = control)
predict_rpart <- predict(fit_rpart, dataset1_validation)
conf_rpart <- confusionMatrix(dataset1_validation$classe, predict_rpart)
set.seed(1007)
fit_rf <- train(classe ~ ., data = dataset1_training, method = "rf", trControl = control)
predict_rf <- predict(fit_rf, dataset1_validation)
conf_rf <- confusionMatrix(dataset1_validation$classe, predict_rf)
conf_rpart$table
## Reference
## Prediction A B C D E
## A 507 13 36 0 2
## B 133 141 105 0 0
## C 155 13 174 0 0
## D 156 65 100 0 0
## E 47 57 96 0 160
conf_rpart$overall[1]
## Accuracy
## 0.5010204
conf_rf$table
## Reference
## Prediction A B C D E
## A 557 1 0 0 0
## B 1 374 4 0 0
## C 0 4 335 3 0
## D 0 1 7 313 0
## E 0 1 1 2 356
conf_rf$overall[1]
## Accuracy
## 0.9872449
set.seed(1007)
control <- trainControl(method = "cv", number = 5)
fit_rpart2 <- train(classe ~ ., data = dataset2_training, method = "rpart",
trControl = control)
predict_rpart2 <- predict(fit_rpart2, dataset2_validation)
conf_rpart2 <- confusionMatrix(dataset2_validation$classe, predict_rpart2)
set.seed(1007)
fit_rf2 <- train(classe ~ ., data = dataset2_training, method = "rf", trControl = control)
predict_rf2 <- predict(fit_rf2, dataset2_validation)
conf_rf2 <- confusionMatrix(dataset2_validation$classe, predict_rf2)
conf_rpart2$table
## Reference
## Prediction A B C D E
## A 495 13 43 0 7
## B 166 116 97 0 0
## C 167 14 161 0 0
## D 153 56 112 0 0
## E 50 56 88 0 166
conf_rf2$table
## Reference
## Prediction A B C D E
## A 558 0 0 0 0
## B 2 371 5 0 1
## C 0 3 338 1 0
## D 0 0 5 315 1
## E 0 0 0 4 356
Based on the results we see that the Random Forest method yields a better accuracy, 98.8% or an error of 1.2%.
conf_rpart$overall[1]
## Accuracy
## 0.5010204
conf_rpart2$overall[1]
## Accuracy
## 0.4785714
conf_rf$overall[1]
## Accuracy
## 0.9877551
conf_rf2$overall[1]
## Accuracy
## 0.9882653
We therefore use the Random Forest model to predict our dataset
predictionTest <- predict(fit_rf, data_test_short2)
print(predictionTest)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The prediction gave 20/20 correct answers.
Additional information on the data can be derived here
For queries you may contact me through my LinkedIn Account