Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here. (Check out the section on the Weight Lifting Exercise Dataset).
Loading the libraries, and the data files.
library(tidyverse); library(caret); library(rpart); library(rpart.plot)
library(e1071); library(ranger)
training <- read.csv("pml-training.csv", na.strings = c("NA", ""))
testing <- read.csv("pml-testing.csv", na.strings = c("NA", ""))
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
Training data has 19,622 rows, with 160 variables, the testing data shows to have 20 rows.
Remove fields that showed to have NAs throughout.
Then, remove the first 7 fields, since they are not relevant features to contribute in modelling.
training <- training[, colSums(is.na(training)) == 0]
training <- training[, colSums(is.na(training)) == 0]
traindata <- training[, -c(1:7)]
testdata <- testing[, -c(1:7)]
Then, split the training data further into training/validation sets with around 70/30 split.
set.seed(12345)
inTrain <- createDataPartition(traindata$classe, p = 0.7, list = FALSE)
train_d <- traindata[inTrain,]
valid_d <- traindata[-inTrain,]
Using classifiaction tree (rpart) and random forest (rnager) as the learning algorithms to compare from. Random forest is known to be one of the best learning algorithms, and we are using a modern model built called “ranger”."
First, run classification tree (rpart). note: only 3 iterations are chosen to reduce running time, especially for ‘ranger’ model.
A number around 10 is easily the norm and higher numbers are not uncommon for ideal result.
myControl <- trainControl(method = "cv", number = 3)
model_rpart <- train(classe ~ ., data = train_d, method = "rpart",
trControl = myControl)
Then, run random forest model (ranger)
model_ranger <- train(classe ~ ., data = train_d, method = "ranger",
trControl = myControl)
Calculate results using validation data. Then check accuracy through confusion matrix.
result_rpart <- predict(model_rpart, valid_d)
result_ranger <- predict(model_ranger, valid_d)
confusionMatrix(result_rpart, valid_d$classe)$overall['Accuracy']
## Accuracy
## 0.4963466
confusionMatrix(result_ranger, valid_d$classe)$overall['Accuracy']
## Accuracy
## 0.9920136
As expected, the second model (ranger) produces much better accuracy, very close to 1, on the validation data, therefore, we choose model_ranger to be the final model.
results <-resamples(list(rpart = model_rpart, ranger = model_ranger))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: rpart, ranger
## Number of resamples: 3
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.5117904 0.5129944 0.5141983 0.5168525 0.5193835 0.5245687 0
## ranger 0.9877729 0.9884268 0.9890806 0.9888623 0.9894070 0.9897335 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.3628715 0.3630979 0.3633243 0.3713404 0.3755748 0.3878254 0
## ranger 0.9845318 0.9853590 0.9861861 0.9859099 0.9865990 0.9870119 0
bwplot(results)
dotplot(results)
Based on our final model, the data in test is now used for final prediction. The answers are also reported for the Quiz section of this report.
Predict <- predict(model_ranger, testdata, method = "response")
Predict
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E