Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, my goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
In my report, I will describe: - how I built the model, - how I used cross validation, - what the expected out of sample error is, - why I made the choices, - how I used my prediction model to predict 20 different test cases.
# load packages
library("caret")
## Loading required package: lattice
## Loading required package: ggplot2
library("rpart")
library("rpart.plot")
library("RColorBrewer")
library("rattle")
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library("gbm")
## Loaded gbm 2.1.5
library("corrplot")
## corrplot 0.84 loaded
# setting seed
set.seed(999)
# download data
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
testing <- read.csv(url(testUrl), na.strings=c("NA","#DIV/0!",""))
I will remove the variables with missing more than 90% of observations and most of the observations are very closed to zero.
# Remove NA
na_counting <- colSums(is.na(training))
training <- training[, na_counting == 0]
testing <- testing[, na_counting == 0]
dim(training); dim(testing)
## [1] 19622 60
## [1] 20 60
# Remove the data near zero
NZV <- nearZeroVar(training)
training <- training[,-NZV]
testing <- testing[ ,-NZV]
dim(training); dim(testing)
## [1] 19622 59
## [1] 20 59
I will apply corss-validation for 10 times. In each time, the corss-validation randomly picks 60% of data for training set. And the rest 40% data will be in the tseting set. I will calculate the out of sample error for each prediction. Here, we give the prediciton error an one unit loss. So the the out of sample error is the amount of incorrect predicions. Then the expected out of sample error is the average of the 10 out of sample error
# creat a vector for saving the out of sample results
Model1_oos <- NULL
for (i in 1:10) {
# Separate the whole training set into two sets
inTrain <- createDataPartition(training$classe, p=0.6, list=FALSE)
myTraining <- training[inTrain, ]
myTesting <- training[-inTrain, ]
# apply algorithm
Model1_decisionTree <- rpart(classe ~ ., data=myTraining[,-1], method="class")
Model1_decisionTree_prediction <- predict(Model1_decisionTree, myTesting, type = "class")
Model1_oos[i] <- sum(Model1_decisionTree_prediction != myTesting$classe)
}
print(paste("The Expected Out of Sample Error of Model 1 is", mean(Model1_oos)))
## [1] "The Expected Out of Sample Error of Model 1 is 1032.7"
# creat a vector for saving the out of sample results
Model2_oos <- NULL
for (i in 1:10) {
# Separate the whole training set into two sets
inTrain <- createDataPartition(training$classe, p=0.6, list=FALSE)
myTraining <- training[inTrain, ]
myTesting <- training[-inTrain, ]
# apply algorithm
Model2_randomForest <- randomForest(classe ~ ., data=myTraining[,-1])
Model2_randomForest_prediction <- predict(Model2_randomForest, myTesting, type = "class")
Model2_oos[i] <- sum(Model2_randomForest_prediction != myTesting$classe)
}
print(paste("The Expected Out of Sample Error of Model 2 is", mean(Model2_oos)))
## [1] "The Expected Out of Sample Error of Model 2 is 10.5"
# creat a vector for saving the out of sample results
Model3_oos <- NULL
for (i in 1:10) {
# Separate the whole training set into two sets
inTrain <- createDataPartition(training$classe, p=0.6, list=FALSE)
myTraining <- training[inTrain, ]
myTesting <- training[-inTrain, ]
# apply algorithm
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1)
Model3_gbm <- train(classe ~ ., data=myTraining[,-1], method = "gbm",
trControl = fitControl,
verbose = FALSE)
Model3_gbm_prediction <- predict(Model3_gbm, newdata=myTesting)
Model3_oos[i] <- sum(Model3_gbm_prediction != myTesting$classe)
}
print(paste("The Expected Out of Sample Error of Model 3 is", mean(Model3_oos)))
## [1] "The Expected Out of Sample Error of Model 3 is 24.9"
Comparing the three algorithms above, the random forest achieve the lowest expected out of sample error with the myTesting dataset. It means that the random forest performs the best accuracy. Thus, we can apply the random forest method to predict the 20 test data. And the results will be shown below.
# Apply a small trick to force the type of variables to be same
testing <- rbind(training[2,-59] , testing[,-59])
testing <- testing[-1,]
# Prediction
final_randomForest <- randomForest(classe ~ ., data=training[,-1])
prediction_20test <- predict(final_randomForest, testing, type = "class")
prediction_20test
## 1 21 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E