This project is a result of a project assignment for the JHU course on Machine Learning (part of the data science specialisation, offered via Coursera).
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
## Loading required package: lars
## Loaded lars 1.2
##
## Loading required package: elasticnet
## Loading required package: MASS
## Loading required package: mda
## Loading required package: class
## Loaded mda 0.4-7
The data is provided by :
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3cSdc7Ewm
The raw data sets are loaded into two data frames:
setwd("d:/dev/app/gitrepo/PML_HAR")
if(!file.exists("./data")){
dir.create("./data")
}
setwd("./data")
## check if file exists first, then download and unzip
if(!file.exists("train.csv")){
file_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url = file_url, destfile = "train.csv")
}
if(!file.exists("test.csv")){
file_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url = file_url, destfile = "test.csv")
}
trainDS <- read.csv("train.csv", na.strings = c("NA", "#DIV/0!", ""))
testDS <- read.csv("test.csv", na.strings = c("NA", "#DIV/0!", ""))
table(trainDS$user_name, trainDS$classe)
##
## A B C D E
## adelmo 1165 776 750 515 686
## carlitos 834 690 493 486 609
## charles 899 745 539 642 711
## eurico 865 592 489 582 542
## jeremy 1177 489 652 522 562
## pedro 640 505 499 469 497
The 2 datasets have the same structure except for the last variable which is classe in the training set (the outcome) and problem_id in the test set (to be submitted). There is a fair distribution between outcomes by test person. The na.strings variable in read.csv is important to capture all NA's.
On first glance, a lot of the variables have NA's. The choice is to remove columns with more than 95% NA's. Pay attention to the na.strings variable in the initial read.csv instruction. (That was actually the first step in the cleaning.)
sum(colSums(is.na(trainDS)) > 0.95 * dim(trainDS)[1])
## [1] 100
removeColumns <- colSums(is.na(trainDS)) > 0.95 * dim(trainDS)[1]
trainNoNa <- trainDS[ , !removeColumns]
testNoNa <- testDS[ , !removeColumns]
The goal of this algorithm is to predict the manner in which the participants did the exercise, captured in the outcome variable classe. We will not complicate matters by considering time stamps. The variable new_window and num_window capture the time window and are removed. user_name has been removed to avoid the risk that 1 test person does a better job than another. We want to train from the sensor data.
trainNZV <- nearZeroVar(trainNoNa, saveMetrics=TRUE)
sum(trainNZV$nzv)
## [1] 1
trainNZV[trainNZV$nzv,]
## freqRatio percentUnique zeroVar nzv
## new_window 47.33005 0.01019264 FALSE TRUE
extraColumns <- names(trainNoNa) %in%
c("X",
"user_name",
"raw_timestamp_part_1",
"raw_timestamp_part_2",
"cvtd_timestamp",
"new_window",
"num_window"
)
trainClean <- trainNoNa[!extraColumns]
testClean <- testNoNa[!extraColumns]
dim(trainClean); dim(testClean)
## [1] 19622 53
## [1] 20 53
With so many columns left, this is not easy to see if some variables are more important than others. The goal of this paper is to provide an algorithm to predict the outcome, inference questions are subordinate but nevertheless interesting. In the prediction chapter we will take some sidesteps and try to get information on importance of variables.
The clean training data set is split in a training and testing set (50/50, to reduce excessive runtimes later)
set.seed(2015) # For reproducibile purpose
inTrain <- createDataPartition(trainClean$classe,p=0.5, list=F)
training <- trainClean[inTrain, ]
testing <- trainClean[-inTrain, ]
A couple of algorithms are trained (using caret package) and then compared.
Cross validation is used with a K-fold=5 parameter, or standard Caret bootstrap settings (QDA only).
fitControl <- trainControl(method = "cv",number = 5)
t.RF <- system.time(
{fit.RF <- train(classe ~ ., data=training,
method="rf",
trControl=fitControl)}
)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:Hmisc':
##
## combine
t.GBM <- system.time(
{fit.GBM <- train(classe ~ ., data=training,
method="gbm",
trControl = fitControl, verbose=FALSE)}
)
## Loading required package: plyr
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:lubridate':
##
## here
##
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
t.QDA <- system.time(
{fit.QDA <- train(classe ~ ., data=training,
method="qda", verbose=FALSE)}
)
t.SVM <- system.time(
{fit.SVM <- train(classe ~., data = training,
method="svmRadial",
trControl = fitControl)}
)
Use the testing data set to predict for the 4 algorithms and then report the training runtime, test accuracy and test error.
pred.RF <- predict(fit.RF, testing)
pred.GBM <- predict(fit.GBM, testing)
pred.QDA <- predict(fit.QDA, testing)
pred.SVM <- predict(fit.SVM, testing)
c.RF <- confusionMatrix(pred.RF, testing$classe)
c.GBM <- confusionMatrix(pred.GBM, testing$classe)
c.QDA <- confusionMatrix(pred.QDA, testing$classe)
c.SVM <- confusionMatrix(pred.SVM, testing$classe)
a.RF <- c.RF$overall[1]
a.GBM <- c.GBM$overall[1]
a.QDA <- c.QDA$overall[1]
a.SVM <- c.SVM$overall[1]
oos.RF <- 1 - a.RF
oos.GBM <- 1 - a.GBM
oos.QDA <- 1 - a.QDA
oos.SVM <- 1 - a.SVM
r1 <- c("RF", round(t.RF[3], 0), round(a.RF, 3), round(oos.RF, 3))
r2 <- c("GBM", round(t.GBM[3], 0), round(a.GBM, 3), round(oos.GBM, 3))
r3 <- c("QDA", round(t.QDA[3], 0), round(a.QDA,3), round(oos.QDA, 3))
r4 <- c("SVM", round(t.SVM[3], 0), round(a.SVM, 3), round(oos.SVM, 3))
resultSummary <- as.data.frame(rbind(r1,r2,r3,r4))
names(resultSummary) <- c("Method",
"training runtime",
"Test accuracy",
"Test OOS error")
rownames(resultSummary) <- NULL
resultSummary
## Method training runtime Test accuracy Test OOS error
## 1 RF 443 0.988 0.012
## 2 GBM 196 0.959 0.041
## 3 QDA 6 0.887 0.113
## 4 SVM 293 0.912 0.088
plot(varImp(fit.RF), top=10)
Although Random forests has the longest training runtime, it is selected because it has the lowest error on the test set (1.2%) and will be used to submit the results. The variance plot for RF sheds some light on which sensors are most important to predict. QDA gives the highest test error (but still reasonable) but is very fast to train compared to the other algorithms.
prediction <- as.character(predict(fit.RF, testClean ))
# write prediction files
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_", i, ".txt")
write.table(x[i], file = filename,
quote = FALSE, row.names = FALSE, col.names = FALSE)
}
}
pml_write_files(prediction)