In this document, a set of Human Activity Recognition (HAR) data released by Groupware@LSE is analyzed to determine the type of weight lifting activity, out of the five possible categories: exactly as specified (A), throwing elbows to the front (B), lifting the dumbell halfway (C), lowering the dumbbell halfway (D), and throwing the hips to the front (E).
Each record of the activity comes with data collected from accelerometers attached to the arm, forearm, belt, and dumbbell. A training data set of 19622 records with the class of activity and a testing data set of 20 records without the class of activity provided. The goal of this study is to successfully predict the correct activity type in the testing data set.
After the data is downloaded, it is noticeable that some columns of both the training and testing contain no information that will be helpful with the prediction. We can drop columns 1-7 immediately. Furthermore, there are columns that just hold NA values. They can be dropped as well.
trainingCsv <- "/home/wshao/school/coursera/predmachlearn-031/project/pml-training.csv"
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", trainingCsv)
rawTraining <- read.csv(trainingCsv)
testingCsv <- "/home/wshao/school/coursera/predmachlearn-031/project/pml-testing.csv"
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", testingCsv)
rawTesting <- read.csv(testingCsv)
# First 10 columns names
head(names(rawTraining), 10)
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt"
# Take out columns 1-7, where the data is not sensor data
training <- rawTraining[,8:160]
# Remove columns where there are all NAs
isnaTraining = as.data.frame(is.na(training))
sumIsnaTraining = apply(isnaTraining, 2, sum)
indIsnaTraining <- which(sumIsnaTraining == 0)
training <- training[,indIsnaTraining]
indNonfactorCol <- which((sapply(training, class) != 'factor'))
training <- training[,indNonfactorCol]
training$classe <- rawTraining$classe
testing <- rawTesting[,8:160]
testing <- testing[,indIsnaTraining]
testing <- testing[,indNonfactorCol]
testing$classe <- rawTesting$classe
The original training set ought to be divided into a pure training set (PTR) and cross validation set (CVL). In this study, the pure training set is 60% of the original training set while the cross validation set is 40% of the original training set.
library(caret)
set.seed(12345)
inA <- createDataPartition(rawTraining$X, p=0.6)
ptrSet <- training[inA[[1]],]
cvlSet <- training[-inA[[1]],]
First, a random forest model is tried out on the pure training data set, with 5-fold cross validation. The model is then tested on the 40% cross-validation (CVL) set split apart at the beginning, not to be confused with the k-fold cross-validation used in training process for this random forest model.
set.seed(23516)
ptm <- proc.time()
modelRf <- train(classe ~ ., data=ptrSet, method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE, allowParallel=TRUE)
timeRf <- proc.time() - ptm
plot(varImp(modelRf), main = "Variable Importance in Random Forest Model")
predRfCvl = predict(modelRf, newdata = cvlSet)
rfCvlAccuracy = sum(predRfCvl == cvlSet$classe) / nrow(cvlSet)
rfCvlError = 1 - rfCvlAccuracy
rfCvlError
## [1] 0.008154944
timeRf
## user system elapsed
## 1696.173 25.091 1747.029
The expected out-of-sample error is 0.8%. However, the time spent in training the model is 29 minutes on a Intel(R) Core(TM) i3-2130 CPU @ 3.40GHz computer. The size of the model in memory is a costly 1.1 GB.
library(mlearning)
##
## Attaching package: 'mlearning'
##
## The following object is masked from 'package:caret':
##
## train
plot(confusion(predRfCvl, cvlSet$classe), sort=FALSE)
predRfTesting = predict(modelRf, newdata = testing)
predRfTesting
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The prediction on the testing is given above. We will next explore an alternative option, using boosting. The gradient boosting machine method is selected. We will also train this model with 5-fold cross validation.
set.seed(62351)
ptm <- proc.time()
modelGbm <- train(classe ~., data=ptrSet, method="gbm",
trControl=trainControl(method="cv", number=5),
verbose=FALSE)
timeGbm <- proc.time() - ptm
plot(varImp(modelGbm), main = "Variable Importance in Gradient Boost Machine Model")
predGbmCvl = predict(modelGbm, newdata = cvlSet)
gbmCvlAccuracy = sum(predGbmCvl == cvlSet$classe) / nrow(cvlSet)
gbmCvlError = 1 - gbmCvlAccuracy
gbmCvlError
## [1] 0.03695209
timeGbm
## user system elapsed
## 229.537 0.611 232.506
Here the expected out-of-sample error rate is 3.7%, not as low as what was obtained by the random forest model, but still very good. The training time is much faster, at under 4 minutes on the same computer, and the amount of memory used is much smaller, at 15.9 MB. The confusion matrix is shown below.
plot(confusion(predGbmCvl, cvlSet$classe), sort=FALSE)
predGbmTesting = predict(modelGbm, newdata = testing)
predGbmTesting
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
sum(predRfTesting == predGbmTesting)
## [1] 20
Most importantly, the result of the prediction on the 20 testing cases are the same as the result using the random forest model.
There are many capable model training techniques available. Often times, if multiple models give comparable prediction results, it may be wise to also consider time efficiency before settling on the model of choice.