Analyzing data (source: http://groupware.les.inf.puc-rio.br/har) from sensors worn on the body and placed within exercise equipment, we classify how well a certain exercise activity was performed by six human volunteers. The classifications are: “A” - exactly according to the specification, “B” - throwing the elbows to the front, “C” - lifting the dumbbell only halfway, “D” - lowering the dumbbell only halfway, and “E” - throwing the hips to the front. Class “A” is the only correct procedure, rest all four are faulty.
For our purpose, we compare two methods–Recursive Partitioning And Regression Trees (rpart) and Random Forest (rf)–for their prediction accuracy and choose the more accurate to apply to the validation data set.
Downloading data files and reading them in:
fileURL1 = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileURL1, "builddata.csv")
fileURL2 = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileURL2, "validatedata.csv")
buildset = read.csv("builddata.csv")
validateset = read.csv("validatedata.csv")
Explore data structure:
str(buildset, list.len = 20)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## [list output truncated]
Of 160 columns, many are with near zero variance, of little relevance to our classification purpose, or with too many missing values. Identifying such columns and removing from the datasets.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# near zero variance
NZV = nearZeroVar(buildset)
# irrelevant columns with serial number of observations, names of volunteers, and three types of timestamps
irrelev = 1:5
# columns with more than 90% missing values
missvalcol = numeric()
for(i in 1:ncol(buildset)){
ifelse((length(which(is.na(buildset[,i]))) / length(buildset[,i])) > 0.90, missvalcol[i] <- i, missvalcol[i] <-
0)
}
# removing columns identified to be dropped
buildset = buildset[,-c(NZV, irrelev, missvalcol)]
validateset = validateset[,-c(NZV, irrelev, missvalcol)]
For cross-validation, We choose random sampling without replacement to partition the build data set into a training set (70%) and a test set (30%).
set.seed(321321)
rowtrain = createDataPartition(buildset$classe, p = 0.70, list = FALSE)
trainset = buildset[rowtrain, ]
testset = buildset[-rowtrain, ]
It being a classification problem, we will compare the performance of Recursive Partitioning and Regression Trees (rpart) and Random Forest (rf).
set.seed(321321)
fit.rpart = train(data= trainset, classe~., method = "rpart")
fit.rf = train(data= trainset, classe~., method = "rf", ntree = 64)
NOTE: we have limited the number of trees in Random Forest to 64 because of computational resource constraint. Our decision to choose the number 64 is backed by research from Oshiro et al. (2012).
We will compare the accuracy of both models on goodness of fit against training data, as well as prediction against test data.
pred.rpart = predict(fit.rpart, newdata = testset)
pred.rf = predict(fit.rf, newdata = testset)
confmat.rpart = confusionMatrix(pred.rpart, testset$classe)
confmat.rf = confusionMatrix(pred.rf, testset$classe)
fit.rpart$results[fit.rpart$results[,"cp"]==fit.rpart$bestTune$cp,c("Accuracy", "Kappa")]
## Accuracy Kappa
## 1 0.5448143 0.4171385
fit.rf$results[fit.rf$results[,"mtry"]==fit.rf$bestTune$mtry,c("Accuracy", "Kappa")]
## Accuracy Kappa
## 2 0.9950994 0.9938015
confmat.rpart$overall[c("Accuracy", "Kappa")]
## Accuracy Kappa
## 0.4978760 0.3440489
confmat.rf$overall[c("Accuracy", "Kappa")]
## Accuracy Kappa
## 0.9991504 0.9989253
par(mfrow= c(1,2), bg = "white")
plot(confmat.rpart$table, main = paste("RPART ( Accuracy: ", round(confmat.rpart$overall["Accuracy"],3),")"))
plot(confmat.rf$table, main = paste("Random Forest ( Accuracy: ", round(confmat.rf$overall["Accuracy"],3),")"))
We see that Random Forest far outperforms RPART decision tree in accuracy on training and test sets and will therefore be our chosen model.
We expect out-of-sample error against the validation set to be similar to the out-of-bag (OOB) error rate of the final model: 0.25%.
fit.rf$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 64, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 64
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.2%
## Confusion matrix:
## A B C D E class.error
## A 3904 1 0 0 1 0.0005120328
## B 6 2650 2 0 0 0.0030097818
## C 0 2 2394 0 0 0.0008347245
## D 0 0 7 2244 1 0.0035523979
## E 0 0 0 7 2518 0.0027722772
plot(x=1:fit.rf$finalModel$ntree, y=fit.rf$finalModel$err.rate[,1], type="l", xlab="Number of Trees", ylab="OOB Error", main="OOB Error by Number of Trees")
The near flattening out of OOB error beyond (approx.) the 40th tree shows we have not overly compromised accuracy by limiting the number of trees to 64 in view of computational constraints.
We predict against the validation set and present the prediction in the form required for the quiz:
finaloutput = data.frame(CaseID = 1:20, prediction = predict(fit.rf, validateset))
print(finaloutput)
## CaseID prediction
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B