c08w04cp

## Loading required package: lattice

## Loading required package: ggplot2

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Synopsis

Our task is to predict how well several physical excercises by different test subjects are performed. We will build several models, select the best of those models and assess the accuracy of that model. As we are dealing with a categorical predictor, we will measure accuracy as fraction of correct prediction / all predictions. Because we are creating multiple models, we need to split our data into three parts for training, model selection and cross validation.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Reproducubility

To be able to reproduce our findings and selection, we will set the seed and provide you with the full code.

Downloading data

#
# download data
#
url_training <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_testing <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

full_training <- read.csv(url(url_training))
full_testing <- read.csv(url(url_testing))

Preprocessing

We will remove unnecessary data in two steps. First, we remove data, which adds no or very litte information via a call to nearZeroVar. Second we will remove columns, which contains a high amount of NAs.

#
# get all columns that contain no data in testing set, so we can remove them from training set
#
nvz <- nearZeroVar(full_training)
full_training <- full_training[,-nvz]
full_testing <- full_testing[,-nvz]
remove_cols <- c("X", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "problem_id")
keep_cols <- !names(full_training) %in% remove_cols

full_training <- full_training[,keep_cols]
full_testing <- full_testing[,keep_cols]

remcols <- sapply(names(full_training), FUN = function(x) { mean(is.na(full_training[x])) })
full_training <- full_training[,remcols <= 0.5]
full_testing <- full_testing[,remcols <= 0.5]

Data splitting

We have about 19k observations, which is a lot of data. Therefore we will split the data into three parts: 1) Training data with about 60%, used to train the models 2) Testing data with about 20%, used to select the best model 3) Validation data with about 20%, used for cross-validation and assessing our model

#
# divide data into a training, test and validation set
#
set.seed(12345)
inTrain1 <- createDataPartition(full_training$classe, p = 0.8, list = FALSE)
inTrain2 <- createDataPartition(inTrain1, p = 0.75, list = FALSE)

data_validation <- full_training[-inTrain1,]
data_training <- full_training[inTrain1,]
data_testing <- data_training[-inTrain2,]
data_training <- data_training[inTrain2,]

Training the models

We train four different models: Random Forst, Linear Discriminant Analysis, CART and Quadratic Discriminant Analysis. Standard gradient boosting models like gbm or xgboost can not be used, as the predictor isn’t binary. It might be possible however, to change the modifier with five levels into 5 different predictors with two levels, e.g. predictor 1: A / notA, predictor 2: B / notB, … to make things not unnecessary complicated, I will not do it in this analysis.

#
# train the models
#
preObj <- preProcess(data_training, method = c("center", "scale"))
p <- predict(preObj, data_training)

fit1 <- randomForest(classe ~ ., data = p)
fit2 <- train(classe ~ ., data = p, method = "lda")
fit3 <- train(classe ~ ., data = p, method = "rpart")
fit4 <- train(classe ~ ., data = p, method = "qda")

Testing for the best model

I will now compare the different models to each other and we will also generate a combined Random Forest out of all four predictors to see if we can improve predictions by combining different models.

#
# test the different models on the training set
#
p2 <- predict(preObj, data_testing)
acc <- data.frame(names = c("RF", "LDA", "RPART", "QDA", "MIXED RF"), accuracy = numeric(5), stringsAsFactors = FALSE)
acc[1,2] <- confusionMatrix(predict(fit1, p2), data_testing$classe)$overall[1]
acc[2,2] <- confusionMatrix(predict(fit2, p2), data_testing$classe)$overall[1]
acc[3,2] <- confusionMatrix(predict(fit3, p2), data_testing$classe)$overall[1]
acc[4,2] <- confusionMatrix(predict(fit4, p2), data_testing$classe)$overall[1]
df <- data.frame(rf = predict(fit1, p2), lda = predict(fit2, p2), rpart = predict(fit3, p2), qda = predict(fit4, p2), classe = data_testing$classe)
pred <- randomForest(classe ~ ., data = df)
acc[5,2] <- confusionMatrix(predict(pred, df), df$classe)$overall[1]
print(acc)

##      names  accuracy
## 1       RF 0.9971967
## 2      LDA 0.7520387
## 3    RPART 0.4938838
## 4      QDA 0.9133537
## 5 MIXED RF 0.9971967

Choosing a model

After testing five different models, we see that Random Forest performs by far the best. A combined predictor from these 5 models only returns Random Forest predictions, so we choose Random Forest as predictor and to assume our out of sample error rate.

Because of the way Random Forest works, a visualization wouldn’t be helpful.

p3 <- predict(preObj, data_validation)

confusionMatrix(predict(fit1, p3), data_validation$classe)$overall[1]

##  Accuracy 
## 0.9956666

Accuracy of our model

We can assume our prediction accuracy will be at 99%.

Predictions fore Quiz 2

With our predicted accuracy of 99%, we can assume to get 19 or 20 out of 20 right.

p4 <- predict(preObj, full_testing)
predict(fit1, p4)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E