The objective for this assignment is to predict a ‘class’ of barbell movement using data collected from accelerometers on the belt, forearm, arm, and dumbell. Breifly, the data collection set up was as follows:
All empty data cells are assigned NA values. The structure of missing data is then examined visually and quantified by variable to see which variables are of limited value due to missingness.
First a plot of missingness. 5% of the observations are selected at random and plotted by row (barbell episode) and column (variable). NA values are coloured red.
rows <- sample(nrow(training), 0.05*nrow(training))
matrixplot(training[rows,], cex.axis=0.5)
Clearly some variables are highly missing while some have nearly 100% available observations. this suggests a rational cut off on missingness for which variables to include, as shown in the following plot:
missing <- data.frame(colSums(is.na(training)))
vars <- rownames(missing)
missing <- cbind(missing, vars)
n <- nrow(training)
names(missing) <- c("n_missing", "vars")
missing$vars <- as.character(missing$vars)
missing <- dplyr::mutate(missing, prop_missing=n_missing/n)
p1 <- ggplot(missing, aes(x=prop_missing))
p1 + stat_ecdf(colour="darkgreen", size=1.5, alpha=0.7) +
geom_abline(intercept = 0, slope = 1, colour="grey") + xlim(0,1) +
xlab("Proportion of observations missing is less than") +
ylab("Proportion of variables") +
ggtitle("Cumulative frequncy of missing observations by variable")
there are 59 variables with no missing observations, and the remaining variables are 100% missing in 97% of cases - we exclude those variables from further analysis with the following code:
vars <- missing$vars[which(missing$prop_missing<0.5)]
training <- training[,vars]
sum(complete.cases(training)) # n=19622
## [1] 19622
Next we exclude any variables with minimal variability using the nearZeroVar function of caret package.
nzv <- nearZeroVar(training, saveMetrics=TRUE)
sum(nzv$nzv) #n=1 variables to drop
## [1] 34
vars <- names(training)
keep.vars <- vars[which(nzv$nzv==FALSE)]
training <- training[,keep.vars]
rm(nzv)
Only one variable is dropped by this procedure.
The remaining variables, if numeric, are preprocessed by centering and scaling. This details of this preprocessing are stored for future use on the validation set.
preProcVars <- names(training)
preObj <- preProcess(training, method=c("center","scale"))
training <- predict(preObj, training)
At this point manual examination of the data class and name for each remaining variable showed them all to be numeric except user name and some time variables. Both these variables are potentially problematic as they suggest possible dependence in the data structure which may not be capturable in a model that is applied to observations made at some future time point and in some different users. The (dubious) approach we take is to remove time variables but keep the user names as they are the same users as in the validation set. This clearly has important implications for extending the model beyond those individuals.
time.vars <- grep("time", names(training))
training <- training[ ,-time.vars]
training$user_name <- as.factor(training$user_name)
training$classe <- as.factor(training$classe)
Next we split the full training set so that we have a large training subset and a test set reserved from the full training set (needed as we dont actually have a test set yet we can use to test models before validation out of sample).
inTrain <- createDataPartition(y=training$classe,
p=0.75, list=FALSE)
trainDF <- training[inTrain,]
testDF <- training[-inTrain,]
Five models were built using 5 different methods: decisioon tree (CART), random forrest (RF), boosted trees (boost), linear discriminant analysis (lda), and support vector machine (svm) modelling.
Default settings within the caret package were used. the code is shown below but the models are loaded from the working directory were they were stored previously as they take hours to run (at least on my laptop!).
CART <- train(classe~., data=trainDF, method="rpart")
RF <- train(classe~., data=trainDF, method="rf", importance=TRUE)
boost <- train(classe~., data=trainDF, method="gbm", verbose=FALSE)
lda <- train(classe~., data=trainDF, method="lda")
svm <- svm(classe~., data =trainDF)
Next we extract the accuracy of each model within the data set they were built from.
CARTpred <- predict(CART, newdata=trainDF)
RFpred <- predict(RF, newdata=trainDF)
ldapred <- predict(lda, newdata=trainDF)
svmpred <- predict(svm, newdata=trainDF)
boostpred <- predict(boost, newdata=trainDF)
withinTrain <- data.frame(CARTpred, RFpred, ldapred, svmpred, boostpred,
classe=trainDF$classe)
CART.Accu <- confusionMatrix(CARTpred, withinTrain$classe)$overall['Accuracy']
RF.Accu <- confusionMatrix(RFpred, withinTrain$classe)$overall['Accuracy']
lda.Accu <- confusionMatrix(ldapred, withinTrain$classe)$overall['Accuracy']
svm.Accu <- confusionMatrix(svmpred, withinTrain$classe)$overall['Accuracy']
boost.Accu <- confusionMatrix(boostpred, withinTrain$classe)$overall['Accuracy']
accuracy <- c(CART.Accu, RF.Accu, lda.Accu, svm.Accu, boost.Accu)
model <- c("CART", "RF", "lda", "svm", "boost")
TRAINacc <- data.frame(model, accuracy)
print(TRAINacc)
## model accuracy
## 1 CART 0.4937492
## 2 RF 0.9995923
## 3 lda 0.7464329
## 4 svm 0.9484305
## 5 boost 0.9921185
The random forrest model has perfect performance with the boosted tree and SVM models also performing well. This is promising as out of sample accuracy for these models should (in theory) be good.
Here we run the predictions in the in the subset of full training set reserved for testing.
CARTpred <- predict(CART, newdata=testDF)
RFpred <- predict(RF, newdata=testDF)
ldapred <- predict(lda, newdata=testDF)
svmpred <- predict(svm, newdata=testDF)
boostpred <- predict(boost, newdata=testDF)
withinTEST <- data.frame(CARTpred, RFpred, ldapred, svmpred, boostpred,
classe=testDF$classe)
CART.Accu <- confusionMatrix(CARTpred, withinTEST$classe)$overall['Accuracy']
RF.Accu <- confusionMatrix(RFpred, withinTEST$classe)$overall['Accuracy']
lda.Accu <- confusionMatrix(ldapred, withinTEST$classe)$overall['Accuracy']
svm.Accu <- confusionMatrix(svmpred, withinTEST$classe)$overall['Accuracy']
boost.Accu <- confusionMatrix(boostpred, withinTEST$classe)$overall['Accuracy']
accuracy <- c(CART.Accu, RF.Accu, lda.Accu, svm.Accu, boost.Accu)
model <- c("CART", "RF", "lda", "svm", "boost")
TESTacc <- data.frame(model, accuracy)
print(TESTacc)
## model accuracy
## 1 CART 0.5004078
## 2 RF 0.9997961
## 3 lda 0.7463295
## 4 svm 0.9467781
## 5 boost 0.9916395
The random forrest model is again the best performing and is selected for the validation sample.
First the validations set is pre processed using the same variable selection, and centering and scaling values as were used in the full training set.
problem_id <- testing$problem_id
# limit to variables used in models (at time before preprocessing)
preProcVars <- preProcVars[-58]
testing <- testing[ ,preProcVars]
# preprocess (center and scale) using the preProcess object created with the training set
testing <- predict(preObj, testing)
Now the validation set observations are run through the random forrest model to generate the predictions.
predictions <- predict(RF, newdata=testing)
# store final predictions
validation_predictions <- data.frame(problem_id, predictions)
print(validation_predictions)
## problem_id predictions
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B