Activity Predictions Using the Weight Lifting Exercises Dataset

Summary

Based on a dataset provide by HAR we will try to predict the activity that was performed using original data from 159 covariates and 1 predictor.

The steps taken to address the problem are: 1. Process the data to make it easy to analyze 2. Perform an exploratory data analysis on the data to find relationships between covariates 3. Selection the models that are going to be used 4. Examine how good the models perform on the data 5. Conclussion 6. Predicting the classe values for the validation dataset

1. Processing

The required libraries are loaded

# Loading the required libraries from the start
library(dplyr);library(caret); library(corrplot); library(Rmisc); library(pander)

Set the seed for reproducibility and extracting the data.

# Getting the data
set.seed(124879)
training.raw <- read.csv("data/pml-training.csv")
validation <- read.csv("data/pml-testing.csv")

After that the columns with a los of missing values are removed, this is done, because there is a lot of covariate from which we can fit, and so the ones with a lot of its data missing could be discarded. Then, all the columns with timestamp on them will be removed, the justification to that is that this is a mere time tracking register and not related to the classe at all. The columns X, user_name and new_window are removed since it is mostly id-related and not relevant covariate data.

# Remove the columns with a lot of missing values
naRemove <- which(colSums(is.na(training.raw) | training.raw == "")/
                      nrow(training.raw) > 0.15)
training.raw <- training.raw[, -naRemove]
validation <- validation[, -naRemove]

# Removing timestamp since that is not an useful covariate
timeRemove <- grep("timestamp", names(training.raw))
training.raw <- training.raw[,-timeRemove]
validation <- validation[,-timeRemove]
classeLevels <- unique(training.raw$classe)
training.raw$classe <- factor(training.raw$classe, labels=classeLevels)

# Removing X, user_name, new_window columns
training.raw <- select(training.raw, -c("X", "user_name", "new_window"))
validation <- select(validation, -c("X", "user_name", "new_window"))

In case there are still rows with little missing values, an imputer will be needed, so we check that.

# Checking to see if an imputer is needed
sum(colSums(is.na(training.raw)))

## [1] 0

The answer zero means that there aren’t any missing values in our data.

2. Exploratory Data Analysis

First, the data is splitted into three datasets: training, testing and validation.

# Splitting the data
inTrain <- createDataPartition(y=training.raw$classe, p=0.7, list=FALSE)
training <- training.raw[inTrain,]
testing <- training.raw[-inTrain,]
outputIndex <- which(names(training) == "classe")

After that, correlations are searched among the covariates and then we extract the fourth most correlated variables. New datasets are created called training.4, testing.4, validation.4. They correspond to dataframes corresponding of just the aforementioned highly correlated predictors.

# Search correlations
correlations <- cor(training[,-outputIndex], as.numeric(training$classe))
# 4 Highest correlated variables to classe
highCorr <- subset(as.data.frame(correlations),abs(V1)>0.26)
corrVariables <- row.names(highCorr)
training.4 <- select(training, c("classe", corrVariables))

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(corrVariables)` instead of `corrVariables` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

outputIndex.4 <- which(names(training.4) == "classe")
testing.4 <- select(testing, c("classe", corrVariables))
validation.4 <- select(validation, c("problem_id", corrVariables))
pander(highCorr, caption="Highest correlated variables")

Highest correlated variables
	V1
magnet_belt_y	-0.2941
magnet_arm_x	0.3061
magnet_arm_y	-0.262
pitch_forearm	0.3431

A plot of this covariates and the outcome classe is made using Hmisc multiplot function.

# Plotting the relationship between the variables of interest
p1 <- ggplot(training.4, aes(classe,pitch_forearm)) +
    geom_boxplot(aes(fill=classe))

p2 <- ggplot(training.4, aes(classe,magnet_arm_y)) +
    geom_boxplot(aes(fill=classe))

p3 <- ggplot(training.4, aes(classe,magnet_arm_x)) +
    geom_boxplot(aes(fill=classe))

p4 <- ggplot(training.4, aes(classe,magnet_belt_y)) +
    geom_boxplot(aes(fill=classe))

multiplot(p1,p2,p3,p4, cols=2)

Boxplot four covariates and classe

Plotting the complete correlation matrix.

# Plotting the corr matrix
correlations <- cor(training[, -outputIndex])
corrplot(correlations, method="color", type="lower", order="hclust",
         tl.col="black", tl.srt = 45, diag = FALSE, tl.cex=0.60,
         col=colorRampPalette(c("#80CED7","#63C7B2","#8E6C88","#263D42"))(100))

Plot of the correlation matrix

3. Model Selection

Given the high performance in terms of accuracy that the random forest and general boosting machine algorithms have, those are going to be the two base models. Two setups for each of the algorithm are going to be fitted, the first one taking into account only the fourth highly correlated predictors and the other predicting with all the covariates; this gives a total of four models. Also, for completeness, four new models are going to be created after preprocessing the data using principal component analysis to explain 80% of the variance. This add to a total of eight model to compare.

modRF4: random forest of the four highly correlated components. modRFAll: random forest using all the possible predictors. modGBM4: general boosting machine of the four highly correlated components. modGBMAll: general boosting machine using all the possible predictors. modRF4.pca: random forest of the post-processing pca data of the four highly correlated components. modRFAll.pca: random forest of the post-processing pca data of all the possible predictors. modGBM4.pca: general boosting machine of the post-processing pca data of the four highly correlated components. modGBMAll.pca: general boosting machine of the post-processing pca data of all the possible predictors.

4. Model fitting

The models are fitted using the caret package and then its accuracy evaluated in the test dataset.The number of trees in the random forest is set to 100, this gives us good accuracy, also the errors are not reduced by much after 30 trees -this will be shown further down the document. In order to no train the models in every single run, those are going to be saved into an RDS file and the loaded when neeeded.

control <- trainControl(verboseIter = TRUE)
# Model using 4 best correlations
modRF4 <- train(classe ~ ., method="rf", data=training.4, trControl=control,
                ntree=100, verbose=T, keep.forest=T)
modGBM4 <- train(classe ~ ., method="gbm", data=training.4, verbose=TRUE)
# Model with all covariates
modRFAll <- train(classe ~ .,method="rf", data=training, trControl=control,
                  ntree=100, verbose=T, keep.forest=T)
modGBMAll <- train(classe ~ ., method="gbm", data=training, verbose=TRUE)

This four models are then evaluated.

# Evaluationg non-pca models
modRF4 <- readRDS("models/modRF4.rds")
modGBM4 <- readRDS("models/modGBM4.rds")
predRF4 <- predict(modRF4, testing)
confRF4 <- confusionMatrix(predRF4, testing.4$classe)
predGBM4 <- predict(modGBM4, testing)
confGBM4 <- confusionMatrix(predGBM4, testing.4$classe)

modRFAll <- readRDS("models/modRFAll.rds")
modGBMAll <- readRDS("models/modGBMAll.rds")
predRFAll <- predict(modRFAll, testing)
confRFAll <- confusionMatrix(predRFAll, testing$classe)
predGBMAll <- predict(modGBMAll, testing)
confGBMAll <- confusionMatrix(predGBMAll, testing$classe)

The next step is to do the pca preprocess on the training, test and validation dataframes.

# Model using pca
pcaPreProcess <- preProcess(training[, -outputIndex], method = "pca",
                            thresh = 0.80)

pcaPreProcess.4 <- preProcess(training.4[, -outputIndex.4], method = "pca",
                              thresh = 0.80)

# PCA 4
training.4.pca <- predict(pcaPreProcess.4, training.4[, -outputIndex.4])
training.4.pca<- cbind(classe=training.4$classe, training.4.pca)
testing.4.pca <- predict(pcaPreProcess.4, testing.4[, -outputIndex.4])
testing.4.pca <- cbind(classe=testing$classe, testing.4.pca)
validation.4.pca <- predict(pcaPreProcess.4, validation.4[, -outputIndex.4])
validation.4.pca <- cbind(problem_id=validation.4$problem_id, validation.4.pca)
# PCA All
training.pca <- predict(pcaPreProcess, training[, -outputIndex])
training.pca <- cbind(training.pca, classe=training$classe)
testing.pca <- predict(pcaPreProcess, testing[, -outputIndex])
testing.pca <- cbind(testing.pca, classe=testing$classe)
validation.pca <- predict(pcaPreProcess, validation[, -outputIndex])
validation.pca <- cbind(validation.pca, problem_id=validation$problem_id)

The four PCA-models are fitted.

# Model using 4 best correlations after PCA
modRF4.pca <- train(classe ~ ., method="rf", data=training.4.pca,
                    trControl=control, ntree=100, verbose=T, keep.forest=T)
modGBM4.pca <- train(classe ~ ., method="gbm", data=training.4.pca,
                     verbose=TRUE)
# Model with all after PCA
modRFAll.pca <- train(classe ~ .,method="rf", data=training.pca,
                      trControl=control, ntree=100, verbose=T, keep.forest=T)
modGBMAll.pca <- train(classe ~ ., method="gbm", data=training.pca,
                       verbose=TRUE)

Evaluating the four PCA-models.

# Evaluationg the pca models
modRF4.pca <- readRDS("models/modRF4.pca.rds")
modGBM4.pca <- readRDS("models/modGBM4.pca.rds")
predRF4.pca <- predict(modRF4.pca, testing.4.pca)
confRF4.pca <- confusionMatrix(predRF4.pca, testing.4.pca$classe)
predGBM4.pca <- predict(modGBM4.pca, testing.4.pca)
confGBM4.pca <- confusionMatrix(predGBM4.pca, testing.4.pca$classe)

modRFAll.pca <- readRDS("models/modRFAll.pca.rds")
modGBMAll.pca <- readRDS("models/modGBMAll.pca.rds")
predRFAll.pca <- predict(modRFAll.pca, testing.pca)
confRFAll.pca <- confusionMatrix(predRFAll.pca, testing.pca$classe)
predGBMAll.pca <- predict(modGBMAll.pca, testing.pca)
confGBMAll.pca <- confusionMatrix(predGBMAll.pca, testing.pca$classe)

5. Model evaluation

Each of the model accuracies is stored in a dataframe to better compared them all.

# Showing the different accuracies
accuracies <- c(confRF4$overall[[1]], confRF4.pca$overall[[1]],
                           confRFAll$overall[[1]], confRFAll.pca$overall[[1]],
                           confGBM4$overall[[1]], confGBM4.pca$overall[[1]],
                           confGBMAll$overall[[1]], confGBMAll.pca$overall[[1]])

accuracies <- data.frame(accuracies)
row.names(accuracies) <- c("RF4", "RF4.pca", "RFAll", "RFAll.pca",
                           "GBM4", "GBM4.pca", "GBMAll", "GBMAll.pca")

pander(accuracies)

	accuracies
RF4	0.7196
RF4.pca	0.6085
RFAll	0.9986
RFAll.pca	0.9643
GBM4	0.6309
GBM4.pca	0.5516
GBMAll	0.985
GBMAll.pca	0.7477

The random forest model with all the covariates has the highest accuracy with a value of 0.9986406 and the lowest accuracy correspond to the GBM4 model after PCA, with a value of 0.5515718. Although it might seems tempting to use the RFAll model to validate our data, this high value might indicate overfitting and thus, will not be used. Instead the RFAll.pca model has a good accuracy (0.9643161) and a much lower training time, for these reasons it will be used to submit the validation response.

6. Conclussion

The PCA models don’t seem to perform better to the other models in terms of sheer accuracy, howerver, the big advantage is that they require much less of a training time due to the lower amount of data. A nicely chosen PCA model could obtain similar results to its non-PCA counterpart with the adequate settings, and this is the advantage of using such processing.

Another finding after this analysis is that when using the highest correlated covariates to the predictor, the error tend to converge faster with the number of trees than when using all the predictor.

par(mfrow=c(1,2))
plot(modRF4$finalModel, main="All covariates model")
plot(modRFAll$finalModel, main="Correlated covariates model")