Human Activity Recognition Project

Summary

To predict Human Activity, there were applied several EDA techniques to get insights about the problem. 100 of the predictors had more than 95% missing values and physics knowledge was mandatory to create new features (vector modules). After high correlated variables were excluded, 15 predictor were selected to build the machine learning model. A benchmark was created providing as a start point (naive solution) with only 4 predictors and it had 66% of accuracy. Later Random Forest and Boosting models were applied to all 15 predictors resulting Random Forest as the best model with 96% of accuracy. For the test set, predicted activities were B, A, C, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, A, B. Due to hardware limitations this results correspond to only 8.000 instances of the training set.

Set environment

For this report, there will be required the following libraries:

dplyr for data processing.
ggplot2 for plotting.
caret for machine learning.
reshape2 for correlation heatmap.

library(dplyr); library(ggplot2); library(caret); library(reshape2)

For reproducibility, seed will be set to 42.

set.seed(42)

Lets load the training and testing data:

training <- read.csv('pml-training.csv')
testing <- read.csv('pml-testing.csv')

Exploratory Data Analysis

First, all columns with more than 95% of missing data will be removed. Otherwise, applying imputing techniques may lead to bias the data set.

missingCount <- sapply(training, function(x) sum(is.na(x) | x == ''))
columnsToRemove <- missingCount > 0.95 * nrow(training)
numberOfColumnsToRemove <- sum(columnsToRemove)
meaningfulColumns <- colnames(training)[!columnsToRemove]

There are 100 columns NOT satisfying the minimum requirements. From missingCount variable, it can be verified that there are no other columns with missing values.

The columns X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window and num_window show data related to the user and measurement unrelated to the prediction of the activity so they will be removed. classe will be excluded as well to make easier the EDA.

columnsToKeep <- meaningfulColumns[!grepl('X|user|time|window|classe', meaningfulColumns)]
processedTraining <- training[columnsToKeep]

Now, check if there is a near zero variable:

nzv <-nearZeroVar(processedTraining, saveMetrics = TRUE)
nearZeroPredictors <- sum(nzv$nzv)

0 near zero variables were found so no column will be discarded for this reason.

In physics, the module of distance, velocity and acceleration contains the information for their decomposition of their coordinates so new features will be created as follows (later it will be found that total_acceleration is highly co-linear with these new properties so they will be discarded. Maybe they were published in different units):

For acceleration:

onlyAccelColumns <- columnsToKeep[grepl('accel.+(_x$|_y$|_z$)', columnsToKeep)]
acceleration <- processedTraining[onlyAccelColumns] %>%
                  mutate(accel_belt = sqrt(accel_belt_x^2 + accel_belt_y^2 + accel_belt_z^2),
                    accel_arm = sqrt(accel_arm_x^2 + accel_arm_y^2 + accel_arm_z^2),
                    accel_dumbbell = sqrt(accel_dumbbell_x^2 + accel_dumbbell_y^2 + accel_dumbbell_z^2),
                    accel_forearm = sqrt(accel_forearm_x^2 + accel_forearm_y^2 + accel_forearm_z^2))
acceleration <- acceleration[c('accel_belt', 'accel_arm', 'accel_dumbbell', 'accel_forearm')]

For gyros:

onlyGyrosColumns <- columnsToKeep[grepl('gyros.+(_x$|_y$|_z$)', columnsToKeep)]
gyros <- processedTraining[onlyGyrosColumns] %>%
                  mutate(gyros_belt = sqrt(gyros_belt_x^2 + gyros_belt_y^2 + gyros_belt_z^2),
                    gyros_arm = sqrt(gyros_arm_x^2 + gyros_arm_y^2 + gyros_arm_z^2),
                    gyros_dumbbell = sqrt(gyros_dumbbell_x^2 + gyros_dumbbell_y^2 + gyros_dumbbell_z^2),
                    gyros_forearm = sqrt(gyros_forearm_x^2 + gyros_forearm_y^2 + gyros_forearm_z^2))
gyros <- gyros[c('gyros_belt', 'gyros_arm', 'gyros_dumbbell', 'gyros_forearm')]

For magnet:

onlyMagnetColumns <- columnsToKeep[grepl('magnet.+(_x$|_y$|_z$)', columnsToKeep)]
magnet <- processedTraining[onlyMagnetColumns] %>%
                  mutate(magnet_belt = sqrt(magnet_belt_x^2 + magnet_belt_y^2 + magnet_belt_z^2),
                    magnet_arm = sqrt(magnet_arm_x^2 + magnet_arm_y^2 + magnet_arm_z^2),
                    magnet_dumbbell = sqrt(magnet_dumbbell_x^2 + magnet_dumbbell_y^2 + magnet_dumbbell_z^2),
                    magnet_forearm = sqrt(magnet_forearm_x^2 + magnet_forearm_y^2 + magnet_forearm_z^2))
magnet <- magnet[c('magnet_belt', 'magnet_arm', 'magnet_dumbbell', 'magnet_forearm')]

Next, append modules and remove respective coordinates. This will reduce significantly the number of features.

processedTraining <- cbind(processedTraining, acceleration, gyros, magnet) %>%
                        select(-c(all_of(onlyAccelColumns), all_of(onlyMagnetColumns), all_of(onlyGyrosColumns)))

At this stage there are 28 possible predictors. Lets plot the correlation heatmap to check for correlated variables.

cormat <- cor(processedTraining)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() +
  ggtitle('Predictors correlation matrix') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

It can be noticed that there are a lot of variables that are highly correlated (spots too dark or too light). Lets find those relations. The following function will help to get the relationship in a very easy way:

findHighCorrelatedPredictor <- function(data, threshold, excludeColumns=c()) {
  # Filter dataset
  trainingPredictors <- data %>% select(-all_of(excludeColumns))
  # Calculate correlation matrix
  correlationMatrix <- cor(trainingPredictors)
  # Prepare variables
  predictorsNames <- colnames(trainingPredictors)
  allCorrelations <- c()
  
  for (predictorName in predictorsNames) {
    # Check value over threshold
    predictorsOverThreshold <- abs(cor(correlationMatrix)[, predictorName]) > threshold
    # Get correlated predictors
    correlatedPredictors <- predictorsNames[predictorsOverThreshold]
    # Remove relationship with itself
    correlatedPredictors <- correlatedPredictors[correlatedPredictors != predictorName]  
    # If there are coincidences, add relations to summary character vector
    if (!identical(correlatedPredictors, character(0))) {
      highRelations <- paste0(predictorName, ' -> ', correlatedPredictors)
      allCorrelations <- c(allCorrelations, highRelations) 
    }
  }
  allCorrelations
}

The first step, is excluding all features that are very high correlated (over 0.9). Using the function above and iteratively selecting columns to remove, the exclude variable is build:

exclude <- c('roll_belt', 'accel_belt', 'accel_forearm', 'accel_dumbbell', 'accel_arm', 'gyros_dumbbell')
findHighCorrelatedPredictor(data=processedTraining, threshold=0.9, excludeColumns=exclude)
processedTraining <- processedTraining %>% select(-all_of(exclude))

The second step is repeat the previous but with threshold 0.7 (highly correlated):

exclude <- c('magnet_dumbbell', 'magnet_belt', 'gyros_belt', 'yaw_dumbbell', 'pitch_forearm', 'yaw_belt', 'pitch_dumbbell')
findHighCorrelatedPredictor(data=processedTraining, threshold=0.7, excludeColumns=exclude)
processedTraining <- processedTraining %>% select(-all_of(exclude))

The last step is to save all selected columns, and create a helper function to apply these changes to the datasets in the following sections:

columnsToKeep <- colnames(processedTraining)
customPreProcess <- function(data) {
  data %>%
    mutate(gyros_arm = sqrt(gyros_arm_x^2 + gyros_arm_y^2 + gyros_arm_z^2),
           gyros_forearm = sqrt(gyros_forearm_x^2 + gyros_forearm_y^2 + gyros_forearm_z^2),
           magnet_arm = sqrt(magnet_arm_x^2 + magnet_arm_y^2 + magnet_arm_z^2),
           magnet_forearm = sqrt(magnet_forearm_x^2 + magnet_forearm_y^2 + magnet_forearm_z^2)) %>%
    select(all_of(columnsToKeep))
}

One helper function for calculating accuracy for predictions will be useful:

calcAccuracy <- function(y_real, y_pred) {
  mean(y_real == y_pred)
}

Special considerations for this project

Due to hardware limitations, the training phase can’t be executed with all the data set, so a helper function to generate a random 8000 instances sub-sample will be useful.

generateSampling <- function(nsamples) {
  sample(nrow(training), nsamples)
}

randomSample <- generateSampling(8000)
trainSampled <- training[randomSample,]

Splitting data for training validation

20% of the data will be separated for validation purposes.

inTrain <- createDataPartition(y=trainSampled$classe, p=0.8, list=FALSE)
trainSet <- trainSampled[inTrain,]
valSet <- trainSampled[-inTrain,]

Benchmark

A benchmark will be useful to compare all model implementations. As a simple model, it will be set under the following statements:

Suppose that only total acceleration explains the problem.
A random forest model will be implemented.

Using a 5-fold CV and accuracy as the metric, we train the model to get the expected out of sample error:

accel_cols <- c('total_accel_arm', 'total_accel_forearm', 'total_accel_belt', 'total_accel_dumbbell')
trainingBenchmark <- trainSet %>% select(all_of(accel_cols))
trainingBenchmark <- cbind(trainingBenchmark, data.frame(classe=trainSet$classe))

train_control <- trainControl(method="cv", number=5)

benchModel <- train(classe ~ .,
                    data=trainingBenchmark,
                    method='rf',
                    trControl = train_control,
                    prox=TRUE)
benchAccuracy <- mean(benchModel$results$Accuracy)

From validation set, it can be estimated the out-of-sample error:

benchPred <- predict(benchModel, newdata = valSet)
benchOOSAcc <- calcAccuracy(valSet$classe, benchPred)

Benchmark accuracy:

Training accuracy is 0.6407.
The expected out of sample accuracy is 0.6621.

Model prediction

For real predictions, the following considerations will be taken:

Random Forest and Boosting gbm models are going to be used.
Common parameters for both models: 5-fold CV, center and scale preprocess.

Below, it will be prepared the training dataset:

processedTraining <- customPreProcess(trainSet)
processedTraining <- cbind(processedTraining, data.frame(classe=trainSet$classe))

processedValidation <- customPreProcess(valSet)

Next, Random Forest model will be train:

rfModelFit <- train(classe ~ .,
                    data=processedTraining,
                    method='rf',
                    trControl = train_control,
                    preProcess = c('center', 'scale'),
                    prox=TRUE)
rfAccuracy <- mean(rfModelFit$results$Accuracy)

For validation set:

rfPred <- predict(rfModelFit, newdata = processedValidation)
rfOOSAcc <- calcAccuracy(valSet$classe, rfPred)

Random Forest accuracy:

Training accuracy is 0.9483.
The expected out of sample accuracy is 0.9668.

Now, same process with boosting model:

boostModelFit <- train(classe ~ .,
                       data=processedTraining,
                       method='gbm',
                       trControl = train_control,
                       preProcess = c('center', 'scale'),
                       verbose=FALSE)
boostAccuracy <- mean(boostModelFit$results$Accuracy)

For validation set:

boostPred <- predict(boostModelFit, newdata = processedValidation)
boostOOSAcc <- calcAccuracy(valSet$classe, boostPred)

Boosting accuracy:

Training accuracy is 0.7536.
The expected out of sample accuracy is 0.883.

Predicting on the testing set

Using the Random Forest (best model), it will predict testing classes:

processedTesting <- customPreProcess(testing)
testPred <- predict(rfModelFit, processedTesting)

Conclusions

Both models performed better than the benchmark.
For this problem, Random Forest was the best model to provide a solution.
Predictions for tests where (in the same order): B, A, C, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, A, B.