Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The data for this project come from [this source][https://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har].

The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.

Data Loading and Processing

We are going to use the following libraries:

Sys.setlocale("LC_TIME", "English")
library(readr)
library(ggplot2)
library(caret)
library(ranger)
library(tidyverse)

First we load the training and test data, transforming all “#DIV/0!” values to NA.

training <- read.csv(file = "data/pml-training.csv", na.strings=c("#DIV/0!"), row.names = "X")
testing <- read.csv(file = "data/pml-testing.csv", na.strings=c("#DIV/0!"), row.names = "X")

The training data has 159 columns, from which several are user specific, or contain mostly NA numbers, these are removed to reduce the size of the data.

The user specific columns are ‘user_name’, ‘raw_timestamp_part_1’, ‘raw_timestamp_part_2’, ‘cvtd_timestamp’, ‘new_window’, ‘num_window’.

training <- training %>% 
  subset(select = -c(user_name, raw_timestamp_part_1, raw_timestamp_part_2, 
             cvtd_timestamp, new_window, num_window))

The columns that contain NA values contain mostly NA values, at least 97%, which means we can remove them from the data.

percNA <- colMeans(is.na(training))
percNA[which(percNA != FALSE)]

##      kurtosis_roll_belt     kurtosis_picth_belt       kurtosis_yaw_belt 
##               0.9798186               0.9809398               1.0000000 
##      skewness_roll_belt    skewness_roll_belt.1       skewness_yaw_belt 
##               0.9797676               0.9809398               1.0000000 
##            max_yaw_belt            min_yaw_belt      amplitude_yaw_belt 
##               0.9798186               0.9798186               0.9798186 
##       kurtosis_roll_arm      kurtosis_picth_arm        kurtosis_yaw_arm 
##               0.9832841               0.9833860               0.9798695 
##       skewness_roll_arm      skewness_pitch_arm        skewness_yaw_arm 
##               0.9832331               0.9833860               0.9798695 
##  kurtosis_roll_dumbbell kurtosis_picth_dumbbell   kurtosis_yaw_dumbbell 
##               0.9795638               0.9794109               1.0000000 
##  skewness_roll_dumbbell skewness_pitch_dumbbell   skewness_yaw_dumbbell 
##               0.9795128               0.9793599               1.0000000 
##        max_yaw_dumbbell        min_yaw_dumbbell  amplitude_yaw_dumbbell 
##               0.9795638               0.9795638               0.9795638 
##   kurtosis_roll_forearm  kurtosis_picth_forearm    kurtosis_yaw_forearm 
##               0.9835898               0.9836408               1.0000000 
##   skewness_roll_forearm  skewness_pitch_forearm    skewness_yaw_forearm 
##               0.9835389               0.9836408               1.0000000 
##         max_yaw_forearm         min_yaw_forearm   amplitude_yaw_forearm 
##               0.9835898               0.9835898               0.9835898

training <- training[, -which(percNA != FALSE)]
dim(training)

## [1] 19622   120

We still have 120 variables, to further reduce the number of predictors, we remove those that have near zero variance.

nzVar <- nearZeroVar(training, saveMetrics = TRUE)
training <- training[, nzVar$nzv == FALSE]

dim(training)

## [1] 19622    53

Now the data is successfully reduced to 53 predictors.

keep_cols <- names(training)

The model

The goal of this project is to successfully predict the ‘classe’ category based on the 53 predictors. First we transform this variable to a factor variable and inspect how the now clean training data is split up along this column.

training$classe <- as.factor(training$classe)
table(training$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Since this is a categorization problem, and our data set is quite big, we are going to use the random forest method. Even though this method does not require cross validation, we are going to do it as the assignment requires it.

inTrain <- createDataPartition(y = training$classe, p=0.75, list=FALSE)
myTraining <- training[inTrain,]
crossVal <- training[-inTrain,]

We are using the ranger package which is a much faster implementation of the random forest algorithm than the one in the caret package.

rf_model <- ranger(
  classe ~ .,
  data = myTraining,
  num.trees = 500,
  importance = "permutation"
)

Cross Validation

Now lets predict the model on the cross validation set, and calculate the accuracy.

crossVal$classe <- as.factor(crossVal$classe)
pred_valid <- predict(rf_model, crossVal)
confusionMatrix(crossVal$classe, pred_valid$predictions)$overall[1]

## Accuracy 
## 0.995106

The out of sample error is 1-accuracy, which in our case 0.6%.

The most important predictors in the model are below.

rf_vars <- data.frame(Variable = names(rf_model$variable.importance), 
                      Importance = rf_model$variable.importance)

rf_vars <- rf_vars[order(-rf_vars$Importance),]

ggplot(data = rf_vars[1:15,], aes(x = reorder(Variable, Importance), y = Importance)) + 
  geom_col() +
  coord_flip() + 
  labs(x = "Variable", title = " Variable Importance in the Random Forest Model") +
  theme_minimal()

Predicting the test data

And finally the prediction on the test data.

final <- predict(rf_model, testing)
final$predictions

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Assignment

Executive Summary

Data Loading and Processing

The model

Cross Validation

Predicting the test data