Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The data for this project come from [this source][https://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har].
The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.
We are going to use the following libraries:
Sys.setlocale("LC_TIME", "English")
library(readr)
library(ggplot2)
library(caret)
library(ranger)
library(tidyverse)
First we load the training and test data, transforming all “#DIV/0!” values to NA.
training <- read.csv(file = "data/pml-training.csv", na.strings=c("#DIV/0!"), row.names = "X")
testing <- read.csv(file = "data/pml-testing.csv", na.strings=c("#DIV/0!"), row.names = "X")
The training data has 159 columns, from which several are user specific, or contain mostly NA numbers, these are removed to reduce the size of the data.
The user specific columns are ‘user_name’, ‘raw_timestamp_part_1’, ‘raw_timestamp_part_2’, ‘cvtd_timestamp’, ‘new_window’, ‘num_window’.
training <- training %>%
subset(select = -c(user_name, raw_timestamp_part_1, raw_timestamp_part_2,
cvtd_timestamp, new_window, num_window))
The columns that contain NA values contain mostly NA values, at least 97%, which means we can remove them from the data.
percNA <- colMeans(is.na(training))
percNA[which(percNA != FALSE)]
## kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt
## 0.9798186 0.9809398 1.0000000
## skewness_roll_belt skewness_roll_belt.1 skewness_yaw_belt
## 0.9797676 0.9809398 1.0000000
## max_yaw_belt min_yaw_belt amplitude_yaw_belt
## 0.9798186 0.9798186 0.9798186
## kurtosis_roll_arm kurtosis_picth_arm kurtosis_yaw_arm
## 0.9832841 0.9833860 0.9798695
## skewness_roll_arm skewness_pitch_arm skewness_yaw_arm
## 0.9832331 0.9833860 0.9798695
## kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## 0.9795638 0.9794109 1.0000000
## skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## 0.9795128 0.9793599 1.0000000
## max_yaw_dumbbell min_yaw_dumbbell amplitude_yaw_dumbbell
## 0.9795638 0.9795638 0.9795638
## kurtosis_roll_forearm kurtosis_picth_forearm kurtosis_yaw_forearm
## 0.9835898 0.9836408 1.0000000
## skewness_roll_forearm skewness_pitch_forearm skewness_yaw_forearm
## 0.9835389 0.9836408 1.0000000
## max_yaw_forearm min_yaw_forearm amplitude_yaw_forearm
## 0.9835898 0.9835898 0.9835898
training <- training[, -which(percNA != FALSE)]
dim(training)
## [1] 19622 120
We still have 120 variables, to further reduce the number of predictors, we remove those that have near zero variance.
nzVar <- nearZeroVar(training, saveMetrics = TRUE)
training <- training[, nzVar$nzv == FALSE]
dim(training)
## [1] 19622 53
Now the data is successfully reduced to 53 predictors.
keep_cols <- names(training)
The goal of this project is to successfully predict the ‘classe’ category based on the 53 predictors. First we transform this variable to a factor variable and inspect how the now clean training data is split up along this column.
training$classe <- as.factor(training$classe)
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
Since this is a categorization problem, and our data set is quite big, we are going to use the random forest method. Even though this method does not require cross validation, we are going to do it as the assignment requires it.
inTrain <- createDataPartition(y = training$classe, p=0.75, list=FALSE)
myTraining <- training[inTrain,]
crossVal <- training[-inTrain,]
We are using the ranger package which is a much faster implementation of the random forest algorithm than the one in the caret package.
rf_model <- ranger(
classe ~ .,
data = myTraining,
num.trees = 500,
importance = "permutation"
)
Now lets predict the model on the cross validation set, and calculate the accuracy.
crossVal$classe <- as.factor(crossVal$classe)
pred_valid <- predict(rf_model, crossVal)
confusionMatrix(crossVal$classe, pred_valid$predictions)$overall[1]
## Accuracy
## 0.995106
The out of sample error is 1-accuracy, which in our case 0.6%.
The most important predictors in the model are below.
rf_vars <- data.frame(Variable = names(rf_model$variable.importance),
Importance = rf_model$variable.importance)
rf_vars <- rf_vars[order(-rf_vars$Importance),]
ggplot(data = rf_vars[1:15,], aes(x = reorder(Variable, Importance), y = Importance)) +
geom_col() +
coord_flip() +
labs(x = "Variable", title = " Variable Importance in the Random Forest Model") +
theme_minimal()
And finally the prediction on the test data.
final <- predict(rf_model, testing)
final$predictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E