Nowadays the use of wearble devices by people has become as usual as the use of ordinary clothings. Wearable technology has a variety of applications which grows as the field itself expands. A proeminent example of such device is the activity trackers in which people use to quantify how much of a particular activity they do. One common issue is that the use of activity trackers rarely quantify/identify how well people do a certain activity.
Using data from sensors on the belt, forearm, arm and dumbbell of 6 participants in which they were asked to perform Weight lifting exercises (Unilateral Dumbbell Biceps Curl), I was able to investigate “how well” an activity was performed by the wearer through the use of machine learning models.
Using an approach in which separates the dataset into training and validation set in order to test the performance of the resulted algorithm, the model was able to correctly classify 99.8% of the instances.
The dataset used in this project comes from a experiment performed by Velloso et al (2013)1 in which six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl.
The concept of correctness of the exercise is defined as follows:
In other words, Class A corresponds to the perfect execution of the exercise, while the other 4 classes correspond to common mistakes.
The dataset contains information about Euler angles (roll, pitch and yaw), raw accelerometer, gyroscope and magnetometer readings for each sensor (there are four in total). For the Euler angles of each of the four sensors there were more eight features: mean, variance, standard deviation, max, min, amplitude, kurtosis and skewness, generating in total 96 derived feature sets.
The dataset is separated in training and testing set.
The training set contains 19622 rows and test only 20. Both datasets contains 96 features. The goal of this project is to train a model using the training set in order to predict “how well” each subject performed the activity.
To get a sense of the data, let’s read the datasets:
library(data.table)
library(dplyr)
library(ggplot2)
training <- fread("../data/pml-training.csv")
testing <- fread("../data/pml-testing.csv")
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
Let’s inspect the distribution of the classes that we want to predict:
library(scales)
ggplot(training) + aes(x = classe, fill = classe) +
geom_bar(aes(y=..count../sum(..count..))) +
scale_y_continuous(labels=percent_format()) +
labs(title = "", y = "proportion", x = "") +
theme_minimal()
Class A has the greatest proportion among the classes, although in general the classes are well distributed. This indicates that we might have a “regular” multiclass classification problem at hand.
As the dataset contains columns with NA values, I will only consider variables that do not present any NA value:
na_col <- data.table(names(training),
sapply(training, function(x) mean(is.na(x)))) %>%
filter(V2 > 0) %>% select(V1)
training <- select(training, -c(V1, na_col$V1))
testing <- select(testing, -c(V1, na_col$V1))
Now let’s investigate the variables that are more correlated to the classe variable in order to get an idea which features can be important to the analysis. As the classe feature is a categorical variable, I create a dummy variable which takes value 1 if the movement was done correctly (Class A) and 0 otherwise (Classe B, C, D and E):
##----------------------------------------------------------------------------##
## Analysing numeric variables:
training$classe_num <- ifelse(training$classe == "A", 1, 0)
## Correlation with numeric variables:
cor_numVar <- cor(select_if(training, is.numeric))
## Sort on decreasing correlations with classe_num:
cor_sorted <- sort(cor_numVar[, "classe_num"], decreasing = T) %>% as.matrix()
## Select only correlations with classe_num greater than 0.2 in abs value:
CorHigh <- apply(cor_sorted, 1, function(x) abs(x) > 0.2) %>% which %>% names
cor_numVar <- cor_numVar[CorHigh, CorHigh] ## selecting from correlation matrix
library(corrplot)
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")
From the correlation analysis, we can see that 8 variables demonstrated correlation greater than 0.2 in absolute value. Let’s examine some summary statistics about these variables:
library(summarytools)
descr(select(training, CorHigh, -classe_num),
stats = c("mean", "sd", "min", "med", "max"),
transpose = TRUE,
omit.headings = TRUE, style = "rmarkdown")
Mean | Std.Dev | Min | Median | Max | |
---|---|---|---|---|---|
magnet_arm_y | 156.61 | 201.91 | -392.00 | 202.00 | 583.00 |
accel_forearm_x | -61.65 | 180.59 | -498.00 | -57.00 | 477.00 |
magnet_forearm_x | -312.58 | 346.96 | -1280.00 | -378.00 | 672.00 |
magnet_arm_z | 306.49 | 326.62 | -597.00 | 444.00 | 694.00 |
accel_dumbbell_x | -28.62 | 67.32 | -419.00 | -8.00 | 235.00 |
accel_arm_x | -60.24 | 182.04 | -404.00 | -44.00 | 437.00 |
magnet_arm_x | 191.72 | 443.64 | -584.00 | 289.00 | 782.00 |
pitch_forearm | 10.71 | 28.15 | -72.50 | 9.24 | 89.80 |
From the table above, we see that the variables present completely different ranges. This is an important issue because if we train a algorithm using the raw data, it tends to be more bias prone. Hence, this is evidence that we might have to normalize the variables in order to make common ranges before training an algorithm on this data.
The modelling process will consist in working with the algorithm of Random Forest to predict the classes based on the characteristics of the exercise performed by the individual.
In order to do that I will split the training data into two subsets: a train (train_data) and a validation set (val_data).
With the train_data, I will first calculate the parameters to normalize the variables in both datasets. Then, I will split the dataset in 5 folds to perform cross validation (CV). The importance of this procedure is twofold: first, it will give an estimate of the error rate that the algorithm will display when faced with new data; and second, it is the standard procedure to choose the hyperparameters of the algorithm.
Once the model is estimated, the next step is to validate the model using the val_data. That is, I will generate a prediction on the val_data and, once we know the classes, we can check if the error rate on this dataset is very close to the CV error rate in order to detect some sign of overfitting.
Finally, once all this process is performed, I append the train_data and val_data and re-run a final model for the values of the hyperparameters chosen by the CV strategy.
library(caret)
##
train <- select(training, -c(classe_num, 2:4))
## Create train_data and val_data
set.seed(123)
inTrain <- createDataPartition(train$classe, p = .8, list = F)
train_data = train[inTrain]
val_data = train[-inTrain]
## Normalizing the dataset:
pre_processing <- preProcess(train_data, method = c("range"))
train_data_sc <- predict(pre_processing, train_data)
val_data_sc <- predict(pre_processing, val_data)
## Create the CV folders:
folds <- createFolds(train_data$classe, k = 5)
## Training the Random Forest:
library(caret)
modelRF <- train(classe ~.,
train_data_sc,
tuneLength = 10,
method = "ranger",
trControl = trainControl(method = "cv",
number = 5,
index = folds))
After estimated the model using the train_data, the CV error rate was around 0.02%. That is, the accuracy was about 98%. The standard deviation of accuracy was 0.3%, which means that with a probability of 95% the accuracy on the val_data will be between 97.4% and 98.6% (assuming the distribution of accuracy tends to a Gaussian distribution).
Moreover, using the CV strategy we ended up with the number of variables to possibly split at in each node equal to 27, with minimal node size equal to 1 and using the Gini index as splitting rule for the values of the Random Forest hyperparameters.
Analyzing the performance of model on the val_data, we have:
## performance on validation set:
cm <- confusionMatrix(predict(modelRF, val_data_sc), as.factor(val_data_sc$classe))
ggplotConfusionMatrix <- function(m){
p1 = ggplot(data = as.data.table(m$table),
aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = log(N)), colour = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
geom_text(aes(x = Reference, y = Prediction, label = N)) +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none") +
labs(title = paste("Accuracy", percent_format()(m$overall[1])))
return(p1)
}
ggplotConfusionMatrix(cm)
As we can see, the accuracy on the validation data was almost perfect with a value of 99.8%.
The final step is to re-run the model with all the training data setting the hyperparameters values equal to the CV values. After that, we can make our last prediction on the 20 instances of the test set:
##----------------------------------------------------------------------------##
## Final model:
test <- testing %>% select(colnames(select(train, -classe)))
## pre-processing:
pre_processing <- preProcess(train, method = c("range"))
train_sc <- predict(pre_processing, train)
test_sc <- predict(pre_processing, test)
## Estimation final model:
modelRF_final <- train(classe ~.,
train_sc,
tuneGrid = data.table(mtry = modelRF$bestTune$mtry,
splitrule = modelRF$bestTune$splitrule,
min.node.size = modelRF$bestTune$min.node.size),
method = "ranger",
importance = "impurity")
## Predicting the final model:
predict(modelRF_final, test_sc)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013.↩