6 September 2018

This machine learning project uses data1 from accelerometer devices worn on the belt, forearm, arm, and barbell of six participants, who were asked to perform barbell lifts correctly and incorrectly in five different ways. The goal is to predict the manner in which participants did the exercise, using the dependent variable classe in the training dataset. The classe variable is a factor with five classes: A, exercise performed exactly according to the specification; B, throwing the elbows to the front; C, lifting the dumbbell only halfway; D, lowering the dumbbell only halfway; and E, throwing the hips to the front. Class A corresponds to the correct execution of the exercise, while the other four classes correspond to common mistakes.

In the following presentation, R code that produced the analysis is shown alongside explanatory text.

# Data setup

library(readr)
library(tidyverse)
library(nortest)
library(Hmisc)
library(FNN)
library(heplots)
library(knitr)
library(pander)
library(kableExtra)

options(max.print=1000000)
options(scipen=999)

# Un-comment the following lines to download the raw data.
# download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", 
#               "pml-training.csv")
# download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", 
#               "pml-testing.csv")
pml_training <- read_csv("pml-training.csv", guess_max = 19622)
pml_testing <- read_csv("pml-testing.csv", guess_max = 20)
# Data Wrangling, Exploratory Analysis, and Model Building

# (Code used in the exploratory phase is commented out here to save processing time.)

# glimpse(pml_training, width = 90)               # A concise look at all the variables.
# sapply(pml_training, function(x) sum(is.na(x))) # Count the NAs in every column.
# sapply(pml_training, function(x) summary(x))    # Summarize each variable.

# Delete NA variables.
pml_training_no_NA <- pml_training[, -which(colMeans(is.na(pml_training)) > 0.5)] 
pml_testing_no_NA <- pml_testing[, -which(colMeans(is.na(pml_testing)) > 0.5)]

# Delete irrelevant variables.
pml_training_no_NA <- select(pml_training_no_NA, "roll_belt":"classe")            
pml_testing_no_NA <- select(pml_testing_no_NA, "roll_belt":"magnet_forearm_z")

# sapply(pml_training_no_NA, function(x) summary(x)) # Summarize each variable.
# sapply(pml_training_no_NA[, 1:52], function(x) {mean(x) - median(x)}) # Compute (mean-median).
# sapply(pml_training_no_NA[, 1:52], function(x) lillie.test(x)) # Normality tests. 
# pairs(pml_training_no_NA, upper.panel = NULL)      # Pairwise scatterplots of predictors.
# rcorr(as.matrix(pml_training_no_NA[-53]))          # Correlation matrix and significance tests.

# Convert character variable classe to a factor.
pml_training_no_NA$classe <- as.factor(pml_training_no_NA$classe) 

classe <- pml_training_no_NA$classe # Need classe in its own vector for FNN::knn.
pml_training_no_NA_no_classe <- pml_training_no_NA[-53] # Get rid of classe for FNN::knn.

# Test for homogeneity of covariance matrices.
theBoxM <- boxM(pml_training_no_NA_no_classe, pml_training_no_NA$classe) 

Data Wrangling, Exploratory Analysis, and Model Building

Of the 160 variables in the training set, there were many with high occurrence of NA values (missing data). The number of NA values was always 19216, close to the total number of observations in the training set (19622), possibly because of a device failure. Those variables were eliminated from the analysis, in addition to several other variables deemed irrelevant to predicting exercise performance.

This left 52 quantitative predictor variables with no missing values. Predictor variables represented various aspects of movement of the body, arm, and barbell as measured by accelerometers worn by the participants. The Lilliefors test for normality (package nortest), plus computing the difference between the mean and median, and viewing of pairwise scatterplots, showed that all 52 predictors were significantly non-normal in distribution. In addition, Box’s M-test (package heplots) indicated extreme heterogeneity of predictor covariance matrices across levels of the dependent variable classe (p < 0.00000000000000022). Violation of these two assumptions eliminated both linear and quadratic discriminant analysis as choices for classification model building. Logistic regression was also eliminated due to the high degree of multicollinearity in the predictor variables, as indicated by matrices of predictor correlations and p-values (function rcorr).

Given the data limitations, we chose k-Nearest Neighbor Classification (KNN, package FNN) to build our predictive engine. KNN is completely nonparametric and makes no assumptions about the distribution of the data, so that it is a “model-free” approach to classification.2 To predict a case from the test data, KNN finds the k training observations that are closest to the test case, using a Euclidean distance measure. The test case is then assigned to the class to which the majority of the k nearest neighbors belong. KNN is an appropriate choice for classification problems in which the data do not meet the assumptions of methods based on linear models. However, because KNN does not fit a model to the training data, it does not tell us which predictors have an important relationship to the dependent variable.3 This aspect of KNN was not considered to be critical as this project was focused on prediction, not interpretation.

# k-Nearest Neighbor Classification and Cross-Validation

# Leave-one-out cross-validation of the training set.
theKNN_CV <- knn.cv(pml_training_no_NA_no_classe, classe, k = 3, prob = TRUE)

tab1 <- table(classe, theKNN_CV)
tab1 %>%
  kable("html", caption = "Table 1. Confusion matrix for the k-Nearest 
  Neighbor Classification, using a leave-one-out cross-validation of the training data and
  k = 3 nearest neighbors.") %>%   
    kable_styling(bootstrap_options = "bordered", full_width = T, position="center") %>% 
    footnote(general = "Exercise categories: A, exercise performed exactly according to the specification; B, throwing the elbows to the front; C, lifting the dumbbell only halfway; D, lowering the dumbbell only halfway; and E, throwing the hips to the front. Class A corresponds to the correct execution of the exercise, while the other four classes correspond to common mistakes.", general_title = "")
Table 1. Confusion matrix for the k-Nearest Neighbor Classification, using a leave-one-out cross-validation of the training data and k = 3 nearest neighbors.
A B C D E
A 5491 32 18 32 7
B 87 3565 76 32 37
C 22 68 3268 47 17
D 17 9 101 3065 24
E 24 47 30 52 3454
Exercise categories: A, exercise performed exactly according to the specification; B, throwing the elbows to the front; C, lifting the dumbbell only halfway; D, lowering the dumbbell only halfway; and E, throwing the hips to the front. Class A corresponds to the correct execution of the exercise, while the other four classes correspond to common mistakes.

k-Nearest Neighbor Classification and Cross-Validation

Table 1 shows the confusion matrix for the k-Nearest Neighbor Classification, using a leave-one-out cross-validation of the training dataset with k = 3 nearest neighbors. Matrix elements on the diagonal show the number of cases that were correctly classified, while off-diagonal elements show misclassifications.

From Table 1 we can calculate the estimated out-of-sample error rate as:

1 - ((5491 + 3565 + 3268 + 3065 + 3454) / 19622) = 0.0397,

or approximately 4%.

Classification of the Test Data

# Classification of the Test Data

theKNN <- FNN::knn(pml_training_no_NA_no_classe, pml_testing_no_NA, classe, k = 3, prob=TRUE)

# tab2 <- t(theKNN[1:length(theKNN)])
# tab2 %>%
#   kable("html", caption = "Table 2. Classification of 20 test cases using k-Nearest Neighbor 
#       Classification with k = 3 nearest neighbors.", col.names = as.character(1:20)) %>%   
#     kable_styling(bootstrap_options = "bordered", full_width = T, position="center")

The same KNN algorithm was used to classify the test dataset, consisting of 20 different test cases. All test cases were classified correctly.


  1. See Human Activity Recognition.

  2. Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning. Second edition. Springer-Verlag New York Inc.

  3. James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning. Springer-Verlag New York Inc.