1 Overview

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available on http://groupware.les.inf.puc-rio.br/har. The goal of this project is to predict the manner in which they did the exercise.

The data for this project come from an article written by W. Ugulino et al.2. In Sect. 2, we split the data into training and testing datasets. Then, we take a look at the output variable.

We do some pre-processing on our tarining set in Sect. 3. This includes removing some harmful variables, dealing with missing values, standardization data (Gaussian distribution), excluding the variables with high correlations, and applying principal component analysis (PCA) on the data.

Sect. 4 is dedicated to building some reasonable model(s), including linear discriminant analysis, classification trees, k-nearest neighbors, support vector machines with a non-linear kernel, and random forests. We use 10-fold crossvalidation to estimate the accuracy of our models. The most accurate models that we get in this case are random forests and k-nearest neighbotes. The former is a little bit more accurate than the latter. We finally test our model on our testing dataset. The result is promissing.

2 Data

In the following, we load our data, and we split it into two sets, train.set and test.set. We will build our models on train.set, and we will test whether it overfits or not on test.set.

dataset <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"), header=T, na.strings=c("","NA"))
dim(dataset)
[1] 19622   160
set.seed(2)
inTrain <- createDataPartition(dataset$classe, p = 0.7, list = FALSE)
train.set <- dataset[inTrain, ]
test.set <- dataset[-inTrain, ]
dim(test.set)
[1] 5885  160
dim(train.set) 
[1] 13737   160

As we see, the dataset has 160 variables. The output of our predictors is the classe variable in our dataset, which is a categorical variable with the following levels: A, B, C, D, E. Now, let’s take a look at the classe (the output) variable:

percentage <- prop.table(table(train.set$classe)) * 100
cbind(freq=table(train.set$classe), percentage=round(percentage, 2))
  freq percentage
A 3906      28.43
B 2658      19.35
C 2396      17.44
D 2252      16.39
E 2525      18.38

3 Preprocessing

In this section, we do some preprocessing on train.set. This includes removing some harmful variables, dealing with missing values, standardization data (Gaussian distribution), excluding the variables with high correlations, and applying principal component analysis (PCA) on the data. We save the preprocessed data into a variable called train.set.proc.

Let’s first take a look at the first two variables of the dataset:

str(train.set[, 1:2])
'data.frame':   13737 obs. of  2 variables:
 $ X        : int  1 2 4 6 9 10 13 14 15 16 ...
 $ user_name: Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
length(unique(train.set[, 1])) == dim(train.set)[1]
[1] TRUE

We must remove the first variable, as it is just an ID for the observatios. The second variable is username. If we include it in our model, then it will be a prediction for a limited number of people. Therefore, we exclude these two variables from our training set.

train.set.proc <- train.set %>% select(-c(1, 2))

Now, we are going to remove the varaibles that have more than 50% missing values. Let us first see which variables have missing values. In the following script, we extract all variables that have NA values and save their indices in the ids_NA.1 varible.

whichNA <- function(dat){
        inds <- NULL
        for(i in 1:dim(dat)[2]){
                if(anyNA(dat[, i])){
                      inds <- c(inds, i)  
                }
        }
        inds
}
ids_NA.1 <- whichNA(train.set.proc)
m <- NULL
for(i in 1: length(ids_NA.1)){
        m <- c(m, mean(is.na(train.set.proc[, ids_NA.1[i]])))
}
unique(m)
[1] 0.9793259

So, there are 100 variables out of 158 variabels which have some missing values. As we see above, for each of these variables, about 98% of its values is missing. Since our training set is a very large data set with many other varaibels, it does not make sense to impute these variables with NA values and keep them as predictors. Therefore, we simply discard these variables as follows:

train.set.proc <- train.set.proc[, -ids_NA.1]
dim(train.set.proc)
[1] 13737    58

Now, we normalize the numeric variables in the training set. We use the methods YeoJohnson, center, and scale.3

is.fact.2 <- sapply(train.set.proc, is.factor)
num_inds.2 <- which(!is.fact.2)
preObj_stn <- preProcess(train.set.proc[, num_inds.2], method = c("YeoJohnson", "center", "scale"))
stan_data <- predict(preObj_stn, train.set.proc[, num_inds.2])
train.set.proc[, num_inds.2] <- stan_data

In the following script, we extract the variables with near zero variability:

isN0 <- nearZeroVar(train.set.proc, saveMetrics = TRUE)$nzv
colnames(train.set.proc)[isN0]
[1] "new_window"

As we see above, there is only one variable, new_window, in the training set with near zero variability. We are going to exclude it from our dataset:

train.set.proc <- train.set.proc[, !isN0]
is.fact.3 <- sapply(train.set.proc, is.factor)
num_inds.3 <- which(!is.fact.3)

Now, let us take a look at the correlation between the numeric variables. Here, using by the corrplot package, we plot the correlations.

cors <- cor(train.set.proc[, num_inds.3])
diag(cors) <- 0
corrplot(cors, order = "hclust", title = title("Correlation Plot", line = -25))

We now exclude the variables with high correlations (with cutoff 0.75) from the dataset. To this end, we take advantage of the findCorrelation function to searche through the correlation matrix and returns the indices of the corresponding variables.

highCorr <- findCorrelation(cors, cutoff=0.75)
num_cor <- length(highCorr)
train.set.proc <- train.set.proc[, -highCorr]
is.fact.4 <- sapply(train.set.proc, is.factor)
num_inds.4 <- which(!is.fact.4)
dim(train.set.proc)
[1] 13737    46

So, 11 variables were removed to reduce the pair-wise correlations.

Now, we would like to preprocess the data with the PCA method. We are going to get 95% of the variance of the data, which is the default thresh4 for precessoning with PCA in the preProcess function. We save the result in a datset variable called train.proc.pca:

preObj_pca <- preProcess(train.set.proc[, -dim(train.set.proc)[2]], method = "pca")
train.proc.pca <- predict(preObj_pca, train.set.proc[, -dim(train.set.proc)[2]])
train.proc.pca$classe <- train.set$classe
dim(train.proc.pca)
[1] 13737    27

Now, we got 25 principal components, the original variable cvtd_timestamp, and the output variable classe in our dataset. Let us just take a look at the cvtd_timestamp varaible:

percentage <- prop.table(table(train.proc.pca$cvtd_timestamp)) * 100
cbind(freq=table(dataset$cvtd_timestamp), percentage=round(percentage, 2))
                 freq percentage
02/12/2011 13:32  177       0.92
02/12/2011 13:33 1321       6.78
02/12/2011 13:34 1375       7.05
02/12/2011 13:35 1019       5.10
02/12/2011 14:56  235       1.14
02/12/2011 14:57 1380       7.02
02/12/2011 14:58 1364       6.99
02/12/2011 14:59  557       2.79
05/12/2011 11:23  190       0.96
05/12/2011 11:24 1497       7.64
05/12/2011 11:25 1425       7.23
05/12/2011 14:22  267       1.30
05/12/2011 14:23 1370       7.02
05/12/2011 14:24  973       5.02
28/11/2011 14:13  833       4.19
28/11/2011 14:14 1498       7.56
28/11/2011 14:15  739       3.83
30/11/2011 17:10  869       4.34
30/11/2011 17:11 1440       7.53
30/11/2011 17:12 1093       5.61

4 Building Models

Now, it is time to build our model. In this section, we are going to get a reasonable model. To this end, we will try several models.

We are going to use 10-fold crossvalidation to estimate the accuracy of our models.5 Moreover, we define a variable metric and set it to Accuracy. This will be the meric to evaluate models.

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

We are going to evaluate five different algorithms, including

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed to ensure that the evaluation of each algorithm is performed using exactly the same data splits.

# LDA: linear
set.seed(123)
fit.lda <- train(classe~., data=train.proc.pca, method="lda", metric=metric, trControl=control)
# CART: nonlinear
set.seed(123)
fit.cart <- train(classe~., data=train.proc.pca, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(123)
fit.knn <- train(classe~., data=train.proc.pca, method="knn", metric=metric, trControl=control)
# SVM: advanced
set.seed(123)
fit.svm <- train(classe~., data=train.proc.pca, method="svmRadial", metric=metric, trControl=control)
# Random Forest: advanced
set.seed(123)
fit.rf <- train(classe~., data=train.proc.pca, method="rf", metric=metric, trControl=control)

Now, we need to compare the models and select the most accurate one. In the following we see the summary of the above models:

# summarize accuracy of models
results <- resamples(list(lda=fit.lda, 
                          cart=fit.cart, 
                          knn=fit.knn, 
                          svm=fit.svm, 
                          rf=fit.rf))
summary(results)

Call:
summary.resamples(object = results)

Models: lda, cart, knn, svm, rf 
Number of resamples: 10 

Accuracy 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
lda  0.7396364 0.7515947 0.7595485 0.7585409 0.7627893 0.7835277    0
cart 0.2838428 0.2843636 0.6091060 0.4849217 0.6179473 0.6302766    0
knn  0.9621267 0.9650719 0.9687040 0.9681888 0.9714080 0.9730321    0
svm  0.9075691 0.9266872 0.9282070 0.9277122 0.9332470 0.9367273    0
rf   0.9723435 0.9790679 0.9814408 0.9804915 0.9836101 0.9847162    0

Kappa 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
lda  0.6717086 0.6863637 0.6968543 0.6953249 0.7005878 0.7266274    0
cart 0.0000000 0.0000000 0.5103081 0.3134941 0.5212514 0.5381595    0
knn  0.9520726 0.9558216 0.9603965 0.9597543 0.9638339 0.9658680    0
svm  0.8829014 0.9071164 0.9090182 0.9084104 0.9154732 0.9198003    0
rf   0.9650058 0.9735138 0.9765216 0.9753188 0.9792623 0.9806640    0

The following plot represents the model evaluation results and compare the spread and the mean accuracy of each model. Sine each algorithm was evaluated 10 times (10 fold cross validation), there is a population of accuracy measures for each algorithm.

dotplot(results)

We can see that the most accurate models in this case were random forests and KNN. The former is a little bit more accurate than the latter. Let us take a look at the randome forests model:

print(fit.rf)
Random Forest 

13737 samples
   26 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 12364, 12363, 12365, 12362, 12363, 12362, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9457673  0.9313401
  23    0.9804915  0.9753188
  44    0.9735033  0.9664747

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 23.

Although, random forests provide us the most accuracte model. However, it may over-fit. To see if this is the case, we are going to predict it on our testing set test.set. We first need to preprocess test.set based on the preprocessing we got for training set:

test.set.proc <- test.set %>% select(-c(1, 2))
test.set.proc <- test.set.proc[, -ids_NA.1]
stan.data.test <- predict(preObj_stn, test.set.proc[, num_inds.2])
test.set.proc[, num_inds.2] <- stan.data.test
test.set.proc <- test.set.proc[, !isN0]
test.set.proc <- test.set.proc[, -highCorr]
test.set.proc$classe <- test.set$classe
test.proc.pca <- predict(preObj_pca, test.set.proc[, -dim(test.set.proc)[2]])

NOW, let’s take a look at the confusion matrix:

pred.test <- predict(fit.rf, test.proc.pca)
confusionMatrix(pred.test, test.set$classe)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1663   15    0    0    0
         B    7 1117   11    2    0
         C    4    7 1011   31    4
         D    0    0    4  922    9
         E    0    0    0    9 1069

Overall Statistics
                                          
               Accuracy : 0.9825          
                 95% CI : (0.9788, 0.9857)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9779          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9934   0.9807   0.9854   0.9564   0.9880
Specificity            0.9964   0.9958   0.9905   0.9974   0.9981
Pos Pred Value         0.9911   0.9824   0.9565   0.9861   0.9917
Neg Pred Value         0.9974   0.9954   0.9969   0.9915   0.9973
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2826   0.1898   0.1718   0.1567   0.1816
Detection Prevalence   0.2851   0.1932   0.1796   0.1589   0.1832
Balanced Accuracy      0.9949   0.9882   0.9880   0.9769   0.9931

As we see, we got about 98% accuracy, which shows that our random forests model work pretty well. This well represented in the following plot.

qplot(classe, pred.test, data = test.set, xlab = "Actual Value", ylab = "Predicted Value", main = "Predicted vs. Actual", geom = "jitter")


  1. a.a.safilian@gmail.com

  2. http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har - Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6

  3. The YeoJohnson method make the data look more normal. We could use the Box-Cox method, but the Box-Cox method does not work when you have negative values.

  4. A cutoff for the cumulative percent of variance to be retained by PCA

  5. This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits.