1. Executive Summary

This study examines Groupware’s Human Activity Recognition data to predict activities being performed by subjects using off-the-shelf machine learning algorithms in R. Subjects were fitted with various sensors on their body and measurements recorded to generate a training database to understand various activities like sitting, walking, standing coded as a multi-level outcome variable. A separate testing database of 20 entries was provided.

After data cleanup, we can conclude that the popular “random forest” and “generalized boosted regression models” can be used for predicting human activities with a high degree of accuracy: 100% in this test case.

2. Data Analysis and cleanup

The training and testing data was provided by Groupware’s Human Activity Recognition project: http://groupware.les.inf.puc-rio.br/har

Specifically, the training set was downloaded from: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

And, the test set was downloaded from: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The training database is relatively large with close to 20,000 samples.

training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

dim(training)
## [1] 19622   160
str(training)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1    : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_arm      : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_arm        : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_arm       : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_arm      : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_arm        : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_dumbbell  : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

Several of the training variables are NA. Also, the first 7 columns irrelevant for our data analysis since they hold serial numbers, subject name, time stamp of data and whether this was a new window for data collection or not and the window number.

Upon examination, the testing data had several empty and NA columns as well.

dim(testing)
## [1]  20 160
# str(testing) head(testing)

The variables in the training and testing data were reduced based on our above observations.

tr.clean <- training[, -c(1:7, 12:36, 50:59, 69:83, 87:101, 103:112, 125:139, 
    141:150)]
tst.clean <- testing[, -c(1:7, 12:36, 50:59, 69:83, 87:101, 103:112, 125:139, 
    141:150)]

The resulting data is more manageable, and excluding unnecessary regressors should provide for a more accurate model.

dim(tr.clean)
## [1] 19622    53
dim(tst.clean)
## [1] 20 53

2. Validation strategy

A validation data set is required for performing accuracy analysis to aid in model selection. The training database can be divided into training and validation. Since the training data holds a large number of observations, a 90-10 data partition should be sufficient.

library(caret)
inTrain <- createDataPartition(tr.clean$classe, p = 0.9, list = FALSE)
tr.df <- tr.clean[inTrain, ]
val.df <- tr.clean[-inTrain, ]

3. Applying Off-the-shelf Machine Learning algorithms

Multi-variable regression would be a simple algorithm to begin with due to it’s relatively short run-time. It generates several warnings and cannot be used since the outcome variable (classe) is a multi-level categorical variable.

We can try the go-to algorithms next and proceed to tune and more later based on the results.

3.1 Random Forest

library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)  # leave 1 core for OS
fitControl <- trainControl(allowParallel = TRUE)
# fit.rf <- train(classe ~ ., method = 'rf', data = tr.df, trainControl =
# fitControl)
load("fitdata")  # Load previously created model to reduce knitr runtimes
fit.rf
## Random Forest 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 19622, 19622, 19622, 19622, 19622, 19622, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9928111  0.9909049
##   27    0.9924619  0.9904632
##   52    0.9834419  0.9790505
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
predict.rf.val <- predict(fit.rf, newdata = val.df)
confusionMatrix(predict.rf.val, val.df$classe)$overall[1]
## Accuracy 
##        1

3.2 Generalized Boosting

fit.gbm <- train(classe ~ ., method = "gbm", data = tr.df, verbose = FALSE)
fit.gbm
## Stochastic Gradient Boosting 
## 
## 17662 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 17662, 17662, 17662, 17662, 17662, 17662, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7496641  0.6825584
##   1                  100      0.8176988  0.7691894
##   1                  150      0.8507482  0.8110328
##   2                   50      0.8518031  0.8121720
##   2                  100      0.9026135  0.8766983
##   2                  150      0.9263452  0.9067596
##   3                   50      0.8926002  0.8639674
##   3                  100      0.9371369  0.9204244
##   3                  150      0.9572725  0.9459244
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
predict.gbm.val <- predict(fit.gbm, newdata = val.df)
confusionMatrix(predict.gbm.val, val.df$classe)$overall[1]
##  Accuracy 
## 0.9642857

3.3 SVM (Support Vector Machine)

library(e1071)
fit.svm <- svm(classe ~ ., data = tr.df)
summary(fit.svm)
## 
## Call:
## svm(formula = classe ~ ., data = tr.df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.01923077 
## 
## Number of Support Vectors:  7515
## 
##  ( 1584 1617 1762 1354 1198 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  A B C D E
predict.svm.val <- predict(fit.svm, newdata = val.df)
confusionMatrix(predict.svm.val, val.df$classe)$overall[1]
##  Accuracy 
## 0.9607143

Since we have more than 95% accuracy with these classifiers, we can proceed to use these 3 on the test data and majority vote on the outcome. (I tried using Naive Bayes and some others and did not get close to the accuracy of the above three classifiers.)

4. Result on the given test data

It is a simple exercise to use predict on the test data for the 3 chosen models.

predict.rf.tst <- predict(fit.rf, newdata = tst.clean)
predict.rf.tst
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
predict.gbm.tst <- predict(fit.gbm, newdata = tst.clean)
predict.gbm.tst
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
predict.svm.tst <- predict(fit.svm, newdata = tst.clean)
predict.svm.tst
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  A  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

5. Further work: tuning and design

For the purpose of the course exercise, the above was sufficient. However, the classifiers can be fine-tuned by use of more complex validation, boosting and tweaking other parameters. The beauty of the caret package is that it allows us to specify these parameters and generates a detailed report of the best combination to be selected.

As also, caret supports tens of algorithims which can be easily tried out. It might be worthwhile to write a super-classifier, which runs several tens of these algorithms on a cluster of machines, and classifies based on majority votes or accuracy.