Introduction

In recent years, devices that monitor our movements have been becoming more and more popular. One problem in this area is the classification of the movements based on accelerometer data.

In this project, we are tasked to take a dataset from Groupware@LES, specifically, the Weight Lifting Exercise Dataset and attempt to create a model that will correctly identify the activity undertaken by the participants based on data taken from accelerometers placed on the belt, forearm, arm and dumbell.

For more information about this dataset, please go to http://groupware.les.inf.puc-rio.br/har

The Data

The dataset we used has been pre partitioned into 2 sets of data, 1 training and 1 test set and have been predownloaded into the current working directory.

train <- read.csv("pml-training.csv", sep=",", header=TRUE)
nrow(train)
## [1] 19622
test <- read.csv("pml-testing.csv", sep=",", header=TRUE)
nrow(test)
## [1] 20

Tidying Data and Feature Selection

First step in our analysis is removing features from the dataset that is not useful in making the predictions. By doing exploratory analysis, it has been determined that these include the following features: “X”, “user_name”, “raw_timestamp_part_1”, raw_timestamp_part_2“,”cvtd_timestamp“,”new_window" and “num_window”. These features are the first 7 features in the data set.

wtrain <- train[,8:ncol(train)]

Exploratory analysis also shows that a lot of features have been automatically identified as factors when in fact, upon inspection of the feature values, the feature can be more appropriately identified as numeric. By inspecting the dataset, we determine that all those identified as factors are indeed numeric and so we convert them to numeric.

for (v in names(wtrain[,-153])) {
  if (is.factor(wtrain[,v])) {
    wtrain[,v] <- as.numeric(as.character(wtrain[,v]))
  }
}

By converting factors to numeric, we are able to uniformly identify non-numeric values as NA. i.e. there are numerous instances of “#DIV/0!” values in the data. Conversion to numeric forced them to be NA.

Part of our feature selection method is to remove all features that have NAs in the dataset. We have chosen this method because detailed inspection of the dataset shows that features with NA in the dataset are all calculated features. This means that they are values derived from other features that contain raw measurement values. By intuition, we identify them as having high correlation with features with raw measurement values.

noNA <- c()
hasNA <- c()
for (v in names(wtrain)){
  if (length(wtrain[is.na(wtrain[,v]),v]) == 0) {
    noNA <- c(noNA, v)
  } else {
    hasNA <- c(hasNA, v)
  }
}
unique(gsub("_.*","", hasNA))
## [1] "kurtosis"  "skewness"  "max"       "min"       "amplitude" "var"      
## [7] "avg"       "stddev"
noNA
##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"     "classe"
wtrain <- wtrain[,noNA]

Methodology

To create an appropriate model for predicting the activities, the random Forest algorithm is used. Random forest creates multiple decision trees and averages the result to reduce the variance. The random forest algorithm was chosen because it works well for classification problems and because in our tests, the in-sample error and out-of-sample error rate derived using 5-fold cross validation are close to each other, as expected.

Implementation

First, load the libraries and set the seed.

library(randomForest)
library(caret)
set.seed(123456)

Then, randomize the order of the training set.

wtrain <- wtrain[sample(nrow(wtrain), replace=FALSE),]

Run random forest with 5-fold cross validation.

k <- 5
n <- nrow(wtrain)
size <- ceiling(n/k)
toi <- 0
insample <- c()
outofsample <- c()
for (i in 1:k){
  print(paste("iteration","=",i))
  fromi <- toi+1
  toi <- min(n, fromi+size)
  cvtest <- wtrain[fromi:toi,]
  cvtrain <- wtrain[-c(fromi:toi),]
  rf <- randomForest(classe ~ ., data=cvtrain, mtry=2)
  ptest <- predict(rf, cvtest)
  cmtest <- confusionMatrix(ptest, cvtest$classe)
  ptrain <- predict(rf, cvtrain)
  cmtrain <- confusionMatrix(ptrain, cvtrain$classe)
  insample <- c(insample, cmtrain$overall[1])
  outofsample <- c(outofsample, cmtest$overall[1])
}
## [1] "iteration = 1"
## [1] "iteration = 2"
## [1] "iteration = 3"
## [1] "iteration = 4"
## [1] "iteration = 5"

The accuracy of the predictions are very close to each other for both in-sample and out-of-sample data for all iterations of the k-fold cross validation.

cvacc <- data.frame(insample, outofsample)
cvacc
##   insample outofsample
## 1        1   0.9936322
## 2        1   0.9956699
## 3        1   0.9949058
## 4        1   0.9928681
## 5        1   0.9946401

To get the error rates we subtract the accuracy from 1. Below is the comparison of the in-sample and out-of-sample error rates.

cverror <- data.frame(insample=(1-insample), outofsample=(1-outofsample))
names(cverror) <- c("insample", "outofsample")
cverror
##   insample outofsample
## 1        0 0.006367804
## 2        0 0.004330107
## 3        0 0.005094244
## 4        0 0.007131941
## 5        0 0.005359877

Test Prediction

answers <- predict(rf, test)
answers
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Project Submission

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(answers)