Application of Decision Tree Algorithm to Human Activity Recognition Data Set

In this paper, a data set from Human Activity Recognition is investigated using decision tree algorithm with both single tree and ensemble tree method. A ten-fold cross validation method is applied to the single tree method to validate the model. An improved accuracy was observed for ensemble tree model as is expected.

Data Cleaning and Preprocessing

Before starting to use the data to build models, it is wise to first check and preprocess the data so that the processing time will be reduced and irrelevant variables will be excluded. First of all, the test data set was investigated and variables which only consist of NAs were removed. And also the first five variables which are “X”, “username” and “timestamps” were also removed. The justification for this is that these variables provide irrelevant information for the human activity measurements and including them in the model can be misleading. For example, the “X” variable is just an index of all the measurements but the data set is ordered in such a way that if we include the “X” variable in the model the “classe” will be perfectly explained by “X” which does not make sense.

Data <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")
test <- test[,colSums(is.na(test))!= nrow(test)] # remove columns that consist of solely NA's
names <- c(names(test)[-c(1:5, 60)], "classe") # remove columns that consist of variables that are most likely irrelevant
test <- test[, -c(1:5)]
Data <- Data[, names]

Create Data Partition for Cross Validation

After the right data set is obtained, it is further partitioned into a training data set and test data set for cross validation purpose. The partition was set at 0.75 which means the data set to test set size ration will be 3.

library(caret); library(kernlab); library(rpart)

## Loading required package: lattice
## Loading required package: ggplot2

inTrain <- createDataPartition(y = Data$classe,
                               p = 0.75,
                               list = F)
trainSet <- Data[inTrain,]
testSet <- Data[-inTrain,]

Decision Tree with Cart Model

After the partitioning of the data set, a decision tree was build with the rpart method which generates a cart model. The method in rpart and type in predict are set as “class” so that it is easier to build the confusion Matrix. After the confusion Matrix was built, a function was written to calculate the weight of the diagonal summation within all the measurements, which gives the accuracy.

cartModel <- rpart(classe~., data = trainSet, method = "class") # build a cart model with the train data set
printcp(cartModel)

## 
## Classification tree:
## rpart(formula = classe ~ ., data = trainSet, method = "class")
## 
## Variables actually used in tree construction:
##  [1] accel_dumbbell_y     accel_forearm_x      magnet_arm_y        
##  [4] magnet_dumbbell_y    magnet_dumbbell_z    magnet_forearm_z    
##  [7] num_window           pitch_belt           pitch_forearm       
## [10] roll_belt            roll_forearm         total_accel_dumbbell
## 
## Root node error: 10533/14718 = 0.71565
## 
## n= 14718 
## 
##          CP nsplit rel error  xerror      xstd
## 1  0.113643      0   1.00000 1.00000 0.0051957
## 2  0.060413      1   0.88636 0.88636 0.0055472
## 3  0.038735      4   0.70512 0.73787 0.0057499
## 4  0.036552      6   0.62765 0.61360 0.0057161
## 5  0.029526      7   0.59109 0.59271 0.0056923
## 6  0.023545      8   0.56157 0.54742 0.0056224
## 7  0.021836     10   0.51448 0.50043 0.0055223
## 8  0.021172     11   0.49264 0.48296 0.0054776
## 9  0.018988     12   0.47147 0.46634 0.0054312
## 10 0.017849     13   0.45248 0.45153 0.0053866
## 11 0.014431     14   0.43463 0.42789 0.0053089
## 12 0.013861     15   0.42020 0.41004 0.0052446
## 13 0.011915     16   0.40634 0.38925 0.0051634
## 14 0.010728     18   0.38251 0.37150 0.0050885
## 15 0.010159     19   0.37178 0.35156 0.0049980
## 16 0.010000     20   0.36163 0.34520 0.0049676

cartPred <- predict(cartModel, newdata = testSet, type = "class") # use the cart model to predict the test data set
conMatrix <- table(cartPred, testSet$classe) # make the confusion matrix from the predicted "classe" variable and actual "classe" variable
acc <- function(conMatrix) {
  sum = 0
  for(i in 1:dim(conMatrix)[1]) {
    sum = sum + conMatrix[i,i]
  } 
  acc <- sum/sum(conMatrix)
  return(acc)
  } # function to calculate the accuracy from the confusion matrix
ACC <- acc(conMatrix)

## [1] "The predicted accuracy is:"

## [1] 0.7347064

Ten-fold Cross Validation

In order to get a more accurate assessment of the accuracy, a 10-fold cross validation is set as follows. The random generator in the createDataPartition function automatically make the ten trials different, which make this meaningfull.

sum = 0
for(i in 1:10) {
    inTrain <- createDataPartition(y = Data$classe,
                               p = 0.75,
                               list = F)
    trainSet <- Data[inTrain,]
    testSet <- Data[-inTrain,]
    cartModel <- rpart(classe~., data = trainSet, method = "class")
    cartPred <- predict(cartModel, newdata = testSet, type = "class")
    conMatrix <- table(cartPred, testSet$classe)
    ACC <- acc(conMatrix)
    sum = sum + ACC
}

## [1] "The average accuracy is:"

## [1] 0.7515905

Random Forest

Furthermore, a random forest algorithm was used to build an ensemble tree model which is believed to be more powerful than a single tree, and the accuracy indeed justifies this statement.

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

inTrain <- createDataPartition(y = Data$classe,
                               p = 0.75,
                               list = F)
trainSet <- Data[inTrain,]
testSet <- Data[-inTrain,]
rfModel <- randomForest(classe~., data=trainSet)
rfPred <- predict(rfModel, newdata = testSet)
conMatrix <- table(rfPred, testSet$classe)
conMatrix

##       
## rfPred    A    B    C    D    E
##      A 1395    0    0    0    0
##      B    0  949    3    0    0
##      C    0    0  852    3    0
##      D    0    0    0  800    2
##      E    0    0    0    1  899

ACC <- acc(conMatrix)

## [1] "The accuracy of the random forest model is: "

## [1] 0.9981648

Predict the Test Data using Random Forest Model

Finally, the random forest model was used to predict the test data.

levels(test$new_window) <- levels(trainSet$new_window) # In order for random forest predict function to work, the levels of the predictor variable must be the same if it is factor variable
testPred <- predict(rfModel, newdata = test)
testPred

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E