In this paper, a data set from Human Activity Recognition is investigated using decision tree algorithm with both single tree and ensemble tree method. A ten-fold cross validation method is applied to the single tree method to validate the model. An improved accuracy was observed for ensemble tree model as is expected.
Before starting to use the data to build models, it is wise to first check and preprocess the data so that the processing time will be reduced and irrelevant variables will be excluded. First of all, the test data set was investigated and variables which only consist of NAs were removed. And also the first five variables which are “X”, “username” and “timestamps” were also removed. The justification for this is that these variables provide irrelevant information for the human activity measurements and including them in the model can be misleading. For example, the “X” variable is just an index of all the measurements but the data set is ordered in such a way that if we include the “X” variable in the model the “classe” will be perfectly explained by “X” which does not make sense.
Data <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")
test <- test[,colSums(is.na(test))!= nrow(test)] # remove columns that consist of solely NA's
names <- c(names(test)[-c(1:5, 60)], "classe") # remove columns that consist of variables that are most likely irrelevant
test <- test[, -c(1:5)]
Data <- Data[, names]
After the right data set is obtained, it is further partitioned into a training data set and test data set for cross validation purpose. The partition was set at 0.75 which means the data set to test set size ration will be 3.
library(caret); library(kernlab); library(rpart)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y = Data$classe,
p = 0.75,
list = F)
trainSet <- Data[inTrain,]
testSet <- Data[-inTrain,]
After the partitioning of the data set, a decision tree was build with the rpart method which generates a cart model. The method in rpart and type in predict are set as “class” so that it is easier to build the confusion Matrix. After the confusion Matrix was built, a function was written to calculate the weight of the diagonal summation within all the measurements, which gives the accuracy.
cartModel <- rpart(classe~., data = trainSet, method = "class") # build a cart model with the train data set
printcp(cartModel)
##
## Classification tree:
## rpart(formula = classe ~ ., data = trainSet, method = "class")
##
## Variables actually used in tree construction:
## [1] accel_dumbbell_y accel_forearm_x magnet_arm_y
## [4] magnet_dumbbell_y magnet_dumbbell_z magnet_forearm_z
## [7] num_window pitch_belt pitch_forearm
## [10] roll_belt roll_forearm total_accel_dumbbell
##
## Root node error: 10533/14718 = 0.71565
##
## n= 14718
##
## CP nsplit rel error xerror xstd
## 1 0.113643 0 1.00000 1.00000 0.0051957
## 2 0.060413 1 0.88636 0.88636 0.0055472
## 3 0.038735 4 0.70512 0.73787 0.0057499
## 4 0.036552 6 0.62765 0.61360 0.0057161
## 5 0.029526 7 0.59109 0.59271 0.0056923
## 6 0.023545 8 0.56157 0.54742 0.0056224
## 7 0.021836 10 0.51448 0.50043 0.0055223
## 8 0.021172 11 0.49264 0.48296 0.0054776
## 9 0.018988 12 0.47147 0.46634 0.0054312
## 10 0.017849 13 0.45248 0.45153 0.0053866
## 11 0.014431 14 0.43463 0.42789 0.0053089
## 12 0.013861 15 0.42020 0.41004 0.0052446
## 13 0.011915 16 0.40634 0.38925 0.0051634
## 14 0.010728 18 0.38251 0.37150 0.0050885
## 15 0.010159 19 0.37178 0.35156 0.0049980
## 16 0.010000 20 0.36163 0.34520 0.0049676
cartPred <- predict(cartModel, newdata = testSet, type = "class") # use the cart model to predict the test data set
conMatrix <- table(cartPred, testSet$classe) # make the confusion matrix from the predicted "classe" variable and actual "classe" variable
acc <- function(conMatrix) {
sum = 0
for(i in 1:dim(conMatrix)[1]) {
sum = sum + conMatrix[i,i]
}
acc <- sum/sum(conMatrix)
return(acc)
} # function to calculate the accuracy from the confusion matrix
ACC <- acc(conMatrix)
## [1] "The predicted accuracy is:"
## [1] 0.7347064
In order to get a more accurate assessment of the accuracy, a 10-fold cross validation is set as follows. The random generator in the createDataPartition function automatically make the ten trials different, which make this meaningfull.
sum = 0
for(i in 1:10) {
inTrain <- createDataPartition(y = Data$classe,
p = 0.75,
list = F)
trainSet <- Data[inTrain,]
testSet <- Data[-inTrain,]
cartModel <- rpart(classe~., data = trainSet, method = "class")
cartPred <- predict(cartModel, newdata = testSet, type = "class")
conMatrix <- table(cartPred, testSet$classe)
ACC <- acc(conMatrix)
sum = sum + ACC
}
## [1] "The average accuracy is:"
## [1] 0.7515905
Furthermore, a random forest algorithm was used to build an ensemble tree model which is believed to be more powerful than a single tree, and the accuracy indeed justifies this statement.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
inTrain <- createDataPartition(y = Data$classe,
p = 0.75,
list = F)
trainSet <- Data[inTrain,]
testSet <- Data[-inTrain,]
rfModel <- randomForest(classe~., data=trainSet)
rfPred <- predict(rfModel, newdata = testSet)
conMatrix <- table(rfPred, testSet$classe)
conMatrix
##
## rfPred A B C D E
## A 1395 0 0 0 0
## B 0 949 3 0 0
## C 0 0 852 3 0
## D 0 0 0 800 2
## E 0 0 0 1 899
ACC <- acc(conMatrix)
## [1] "The accuracy of the random forest model is: "
## [1] 0.9981648
Finally, the random forest model was used to predict the test data.
levels(test$new_window) <- levels(trainSet$new_window) # In order for random forest predict function to work, the levels of the predictor variable must be the same if it is factor variable
testPred <- predict(rfModel, newdata = test)
testPred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E