library(caret)
library(tidyverse)
library(glmnet)
library(rpart.plot)
First, we load the data into R and check the shapes of the datasets.
train <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")
dim(train)
## [1] 19622 160
dim(test)
## [1] 20 160
The training data has 19622 observations and 160 variables, whereas the testing set has 20 observations and 160 variables.
for(i in 1:ncol(train)){
if(colnames(train)[i] != colnames(test)[i]){
print(colnames(train)[i])
print(colnames(test)[i])
}
}
## [1] "classe"
## [1] "problem_id"
Then, we can print out the unique variables in the two sets. We can notice that the training set has a variable name classe, which does not appear in the testing set. Similarly, the variable problem_id appears in the testing set but not the training set. The goal of this project is to predict the classe in the testing set, and the following code shows that there are five classes in total.
table(train$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
Before diving into the analysis, we first need to take a look at the dataset, and we will find out that there are quite a lot of missing values or blanks.
number <- as.numeric()
#Change all blanks to NAs
train[train==""] <- NA
for(i in 1:ncol(train)){
#If there are no missings in a column
if(is.na(as.numeric(table((is.na(train[,i])))["TRUE"]))==T){
number[i] <- 0
}else{
#Record the number of missings in a column
number[i] <- as.numeric(table((is.na(train[,i])))["TRUE"])
}
}
number
## [1] 0 0 0 0 0 0 0 0 0 0 0
## [12] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [23] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [34] 19216 19216 19216 0 0 0 0 0 0 0 0
## [45] 0 0 0 0 0 19216 19216 19216 19216 19216 19216
## [56] 19216 19216 19216 19216 0 0 0 0 0 0 0
## [67] 0 0 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [78] 19216 19216 19216 19216 19216 19216 0 0 0 19216 19216
## [89] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [100] 19216 19216 0 19216 19216 19216 19216 19216 19216 19216 19216
## [111] 19216 19216 0 0 0 0 0 0 0 0 0
## [122] 0 0 0 19216 19216 19216 19216 19216 19216 19216 19216
## [133] 19216 19216 19216 19216 19216 19216 19216 0 19216 19216 19216
## [144] 19216 19216 19216 19216 19216 19216 19216 0 0 0 0
## [155] 0 0 0 0 0 0
The above function detects the number of missing values in each column. It’s obvious to see that many columns have over 19000 missings with only 19622 total observations. Then, we can delete those variables since they provide only a little information.
which(number != 0)
## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [18] 29 30 31 32 33 34 35 36 50 51 52 53 54 55 56 57 58
## [35] 59 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 87
## [52] 88 89 90 91 92 93 94 95 96 97 98 99 100 101 103 104 105
## [69] 106 107 108 109 110 111 112 125 126 127 128 129 130 131 132 133 134
## [86] 135 136 137 138 139 141 142 143 144 145 146 147 148 149 150
colRemove <- which(number != 0)
trainNew <- train[, -c(1, 3:5, colRemove)]
testNew <- test[, -c(1, 3:5, colRemove)]
Since we have removed those columns in the training set, we need to remove them in the testing set. Moreover, since column X indicates the index in the dataset, we can just remove it. Also, by roughly looking at the dataset, the columns from three to five do not provide relevant information. So, we might as well remove them for now.
Before modeling, we need to make ourselves a local validation set to test the performance of different statistical models. We can split the training set into a new training set and a new validation set with a 9/1 ratio.
set.seed(123)
split <- sample(1:nrow(trainNew), .9*nrow(trainNew))
NewTrain <- trainNew[split,]
Validation <- trainNew[-split,]
We begin with the simplest classification method: classification trees. To evaluate the accuracy of different models, we will use 10-fold cross-validation.
#CV with 10 folds
Control <- trainControl(method="cv", number=10)
#Classification Tree
set.seed(123)
CT.fit <- train(classe~., data=NewTrain, method="rpart", trControl=Control)
#Prediction
CT.pred <- predict(CT.fit, newdata=Validation[,-ncol(Validation)])
#Results
confusionMatrix(Validation$classe, CT.pred)$table
## Reference
## Prediction A B C D E
## A 491 4 28 0 0
## B 150 128 87 0 0
## C 158 17 195 0 0
## D 146 57 138 0 0
## E 43 53 103 0 165
confusionMatrix(Validation$classe, CT.pred)$overall[1]
## Accuracy
## 0.4987264
From the result above, we can see that the classification model is about 49.9% accurate, which is not a good model to consider. And below is the graph of the classification tree.
rpart.plot(CT.fit$finalModel)
I also tried the method of k-nearest neighbors (KNN). The accuracy at the end is about 96.8%, which is pretty good. And from the graph below, it suggests that when k=5 (5 neighbors), we will have the highest accuracy.
#KNN
set.seed(123)
KNN.fit <- train(classe~., data=NewTrain, method="knn", trControl=Control, preProcess=c("center","scale"))
KNN.pred <- predict(KNN.fit, newdata=Validation[,-ncol(Validation)])
plot(KNN.fit)
confusionMatrix(Validation$classe, KNN.pred)$table
## Reference
## Prediction A B C D E
## A 516 0 5 2 0
## B 6 354 4 1 0
## C 1 11 352 5 1
## D 0 1 16 323 1
## E 2 0 3 3 356
confusionMatrix(Validation$classe, KNN.pred)$overall[1]
## Accuracy
## 0.9684157
Below is the method of quadratic discriminant analysis, and that gives me an accuracy of 90.4%. The result is pretty decent but not as good as KNN.
#QDA
set.seed(123)
QDA.fit <- train(classe~., data=NewTrain, method="qda", trControl=Control, verbose=FALSE)
QDA.pred <- predict(QDA.fit, newdata=Validation[,-ncol(Validation)])
confusionMatrix(Validation$classe, QDA.pred)$table
## Reference
## Prediction A B C D E
## A 486 24 11 2 0
## B 13 318 29 2 3
## C 0 23 342 3 2
## D 0 2 36 300 3
## E 0 8 16 11 329
confusionMatrix(Validation$classe, QDA.pred)$overall[1]
## Accuracy
## 0.9042282
I also tried a random forest; however, I wasn’t able to obtain the result due to a long runtime. Since the method of KNN provides a prediction with 96.8% accuracy, I will use it as my final model.
Using the KNN model we just built, the predicted classes for the test data are shown below.
predict(KNN.fit, newdata=test[,-ncol(test)])
## [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.