It is now possible to collect personal movement data relatively inexpensively by using devices such as Jawbone Up, Nike FuelBand and Fitbit. Activity recognition research is generally focused on predicting what kind of activity is performed at a specific point in time. On the contrary, how well an activity is performed, is traditionally neglected. In this project, we used weight lifting exercises data set from accelerometers on the belt, forearm, arm and dumbell of 6 participants [1] to build a model to predict the manner in which they did the exercise.
Training data : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Test data : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
library(mlbench)
library(caret)
library(parallel)
library(doParallel)
library(rpart)
library(knitr)
library(kableExtra)
if (!file.exists("pml-training.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = "pml-training.csv")
}
if (!file.exists("pml-testing.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile ="pml-testing.csv")
}
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
The first two columns are sequence number and participant name, and next five columns are time stamp and feature extraction window. These values are not important to predict exercise quality. So, I have removed the first seven columns from both training and test data sets.
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
The empty observations are replaced with NA. The zero and near zero variance predictors are removed. Finally, predictors with more than 50% NA are also removed from the data set. There is no missing value in the data sets.
training[training == ""] <- NA
x = nearZeroVar(training)
training <- training[, -x]
training <- training[, -which(colMeans(is.na(training)) > 0.5)]
testing[testing == ""] <- NA
testing <- testing[, -nearZeroVar(testing)]
table(complete.cases(training))
##
## TRUE
## 19622
table(complete.cases(testing))
##
## TRUE
## 20
dim(training)
## [1] 19622 53
dim(testing)
## [1] 20 53
Training data set is divided into two parts. 75% data is kept for training and 25% is kept for validation.
inTraining <- createDataPartition(training$classe, p = .75, list=FALSE)
subtraining <- training[inTraining,]
validation <- training[-inTraining,]
Two models are created using random forest and gradient boosting method. The final model is chosen based on the highest accuracy. The two models are described below.
At first, I fit a model based on classification tree using rpart function.
set.seed(2311)
fitrpart <- rpart(classe ~ ., data = subtraining, method = "class")
print("Make prediction on the validation data set")
## [1] "Make prediction on the validation data set"
predtree <- predict(fitrpart, validation, type = "class")
matrix1 <- confusionMatrix(predtree, validation$classe)
print("Confusion Matrix and other values")
## [1] "Confusion Matrix and other values"
matrix1$table
## Reference
## Prediction A B C D E
## A 1247 167 21 49 12
## B 35 550 86 81 82
## C 40 120 668 107 125
## D 50 65 55 515 48
## E 23 47 25 52 634
matrix1$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.369494e-01 6.665443e-01 7.243859e-01 7.492303e-01 2.844617e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 1.089358e-36
Then, I tried random forest method using 5-fold cross validation
x <- subtraining[,-53]
y <- subtraining[,53]
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv",
number = 5,
allowParallel = TRUE)
set.seed(2314)
fitrf <- train(x, y, method = "rf", data = subtraining,
trControl = fitControl)
stopCluster(cluster)
registerDoSEQ()
print("Make prediction on the validation data set")
## [1] "Make prediction on the validation data set"
predrf <- predict(fitrf, validation)
matrix2 <- confusionMatrix(predrf, validation$classe)
print("Confusion Matrix and other values")
## [1] "Confusion Matrix and other values"
matrix2$table
## Reference
## Prediction A B C D E
## A 1395 3 0 0 0
## B 0 945 4 0 0
## C 0 1 848 10 2
## D 0 0 2 794 1
## E 0 0 1 0 898
matrix2$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9951060 0.9938092 0.9927269 0.9968619 0.2844617
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
Finally, I used Gradient boosting to fit the third model.
set.seed(2310)
fitcontrol1<-trainControl(method="cv", number = 5)
fitgbm<-train(classe~., data=subtraining, method="gbm",
trControl=fitcontrol1, verbose = FALSE)
print("Make prediction on the validation data set")
## [1] "Make prediction on the validation data set"
predgbm <- predict(fitgbm, validation)
matrix3 <- confusionMatrix(predgbm, validation$classe)
print("Confusion Matrix and other values")
## [1] "Confusion Matrix and other values"
matrix3$table
## Reference
## Prediction A B C D E
## A 1379 33 0 1 0
## B 12 887 31 5 10
## C 2 24 812 21 6
## D 2 2 9 768 12
## E 0 3 3 9 873
matrix3$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9622757 0.9522607 0.9565588 0.9674334 0.2844617
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
dt <- data.frame(Method = c("Classification tree", "Random Forest",
"Gradient Boosting"),
Accuracy = c("0.7745", "0.9937", "0.9556"),
Out_sample_error = c("0.2255", "0.0063",
"0.0444"))
kable(dt, digits = 4)
Method | Accuracy | Out_sample_error |
---|---|---|
Classification tree | 0.7745 | 0.2255 |
Random Forest | 0.9937 | 0.0063 |
Gradient Boosting | 0.9556 | 0.0444 |
From the table in previous section, we see that accuracy is 99.37% in the model fitted using the random forest method. The out of sample error is 0.0063 (1-0.9937) or 0.6%. Hence, I used random forest model fit to predict the 20 test cases.
predfinal <- predict(fitrf, testing)
df <- data.frame(Problem_id = testing$problem_id,
Prediction = predfinal)
kable(df, align = "r")
Problem_id | Prediction |
---|---|
1 | B |
2 | A |
3 | B |
4 | A |
5 | A |
6 | E |
7 | D |
8 | B |
9 | A |
10 | A |
11 | B |
12 | C |
13 | B |
14 | A |
15 | E |
16 | E |
17 | A |
18 | B |
19 | B |
20 | B |