Practical Machine Learning Project (Human activity recognition of weight lifting exercises)

Introdution:

It is now possible to collect personal movement data relatively inexpensively by using devices such as Jawbone Up, Nike FuelBand and Fitbit. Activity recognition research is generally focused on predicting what kind of activity is performed at a specific point in time. On the contrary, how well an activity is performed, is traditionally neglected. In this project, we used weight lifting exercises data set from accelerometers on the belt, forearm, arm and dumbell of 6 participants [1] to build a model to predict the manner in which they did the exercise.

Getting and Cleaning data:

Data source:

Training data : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

Test data : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Required libraries

library(mlbench)
library(caret)
library(parallel)
library(doParallel)
library(rpart)
library(knitr)
library(kableExtra)

if (!file.exists("pml-training.csv")) {  
  download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = "pml-training.csv")
}
if (!file.exists("pml-testing.csv")) {
  download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile ="pml-testing.csv")
}
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

Cleaning data

The first two columns are sequence number and participant name, and next five columns are time stamp and feature extraction window. These values are not important to predict exercise quality. So, I have removed the first seven columns from both training and test data sets.

training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]

The empty observations are replaced with NA. The zero and near zero variance predictors are removed. Finally, predictors with more than 50% NA are also removed from the data set. There is no missing value in the data sets.

training[training == ""] <- NA
x = nearZeroVar(training)
training <- training[, -x]
training <- training[, -which(colMeans(is.na(training)) > 0.5)]


testing[testing == ""] <- NA
testing <- testing[, -nearZeroVar(testing)]

table(complete.cases(training))

## 
##  TRUE 
## 19622

table(complete.cases(testing))

## 
## TRUE 
##   20

dim(training)

## [1] 19622    53

dim(testing)

## [1] 20 53

Training and Validation data set

Training data set is divided into two parts. 75% data is kept for training and 25% is kept for validation.

inTraining <- createDataPartition(training$classe, p = .75, list=FALSE)
subtraining <- training[inTraining,]
validation <- training[-inTraining,]

Training model

Two models are created using random forest and gradient boosting method. The final model is chosen based on the highest accuracy. The two models are described below.

Regression and Classification Tree

At first, I fit a model based on classification tree using rpart function.

set.seed(2311)
fitrpart <- rpart(classe ~ ., data = subtraining, method = "class")

print("Make prediction on the validation data set")

## [1] "Make prediction on the validation data set"

predtree <- predict(fitrpart, validation, type = "class")
matrix1 <- confusionMatrix(predtree, validation$classe)

print("Confusion Matrix and other values")

## [1] "Confusion Matrix and other values"

matrix1$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1247  167   21   49   12
##          B   35  550   86   81   82
##          C   40  120  668  107  125
##          D   50   65   55  515   48
##          E   23   47   25   52  634

matrix1$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.369494e-01   6.665443e-01   7.243859e-01   7.492303e-01   2.844617e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   1.089358e-36

Random Forest

Then, I tried random forest method using 5-fold cross validation

x <- subtraining[,-53]
y <- subtraining[,53]
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv",
                           number = 5,
                           allowParallel = TRUE)

set.seed(2314)
fitrf <- train(x, y, method = "rf", data = subtraining, 
               trControl = fitControl)
stopCluster(cluster)
registerDoSEQ()

print("Make prediction on the validation data set")

## [1] "Make prediction on the validation data set"

predrf <- predict(fitrf, validation)
matrix2 <- confusionMatrix(predrf, validation$classe)

print("Confusion Matrix and other values")

## [1] "Confusion Matrix and other values"

matrix2$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1395    3    0    0    0
##          B    0  945    4    0    0
##          C    0    1  848   10    2
##          D    0    0    2  794    1
##          E    0    0    1    0  898

matrix2$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9951060      0.9938092      0.9927269      0.9968619      0.2844617 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

Gradient Boosting

Finally, I used Gradient boosting to fit the third model.

set.seed(2310)
fitcontrol1<-trainControl(method="cv", number = 5)
fitgbm<-train(classe~., data=subtraining, method="gbm", 
              trControl=fitcontrol1, verbose = FALSE)


print("Make prediction on the validation data set")

## [1] "Make prediction on the validation data set"

predgbm <- predict(fitgbm, validation)
matrix3 <- confusionMatrix(predgbm, validation$classe)

print("Confusion Matrix and other values")

## [1] "Confusion Matrix and other values"

matrix3$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1379   33    0    1    0
##          B   12  887   31    5   10
##          C    2   24  812   21    6
##          D    2    2    9  768   12
##          E    0    3    3    9  873

matrix3$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9622757      0.9522607      0.9565588      0.9674334      0.2844617 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

Comparison between three models

dt <- data.frame(Method = c("Classification tree", "Random Forest",
                            "Gradient Boosting"), 
                 Accuracy = c("0.7745", "0.9937", "0.9556"),
                 Out_sample_error = c("0.2255", "0.0063",
                                      "0.0444"))
kable(dt, digits = 4)

Method	Accuracy	Out_sample_error
Classification tree	0.7745	0.2255
Random Forest	0.9937	0.0063
Gradient Boosting	0.9556	0.0444

Model Selection

From the table in previous section, we see that accuracy is 99.37% in the model fitted using the random forest method. The out of sample error is 0.0063 (1-0.9937) or 0.6%. Hence, I used random forest model fit to predict the 20 test cases.

Prediction of test cases

predfinal <- predict(fitrf, testing)
df <- data.frame(Problem_id = testing$problem_id, 
                 Prediction = predfinal)

kable(df, align = "r")

Problem_id	Prediction
1	B
2	A
3	B
4	A
5	A
6	E
7	D
8	B
9	A
10	A
11	B
12	C
13	B
14	A
15	E
16	E
17	A
18	B
19	B
20	B

Reference

E. Velloso, A. Bulling, H. Gellersen, W. Ugulino and H. Fuks, Proc. of 4th Inter. Conf. in Cooperation with SIGCHI, Stuttgart, Germany, 2013

Online links

The github page can be found in the following link

https://github.com/mohammadullah/Practical-Machine-Learning