Overview

This is an assignment from Practical Machine Learning from JHU on coursera, this report will use data from Weight Lifting Exercise Dataset, which includes data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Since there are a lot of predictors and categorical outcome, I choose decision tree, boosting, and random forest to build up the model.

Training dataset will be further separate into actual training data and validation data, final model will be apply on the training data in the end for prediction.

reference : http:/groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz4Tjq1wXIK

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Getting and Cleaning Data

Let’s start with grabbing the data online and cleaning it!

url1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# i have a quick view of the original dataset
training <- read.csv(url1, na.strings = c("", NA, "#DIV/0!"))
testing <- read.csv(url2, na.strings = c("", NA, "#DIV/0!"))
training$classe <- as.factor(training$classe)

library(caret)
# omit all the NA columns
# you can also use -nearZeroVar(training)
training <- training[,colSums(is.na(training)) == 0]
training <- subset(training, select = -c(1:7))

testing <- testing[,colSums(is.na(testing)) == 0]
testing <- testing[,-c(1:7)]

Separate the training data into actual training data and validation data.

set.seed(2023)
inTrain <- createDataPartition(training$classe, p = 0.75, list = F)
trainsub <- training[inTrain,]
testsub <- training[-inTrain,]

we can see correalation between the variable though corrplot()

library(ggplot2);library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(trainsub[,-c(53)]),method = "circle",
         type = "lower",tl.cex = 0.6, tl.col = "black")

Tree

Let’s start with decision tree, set the method to “rpart”. We can see roll_belt<131, pitch_forearm<0.34, magnet_dumbbell_y<440, and roll_forearm<124 variables are used to build up the classification.

WE GOT 0.50 ACCURACY FROM DECISION TREE.

set.seed(2023)
tree.model <- train(classe ~ ., data = trainsub, method = "rpart")

# plot the decision tree
library(rattle)
fancyRpartPlot(tree.model$finalModel)

# prediction and accuracy
pred <- predict(tree.model, testsub)
confusionMatrix(pred, as.factor(testsub$classe))$overall["Accuracy"]
##  Accuracy 
## 0.5008157

Boosting

Further building the second model by gbm(), the prediction of the gbm are probability, so some formatting are needed.

WE GOT 0.82 ACCURACY FROM BOOSTING, HIGHER THAN THE DECISION TREE!!

ps: it doesn’t work with caret method = “gbm”, does anyone know why?

set.seed(2023)
library(gbm)
gbm.model <- gbm(formula = classe ~ . , distribution = "multinomial",
                 data = trainsub, verbose = F,n.trees = 100)

# set type = "response" for probs!!
gbm.pred <- predict(gbm.model, testsub, type = "response", n.trees = 100)

# note that gbm predict returns the probability,
# by using which.max() choosing the highest prob of classe
pred_class <- apply(gbm.pred, 1, which.max)
pred_class <- as.factor(pred_class)
levels(pred_class) <- c("A","B","C","D","E")

confusionMatrix(pred_class, testsub$classe)$overall["Accuracy"]
## Accuracy 
## 0.824429

Random Forest

Finally the last model, random forest, which mostly perform the highest accuracy but easily to get overfitted in the short side.

WE GOT 0.99 ACCURACY FROM RANDOM FOREST!!THE HIGHEST!!

set.seed(2023)
library(randomForest)
rf.model <- randomForest(classe ~ ., data = trainsub, method = "class", 
                         ntree = 50, mtry = 17,do.trace = F, proximity = T, importance = T)
# do.trace = T to see how it iterate

pred <- predict(rf.model, testsub, type = "class")

confusionMatrix(pred, as.factor(testsub$classe))$overall[1]
##  Accuracy 
## 0.9936786

Let’s have a look on error rate of each classe and OOB(Out-of-bag) error.

OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. —from wiki

error rate lower as the number of trees grown more, it seems ntree = 40 will reach the lowest error rate.

plot(rf.model)
legend("topright", colnames(rf.model$err.rate),col=c("black","#DF536B","#61D04F","#2297E6","#28E2E5","#CD0BBC"),cex=0.8,fill=c("black","#DF536B","#61D04F","#2297E6","#28E2E5","#CD0BBC"))

The final prediction

final.predict <- predict(rf.model, testing, type = "class")

print(final.predict)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E