We created prediction model for predicting classe variable in our data sets. Our dataset is ‘Weight Lifting Exercise Dataset’, data from accelerometers on each parts of body. (See more information about dataset from here : http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har) In this document, we did preprocessing, cross validation for preprocessed data and built two models; Random Forest and Generalized Boosted Model. There’s quite serious overfitting problem in using Random Forest method, so we decided to use GBM consequently.
if(!file.exists("./data")){dir.create("./data")}
#check whether the "data" directory exists
fileUrl1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileUrl2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl1,destfile="./data/pml-training.csv",method="curl")
download.file(fileUrl2,destfile="./data/pml-testing.csv",method="curl")
training <- read.csv("./data/pml-training.csv", header=T)
testing <- read.csv("./data/pml-testing.csv", header=T)
We loaded datasets for this assessment. Let’s see how many features are in datasets
dim(training); dim(testing)
## [1] 19622 160
## [1] 20 160
There are 19622 observations of 160 features in training dataset, and 20 observations of 160 in testing dataset.
First of all, there are useless features in these datasets for building models. We could confirm that with varImp() (See Appendix). We therefore created new dataset without them.
useless.train <- grep("^X|user|timestamp|window", colnames(training))
useless.test <- grep("^X|user|timestamp|window", colnames(testing))
newtrain <- training[,-useless.train]
newtest <- testing[,-useless.test]
We still had NAs, so do the same process for them.
midtrain <- newtrain[, colSums(is.na(newtrain))==0]
midtest <- newtest[, colSums(is.na(newtrain))==0]
Finally, there are some empty features in data sets. Since they’re all non-numeric, we sorted numeric columns only.
finalfilter1 <- sapply(midtrain, is.numeric)
finalfilter2 <- sapply(midtest, is.numeric)
finaltrain <- midtrain[,finalfilter1] ; finaltrain$classe <- training$classe
finaltest <- midtest[,finalfilter2]
We split finaltrain into 3 different sets for reduce overfitting and out-of-sample error ; t.train, t.val, and t.test.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/Asia/Seoul'
index1 <- createDataPartition(y=finaltrain$classe, p=0.7, list=FALSE)
t.train <- finaltrain[index1,]; t.val <- finaltrain[-index1,]
index2 <- createDataPartition(y=t.train$classe, p=0.7, list=FALSE)
t.train <- t.train[index2,]; t.test <- t.train[-index2,]
dim(t.train); dim(t.val); dim(t.test); dim(finaltest)
## [1] 9619 53
## [1] 5885 53
## [1] 2879 53
## [1] 20 53
We could check dimensions of our refined datasets above. We used t.train for our training set, t.val for our validation set, and t.test for our testing set, generally. Predicting classe variable from finaltest set is our goal.
We conducted random forest model at first, because it could choose important variables by bootstrapping each nodes, and it’s quite robust to correlated features and outliers generally.
set.seed(1020)
trctrl <- trainControl(method="cv", number=3)
model1 <- train(classe~., data=t.train, method="rf",
trContrl=trctrl, ntree=300)
model1$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 300, mtry = param$mtry, trContrl = ..1)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.99%
## Confusion matrix:
## A B C D E class.error
## A 2732 2 1 0 0 0.001096892
## B 22 1823 15 1 0 0.020419130
## C 0 14 1657 7 0 0.012514899
## D 0 1 22 1554 0 0.014584654
## E 0 2 5 3 1758 0.005656109
We could see that errors in confusion matrix were small enough.
We also conducted gbm model for our datasets.
model2 <- train(classe~., data=t.train, method="gbm")
#model2$finalModel
valpre1 <- predict(model1, t.val[,-160])
testpre1 <- predict(model1, t.test[,-160])
(accuracy1 <- postResample(valpre1, t.val$classe)); (accuracy2 <- postResample(testpre1, t.test$classe))
## Accuracy Kappa
## 0.9879354 0.9847362
## Accuracy Kappa
## 1 1
Our accuracy and kappa from t.val dataset is amazingly close to 1 ; accuracy is 98.7% and kappa is 98.4%. Those from t.test is exactly 1 ; this was the result of overfitting.
1-as.numeric(accuracy1[1]); 1-as.numeric(accuracy2[1])
## [1] 0.01206457
## [1] 0
And its out of sample error in t.val is 1.3% and 0% in t.test. Therefore, random forest model is quite appropriate for this dataset, although it has overfitting in t.test set.
(testpre1 <- predict(model1, finaltest))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
This is the result of predicting classe from finaltest set through our RF.
valpre2 <- predict(model2, t.val[,-160])
testpre2 <- predict(model2, t.test[,-160])
(accuracy3 <- postResample(valpre2, t.val$classe)); (accuracy4 <- postResample(testpre2, t.test$classe))
## Accuracy Kappa
## 0.9576890 0.9464755
## Accuracy Kappa
## 0.9760333 0.9696992
Our accuracy and kappa from t.val dataset is close to 1 also ; accuracy is 96.1% and kappa is 95.1%. Those from t.test is 96.8% and 96.0% each ; This model performs better at out of sample. This aspect is nicer than RF.
1-as.numeric(accuracy3[1]); 1-as.numeric(accuracy4[1])
## [1] 0.04231096
## [1] 0.02396666
And its out of sample error in t.val is 3.9% and 3.2% in t.test. Therefore GBM is also appropriate for this dataset, and it has less overfitting than RF.
(testpre2 <- predict(model2, finaltest))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
This is the result of predicting classe from finaltest set through our GBM. It’s exactly same as the result from RF.
We conducted 2 models for our datasets ; RF and GBM. All of them are suitable for our datasets. RF has some overfitting problems in predicting t.test dataset, while GBM hasn’t. The error of t.test at GBM is even more decreased than that of t.val. Predicted values classe from each models are exactly same. This is because the number of subjects in finaltest set was too small. If the number is bigger, there will be difference in two models.
varImp(model1)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 60.38
## yaw_belt 56.50
## magnet_dumbbell_z 46.15
## pitch_belt 44.23
## roll_forearm 43.12
## magnet_dumbbell_y 42.79
## accel_dumbbell_y 19.30
## magnet_dumbbell_x 18.57
## accel_forearm_x 17.80
## roll_dumbbell 17.67
## magnet_belt_z 16.34
## accel_belt_z 14.95
## accel_dumbbell_z 14.02
## magnet_forearm_z 13.92
## total_accel_dumbbell 13.42
## magnet_belt_y 13.23
## yaw_arm 11.16
## gyros_belt_z 11.01
## magnet_belt_x 10.44