PML Peer Assessment

1. Overview

We created prediction model for predicting classe variable in our data sets. Our dataset is ‘Weight Lifting Exercise Dataset’, data from accelerometers on each parts of body. (See more information about dataset from here : http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har) In this document, we did preprocessing, cross validation for preprocessed data and built two models; Random Forest and Generalized Boosted Model. There’s quite serious overfitting problem in using Random Forest method, so we decided to use GBM consequently.

2. Loading Datasets and Preprocessing

2-1. Loading Datasets and Packages

if(!file.exists("./data")){dir.create("./data")}        
        #check whether the "data" directory exists
fileUrl1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileUrl2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl1,destfile="./data/pml-training.csv",method="curl")
download.file(fileUrl2,destfile="./data/pml-testing.csv",method="curl")
training <- read.csv("./data/pml-training.csv", header=T)
testing <- read.csv("./data/pml-testing.csv", header=T)

We loaded datasets for this assessment. Let’s see how many features are in datasets

dim(training); dim(testing)

## [1] 19622   160

## [1]  20 160

There are 19622 observations of 160 features in training dataset, and 20 observations of 160 in testing dataset.

2-2. Preprocessing

First of all, there are useless features in these datasets for building models. We could confirm that with varImp() (See Appendix). We therefore created new dataset without them.

useless.train <- grep("^X|user|timestamp|window", colnames(training))
useless.test <- grep("^X|user|timestamp|window", colnames(testing))
        newtrain <- training[,-useless.train]
        newtest <- testing[,-useless.test]

We still had NAs, so do the same process for them.

midtrain <- newtrain[, colSums(is.na(newtrain))==0]
midtest <- newtest[, colSums(is.na(newtrain))==0]

Finally, there are some empty features in data sets. Since they’re all non-numeric, we sorted numeric columns only.

finalfilter1 <- sapply(midtrain, is.numeric)
finalfilter2 <- sapply(midtest, is.numeric)

finaltrain <- midtrain[,finalfilter1] ; finaltrain$classe <- training$classe
finaltest <- midtest[,finalfilter2]

2-3. Splitting Dataset

We split finaltrain into 3 different sets for reduce overfitting and out-of-sample error ; t.train, t.val, and t.test.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/Asia/Seoul'

index1 <- createDataPartition(y=finaltrain$classe, p=0.7, list=FALSE)
t.train <- finaltrain[index1,]; t.val <- finaltrain[-index1,]
index2 <- createDataPartition(y=t.train$classe, p=0.7, list=FALSE)
t.train <- t.train[index2,]; t.test <- t.train[-index2,]

dim(t.train); dim(t.val); dim(t.test); dim(finaltest)

## [1] 9619   53

## [1] 5885   53

## [1] 2879   53

## [1] 20 53

We could check dimensions of our refined datasets above. We used t.train for our training set, t.val for our validation set, and t.test for our testing set, generally. Predicting classe variable from finaltest set is our goal.

3. Building Models

3-1. Random Forest Model

We conducted random forest model at first, because it could choose important variables by bootstrapping each nodes, and it’s quite robust to correlated features and outliers generally.

set.seed(1020)
trctrl <- trainControl(method="cv", number=3)
model1 <- train(classe~., data=t.train, method="rf", 
                trContrl=trctrl, ntree=300)
model1$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = 300, mtry = param$mtry, trContrl = ..1) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.99%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 2732    2    1    0    0 0.001096892
## B   22 1823   15    1    0 0.020419130
## C    0   14 1657    7    0 0.012514899
## D    0    1   22 1554    0 0.014584654
## E    0    2    5    3 1758 0.005656109

We could see that errors in confusion matrix were small enough.

3-2. Generalized Boosted Model

We also conducted gbm model for our datasets.

model2 <- train(classe~., data=t.train, method="gbm")

#model2$finalModel

4. Predicting Through Models

4-1. Through RF

valpre1 <- predict(model1, t.val[,-160])
testpre1 <- predict(model1, t.test[,-160])
(accuracy1 <- postResample(valpre1, t.val$classe)); (accuracy2 <- postResample(testpre1, t.test$classe))

##  Accuracy     Kappa 
## 0.9879354 0.9847362

## Accuracy    Kappa 
##        1        1

Our accuracy and kappa from t.val dataset is amazingly close to 1 ; accuracy is 98.7% and kappa is 98.4%. Those from t.test is exactly 1 ; this was the result of overfitting.

1-as.numeric(accuracy1[1]); 1-as.numeric(accuracy2[1])

## [1] 0.01206457

## [1] 0

And its out of sample error in t.val is 1.3% and 0% in t.test. Therefore, random forest model is quite appropriate for this dataset, although it has overfitting in t.test set.

(testpre1 <- predict(model1, finaltest))

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

This is the result of predicting classe from finaltest set through our RF.

4-2. Through GBM

valpre2 <- predict(model2, t.val[,-160])
testpre2 <- predict(model2, t.test[,-160])
(accuracy3 <- postResample(valpre2, t.val$classe)); (accuracy4 <- postResample(testpre2, t.test$classe))

##  Accuracy     Kappa 
## 0.9576890 0.9464755

##  Accuracy     Kappa 
## 0.9760333 0.9696992

Our accuracy and kappa from t.val dataset is close to 1 also ; accuracy is 96.1% and kappa is 95.1%. Those from t.test is 96.8% and 96.0% each ; This model performs better at out of sample. This aspect is nicer than RF.

1-as.numeric(accuracy3[1]); 1-as.numeric(accuracy4[1])

## [1] 0.04231096

## [1] 0.02396666

And its out of sample error in t.val is 3.9% and 3.2% in t.test. Therefore GBM is also appropriate for this dataset, and it has less overfitting than RF.

(testpre2 <- predict(model2, finaltest))

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

This is the result of predicting classe from finaltest set through our GBM. It’s exactly same as the result from RF.

4. Result

We conducted 2 models for our datasets ; RF and GBM. All of them are suitable for our datasets. RF has some overfitting problems in predicting t.test dataset, while GBM hasn’t. The error of t.test at GBM is even more decreased than that of t.val. Predicted values classe from each models are exactly same. This is because the number of subjects in finaltest set was too small. If the number is bigger, there will be difference in two models.

5. Appendix

varImp(model1)

## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt             100.00
## pitch_forearm          60.38
## yaw_belt               56.50
## magnet_dumbbell_z      46.15
## pitch_belt             44.23
## roll_forearm           43.12
## magnet_dumbbell_y      42.79
## accel_dumbbell_y       19.30
## magnet_dumbbell_x      18.57
## accel_forearm_x        17.80
## roll_dumbbell          17.67
## magnet_belt_z          16.34
## accel_belt_z           14.95
## accel_dumbbell_z       14.02
## magnet_forearm_z       13.92
## total_accel_dumbbell   13.42
## magnet_belt_y          13.23
## yaw_arm                11.16
## gyros_belt_z           11.01
## magnet_belt_x          10.44