Prediction Assignment Writeup

Load Data

First, download csv files to local machine and constracut training and testing data.

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',destfile = "/tmp/training.csv", method = "curl")
rawTraining <- read.csv("/tmp/training.csv", na.strings = c("NA",""))

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
',destfile = "/tmp/testing.csv", method = "curl")
rawTesting <- read.csv("/tmp/testing.csv", na.strings = c("NA",""))

Test Data Cleanup

Now, let’s exclude columns whose data are NA and unrelated columns from data set. As for unrelated columns, we can ignore followings.

X
user_name
raw_timestamp_part_1
raw_timestamp_part_2
cvtdtimestamp
new_window

# discard NA columns: For each column, summarize the number of NA.
nacheck <- apply(rawTraining, 2, function(x) {
  sum(is.na(x))
})

# select columns whose number of NA is zero
nonNATraining <- rawTraining[, which(nacheck == 0)]

# Exclude not related columns
tidyTraining <- nonNATraining %>% select(-(X:new_window))

Create Train and Validattion Data from pml-trainin Data Set

Since test data (pml-testing) do not contain “classe” data, we need to split train data (pml-training) into “train” and “validate” data so that we can check how well the model fits.

colnames(tidyTraining)

##  [1] "num_window"           "roll_belt"            "pitch_belt"          
##  [4] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
##  [7] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [10] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [13] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [16] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [19] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [22] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [25] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [28] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [31] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [34] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [37] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [40] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [43] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [46] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [49] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [52] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

Now split training data into “training” and “validatation”

trainIndx <- createDataPartition(tidyTraining$classe, p = 0.7, list = FALSE)
training <- tidyTraining[trainIndx,]
validationData <- tidyTraining[-trainIndx,]

Create and Train Model

By using with 5 fold cross validation, let’s create a Random Forest model whose outcome is “classe” and predicotrs are rest of the columns in the training data.

# train control : 5 fold cross validation.
trControl <- trainControl(method = "cv", number = 5)

# build model
modelFit <- train(training$classe ~ ., method = "rf", trControl = trControl, data = training)

summary(modelFit)

##                 Length Class      Mode     
## call                4  -none-     call     
## type                1  -none-     character
## predicted       13737  factor     numeric  
## err.rate         3000  -none-     numeric  
## confusion          30  -none-     numeric  
## votes           68685  matrix     numeric  
## oob.times       13737  -none-     numeric  
## classes             5  -none-     character
## importance         53  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             14  -none-     list     
## y               13737  factor     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames             53  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           5  -none-     character

modelFit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.19%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    0    0    0    1 0.0002560164
## B    6 2649    2    1    0 0.0033860045
## C    0    3 2393    0    0 0.0012520868
## D    0    0    7 2244    1 0.0035523979
## E    0    0    0    5 2520 0.0019801980

Predict “classe” for Validation Data

Apply the model to Validation Data

Apply the random forest model to the Validation data set and get predictions.

predictions <- predict(modelFit, validationData)

Prediction Results

Compare the predicted “classe” and “classe” value in validation data set.

table(predictions, validationData$classe)

##            
## predictions    A    B    C    D    E
##           A 1672    7    0    0    0
##           B    1 1132    2    0    0
##           C    0    0 1024    2    0
##           D    0    0    0  962    2
##           E    1    0    0    0 1080

Out of Sample Error

From the model fit result displayed below, error rate is \((1 - 0.997) \cdot 100 = 0.3\%\).

print(modelFit)

## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 10990, 10990, 10989, 10990, 10989 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    0.993     0.991  0.00233      0.00295 
##   27    0.997     0.997  0.00133      0.00168 
##   53    0.995     0.994  0.00241      0.00305 
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Error rate is caluted as follows:

\[ 1 - \frac{number\ of \ correct \ estimates}{number \ of \ predictions} \]

Number of correct estimates is calculated as follows:

sum(predictions == validationData$classe)

## [1] 5870

and number of predictions is

length(predictions)

## [1] 5885

So the Out of Sample Error rate is 0.25 %

Predict “classe” for test data (plm-testing)

Clean test data

Before apply the model to test data, let’s clean the test data like we did for train data.

# discard NA columns: For each column, summarize the number of NA.
nacheckTest <- apply(rawTesting, 2, function(x) {
  sum(is.na(x))
})

# select columns whose number of NA is zero
nonNATesting <- rawTesting[, which(nacheckTest == 0)]

# Exclude not related columns
testing <- nonNATesting %>% select(-(X:new_window))
dim(testing)

## [1] 20 54

Apply the model to test Data

Apply the random forest model to the Validation data set and get predictions.

testPredictions <- predict(modelFit, testing)
testPredictions

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

For Course Project Submission

To make it with pml_write_files function, convert testPredictions to a character vecotr.

testPredictionsVector <- c(as.character(testPredictions))

Then create a plm_write_files function to gengerate .txt file for each prediction

# write up
pml_write_files = function(x) {
    n = length(x)
    for (i in 1:n) {
        filename = paste0("problem_id_", i, ".txt")
        write.table(x[i], file = filename, quote = FALSE, row.names = FALSE, 
            col.names = FALSE)
    }
}

Call the function with testPredicitons

pml_write_files(testPredictions)