First, download csv files to local machine and constracut training and testing data.
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',destfile = "/tmp/training.csv", method = "curl")
rawTraining <- read.csv("/tmp/training.csv", na.strings = c("NA",""))
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
',destfile = "/tmp/testing.csv", method = "curl")
rawTesting <- read.csv("/tmp/testing.csv", na.strings = c("NA",""))
Now, let’s exclude columns whose data are NA and unrelated columns from data set. As for unrelated columns, we can ignore followings.
# discard NA columns: For each column, summarize the number of NA.
nacheck <- apply(rawTraining, 2, function(x) {
sum(is.na(x))
})
# select columns whose number of NA is zero
nonNATraining <- rawTraining[, which(nacheck == 0)]
# Exclude not related columns
tidyTraining <- nonNATraining %>% select(-(X:new_window))
Since test data (pml-testing) do not contain “classe” data, we need to split train data (pml-training) into “train” and “validate” data so that we can check how well the model fits.
colnames(tidyTraining)
## [1] "num_window" "roll_belt" "pitch_belt"
## [4] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [7] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [10] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [13] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [16] "pitch_arm" "yaw_arm" "total_accel_arm"
## [19] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [22] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [25] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [28] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [31] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [34] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [37] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [40] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [43] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [46] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [49] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [52] "magnet_forearm_y" "magnet_forearm_z" "classe"
Now split training data into “training” and “validatation”
trainIndx <- createDataPartition(tidyTraining$classe, p = 0.7, list = FALSE)
training <- tidyTraining[trainIndx,]
validationData <- tidyTraining[-trainIndx,]
By using with 5 fold cross validation, let’s create a Random Forest model whose outcome is “classe” and predicotrs are rest of the columns in the training data.
# train control : 5 fold cross validation.
trControl <- trainControl(method = "cv", number = 5)
# build model
modelFit <- train(training$classe ~ ., method = "rf", trControl = trControl, data = training)
summary(modelFit)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 13737 factor numeric
## err.rate 3000 -none- numeric
## confusion 30 -none- numeric
## votes 68685 matrix numeric
## oob.times 13737 -none- numeric
## classes 5 -none- character
## importance 53 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 13737 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 53 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 5 -none- character
modelFit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.19%
## Confusion matrix:
## A B C D E class.error
## A 3905 0 0 0 1 0.0002560164
## B 6 2649 2 1 0 0.0033860045
## C 0 3 2393 0 0 0.0012520868
## D 0 0 7 2244 1 0.0035523979
## E 0 0 0 5 2520 0.0019801980
Apply the random forest model to the Validation data set and get predictions.
predictions <- predict(modelFit, validationData)
Compare the predicted “classe” and “classe” value in validation data set.
table(predictions, validationData$classe)
##
## predictions A B C D E
## A 1672 7 0 0 0
## B 1 1132 2 0 0
## C 0 0 1024 2 0
## D 0 0 0 962 2
## E 1 0 0 0 1080
From the model fit result displayed below, error rate is \((1 - 0.997) \cdot 100 = 0.3\%\).
print(modelFit)
## Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 10990, 10990, 10989, 10990, 10989
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.993 0.991 0.00233 0.00295
## 27 0.997 0.997 0.00133 0.00168
## 53 0.995 0.994 0.00241 0.00305
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Error rate is caluted as follows:
\[ 1 - \frac{number\ of \ correct \ estimates}{number \ of \ predictions} \]
Number of correct estimates is calculated as follows:
sum(predictions == validationData$classe)
## [1] 5870
and number of predictions is
length(predictions)
## [1] 5885
So the Out of Sample Error rate is 0.25 %
Before apply the model to test data, let’s clean the test data like we did for train data.
# discard NA columns: For each column, summarize the number of NA.
nacheckTest <- apply(rawTesting, 2, function(x) {
sum(is.na(x))
})
# select columns whose number of NA is zero
nonNATesting <- rawTesting[, which(nacheckTest == 0)]
# Exclude not related columns
testing <- nonNATesting %>% select(-(X:new_window))
dim(testing)
## [1] 20 54
Apply the random forest model to the Validation data set and get predictions.
testPredictions <- predict(modelFit, testing)
testPredictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
To make it with pml_write_files function, convert testPredictions to a character vecotr.
testPredictionsVector <- c(as.character(testPredictions))
Then create a plm_write_files function to gengerate .txt file for each prediction
# write up
pml_write_files = function(x) {
n = length(x)
for (i in 1:n) {
filename = paste0("problem_id_", i, ".txt")
write.table(x[i], file = filename, quote = FALSE, row.names = FALSE,
col.names = FALSE)
}
}
Call the function with testPredicitons
pml_write_files(testPredictions)