Created with knitr in R x64 3.1.0 Patched on Windows 7 64 bits SO
Github repo with RMarkdown source code: https://github.com/mgmarques/PML_PA
Online version: https://rpubs.com/mgmarques/PML_PA
This document presents the results of the Practical Machine Learning Peer Assessments in a report using a single R markdown document that can be processed by knitr and be transformed into an HTML file.
Since we have a data set with to many columns and we need make a class prediction, we decide implement a random forests model, that’s no need cross-validation or a separate test set to get an unbiased estimate of the test set error. Before apply the dataset to our prediction modelo, we decide remove all the columns that having less than 60% of data filled, instead try to filled it with some center measure. Our model accuracy over validation dataset is equal to 99.9235%. This model promoted a excelente prediction results with our testing dataset and generated the 20th files answers to submit for the Assignments. The Assignments returned to us that he 20th files answers as being correct!
This assignment instructions request to: 1. predict the manner in which they did the exercise. This is the “classe” variable in the training set. All other variables can be use as predictor. 2. Shwo how I built my model, how I used cross validation, what I think the expected out of sample error is, and why I made the choices I did. 3. This prediction model alsi to predict 20 different test cases from the test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Throughout this report you can always find the code that I used to generate my output presents here. When writing code chunks in the R markdown document, always use echo = TRUE so that someone else will be able to read the code. This assignment will be evaluated via peer assessment so it’s essential that my peer evaluators be able to review my code and my analysis together..
First, we set echo equal a TRUE and results equal a ‘hold’ as global options for this document.
library(knitr)
opts_chunk$set(echo = TRUE, results = 'hold')
Load all libraries used, and setting seed for reproducibility. Results Hidden, Warnings FALSE and Messages FALSE
library(ElemStatLearn)
library(caret)
library(rpart)
library(randomForest)
set.seed(357)
This assignment makes use of data from a personal activity monitoring device. The data come from this source: http://groupware.les.inf.puc-rio.br/har.
First we define a function to check if the files exists in the path defined by file_path. If don’t exists stops execution of the current expression, and executes an error action
check_file_exist <- function(file_path)
{
if (!file.exists(file_path))
stop("The ", file_path, " not found!") else TRUE
}
Next, we use data set and data_dir to define the file_path, call the check_file_exist function to check and finally returned with data set load to data variable.
load_data <- function(data_dir , fileURL, fileSource)
{
# Dataset check and load
source_path <- paste(data_dir, "/", fileSource , sep="")
txt_file <- paste(data_dir, "/","activity.csv", sep="")
if (!file.exists(txt_file)) {
message(paste("Please Wait! Download...", fileURL, "..."));
download.file(fileURL, destfile=source_path);
}
data <- read.csv(txt_file,
header=TRUE, na.strings=c("NA",""))
data$interval <- factor(data$interval)
data$date <- as.Date(data$date, format="%Y-%m-%d")
data
}
Maybe you need to change this data_dir variable to yours core directory (see getwd() at a R consuole), because the line code here that ask you inform where the data directory is find, use readline function, and it’s not function at markdown documents.
data_dir <- "C:/Users/Marcelo/Documents/PML_PA/Data";
Check if the “./Data” directory exists, if doesn’t ask to user the path of his data directory. If user inform a invalid directory path stops execution of the current expression and executes an error action.
if (!file.exists(data_dir)){
# data_dir <- readline(prompt = "Please, inform your data directory path: ")
data_dir <-"./PML_PA/Data" ## simulate a valid data entry just because we use a Rmd
if (!file.exists(data_dir)){
stop("You inform a invalid directory path")
}
}
Here its rely the point at the all data load and preparation is call and running.
fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileSource <-"pml-training.csv"
source_path <- paste(data_dir, "/", fileSource , sep="")
txt_file <- paste(data_dir, "/", fileSource, sep="")
if (!file.exists(txt_file)) {
message(paste("Please Wait! Download...", fileURL, "..."));
download.file(fileURL, destfile=source_path);
}
pml_CSV <- read.csv(txt_file, header=TRUE, sep=",", na.strings=c("NA",""))
pml_CSV <- pml_CSV[,-1] # Remove the first column that represents a ID Row
Create the data partitions of training and validating data sets.
inTrain = createDataPartition(pml_CSV$classe, p=0.60, list=FALSE)
training = pml_CSV[inTrain,]
validating = pml_CSV[-inTrain,]
Since we choose a random forest model and we have a data set with to many columns, first we check if we have many problems with columns without data. If it’s the case we decide remove all the columns that having less than 60% of data filled, instead try to filled with some center measure.
sum((colSums(!is.na(training[,-ncol(training)])) < 0.6*nrow(training)))
[1] 100
# Number of coluns with less than 60% of data
So, we apply our definition of remove columns that most doesn’t have data, before its apply to the model.
Keep <- c((colSums(!is.na(training[,-ncol(training)])) >= 0.6*nrow(training)))
training <- training[,Keep]
validating <- validating[,Keep]
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the execution. So, we proced with the training the model (Random Forest) with the training data set.
model <- randomForest(classe~.,data=training)
print(model)
##
## Call:
## randomForest(formula = classe ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.18%
## Confusion matrix:
## A B C D E class.error
## A 3347 1 0 0 0 0.0002987
## B 2 2276 1 0 0 0.0013164
## C 0 4 2047 3 0 0.0034080
## D 0 0 6 1923 1 0.0036269
## E 0 0 0 3 2162 0.0013857
And proceed with the verification of variable importance measures as produced by random Forest:
importance(model)
## MeanDecreaseGini
## user_name 9.870e+01
## raw_timestamp_part_1 9.866e+02
## raw_timestamp_part_2 9.933e+00
## cvtd_timestamp 1.358e+03
## new_window 8.926e-02
## num_window 5.306e+02
## roll_belt 5.460e+02
## pitch_belt 2.988e+02
## yaw_belt 3.513e+02
## total_accel_belt 1.263e+02
## gyros_belt_x 4.052e+01
## gyros_belt_y 5.171e+01
## gyros_belt_z 1.278e+02
## accel_belt_x 5.780e+01
## accel_belt_y 6.444e+01
## accel_belt_z 1.858e+02
## magnet_belt_x 1.081e+02
## magnet_belt_y 2.101e+02
## magnet_belt_z 1.884e+02
## roll_arm 1.163e+02
## pitch_arm 5.445e+01
## yaw_arm 7.093e+01
## total_accel_arm 3.158e+01
## gyros_arm_x 4.143e+01
## gyros_arm_y 4.144e+01
## gyros_arm_z 1.862e+01
## accel_arm_x 9.662e+01
## accel_arm_y 4.815e+01
## accel_arm_z 3.981e+01
## magnet_arm_x 9.280e+01
## magnet_arm_y 7.552e+01
## magnet_arm_z 5.769e+01
## roll_dumbbell 2.014e+02
## pitch_dumbbell 9.083e+01
## yaw_dumbbell 1.157e+02
## total_accel_dumbbell 1.239e+02
## gyros_dumbbell_x 3.760e+01
## gyros_dumbbell_y 9.397e+01
## gyros_dumbbell_z 2.393e+01
## accel_dumbbell_x 1.152e+02
## accel_dumbbell_y 1.820e+02
## accel_dumbbell_z 1.378e+02
## magnet_dumbbell_x 2.233e+02
## magnet_dumbbell_y 3.229e+02
## magnet_dumbbell_z 2.941e+02
## roll_forearm 2.316e+02
## pitch_forearm 3.041e+02
## yaw_forearm 5.549e+01
## total_accel_forearm 3.126e+01
## gyros_forearm_x 2.414e+01
## gyros_forearm_y 3.601e+01
## gyros_forearm_z 2.404e+01
## accel_forearm_x 1.279e+02
## accel_forearm_y 4.278e+01
## accel_forearm_z 9.758e+01
## magnet_forearm_x 7.772e+01
## magnet_forearm_y 7.428e+01
## magnet_forearm_z 9.364e+01
Now we evaluate our model results through confusion Matrix.
confusionMatrix(predict(model,newdata=validating[,-ncol(validating)]),validating$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 0 0 0 0
## B 0 1518 0 0 0
## C 0 0 1367 5 0
## D 0 0 1 1281 0
## E 0 0 0 0 1442
##
## Overall Statistics
##
## Accuracy : 0.999
## 95% CI : (0.998, 1)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.999
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 1.000 0.999 0.996 1.000
## Specificity 1.000 1.000 0.999 1.000 1.000
## Pos Pred Value 1.000 1.000 0.996 0.999 1.000
## Neg Pred Value 1.000 1.000 1.000 0.999 1.000
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.284 0.193 0.174 0.163 0.184
## Detection Prevalence 0.284 0.193 0.175 0.163 0.184
## Balanced Accuracy 1.000 1.000 0.999 0.998 1.000
And confirmed the accurancy at validating data set by calculate it with the formula:
accurancy<-c(as.numeric(predict(model,newdata=validating[,-ncol(validating)])==validating$classe))
accurancy<-sum(accurancy)*100/nrow(validating)
Model Accuracy as tested over Validation set = 99.9235%.
Finaly, we prossed with predicting the new values in the testing csv provided, first we apply the same data cleannig operations on it and coerce all columns of Testing data set for the same class of previuos data set.
#### Gettind Testing Dataset
fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
fileSource <-"pml-testing.csv"
source_path <- paste(data_dir, "/", fileSource , sep="")
txt_file <- paste(data_dir, "/", fileSource, sep="")
if (!file.exists(txt_file)) {
message(paste("Please Wait! Download...", fileURL, "..."));
download.file(fileURL, destfile=source_path);
}
pml_CSV <- read.csv(txt_file, header=TRUE, sep=",", na.strings=c("NA",""))
pml_CSV <- pml_CSV[,-1] # Remove the first column that represents a ID Row
pml_CSV <- pml_CSV[ , Keep] # Keep the same columns of testing dataset
pml_CSV <- pml_CSV[,-ncol(pml_CSV)] # Remove the problem ID
# class_check <- (sapply(pml_CSV, class) == sapply(training, class))
# pml_CSV[, !class_check] <- sapply(pml_CSV[, !class_check], as.numeric)
# Coerce testing dataset to same class and strucuture of training dataset
testing <- rbind(training[100, -59] , pml_CSV)
# Apply the ID Row to row.names and 100 for dummy row from testing dataset
row.names(testing) <- c(100, 1:20)
predictions <- predict(model,newdata=testing[-1,])
print(predictions)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The following function to create the files to answers the Prediction Assignment Submission:
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("./answers/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(predictions)
Since the greatest accuracy level of our modelo, as expect, all 20th files answers submitted were correct!