Practical Machine Learning - Peer Assessments

Executive Summary
Requisites of This Assignment
Prepare the Environment
- Load libraries and Set Seed
Loading and Preprocessing the Data
- Function definitions and load libraries code segment
Data Getting and Cleanning:
- Assign the directory that all data set was found and confirm its exists
- Data Sets Partitions Definitions
Data Exploration and Clenning
Modeling
- Model Evaluate
- Model Test

Created with knitr in R x64 3.1.0 Patched on Windows 7 64 bits SO
Github repo with RMarkdown source code: https://github.com/mgmarques/PML_PA
Online version: https://rpubs.com/mgmarques/PML_PA

Executive Summary

This document presents the results of the Practical Machine Learning Peer Assessments in a report using a single R markdown document that can be processed by knitr and be transformed into an HTML file.

Since we have a data set with to many columns and we need make a class prediction, we decide implement a random forests model, that’s no need cross-validation or a separate test set to get an unbiased estimate of the test set error. Before apply the dataset to our prediction modelo, we decide remove all the columns that having less than 60% of data filled, instead try to filled it with some center measure. Our model accuracy over validation dataset is equal to 99.9235%. This model promoted a excelente prediction results with our testing dataset and generated the 20th files answers to submit for the Assignments. The Assignments returned to us that he 20th files answers as being correct!

Requisites of This Assignment

This assignment instructions request to: 1. predict the manner in which they did the exercise. This is the “classe” variable in the training set. All other variables can be use as predictor. 2. Shwo how I built my model, how I used cross validation, what I think the expected out of sample error is, and why I made the choices I did. 3. This prediction model alsi to predict 20 different test cases from the test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Prepare the Environment

Throughout this report you can always find the code that I used to generate my output presents here. When writing code chunks in the R markdown document, always use echo = TRUE so that someone else will be able to read the code. This assignment will be evaluated via peer assessment so it’s essential that my peer evaluators be able to review my code and my analysis together..

First, we set echo equal a TRUE and results equal a ‘hold’ as global options for this document.

library(knitr)
opts_chunk$set(echo = TRUE, results = 'hold')

Load libraries and Set Seed

Load all libraries used, and setting seed for reproducibility. Results Hidden, Warnings FALSE and Messages FALSE

library(ElemStatLearn)
library(caret)
library(rpart)
library(randomForest)
set.seed(357)

Loading and Preprocessing the Data

This assignment makes use of data from a personal activity monitoring device. The data come from this source: http://groupware.les.inf.puc-rio.br/har.

Function definitions and load libraries code segment

First we define a function to check if the files exists in the path defined by file_path. If don’t exists stops execution of the current expression, and executes an error action

check_file_exist <- function(file_path) 
{
        if (!file.exists(file_path))
                stop("The ", file_path, " not found!") else TRUE 
}

Next, we use data set and data_dir to define the file_path, call the check_file_exist function to check and finally returned with data set load to data variable.

load_data <- function(data_dir , fileURL, fileSource) 
{
        # Dataset check and load 
        
        source_path <- paste(data_dir, "/", fileSource , sep="")
        txt_file <- paste(data_dir, "/","activity.csv", sep="")

        if (!file.exists(txt_file)) {
             message(paste("Please Wait! Download...", fileURL, "..."));
             download.file(fileURL, destfile=source_path);
        } 
        data <- read.csv(txt_file,
                         header=TRUE,  na.strings=c("NA",""))
        data$interval <- factor(data$interval)
        data$date <- as.Date(data$date, format="%Y-%m-%d")
        data        
        
}

Data Getting and Cleanning:

Assign the directory that all data set was found and confirm its exists

Maybe you need to change this data_dir variable to yours core directory (see getwd() at a R consuole), because the line code here that ask you inform where the data directory is find, use readline function, and it’s not function at markdown documents.

data_dir <- "C:/Users/Marcelo/Documents/PML_PA/Data";

Check if the “./Data” directory exists, if doesn’t ask to user the path of his data directory. If user inform a invalid directory path stops execution of the current expression and executes an error action.

if (!file.exists(data_dir)){
        # data_dir <- readline(prompt = "Please, inform your data directory path: ")
        data_dir <-"./PML_PA/Data" ## simulate a valid data entry just because we use a Rmd
        if (!file.exists(data_dir)){
                stop("You inform a invalid directory path")
        }
}

Here its rely the point at the all data load and preparation is call and running.

fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" 
fileSource <-"pml-training.csv"
source_path <- paste(data_dir, "/", fileSource , sep="")
txt_file <- paste(data_dir, "/", fileSource, sep="")

        if (!file.exists(txt_file)) {
            message(paste("Please Wait! Download...", fileURL, "..."));
            download.file(fileURL, destfile=source_path);
        }
pml_CSV <- read.csv(txt_file, header=TRUE, sep=",", na.strings=c("NA",""))
pml_CSV <- pml_CSV[,-1] # Remove the first column that represents a ID Row

Data Sets Partitions Definitions

Create the data partitions of training and validating data sets.

inTrain = createDataPartition(pml_CSV$classe, p=0.60, list=FALSE)
training = pml_CSV[inTrain,]
validating = pml_CSV[-inTrain,]

Data Exploration and Clenning

Since we choose a random forest model and we have a data set with to many columns, first we check if we have many problems with columns without data. If it’s the case we decide remove all the columns that having less than 60% of data filled, instead try to filled with some center measure.

sum((colSums(!is.na(training[,-ncol(training)])) < 0.6*nrow(training)))

[1] 100

# Number of coluns with less than 60% of data

So, we apply our definition of remove columns that most doesn’t have data, before its apply to the model.

Keep <- c((colSums(!is.na(training[,-ncol(training)])) >= 0.6*nrow(training)))
training   <-  training[,Keep]
validating <- validating[,Keep]

Modeling

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the execution. So, we proced with the training the model (Random Forest) with the training data set.

model <- randomForest(classe~.,data=training)
print(model)

## 
## Call:
##  randomForest(formula = classe ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.18%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3347    1    0    0    0   0.0002987
## B    2 2276    1    0    0   0.0013164
## C    0    4 2047    3    0   0.0034080
## D    0    0    6 1923    1   0.0036269
## E    0    0    0    3 2162   0.0013857

Model Evaluate

And proceed with the verification of variable importance measures as produced by random Forest:

importance(model)

##                      MeanDecreaseGini
## user_name                   9.870e+01
## raw_timestamp_part_1        9.866e+02
## raw_timestamp_part_2        9.933e+00
## cvtd_timestamp              1.358e+03
## new_window                  8.926e-02
## num_window                  5.306e+02
## roll_belt                   5.460e+02
## pitch_belt                  2.988e+02
## yaw_belt                    3.513e+02
## total_accel_belt            1.263e+02
## gyros_belt_x                4.052e+01
## gyros_belt_y                5.171e+01
## gyros_belt_z                1.278e+02
## accel_belt_x                5.780e+01
## accel_belt_y                6.444e+01
## accel_belt_z                1.858e+02
## magnet_belt_x               1.081e+02
## magnet_belt_y               2.101e+02
## magnet_belt_z               1.884e+02
## roll_arm                    1.163e+02
## pitch_arm                   5.445e+01
## yaw_arm                     7.093e+01
## total_accel_arm             3.158e+01
## gyros_arm_x                 4.143e+01
## gyros_arm_y                 4.144e+01
## gyros_arm_z                 1.862e+01
## accel_arm_x                 9.662e+01
## accel_arm_y                 4.815e+01
## accel_arm_z                 3.981e+01
## magnet_arm_x                9.280e+01
## magnet_arm_y                7.552e+01
## magnet_arm_z                5.769e+01
## roll_dumbbell               2.014e+02
## pitch_dumbbell              9.083e+01
## yaw_dumbbell                1.157e+02
## total_accel_dumbbell        1.239e+02
## gyros_dumbbell_x            3.760e+01
## gyros_dumbbell_y            9.397e+01
## gyros_dumbbell_z            2.393e+01
## accel_dumbbell_x            1.152e+02
## accel_dumbbell_y            1.820e+02
## accel_dumbbell_z            1.378e+02
## magnet_dumbbell_x           2.233e+02
## magnet_dumbbell_y           3.229e+02
## magnet_dumbbell_z           2.941e+02
## roll_forearm                2.316e+02
## pitch_forearm               3.041e+02
## yaw_forearm                 5.549e+01
## total_accel_forearm         3.126e+01
## gyros_forearm_x             2.414e+01
## gyros_forearm_y             3.601e+01
## gyros_forearm_z             2.404e+01
## accel_forearm_x             1.279e+02
## accel_forearm_y             4.278e+01
## accel_forearm_z             9.758e+01
## magnet_forearm_x            7.772e+01
## magnet_forearm_y            7.428e+01
## magnet_forearm_z            9.364e+01

Now we evaluate our model results through confusion Matrix.

confusionMatrix(predict(model,newdata=validating[,-ncol(validating)]),validating$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    0    0    0    0
##          B    0 1518    0    0    0
##          C    0    0 1367    5    0
##          D    0    0    1 1281    0
##          E    0    0    0    0 1442
## 
## Overall Statistics
##                                     
##                Accuracy : 0.999     
##                  95% CI : (0.998, 1)
##     No Information Rate : 0.284     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 0.999     
##  Mcnemar's Test P-Value : NA        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    1.000    0.999    0.996    1.000
## Specificity             1.000    1.000    0.999    1.000    1.000
## Pos Pred Value          1.000    1.000    0.996    0.999    1.000
## Neg Pred Value          1.000    1.000    1.000    0.999    1.000
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.284    0.193    0.174    0.163    0.184
## Detection Prevalence    0.284    0.193    0.175    0.163    0.184
## Balanced Accuracy       1.000    1.000    0.999    0.998    1.000

And confirmed the accurancy at validating data set by calculate it with the formula:

accurancy<-c(as.numeric(predict(model,newdata=validating[,-ncol(validating)])==validating$classe))
accurancy<-sum(accurancy)*100/nrow(validating)

Model Accuracy as tested over Validation set = 99.9235%.

Model Test

Finaly, we prossed with predicting the new values in the testing csv provided, first we apply the same data cleannig operations on it and coerce all columns of Testing data set for the same class of previuos data set.
#### Gettind Testing Dataset

fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" 
fileSource <-"pml-testing.csv"
source_path <- paste(data_dir, "/", fileSource , sep="")
txt_file <- paste(data_dir, "/", fileSource, sep="")

        if (!file.exists(txt_file)) {
            message(paste("Please Wait! Download...", fileURL, "..."));
            download.file(fileURL, destfile=source_path);
        }
pml_CSV <- read.csv(txt_file, header=TRUE, sep=",", na.strings=c("NA",""))

pml_CSV <- pml_CSV[,-1] # Remove the first column that represents a ID Row
pml_CSV <- pml_CSV[ , Keep] # Keep the same columns of testing dataset
pml_CSV <- pml_CSV[,-ncol(pml_CSV)] # Remove the problem ID

Apply the Same Transformations and Coerce Testing Dataset

# class_check <- (sapply(pml_CSV, class) == sapply(training, class))
# pml_CSV[, !class_check] <- sapply(pml_CSV[, !class_check], as.numeric)

# Coerce testing dataset to same class and strucuture of training dataset 
testing <- rbind(training[100, -59] , pml_CSV) 
# Apply the ID Row to row.names and 100 for dummy row from testing dataset 
row.names(testing) <- c(100, 1:20)

Prediciting with Testing Dataset

predictions <- predict(model,newdata=testing[-1,])
print(predictions)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Generating Answers Files to Submit for Assignment

The following function to create the files to answers the Prediction Assignment Submission:

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("./answers/problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(predictions)

Since the greatest accuracy level of our modelo, as expect, all 20th files answers submitted were correct!