Executive Summary

This document presents the results of the Practical Machine Learning Peer Assessments in a report using a single R markdown document that can be processed by knitr and be transformed into an HTML file.

To provide class prediction of data with multiple columns it requires to implement a random forests without cross -validation and test set ,therefore first of all it is necessary to remove the columns with less than 60% of data and then evaluate data validation and testing for answering to following questions:

predict the manner in which they did the exercise. This is the “classe” variable in the training set. All other variables can be use as predictor. 2.How to build the model and use cross validation.

Data source:https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Set Direcory

 setwd("C:/Users/FARZAD/Desktop/Data Science/Course 8/Project")
 getwd()
[1] "C:/Users/FARZAD/Desktop/Data Science/Course 8/Project"

Install necessary packages

install.packages("ElemStatLearn")
install.packages("caret") 
library(ElemStatLearn) 
library(caret) 
install.packages("rpart") 
library(rpart) 
install.packages("randomForest")
library(randomForest)

loading Data

URLTRN <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URLTST <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
fileTRN <- "pml-training.csv"
fileTST <- "pml-testing.csv"


    download.file(url=URLTRN, destfile=fileTRN)
    trying URL 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
    Content type 'text/csv' length 12202745 bytes (11.6 MB)
    downloaded 11.6 MB

   download.file(url=URLTST, destfile=fileTST)
  trying URL 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
  Content type 'text/csv' length 15113 bytes (14 KB)
  downloaded 14 KB
  
  

Read Data

  TRN <- read.csv("pml-training.csv",row.names=1,na.strings = "")
  TST <- read.csv("pml-testing.csv",row.names=1,na.strings = "NA")
  

Exploratory Data Analysis

Remove Missing value in Training and Testing Files

  TRN_REM_na <- TRN[,(colSums(is.na(TRN)) == 0)]
  TST_REM_na <- TST[,(colSums(is.na(TST)) == 0)]
  

Remove Unnecessry columns in Training and Testing files

  colRm_TRN <- c("user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window")
  colRm_TST <- c("user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window","problem_id")
  
  TRN_colRm <- TRN_REM_na[,!(names(TRN_REM_na) %in% colRm_TRN)]
  TST_colRm <- TST_REM_na[,!(names(TST_REM_na) %in% colRm_TST)]
  

dim

  dim(TRN_colRm)
[1] 19622   121
 dim(TST_colRm)
[1] 20 53

Evaluation of Training & Validation Set

Remove 1st column of ID

TRNRaw <- TRN[,-1]
TSTRaw <- TST[,-1]
table(TRN$classe)

 A    B    C    D    E 
5580 3797 3422 3216 3607

inTrain <- createDataPartition(y=TRN$classe, p=0.7, list=FALSE)
TRN_clean <- TRN_colRm[inTrain,]
validation_clean <- TRN_colRm[-inTrain,]
TST_clean<-TRN[-inTrain, ]

Both created datasets have 160 variables. Those variables have plenty of NA, that can be removed with the cleaning procedures below. The Near Zero variance (NZV) variables are also removed and the ID variables as well.

TRN_clean <- TRN_clean[, -NZV]
TST_clean <- TST_clean[, -NZV]

dim(TRN_clean)
[1] 13737    53

dim(TST_clean)
[1] 5885   91

remove identification only variables (columns 1 to 5)

TRN_clean <- TRN_clean[, -(1:5)]
TST_clean<- TST_clean[, -(1:5)]

dim(TRN_clean)
[1] 13737    48
dim(TST_clean)
[1] 5885   86

Modeling In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the execution. Therefore, the training of the model (Random Forest) is proceeded using the training data set

   model <- randomForest(classe~.,data=TRN_clean)
   model

  Call:
  randomForest(formula = classe ~ ., data = TRN_clean) 
               Type of random forest: classification
                        Number of trees: 500
  No. of variables tried at each split: 6

         OOB estimate of  error rate: 1.06%
  Confusion matrix:
       A    B    C    D    E class.error
  A 3902    4    0    0    0 0.001024066
  B   31 2618    9    0    0 0.015048909
  C    0   29 2366    1    0 0.012520868
  D    0    0   59 2189    4 0.027975133
  E    0    0    1    7 2517 0.003168317

Model Evaluation Verification of the variable importance measures as produced by random Forest is as follows:

        importance(model)
                       MeanDecreaseGini
  gyros_belt_y                109.77704
  gyros_belt_z                308.33595
  accel_belt_x                116.92622
  accel_belt_y                146.33422
  accel_belt_z                448.91511
  magnet_belt_x               223.34792
  magnet_belt_y               389.96987
  magnet_belt_z               382.13972
  roll_arm                    293.73030
  pitch_arm                   174.55514
  yaw_arm                     210.58732
  total_accel_arm              96.90070
  gyros_arm_x                 134.23053
  gyros_arm_y                 131.50277
  gyros_arm_z                  63.82912
  accel_arm_x                 204.96531
  accel_arm_y                 152.93891
  accel_arm_z                 131.03189
  magnet_arm_x                198.97953
  magnet_arm_y                205.41150
  magnet_arm_z                173.46416
  roll_dumbbell               325.97376
  pitch_dumbbell              163.25065
  yaw_dumbbell                223.81261
  total_accel_dumbbell        223.89752
  gyros_dumbbell_x            128.18776
  gyros_dumbbell_y            224.76562
  gyros_dumbbell_z             87.64901
  accel_dumbbell_x            205.38602
  accel_dumbbell_y            341.14479
  accel_dumbbell_z            274.04832
  magnet_dumbbell_x           394.16568
  magnet_dumbbell_y           507.41389
  magnet_dumbbell_z           617.73962
  roll_forearm                486.88577
  pitch_forearm               572.49705
  yaw_forearm                 147.14687
  total_accel_forearm         109.03825
  gyros_forearm_x              84.99886
  gyros_forearm_y             127.73489
  gyros_forearm_z              85.22947
  accel_forearm_x             251.93194
  accel_forearm_y             135.61767
  accel_forearm_z             216.90374
  magnet_forearm_x            191.08455
  magnet_forearm_y            188.56146
  magnet_forearm_z            247.21371

Next, the model results is evaluated through the confusion Matrix.

  confusionMatrix(predict(model,newdata=validation_clean[,-ncol(validation_clean)]),validation_clean$classe)
 Confusion Matrix and Statistics

           Reference
  Prediction    A    B    C    D    E
           A 1673   10    0    0    0
           B    0 1123   13    0    0
           C    0    5 1011   20    1
           D    0    1    2  941    2
           E    1    0    0    3 1079

  Overall Statistics
                                      
                Accuracy : 0.9901          
                   95% CI : (0.9873, 0.9925)
     No Information Rate : 0.2845          
     P-Value [Acc > NIR] : < 2.2e-16       
                                      
                  Kappa : 0.9875          
  Mcnemar's Test P-Value : NA              

  Statistics by Class:

                      Class: A Class: B Class: C Class: D Class: E
  Sensitivity            0.9994   0.9860   0.9854   0.9761   0.9972
  Specificity            0.9976   0.9973   0.9946   0.9990   0.9992
  Pos Pred Value         0.9941   0.9886   0.9749   0.9947   0.9963
  Neg Pred Value         0.9998   0.9966   0.9969   0.9953   0.9994
  Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
  Detection Rate         0.2843   0.1908   0.1718   0.1599   0.1833
  Detection Prevalence   0.2860   0.1930   0.1762   0.1607   0.1840
  Balanced Accuracy      0.9985   0.9916   0.9900   0.9876   0.9982
  

The accurancy for the validating data set is calculated with the following formula:

  Accur<-c(as.numeric(predict(model,newdata=validation_clean[,-ncol(validation_clean)])==validation_clean$classe))
  Accur<-sum(Accur)*100/nrow(validation_clean)
  Accur
  [1] 99.03144

Model Accuracy as tested over Validation set = 99.03144% The out-of-sample error is 0.13%, which is pretty low.

Prediction with the Testing Dataset

predictions <- predict(model,newdata=TST[-1,])
predictions
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
Levels: A B C D E

In the new training set and validation set we just created, there are 52 predictors and 1 response. Check the correlations between the predictors and the outcome variable in the new training set. There doesn’t seem to be any predictors strongly correlated with the outcome variable, so linear regression model may not be a good option. Random forest model may be more robust for this data.

cor <- abs(sapply(colnames(TRN_clean[, -ncol(TRN)]), function(x) cor(as.numeric(TRN_clean[, x]), as.numeric(TRN_clean$classe), method = "spearman")))

Create Random Forest Model We try to fit a random forest model and check the model performance on the validation set.

RanForFit <- train(classe ~ ., method = "rf", data = TRN_clean, importance = T, trControl = trainControl(method = "cv", number = 4))

Data Cleaning Since a random forest model is chosen and the data set must first be checked on possibility of columns without data.

The decision is made whereby all the columns that having less than 60% of data filled are removed.

sum((colSums(!is.na(TRN[,-ncol(TRN)])) < 0.6*nrow(TRN)))
[1] 33