This document presents the results of the Practical Machine Learning Peer Assessments in a report using a single R markdown document that can be processed by knitr and be transformed into an HTML file.
To provide class prediction of data with multiple columns it requires to implement a random forests without cross -validation and test set ,therefore first of all it is necessary to remove the columns with less than 60% of data and then evaluate data validation and testing for answering to following questions:
predict the manner in which they did the exercise. This is the “classe” variable in the training set. All other variables can be use as predictor. 2.How to build the model and use cross validation.
Data source:https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Set Direcory
setwd("C:/Users/FARZAD/Desktop/Data Science/Course 8/Project")
getwd()
[1] "C:/Users/FARZAD/Desktop/Data Science/Course 8/Project"
Install necessary packages
install.packages("ElemStatLearn")
install.packages("caret")
library(ElemStatLearn)
library(caret)
install.packages("rpart")
library(rpart)
install.packages("randomForest")
library(randomForest)
URLTRN <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URLTST <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
fileTRN <- "pml-training.csv"
fileTST <- "pml-testing.csv"
download.file(url=URLTRN, destfile=fileTRN)
trying URL 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
Content type 'text/csv' length 12202745 bytes (11.6 MB)
downloaded 11.6 MB
download.file(url=URLTST, destfile=fileTST)
trying URL 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
Content type 'text/csv' length 15113 bytes (14 KB)
downloaded 14 KB
Read Data
TRN <- read.csv("pml-training.csv",row.names=1,na.strings = "")
TST <- read.csv("pml-testing.csv",row.names=1,na.strings = "NA")
Remove Missing value in Training and Testing Files
TRN_REM_na <- TRN[,(colSums(is.na(TRN)) == 0)]
TST_REM_na <- TST[,(colSums(is.na(TST)) == 0)]
Remove Unnecessry columns in Training and Testing files
colRm_TRN <- c("user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window")
colRm_TST <- c("user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window","problem_id")
TRN_colRm <- TRN_REM_na[,!(names(TRN_REM_na) %in% colRm_TRN)]
TST_colRm <- TST_REM_na[,!(names(TST_REM_na) %in% colRm_TST)]
dim
dim(TRN_colRm)
[1] 19622 121
dim(TST_colRm)
[1] 20 53
Evaluation of Training & Validation Set
Remove 1st column of ID
TRNRaw <- TRN[,-1]
TSTRaw <- TST[,-1]
table(TRN$classe)
A B C D E
5580 3797 3422 3216 3607
inTrain <- createDataPartition(y=TRN$classe, p=0.7, list=FALSE)
TRN_clean <- TRN_colRm[inTrain,]
validation_clean <- TRN_colRm[-inTrain,]
TST_clean<-TRN[-inTrain, ]
Both created datasets have 160 variables. Those variables have plenty of NA, that can be removed with the cleaning procedures below. The Near Zero variance (NZV) variables are also removed and the ID variables as well.
TRN_clean <- TRN_clean[, -NZV]
TST_clean <- TST_clean[, -NZV]
dim(TRN_clean)
[1] 13737 53
dim(TST_clean)
[1] 5885 91
remove identification only variables (columns 1 to 5)
TRN_clean <- TRN_clean[, -(1:5)]
TST_clean<- TST_clean[, -(1:5)]
dim(TRN_clean)
[1] 13737 48
dim(TST_clean)
[1] 5885 86
Modeling In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the execution. Therefore, the training of the model (Random Forest) is proceeded using the training data set
model <- randomForest(classe~.,data=TRN_clean)
model
Call:
randomForest(formula = classe ~ ., data = TRN_clean)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6
OOB estimate of error rate: 1.06%
Confusion matrix:
A B C D E class.error
A 3902 4 0 0 0 0.001024066
B 31 2618 9 0 0 0.015048909
C 0 29 2366 1 0 0.012520868
D 0 0 59 2189 4 0.027975133
E 0 0 1 7 2517 0.003168317
Model Evaluation Verification of the variable importance measures as produced by random Forest is as follows:
importance(model)
MeanDecreaseGini
gyros_belt_y 109.77704
gyros_belt_z 308.33595
accel_belt_x 116.92622
accel_belt_y 146.33422
accel_belt_z 448.91511
magnet_belt_x 223.34792
magnet_belt_y 389.96987
magnet_belt_z 382.13972
roll_arm 293.73030
pitch_arm 174.55514
yaw_arm 210.58732
total_accel_arm 96.90070
gyros_arm_x 134.23053
gyros_arm_y 131.50277
gyros_arm_z 63.82912
accel_arm_x 204.96531
accel_arm_y 152.93891
accel_arm_z 131.03189
magnet_arm_x 198.97953
magnet_arm_y 205.41150
magnet_arm_z 173.46416
roll_dumbbell 325.97376
pitch_dumbbell 163.25065
yaw_dumbbell 223.81261
total_accel_dumbbell 223.89752
gyros_dumbbell_x 128.18776
gyros_dumbbell_y 224.76562
gyros_dumbbell_z 87.64901
accel_dumbbell_x 205.38602
accel_dumbbell_y 341.14479
accel_dumbbell_z 274.04832
magnet_dumbbell_x 394.16568
magnet_dumbbell_y 507.41389
magnet_dumbbell_z 617.73962
roll_forearm 486.88577
pitch_forearm 572.49705
yaw_forearm 147.14687
total_accel_forearm 109.03825
gyros_forearm_x 84.99886
gyros_forearm_y 127.73489
gyros_forearm_z 85.22947
accel_forearm_x 251.93194
accel_forearm_y 135.61767
accel_forearm_z 216.90374
magnet_forearm_x 191.08455
magnet_forearm_y 188.56146
magnet_forearm_z 247.21371
Next, the model results is evaluated through the confusion Matrix.
confusionMatrix(predict(model,newdata=validation_clean[,-ncol(validation_clean)]),validation_clean$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1673 10 0 0 0
B 0 1123 13 0 0
C 0 5 1011 20 1
D 0 1 2 941 2
E 1 0 0 3 1079
Overall Statistics
Accuracy : 0.9901
95% CI : (0.9873, 0.9925)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9875
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9994 0.9860 0.9854 0.9761 0.9972
Specificity 0.9976 0.9973 0.9946 0.9990 0.9992
Pos Pred Value 0.9941 0.9886 0.9749 0.9947 0.9963
Neg Pred Value 0.9998 0.9966 0.9969 0.9953 0.9994
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2843 0.1908 0.1718 0.1599 0.1833
Detection Prevalence 0.2860 0.1930 0.1762 0.1607 0.1840
Balanced Accuracy 0.9985 0.9916 0.9900 0.9876 0.9982
The accurancy for the validating data set is calculated with the following formula:
Accur<-c(as.numeric(predict(model,newdata=validation_clean[,-ncol(validation_clean)])==validation_clean$classe))
Accur<-sum(Accur)*100/nrow(validation_clean)
Accur
[1] 99.03144
Model Accuracy as tested over Validation set = 99.03144% The out-of-sample error is 0.13%, which is pretty low.
Prediction with the Testing Dataset
predictions <- predict(model,newdata=TST[-1,])
predictions
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A B A A E D B A A B C B A E E A B B B
Levels: A B C D E
In the new training set and validation set we just created, there are 52 predictors and 1 response. Check the correlations between the predictors and the outcome variable in the new training set. There doesn’t seem to be any predictors strongly correlated with the outcome variable, so linear regression model may not be a good option. Random forest model may be more robust for this data.
cor <- abs(sapply(colnames(TRN_clean[, -ncol(TRN)]), function(x) cor(as.numeric(TRN_clean[, x]), as.numeric(TRN_clean$classe), method = "spearman")))
Create Random Forest Model We try to fit a random forest model and check the model performance on the validation set.
RanForFit <- train(classe ~ ., method = "rf", data = TRN_clean, importance = T, trControl = trainControl(method = "cv", number = 4))
Data Cleaning Since a random forest model is chosen and the data set must first be checked on possibility of columns without data.
The decision is made whereby all the columns that having less than 60% of data filled are removed.
sum((colSums(!is.na(TRN[,-ncol(TRN)])) < 0.6*nrow(TRN)))
[1] 33