PROJECT QUESTION

The goal of the project is to predict the manner in which the participants did the exercise. Furthermore, data is split into two sets for training and testing purposes. After the initial data overview and preparation the machine learning algorithm is applied to obtain satisfying level of performance first on the training set and further on the test set.

DATA OVERVIEW

The data for this project has been obtained from this source: [http://groupware.les.inf.puc-rio.br/har]. "Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions1: Class A - exactly according to the specification Class B - throwing the elbows to the front Class C - lifting the dumbbell only halfway Class D - lowering the dumbbell only halfway Class E - throwing the hips to the front The training set was downloaded from Training The testing set was download from Testing

URL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(URL,destfile="TrainData.csv", method="curl")
training<-read.csv("TrainData.csv")
URL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(URL,destfile="TestData.csv", method="curl")
testing<-read.csv("TestData.csv")

First seven columns are not the valued data for the project as it includes user names, time stamps, etc. Therefore dataframe is adjusted to the useful variables.

training<-training[,-c(1:7)]
testing<-testing[,-c(1:7)]

#Extract the names of the columns which are different in testing and training sets.
k<-colnames(training)!=colnames(testing)

names(training[k]); names(testing[k]);
## [1] "classe"
## [1] "problem_id"
#removing the problem_id column from the test set and setting outcome column in train set as factors.
testing<-testing[,-length(testing)]
training$classe<-as.factor(training$classe)

DATA PREPARATION

Null values inspection

val<-apply(X=training, MARGIN = 2, FUN=function(x) sum(is.na(x)))
unique(val[val>0]); round((unique(val[val>0]))/dim(training)[1],2); length(val[val>0])
## [1] 19216
## [1] 0.98
## [1] 67

Therefore, there are 67 columns with the NA values with 98% of the data missing in those columns.

head(training[,c(1,9:12)], 3)
##   roll_belt skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## 1      1.41                                                   NA             NA
## 2      1.41                                                   NA             NA
## 3      1.42                                                   NA             NA

However, through observation of the dataframe, it was also noted that some columns have empty entries instead of NA entries. To account for that the further empty to NA conversion step is used.

training<-mutate_all(training, list(~na_if(.,"")))
val<-apply(X=training, MARGIN = 2, FUN=function(x) sum(is.na(x)))
unique(val[val>0]); round((unique(val[val>0]))/dim(training)[1],2); length(val[val>0])
## [1] 19216
## [1] 0.98
## [1] 100

This allow us to see that overall there are 100 columns in the data set that have 98% of data missing. For the purpose of this project it will be beneficial to readjust the dataset and eliminate variables that contain only 2% of the valuable information.The useful dataset will consists of 53 variables for further application. It is also not practical to impute any missing values as 2% is a significantly little number to use for the missing values.

training<-(training[val==0])
dim(training)
## [1] 19622    53

BUILDING MODEL

As there are large number of cases given with the large number of variables it might be beneficial to split training set even further to training and validation sets. For CARET package is going to be used for further model building.

First, find highly correlated variables to further simplify the model. For the visualization plot see the Appendix (1).

library(caret)
correl<-findCorrelation(cor(training[,-53], use="pairwise.complete.obs"), cutoff = 0.75, verbose = FALSE )
names(training[correl])
##  [1] "accel_belt_z"      "roll_belt"         "accel_belt_y"     
##  [4] "accel_arm_y"       "total_accel_belt"  "accel_dumbbell_z" 
##  [7] "accel_belt_x"      "pitch_belt"        "magnet_dumbbell_x"
## [10] "accel_dumbbell_y"  "magnet_dumbbell_y" "accel_arm_x"      
## [13] "accel_dumbbell_x"  "accel_arm_z"       "magnet_arm_y"     
## [16] "magnet_belt_z"     "accel_forearm_y"   "gyros_forearm_y"  
## [19] "gyros_dumbbell_x"  "gyros_dumbbell_z"  "gyros_arm_x"

For further variables reduction the cutoff is going to be set to 90% and the fitted variables will be removed from the model building process.

highcorrel<-findCorrelation(cor(training[,-53], use="pairwise.complete.obs"), cutoff = 0.90, verbose = FALSE )
names(training[highcorrel])
## [1] "accel_belt_z"     "roll_belt"        "accel_belt_y"     "accel_belt_x"    
## [5] "gyros_dumbbell_x" "gyros_dumbbell_z" "gyros_arm_x"
training<-training[,-highcorrel]

Splitting data into training set and test-validation set

set.seed(373)
inTrain<-createDataPartition(y=training$classe, p=0.75, list=FALSE)
train<-training[inTrain,]
valid<-training[-inTrain,]
dim(train); dim(valid)
## [1] 14718    46
## [1] 4904   46

Developing the Decision Tree (rpart), Random Forest (rf) and Stochastic Gradient Boosting Models on the train part of the training data; followed by application on the validation set.

set.seed(375)
modrpart<-train(classe~., method='rpart', data=train)
predrpart<-predict(modrpart, newdata=valid)

library(randomForest)
modrf<-randomForest(classe~., data=train, importance = TRUE, ntrees = 5, type='responce')
predrf<-predict(modrpart, valid)

modgbm<-train(classe~., method='gbm', data=train)
predgbm<- predict(modgbm, valid)

Confusion Matrices and Accuracy Levels

Decision Tree

rpart<-confusionMatrix(valid$classe, predrpart)
rpart
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1283   22   72   18    0
##          B  392  319  199   39    0
##          C  401   34  323   97    0
##          D  315  142   81  205   61
##          E  204  164  189   36  308
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4971          
##                  95% CI : (0.4831, 0.5112)
##     No Information Rate : 0.5292          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3425          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4944  0.46843  0.37384  0.51899  0.83469
## Specificity            0.9515  0.85082  0.86832  0.86715  0.86924
## Pos Pred Value         0.9197  0.33614  0.37778  0.25498  0.34184
## Neg Pred Value         0.6261  0.90847  0.86639  0.95366  0.98476
## Prevalence             0.5292  0.13887  0.17618  0.08055  0.07524
## Detection Rate         0.2616  0.06505  0.06586  0.04180  0.06281
## Detection Prevalence   0.2845  0.19352  0.17435  0.16395  0.18373
## Balanced Accuracy      0.7230  0.65962  0.62108  0.69307  0.85196

GBM

gbm<-confusionMatrix(valid$classe, predgbm)
gbm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1367   18    7    2    1
##          B   38  881   27    1    2
##          C    0   33  813    9    0
##          D    0    5   17  773    9
##          E    2   11   11   12  865
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9582          
##                  95% CI : (0.9522, 0.9636)
##     No Information Rate : 0.2869          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9471          
##                                           
##  Mcnemar's Test P-Value : 1.791e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9716   0.9293   0.9291   0.9699   0.9863
## Specificity            0.9920   0.9828   0.9896   0.9925   0.9911
## Pos Pred Value         0.9799   0.9283   0.9509   0.9614   0.9600
## Neg Pred Value         0.9886   0.9831   0.9847   0.9941   0.9970
## Prevalence             0.2869   0.1933   0.1784   0.1625   0.1788
## Detection Rate         0.2788   0.1796   0.1658   0.1576   0.1764
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9818   0.9561   0.9594   0.9812   0.9887

Random Forest

rf<-confusionMatrix(valid$classe, predrf)
rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1283   22   72   18    0
##          B  392  319  199   39    0
##          C  401   34  323   97    0
##          D  315  142   81  205   61
##          E  204  164  189   36  308
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4971          
##                  95% CI : (0.4831, 0.5112)
##     No Information Rate : 0.5292          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3425          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4944  0.46843  0.37384  0.51899  0.83469
## Specificity            0.9515  0.85082  0.86832  0.86715  0.86924
## Pos Pred Value         0.9197  0.33614  0.37778  0.25498  0.34184
## Neg Pred Value         0.6261  0.90847  0.86639  0.95366  0.98476
## Prevalence             0.5292  0.13887  0.17618  0.08055  0.07524
## Detection Rate         0.2616  0.06505  0.06586  0.04180  0.06281
## Detection Prevalence   0.2845  0.19352  0.17435  0.16395  0.18373
## Balanced Accuracy      0.7230  0.65962  0.62108  0.69307  0.85196

Therefore, the overall accuracy summary is as follows, with the GBM method showing the highest levels.

cbind(c('RPART', 'GBM', 'RF'), round(c(rpart$overall['Accuracy'], gbm$overall['Accuracy'], rf$overall['Accuracy']),3))
##          [,1]    [,2]   
## Accuracy "RPART" "0.497"
## Accuracy "GBM"   "0.958"
## Accuracy "RF"    "0.497"

Use GBM method on testing data set

predtest <- predict(modgbm, testing)
testing$classe <- predtest
predtest
##  [1] B A B A A C D B A A B C B A E E A B B B
## Levels: A B C D E

APPENDIX

  1. Plotting correlated predictors
library(corrplot)
corMat <- cor(train[, -46])
corrplot(corMat, order = "FPC", method = "color", type = "upper", 
         tl.cex = 0.5, tl.col = rgb(0, 0, 0))

  1. The random tree model
library(rattle)
fancyRpartPlot(modrpart$finalModel)