The goal of the project is to predict the manner in which the participants did the exercise. Furthermore, data is split into two sets for training and testing purposes. After the initial data overview and preparation the machine learning algorithm is applied to obtain satisfying level of performance first on the training set and further on the test set.
The data for this project has been obtained from this source: [http://groupware.les.inf.puc-rio.br/har]. "Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions1: Class A - exactly according to the specification Class B - throwing the elbows to the front Class C - lifting the dumbbell only halfway Class D - lowering the dumbbell only halfway Class E - throwing the hips to the front The training set was downloaded from Training The testing set was download from Testing
URL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(URL,destfile="TrainData.csv", method="curl")
training<-read.csv("TrainData.csv")
URL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(URL,destfile="TestData.csv", method="curl")
testing<-read.csv("TestData.csv")
First seven columns are not the valued data for the project as it includes user names, time stamps, etc. Therefore dataframe is adjusted to the useful variables.
training<-training[,-c(1:7)]
testing<-testing[,-c(1:7)]
#Extract the names of the columns which are different in testing and training sets.
k<-colnames(training)!=colnames(testing)
names(training[k]); names(testing[k]);
## [1] "classe"
## [1] "problem_id"
#removing the problem_id column from the test set and setting outcome column in train set as factors.
testing<-testing[,-length(testing)]
training$classe<-as.factor(training$classe)
Null values inspection
val<-apply(X=training, MARGIN = 2, FUN=function(x) sum(is.na(x)))
unique(val[val>0]); round((unique(val[val>0]))/dim(training)[1],2); length(val[val>0])
## [1] 19216
## [1] 0.98
## [1] 67
Therefore, there are 67 columns with the NA values with 98% of the data missing in those columns.
head(training[,c(1,9:12)], 3)
## roll_belt skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## 1 1.41 NA NA
## 2 1.41 NA NA
## 3 1.42 NA NA
However, through observation of the dataframe, it was also noted that some columns have empty entries instead of NA entries. To account for that the further empty to NA conversion step is used.
training<-mutate_all(training, list(~na_if(.,"")))
val<-apply(X=training, MARGIN = 2, FUN=function(x) sum(is.na(x)))
unique(val[val>0]); round((unique(val[val>0]))/dim(training)[1],2); length(val[val>0])
## [1] 19216
## [1] 0.98
## [1] 100
This allow us to see that overall there are 100 columns in the data set that have 98% of data missing. For the purpose of this project it will be beneficial to readjust the dataset and eliminate variables that contain only 2% of the valuable information.The useful dataset will consists of 53 variables for further application. It is also not practical to impute any missing values as 2% is a significantly little number to use for the missing values.
training<-(training[val==0])
dim(training)
## [1] 19622 53
As there are large number of cases given with the large number of variables it might be beneficial to split training set even further to training and validation sets. For CARET package is going to be used for further model building.
First, find highly correlated variables to further simplify the model. For the visualization plot see the Appendix (1).
library(caret)
correl<-findCorrelation(cor(training[,-53], use="pairwise.complete.obs"), cutoff = 0.75, verbose = FALSE )
names(training[correl])
## [1] "accel_belt_z" "roll_belt" "accel_belt_y"
## [4] "accel_arm_y" "total_accel_belt" "accel_dumbbell_z"
## [7] "accel_belt_x" "pitch_belt" "magnet_dumbbell_x"
## [10] "accel_dumbbell_y" "magnet_dumbbell_y" "accel_arm_x"
## [13] "accel_dumbbell_x" "accel_arm_z" "magnet_arm_y"
## [16] "magnet_belt_z" "accel_forearm_y" "gyros_forearm_y"
## [19] "gyros_dumbbell_x" "gyros_dumbbell_z" "gyros_arm_x"
For further variables reduction the cutoff is going to be set to 90% and the fitted variables will be removed from the model building process.
highcorrel<-findCorrelation(cor(training[,-53], use="pairwise.complete.obs"), cutoff = 0.90, verbose = FALSE )
names(training[highcorrel])
## [1] "accel_belt_z" "roll_belt" "accel_belt_y" "accel_belt_x"
## [5] "gyros_dumbbell_x" "gyros_dumbbell_z" "gyros_arm_x"
training<-training[,-highcorrel]
set.seed(373)
inTrain<-createDataPartition(y=training$classe, p=0.75, list=FALSE)
train<-training[inTrain,]
valid<-training[-inTrain,]
dim(train); dim(valid)
## [1] 14718 46
## [1] 4904 46
Developing the Decision Tree (rpart), Random Forest (rf) and Stochastic Gradient Boosting Models on the train part of the training data; followed by application on the validation set.
set.seed(375)
modrpart<-train(classe~., method='rpart', data=train)
predrpart<-predict(modrpart, newdata=valid)
library(randomForest)
modrf<-randomForest(classe~., data=train, importance = TRUE, ntrees = 5, type='responce')
predrf<-predict(modrpart, valid)
modgbm<-train(classe~., method='gbm', data=train)
predgbm<- predict(modgbm, valid)
Decision Tree
rpart<-confusionMatrix(valid$classe, predrpart)
rpart
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1283 22 72 18 0
## B 392 319 199 39 0
## C 401 34 323 97 0
## D 315 142 81 205 61
## E 204 164 189 36 308
##
## Overall Statistics
##
## Accuracy : 0.4971
## 95% CI : (0.4831, 0.5112)
## No Information Rate : 0.5292
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3425
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4944 0.46843 0.37384 0.51899 0.83469
## Specificity 0.9515 0.85082 0.86832 0.86715 0.86924
## Pos Pred Value 0.9197 0.33614 0.37778 0.25498 0.34184
## Neg Pred Value 0.6261 0.90847 0.86639 0.95366 0.98476
## Prevalence 0.5292 0.13887 0.17618 0.08055 0.07524
## Detection Rate 0.2616 0.06505 0.06586 0.04180 0.06281
## Detection Prevalence 0.2845 0.19352 0.17435 0.16395 0.18373
## Balanced Accuracy 0.7230 0.65962 0.62108 0.69307 0.85196
GBM
gbm<-confusionMatrix(valid$classe, predgbm)
gbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1367 18 7 2 1
## B 38 881 27 1 2
## C 0 33 813 9 0
## D 0 5 17 773 9
## E 2 11 11 12 865
##
## Overall Statistics
##
## Accuracy : 0.9582
## 95% CI : (0.9522, 0.9636)
## No Information Rate : 0.2869
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9471
##
## Mcnemar's Test P-Value : 1.791e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9716 0.9293 0.9291 0.9699 0.9863
## Specificity 0.9920 0.9828 0.9896 0.9925 0.9911
## Pos Pred Value 0.9799 0.9283 0.9509 0.9614 0.9600
## Neg Pred Value 0.9886 0.9831 0.9847 0.9941 0.9970
## Prevalence 0.2869 0.1933 0.1784 0.1625 0.1788
## Detection Rate 0.2788 0.1796 0.1658 0.1576 0.1764
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9818 0.9561 0.9594 0.9812 0.9887
Random Forest
rf<-confusionMatrix(valid$classe, predrf)
rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1283 22 72 18 0
## B 392 319 199 39 0
## C 401 34 323 97 0
## D 315 142 81 205 61
## E 204 164 189 36 308
##
## Overall Statistics
##
## Accuracy : 0.4971
## 95% CI : (0.4831, 0.5112)
## No Information Rate : 0.5292
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3425
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4944 0.46843 0.37384 0.51899 0.83469
## Specificity 0.9515 0.85082 0.86832 0.86715 0.86924
## Pos Pred Value 0.9197 0.33614 0.37778 0.25498 0.34184
## Neg Pred Value 0.6261 0.90847 0.86639 0.95366 0.98476
## Prevalence 0.5292 0.13887 0.17618 0.08055 0.07524
## Detection Rate 0.2616 0.06505 0.06586 0.04180 0.06281
## Detection Prevalence 0.2845 0.19352 0.17435 0.16395 0.18373
## Balanced Accuracy 0.7230 0.65962 0.62108 0.69307 0.85196
Therefore, the overall accuracy summary is as follows, with the GBM method showing the highest levels.
cbind(c('RPART', 'GBM', 'RF'), round(c(rpart$overall['Accuracy'], gbm$overall['Accuracy'], rf$overall['Accuracy']),3))
## [,1] [,2]
## Accuracy "RPART" "0.497"
## Accuracy "GBM" "0.958"
## Accuracy "RF" "0.497"
predtest <- predict(modgbm, testing)
testing$classe <- predtest
predtest
## [1] B A B A A C D B A A B C B A E E A B B B
## Levels: A B C D E
library(corrplot)
corMat <- cor(train[, -46])
corrplot(corMat, order = "FPC", method = "color", type = "upper",
tl.cex = 0.5, tl.col = rgb(0, 0, 0))
library(rattle)
fancyRpartPlot(modrpart$finalModel)