This project is focused on predicting the manner in which six young health participants aged between 20-28 years, with little weight lifting experience performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). All these excercises are in the variable name called “classe”. The provided plm-training data set is partitioned in to training and validation data sets and pml-testing with 20 observations are available for test data set. The links for the data are presented in the readme file.
Necessary packages
library(knitr)
library(caret)
library(MASS)
library(klaR)
library(rattle)
library(readr)
library(ggplot2)
Data downloading and reading
filePath<- getwd()
fileName1<- "pml-training.csv"
fileName2<- "pml-testing.csv"
urll<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url2<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urll, destfile = fileName1, method = "curl")
download.file(url2, destfile = fileName2, method = "curl")
trainingRD<- read.csv(fileName1)
testingRD<- read.csv(fileName2)
Cleaning the Data
View(trainingRD); View(testingRD)
As you can see from the above function, both data have columns with NA and contain unncessary variables. its should be removed before the analysis started.
nearZero<- nearZeroVar(trainingRD)
trainingRD <- trainingRD[ ,-nearZero]
trainingRD<- trainingRD[ ,which(colSums(is.na(trainingRD))== 0)]
# the first 7 columns are variables that has no relationship with "classe"
trainingSet<- trainingRD[ ,-c(1:7)]
testing<- testingRD[ ,-c(1:7)]
The following table shows the number of observations for each class category after the data is cleaned.
table(trainingSet$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
set.seed(12345)
inTrain<- createDataPartition(y = trainingSet$classe, p = 0.7, list = FALSE)
training<- trainingSet[inTrain, ]
validation<- trainingSet[-inTrain, ]
dim(training)
## [1] 13737 52
dim(validation)
## [1] 5885 52
dim(testing)
## [1] 20 153
The following plots show, as sample observations, how four representative variables (total_accel_belt, total_accel_arm, total_accel_dumbbell and total_accel_forearm) were varied across the range of observations.
par(mfrow = c(2, 2))
plot(training$classe, training$total_accel_belt, xlab = "Class", ylab = "total_accel_belt", main = "Class vs Total acceleration on belt")
plot(training$classe, training$total_accel_arm, xlab = "Class", ylab = "total_accel_arm", main = "Class vs Total acceleration on arm")
plot(training$classe, training$total_accel_dumbbell, xlab = "Class", ylab = "total_accel_dumbbell", main = "Class vs Total acceleration on dumbbell")
plot(training$classe, training$total_accel_forearm, xlab = "Class", ylab = "total_accel_forearm", main = "Class vs Total acceleration on forearm")
Three models (lda, rpart and rf) were chosen and the best fit model is selected based on highest accuracy value.
1. Linear discriminant analysis (“lda”)
mod_lda<- train(classe ~., data = training, method = "lda")
plda <- predict(mod_lda, validation)
confusionMatrix(plda, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1387 174 107 56 50
## B 40 744 115 52 202
## C 116 119 645 124 90
## D 127 50 138 686 112
## E 4 52 21 46 628
##
## Overall Statistics
##
## Accuracy : 0.695
## 95% CI : (0.683, 0.7067)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6137
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8286 0.6532 0.6287 0.7116 0.5804
## Specificity 0.9081 0.9138 0.9076 0.9132 0.9744
## Pos Pred Value 0.7818 0.6453 0.5896 0.6164 0.8362
## Neg Pred Value 0.9302 0.9165 0.9205 0.9417 0.9116
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2357 0.1264 0.1096 0.1166 0.1067
## Detection Prevalence 0.3014 0.1959 0.1859 0.1891 0.1276
## Balanced Accuracy 0.8683 0.7835 0.7681 0.8124 0.7774
2. Recursive Partitioning (“rpart”) and plot Trees
mod_rpart<- train(classe ~., data = training, method = "rpart")
prpart<- predict(mod_rpart, validation)
confusionMatrix(prpart, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1527 482 498 423 243
## B 31 387 38 188 228
## C 77 124 423 126 150
## D 38 146 67 227 145
## E 1 0 0 0 316
##
## Overall Statistics
##
## Accuracy : 0.4894
## 95% CI : (0.4765, 0.5022)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3317
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9122 0.33977 0.41228 0.23548 0.29205
## Specificity 0.6091 0.89781 0.90183 0.91953 0.99979
## Pos Pred Value 0.4812 0.44381 0.47000 0.36437 0.99685
## Neg Pred Value 0.9458 0.84999 0.87904 0.85994 0.86243
## Prevalence 0.2845 0.19354 0.17434 0.16381 0.18386
## Detection Rate 0.2595 0.06576 0.07188 0.03857 0.05370
## Detection Prevalence 0.5392 0.14817 0.15293 0.10586 0.05387
## Balanced Accuracy 0.7607 0.61879 0.65706 0.57750 0.64592
fancyRpartPlot(mod_rpart$finalModel)
3. Random forest analysis(“rf”)
mod_rf<- train(classe ~., method = "rf", data = training, importance = T, trControl = trainControl(method = "cv", classProbs=TRUE,savePredictions=TRUE,allowParallel=TRUE, number =3))
prf<- predict(mod_rf, validation)
confusionMatrix(prf, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 6 0 0 0
## B 1 1129 7 0 0
## C 0 4 1016 8 0
## D 0 0 3 955 2
## E 0 0 0 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9946
## 95% CI : (0.9923, 0.9963)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9931
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9912 0.9903 0.9907 0.9982
## Specificity 0.9986 0.9983 0.9975 0.9990 0.9998
## Pos Pred Value 0.9964 0.9930 0.9883 0.9948 0.9991
## Neg Pred Value 0.9998 0.9979 0.9979 0.9982 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1918 0.1726 0.1623 0.1835
## Detection Prevalence 0.2853 0.1932 0.1747 0.1631 0.1837
## Balanced Accuracy 0.9990 0.9948 0.9939 0.9948 0.9990
The above result show that the random forest model has the highest accuracy in cross validation. Therefore, we will use the random forest model for predicting test samples.
since the random forest (rf) method has highest accurancy, it is selected to predict the test sample.
testing_pre<- predict(mod_rf, newdata = testing)
testing_pre
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The pml-training dataset is splitted into training and validation data set to construct a predictive model and evaluate its accuracy. To select the best fit model, lda, rpart and rf models are applied.The rf is the best fit model and this model is used for predicting the test data.
REFERENCE Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. Read more: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz6SGpbauXU