This is a submission for the final project in Coursera’s Practical Machine Learning by Johns Hopkins University, third course in the Data Science: Statistics and Machine Learning Specialization.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
In this report, we trained three models: Random Forest,Decision Trees and Support Vector Machine (svm) using k-folds cross validation for purposes of reducing noise and obtaining patterns in the training data. We split the pml-training data set into training and validation sets. The pml-testing data set provided was left for the purposes of the final prediction for quizzes.
From the three models, the random forest model had the highest accuracy level about 99.5% and a very small out of sample error about 0.5%. We then use this model to do the final prediction.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source:
http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
library(caret);library(ggplot2);library(dplyr)
library(skimr)
library(naniar)
library(kernlab)
library(randomForest)
library(rattle)
trainingDF <- read.csv("data/pml-training.csv")
pmlTesting <- read.csv("data/pml-testing.csv")
dim(trainingDF);dim(pmlTesting)
## [1] 19622 160
## [1] 20 160
view(head(trainingDF[complete.cases(trainingDF),],10))
Check for the missing values
Remove information not necessary to the outcome variable.
These are the first seven columns of the data.
pml_training <- trainingDF %>% select(-c(1:7))
pml_training %>% miss_var_summary()
## # A tibble: 153 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 max_roll_belt 19216 97.9
## 2 max_picth_belt 19216 97.9
## 3 min_roll_belt 19216 97.9
## 4 min_pitch_belt 19216 97.9
## 5 amplitude_roll_belt 19216 97.9
## 6 amplitude_pitch_belt 19216 97.9
## 7 var_total_accel_belt 19216 97.9
## 8 avg_roll_belt 19216 97.9
## 9 stddev_roll_belt 19216 97.9
## 10 var_roll_belt 19216 97.9
## # ... with 143 more rows
There are about 67 variables with high number of missing values. We can eliminate this variables.
pml_training.no.na <- pml_training %>% select(which(colMeans(is.na(.))<0.9))
Removing zero and near zero variance predictors
nzvVars <- nearZeroVar(pml_training.no.na)
pmlDf <- pml_training.no.na[,-nzvVars]
Check for correlated data
numDat <- select_if(pmlDf,is.numeric)
highCor<- findCorrelation(cor(numDat),cutoff = 0.9)
filterPmlDf <- pmlDf[,-highCor]
We can now split the data to training and validation data set after cleaning and preprocessing. However, the test set (“pmlTesting”) will be left for the final prediction.
inTrain <- createDataPartition(y=filterPmlDf$classe, p=0.75, list=FALSE)
training <- filterPmlDf[inTrain,]
validation <- filterPmlDf[-inTrain,]
We are going to fit three models: Random Forest,Decision Trees and SVM models for classification to check which algorithm is much better to fit the data.
To obtain the correct patterns from the data and ensure it is not getting too much noise, we use k-folds cross validation technique.
train_control <- trainControl(method="cv", number=5)
set.seed(4578)
rfMod <- train(classe~., data=training, method="rf", trControl = train_control, tuneLength = 5)
rfPred <- predict(rfMod, validation)
cmRF <- confusionMatrix(rfPred, factor(validation$classe))
cmRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1393 6 0 0 0
## B 1 939 7 0 0
## C 0 4 845 12 0
## D 0 0 3 791 1
## E 1 0 0 1 900
##
## Overall Statistics
##
## Accuracy : 0.9927
## 95% CI : (0.9899, 0.9949)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9907
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9895 0.9883 0.9838 0.9989
## Specificity 0.9983 0.9980 0.9960 0.9990 0.9995
## Pos Pred Value 0.9957 0.9916 0.9814 0.9950 0.9978
## Neg Pred Value 0.9994 0.9975 0.9975 0.9968 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2841 0.1915 0.1723 0.1613 0.1835
## Detection Prevalence 0.2853 0.1931 0.1756 0.1621 0.1839
## Balanced Accuracy 0.9984 0.9937 0.9922 0.9914 0.9992
treeMod <- train(classe~., data=training, method="rpart", trControl = train_control, tuneLength = 5)
## Plo the tree
fancyRpartPlot(treeMod$finalModel)
Prediction:
predTrees <- predict(treeMod, validation)
cmTrees <- confusionMatrix(predTrees, factor(validation$classe))
cmTrees
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1247 262 122 129 97
## B 22 381 34 17 133
## C 87 215 596 132 262
## D 39 91 103 465 117
## E 0 0 0 61 292
##
## Overall Statistics
##
## Accuracy : 0.6079
## 95% CI : (0.594, 0.6216)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.499
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8939 0.40148 0.6971 0.57836 0.32408
## Specificity 0.8262 0.94791 0.8281 0.91463 0.98476
## Pos Pred Value 0.6715 0.64906 0.4613 0.57055 0.82720
## Neg Pred Value 0.9514 0.86843 0.9283 0.91709 0.86618
## Prevalence 0.2845 0.19352 0.1743 0.16395 0.18373
## Detection Rate 0.2543 0.07769 0.1215 0.09482 0.05954
## Detection Prevalence 0.3787 0.11970 0.2635 0.16619 0.07198
## Balanced Accuracy 0.8600 0.67469 0.7626 0.74650 0.65442
set.seed(1234)
svmMod <- train(classe~., data=training, method="svmRadial", trControl = train_control, tuneLength = 5, verbose = FALSE)
# Prediction
predSvm <- predict(svmMod, validation)
#Confusion matrix
cmSvm <- confusionMatrix(predSvm, factor(validation$classe))
cmSvm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1390 44 3 1 0
## B 0 895 6 0 0
## C 4 8 845 63 5
## D 0 0 1 739 14
## E 1 2 0 1 882
##
## Overall Statistics
##
## Accuracy : 0.9688
## 95% CI : (0.9635, 0.9735)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9605
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9431 0.9883 0.9192 0.9789
## Specificity 0.9863 0.9985 0.9802 0.9963 0.9990
## Pos Pred Value 0.9666 0.9933 0.9135 0.9801 0.9955
## Neg Pred Value 0.9986 0.9865 0.9975 0.9843 0.9953
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2834 0.1825 0.1723 0.1507 0.1799
## Detection Prevalence 0.2932 0.1837 0.1886 0.1538 0.1807
## Balanced Accuracy 0.9914 0.9708 0.9843 0.9577 0.9890
Accuracy and Out of Sample Error
DTree <- c(cmTrees$overall["Accuracy"],1-c(cmTrees$overall["Accuracy"]))
RF <- c(cmRF$overall["Accuracy"],1-c(cmRF$overall["Accuracy"]))
SVM <- c(cmSvm$overall["Accuracy"],1-c(cmSvm$overall["Accuracy"]))
Output <- rbind(DTree,RF,SVM)
colnames(Output) <- c("Accuracy","oo_S_Err")
Output <- Output %>% apply(.,2, round,3)
Output[order(-Output[,1]),]
## Accuracy oo_S_Err
## RF 0.993 0.007
## SVM 0.969 0.031
## DTree 0.608 0.392
The best model is the Random Forest model, with 0.9926591 accuracy and 0.0073409 out of sample error rate. We find that to be a sufficient enough model to use for our test sets.
We will use the random forest model to do prediction on the test set since it has the highest accuracy.
testPredRF <- predict(rfMod, pmlTesting)
print(testPredRF)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
correlation matrix of variables in training set
library(psych)
cor.plot(numDat,xlas = 2)
Plotting the models
plot(rfMod)
plot(treeMod)
plot(svmMod, plotType = "line")