For this project, we are given data from accelerometers on the belt, forearm, arm, and dumbell of 6 research study participants. Our training data consists of accelerometer data and a label identifying the quality of the activity the participant was doing. Our testing data consists of accelerometer data without the identifying label. Our goal is to predict the labels for the test set observations.
Below is the code I used when creating the model, estimating the out-of-sample error, and making predictions. I also include a description of each step of the process.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv)
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
library(caret)
## Warning: package 'caret' was built under R version 3.2.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
library(rpart)
## Warning: package 'rpart' was built under R version 3.2.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.3
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.2.3
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.3
## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(knitr)
## Warning: package 'knitr' was built under R version 3.2.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.2.3
ptrain <- read.csv("pml-training.csv")
ptest <- read.csv("pml-testing.csv")
Because I want to be able to estimate the out-of-sample error, I randomly split the full training data (ptrain) into a smaller training set (ptrain1) and a validation set (ptrain2):
partition <- createDataPartition(y=ptrain$classe, p=0.7, list=F)
ptrain1 <- ptrain[partition, ]
ptrain2 <- ptrain[-partition, ]
Now I’m removing those variable with have maximum valuses as NA, Variance Near by zero and those variable wich do nat have a significance on pridiction.
# Near zero variance
nzv<- nearZeroVar(ptrain1)
ptrain1<- ptrain1[,-nzv]
ptrain2<- ptrain2[,-nzv]
# Mostly NA
mostlyNa<- sapply(ptrain1,function(x) mean(is.na(x)))> 0.95
ptrain1<- ptrain1[,mostlyNa==F]
ptrain2<- ptrain2[,mostlyNa==F]
#non singificance for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), which happen to be the first five variables
ptrain1<- ptrain1[,-(1:5)]
ptrain2<- ptrain2[,-(1:5)]
modFit <- train(classe ~ ., data = ptrain1, method="rpart")
print(modFit, digits=3)
## CART
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.0395 0.542 0.4093 0.0512 0.0794
## 0.0595 0.389 0.1606 0.0487 0.0813
## 0.1159 0.321 0.0545 0.0425 0.0628
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0395.
print(modFit$finalModel, digits=3)
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9830 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130 12512 8650 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -34.3 1069 2 A (1 0.0019 0 0 0) *
## 5) pitch_forearm>=-34.3 11443 8650 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 438 9663 6930 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 122 5991 3540 A (0.41 0.18 0.19 0.16 0.06) *
## 21) roll_forearm>=122 3672 2480 C (0.078 0.18 0.32 0.23 0.18) *
## 11) magnet_dumbbell_y>=438 1780 868 B (0.034 0.51 0.042 0.23 0.18) *
## 3) roll_belt>=130 1225 43 E (0.035 0 0 0 0.96) *
fancyRpartPlot(modFit$finalModel)
Now run the prediction model against ptrain2
# Run against ptrain2
predictions <- predict(modFit, ptrain2)
print(confusionMatrix(predictions, ptrain2$classe), digits=4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1492 485 453 451 131
## B 27 381 34 158 151
## C 124 273 539 355 283
## D 0 0 0 0 0
## E 31 0 0 0 517
##
## Overall Statistics
##
## Accuracy : 0.4977
## 95% CI : (0.4849, 0.5106)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3442
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8913 0.33450 0.52534 0.0000 0.47782
## Specificity 0.6390 0.92204 0.78699 1.0000 0.99355
## Pos Pred Value 0.4954 0.50732 0.34244 NaN 0.94343
## Neg Pred Value 0.9367 0.85236 0.88703 0.8362 0.89414
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Detection Rate 0.2535 0.06474 0.09159 0.0000 0.08785
## Detection Prevalence 0.5118 0.12761 0.26746 0.0000 0.09312
## Balanced Accuracy 0.7652 0.62827 0.65617 0.5000 0.73568
It was really disappinting to see this low accuracy (0.4833)
#Train on training set 1 of 4 with only cross validation.
set.seed(666)
modFit <- train(ptrain1$classe ~ ., method="rf", trControl=trainControl(method = "cv", number = 4), data=ptrain1)
print(modFit, digits=3)
## Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 10302, 10302, 10304, 10303
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.991 0.989 0.00252 0.00319
## 27 0.996 0.994 0.00104 0.00131
## 53 0.993 0.991 0.00139 0.00176
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Now run the prediction model against ptrain2
predictions <- predict(modFit, newdata=ptrain2)
print(confusionMatrix(predictions, ptrain2$classe), digits=4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1135 2 0 0
## C 0 0 1024 5 0
## D 0 0 0 959 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9978
## 95% CI : (0.9962, 0.9988)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9972
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9965 0.9981 0.9948 0.9982
## Specificity 0.9991 0.9996 0.9990 0.9996 1.0000
## Pos Pred Value 0.9976 0.9982 0.9951 0.9979 1.0000
## Neg Pred Value 1.0000 0.9992 0.9996 0.9990 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1929 0.1740 0.1630 0.1835
## Detection Prevalence 0.2851 0.1932 0.1749 0.1633 0.1835
## Balanced Accuracy 0.9995 0.9980 0.9985 0.9972 0.9991
accuracy <- postResample(predictions, ptrain2$classe)
accuracy
## Accuracy Kappa
## 0.9977910 0.9972057
oose <- 1 - as.numeric(confusionMatrix(predictions, ptrain2$classe)$overall[1])
oose
## [1] 0.002209006
Then, we estimate the performance of the model on the validation data set (ptest).
pridictions<-predict(modFit, newdata=ptest)
print(pridictions)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
1-####Correlation Matrix Visualization
corrPlot <- cor(ptrain1[, -length(names(ptrain1))])
corrplot(corrPlot, method="color")
2-####Decision Tree Visualization
treeModel <- rpart(classe ~ ., data=ptrain1, method="class")
prp(treeModel)