In this project, I am to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. My goal is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. More Details: https://class.coursera.org/predmachlearn-012/human_grading/view/courses/973547/assessments/4/submissions.
My fist step is to load the data from local files, remove any predictors of near zero, most-NA, and corralated ones.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(2015)
dat <- read.csv('data/pml-training.csv', row.names = 1)
dim(dat)
## [1] 19622 159
#remove half-NA, Zero- and Near Zero-Variance Predictors
dat <- dat[colSums(is.na(dat)) < 0.5*nrow(dat)] #93 variavbles
nzv <- nearZeroVar(dat)
dat <- dat[, -nzv] #59 variavbles
dim(dat)
## [1] 19622 58
#Identifyi and Remove Correlated Predictors
numericData <- dat[sapply(dat, is.numeric)]
descrCor <- cor(numericData)
summary(descrCor[upper.tri(descrCor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.992000 -0.104100 0.001566 0.001313 0.086960 0.980900
highlyCorDescr <- findCorrelation(descrCor, cutoff = .8)
highlyCorCol <- colnames(numericData[,highlyCorDescr])
dat <- dat[, -which(colnames(dat) %in% highlyCorCol)]
dim(dat)
## [1] 19622 46
#Simple Splitting
inTraining <- createDataPartition(dat$classe, p = .6, list = FALSE)
training <- dat[ inTraining,]
testing <- dat[-inTraining,]
#Model Parameter Setting
fitControl <- trainControl(method = "cv", number = 10)
#Model List: http://topepo.github.io/caret/modelList.html
# Generalized Linear Model (glm)
start <- proc.time()
gbmFit <- train(classe ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
elapsed <- proc.time() - start
I’ve tried 3 modles: Recursive Partitioning (rpart), gradient boosting machine (gbm) model, and Random Forest (RF) model. The code for them are similar except for ‘method= modelCode’. The rpart is not usable in this data. The final choice is gbm, due to its high accuracy. rf model is even higher in accuracy, however, it consumed twice the time.
The modeling codes and results are in the annex at the end.
Predicted the csv test data with the gbm model and write the results to text files.
# gradient boosting machine (gbm) model
elapsed
## user system elapsed
## 678.79 0.72 680.02
gbmFit
## Stochastic Gradient Boosting
##
## 11776 samples
## 45 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 10598, 10599, 10598, 10600, 10598, 10598, ...
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD
## 1 50 0.8296495 0.7839736 0.013695744
## 1 100 0.8963129 0.8686992 0.007708690
## 1 150 0.9201751 0.8988967 0.009220057
## 2 50 0.9538873 0.9416148 0.005537182
## 2 100 0.9850544 0.9810943 0.003493162
## 2 150 0.9915081 0.9892585 0.002886121
## 3 50 0.9817428 0.9769025 0.004531410
## 3 100 0.9935462 0.9918367 0.002374829
## 3 150 0.9962634 0.9952739 0.002375532
## Kappa SD
## 0.017415787
## 0.009806397
## 0.011686888
## 0.007007403
## 0.004419199
## 0.003650359
## 0.005730630
## 0.003003883
## 0.003004306
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3 and shrinkage = 0.1.
prediction_gbm <- predict(gbmFit, testing)
confusionMatrix(prediction_gbm, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 2 0 0 0
## B 0 1511 1 0 0
## C 0 3 1360 1 0
## D 0 2 7 1285 3
## E 0 0 0 0 1439
##
## Overall Statistics
##
## Accuracy : 0.9976
## 95% CI : (0.9962, 0.9985)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9969
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9954 0.9942 0.9992 0.9979
## Specificity 0.9996 0.9998 0.9994 0.9982 1.0000
## Pos Pred Value 0.9991 0.9993 0.9971 0.9907 1.0000
## Neg Pred Value 1.0000 0.9989 0.9988 0.9998 0.9995
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1926 0.1733 0.1638 0.1834
## Detection Prevalence 0.2847 0.1927 0.1738 0.1653 0.1834
## Balanced Accuracy 0.9998 0.9976 0.9968 0.9987 0.9990
plot(gbmFit)
#Random Forest (RF)
start <- proc.time()
rfFit <- train(classe ~ ., data = training,
method = "rf",
trControl = fitControl,
verbose = FALSE)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
elapsed <- proc.time() - start
elapsed
## user system elapsed
## 1637.55 5.76 1645.17
rfFit
## Random Forest
##
## 11776 samples
## 45 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 10599, 10599, 10598, 10597, 10597, 10599, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9821664 0.9774326 0.0022347582 0.002828670
## 34 0.9993203 0.9991403 0.0008774815 0.001109833
## 67 0.9982162 0.9977439 0.0012315787 0.001557637
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 34.
prediction_rf <- predict(rfFit, testing)
confusionMatrix(prediction_rf, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 0 0 0 0
## B 0 1517 2 0 0
## C 0 1 1366 1 0
## D 0 0 0 1285 2
## E 0 0 0 0 1440
##
## Overall Statistics
##
## Accuracy : 0.9992
## 95% CI : (0.9983, 0.9997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.999
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9993 0.9985 0.9992 0.9986
## Specificity 1.0000 0.9997 0.9997 0.9997 1.0000
## Pos Pred Value 1.0000 0.9987 0.9985 0.9984 1.0000
## Neg Pred Value 1.0000 0.9998 0.9997 0.9998 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1933 0.1741 0.1638 0.1835
## Detection Prevalence 0.2845 0.1936 0.1744 0.1640 0.1835
## Balanced Accuracy 1.0000 0.9995 0.9991 0.9995 0.9993
plot(rfFit)
#Recursive Partitioning (rpart)
# rpartFit <- train(classe ~ ., data = training,
# method = "rpart",
# trControl = fitControl,
# verbose = FALSE)
# This molde cannot be fitted.