For this project, I am to determine the manner of how participants conducted exersize from data obtained by devices that track exercise. In the past, exersize has tracked the amount of exersize, but not how the exersize is done. This project is to find out if we can predict the manner in which the exersize is done from data gathered from these devices. The data is from sensors on people as they do exersize. Sometimes the participants do the exersize correctly and sometimes not. Finally, I predict the manner of exersize for previously unseen data. Below are dimensions of the dataset. As you can see, there are many features and many of them contain NAs. It was necessary to remove features with very little information in them. Also, many features described the subject who underwent the exersize leading to highly correlated attributes. It is unnecessary to use these attributes because they can be reduced to the one variable identifying the subject of the study. I split the data into a training and test set to perform predictions and assess the results.
library(caret)
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.4.4
## Loading required package: ggplot2
library(rattle)
## Warning: package 'rattle' was built under R version 3.4.4
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(plyr)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(nnet)
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.4.4
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
library(ElemStatLearn)
## Warning: package 'ElemStatLearn' was built under R version 3.4.4
##
## Attaching package: 'ElemStatLearn'
## The following object is masked from 'package:plyr':
##
## ozone
df<-read.csv(file="pml-training.csv", header=TRUE)
cat("Dimensions of entire training dataset: ",dim(df))
## Dimensions of entire training dataset: 19622 160
train.dat <- which(colSums(is.na(df) |df=="")>0.9*dim(df)[1])
train.data <- df[,-train.dat]
train.data<-train.data[,-c(1:7)]
cat("Dimensions of cleaned training dataset: ",dim(train.data))
## Dimensions of cleaned training dataset: 19622 53
set.seed(123)
inTrain = createDataPartition(train.data$classe, p = 3/4)[[1]]
training = train.data[ inTrain,]
dim(training)
## [1] 14718 53
testing=train.data[-inTrain,]
dim(testing)
## [1] 4904 53
x_train <- training[, -ncol(training)]
x_test <- testing[, -ncol(testing)]
y_train <- training$classe
y_test<-testing$classe
Because there is a sparce dataframe of features, I chose lasso regression to proving a penalty to features that had little effect on the outome of a prediction. As you can see below, many features are zero. I set up the model to use the best parameters for prediction using the grid and trcontrol parameters, then display the coefficients used for the lasso regression. I used the model with the best lambda value, or size of the penalty that effects the number of features used in the model, and used five fold cross validation on the training set. Lasso regression is usually used for numerical classes, but I chose Lasso Regression to see how a linear model would perform on classification problems.
control<-trainControl(method="cv", number=5, classProbs = TRUE)
tuneGrid=expand.grid(.alpha=1,.lambda=10^seq(10, -2, length = 100))
lasso.mod <- train(classe~.,data=training, method="glmnet",trControl=control, metric="Accuracy", tuneGrid=tuneGrid, family="multinomial")
bestlam <- lasso.mod$lambda.min
lasso.pred <- predict(lasso.mod, s=bestlam, newdata =testing, type="raw")
lasso.con<-confusionMatrix( lasso.pred, testing$classe)
head(varImp(lasso.mod), 53)
## $importance
## A B C D
## roll_belt 0.000000e+00 0.00000000 0.00000000 0.37493523
## pitch_belt 0.000000e+00 0.00000000 0.00000000 0.00000000
## yaw_belt 0.000000e+00 0.00000000 0.00000000 0.00000000
## total_accel_belt 0.000000e+00 0.00000000 0.00000000 0.00000000
## gyros_belt_x 1.000000e+02 0.00000000 0.00000000 0.00000000
## gyros_belt_y 5.531670e+01 0.00000000 0.00000000 0.00000000
## gyros_belt_z 0.000000e+00 0.00000000 18.11713037 0.00000000
## accel_belt_x 0.000000e+00 0.00000000 0.00000000 0.00000000
## accel_belt_y 0.000000e+00 0.71465368 0.00000000 0.00000000
## accel_belt_z 4.869178e-01 0.00000000 0.00000000 0.00000000
## magnet_belt_x 1.095818e+00 0.00000000 0.00000000 0.00000000
## magnet_belt_y 0.000000e+00 0.00000000 0.00000000 1.21849220
## magnet_belt_z 5.092630e-01 0.00000000 0.00000000 0.00000000
## roll_arm 0.000000e+00 0.20640733 0.00000000 0.00000000
## pitch_arm 6.004254e-01 0.17778836 0.00000000 0.00000000
## yaw_arm 1.372930e-01 0.12450372 0.00000000 0.01880261
## total_accel_arm 0.000000e+00 0.00000000 0.00000000 0.00000000
## gyros_arm_x 0.000000e+00 0.00000000 0.00000000 0.00000000
## gyros_arm_y 4.329295e+00 0.00000000 0.00000000 0.00000000
## gyros_arm_z 0.000000e+00 0.00000000 16.60737140 0.00000000
## accel_arm_x 0.000000e+00 0.15985478 0.00000000 0.45821549
## accel_arm_y 0.000000e+00 0.00000000 0.00000000 0.00000000
## accel_arm_z 6.173637e-01 0.00000000 0.00000000 0.00000000
## magnet_arm_x 3.943701e-03 0.00000000 0.00000000 0.00000000
## magnet_arm_y 0.000000e+00 0.24704908 0.00000000 0.00000000
## magnet_arm_z 1.553366e-01 0.44046435 0.00000000 0.02970192
## roll_dumbbell 0.000000e+00 0.00000000 1.13946927 1.17853108
## pitch_dumbbell 0.000000e+00 0.00000000 0.31278419 0.00000000
## yaw_dumbbell 6.225701e-01 0.00000000 0.54257746 0.33655658
## total_accel_dumbbell 0.000000e+00 4.11087984 0.00000000 2.75798910
## gyros_dumbbell_x 0.000000e+00 0.00000000 0.00000000 0.00000000
## gyros_dumbbell_y 0.000000e+00 0.00000000 0.00000000 0.00000000
## gyros_dumbbell_z 0.000000e+00 0.00000000 0.00000000 0.00000000
## accel_dumbbell_x 0.000000e+00 1.77708447 0.00000000 0.00000000
## accel_dumbbell_y 0.000000e+00 0.29804218 0.47744523 0.00000000
## accel_dumbbell_z 4.679891e-02 0.01633127 0.00000000 0.00000000
## magnet_dumbbell_x 0.000000e+00 0.00000000 0.00000000 0.00000000
## magnet_dumbbell_y 0.000000e+00 0.11971844 0.00000000 0.00000000
## magnet_dumbbell_z 1.433707e+00 0.57070194 0.17463770 0.00000000
## roll_forearm 2.200678e-01 0.00000000 0.03584440 0.00000000
## pitch_forearm 4.554249e+00 0.00000000 0.00000000 2.43038110
## yaw_forearm 0.000000e+00 0.00000000 0.00000000 0.00000000
## total_accel_forearm 5.618260e+00 0.00000000 0.00000000 0.00000000
## gyros_forearm_x 0.000000e+00 0.00000000 0.00000000 0.00000000
## gyros_forearm_y 0.000000e+00 0.00000000 0.00000000 0.00000000
## gyros_forearm_z 0.000000e+00 0.00000000 0.00000000 0.00000000
## accel_forearm_x 0.000000e+00 0.00000000 0.00000000 0.47654328
## accel_forearm_y 0.000000e+00 0.05140739 0.00000000 0.07034039
## accel_forearm_z 4.861418e-01 0.04319445 0.00000000 0.00000000
## magnet_forearm_x 0.000000e+00 0.00000000 0.00000000 0.00000000
## magnet_forearm_y 0.000000e+00 0.00000000 0.04550607 0.00000000
## magnet_forearm_z 2.837598e-02 0.00000000 0.16621399 0.00000000
## E
## roll_belt 0.99442837
## pitch_belt 1.68097635
## yaw_belt 0.00000000
## total_accel_belt 5.81365375
## gyros_belt_x 33.89016841
## gyros_belt_y 0.00000000
## gyros_belt_z 0.00000000
## accel_belt_x 0.00000000
## accel_belt_y 1.73329879
## accel_belt_z 0.00000000
## magnet_belt_x 0.00000000
## magnet_belt_y 4.30866576
## magnet_belt_z 0.00000000
## roll_arm 0.00000000
## pitch_arm 0.34008175
## yaw_arm 0.00000000
## total_accel_arm 0.00000000
## gyros_arm_x 4.86463719
## gyros_arm_y 0.00000000
## gyros_arm_z 0.00000000
## accel_arm_x 0.00000000
## accel_arm_y 0.00000000
## accel_arm_z 0.00000000
## magnet_arm_x 0.00000000
## magnet_arm_y 0.40023753
## magnet_arm_z 0.00000000
## roll_dumbbell 0.00000000
## pitch_dumbbell 0.00000000
## yaw_dumbbell 0.00000000
## total_accel_dumbbell 7.87607563
## gyros_dumbbell_x 0.00000000
## gyros_dumbbell_y 0.00000000
## gyros_dumbbell_z 0.00000000
## accel_dumbbell_x 0.17351241
## accel_dumbbell_y 0.00000000
## accel_dumbbell_z 0.16804920
## magnet_dumbbell_x 0.00000000
## magnet_dumbbell_y 0.20283209
## magnet_dumbbell_z 0.00000000
## roll_forearm 0.07666691
## pitch_forearm 0.00000000
## yaw_forearm 0.00000000
## total_accel_forearm 2.18606816
## gyros_forearm_x 0.00000000
## gyros_forearm_y 0.00000000
## gyros_forearm_z 0.00000000
## accel_forearm_x 0.00000000
## accel_forearm_y 0.00000000
## accel_forearm_z 0.54511833
## magnet_forearm_x 0.00000000
## magnet_forearm_y 0.05092659
## magnet_forearm_z 0.00000000
##
## $model
## [1] "glmnet"
##
## $calledFrom
## [1] "varImp"
Because there are so many features, only a small subset of them could be used for predictions. If all of the features were used, there may be overfitting. Below is the investigation into principal components. When the pca parameter is used to fit the machine learning models below, I chose the number of principal components to equal a little less than %98 of the total variance. This turns out to be about 30 principal components instead of the 52 features contained in the dataset as you can see from the analysis below.
pr_comp<-prcomp(x_train, scale=TRUE)
st_dev<-pr_comp$sdev
p_var<-st_dev^2
prop_ex<-p_var/sum(p_var)
summary(pr_comp)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.8940 2.839 2.15826 2.0712 1.92040 1.73879 1.49195
## Proportion of Variance 0.1611 0.155 0.08958 0.0825 0.07092 0.05814 0.04281
## Cumulative Proportion 0.1611 0.316 0.40560 0.4881 0.55902 0.61716 0.65997
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 1.44674 1.31154 1.22948 1.17376 1.06498 0.99335
## Proportion of Variance 0.04025 0.03308 0.02907 0.02649 0.02181 0.01898
## Cumulative Proportion 0.70022 0.73330 0.76237 0.78886 0.81067 0.82965
## PC14 PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.9429 0.90748 0.88704 0.82819 0.75526 0.72243
## Proportion of Variance 0.0171 0.01584 0.01513 0.01319 0.01097 0.01004
## Cumulative Proportion 0.8468 0.86258 0.87771 0.89090 0.90187 0.91191
## PC20 PC21 PC22 PC23 PC24 PC25
## Standard deviation 0.68825 0.64289 0.62711 0.60947 0.57735 0.54723
## Proportion of Variance 0.00911 0.00795 0.00756 0.00714 0.00641 0.00576
## Cumulative Proportion 0.92102 0.92897 0.93653 0.94368 0.95009 0.95584
## PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 0.53362 0.50213 0.48136 0.44849 0.41347 0.39012
## Proportion of Variance 0.00548 0.00485 0.00446 0.00387 0.00329 0.00293
## Cumulative Proportion 0.96132 0.96617 0.97062 0.97449 0.97778 0.98071
## PC32 PC33 PC34 PC35 PC36 PC37
## Standard deviation 0.36248 0.34510 0.33401 0.30285 0.28111 0.25327
## Proportion of Variance 0.00253 0.00229 0.00215 0.00176 0.00152 0.00123
## Cumulative Proportion 0.98323 0.98552 0.98767 0.98943 0.99095 0.99219
## PC38 PC39 PC40 PC41 PC42 PC43
## Standard deviation 0.23593 0.23317 0.20002 0.19187 0.18478 0.1772
## Proportion of Variance 0.00107 0.00105 0.00077 0.00071 0.00066 0.0006
## Cumulative Proportion 0.99326 0.99430 0.99507 0.99578 0.99644 0.9970
## PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.16770 0.16539 0.16313 0.14621 0.14204 0.11230
## Proportion of Variance 0.00054 0.00053 0.00051 0.00041 0.00039 0.00024
## Cumulative Proportion 0.99758 0.99811 0.99862 0.99903 0.99942 0.99966
## PC50 PC51 PC52
## Standard deviation 0.09834 0.07666 0.04577
## Proportion of Variance 0.00019 0.00011 0.00004
## Cumulative Proportion 0.99985 0.99996 1.00000
barplot(100*prop_ex[1:30], las=2, xlab='Components 1-30', ylab='% Variance Explained')
I trained several models using tuning paramters to get the best results. I used five fold cross validation, principal components and selected standard parameters for neural networks. that account for 98% of the variance. I chose a variety of different algorithms for this task based on their applicability to the dataset, including the lasso regression discussed above and including a stacking algorithm that combined all of the algorithms into an ensemble. These include, support vector machines which work best with complex classification boundaries; Neural Networks that produces it’s own learning alrogithm; and Random Forests which approaches learning with many different shallower trees rather than one tree that is fully grown.
trcontrol<-trainControl(method="repeatedcv", number=5, repeats=3, classProbs=TRUE, summaryFunction=multiClassSummary)
svm.fit<-train(classe~., data=training, method="svmRadial", metric="Accuracy", preProcess = "pca", n.comp=30, trcontrol=trcontrol)
svm.pred<-predict(svm.fit, testing)
svm.con<-confusionMatrix(svm.pred, testing$classe)
trcontrol<-trainControl(method="repeatedcv", number=5, repeats=3, classProbs=TRUE, summaryFunction=multiClassSummary)
grid.tune<-expand.grid(.mtry=c(1:6))
rf_mod <- train(classe~.,data=training, method = "rf", metric="Accuracy", tuneGrid=grid.tune, preProcess = "pca", n.comp=30, trcontrol=trcontrol)
rf_pred<-predict(rf_mod, testing)
rf.con<-confusionMatrix(rf_pred, testing$classe)
trcontrol<-trainControl(method="repeatedcv", number=5, repeats=3, classProbs=TRUE, summaryFunction=multiClassSummary)
grid <- expand.grid(.decay = c(.05, .01), .size = c(13,15,17))
nn.mod<-train(classe~., data=training, method="nnet", preProcess="pca", n.comp=30, maxit=1000, tuneGrid=grid, trcontrol=trcontrol)
nn.pred<-predict(nn.mod, testing, type="raw")
nn.con<-confusionMatrix(as.factor(nn.pred), testing$classe)
trcontrol<-trainControl(method="repeatedcv", number=5, repeats=3, classProbs=TRUE, summaryFunction=multiClassSummary)
predDF <- data.frame(svm.pred, rf_pred, nn.pred, classe=testing$classe)
modelStack <- train(classe ~ ., data = predDF, method = "gbm")
combPred <- predict(modelStack, predDF)
stack.con<-confusionMatrix(combPred, predDF$classe)
Below are plots of some of the models above. They show that the Neural networks work best with an optimum level of decay of .05 and hidden units at 17.
plot(nn.mod, main="Neural Networks")
Support Vector Machines performs best when there is a wider decision boundary.
plot(svm.fit, main="Radial Support Vector Machines")
Random forest work best with less than 6 trees.
plot(rf_mod, main="Random Forests")
The boosting algorithm used in stacking works best with fewer weak learners and smaller trees.
plot(modelStack, main="Boosting Stacked Model")
Below is the accuracy of each of the models. Notice that Neural Networks, Support Vector Machines, Random forests and the stacked model did well. In fact Random Forests and the stacked model got over 97% accuracy, with the stacked model getting a slightly higher accuracy. Lasso performed the worst, most likely because it is usually used for numerical data rather than classification. I decided to exlude Lasso Regression on the stacked algorithm because of its poor performance on the data.
cat("Lasso Regression Accuracy: ", lasso.con$overall[1])
## Lasso Regression Accuracy: 0.5921697
cat("Suppor Vector Machine Accuracy: ", svm.con$overall[1])
## Suppor Vector Machine Accuracy: 0.9068108
cat("Random Forest Accuracy: ", rf.con$overall[1])
## Random Forest Accuracy: 0.9743067
cat("Neural Network Accuracy: ", nn.con$overall[1])
## Neural Network Accuracy: 0.8762235
cat("Stached Algorithm Accuracy", stack.con$overall[1])
## Stached Algorithm Accuracy 0.9743067
First, I load the test set into R and perform the same cleaning I performed for the training set.
dat<-read.csv(file="pml-testing.csv", header=TRUE)
test.dats <- which(colSums(is.na(dat) |dat=="")>0.9*dim(dat)[1])
test.datas <- dat[,-test.dats]
test.datas<-test.datas[,-c(1:7)]
cat("Dimensions of Cleaned Test Set: ",dim(test.datas))
## Dimensions of Cleaned Test Set: 20 53
As you can see, there are a wide variety of predictions from the different models. Random forests and the stacking algorithm predictions are the same and because they recieved the highest accuracy, I will use them for predictions. Due to the amount of time it took to fit the models, I would suggest carefully choosing a few promising algorithms that would perform well on the given data. The time it took to render and evaluate the models was prohibitive. Random forests are emsemble methods that use several many smaller trees to classify the data and it performed as well as ensebling all of the algorithms together. It was not necessary to perform the boosting algorithm for stacking because the accuracy was not improved significantly by stacking.
svm.test<-predict(svm.fit, newdata=test.datas)
rf.test<-predict(rf_mod, newdata=test.datas)
nn.test<-predict(nn.mod, newdata=test.datas)
pred.stack <- data.frame(svm.pred=svm.test, rf_pred=rf.test, nn.pred=nn.test)
com.predict <- predict(modelStack, pred.stack)
final.data<-cbind(pred.stack, com.predict)
names(final.data)<-c("svm", "RandomForest", "NeuralNetworks", "StackedAlgorithm")
final.data
## svm RandomForest NeuralNetworks StackedAlgorithm
## 1 B B B B
## 2 A A A A
## 3 C C C C
## 4 A A A A
## 5 A A A A
## 6 E E E E
## 7 D D D D
## 8 B B B B
## 9 A A A A
## 10 A A A A
## 11 A A B A
## 12 C C C C
## 13 B B E B
## 14 A A A A
## 15 E E E E
## 16 E E B E
## 17 A A A A
## 18 B B B B
## 19 B B B B
## 20 B B B B