Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. The participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The goal of this project is to predict the manner in which they did the exercise. This is the classe variable in the training set. We may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
rm(list = ls())
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
training_raw = read.csv("/Users/woodzsan/Desktop/Data Science with R/Practical Machine Learning/Course Project/pml-training.csv")
testing = read.csv("/Users/woodzsan/Desktop/Data Science with R/Practical Machine Learning/Course Project/pml-testing.csv")
Some statistics on the raw training dataset, training_raw:
dim(training_raw)
## [1] 19622 160
table(training_raw$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
We see that a number columns in training_raw are missing most of their values. Many have either ‘NA’ or ’ ’ in about 19216 rows. Data is present for 406/19622 = 2.0% of the rows. We will make the decision to eliminate these columns.
# Remove columns with > 19200 'NA' values
training <- training_raw[ , colSums(is.na(training_raw)) < 19200]
# Remove columns with > 19200 ' ' values
training <- training[, colSums(training != "") > 19200]
We need establish training and cross-validation datasets to determine the accuracy of our prediction model.
inTrain <- createDataPartition(y=training$classe,p=0.8,list=FALSE)
training <- training[inTrain,]
cross_val <- training[-inTrain,]
y_training <- training$classe
y_cross_val <- cross_val$classe
# Strip out categorical variables and `classe` from 'training' and 'cross_val' datasets
X_train <- training[-c(1:6,60)]
X_cv <- cross_val[-c(1:6,60)]
Now, let’s look at the correlation values among the predictors…
M <- abs(cor(X_train))
diag(M) <- 0
which(M>0.8,arr.ind = TRUE)
## row col
## yaw_belt 4 2
## total_accel_belt 5 2
## accel_belt_y 10 2
## accel_belt_z 11 2
## accel_belt_x 9 3
## magnet_belt_x 12 3
## roll_belt 2 4
## roll_belt 2 5
## accel_belt_y 10 5
## accel_belt_z 11 5
## pitch_belt 3 9
## magnet_belt_x 12 9
## roll_belt 2 10
## total_accel_belt 5 10
## accel_belt_z 11 10
## roll_belt 2 11
## total_accel_belt 5 11
## accel_belt_y 10 11
## pitch_belt 3 12
## accel_belt_x 9 12
## gyros_arm_y 20 19
## gyros_arm_x 19 20
## magnet_arm_x 25 22
## accel_arm_x 22 25
## magnet_arm_z 27 26
## magnet_arm_y 26 27
## accel_dumbbell_x 35 29
## accel_dumbbell_z 37 30
## gyros_dumbbell_z 34 32
## gyros_forearm_z 47 32
## gyros_dumbbell_x 32 34
## gyros_forearm_z 47 34
## pitch_dumbbell 29 35
## yaw_dumbbell 30 37
## gyros_forearm_z 47 46
## gyros_dumbbell_x 32 47
## gyros_dumbbell_z 34 47
## gyros_forearm_y 46 47
We see that many of the columns are highly correlated with each other. Let’s train a PCA model with 2 principal components:
preProc <- preProcess(X_train,method="pca",pcaComp=2)
pred_training <- predict(preProc,X_train)
plot(pred_training[,1],pred_training[,2])
We see from the plot of the two principal components that there are 5 relatively distinct groupings that correspond to the different values of the classe variable.
# Add `classe` vector to the principal component predictors...
training_pca <- cbind(pred_training,y_training)
# Rename `training$classe` column to `classe`
names(training_pca)[names(training_pca) == 'y_training'] <- 'classe'
Let’s train a Random Forest classifier using the principal components as our predictors.
modelFit <- train(classe~.,method="rf",data=training_pca)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
First, let’s determine the principal component predictions for the cross-validation dataset based on our preprocessing parameters for the training dataset.
pred_cross_val <- predict(preProc,X_cv)
plot(pred_cross_val[,1],pred_cross_val[,2])
Let’s generate a confusion matrix for the cross-validation predictions for classe vs. the actual values.
confusionMatrix(y_cross_val,predict(modelFit,pred_cross_val))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 886 0 0 0 0
## B 0 633 0 0 0
## C 0 0 532 0 0
## D 0 0 0 512 0
## E 0 0 0 0 575
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9988, 1)
## No Information Rate : 0.2823
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2823 0.2017 0.1695 0.1632 0.1832
## Detection Rate 0.2823 0.2017 0.1695 0.1632 0.1832
## Detection Prevalence 0.2823 0.2017 0.1695 0.1632 0.1832
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Wow, we have an cross-validation accuracy of 1! This means that the Random Forest algorithm is likely overfitting the data.
Anyway, let’s go ahead and test this RF model with the test data. We need to recall what columns of the training dataset we kept so that the columns of the test dataset are identical.
# Remove columns with >=1 'NA' value
testing_modified <- testing[ , colSums(is.na(testing)) < 1]
# Remove columns with unnecessary categorical variables and strip out `classe`
testing_modified <- testing_modified[-c(1:6,60)]
Generate principal components on the test set using the preProc method we developed for the training set.
testing_pca <- predict(preProc,testing_modified)
plot(testing_pca[,1],testing_pca[,2])
Make predictions on classe based on modelFit:
testing_classe_pred <- predict(modelFit,testing_pca)
testing_classe_pred
## [1] A C A A D D D B A A A C D A E E C E D B
## Levels: A B C D E
This set of predictions give us an accuracy of 0.55. Therefore, we conclude that we are overfitting.
Define the training control for the K-fold validation using K=3 folds.
set.seed(123)
train.control <- trainControl(method = "cv", number = 3, verboseIter=FALSE)
training <- training[-c(1:6)]
Train the model…
modelFit2 <- train(classe ~., data = training, method = "rf",
trControl = train.control)
For this model, the accuracy numbers are consistently 0.99 for all three folds.
modelFit2
## Random Forest
##
## 15699 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 10468, 10465, 10465
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9945219 0.9930701
## 27 0.9970064 0.9962133
## 53 0.9956687 0.9945211
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Let’s look at the confusion matrix for the cross validation dataset.
confusionMatrix(y_cross_val,predict(modelFit2,X_cv))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 886 0 0 0 0
## B 0 633 0 0 0
## C 0 0 532 0 0
## D 0 0 0 512 0
## E 0 0 0 0 575
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9988, 1)
## No Information Rate : 0.2823
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2823 0.2017 0.1695 0.1632 0.1832
## Detection Rate 0.2823 0.2017 0.1695 0.1632 0.1832
## Detection Prevalence 0.2823 0.2017 0.1695 0.1632 0.1832
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Now, we generate predictions on classe based on modelFit2:
testing_classe_pred <- predict(modelFit2,testing_modified)
testing_classe_pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
We find that these predictions give us a 100% score on the Prediction Quiz.