Question: How well do they do it?
Goal: use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, from training data set, and predict with the “classe” variable or other variables the maner in which the participants did the exercise. Also use the model to predict 20 different test cases.
Steps:
Results Fitting a random forest to the data gave better estimates for the prediction of the outcome, than a classification tree. Data processing and analysis followed instructions on the References section of this report.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Read more: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz4ZyJOw3EZ
Read the data
file_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train <- read.csv(file_train, na.strings = c("", "NA"))
dim(train)
## [1] 19622 160
file_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test <- read.csv(file_test, na.strings = c("", "NA"))
dim(test)
## [1] 20 160
Both sets have 160 variables. The train set has 19622 observations and the test set has 20 observations.
#Load required libraries
library(tidyverse)
library(caret)
library(rpart)
library(randomForest)
library(rattle)
library(rpart.plot)
Clean the data
# percentage of missing data for each column in the data set
training <- train %>% select(which(colMeans(is.na(.)) < 0.5)) # get only the columns with less than 50% of missing values in the train set
testing <- test %>% select(which(colMeans(is.na(.)) < 0.5)) # get only the columns with less than 50% of missing values in the test set
dim(training); dim(testing) #dimensions of each data set
## [1] 19622 60
## [1] 20 60
After removing most missing values from the data sets, both training and testing data sets have 60 variables. Training has 19622 observations and Testinghas 20 observations.
str(training) #understand the structure of the data and which variables can be excluded for the data modelling purpose
According to documentation available for this subject, we can consider for this exercise variables that were used to predict performance of the participants exercise. Thus we exclude the first variables in each data set.
xnames <- colnames(training)[1:7] #select first 7 variables from the training set
featurePlot(x = training[, xnames], y = training$classe, plot = "pairs") #The feature plot confirms little predictive power from these variables in the class outcome variable.
mytrain <- training[!names(training) %in% xnames] # remove first variables from training data
mytest <- testing[!names(testing) %in% xnames] # remove first variables from testing data
dim(mytrain)
## [1] 19622 53
dim(mytest)
## [1] 20 53
Training data is now mytrain with 53 predictors and testing data is now mytest with 53 variables. Both data sets keep the number of observations.
# Understand the outcome variable distribution
Count <- table(mytrain$classe)
Frequency <- paste(round((Count) / length(mytrain$classe) * 100), "%", sep ="")
pander::pander(rbind(Count, Frequency))
| A | B | C | D | E | |
|---|---|---|---|---|---|
| Count | 5580 | 3797 | 3422 | 3216 | 3607 |
| Frequency | 28% | 19% | 17% | 16% | 18% |
library(scales)
index <- seq_along(1:nrow(mytrain))
ggplot(data = mytrain, aes(x = index, colour = classe)) + geom_density(aes(y = ..count../sum(..count..) *
100)) + # geom_text(aes(y = ..count../sum(..count..) * 100), label =
# ..count../sum(..count..) * 100) +
scale_y_continuous(labels = scales::percent) + labs(title = "Classes distribution on the training data set",
y = "Frequency", x = "data set index") + theme_bw()
More counts are expected for classe A and a balanced number of occurencies for the other classes.
Data partition
Next, mytrain is splitted 70% into a training set and 30% for testing set, to avoid sampling errors when predicting the outcome.
# set the seed for reproducibility of the analysis
set.seed(12345)
inTrain <- createDataPartition(mytrain$classe, p = 0.7,
list = FALSE)
toTrain <- mytrain[inTrain, ]
toTest <- mytrain[-inTrain,] # the remaining 30% of data training data appended to testing data
Train the data with prediction algorithms
Since logist regression is applied to 2 class outcomes, we can fit a classification tree, a linear discriminent analysis (lda) and a random forest predictor relating the factor variable classe to the remaining variables.
First we fit a classification tree with cross validation and pre-processing.
# fit rpart model
control <- trainControl(method = "cv", number = 5) #apply cross validation
rpart_model <- train(classe ~ ., data = toTrain, method = "rpart", trControl = control)
rpart_model
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10989, 10990, 10990, 10989, 10990
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03753433 0.5161240 0.37284078
## 0.05987862 0.4166831 0.20949558
## 0.11585800 0.3323882 0.07355105
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03753433.
rpart_model2 <- train(classe ~ ., data = toTrain, method = "rpart", preProcess = c("center", "scale")) #apply preprocessing to analyse differences in the accuracy of the model
rpart_model2
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03753433 0.5135957 0.36893849
## 0.05987862 0.3812312 0.15135221
## 0.11585800 0.3226947 0.05976988
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03753433.
# Plot classification tree
fancyRpartPlot(rpart_model$finalModel, palettes=c("Blues", "Greens"))
Accuracy is not different between pre-processing and cross-validation methods.
# Predict the outcome using the test set
rpart_pred <- predict(rpart_model, toTest)
# Show prediction result
rpart_cm <- confusionMatrix(toTest$classe, rpart_pred)
rpart_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1494 21 128 0 31
## B 470 380 289 0 0
## C 467 29 530 0 0
## D 438 184 342 0 0
## E 141 147 277 0 517
##
## Overall Statistics
##
## Accuracy : 0.4963
## 95% CI : (0.4835, 0.5092)
## No Information Rate : 0.5115
## P-Value [Acc > NIR] : 0.9902
##
## Kappa : 0.3425
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4963 0.49934 0.33844 NA 0.94343
## Specificity 0.9374 0.85187 0.88516 0.8362 0.89414
## Pos Pred Value 0.8925 0.33363 0.51657 NA 0.47782
## Neg Pred Value 0.6400 0.91972 0.78679 NA 0.99355
## Prevalence 0.5115 0.12931 0.26610 0.0000 0.09312
## Detection Rate 0.2539 0.06457 0.09006 0.0000 0.08785
## Detection Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Balanced Accuracy 0.7169 0.67561 0.61180 NA 0.91878
When testing the model in the testing part of the data, the accuracy rate of the classification tree is not very good to predict the outcome classe. Notice no prediction for class D. Thus we can apply another method.
Secondly we can fit a linear discriminent analysis
# linear discriminent analysis
lda_model <- train(classe ~ ., data = toTrain, method = "lda", preProcess = c("center", "scale"))
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
lda_pred <- predict(lda_model, toTest) #predict in the testing part of the data set
lda_cm <- confusionMatrix(toTest$classe, lda_pred) #get the estimates for the analysis
lda_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1380 36 133 116 9
## B 191 709 142 43 54
## C 91 104 658 144 29
## D 48 49 125 698 44
## E 38 193 99 100 652
##
## Overall Statistics
##
## Accuracy : 0.6962
## 95% CI : (0.6842, 0.7079)
## No Information Rate : 0.297
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6155
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7895 0.6499 0.5687 0.6340 0.8274
## Specificity 0.9289 0.9103 0.9222 0.9444 0.9156
## Pos Pred Value 0.8244 0.6225 0.6413 0.7241 0.6026
## Neg Pred Value 0.9126 0.9195 0.8973 0.9181 0.9717
## Prevalence 0.2970 0.1854 0.1966 0.1871 0.1339
## Detection Rate 0.2345 0.1205 0.1118 0.1186 0.1108
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.8592 0.7801 0.7454 0.7892 0.8715
The sensitivity and the accuracy improved with this model, but it is still worth to try to fit a Random forest
# fit a Random forest
set.seed(12345)
rf_model <- randomForest(classe ~ . , data = toTrain)
# test the Random forest model in the test set originated from data partition
rf_pred <- predict(rf_model, newdata = toTest)
table(rf_pred) # outcome predictions with random forest model
## rf_pred
## A B C D E
## 1683 1137 1032 956 1077
table(toTest$classe) # outcome in testing part of the data set
##
## A B C D E
## 1674 1139 1026 964 1082
rf_cm <- confusionMatrix(rf_pred, toTest$classe)
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 10 0 0 0
## B 1 1126 10 0 0
## C 0 3 1015 14 0
## D 0 0 1 950 5
## E 0 0 0 0 1077
##
## Overall Statistics
##
## Accuracy : 0.9925
## 95% CI : (0.99, 0.9946)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9905
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9886 0.9893 0.9855 0.9954
## Specificity 0.9976 0.9977 0.9965 0.9988 1.0000
## Pos Pred Value 0.9941 0.9903 0.9835 0.9937 1.0000
## Neg Pred Value 0.9998 0.9973 0.9977 0.9972 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1913 0.1725 0.1614 0.1830
## Detection Prevalence 0.2860 0.1932 0.1754 0.1624 0.1830
## Balanced Accuracy 0.9985 0.9931 0.9929 0.9921 0.9977
plot(rf_model)
Random forests fit a better model to predict the outcome (activity), with better accuracy and sensitivity. The expected out-of-sample error is the remanescent 100-99.2 = 0.8%.
===============================================================================
Apply the machine learning algorithm to the 20 test cases available in the test data and submit the predictions to the Course Project Prediction Quiz for automated grading
We can predict performance (the outcome classe) on the 20 cases available on the test data set.
predict(rf_model, mytest)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The data for this project came from this source: http://groupware.les.inf.puc-rio.br/har.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#wle_paper_section#ixzz4ZtZIkJHv
# count the words by adding a new Addin to Rstudio
devtools::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)