Executive Summary

Question: How well do they do it?

Goal: use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, from training data set, and predict with the “classe” variable or other variables the maner in which the participants did the exercise. Also use the model to predict 20 different test cases.

Steps:

Results Fitting a random forest to the data gave better estimates for the prediction of the outcome, than a classification tree. Data processing and analysis followed instructions on the References section of this report.

Background

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Read more: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz4ZyJOw3EZ

Data Processing

Read the data

file_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train <- read.csv(file_train, na.strings = c("", "NA"))
dim(train)
## [1] 19622   160
file_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test <- read.csv(file_test, na.strings = c("", "NA"))
dim(test)
## [1]  20 160

Both sets have 160 variables. The train set has 19622 observations and the test set has 20 observations.

#Load required libraries
library(tidyverse) 
library(caret)
library(rpart) 
library(randomForest)
library(rattle) 
library(rpart.plot)

Clean the data

# percentage of missing data for each column in the data set
training <- train %>% select(which(colMeans(is.na(.)) < 0.5)) # get only the columns with less than 50% of missing values in the train set
testing <- test %>% select(which(colMeans(is.na(.)) < 0.5)) # get only the columns with less than 50% of missing values in the test set

dim(training); dim(testing) #dimensions of each data set
## [1] 19622    60
## [1] 20 60

After removing most missing values from the data sets, both training and testing data sets have 60 variables. Training has 19622 observations and Testinghas 20 observations.

str(training) #understand the structure of the data and which variables can be excluded for the data modelling purpose

According to documentation available for this subject, we can consider for this exercise variables that were used to predict performance of the participants exercise. Thus we exclude the first variables in each data set.

xnames <- colnames(training)[1:7] #select first 7 variables from the training set
featurePlot(x = training[, xnames], y = training$classe, plot = "pairs") #The feature plot confirms little predictive power from these variables in the class outcome variable.

mytrain <- training[!names(training) %in% xnames] # remove first variables from training data
mytest <- testing[!names(testing) %in% xnames] # remove first variables from testing data

dim(mytrain)
## [1] 19622    53
dim(mytest)
## [1] 20 53

Training data is now mytrain with 53 predictors and testing data is now mytest with 53 variables. Both data sets keep the number of observations.

Exploratory data analysis

# Understand the outcome variable distribution
Count <- table(mytrain$classe)
Frequency <- paste(round((Count) / length(mytrain$classe) * 100), "%", sep ="")
pander::pander(rbind(Count, Frequency))
  A B C D E
Count 5580 3797 3422 3216 3607
Frequency 28% 19% 17% 16% 18%
library(scales)
index <- seq_along(1:nrow(mytrain))
ggplot(data = mytrain, aes(x = index, colour = classe)) + geom_density(aes(y = ..count../sum(..count..) * 
    100)) + # geom_text(aes(y = ..count../sum(..count..) * 100), label =
# ..count../sum(..count..) * 100) +
scale_y_continuous(labels = scales::percent) + labs(title = "Classes distribution on the training data set", 
    y = "Frequency", x = "data set index") + theme_bw()

More counts are expected for classe A and a balanced number of occurencies for the other classes.

Machine Learning

Data partition

Next, mytrain is splitted 70% into a training set and 30% for testing set, to avoid sampling errors when predicting the outcome.

# set the seed for reproducibility of the analysis
set.seed(12345)
inTrain <- createDataPartition(mytrain$classe, p = 0.7, 
                               list = FALSE)
toTrain <- mytrain[inTrain, ]
toTest <- mytrain[-inTrain,] # the remaining 30% of data training data appended to testing data

Train the data with prediction algorithms

Since logist regression is applied to 2 class outcomes, we can fit a classification tree, a linear discriminent analysis (lda) and a random forest predictor relating the factor variable classe to the remaining variables.

First we fit a classification tree with cross validation and pre-processing.

# fit rpart model 
control <- trainControl(method = "cv", number = 5) #apply cross validation
rpart_model <- train(classe ~ ., data = toTrain, method = "rpart", trControl = control)
rpart_model
## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10989, 10990, 10990, 10989, 10990 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.03753433  0.5161240  0.37284078
##   0.05987862  0.4166831  0.20949558
##   0.11585800  0.3323882  0.07355105
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03753433.
rpart_model2 <- train(classe ~ ., data = toTrain, method = "rpart", preProcess = c("center", "scale")) #apply preprocessing to analyse differences in the accuracy of the model
rpart_model2
## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.03753433  0.5135957  0.36893849
##   0.05987862  0.3812312  0.15135221
##   0.11585800  0.3226947  0.05976988
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03753433.
# Plot classification tree
fancyRpartPlot(rpart_model$finalModel, palettes=c("Blues", "Greens"))

Accuracy is not different between pre-processing and cross-validation methods.

# Predict the outcome using the test set
rpart_pred <- predict(rpart_model, toTest)
# Show prediction result
rpart_cm <- confusionMatrix(toTest$classe, rpart_pred)
rpart_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1494   21  128    0   31
##          B  470  380  289    0    0
##          C  467   29  530    0    0
##          D  438  184  342    0    0
##          E  141  147  277    0  517
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4963          
##                  95% CI : (0.4835, 0.5092)
##     No Information Rate : 0.5115          
##     P-Value [Acc > NIR] : 0.9902          
##                                           
##                   Kappa : 0.3425          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4963  0.49934  0.33844       NA  0.94343
## Specificity            0.9374  0.85187  0.88516   0.8362  0.89414
## Pos Pred Value         0.8925  0.33363  0.51657       NA  0.47782
## Neg Pred Value         0.6400  0.91972  0.78679       NA  0.99355
## Prevalence             0.5115  0.12931  0.26610   0.0000  0.09312
## Detection Rate         0.2539  0.06457  0.09006   0.0000  0.08785
## Detection Prevalence   0.2845  0.19354  0.17434   0.1638  0.18386
## Balanced Accuracy      0.7169  0.67561  0.61180       NA  0.91878

When testing the model in the testing part of the data, the accuracy rate of the classification tree is not very good to predict the outcome classe. Notice no prediction for class D. Thus we can apply another method.

Secondly we can fit a linear discriminent analysis

# linear discriminent analysis
lda_model <- train(classe ~ ., data = toTrain, method = "lda", preProcess = c("center", "scale"))
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
lda_pred <- predict(lda_model, toTest) #predict in the testing part of the data set
lda_cm <- confusionMatrix(toTest$classe, lda_pred) #get the estimates for the analysis
lda_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1380   36  133  116    9
##          B  191  709  142   43   54
##          C   91  104  658  144   29
##          D   48   49  125  698   44
##          E   38  193   99  100  652
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6962          
##                  95% CI : (0.6842, 0.7079)
##     No Information Rate : 0.297           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6155          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7895   0.6499   0.5687   0.6340   0.8274
## Specificity            0.9289   0.9103   0.9222   0.9444   0.9156
## Pos Pred Value         0.8244   0.6225   0.6413   0.7241   0.6026
## Neg Pred Value         0.9126   0.9195   0.8973   0.9181   0.9717
## Prevalence             0.2970   0.1854   0.1966   0.1871   0.1339
## Detection Rate         0.2345   0.1205   0.1118   0.1186   0.1108
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.8592   0.7801   0.7454   0.7892   0.8715

The sensitivity and the accuracy improved with this model, but it is still worth to try to fit a Random forest

# fit a Random forest 
set.seed(12345)
rf_model <- randomForest(classe ~ . , data = toTrain)
# test the Random forest model in the test set originated from data partition 
rf_pred <- predict(rf_model, newdata = toTest)
table(rf_pred) # outcome predictions with random forest model
## rf_pred
##    A    B    C    D    E 
## 1683 1137 1032  956 1077
table(toTest$classe) # outcome in testing part of the data set
## 
##    A    B    C    D    E 
## 1674 1139 1026  964 1082
rf_cm <- confusionMatrix(rf_pred, toTest$classe)
rf_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673   10    0    0    0
##          B    1 1126   10    0    0
##          C    0    3 1015   14    0
##          D    0    0    1  950    5
##          E    0    0    0    0 1077
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9925        
##                  95% CI : (0.99, 0.9946)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9905        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9886   0.9893   0.9855   0.9954
## Specificity            0.9976   0.9977   0.9965   0.9988   1.0000
## Pos Pred Value         0.9941   0.9903   0.9835   0.9937   1.0000
## Neg Pred Value         0.9998   0.9973   0.9977   0.9972   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1913   0.1725   0.1614   0.1830
## Detection Prevalence   0.2860   0.1932   0.1754   0.1624   0.1830
## Balanced Accuracy      0.9985   0.9931   0.9929   0.9921   0.9977
plot(rf_model)

Random forests fit a better model to predict the outcome (activity), with better accuracy and sensitivity. The expected out-of-sample error is the remanescent 100-99.2 = 0.8%.

===============================================================================

Course Project Prediction Quiz Portion

Apply the machine learning algorithm to the 20 test cases available in the test data and submit the predictions to the Course Project Prediction Quiz for automated grading

Prediction on 20 test cases

We can predict performance (the outcome classe) on the 20 cases available on the test data set.

predict(rf_model, mytest)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

References

The data for this project came from this source: http://groupware.les.inf.puc-rio.br/har.

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Read more: http://groupware.les.inf.puc-rio.br/har#wle_paper_section#ixzz4ZtZIkJHv

# count the words by adding a new Addin to Rstudio
devtools::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)