The goal of this project (classification problem) is to build a traning model that predicts the way different subjects performed a fitness exercice (the quality of execution). As source we have a large collection of activity monitors input data, along with other variables and the correponding outcome: the way the subject did the exercice among five possible options (the right one and four common mistakes). The training data can be found at Input Data and the test data in Test Data. For the test data we don’t know the way the subjects performed the exercice and we need to figure it out using the machine learning algorithm of our choice conveniently trained.
The initial training data set will be split into two subsets: a proper training set (75% of the data) and a validation set (25% of the data). The model will be build using only the first subset (training). The out of sample error will be estimated using only the validation subset; comparing predicted outcomes by the model with the real ones (that are present in the data).
The random forest method has been choosen (caret package) because it incorporates the cross-validation procedure as an option and provides multiple tuning parameters for the train function.
An initial data analysis phase has been carried out to see the distribution of outcomes, the NA values, the empty variables and other data characteristics. As a conclusion we confirm that we have enought data for each possible classe outcome. We also have identified variables not directly related with our task. As a result, the number of possible predictors have been considerably reduced:
After this process only 52 variables out of 159 have been retained as predictors.
library(caret)
library(doParallel)
library(foreach)
# Use parallelism for speed
registerDoParallel(detectCores())
Another approach to identify covariates that could be discarted will be to use the nearZeroVar() function using the original dataframe dfTraining as argument (this dataframe inlcudes the outcome and the other 159 variables). Running this function we obtain 60 variables to eliminate. I have opted for the previosly described method as achieved less predictors (and all of them were retained by the nearZeroVar(df) (that equals to 0 when we use the already reduced set of 52 variables).
Note. For performance issues parallelism is used in case the computer has multiples cores available.
nfileT <- "pml-training.csv"
nfileV <- "pml-testing.csv"
dfTraining <- read.csv(nfileT, header=TRUE)
dfTest <- read.csv(nfileV, header=TRUE)
# Remove all columns that have some NA or spaces and the 7 first variables
df <- dfTraining[ , ! apply( dfTraining , 2 , function(x) any(is.na(x) | x=="") ) ]
df <- df[,!(names(df) %in% names(df)[1:7])]
# Count the number of casses of each outcome (variable classe)
table(df$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
For theorical reasons (mainly model accuracy) and for the nature of the problem, a random forest algorithm has been selected.
This model fitting is very demanding in computational terms for the data we have (19622 rows and 160 columns), meaning long waiting times (several hours) with standard parameters. Hence in the initial phase of analysis only a random reduced subset of the data was used to evaluate the different parameters (1000 cases). The parameters that have been tested are the number of trees, k-folds and repetitions. Also different percentajes of data split between training and alidation were tested.
# Parameters
training_p <- 0.75
number_trees <- 50
number_folds <- 10
number_reps <- 10
After testing different combinations using the 1000 rows reduced sample of data, the complete dataset has been analyzed with only the best combinations of parameters. Finally the choosen model has been a random forest training model with 50 trees, using cross-validation 10 k-folds and 10 repetitions. It provides a quick response considering it also provides simmilar out of sample error that models with a lot more of trees and repetitions.
# From the trainig set, split data in two groups: model training and validation
set.seed(123)
inTraining <- createDataPartition(df$classe, p=training_p, list=FALSE)
dfT <- df[ inTraining,]
dfV <- df[-inTraining,]
# Train the model
train_control <- trainControl(method="repeatedcv", number=number_folds, repeats=number_reps, allowParallel=TRUE)
comp_time <- system.time(
model <- train(classe~., data=dfT, method="rf", trControl=train_control, ntree=number_trees))
The model prooved to be relatively quick (436.63 seconds with these parameters). Just to see if accuracy could be improved (even at the expense of computation time) the model has been evaluated with more trees and repetitions obtaining very similar out of sample estimated error (and same predictions for the test cases) and of course a substantial increase in computation time. The conclussion is that the parameters choosen are a good compromise even if other sets could be used.
In the next part the basic model characteristics are summarized as computed by the algorithm. We can see that accuracy reaches 99%. Also it is interesting to see that not all the 52 variables are needed and we could perform a more in deep analysis to reduce them to less than 30 with simmilar accuracy results.
Also we see that the decision of using only 50 trees (10 times less than the default of 500) seems reasonable as reaching certain limit above that number the benefit increasing it is marginal and can lead to overfitting.
model
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
##
## Summary of sample sizes: 13247, 13246, 13247, 13246, 13245, 13246, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9915071 0.9892560 0.002414565 0.003055423
## 27 0.9915207 0.9892733 0.002100914 0.002658605
## 52 0.9858476 0.9820969 0.003302797 0.004178484
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
model$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = ..1, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.99%
## Confusion matrix:
## A B C D E class.error
## A 4174 8 1 0 2 0.002628435
## B 27 2803 17 0 1 0.015800562
## C 0 21 2533 12 1 0.013245033
## D 1 2 30 2375 4 0.015339967
## E 0 6 4 9 2687 0.007021434
plot(model)
plot(model$finalModel)
legend("topright", legend=unique(dfV$classe), col=unique(as.numeric(dfV$classe)), pch=19)
The most important variables used by the model follow:
varImp(model)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 65.07
## yaw_belt 61.52
## roll_forearm 48.64
## pitch_belt 47.35
## magnet_dumbbell_y 45.30
## magnet_dumbbell_z 44.33
## accel_dumbbell_y 24.22
## roll_dumbbell 21.25
## accel_forearm_x 18.98
## accel_belt_z 18.19
## total_accel_dumbbell 17.30
## magnet_belt_z 17.25
## magnet_forearm_z 16.84
## accel_dumbbell_z 16.00
## magnet_dumbbell_x 15.67
## magnet_belt_y 15.09
## magnet_belt_x 11.96
## gyros_belt_z 11.68
## yaw_arm 11.68
All the cases were perfectly classified in that case.
table(predict(model, newdata=dfT), dfT$classe)
##
## A B C D E
## A 4185 0 0 0 0
## B 0 2848 0 0 0
## C 0 0 2567 0 0
## D 0 0 0 2412 0
## E 0 0 0 0 2706
prediction <- predict(model, newdata=dfV)
table(prediction, dfV$classe)
##
## prediction A B C D E
## A 1394 5 0 0 0
## B 1 942 7 0 0
## C 0 2 845 9 0
## D 0 0 3 793 1
## E 0 0 0 2 900
In that case we can see only a few cases are misclassified (as expected by the previous contrasted measures of the model). Accuracy in verification data is very high: 0.9938825.
res <- predict(model, newdata=dfTest)
res
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E