Practical Machine Learning Project

Abstract

The goal of this project (classification problem) is to build a traning model that predicts the way different subjects performed a fitness exercice (the quality of execution). As source we have a large collection of activity monitors input data, along with other variables and the correponding outcome: the way the subject did the exercice among five possible options (the right one and four common mistakes). The training data can be found at Input Data and the test data in Test Data. For the test data we don’t know the way the subjects performed the exercice and we need to figure it out using the machine learning algorithm of our choice conveniently trained.

Methodology

Data split

The initial training data set will be split into two subsets: a proper training set (75% of the data) and a validation set (25% of the data). The model will be build using only the first subset (training). The out of sample error will be estimated using only the validation subset; comparing predicted outcomes by the model with the real ones (that are present in the data).

Model Selection

The random forest method has been choosen (caret package) because it incorporates the cross-validation procedure as an option and provides multiple tuning parameters for the train function.

Exploratory Data Analysis

An initial data analysis phase has been carried out to see the distribution of outcomes, the NA values, the empty variables and other data characteristics. As a conclusion we confirm that we have enought data for each possible classe outcome. We also have identified variables not directly related with our task. As a result, the number of possible predictors have been considerably reduced:

all the variables with some rows with NA of empty values have been removed,
the initial seven variables have been removed as they were not related to the sensors (it is important to notice that the subyacent idea is to predict using data provided or derived from the sensors).

After this process only 52 variables out of 159 have been retained as predictors.

library(caret)
library(doParallel)
library(foreach)

# Use parallelism for speed
registerDoParallel(detectCores())

Another approach to identify covariates that could be discarted will be to use the nearZeroVar() function using the original dataframe dfTraining as argument (this dataframe inlcudes the outcome and the other 159 variables). Running this function we obtain 60 variables to eliminate. I have opted for the previosly described method as achieved less predictors (and all of them were retained by the nearZeroVar(df) (that equals to 0 when we use the already reduced set of 52 variables).

Note. For performance issues parallelism is used in case the computer has multiples cores available.

Reading Training and Testing Files

nfileT <- "pml-training.csv"
nfileV <- "pml-testing.csv"
dfTraining  <- read.csv(nfileT, header=TRUE)
dfTest      <- read.csv(nfileV, header=TRUE)

# Remove all columns that have some NA or spaces and the 7 first variables
df    <- dfTraining[ , ! apply( dfTraining , 2 , function(x) any(is.na(x) | x=="") ) ]
df    <- df[,!(names(df) %in% names(df)[1:7])]
# Count the number of casses of each outcome (variable classe)
table(df$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Fitting the model

Model selection: random forest and cross-validation.

For theorical reasons (mainly model accuracy) and for the nature of the problem, a random forest algorithm has been selected.

Tuning the model

This model fitting is very demanding in computational terms for the data we have (19622 rows and 160 columns), meaning long waiting times (several hours) with standard parameters. Hence in the initial phase of analysis only a random reduced subset of the data was used to evaluate the different parameters (1000 cases). The parameters that have been tested are the number of trees, k-folds and repetitions. Also different percentajes of data split between training and alidation were tested.

# Parameters
training_p   <- 0.75
number_trees <- 50
number_folds <- 10
number_reps  <- 10

After testing different combinations using the 1000 rows reduced sample of data, the complete dataset has been analyzed with only the best combinations of parameters. Finally the choosen model has been a random forest training model with 50 trees, using cross-validation 10 k-folds and 10 repetitions. It provides a quick response considering it also provides simmilar out of sample error that models with a lot more of trees and repetitions.

# From the trainig set, split data in two groups: model training and validation
set.seed(123)
inTraining <- createDataPartition(df$classe, p=training_p, list=FALSE)
dfT  <- df[ inTraining,]
dfV  <- df[-inTraining,]

# Train the model
train_control <- trainControl(method="repeatedcv", number=number_folds, repeats=number_reps, allowParallel=TRUE)
comp_time <- system.time(
        model <- train(classe~., data=dfT, method="rf", trControl=train_control, ntree=number_trees))

Model Analysis

The model prooved to be relatively quick (457.61 seconds with these parameters). Just to see if accuracy could be improved (even at the expense of computation time) the model has been evaluated with more trees and repetitions obtaining very similar out of sample estimated error (and same predictions for the test cases) and of course a substantial increase in computation time. The conclussion is that the parameters choosen are a good compromise even if other sets could be used.

In the next part the basic model characteristics are summarized as computed by the algorithm. We can see that accuracy reaches 99%. Also it is interesting to see that not all the 52 variables are needed and we could perform a more in deep analysis to reduce them to less than 30 with simmilar accuracy results.

Also we see that the decision of using only 50 trees (10 times less than the default of 500) seems reasonable as reaching certain limit above that number the benefit increasing it is marginal and can lead to overfitting.

model

## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## 
## Summary of sample sizes: 13247, 13246, 13247, 13246, 13245, 13246, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9915818  0.9893503  0.002440657  0.003088390
##   27    0.9914868  0.9892308  0.002349827  0.002972927
##   52    0.9861598  0.9824920  0.003425812  0.004334092
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

model$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = ..1, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.2%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4170    7    2    5    1 0.003584229
## B   33 2801   13    0    1 0.016502809
## C    4   26 2522   15    0 0.017530191
## D    2    0   48 2359    3 0.021973466
## E    1    3    3    9 2690 0.005912786

plot(model)

plot(model$finalModel)
legend("topright", legend=unique(dfV$classe), col=unique(as.numeric(dfV$classe)), pch=19)

The most important variables used by the model follow:

varImp(model)

## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                   Overall
## roll_belt          100.00
## yaw_belt            73.30
## magnet_dumbbell_z   63.66
## magnet_dumbbell_y   59.96
## pitch_forearm       51.43
## roll_forearm        51.25
## magnet_dumbbell_x   50.88
## pitch_belt          48.05
## magnet_belt_z       43.84
## magnet_belt_y       41.36
## accel_belt_z        41.31
## roll_dumbbell       38.87
## accel_forearm_x     38.30
## accel_dumbbell_z    38.06
## accel_dumbbell_y    37.44
## roll_arm            35.12
## accel_dumbbell_x    30.17
## gyros_belt_z        28.60
## magnet_arm_x        27.21
## magnet_arm_y        25.23

Error Estimate and results

Contingency Table for In Sample Error

All the cases were perfectly classified in that case.

table(predict(model, newdata=dfT), dfT$classe)

##    
##        A    B    C    D    E
##   A 4185    0    0    0    0
##   B    0 2848    0    0    0
##   C    0    0 2567    0    0
##   D    0    0    0 2412    0
##   E    0    0    0    0 2706

Contingency Table for Out of Sample Error. Accuracy.

prediction <- predict(model, newdata=dfV)
table(prediction, dfV$classe)

##           
## prediction    A    B    C    D    E
##          A 1393    1    0    1    2
##          B    2  945   11    0    0
##          C    0    3  839   21    1
##          D    0    0    5  780    2
##          E    0    0    0    2  896

In that case we can see only a few cases are misclassified (as expected by the previous contrasted measures of the model). Accuracy in verification data is very high: 0.9896003.

Prediction for the testing data set

res <- predict(model, newdata=dfTest)
res

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E