Summary

The goal in this exercise is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants in order to assess the quality in the activities they perform. For that, we will use machine learning techniques in R, specifically random forest algorithm. The data has been provided by Groupware.

Data Processing

The data sets consist in two csv files. The Test file contains 20 observations and 159 variables and it is used to validate the model, we will call it validation set.
The train file contains 19622 observations and 160 variables. It is quite large data set, it will be split in two subsets, one used for creating the model (70%) and the other for testing it (30%) and assess the model performance with new inputs.
The variable to be predicted is a factor variable called classe with 5 possible values: A, B, C, D or E, and represent as said, the quality of the activities each person performs.

inTrain <- createDataPartition(train$classe, p = 0.7, list = F)
training <- train[inTrain,]
testing <- train[-inTrain,]

Data Set Reduction

If we use the function summary and we apply it to the data sets, we will find out many variables with more than 97% of their values as Na’s or DIV0!. In order to reduce the complexity of the model (decrease computing time) and potentially get a more accurate prediction, we will remove all the variables with these features to create the model.
An easy way to perform this, is to take a look to the validation subset (test set originally). All the variables containing Na’s or DIV0! values are labeled with the variable class “logical” with all values set up to Na. The remaining relevant columns will be taken and will subset the training data set.
In addition to this, there are 5 columns that do not aggregate any value to the model, they are:

names(train[,c(1,3:6)])

## [1] "X"                    "raw_timestamp_part_1" "raw_timestamp_part_2"
## [4] "cvtd_timestamp"       "new_window"

Next code is for reducing the data set by eliminating unneeded variables

relevant_columns <- as.vector(sapply(validation, class) != "logical")
relevant_columns[c(1,3:6)] <- F

training_red <- training[,relevant_columns]
testing_red <- testing[,relevant_columns]

n_var <- dim(training_red)[2] - 1 # We substract the "classe" column

We now have a reduced data set from the original.

The number of predictors in the reduced data set is 54. Now, we can build the model using as an outcome the variable classe and the remaining variables as predictors.

Results

We will use the randomForest function within the randomForest package in R.
To reduce the computing time, we will set up the number of trees to 200.

fit_rf_red <- randomForest(classe ~ ., data = training_red, importance = T, ntree = 200)

sol_rf_red <- predict(fit_rf_red, newdata =  testing_red)

accuracy <- paste(round(sum(testing$classe == sol_rf_red) / dim(testing)[1]*100,2),"%", sep = " ")

confusionMatrix <- table(testing$classe, sol_rf_red)

Recall that random forest algorithms take around 66% of the data to build trees and use the remaining (Out Of Bag samples) to assess the estimate error we expect to have when using model to predict test and/or validation data sets.

The out of bag estimate error is 0.33% as displyed in appendix. The expected accuracy of the model is 100 - 0.33 = 99.67%.

Once tested against the test set, we obtain the next confusion matrix.

print(confusionMatrix)

##    sol_rf_red
##        A    B    C    D    E
##   A 1674    0    0    0    0
##   B    6 1133    0    0    0
##   C    0    3 1023    0    0
##   D    0    0    2  961    1
##   E    0    0    0    2 1080

The accuracy of the model against the test set, calculated as the true predicted values vs. all predicted values, is 99.76 %, which means it is a very good prediction model.
When compared to the expected accuracy, both values are very closed, so we have had a very good estimation of the accurary of the model:
Expected: 99.67%
Actual: 99.76 %

Conclusions

We can conclude that random forest algorithm has performed really well to predict the activity quality, getting a 99.76 % of accuracy.
The validation test was also tested resulting in a 100% accuracy.
To reduce the number of predictors form 159 to 54 has performed really well decreasing the computing time and obtaining a very high accuracy prediction model.

Appendix

Random forest model information. Model created with the training data set

print(fit_rf_red)

## 
## Call:
##  randomForest(formula = classe ~ ., data = training_red, importance = T,      ntree = 200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.33%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B    5 2649    4    0    0 0.0033860045
## C    0   11 2382    3    0 0.0058430718
## D    0    0   17 2233    2 0.0084369449
## E    0    0    0    3 2522 0.0011881188

Qualitative Prediction Activities with Random Forest Algorithm

Enrique Ruix

31 January 2016