Practical Machine Learning Project

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project I use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, . They were asked to perform barbell lifts correctly and incorrectly in 5 different ways all stored in dependant variable called “classe”. The main objective is to predict the manner in which they did the exercise based on any of the other variables generated in the experiment.

All the data used in this project was kindly provided by: “Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements”

###GETTIN DATA AND LOADING THE NECESSARY PACKAGES:###

library(tidyverse)
library(caret)
library(corrplot)
library(RColorBrewer)

training<-read.csv(file="pml-training.csv" )
testing<-read.csv(file="pml-testing.csv" )

EXPLORATORY DATA ANALYSIS AND CLEANING DATA

In order to check the structure of the data , a basic exploratory analysis was carried out. Please note that some of the outputs are omitted in terms of space optimization of the present report.

#data structure
str(training)
#Removing unncessary columns
training<-training[,-c(2:7)]
#outcome as factor variable
training$classe<-as.factor(training$classe)
# number of missing values per variable:
map_dbl(training,.f= function(x){sum(is.na(x))})
# number of blank spaces
training %>% map_lgl(.f= function(x){any(!is.na(x) & x=='')})
# blank space=NA
training[training==""]<-NA

The next graph represent the number of NA values per variable in the training data set. Unfortunately, the ‘x’ labels are overlapped due to the big amount of predictors. Even though, it’s possible to identify missing data (orange stripes) that will be deleted in the next steps.

training_long<-training %>% gather(key="variable", value='value',-X) %>% 
        mutate(Missing_value = is.na(value))
ggplot(data = training_long, aes(x = variable, y = X, fill = Missing_value)) +
        geom_raster() +
        scale_fill_manual(values = c("gray60", "orangered2")) +
        theme_bw() +
        labs(title = "Number of NA observations per variable") +
        theme(legend.position = "bottom")

Columns with NA values are not representative for modelling. In most of the machine learning algorithms missing values represent a problem that could promote a lost in accuracy. Thus removing these values is totally necessary.

#selecting columns from training with no NAs:
Nacol<-colnames(training)[colSums(is.na(training)) > 0]
training<- training  %>% select(-Nacol)
sum(is.na(training))

## [1] 0

PRE-PROCESSING PREDICTORS:

Predictors with near zero variance :

There’s no predictor with variance equal or approximate equal to zero than could lead into bias during bootsraping or cross-validation of these variables.

predictors<-training %>% select(-c(X,classe))
ZeroVar<-predictors %>% nearZeroVar(saveMetrics = TRUE)
print(paste0("Total number of variables with zero variance: ", sum(ZeroVar$zeroVar)))

## [1] "Total number of variables with zero variance: 0"

Finding correlation between predictors:

In the next graph we see in some cases highly correlated predictors; however, if we inspect the pairs of variables that are correlated, we can note that they correspond to measurements from the same sensor but on different directions. This makes sense in the context of the experiment thus it won´t be advisable to remove them .

training_cor<-cor(predictors)
corrplot(training_cor, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

TRAINING A MODEL:

Model selected: Random Forest

The reason why I selected Random Forest algorithm is due that the outcome variable is indeed a classifier that is treated as a factor in response of the given predictors.Random Forest is type of bagging that improves the results by uncorrelating the decision trees generated in the process and avoiding overfitting, this algorithm has the power to handle a large data set with higher dimensionality as this case.

TrainControl selected: cross-validation: Cross-validation was used in order to get a reasonable estimation from our out of bag error. I used 5 folds,as is general suggested, to balance the trade between bias and variance in the model.

training<-training %>% select(-X) 
set.seed(123)
model1<-train(classe~.,data=training,method='rf',
              trControl=trainControl(method = 'cv',number=5))
model1$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.41%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 5578    2    0    0    0 0.0003584229
## B   10 3783    4    0    0 0.0036871214
## C    0   17 3404    1    0 0.0052600818
## D    0    0   41 3174    1 0.0130597015
## E    0    0    0    5 3602 0.0013861935

The number of random predictors selected for each tree that maximizes the accuracy is 2.

plot(model1,uniform=TRUE, main="Accuracy vs random number of predictors")

The out of bag error estimate is 0.41% which means the acurracy over the training data set is around 99.59% , this means our model is really accurate regarding the training data. By the cross-validation carried out and the machine learning algorithm used, at this point,I bet the out of sample error will be slightly smaller than 99.59%.

PREPARING TESTING DATA SET:

In order to use my model to predict over the testing data set, the same process of data preparation, as in training data set, must be carry out.

#Removing unncessary columns
testing<-testing[,-c(2:7)]

# number of missing values per variable:
map_dbl(testing,.f= function(x){sum(is.na(x))})
# number of blank spaces
testing %>% map_lgl(.f= function(x){any(!is.na(x) & x=='')})
# blank space=NA
testing[testing==""]<-NA

Again, we see the same columns, as in the training data set, that have NA values. These values will be removed.

testing_long<-testing %>% gather(key="variable", value='value',-X) %>% 
        mutate(Missing_value = is.na(value))
ggplot(data = testing_long, aes(x = variable, y = X, fill = Missing_value)) +
        geom_raster() +
        scale_fill_manual(values = c("gray60", "orangered2")) +
        theme_bw() +
        labs(title = "Number of NA observations per variable") +
        theme(legend.position = "bottom")

#selecting columns from training with no NAs:
Nacoltest<-colnames(testing)[colSums(is.na(testing)) > 0]
testing<- testing  %>% select(-Nacoltest)
sum(is.na(testing))

## [1] 0

VALIDATING RESULTS:

testing<- testing %>% select(-c(X,problem_id))
testing$classepred<-predict(model1,testing)
testing$classreal<-as.factor(c('B','A','B','A','A','E','D','B','A','A','B','C','B','A','E','E','A','B','B','B'))
confusionMatrix(testing$classreal,testing$classepred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction A B C D E
##          A 7 0 0 0 0
##          B 1 7 0 0 0
##          C 0 0 1 0 0
##          D 0 0 0 1 0
##          E 0 0 0 0 3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.7513, 0.9987)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 3.408e-07       
##                                           
##                   Kappa : 0.9278          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8750   1.0000     1.00     1.00     1.00
## Specificity            1.0000   0.9231     1.00     1.00     1.00
## Pos Pred Value         1.0000   0.8750     1.00     1.00     1.00
## Neg Pred Value         0.9231   1.0000     1.00     1.00     1.00
## Prevalence             0.4000   0.3500     0.05     0.05     0.15
## Detection Rate         0.3500   0.3500     0.05     0.05     0.15
## Detection Prevalence   0.3500   0.4000     0.05     0.05     0.15
## Balanced Accuracy      0.9375   0.9615     1.00     1.00     1.00

Conclussion: The poposed model is able to predict the classifiers over the test data set with a 100% of accuracy,thus we can say is a very precise model.