Victor D. Saldaña C.

PhD(c) in Geoinformatics Engineering
Technical University of Madrid (Spain)
November, 2016

Summary.

This is the report of the Practical Machine Learning Peer Assignment. It is divided in six sections. The goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants to build a classifier (model) to predict the manner in which they do the exercise (“classe” variable). This report starts with a basic summary. Then it has a cleaning and exploratory data analysis section. Finally, some classifiers (models) are presented and the prediction is done with the Random Forest that has the highest accuracy after using a Cross Validation approach to select the model.

1. Load packages.

Let’s load the packages needed. If you haven’t download them do it first.

library("knitr")
library("datasets")
library("lattice")
library("ggplot2")
library("caret")
library("reshape2")
library("nnet")
library("rattle")
library("e1071")
library("rpart")
library("RGtk2")
library("rpart.plot")
library("rpart")
library("randomForest")

2. Getting the data.

The data we are going to use in this assignment come from the “Human Activity Recognition” project of the "Groupware@LES" Research Group of the “Pontifical Catholic University of Rio de Janeiro” (more info in http://groupware.les.inf.puc-rio.br/har). This data is licensed under the Creative Commons license (CC BY-SA).

There are two .csv file with 160 variables. One is the Training Data Set (19.622 observations) and the other one is the Test Data Set (20 observations). The observations correspond to six young healthy men with ages between 20-28 years. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different ways that were assigned to the “class” variable. The classe “A” matches the right way of working out and classes B, C, D and E correspond to common mistakes. Specifically, the mistakes are: the elbows to the front (Classe B); lifting the dumbbell only halfway (Classe C); lowering the dumbbell only halfway (Classe D); and throwing the hips to the front (Classe E).

setwd("C:/Victor/Estudios/6_Data_Science/8_Practical_Machine_Learning/Peer_Assignment_1")
#"NA", "#DIV/0!" and blank field will be treated as Not Available / Missing values.
Train_DS<-read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!"),header=TRUE)
Test_DS<-read.csv("pml-testing.csv", na.string=c("NA", "#DIV/0!"),header=TRUE)

3. Cleaning the data.

Now is necessary to clean the data. As said before there were 160 variables and the model I want to fit to predict the response variable “classe” have to be based on the quantitative measurements from the four sensors. Therefore, the first task I am going to do is to remove those variables (predictors) that are not necessary to fit the model. In this regards, the first seven variables (columns) of the data frames are related with the time and participants. These variables are going to be dropped. Consequently, the new ones have only 153 variables, 38 for every sensor and the variable response “classe”. The first 152 variables are potential predictors to predict the “classe” one.

Train_DS_2<-Train_DS[,-c(1:7)]
Test_DS_2<-Test_DS[,-c(1:7)]

Now we are going to evaluate which are the variables (columns) with complete data, i.e., without any Not Available / Missing values. Those that are not complete are not going to be evaluated. Let’s calculate the proportion of missing values in all the 153 variables.

Proportion<-vector()
for (i in colnames(Train_DS_2)) {
  Proportion[i]=sum(is.na(Train_DS_2[,i]))/ length(Train_DS_2[,i])
}
Proportion<-round(Proportion*100,1)
table(Proportion)
## Proportion
##    0 97.9   98 98.1 98.3 98.4  100 
##   53   69   12    2    4    7    6

As seen on the table, there are clearly two groups of columns. The first one includes 100 variables that have at least 97.9% of missing values. The second one includes 53 variables that have 0%, i.e., they are complete. To build the predictive model only these 53 “complete” columns are going to be considered as potential predictors. Therefore, now I will subset the training data set to include only these 53 variables.

Train_DS_3<-data.frame()
for(i in colnames(Train_DS_2)){
  if ((sum(is.na(Train_DS_2[,i]))/length(Train_DS_2[,i]))==0){
  Train_DS_3<-c(Train_DS_3,Train_DS_2[i])  
  }
Train_DS_3<-as.data.frame(Train_DS_3)
}

4. Exploring the data.

Now, there is a data set (Train_DS_3) with 53 columns, 52 predictor candidates and one response variable. Let’s star exploring the response varaible “classe” and analyse which is the more frequent classe. Then, let’s analyse the correlation among predictors.

4.1. Response varaible frecuency.

table(Train_DS_3$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607
Bar_response<-barplot(table(Train_DS_3$classe),main="Classes frecuency",cex.names=.75,
                      cex.axis=.75,col="blue",ylim=c(0,7000))
text(x = Bar_response, y = table(Train_DS_3$classe), label = table(Train_DS_3$classe), pos = 3, cex = .75, col = "black")

As seen on the bar plot the class “A” is the more frequent. The rest of the classes have a similar frequency.

4.2. Correlation among predcitors.

Now, let’s calculate the correlations among pairs of predictors. Those from a pair that are highly correlated can be linearly predicted with the other. In regression models this is called “multicollinearity”. This phenomenon may invalidate a regression model, by skewing the coefficients. Also, it causes that precision of the coefficients decreases. To avoid multicollinearity the main approach is to drop the predictors that cause it. To do it there are different methods such as setting a correlation threshold among pairs of predictors or getting the Variance Inflation Factors (VIF). Multicollinearity, is not so important when fitting a nonlinear model such as Decision Tress and Random Forrest but is always better not to use redundant predictors. For instance, fewer predictors save a lot of computational time when fitting a model such as Random Forrest.

par(mfrow=c(2,2))

#Correlation matrix.
cor_matrix<-cor(Train_DS_3[,-53])

#Let's melt the cor_matrix to get a data frame which has rows that only corresponds to the correlation of two variables (Predictors1 & Predictors2) instead of a matrix class.
cor_matrix_melt<-melt(cor(Train_DS_3[,-53]))
colnames(cor_matrix_melt)<-c("Predictors1","Predictors2","Correlation")

#Now, let's plot the correlation among pairs of predictors to have an idea of the correlation among them.
qplot(x=Predictors1, y=Predictors2, data=cor_matrix_melt)+geom_tile(aes(fill=Correlation))+
  theme(axis.text.x = element_text(face="bold",angle=90,colour="black",hjust=1),
        axis.text.y = element_text(face="bold",angle=0,colour="black",vjust=1),
        axis.title.x=element_text("Predictors"),
        axis.title.y=element_text("Predictors"))+
  scale_fill_gradient2(low = "white",mid="gray",high = "black")+
  ggtitle("Correlation of predictors")

#Histogram of the correlations.
hist(cor_matrix_melt$Cor,main="Histogram of correlations",xlab="Correlation",ylim = c(0,800),breaks=15,right=TRUE,col="blue",labels=TRUE)
qqnorm(cor_matrix_melt$Cor,ylab="Data Quantiles")
qqline(cor_matrix_melt$Cor)

#Contingency table with correlation counts for four intervals. 
Cont_table_cor<-table(findInterval(round(cor_matrix_melt$Cor,3),seq(-1,1,by=0.5),left.open=TRUE))

#Bar plot of the contingency table.
Bar_Cont_table_cor<-barplot(Cont_table_cor,col="green",names.arg=c("(-1,-0.5]","(-0.5,0]","(0,0.5]","(0.5,1]"), main="Correlation counts by interval",xlab="Intervals",ylab="Counts",ylim=c(0,1500))

#Let's add text of the counts at top of bars.
text(x = Bar_Cont_table_cor, y = Cont_table_cor, label = Cont_table_cor, pos = 3, cex = 1, col = "black")

As seen in the figures the correlations are (approximately) normally distributed. Besides, the correlation of the predictors with itself that is 1, there are some values that may be considered high. As a rule of thumb a correlation of 0.7 or higher might be considered high. So, now we are going to drop those predictors.

#Let's get the number of the columns with high correlation. 
pre_dropped<-findCorrelation(cor_matrix[,-53], cutoff = .70, names = FALSE)

#Let's get the final data set to train the models. 
Train_DS_4<-Train_DS_3[,-pre_dropped]

Finally, there is a training data set with only 31 predictors that are not highly correlated. Therefore, with this data set I am going to fit some predictive models to predict the “classe” variable of the observation on the original test data set.

5. Model Selection and fitting.

In the section of this report three classifiers are going to be fitted to predict the “classe” response of exercises in the test data set. The first one is a Multinomial Logistic Regression, the second one a Decision Tree and, finally, the third one is a Random Forrest classifier. Then, the performance of everyone is going to be evaluated. There are different approaches to evaluate the performance of different models to select the best one such as Data Split, Bootstrap or Cross Validation (k-fold, Repeated k-fold, Leave One Out, etc.). In this case, I am going to use the Data Split approach for the first one and k-fold Cross Validation to choose between Decision Tree and Random Forrest classifier. The one with the highest accuracy (lowest misclassification in sample error) is going to be used to predict the “classe” variable of the observations in the original test data set.

5.1. Multinomial Logistic Regression classifier.

Multinomial Logistic Regression Classifier is an extension of the normal Logistic Regression Classifier for the case of a categorical dependent variable with more than two levels.

#Splitting the training data set (70% training and 30% validation).
split <- createDataPartition(Train_DS_4$classe, p = 0.7, list = FALSE)
Train_DS_4_TR <- Train_DS_4[split,]
Train_DS_4_VA <- Train_DS_4[-split,]

#Let's set THE classe "A" as reference.
Train_DS_4_TR$classe<-relevel(Train_DS_4_TR$classe,ref="A")

#Let's fit the classifier.
classifier_1<-multinom(classe ~ .,data=Train_DS_4_TR)
## # weights:  160 (124 variable)
## initial  value 22108.848603 
## iter  10 value 18991.531643
## iter  20 value 18161.633068
## iter  30 value 17608.428353
## iter  40 value 17213.772766
## iter  50 value 17076.680283
## iter  60 value 17040.011410
## iter  70 value 17025.293808
## iter  80 value 16939.595968
## iter  90 value 16798.299602
## iter 100 value 16268.114452
## final  value 16268.114452 
## stopped after 100 iterations
#Let's use the classifier to predict the values of the training data set.
pred_classifier_1<-predict(classifier_1,Train_DS_4_VA)

#Let's get a contingency table (left letters are predicted and upper observed). 
table_classifier_1<-table(pred_classifier_1,Train_DS_4_VA$classe)
table_classifier_1
##                  
## pred_classifier_1    A    B    C    D    E
##                 A 1126  259  195  111  131
##                 B  181  423   54   46  141
##                 C  130  111  483   94  110
##                 D  148  148  149  599  209
##                 E   89  198  145  114  491
#Let's calculate the misclassification error. 
error_classifier_1<-sum(diag(table_classifier_1))/sum(table_classifier_1)
paste("Accuracy is =",round(error_classifier_1,4)*100,"%")
## [1] "Accuracy is = 53.05 %"

5.2. Decision Tree and Random Forest classifiers.

Decision Tree is one of the most common non-parametric methods used in Machine Learning for classification and regression. The main idea is to split the observations of the training data set into several branches (of the tree) that are as pure as possible (same classes) based on the features. On the other hand, Random Forest is a very powerful method for classification, regression and other tasks that consists in building a multitude of decision trees (sampling observation and features) to combine their results by using a voting system.

5.2.1. Cross Validation.

In this case there are two models (Decision Tree and Random Forest) and I want to evaluate which have a better performance based on the accuracy (lowest misclassification). Therefore, I need to select an approach to assess the performance of these two classifiers. Any model is evaluated measuring how well it makes predictions. To achieve this, a training and test data set with the values of the response variable are needed. With the first one the classifier is fitted and with the second one is evaluated. In this assignment there are only two data sets, the training and test one. However, the test data set has the observations to be predicted, therefore, is not possible to use it to evaluate the classifier. Also, it does not have any values of the response variable “classe”. The only one option is to use the training data set. The first choice might be to split “randomly” the training data set into two new sub-sets. The first one would be the training data set and the second one the test (validation). If this is done once maybe the training data set does not have all the information needed to fit the model or the validation data set is not the best one because of randomness. Therefore, a solution is to use “cross validation”. The basic idea is to subset the training data set to get two new data sets (training and validation) to fit the model and to evaluate it.

There are different approaches in cross validation. Here I will use “k fold”. This consists in splitting the original training set in “k” equal sized folds (sub-samples) and then fit the model with “k-1”. Then the model fitted is tested with the other one that was not used to fit it. This is done “k” times using at each time one of the k fold as validation data set. Therefore, all observations of the original training data set are used k-1 times in the training phase and once in the validation one. Then, the error is averaged. The model with a lowest average error is used to predict.

#Setting the seed for reproducibility. 
set.seed(555)

#Getting the k folds. In this case I going to set k=5. 
folds<-trainControl(method="cv", number=5)

5.2.2. Decision Tree classifier.

#Fitting the Decision Tree classifier.
classifier_2<-train(classe ~ .,trControl=folds,method="rpart",data=Train_DS_4)

#Let's print details. 
print(classifier_2)
## CART 
## 
## 19622 samples
##    30 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15697, 15698, 15698, 15698, 15697 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.02047429  0.5262448  0.3935517
##   0.02079476  0.5231357  0.3897276
##   0.04226606  0.4375170  0.2530959
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.02047429.

5.2.3 Random Forest classifier.

#Let's fit the Random Forest classifier. The number of trees has been set in 20 to limit computational time. 
classifier_3<-train(classe ~ .,trControl=folds,method="rf",data=Train_DS_4,ntree=20)

#Let's print details.
print(classifier_3)
## Random Forest 
## 
## 19622 samples
##    30 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15697, 15696, 15698, 15699, 15698 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9845581  0.9804626
##   16    0.9835900  0.9792406
##   30    0.9754863  0.9689862
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

6. Classification.

As seen on the details Random Forest has, clearly, the highest accuracy after assessing the models with cross validation. Therefore, this classifier is going to be used.

#Let's use only the 30 predcitors of the final training data set.  
col_names_Train_DS_4<-colnames(Train_DS_4)[1:30]
Test_DS_3<-subset(Test_DS_2, select=col_names_Train_DS_4)

#Let's use the Random Forest classifier.
prediction<-predict(classifier_3,newdata=Test_DS_3)
prediction
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

With this classifier I expected an low out of sample error (<10%).