Partial Least Squares

Loading Data

We will use the ZIPs datasets provided: zip_train.dat and zip_test.dat. From the training dataset \(5\)% or random sampled data (without replacement) will be chosen as our train data. In the case of the testing data, we will use the complete test file.

Centering Data and Target Variable

The predictors will be centered, this means we need to center our testing data according to our training mean.

# Centering Dataset
training = scale(train_split[,-1],scale = F)
# Centering Testing set (according to train set)
testing = scale(test[,-1],center = colMeans(train_split[,-1]),scale = F)

# Generating Labels
training_y = as.matrix(EncodeLabels(train_split[,1]))
testing_y = as.matrix(EncodeLabels(test[,1]))

Performing Partial Least Squares

We will compute PLS with Cross-validation. It is a good practice to use LOOCV, but in this case we will avoid long computation time by setting the validation of plsr to “CV”.

#Partial Least Squares
pls2 <- plsr(training_y~training, validation = "CV")

# Getting the components
r2.cv <- R2(pls2)$val[1,,]
component_R2 = apply(r2.cv, MARGIN = 2, mean)
plot(component_R2, main = 'R-squared vs number of PLS2 components', 
     xlab = 'Number of components', ylab = 'R-squared', pch=1,ylim=c(0,1))
nd =unname(which.max(component_R2))
abline(v=nd, col="red")

As seen in the plot, the best number of components to explain our model is 15. It is interesting to mention that the mean R-squares is decreasing after reaching the maximum, this is because some values of the R-square are negative, we will consider these as being a bad fit and pursue our analysis with the chosen components.

Predictive Model

With the chosen components, a linear regressin will be computed and with this model, we will predict our testing set.

# Linear Model
lm.pls2 <- lm(training_y ~ pls2$scores[,1:nd])

# Project the test data
X_pls_testing <- as.matrix(testing) %*% pls2$projection[,1:nd]

# Predictions
Y_pred <-  as.matrix(cbind(rep(1, nrow(X_pls_testing)), X_pls_testing)) %*%
  lm.pls2$coefficients

We will compute the R-squared of our testing set, in order to evaluate our model.

# Sum Square Error
SSE <- sapply(1:ncol(Y_pred), function(x) {
  sum((Y_pred[,x] - testing_y[,x])^2)})

# Total Sum Squares
SST <- sapply(1:ncol(Y_pred), function(x) {
   sum((testing_y[,x]-mean(training_y[,x]))^2)})

# R squared for predicting each single number  
(R.sqr <- 1 - SSE/SST)

 [1] 0.6991890 0.7930076 0.5087351 0.3850309 0.4681096 0.3684934 0.5290857
 [8] 0.4505706 0.3614745 0.3917376

Here we have the R-squared for predicting each single number of the ZIP data.

In general if we average our R-squared our model would have an R squared of:

mean(R.sqr)

[1] 0.4955434

In general the model does not seem to be really good. In any case, we can see how good or bad our model is predicting the test set.

Results

# Predict Test Set
pred_test_values <- unname(apply(Y_pred,1,function(x) which.max(x)-1))

#Confusion Matrix Testing
cm = table(pred_test_values,test[,1])
rownames(cm) <- c(1:10)
colnames(cm) <- c(1:10)
cm

                
pred_test_values   1   2   3   4   5   6   7   8   9  10
              1  343   0  13   8   1  12   3   4  10   1
              2    0 256   4   0   8   0   0   4   2   7
              3    0   0 146   1  10   0   6   2   8   1
              4    1   2   9 129   0  25   0   1  12   0
              5    3   3   7   1 162   4   4   4   5   6
              6    0   0   2  16   0  97   3   0   2   2
              7    6   2   2   0   6   3 152   0   2   1
              8    0   0   4   0   3   3   0 130   4   9
              9    4   0  10   4   0  10   1   0 115   1
              10   2   1   1   7  10   6   1   2   6 149

accuracy = sum(diag(cm))/sum(cm)
IndividualAccuracy=diag(cm)/colSums(cm)
Accuracy = round(c(IndividualAccuracy,"Model" = accuracy)*100,0)
kable(t(data.frame(Accuracy)),format="markdown", align = 'c')

	1	2	3	4	5	6	7	8	9	10	Model
Accuracy	96	97	74	78	81	61	89	88	69	84	84

The confusion matrix of the testing set after applying the model, shows the well predicted values, against the mistakes. As it can be observed, predicting number 1 and 2 have a great accuracy, but there are other numbers which represent a higher struggle, like 6 and 9

	Accuracy
MVR	0.61
PCR	0.88
IBA	0.81
PLS2	0.84

Partial Least Squares

Saul Garcia

12/22/2016

Objective

Loading Data

Centering Data and Target Variable

Performing Partial Least Squares

Predictive Model

Results

Compare to MVR and PCR and IBA