The main goal of this practice, is to apply a Partial Least Squares and compare the results with the Multivariate Regression, Principal Component Regression and Interbatteries Analysis from previous Multivariate Modeling Assignment and Inter-Batteries Assignment.
For this purpose we will again use the zip dataset, which contains normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service in 16 x 15 grayscale images (from -1 to 1). Each line consists of the id (0-9) followed by the 256 grayscale values. We dispose of a training set of 7291 digits and a test set of 2007 digits. In order to compare the results with the previous assignment, we will again use only a \(5\)% of the training dataset, which objective was to have a value of n (observations) closer to p (variables).
We will use the ZIPs datasets provided: zip_train.dat
and zip_test.dat
. From the training dataset \(5\)% or random sampled data (without replacement) will be chosen as our train data. In the case of the testing data, we will use the complete test file.
The predictors will be centered, this means we need to center our testing data according to our training mean.
# Centering Dataset
training = scale(train_split[,-1],scale = F)
# Centering Testing set (according to train set)
testing = scale(test[,-1],center = colMeans(train_split[,-1]),scale = F)
# Generating Labels
training_y = as.matrix(EncodeLabels(train_split[,1]))
testing_y = as.matrix(EncodeLabels(test[,1]))
We will compute PLS with Cross-validation. It is a good practice to use LOOCV, but in this case we will avoid long computation time by setting the validation of plsr
to “CV”.
#Partial Least Squares
pls2 <- plsr(training_y~training, validation = "CV")
# Getting the components
r2.cv <- R2(pls2)$val[1,,]
component_R2 = apply(r2.cv, MARGIN = 2, mean)
plot(component_R2, main = 'R-squared vs number of PLS2 components',
xlab = 'Number of components', ylab = 'R-squared', pch=1,ylim=c(0,1))
nd =unname(which.max(component_R2))
abline(v=nd, col="red")
As seen in the plot, the best number of components to explain our model is 15. It is interesting to mention that the mean R-squares is decreasing after reaching the maximum, this is because some values of the R-square
are negative, we will consider these as being a bad fit and pursue our analysis with the chosen components.
With the chosen components, a linear regressin will be computed and with this model, we will predict our testing set.
# Linear Model
lm.pls2 <- lm(training_y ~ pls2$scores[,1:nd])
# Project the test data
X_pls_testing <- as.matrix(testing) %*% pls2$projection[,1:nd]
# Predictions
Y_pred <- as.matrix(cbind(rep(1, nrow(X_pls_testing)), X_pls_testing)) %*%
lm.pls2$coefficients
We will compute the R-squared
of our testing set, in order to evaluate our model.
# Sum Square Error
SSE <- sapply(1:ncol(Y_pred), function(x) {
sum((Y_pred[,x] - testing_y[,x])^2)})
# Total Sum Squares
SST <- sapply(1:ncol(Y_pred), function(x) {
sum((testing_y[,x]-mean(training_y[,x]))^2)})
# R squared for predicting each single number
(R.sqr <- 1 - SSE/SST)
[1] 0.6991890 0.7930076 0.5087351 0.3850309 0.4681096 0.3684934 0.5290857
[8] 0.4505706 0.3614745 0.3917376
Here we have the R-squared for predicting each single number of the ZIP data.
In general if we average our R-squared our model would have an R squared of:
mean(R.sqr)
[1] 0.4955434
In general the model does not seem to be really good. In any case, we can see how good or bad our model is predicting the test set.
# Predict Test Set
pred_test_values <- unname(apply(Y_pred,1,function(x) which.max(x)-1))
#Confusion Matrix Testing
cm = table(pred_test_values,test[,1])
rownames(cm) <- c(1:10)
colnames(cm) <- c(1:10)
cm
pred_test_values 1 2 3 4 5 6 7 8 9 10
1 343 0 13 8 1 12 3 4 10 1
2 0 256 4 0 8 0 0 4 2 7
3 0 0 146 1 10 0 6 2 8 1
4 1 2 9 129 0 25 0 1 12 0
5 3 3 7 1 162 4 4 4 5 6
6 0 0 2 16 0 97 3 0 2 2
7 6 2 2 0 6 3 152 0 2 1
8 0 0 4 0 3 3 0 130 4 9
9 4 0 10 4 0 10 1 0 115 1
10 2 1 1 7 10 6 1 2 6 149
accuracy = sum(diag(cm))/sum(cm)
IndividualAccuracy=diag(cm)/colSums(cm)
Accuracy = round(c(IndividualAccuracy,"Model" = accuracy)*100,0)
kable(t(data.frame(Accuracy)),format="markdown", align = 'c')
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Model | |
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | 96 | 97 | 74 | 78 | 81 | 61 | 89 | 88 | 69 | 84 | 84 |
The confusion matrix of the testing set after applying the model, shows the well predicted values, against the mistakes. As it can be observed, predicting number 1 and 2 have a great accuracy, but there are other numbers which represent a higher struggle, like 6 and 9
Now we can compare our results with the previous assignment. We will use the results of MVR and PCR for the same dataset and same amount of data.
Accuracy | |
---|---|
MVR | 0.61 |
PCR | 0.88 |
IBA | 0.81 |
PLS2 | 0.84 |
Comparing the results, we can see that the PCR shows the best results of the 4 different models with an accuracy close to \(90\)% for the testing set, while training with \(5\)% of the whole dataset. The Partial Least Squares, does not seem to be the best model in this situation but test would have to be done in order to see how the model improves by training it with more data.