The main goal of this practice, is to compute an Inter Batteries Analysis and compare the results with the Multivariate Regression and Principal Component Regression from previous Multivariate Modeling Assignment.
For this purpose we will again use the zip dataset, which contains normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service in 16 x 15 grayscale images (from -1 to 1). Each line consists of the id (0-9) followed by the 256 grayscale values. We dispose of a training set of 7291 digits and a test set of 2007 digits. In order to compare the results with the previous assignment, we will again use only a \(5\)% of the training dataset, which objective was to have a value of n (observations) closer to p (variables).
It is important to mention that the training set is centered, and so is the testing set, according to the training set mean.
To implement our own Intra Batteries Analysis model, first we calculate the variance of our X vector and our Y from our training dataset. Then we compute the eigenvalues and eigen vectors by from our multiplication of variance. It is important to notice that our last eigenvalue is close to 0. So when computing our components T, this could be removed, which leaves us with 9 components.
### Inter Bateries Analysis
Varxy = var(training_x,training_y)
AIB <- eigen(t(Varxy)%*%Varxy)
A = Varxy %*% AIB$vectors %*% diag(AIB$values^(-0.5))
B = AIB$vectors
# Computing T and U.
T = as.matrix(training_x) %*% as.matrix(A)
U = as.matrix(training_y) %*% as.matrix(B)
# Final components
T <- T[,-10]
After having our components, now we can compute a Cross Validation, in order to look for the \(R^2\), and see which components to use, in order to predict.
## Components
lmb = AIB$values
p = ncol(training_x)
a <- sum(lmb > 0.0000001)
com.xt <- matrix(NA,p,a)
for (i in 1:p) {for ( j in 1:a) {com.xt[i,j] <- summary(lm(training_x[,i]~T[,1:j]))$r.squared}}
# R squared by component
R_squared <- apply(com.xt,2,mean)
Components <- data.frame(cbind("Components" = 1:9, R_squared))
ggplot(Components,aes(x=Components,y=R_squared)) +
geom_point(shape=21) + geom_line() + theme_bw() + scale_x_continuous(breaks = c(1:9)) +
labs(y = expression(R^{2})) + ggtitle(expression(paste("Number of Components against ", R^{2})))
Since the results for Inter Batteries Analysis does not give us a good proportion explained by the components, we can decide to use all of them for our prediction.
First we will calculate a linear model in order to be able to classify which is the number we are reading. We will compute some statistics for the training dataset, and then go on with predicting our training set.
# Model
set.seed(1)
fit <- lm(as.matrix(training_y) ~ T - 1)
summary_fit <- summary(fit)
# R-squared
l = length(summary_fit)
Average_Rsquared = mean(sapply(1:l, function(x) summary_fit[[x]]$r.squared))
Adjusted_Rsquared = mean(sapply(1:l, function(x) summary_fit[[x]]$adj.r.squared))
# R-squared by LOOCV
n <- nrow(training_y)
PRESS <- colSums((fit$residuals/(1-ls.diag(fit)$hat))^2)
RsquaredCV <- 1 - PRESS/(diag(var(training_y))*(n-1))
RsquaredLOOCV <- mean(RsquaredCV)
### Predicting Training Set ###
pred <- T %*% fit$coefficients
pred_values <- unname(apply(pred,1,function(x) which.max(x)-1))
#Confusion Matrix Training
t_cm = table(pred_values,train_split[,1])
rownames(t_cm) <- c(1:10)
colnames(t_cm) <- c(1:10)
t_accuracy = sum(diag(t_cm))/sum(t_cm)
t_cm
pred_values 1 2 3 4 5 6 7 8 9 10
1 54 0 0 0 0 0 2 0 1 0
2 0 50 1 2 1 0 1 0 1 3
3 1 0 34 0 1 0 1 0 0 0
4 0 0 1 29 0 2 0 0 2 0
5 0 0 1 1 28 2 0 0 1 3
6 0 0 0 0 0 24 2 1 1 0
7 5 0 0 0 0 0 26 0 0 0
8 0 0 0 0 1 0 0 29 0 0
9 0 0 0 0 0 0 1 0 21 1
10 0 0 0 1 2 0 0 2 0 25
| Average Rsquared | Adjusted Rsquared | LOOCV Rsquared | Training Accuracy | |
|---|---|---|---|---|
| results | 0.5123866 | 0.5000593 | 0.4329066 | 0.8767123 |
As we can see in our confusion matrix, the most misspredicted numbers are 1, 8 and 10.
We get a quite decent accuracy close to 90%, even though or model is not a very good one, since our \(R^2\) portrays even less than 50%.
In any case, we will predict our testing set and see how the model performs. Is important to mention that we will only use the first nine vectors of our matrix A, since we are using only the first 9 components.
### Predicting Test Set ###
pred_test <- testing_x %*% A[,1:9] %*% fit$coefficients
pred_test_values <- unname(apply(pred_test,1,function(x) which.max(x)-1))
#Confusion Matrix Testing
cm = table(pred_test_values,test[,1])
rownames(cm) <- c(1:10)
colnames(cm) <- c(1:10)
cm
##
## pred_test_values 1 2 3 4 5 6 7 8 9 10
## 1 312 0 11 4 1 12 20 1 5 0
## 2 0 256 2 1 7 1 0 5 3 8
## 3 2 0 143 3 7 0 9 4 4 0
## 4 2 0 6 125 0 10 0 0 9 0
## 5 3 5 9 2 157 4 2 4 5 14
## 6 0 0 2 20 0 109 2 0 5 1
## 7 39 1 3 0 7 3 136 1 1 1
## 8 0 0 9 0 5 1 0 123 5 11
## 9 0 2 13 8 1 13 1 1 125 5
## 10 1 0 0 3 15 7 0 8 4 137
accuracy = sum(diag(cm))/sum(cm)
IndividualAccuracy=diag(cm)/colSums(cm)
Accuracy = round(c(IndividualAccuracy,"Model" = accuracy)*100,0)
kable(t(data.frame(Accuracy)),format="markdown", align = 'c')
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 87 | 97 | 72 | 75 | 78 | 68 | 80 | 84 | 75 | 77 | 81 |
The confusion matrix of the testing set after applying the model, shows the well predicted values, against the mistakes. As it can be observed, predicting number 2 has a great accuracy, but there are other numbers which represent a higher struggle, like 6 and 3
We can even compute the *\(R^2\) for the testing data:
# R squared for Testing data
TSS = apply(testing_y,2,function(x){sum((x-mean(x))^2)})
RSS = colSums((testing_y- pred_test)^2)
(R_sq_test = mean(1 - (RSS/TSS)))
## [1] 0.3496209
As we can see, the problem does not seem to be explained by our model.
Now we can compare our results with the previous assignment. We will use the results of MVR and PCR for the same dataset and same amount of data.
| Accuracy | Test R Squared | |
|---|---|---|
| MVR | 0.61 | 0.82 |
| PCR | 0.88 | 0.91 |
| IBA | 0.81 | 0.35 |
Comparing the results, we can see that the PCR shows the best results of the three different models with an accuracy close to \(90\)% for the testing set, while training with \(5\)% of the whole dataset. The Inter Batteries Analysis, does not seem to be the best model in this situation but test would have to be done in order to see how the model improves by training it with more data.