The objective of this practical work is to build a classification function after computing a Multivariate Regression and a Principal Component Regression and compare the results and understand the limitations of each.
The data set explored are normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service in 16 x 15 grayscale images (from -1 to 1). Each line consists of the id (0-9) followed by the 256 grayscale values. We dispose of a training set of 7291 digits and a test set of 2007 digits.
It is interesting to compare the efficiency of both models in order to classify unknown data. For this we will make a comparison using different sizes of training data in order to see the performance of each depending on the amount of data used for training. In addition, for the Principal Component Regression, it will be important to see how many components should be taken into account in order to build our model.
The intuiton of this problem is the following: We have a vector, which are basically our different written numbers from 0-9. In order to be able to classify them, this factor variable will be binarized into 10 dummy variables containing 0 or 1, each representing one of the numbers from 0-9. The Multivariate Regression (MVR) and Principal Component Regression (PCR) come into play when the problem becomes predicting 10 different responses (Y) from \(0-1\). This way, when we have a value for each different number, our prediction will become the maximum value out of the 10 predicted ones.
For the Multivariate Regression (MVR), the models are computed using data sampling [5%, 10% 30% and 50%] from the training data.
| Train Data | accuracy | error | R-squared | Adj.R-squared | LOOCV R-squared | Test R-squared | TrainSize |
|---|---|---|---|---|---|---|---|
| 0.05 | 0.61 | 0.39 | 0.84 | 0.46 | -40.00 | 0.82 | 365 |
| 0.10 | 0.78 | 0.22 | 0.72 | 0.57 | -0.15 | 0.69 | 729 |
| 0.30 | 0.85 | 0.15 | 0.63 | 0.58 | 0.44 | 0.59 | 2187 |
| 0.50 | 0.86 | 0.14 | 0.61 | 0.58 | 0.48 | 0.57 | 3646 |
It is interesting to observe how the accuracy improves while using more data to train our model, this will be shown in Figure 1. Another observation is how the Average R-squared of our MVR model seems to be getting worse, but at least our Adj.R squared and our LOOCV R-squared are improving while improving the training data size, which are a better indication of the variance explained by our MVR model.
For Principal Component Regression (PCR), the model is computed only using a sample of 5% from the training data, this being the same observations as the 5% from the (MVR). The reason for this choice is to see how the model works by having a number of observations \(n\) close to the number of features \(p\). And given that a LOOCV is computed, it becomes more complicated to do this with a big dataset.
The basic idea of the PCR is to calculate principal components, and later use some of this components as predictors in a linear regression model fitted using the typical least squares procedure.
Since one of our objectives is to reduce dimensionality, it is important to know how many components should be used in order to predict. Our model is built by using \(256\) predictors, which result in a computation of these total number of components to explain 100% of the variance. For this we plot the components and the variance explained in order to know for which Components to proceed with our model.
In this case it is chosen to explore the components (40 components) which explain up to 80% of our variance and plot the results.
By computing the accuracy and error of our testing dataset, we can observe that using any number from 15-18 components might be ideal.
It is interesting to see that by reducing dimensions with a PCR we are able to get a very good result even when we are using only a small portion of the data in order to train the model. This PCR model using 5% of the training data is much better than the MVR model using the same training data, and it is quite comparable to the MVR model which uses 50% of the data.
###################################
############ LIBRARIES ############
library(caTools)
library(pls)
library(ggplot2)
library(gridExtra)
############ LIBRARIES ############
###################################
###################################
############ FUNCTIONS ############
# Prepare Labels
# 01 Function to transform categories into dummy variables
EncodeLabels <- function(vector){
Y = data.frame(model.matrix(~factor(vector)-1))
names(Y) = c("Y0","Y1","Y2","Y3","Y4","Y5","Y6","Y7","Y8","Y9")
return(Y)
}
# Compute accuracy and error for a given model
acc_error = function(model,data){
x = as.data.frame(scale(data[,-1],scale = F))
y = data[,1]
yhat = predict(model, x)
prediction = unname(apply(yhat,1,function(x) which.max(x)-1))
cm = table(y,prediction)
acc = sum(diag(cm))/sum(cm)
error = 1-acc
return(c("accuracy" = acc,"error" = error))
}
# MultivariateRegression Model
multivariateRegression<- function(train,test,partition){
results = c()
for(i in partition){
set.seed(1)
split = sample.split(train$V1, SplitRatio = i)
train_split <- subset(train,split==TRUE)
X = as.data.frame(scale(train_split[,-1],scale = F))
Y = EncodeLabels(train_split[,1])
df = cbind(X,Y)
#Regression
formula= cbind(Y0,Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8,Y9) ~ . - 1
mreg1 <- lm(formula, data = df)
# R-squared
models_mreg1 = summary(mreg1)
l = length(models_mreg1)
Average_Rsquared = mean(sapply(1:l, function(x) models_mreg1[[x]]$r.squared))
Adjusted_Rsquared = mean(sapply(1:l, function(x) models_mreg1[[x]]$adj.r.squared))
# R-squared by LOOCV
n <- nrow(Y)
PRESS <- colSums((mreg1$residuals/(1-ls.diag(mreg1)$hat))^2)
RsquaredCV <- 1 - PRESS/(diag(var(Y))*(n-1))
Avg_RsquaredCV <- mean(RsquaredCV)
# R-squared Test data
TSS = apply(Y,2,function(x){sum((x-mean(x))^2)})
RSS = colSums((Y-predict(mreg1,X))^2)
R_sq_test = mean(1 - (RSS/TSS))
#Accuracy and Error for test set
results = rbind(results,c("Train Data" = i, acc_error(mreg1,test)[1], acc_error(mreg1,test)[2],
"R-squared" = Average_Rsquared,"Adj.R-squared" = Adjusted_Rsquared,
"LOOCV R-squared" = Avg_RsquaredCV,"Test R-squared" = R_sq_test))
}
return(results)
}
############ FUNCTIONS ############
###################################
###################################
############ MAIN ############
# Load Data
train = read.delim('zip_train.dat', header = F, sep=" ")
test = read.delim('zip_test.dat', header = F, sep=" ")
# Explore Datasets
#dim(train) ; str(train) ; summary(train)
#dim(test)
# Look for NA's
NonNA = apply(train,2,function(x) sum(is.na(x))==0)
# Remove Columns of NA's
train = train[,NonNA]
test = test[,NonNA]
### Multi-Linear Regression
results = as.data.frame(multivariateRegression(train,test,c(.05,.1,.3,.5)))
results$TrainSize = round(nrow(train)*results[,1],0)
#Accuracy Plot
ggplot(results, aes(x=TrainSize,y=accuracy)) +
geom_line(col="black") + geom_point(aes(col=factor(TrainSize))) +
theme_bw() + ggtitle("Accuracy for Test Data")
###### Principal Component Regression #####
# Split Dataset
set.seed(1)
split = sample.split(train$V1, SplitRatio = 0.05)
train_split <- subset(train,split==TRUE)
test_split <- subset(train,split==FALSE)
X = as.data.frame(scale(train_split[,-1],scale = F))
Y = EncodeLabels(train_split[,1])
df = cbind(X,Y)
formula= cbind(Y0,Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8,Y9) ~ . - 1
#Model Building
pcr1 <- pcr(formula, data = df, validation = "LOO") #validation = "LOO"
variance_explained = cumsum(explvar(pcr1))
#Variance Explained Plot
var_exp = data.frame("Components" = 1:length(variance_explained), "Variance" = variance_explained)
ggplot(var_exp, aes(x=Components,y=Variance)) + geom_point() +
theme_bw() + ggtitle("PCR \nVariance Explained") +geom_hline(yintercept=80, col="red")
# The variance explained by the components should be around 80%,
X_test <- as.data.frame(scale(test_split[,-1], scale=F))
Y_test <- test_split[, 1]
yhat = predict(pcr1,X_test)
acc=c()
for(i in 1:40){
prediction <- unname(apply(yhat[,,i],1,function(x) which.max(x)-1))
cm=table(Y_test,prediction)
acc[i] = sum(diag(cm))/sum(cm)
}
#Plotting the accuracies
accuracy_df=as.data.frame(cbind("Components"=1:40,"Accuracy"=acc, "Error" = 1-acc))
plot1 <- ggplot(accuracy_df,aes(x=Components,y=Accuracy)) +
geom_line(col="blue") + geom_point(col="black") +
theme_bw() + ggtitle("Test Data \nAccuracy")
plot2 <- ggplot(accuracy_df,aes(x=Components,y=Error)) +
geom_line(col="red") + geom_point(col="black") + theme_bw() +
ggtitle("Test Data \nError")
grid.arrange(plot1,plot2,nrow=1,ncol=2)
# It seems that a good value to choose for our number of components would be 15.
############ MAIN ############
###################################