A Comparison between Multivariate Regression and Principal Component Regression

Objective

The objective of this practical work is to build a classification function after computing a Multivariate Regression and a Principal Component Regression and compare the results and understand the limitations of each.

Introduction

The data set explored are normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service in 16 x 15 grayscale images (from -1 to 1). Each line consists of the id (0-9) followed by the 256 grayscale values. We dispose of a training set of 7291 digits and a test set of 2007 digits.

Process

It is interesting to compare the efficiency of both models in order to classify unknown data. For this we will make a comparison using different sizes of training data in order to see the performance of each depending on the amount of data used for training. In addition, for the Principal Component Regression, it will be important to see how many components should be taken into account in order to build our model.

The intuiton of this problem is the following: We have a vector, which are basically our different written numbers from 0-9. In order to be able to classify them, this factor variable will be binarized into 10 dummy variables containing 0 or 1, each representing one of the numbers from 0-9. The Multivariate Regression (MVR) and Principal Component Regression (PCR) come into play when the problem becomes predicting 10 different responses (Y) from \(0-1\). This way, when we have a value for each different number, our prediction will become the maximum value out of the 10 predicted ones.

Multivariate regression

For the Multivariate Regression (MVR), the models are computed using data sampling [5%, 10% 30% and 50%] from the training data.

Train Data	accuracy	error	R-squared	Adj.R-squared	LOOCV R-squared	Test R-squared	TrainSize
0.05	0.61	0.39	0.84	0.46	-40.00	0.82	365
0.10	0.78	0.22	0.72	0.57	-0.15	0.69	729
0.30	0.85	0.15	0.63	0.58	0.44	0.59	2187
0.50	0.86	0.14	0.61	0.58	0.48	0.57	3646

It is interesting to observe how the accuracy improves while using more data to train our model, this will be shown in Figure 1. Another observation is how the Average R-squared of our MVR model seems to be getting worse, but at least our Adj.R squared and our LOOCV R-squared are improving while improving the training data size, which are a better indication of the variance explained by our MVR model.

Principal Component Regression

For Principal Component Regression (PCR), the model is computed only using a sample of 5% from the training data, this being the same observations as the 5% from the (MVR). The reason for this choice is to see how the model works by having a number of observations \(n\) close to the number of features \(p\). And given that a LOOCV is computed, it becomes more complicated to do this with a big dataset.

The basic idea of the PCR is to calculate principal components, and later use some of this components as predictors in a linear regression model fitted using the typical least squares procedure.

Since one of our objectives is to reduce dimensionality, it is important to know how many components should be used in order to predict. Our model is built by using \(256\) predictors, which result in a computation of these total number of components to explain 100% of the variance. For this we plot the components and the variance explained in order to know for which Components to proceed with our model.

In this case it is chosen to explore the components (40 components) which explain up to 80% of our variance and plot the results.

By computing the accuracy and error of our testing dataset, we can observe that using any number from 15-18 components might be ideal.

Conclusions

It is interesting to see that by reducing dimensions with a PCR we are able to get a very good result even when we are using only a small portion of the data in order to train the model. This PCR model using 5% of the training data is much better than the MVR model using the same training data, and it is quite comparable to the MVR model which uses 50% of the data.

Code

###################################
############ LIBRARIES ############
library(caTools)
library(pls)
library(ggplot2)
library(gridExtra)
############ LIBRARIES ############
###################################

###################################
############ FUNCTIONS ############

# Prepare Labels
# 01 Function to transform categories into dummy variables
EncodeLabels <- function(vector){
  Y =  data.frame(model.matrix(~factor(vector)-1))
  names(Y) = c("Y0","Y1","Y2","Y3","Y4","Y5","Y6","Y7","Y8","Y9")
  return(Y)
}

# Compute accuracy and error for a given model
acc_error = function(model,data){
  x = as.data.frame(scale(data[,-1],scale = F))
  y = data[,1]
  yhat = predict(model, x)
  prediction = unname(apply(yhat,1,function(x) which.max(x)-1))
  
  cm = table(y,prediction)
  acc = sum(diag(cm))/sum(cm)
  error = 1-acc
  return(c("accuracy" = acc,"error" = error))
}

# MultivariateRegression Model
multivariateRegression<- function(train,test,partition){
  
  results = c()
  for(i in partition){
    set.seed(1)
    split =  sample.split(train$V1, SplitRatio = i)
    train_split <- subset(train,split==TRUE)
    X = as.data.frame(scale(train_split[,-1],scale = F))
    Y = EncodeLabels(train_split[,1])
    df = cbind(X,Y)
    
    #Regression
    formula= cbind(Y0,Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8,Y9) ~ . - 1
    mreg1 <- lm(formula, data = df) 
    
    # R-squared
    models_mreg1 = summary(mreg1)
    l = length(models_mreg1)
    Average_Rsquared = mean(sapply(1:l, function(x) models_mreg1[[x]]$r.squared))
    Adjusted_Rsquared = mean(sapply(1:l, function(x) models_mreg1[[x]]$adj.r.squared))
    
    # R-squared by LOOCV
    n <- nrow(Y)
    PRESS <- colSums((mreg1$residuals/(1-ls.diag(mreg1)$hat))^2)
    RsquaredCV <- 1 - PRESS/(diag(var(Y))*(n-1))
    Avg_RsquaredCV <- mean(RsquaredCV)
    
    # R-squared Test data
    TSS = apply(Y,2,function(x){sum((x-mean(x))^2)})
    RSS = colSums((Y-predict(mreg1,X))^2)
    R_sq_test = mean(1 - (RSS/TSS))
    
    #Accuracy and Error for test set
    results = rbind(results,c("Train Data" = i, acc_error(mreg1,test)[1], acc_error(mreg1,test)[2],
                              "R-squared" = Average_Rsquared,"Adj.R-squared" = Adjusted_Rsquared, 
                              "LOOCV R-squared" = Avg_RsquaredCV,"Test R-squared" = R_sq_test))
  }
  return(results)
}


############ FUNCTIONS ############
###################################

###################################
############   MAIN    ############

# Load Data
train = read.delim('zip_train.dat', header = F, sep=" ") 
test = read.delim('zip_test.dat', header = F, sep=" ") 

# Explore Datasets
#dim(train) ; str(train) ; summary(train)
#dim(test)

# Look for NA's
NonNA = apply(train,2,function(x) sum(is.na(x))==0)

# Remove Columns of NA's
train = train[,NonNA]
test = test[,NonNA]


### Multi-Linear Regression
results = as.data.frame(multivariateRegression(train,test,c(.05,.1,.3,.5)))
results$TrainSize = round(nrow(train)*results[,1],0)

#Accuracy Plot
ggplot(results, aes(x=TrainSize,y=accuracy)) + 
  geom_line(col="black") + geom_point(aes(col=factor(TrainSize))) + 
  theme_bw() + ggtitle("Accuracy for Test Data")


###### Principal Component Regression #####

# Split Dataset
set.seed(1)
split =  sample.split(train$V1, SplitRatio = 0.05)
train_split <- subset(train,split==TRUE)
test_split <- subset(train,split==FALSE)
X = as.data.frame(scale(train_split[,-1],scale = F))
Y = EncodeLabels(train_split[,1])
df = cbind(X,Y)
formula= cbind(Y0,Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8,Y9) ~ . - 1

#Model Building
pcr1 <- pcr(formula, data = df, validation = "LOO") #validation = "LOO"
variance_explained = cumsum(explvar(pcr1))

#Variance Explained Plot
var_exp = data.frame("Components" = 1:length(variance_explained), "Variance" = variance_explained)
ggplot(var_exp, aes(x=Components,y=Variance)) + geom_point() + 
  theme_bw() + ggtitle("PCR \nVariance Explained") +geom_hline(yintercept=80, col="red")
# The variance explained by the components should be around 80%, 

X_test <- as.data.frame(scale(test_split[,-1], scale=F)) 
Y_test <- test_split[, 1]
yhat = predict(pcr1,X_test)
acc=c()
for(i in 1:40){
  prediction <- unname(apply(yhat[,,i],1,function(x) which.max(x)-1))
  cm=table(Y_test,prediction)
  acc[i] = sum(diag(cm))/sum(cm)
}

#Plotting the accuracies
accuracy_df=as.data.frame(cbind("Components"=1:40,"Accuracy"=acc, "Error" = 1-acc))
plot1 <- ggplot(accuracy_df,aes(x=Components,y=Accuracy)) + 
  geom_line(col="blue") + geom_point(col="black") + 
  theme_bw() + ggtitle("Test Data \nAccuracy")
plot2 <- ggplot(accuracy_df,aes(x=Components,y=Error)) + 
  geom_line(col="red") + geom_point(col="black") + theme_bw() + 
  ggtitle("Test Data \nError")

grid.arrange(plot1,plot2,nrow=1,ncol=2)
# It seems that a good value to choose for our number of components would be 15.


############   MAIN    ############
###################################