Problem 1a

We will compare various models (original least squares, ridge regression, lasso regression, principal component regression, and partial least squares regression) on predicting exclusivity of a college based on various predictor variables such as rankings, private vs public, number of undergraduate and graduate, phD admissions, etc. See the data set ‘college.csv’ for more of the predictor space.

We will assess the performance of these various models by assessing their MSE (mean squared error) on a partitioned subset of the data which we will call the test data. The models will only learn their parameters on a disjoint subset of the data which we will call the train data.

By the end of this problem 1 we will present a table with the SSR of each model. The model with the minimum SSR has performed the best per our objective measure of fit.

We first split our data into training and test data

#Split training and test data
set.seed(124)
index = sample(1:(dim(z.data)[1]),round(dim(z.data)[1]*.8))
train.data  <- z.data[index, ]
test.data <- z.data[-index, ]
x.train <- model.matrix(exclus~., train.data)[,-1]
y.train <- train.data$exclus
x.test = model.matrix(exclus~., test.data)[,-1]
y.test = test.data$exclus

Model 1: Simple Linear Regression

model.lm = lm(exclus ~ ., data=train.data)
predicted.lm = predict(model.lm, newdata=test.data)
residuals.lm = mean((y.test - predicted.lm)^2)

Model 2: Ridge Regression (alpha = 0)

library(glmnet)
cv <- cv.glmnet(x.train, y.train, alpha = 0)
model.ridge <- glmnet(x.train, y.train, alpha = 0, lambda = cv$lambda.min)
predicted.ridge = predict.glmnet(model.ridge, newx=x.test)
residuals.ridge = mean((predicted.ridge - y.test)^2)

Model 3: Lasso Regression (alpha = 0)

#model 3 Lasso 
cv <- cv.glmnet(x.train, y.train, alpha = 1)
model.lasso <- glmnet(x.train, y.train, alpha = 1, lambda = cv$lambda.min)
predicted.lasso = predict.glmnet(model.lasso, newx=x.test)
residuals.lasso = mean((predicted.lasso - y.test)^2)

Model 4: Elastic Net (alpha = 0.5 so that lambda1 and lambda2 are equal)

cv = cv.glmnet(x.train, y.train, alpha = 0.5)
model.elastic = glmnet(x.train, y.train, alpha=0.5, lambda = cv$lambda.min)
predicted.elastic = predict.glmnet(model.elastic, newx=x.test)
residuals.elastic = mean((predicted.elastic - y.test)^2)

Model 5: Principal Component Regression

library(pls)
model.pcr = pcr(exclus~., validation = "CV", data = train.data)
validationplot(model.pcr, val.type='R2', cex.axis=0.7)

best.model.pcr = pcr(exclus~., data=train.data, ncomp=15)
predicted.pcr.all = predict(best.model.pcr, newdata=x.test)
predicted.pcr = predicted.pcr.all[1:nrow(test.data), 1, 12]
residuals.pcr = mean((predicted.pcr - y.test)^2)

By the graph above it appears that our model has loaded a sufficient amount of information at 12 components. Therefore, we will assess the residual sum of squares for the model with 12 components.

Model 6:

model.plsr = plsr(exclus~., validation = "CV", data = train.data)
validationplot(model.plsr, val.type='R2', cex.axis=0.7)

#the best number of components is 4
best.model.plsr = plsr(exclus~., data=train.data, ncomp = 5)
predicted.plsr.all = predict(best.model.plsr, newdata=x.test)
predicted.plsr = predicted.plsr.all[1:nrow(test.data), 1, 4]
residuals.plsr = mean((predicted.pcr - y.test)^2)

Per our graph above, it appears that by using 4 components we will load a sufficient amount of information from our predictor variables. Therefore, we will assess the residual sum of squares for the model with 4 components.

Problem 1b: Results of Models

##           Model       MSE
## 1 Least Squares 0.8093545
## 2     Ridge Reg 0.8015277
## 3     Lasso Reg 0.8025582
## 4   Elastic Net 0.7966592
## 5           PCR 0.8123587
## 6          PLSR 0.8123587

It appears that the models perform almost identical. It is to say that no model is significantly more competitive than the other. This could be because with 15 predictors and 777 observations there is little chance of overfitting and so the OLS solution will not face significant obstacles and the regularization models will not significantly improve, and in fact, would most likely do worse.

To verify that this result is not just a function of the training and test data partition, we will replicate this experiment 1000 times and take the average MSE across all 1000 simulations.

NUMSIM = 100
lm.mse = rep(NA, NUMSIM)
ridge.mse = rep(NA, NUMSIM)
lasso.mse = rep(NA, NUMSIM)
elastic.mse = rep(NA, NUMSIM)
pcr.mse = rep(NA, NUMSIM)
plsr.mse = rep(NA, NUMSIM)

for(i in 1:NUMSIM) {
  index = sample(1:(dim(z.data)[1]),round(dim(z.data)[1]*.8))
  train.data  <- z.data[index, ]
  test.data <- z.data[-index, ]
  x.train <- model.matrix(exclus~., train.data)[,-1]
  y.train <- train.data$exclus
  x.test = model.matrix(exclus~., test.data)[,-1]
  y.test = test.data$exclus
  
  #lm model
  model.lm = lm(exclus ~ ., data=train.data)
  predicted.lm = predict(model.lm, newdata=test.data)
  lm.mse = mean((y.test - predicted.lm)^2)
  
  #ridge model
  cv <- cv.glmnet(x.train, y.train, alpha = 0)
  model.ridge <- glmnet(x.train, y.train, alpha = 0, lambda = cv$lambda.min)
  predicted.ridge = predict.glmnet(model.ridge, newx=x.test)
  ridge.mse = mean((predicted.ridge - y.test)^2)
  
  #lasso model
  cv <- cv.glmnet(x.train, y.train, alpha = 1)
  model.lasso <- glmnet(x.train, y.train, alpha = 1, lambda = cv$lambda.min)
  predicted.lasso = predict.glmnet(model.lasso, newx=x.test)
  lasso.mse = mean((predicted.lasso - y.test)^2)
  
  #elastic model
  cv = cv.glmnet(x.train, y.train, alpha = 0.5)
  model.elastic = glmnet(x.train, y.train, alpha=0.5, lambda = cv$lambda.min)
  predicted.elastic = predict.glmnet(model.elastic, newx=x.test)
  elastic.mse = mean((predicted.elastic - y.test)^2)
  
  #pcr model
  best.model.pcr = pcr(exclus~., data=train.data, ncomp=15)
  predicted.pcr.all = predict(best.model.pcr, newdata=x.test)
  predicted.pcr = predicted.pcr.all[1:nrow(test.data), 1, 12]
  pcr.mse = mean((predicted.pcr - y.test)^2)
  
  #plsr model
  best.model.plsr = plsr(exclus~., data=train.data, ncomp = 4)
  predicted.plsr.all = predict(best.model.plsr, newdata=x.test)
  predicted.plsr = predicted.plsr.all[1:nrow(test.data), 1, 4]
  plsr.mse = mean((predicted.pcr - y.test)^2)
}

models <- c("Least Squares", "Ridge Reg", "Lasso Reg", "Elastic Net", "PCR", "PLSR")


ssr <- c(mean(lm.mse), mean(ridge.mse), mean(lasso.mse), mean(elastic.mse), mean(pcr.mse), mean(plsr.mse))

results.df = data.frame(models, ssr)
colnames(results.df) = c("Model", "MSE")
results.df
##           Model       MSE
## 1 Least Squares 0.7890140
## 2     Ridge Reg 0.7733811
## 3     Lasso Reg 0.7609350
## 4   Elastic Net 0.7759370
## 5           PCR 0.7982557
## 6          PLSR 0.7982557

This suggests that we are getting similar results with all these models, which suggests that regularization isn’t necessary.

Problem 2

Problem 2a: PCA

We will perform Principal Component Analysis to determine the number of principal components are needed to load the majority of the variance/information from the original variables.

As we can see, it appears that 4 principal components explains about 97% of the total variance/information. Therefore, we will settle on 4 principal components as we try to predict the variable 4 (looking right, looking front, looking left).

We will now attempt to explain the loading coefficients for our two main principal components.

As we can see in our figures above (and table below), all of our predictors that are x-positions are positively correlated with our principal component 1. In other words, as principal component 1 increases, x-positions increase. Conversely, predictors that are y-positions are negatively correlated with our pricipal component 1. In other words, as principal component 1 increases, y-positions decrease.

##            PC1         PC2          PC3          PC4
## xF   0.2942435 -0.26070636  0.078744995 -0.188899657
## yF  -0.2794330 -0.29479659  0.046130612  0.032777586
## wF  -0.1206052  0.08913905 -0.766422460  0.568148900
## hF   0.1585780 -0.10946051 -0.628050184 -0.709415013
## xRE  0.2738470 -0.29522750  0.022563377  0.144893240
## yRE -0.2831645 -0.29171738  0.001006468 -0.008940596
## xLE  0.2881683 -0.28016465 -0.034533147  0.078343843
## yLE -0.2705769 -0.30511584  0.016749591  0.011283653
## xN   0.2832853 -0.27525736  0.020665047  0.246382359
## yN  -0.2768979 -0.29728337 -0.025268870 -0.046098948
## xRM  0.2788090 -0.28061881 -0.002109732  0.192831770
## yRM -0.2722651 -0.30239743 -0.037238484 -0.043792537
## xLM  0.3280473 -0.21838995 -0.069718427  0.082286981
## yLM -0.2604178 -0.31515751 -0.022807691 -0.033737931

Two predictor variables to note are wF and hF, which are weidth and height of the person’s head. By examining our table and figures above, both these are weakly correlated to principal component 1 and principal component 2. In other words, not much information was loaded onto these PCs from wF and hF.

Looking at PC3 and PC4 it is clear that wF and hF loaded primarily onto those two PCs (see row wF and row hF in table)

Finally we will attempt to cluster our looking left, right and front labels via our two principal components.

Looking at the figure above we see that looking front (f - red) has small absolute value of PC1 and PC2. This makes sense with our statement earlier. Recall that we said a larger PC1 value positively correlates with a larger x position and negatively correlates with a larger (more negative) y position. Thus, for “small” PC1 values this means we have “small” (i.e. close to center) x and y positions. This makes sense that looking front is in the middle of our two dimensional PC plot.

Furthermore, we see that looking right (lr - blue) has large positive value of PC1 and large negative value of PC2. Recall that we said a large positive value of PC1 correlates with an increase in the x-position and a decrease in the y-position. Thus, looking right (lr - blue) is looking in the positive x-direction and slightly in the negative y-direction.

Problem 2b: Factor Analysis

As we can see in the plots above, our factor analysis confirms what we have already learned in PCA. Namely, that our predictor space can be loaded onto 4 variables. See table below for factor loadings onto 4 components.

## 
## Loadings:
##     Factor1 Factor2 Factor3 Factor4
## xF           0.900   0.260  -0.338 
## yF   0.994                         
## wF          -0.257   0.106   0.842 
## hF           0.345   0.842   0.117 
## xRE          0.989                 
## yRE  0.996                         
## xLE          0.975   0.157         
## yLE  0.997                         
## xN           0.996                 
## yN   0.996                         
## xRM          0.980                 
## yRM  0.997                         
## xLM -0.190   0.949   0.188         
## yLM  0.998                         
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      6.004   5.787   0.868   0.865
## Proportion Var   0.429   0.413   0.062   0.062
## Cumulative Var   0.429   0.842   0.904   0.966

We can see that the cumulative variance of four factors is 0.966, all x or y-position variables have loaded most of their information onto Factor 1 and Factor 2, and that hF and wF have loaded most of their information onto Factor 3 and Factor 4.

We cannot predict variable 4 like we did in Factor Analysis because we do not know the factor loading coefficients (they simply do not exist) nor the correlation between the original predictor space and the factor components. Thus, we cannot make any claims like we did before, such as when PC1 increases the x-location variables increase.