Introduction

In this project, I am going to evaluate College data set using Regularization methods and make a prediction on new data.

Data:

College data set contains statistics for a large number of US Colleges from the 1995 issue of US News and World Report. The data frame with 777 observations on the following 18 variables:

Private: A factor with levels No and Yes indicating private or public university

Apps: Number of applications received

Accept: Number of applications accepted

Enroll: Number of new students enrolled

Top10perc: Pct. new students from top 10% of H.S. class

Top25perc: Pct. new students from top 25% of H.S. class

F.Undergrad: Number of fulltime undergraduates

P.Undergrad: Number of parttime undergraduates

Outstate: Out-of-state tuition

Room.Board: Room and board costs

Books: Estimated book costs

Personal: Estimated personal spending

PhD: Pct. of faculty with Ph.D.’s

Terminal: Pct. of faculty with terminal degree

S.F.Ratio: Student/faculty ratio

perc.alumni: Pct. alumni who donate

Expend: Instructional expenditure per student

Grad.Rate: Graduation rate

Objective:

  1. Analyze the data

  2. Fit a linear model using least squares on training set

  3. Report the test error obtained

  4. Perform the best subset selection methods including Ridge and Lasso regression models

  5. Fit PCR and PLS Models on training set with M chosen by cross validation

  6. Find the Most Important Features and Create the Final Model

  7. Compare the model RMSE and R2 Scores



Loading Libraries

#Loading necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v readr   2.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)  
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-2
library(formattable)
library(pls) #PCR & PLS REgression 
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
library(ISLR)
library(ggplot2)
1. Analyze the data

Load Data Set

head(College,2)
##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
#Structure of the data
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
#check missing variables
sum(is.na(College))
## [1] 0

There is no missing data in the dataset.

# Smallest schools
head(arrange(College, Enroll),2)
##                 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad
## Capitol College     Yes  100     90     35        10        52         282
## Wilson College      Yes  167    130     46        16        50         199
##                 P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## Capitol College         331     8400       2812   300     2134  10       50
## Wilson College          676    11428       5084   450      475  67       76
##                 S.F.Ratio perc.alumni Expend Grad.Rate
## Capitol College      12.1          24   7976        52
## Wilson College        8.3          43  10291        67
# Largest schools
head(arrange(College, desc(Enroll)),2)
##                                    Private  Apps Accept Enroll Top10perc
## Texas A&M Univ. at College Station      No 14474  10519   6392        49
## Michigan State University               No 18114  15096   6180        23
##                                    Top25perc F.Undergrad P.Undergrad Outstate
## Texas A&M Univ. at College Station        85       31643        2798     5130
## Michigan State University                 57       26640        4120    10658
##                                    Room.Board Books Personal PhD Terminal
## Texas A&M Univ. at College Station       3412   600     2144  89       91
## Michigan State University                3734   504      600  93       95
##                                    S.F.Ratio perc.alumni Expend Grad.Rate
## Texas A&M Univ. at College Station      23.1          29   8471        69
## Michigan State University               14.0           9  10520        71
# Highest Pct. of faculty with Ph.D.'s
head(arrange(College, desc(PhD)))
##                                   Private  Apps Accept Enroll Top10perc
## Texas A&M University at Galveston      No   529    481    243        22
## Bryn Mawr College                     Yes  1465    810    313        71
## Harvey Mudd College                   Yes  1377    572    178        95
## Pitzer College                        Yes  1133    630    220        37
## Brown University                      Yes 12586   3239   1462        87
## Claremont McKenna College             Yes  1860    767    227        71
##                                   Top25perc F.Undergrad P.Undergrad Outstate
## Texas A&M University at Galveston        47        1206         134     4860
## Bryn Mawr College                        95        1088          16    18165
## Harvey Mudd College                     100         654           5    17230
## Pitzer College                           73         750          30    17688
## Brown University                         95        5643         349    19528
## Claremont McKenna College                93         887           1    17000
##                                   Room.Board Books Personal PhD Terminal
## Texas A&M University at Galveston       3122   600      650 103       88
## Bryn Mawr College                       6750   500     1200 100      100
## Harvey Mudd College                     6690   700      900 100      100
## Pitzer College                          5900   650      850 100      100
## Brown University                        5926   720     1100  99      100
## Claremont McKenna College               6010   500      850  99       99
##                                   S.F.Ratio perc.alumni Expend Grad.Rate
## Texas A&M University at Galveston      17.4          16   6415        43
## Bryn Mawr College                      12.3          49  17449        89
## Harvey Mudd College                     8.2          46  21569       100
## Pitzer College                         10.4          11  14820        73
## Brown University                        7.6          39  20440        97
## Claremont McKenna College               9.6          52  18443        87
#College Application vs Enrollment by Type of the College
ggplot(College, aes(Apps, Enroll, colour = Private)) +
  geom_point()

#Private Vs Public Schools by Enrollment
ggplot(College, aes(Private, Enroll)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 19)  

#Private Vs Public Schools by Expenses
ggplot(College, aes(Private, Expend)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 19)  

# Enrollment vs Expenses by Type of College with Linear regression
ggplot(College, aes(Enroll, Expend, colour = Private)) +
  geom_point() +
   stat_smooth(method = "lm",fill = NA)
## `geom_smooth()` using formula 'y ~ x'

# Elite school with Pct. of faculty with Ph.D.'s
Elite <- rep("No", nrow(College))
Elite[College$Top10perc>50] <- "Yes"
Elite<- as.factor(Elite)
college <- data.frame(College, Elite)

ggplot(college, aes(Elite, PhD)) +
  geom_boxplot(outlier.colour = "blue", outlier.shape = 19) 

# Elite Schools with Top 10 Enrollments

newdata<-college[college$Elite=="Yes",]
newdata<-arrange(newdata, desc(Enroll))
newdata<-head(newdata,10)
University<-row.names(newdata)

ggplot(data=newdata, aes(x=University, y=Enroll, fill=University)) +
  ggtitle("Elite Schools with Top 10 Enrollments in 2005")+
  geom_bar(stat="identity",width=0.6,  fill="steelblue")+
  theme_minimal() +
  geom_text(aes(label=Enroll), vjust= 1, hjust=1.2, color="white", size=3.8)+
  coord_flip()

2. Fit a linear model using least squares on training set

Data Partition

ind <- sample(2, nrow(College), replace=TRUE, prob=c(0.8,0.2))

train = College[ind==1,]
test = College[ind==2,]

Scaling

scaled_data <- preProcess(train, method = c('center', 'scale'))

training <- predict(scaled_data, train)
testing <- predict(scaled_data, test)

#Actual Values
y_train <- training$Apps
y_test <- testing$Apps

#one hot encoding
one_hot_encoding <- dummyVars(Apps ~ ., data = training)
x_train <- predict(one_hot_encoding, training)
x_test <- predict(one_hot_encoding, testing)

Fitting a linear model and reporting a test error

lm.fit = lm(Apps~., data=training)
lm.pred = predict(lm.fit, testing)


model_info <- postResample(lm.pred, testing$Apps)
3. Report the test error obtained
#Root Mean Squared Error is calculated using sqrt(mean((pred - obs)^2
model_info[1]
##      RMSE 
## 0.3788853
#R^2 is calculated with using as the square of the correlation between the observed and predicted outcomes:
model_info[2]
##  Rsquared 
## 0.9407619
#Mean absolute error is calculated using mean(abs(pred-obs))
model_info[3]
##       MAE 
## 0.1847903
4. Perform Ridge and Lasso Regression

Ridge Regression(L2 regularization)

Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.

In ridge regression, the cost function is altered by adding a penalty term (which is the last term of the equation below) which is equivalent to the square of the magnitude of the coefficient.

Cost function with Ridge regression penalty term ==> \(\sum_{i = 1}^{n} (y_i - \sum_{j = 1}^{p}(x_{ij}β_j)^2+λ \sum_{j = 1}^{p}(β_j)^2)\)

The penalty term (lambda) regularizes the coefficients such that if the coefficients take large values the optimization function is penalized. So, ridge regression shrinks the coefficients and it helps to reduce the model complexity and multi-collinearity but does not reduce the number of the variables since it never leads to a coefficient to zero. thus, this model is not good for feature reduction.

The glmnet() function has an alpha argument that determines what type of model is fit. λ =0, then a ridge regression model is fit, and λ=1 then lasso model is fit.

ridge_fit <- train(x = x_train, y = y_train,
                   method = 'glmnet', 
                   trControl = trainControl(method = 'cv', number = 10),
                   tuneGrid = expand.grid(alpha = 0,
                                          lambda = seq(0, 10e2, length.out = 20)))


ridge_info <- postResample(predict(ridge_fit, x_test), y_test)
#Root Mean Squared Error 
ridge_info[1]
##      RMSE 
## 0.5301493
#R^2  outcomes:
ridge_info[2]
##  Rsquared 
## 0.8934729
#Mean absolute error 
ridge_info[3]
##       MAE 
## 0.2092366
#Plot
plot(ridge_fit)

#Print Feature Importance
plot(varImp(ridge_fit))

Lasso Regression (L1 Regularization)

Lasso regression stands for Least Absolute Shrinkage and Selection Operator.

It adds penalty term to the cost function. This term is the absolute sum of the coefficients which is the last term of the equation below.

Cost function with Lasso regression penalty term ==> \(\sum_{i = 1}^{n} (y_i - \sum_{j = 1}^{p}(x_{ij}β_j)^2+λ \sum_{j = 1}^{p}|β_j|)\)

As the value of coefficients increases from 0 this term penalizes, cause model, to decrease the value of coefficients in order to reduce loss. The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero. Limitation of Lasso Regression: Lasso sometimes struggles with some types of data. If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant (or may be used in the test set). If there are two or more highly collinear variables then LASSO regression select one of them randomly which is not good for the interpretation of data

The glmnet() function has an alpha argument that determines what type of model is fit. λ =0, then a ridge regression model is fit, and λ=1 then lasso model is fit.

lasso_fit <- train(x = x_train, y = y_train, 
                   method = 'glmnet',
                   trControl = trainControl(method = 'cv', number = 10),
                   tuneGrid = expand.grid(alpha = 1,
                                          lambda = seq(0.0001, 1, length.out = 50)))

lasso_info <- postResample(predict(lasso_fit, x_test), y_test)
#Root Mean Squared Error 
lasso_info[1]
##      RMSE 
## 0.3816709
#R^2  outcomes:
lasso_info[2]
##  Rsquared 
## 0.9406489
#Mean absolute error 
lasso_info[3]
##       MAE 
## 0.1833978
#Plot
plot(lasso_fit)

#Print Feature Importance
plot(varImp(lasso_fit))

5. Fit PCR and PLS Models on training set

Fit a PCR model on the training set, with M chosen by cross validation

pcr_fit <- train(x = x_train, y = y_train,
                   method = 'pcr',
                   trControl = trainControl(method = 'cv', number = 10),
                   tuneGrid = expand.grid(ncomp = 1:10))

pcr_info <- postResample(predict(pcr_fit, x_test), y_test)
#Root Mean Squared Error 
pcr_info[1]
##      RMSE 
## 0.6160806
#R^2  outcomes:
pcr_info[2]
## Rsquared 
## 0.831912
#Mean absolute error 
pcr_info[3]
##       MAE 
## 0.2352727
#Plot
plot(pcr_fit)

#Print Feature Importance
plot(varImp(pcr_fit))

Fit a PLS model on the training set, with M chosen by cross validation

pls_fit <- train(x = x_train, y = y_train,
                   method = 'pls',
                   trControl = trainControl(method = 'cv', number = 10),
                   tuneGrid = expand.grid(ncomp = 1:10))
pls_info <- postResample(predict(pls_fit, x_test), y_test)
#Root Mean Squared Error 
pls_info[1]
##      RMSE 
## 0.3774951
#R^2  outcomes:
pls_info[2]
##  Rsquared 
## 0.9423937
#Mean absolute error 
pls_info[3]
##       MAE 
## 0.1830492
#Plot
plot(pls_fit)

#Print Feature Importance
plot(varImp(pls_fit))

6. Find the Most Important Features and Create the Final Model

par(mfrow=c(1,4))
plot(varImp(ridge_fit), top=5)
plot(varImp(lasso_fit), top=5)
plot(varImp(pcr_fit), top=5)
plot(varImp(pls_fit), top=5)

If we combine the top 5 important variables; in our final model we should have Accept, Enroll, F.Undergrad, Top10perc, P.Undergrad, Expend, Outstate, Private, PhD. These are the variables that comes out from the models.

Final Model

final.fit = lm(Apps~Accept+Enroll+F.Undergrad+Top10perc+P.Undergrad+Expend+Outstate+Private+PhD,   
               data=training)
final.pred = predict(final.fit, testing)


final_info <- postResample(final.pred, testing$Apps)
#Root Mean Squared Error
final_info[1]
##      RMSE 
## 0.3753511
#R^2 :
final_info[2]
##  Rsquared 
## 0.9419864
#Mean absolute error 
final_info[3]
##      MAE 
## 0.183696
7. Compare RMSE and R2 Scores of the models
df<-data_frame(
  id = 1:6,
  Model = c('Linear', 'Ridge', 'Lasso', 'PCR', 'PLS', "Final"),
  RMSE =  c(model_info[1],ridge_info[1],lasso_info[1],pcr_info[1], pls_info[1],final_info[1]), 
  R.Squ = c(model_info[2],ridge_info[2],lasso_info[2],pcr_info[2], pls_info[2],final_info[2]),
  MAE =   c(model_info[3],ridge_info[3],lasso_info[3],pcr_info[3], pls_info[3],final_info[3]))

formattable(df, list(
                Model = color_tile("white", "orange"),
                area(col = c(RMSE, MAE)) ~ normalize_bar("pink", 0.2),
                area(col = c(R.Squ)) ~ normalize_bar("green", 0.2)
              
                
    ))
id Model RMSE R.Squ MAE
1 Linear 0.3788853 0.9407619 0.1847903
2 Ridge 0.5301493 0.8934729 0.2092366
3 Lasso 0.3816709 0.9406489 0.1833978
4 PCR 0.6160806 0.8319120 0.2352727
5 PLS 0.3774951 0.9423937 0.1830492
6 Final 0.3753511 0.9419864 0.1836960

The scores are very close with the exception of PCR model. My final model returns the lowest RMSE and MAE score, it also returns one of the highest R square value with limited variables.





*************************