In this project, I am going to evaluate College data set using Regularization methods and make a prediction on new data.
Data:
College data set contains statistics for a large number of US Colleges from the 1995 issue of US News and World Report. The data frame with 777 observations on the following 18 variables:
Private: A factor with levels No and Yes indicating private or public university
Apps: Number of applications received
Accept: Number of applications accepted
Enroll: Number of new students enrolled
Top10perc: Pct. new students from top 10% of H.S. class
Top25perc: Pct. new students from top 25% of H.S. class
F.Undergrad: Number of fulltime undergraduates
P.Undergrad: Number of parttime undergraduates
Outstate: Out-of-state tuition
Room.Board: Room and board costs
Books: Estimated book costs
Personal: Estimated personal spending
PhD: Pct. of faculty with Ph.D.’s
Terminal: Pct. of faculty with terminal degree
S.F.Ratio: Student/faculty ratio
perc.alumni: Pct. alumni who donate
Expend: Instructional expenditure per student
Grad.Rate: Graduation rate
Objective:
Analyze the data
Fit a linear model using least squares on training set
Report the test error obtained
Perform the best subset selection methods including Ridge and Lasso regression models
Fit PCR and PLS Models on training set with M chosen by cross validation
Find the Most Important Features and Create the Final Model
Compare the model RMSE and R2 Scores
Loading Libraries
#Loading necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v stringr 1.4.0
## v tidyr 1.1.3 v forcats 0.5.1
## v readr 2.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-2
library(formattable)
library(pls) #PCR & PLS REgression
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(ISLR)
library(ggplot2)
Load Data Set
head(College,2)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
#Structure of the data
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
#check missing variables
sum(is.na(College))
## [1] 0
There is no missing data in the dataset.
# Smallest schools
head(arrange(College, Enroll),2)
## Private Apps Accept Enroll Top10perc Top25perc F.Undergrad
## Capitol College Yes 100 90 35 10 52 282
## Wilson College Yes 167 130 46 16 50 199
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## Capitol College 331 8400 2812 300 2134 10 50
## Wilson College 676 11428 5084 450 475 67 76
## S.F.Ratio perc.alumni Expend Grad.Rate
## Capitol College 12.1 24 7976 52
## Wilson College 8.3 43 10291 67
# Largest schools
head(arrange(College, desc(Enroll)),2)
## Private Apps Accept Enroll Top10perc
## Texas A&M Univ. at College Station No 14474 10519 6392 49
## Michigan State University No 18114 15096 6180 23
## Top25perc F.Undergrad P.Undergrad Outstate
## Texas A&M Univ. at College Station 85 31643 2798 5130
## Michigan State University 57 26640 4120 10658
## Room.Board Books Personal PhD Terminal
## Texas A&M Univ. at College Station 3412 600 2144 89 91
## Michigan State University 3734 504 600 93 95
## S.F.Ratio perc.alumni Expend Grad.Rate
## Texas A&M Univ. at College Station 23.1 29 8471 69
## Michigan State University 14.0 9 10520 71
# Highest Pct. of faculty with Ph.D.'s
head(arrange(College, desc(PhD)))
## Private Apps Accept Enroll Top10perc
## Texas A&M University at Galveston No 529 481 243 22
## Bryn Mawr College Yes 1465 810 313 71
## Harvey Mudd College Yes 1377 572 178 95
## Pitzer College Yes 1133 630 220 37
## Brown University Yes 12586 3239 1462 87
## Claremont McKenna College Yes 1860 767 227 71
## Top25perc F.Undergrad P.Undergrad Outstate
## Texas A&M University at Galveston 47 1206 134 4860
## Bryn Mawr College 95 1088 16 18165
## Harvey Mudd College 100 654 5 17230
## Pitzer College 73 750 30 17688
## Brown University 95 5643 349 19528
## Claremont McKenna College 93 887 1 17000
## Room.Board Books Personal PhD Terminal
## Texas A&M University at Galveston 3122 600 650 103 88
## Bryn Mawr College 6750 500 1200 100 100
## Harvey Mudd College 6690 700 900 100 100
## Pitzer College 5900 650 850 100 100
## Brown University 5926 720 1100 99 100
## Claremont McKenna College 6010 500 850 99 99
## S.F.Ratio perc.alumni Expend Grad.Rate
## Texas A&M University at Galveston 17.4 16 6415 43
## Bryn Mawr College 12.3 49 17449 89
## Harvey Mudd College 8.2 46 21569 100
## Pitzer College 10.4 11 14820 73
## Brown University 7.6 39 20440 97
## Claremont McKenna College 9.6 52 18443 87
#College Application vs Enrollment by Type of the College
ggplot(College, aes(Apps, Enroll, colour = Private)) +
geom_point()
#Private Vs Public Schools by Enrollment
ggplot(College, aes(Private, Enroll)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 19)
#Private Vs Public Schools by Expenses
ggplot(College, aes(Private, Expend)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 19)
# Enrollment vs Expenses by Type of College with Linear regression
ggplot(College, aes(Enroll, Expend, colour = Private)) +
geom_point() +
stat_smooth(method = "lm",fill = NA)
## `geom_smooth()` using formula 'y ~ x'
# Elite school with Pct. of faculty with Ph.D.'s
Elite <- rep("No", nrow(College))
Elite[College$Top10perc>50] <- "Yes"
Elite<- as.factor(Elite)
college <- data.frame(College, Elite)
ggplot(college, aes(Elite, PhD)) +
geom_boxplot(outlier.colour = "blue", outlier.shape = 19)
# Elite Schools with Top 10 Enrollments
newdata<-college[college$Elite=="Yes",]
newdata<-arrange(newdata, desc(Enroll))
newdata<-head(newdata,10)
University<-row.names(newdata)
ggplot(data=newdata, aes(x=University, y=Enroll, fill=University)) +
ggtitle("Elite Schools with Top 10 Enrollments in 2005")+
geom_bar(stat="identity",width=0.6, fill="steelblue")+
theme_minimal() +
geom_text(aes(label=Enroll), vjust= 1, hjust=1.2, color="white", size=3.8)+
coord_flip()
Data Partition
ind <- sample(2, nrow(College), replace=TRUE, prob=c(0.8,0.2))
train = College[ind==1,]
test = College[ind==2,]
Scaling
scaled_data <- preProcess(train, method = c('center', 'scale'))
training <- predict(scaled_data, train)
testing <- predict(scaled_data, test)
#Actual Values
y_train <- training$Apps
y_test <- testing$Apps
#one hot encoding
one_hot_encoding <- dummyVars(Apps ~ ., data = training)
x_train <- predict(one_hot_encoding, training)
x_test <- predict(one_hot_encoding, testing)
Fitting a linear model and reporting a test error
lm.fit = lm(Apps~., data=training)
lm.pred = predict(lm.fit, testing)
model_info <- postResample(lm.pred, testing$Apps)
#Root Mean Squared Error is calculated using sqrt(mean((pred - obs)^2
model_info[1]
## RMSE
## 0.3788853
#R^2 is calculated with using as the square of the correlation between the observed and predicted outcomes:
model_info[2]
## Rsquared
## 0.9407619
#Mean absolute error is calculated using mean(abs(pred-obs))
model_info[3]
## MAE
## 0.1847903
Ridge Regression(L2 regularization)
Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.
In ridge regression, the cost function is altered by adding a penalty term (which is the last term of the equation below) which is equivalent to the square of the magnitude of the coefficient.
Cost function with Ridge regression penalty term ==> \(\sum_{i = 1}^{n} (y_i - \sum_{j = 1}^{p}(x_{ij}β_j)^2+λ \sum_{j = 1}^{p}(β_j)^2)\)
The penalty term (lambda) regularizes the coefficients such that if the coefficients take large values the optimization function is penalized. So, ridge regression shrinks the coefficients and it helps to reduce the model complexity and multi-collinearity but does not reduce the number of the variables since it never leads to a coefficient to zero. thus, this model is not good for feature reduction.
The glmnet() function has an alpha argument that determines what type of model is fit. λ =0, then a ridge regression model is fit, and λ=1 then lasso model is fit.
ridge_fit <- train(x = x_train, y = y_train,
method = 'glmnet',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 10e2, length.out = 20)))
ridge_info <- postResample(predict(ridge_fit, x_test), y_test)
#Root Mean Squared Error
ridge_info[1]
## RMSE
## 0.5301493
#R^2 outcomes:
ridge_info[2]
## Rsquared
## 0.8934729
#Mean absolute error
ridge_info[3]
## MAE
## 0.2092366
#Plot
plot(ridge_fit)
#Print Feature Importance
plot(varImp(ridge_fit))
Lasso Regression (L1 Regularization)
Lasso regression stands for Least Absolute Shrinkage and Selection Operator.
It adds penalty term to the cost function. This term is the absolute sum of the coefficients which is the last term of the equation below.
Cost function with Lasso regression penalty term ==> \(\sum_{i = 1}^{n} (y_i - \sum_{j = 1}^{p}(x_{ij}β_j)^2+λ \sum_{j = 1}^{p}|β_j|)\)
As the value of coefficients increases from 0 this term penalizes, cause model, to decrease the value of coefficients in order to reduce loss. The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero. Limitation of Lasso Regression: Lasso sometimes struggles with some types of data. If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant (or may be used in the test set). If there are two or more highly collinear variables then LASSO regression select one of them randomly which is not good for the interpretation of data
The glmnet() function has an alpha argument that determines what type of model is fit. λ =0, then a ridge regression model is fit, and λ=1 then lasso model is fit.
lasso_fit <- train(x = x_train, y = y_train,
method = 'glmnet',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0.0001, 1, length.out = 50)))
lasso_info <- postResample(predict(lasso_fit, x_test), y_test)
#Root Mean Squared Error
lasso_info[1]
## RMSE
## 0.3816709
#R^2 outcomes:
lasso_info[2]
## Rsquared
## 0.9406489
#Mean absolute error
lasso_info[3]
## MAE
## 0.1833978
#Plot
plot(lasso_fit)
#Print Feature Importance
plot(varImp(lasso_fit))
Fit a PCR model on the training set, with M chosen by cross validation
pcr_fit <- train(x = x_train, y = y_train,
method = 'pcr',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(ncomp = 1:10))
pcr_info <- postResample(predict(pcr_fit, x_test), y_test)
#Root Mean Squared Error
pcr_info[1]
## RMSE
## 0.6160806
#R^2 outcomes:
pcr_info[2]
## Rsquared
## 0.831912
#Mean absolute error
pcr_info[3]
## MAE
## 0.2352727
#Plot
plot(pcr_fit)
#Print Feature Importance
plot(varImp(pcr_fit))
Fit a PLS model on the training set, with M chosen by cross validation
pls_fit <- train(x = x_train, y = y_train,
method = 'pls',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(ncomp = 1:10))
pls_info <- postResample(predict(pls_fit, x_test), y_test)
#Root Mean Squared Error
pls_info[1]
## RMSE
## 0.3774951
#R^2 outcomes:
pls_info[2]
## Rsquared
## 0.9423937
#Mean absolute error
pls_info[3]
## MAE
## 0.1830492
#Plot
plot(pls_fit)
#Print Feature Importance
plot(varImp(pls_fit))
6. Find the Most Important Features and Create the Final Model
par(mfrow=c(1,4))
plot(varImp(ridge_fit), top=5)
plot(varImp(lasso_fit), top=5)
plot(varImp(pcr_fit), top=5)
plot(varImp(pls_fit), top=5)
If we combine the top 5 important variables; in our final model we should have Accept, Enroll, F.Undergrad, Top10perc, P.Undergrad, Expend, Outstate, Private, PhD. These are the variables that comes out from the models.
Final Model
final.fit = lm(Apps~Accept+Enroll+F.Undergrad+Top10perc+P.Undergrad+Expend+Outstate+Private+PhD,
data=training)
final.pred = predict(final.fit, testing)
final_info <- postResample(final.pred, testing$Apps)
#Root Mean Squared Error
final_info[1]
## RMSE
## 0.3753511
#R^2 :
final_info[2]
## Rsquared
## 0.9419864
#Mean absolute error
final_info[3]
## MAE
## 0.183696
df<-data_frame(
id = 1:6,
Model = c('Linear', 'Ridge', 'Lasso', 'PCR', 'PLS', "Final"),
RMSE = c(model_info[1],ridge_info[1],lasso_info[1],pcr_info[1], pls_info[1],final_info[1]),
R.Squ = c(model_info[2],ridge_info[2],lasso_info[2],pcr_info[2], pls_info[2],final_info[2]),
MAE = c(model_info[3],ridge_info[3],lasso_info[3],pcr_info[3], pls_info[3],final_info[3]))
formattable(df, list(
Model = color_tile("white", "orange"),
area(col = c(RMSE, MAE)) ~ normalize_bar("pink", 0.2),
area(col = c(R.Squ)) ~ normalize_bar("green", 0.2)
))
| id | Model | RMSE | R.Squ | MAE |
|---|---|---|---|---|
| 1 | Linear | 0.3788853 | 0.9407619 | 0.1847903 |
| 2 | Ridge | 0.5301493 | 0.8934729 | 0.2092366 |
| 3 | Lasso | 0.3816709 | 0.9406489 | 0.1833978 |
| 4 | PCR | 0.6160806 | 0.8319120 | 0.2352727 |
| 5 | PLS | 0.3774951 | 0.9423937 | 0.1830492 |
| 6 | Final | 0.3753511 | 0.9419864 | 0.1836960 |
The scores are very close with the exception of PCR model. My final model returns the lowest RMSE and MAE score, it also returns one of the highest R square value with limited variables.