Introduction

A simple modeling with college admissions data obtained from https://www.kaggle.com/mohansacharya/graduate-admissions

Objective :

  • Predict graduate admissions with linear regression
  • Analyze relationships between target and predictor variables

Data Preparation

Loading the required packages for this simple modeling

library(dplyr)
library(MLmetrics)
library(readr)
library(assertive)
library(bigassertr)
library(ggplot2)
library(car)
library(lmtest)
library(ggcorrplot)
library(tidyr)
library(ggpubr)

Loading the dataset

df_admission<-read_csv("Admission_Predict_Ver1.1.csv")
df_admission<-df_admission %>% select(-`Serial No.`)
head(df_admission)
str(df_admission)
## tibble [500 x 8] (S3: tbl_df/tbl/data.frame)
##  $ GRE Score        : num [1:500] 337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL Score      : num [1:500] 118 107 104 110 103 115 109 101 102 108 ...
##  $ University Rating: num [1:500] 4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num [1:500] 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num [1:500] 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num [1:500] 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : num [1:500] 1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance of Admit  : num [1:500] 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

The dataset contains several parameters which are considered important during the application for Masters Programs.The parameters included are :

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • SOP - Statement of Purpose Strength ( out of 5 )
  • LOR - Letter of Recommendation Strength ( out of 5 )
  • CGPA - Undergraduate GPA ( out of 10 )
  • Research - Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

Checking if there is any missing data

colSums(is.na(df_admission))
##         GRE Score       TOEFL Score University Rating               SOP 
##                 0                 0                 0                 0 
##               LOR              CGPA          Research   Chance of Admit 
##                 0                 0                 0                 0

Checking if all values are in range

assert_all_are_in_closed_range(df_admission$`GRE Score`,lower=0,upper=340)
assert_all_are_in_closed_range(df_admission$`TOEFL Score`,lower=0,upper=120)
assert_all_are_in_closed_range(df_admission$`University Rating`,lower=0,upper=5)
assert_all_are_in_closed_range(df_admission$SOP,lower=0,upper=5)
assert_all_are_in_closed_range(df_admission$CGPA,lower=0,upper=10)
assert_all_are_in_closed_range(df_admission$LOR,lower=0,upper=5)
assert_all_are_in_closed_range(df_admission$`Chance of Admit`,lower=0,upper=1)

No errors found! Now it’s time to check the binary scaled variable

assert_01(df_admission$Research)

No errors found as well, now it’s time to do some exploratory data analysis.

Exploratory Data Analysis

Correlation Matrix

Correlation<-round(cor(df_admission),1)
ggcorrplot(Correlation,type="lower",lab=TRUE) +
scale_fill_gradient2(breaks=c(0, 1), limit=c(0, 1),low="white",mid="blue",high="red")+labs(title="Correlation Between Variables",fill="Correlation")

Histogram

df_admission %>% 
  gather(key=Variable,value = Value) %>% 
  ggplot(aes(x=Value))+geom_histogram()+facet_wrap(~Variable,nrow = 2,ncol=4,scales="free")+theme_minimal()

Relationship Between Target and Predictor Variable

a<-df_admission %>% 
  ggplot(aes(x=`GRE Score`,y=`Chance of Admit`))+geom_point()+geom_smooth()
b<-df_admission %>% 
  ggplot(aes(x=`TOEFL Score`,y=`Chance of Admit`))+geom_point()+geom_smooth()
c<-df_admission %>% 
  ggplot(aes(x=`University Rating`,y=`Chance of Admit`))+geom_point()+geom_smooth()
d<-df_admission %>% 
  ggplot(aes(x=SOP,y=`Chance of Admit`))+geom_point()+geom_smooth()
e<-df_admission %>% 
  ggplot(aes(x=LOR,y=`Chance of Admit`))+geom_point()+geom_smooth()
f<-df_admission %>% 
  ggplot(aes(x=CGPA,y=`Chance of Admit`))+geom_point()+geom_smooth()
ggarrange(a,b,c,d,e,f + rremove("x.text"), 
          ncol = 2, nrow = 3)

Modelling

Spliting Into Train and Test Set

To decrease the probability model overfit and bias, we will first split the dataset into training and testing data. We are going to split the data into 70% train set and 30% test set.

n <- nrow(df_admission)


n_data<- round(n * 0.7,0)

index <- sample(seq_len(nrow(df_admission)), size = n_data)
set.seed(111)
df_admission_train <- df_admission[index, ]
df_admission_test <- df_admission[-index, ]

Linear Regression

Analyzing Variables

Now we want to see which of the predicted variables impacts the full model the most. So, we will start off by scaling the whole dataset

df_scaled<-df_admission%>% mutate_if(is.numeric,scale)
model_scaled<-lm(`Chance of Admit`~.,df_scaled)

Based on the scaled model, we will the construct a bar plot to compare values between coefficients

plot_coeffs <- function(mlr_model) {
      coeffs <- coefficients(mlr_model)
      mp <- barplot(coeffs[order(coeffs,decreasing = T)], col="#3F97D0", xaxt='n', main="Regression Coefficients")
      lablist <- names(coeffs)
      text(mp, par("usr")[3], labels = lablist, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=0.6)
}
plot_coeffs(model_scaled)

The plot above shows the coefficient from largest to smallest of the model.

Multiple Linear Regression

From the plot above, we see that the most impacful variables to the graduate admissions in terms of coefficients are GRE Score and TOEFL Score, in which we will try to do some modeling with those variables.

model_1<-lm(`Chance of Admit`~`GRE Score`+`TOEFL Score`,df_admission_train)
summary(model_1)
## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score`, 
##     data = df_admission_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32337 -0.04167  0.01267  0.04695  0.16062 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.252122   0.117375 -19.187  < 2e-16 ***
## `GRE Score`    0.006779   0.000634  10.691  < 2e-16 ***
## `TOEFL Score`  0.007740   0.001190   6.504 2.73e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07516 on 347 degrees of freedom
## Multiple R-squared:  0.7221, Adjusted R-squared:  0.7205 
## F-statistic: 450.8 on 2 and 347 DF,  p-value: < 2.2e-16

Now, we will try to examine the possibility of an interaction between two variables

model_2<-lm(`Chance of Admit`~`GRE Score`+`TOEFL Score`+ `GRE Score`:`TOEFL Score`,df_admission_train)
summary(model_2)
## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
##     `GRE Score`:`TOEFL Score`, data = df_admission_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32252 -0.04183  0.01083  0.04663  0.15937 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)               -1.702e+00  1.808e+00  -0.941    0.347
## `GRE Score`                5.047e-03  5.711e-03   0.884    0.377
## `TOEFL Score`              2.578e-03  1.697e-02   0.152    0.879
## `GRE Score`:`TOEFL Score`  1.621e-05  5.315e-05   0.305    0.761
## 
## Residual standard error: 0.07525 on 346 degrees of freedom
## Multiple R-squared:  0.7222, Adjusted R-squared:  0.7198 
## F-statistic: 299.8 on 3 and 346 DF,  p-value: < 2.2e-16

We see from the results above that a interactive relationship does not help either, the variables becomes insignificant and the adjusted R-Squared does not improve at all, so we stick to the first model which is :

\[ Model\ 1\\ Chance\ of\ Admit=-2.201218+ 0.006339\ GRE\ Score + 0.008537\ TOEFL\ Score \] And, we will call this Model 1

Stepwise Regression

Now we are going to start modeling with creating a null and full model to create a stepwise regression

model_null<-lm(`Chance of Admit`~1,df_admission_train)
model_full<-lm(`Chance of Admit`~.,df_admission_train)
model_stepwise<-step(model_full,scope=list(lower=model_null,upper=model_full),direction = "both")
## Start:  AIC=-1965.47
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` + 
##     SOP + LOR + CGPA + Research
## 
##                       Df Sum of Sq    RSS     AIC
## - SOP                  1   0.00294 1.2202 -1966.6
## - `University Rating`  1   0.00357 1.2208 -1966.4
## <none>                             1.2172 -1965.5
## - `TOEFL Score`        1   0.01918 1.2364 -1962.0
## - Research             1   0.02208 1.2393 -1961.2
## - `GRE Score`          1   0.04314 1.2604 -1955.3
## - LOR                  1   0.05684 1.2741 -1951.5
## - CGPA                 1   0.31742 1.5347 -1886.4
## 
## Step:  AIC=-1966.63
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` + 
##     LOR + CGPA + Research
## 
##                       Df Sum of Sq    RSS     AIC
## - `University Rating`  1   0.00678 1.2270 -1966.7
## <none>                             1.2202 -1966.6
## + SOP                  1   0.00294 1.2172 -1965.5
## - `TOEFL Score`        1   0.02087 1.2411 -1962.7
## - Research             1   0.02253 1.2427 -1962.2
## - `GRE Score`          1   0.04247 1.2627 -1956.7
## - LOR                  1   0.07259 1.2928 -1948.4
## - CGPA                 1   0.34771 1.5679 -1880.9
## 
## Step:  AIC=-1966.69
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + LOR + CGPA + 
##     Research
## 
##                       Df Sum of Sq    RSS     AIC
## <none>                             1.2270 -1966.7
## + `University Rating`  1   0.00678 1.2202 -1966.6
## + SOP                  1   0.00615 1.2208 -1966.4
## - Research             1   0.02390 1.2509 -1961.9
## - `TOEFL Score`        1   0.02668 1.2536 -1961.2
## - `GRE Score`          1   0.04189 1.2689 -1956.9
## - LOR                  1   0.09495 1.3219 -1942.6
## - CGPA                 1   0.38651 1.6135 -1872.8

Now we examine the model

summary(model_stepwise)
## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
##     LOR + CGPA + Research, data = df_admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.266334 -0.023151  0.008303  0.035222  0.164953 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.3710322  0.1195284 -11.470  < 2e-16 ***
## `GRE Score`    0.0021409  0.0006247   3.427 0.000684 ***
## `TOEFL Score`  0.0027758  0.0010149   2.735 0.006561 ** 
## LOR            0.0237213  0.0045975   5.160 4.19e-07 ***
## CGPA           0.1191731  0.0114481  10.410  < 2e-16 ***
## Research       0.0201557  0.0077858   2.589 0.010040 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05972 on 344 degrees of freedom
## Multiple R-squared:  0.826,  Adjusted R-squared:  0.8235 
## F-statistic: 326.7 on 5 and 344 DF,  p-value: < 2.2e-16

Comparing and Evaluating Errors

Now, we want to compare model 1 and model stepwise. First we generate predictions from the model

predicted1<-predict(model_1,df_admission_test)
predicted2<-predict(model_stepwise,df_admission_test)

Now, to make sure that the predictions are still in bound of the variable restriction, which is from 0 to 1

predicted1<-ifelse(predicted1>1,1,predicted1)
predicted2<-ifelse(predicted1>1,1,predicted2)
errors <- matrix(c(rmse_1,mae_1,mape_1,rmse_2,mae_2,mape_2),ncol=2,byrow=FALSE)
colnames(errors) <- c("Model 1","model stepwise")
rownames(errors) <- c("RMSE","MAE","MAPE")
error_table <- as.table(errors)
error_table
##         Model 1 model stepwise
## RMSE 0.08144377     0.06130972
## MAE  0.05947997     0.04339546
## MAPE 0.09383281     0.06757797

Final Model

By the stepwise regression, it is found that only one variable which is the university ranking is omitted from the model, resulting in a multiple R-squared of 80.32%. We will Call this model stepwise :

\[ Chance\ of\ Admit = -1.371\ + 0.002\ GRE \ Score + 0.003\ TOEFL \ Score + 0.015\ LOR + 0.122\ CGPA + 0.0272 \ Research \]

We can see that in terms of the error performance, model stepwise outperforms model 1 in every aspect of the error metrics, and thus we can conclude that the stepwise regression has been successful in finding a better model than model 1.

Examining Overfit/Underfit

Now that we have that settled, we are going to check if our model is in any way shape or form overfitted or underfitted.

First, we are going to create a prediction using the train data

predicted3<-predict(model_stepwise,df_admission_train)

Now, we are going to calculate the error measures

Creating a table to compare the errors

errors2 <- matrix(c(rmse_3,mae_3,mape_3,rmse_2,mae_2,mape_2),ncol=2,byrow=FALSE)
colnames(errors2) <- c("Train Set ","Test Set")
rownames(errors2) <- c("RMSE","MAE","MAPE")
error_table2 <- as.table(errors2)
error_table2
##      Train Set    Test Set
## RMSE 0.05920826 0.06130972
## MAE  0.04306291 0.04339546
## MAPE 0.06937679 0.06757797

Now, we see that the errors don’t differ as much, so we can conclude that our model is not over or under fitted.

Evaluating Assumptions

After we have come into terms that model stepwise, the model resulted from the stepwise regression is the final model, it is time to evaluate and check if the underlying assumptions for a multiple linear regression model is fulfilled by model stepwise. In this part, we are going to use an alpha of 5%, which indicates a 5% significance.

Normality

First, we are going to check if our residuals in fact fulfills the normality assumption of the residuals of the model. We are going to use the shapiro-wilk test to check the normality of the residuals.

shapiro.test(model_stepwise$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_stepwise$residuals
## W = 0.94215, p-value = 1.888e-10

From the results above, we see that our p-value is less than of 5% which, indicates that null hypothesis is rejected. Thus, our model’s residuals is not normally distributed

Now, we will try plotting the residuals

hist(model_stepwise$residuals)

From the plot above, we can see that the residuals of the model is in fact skewed to the left and not normally distributed.

Autocorrelation

Next, we are going to check if any auto-correlations exist between the predictor variables, we are using the variance inflation factor, in which a value greater than 10 indicates that auto-correlation in fact exists.

vif(model_stepwise)
##   `GRE Score` `TOEFL Score`           LOR          CGPA      Research 
##      5.082656      3.807112      1.733881      4.666194      1.469552

No autocorrelation exist! Now, we move on to homoscedasticity.

Homoscedasticity

homoscedasticity is a test used to check the homogeneity of variance within the predictors. In this step, we are using the breusch-pagan test.

bptest(model_stepwise)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_stepwise
## BP = 25.882, df = 5, p-value = 9.408e-05

Since, the p-value is less than 5% and thus we reject null hypothesis. So, we can conclude that heteroscedasticity is in fact present in out model.

Linearity
plot(model_stepwise,1)

The plot shows that there is a fairly linear relationship between the fitted line and the residual value

In conclusion, we can conclude that while the stepwise regression managed to generate fairly low errors and good R-squared, the linear model assumptions which is normality and heteroscedasticity fails to adhere, and thus might be an indicator that our data is not really suitable for linear regression.

Model Improvement

Box Cox Transformation

Now, we are going to try if Box-Cox transformation will work on our model.

lambda_try <- boxCox(model_stepwise, lambda = seq(-10, 10,by=0.1))

lambda <- lambda_try$x[ which.max(lambda_try$y) ]

Now, we transform the variable

df_admission_new <- df_admission %>% 
  mutate(`Chance of Admit` = (`Chance of Admit`^lambda - 1) / lambda)

Then, we split into training and testing set

n <- nrow(df_admission_new)
n_data<- round(n * 0.7,0)
index <- sample(seq_len(nrow(df_admission_new)), size = n_data)
set.seed(111)
df_admission_new_train <- df_admission_new[index, ]
df_admission_new_test <- df_admission_new[-index, ]

Now, we start the modeling process

model_4 <- lm(`Chance of Admit` ~ ., df_admission_new_train)
model_transformed <- step(model_4, trace = 0)

Then, we examine the model

summary(model_transformed)
## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
##     `University Rating` + LOR + CGPA + Research, data = df_admission_new_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.158650 -0.019352  0.006652  0.024205  0.094424 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.5705279  0.0789858 -19.884  < 2e-16 ***
## `GRE Score`          0.0013436  0.0003889   3.455 0.000619 ***
## `TOEFL Score`        0.0018498  0.0006681   2.769 0.005930 ** 
## `University Rating`  0.0084454  0.0027896   3.027 0.002653 ** 
## LOR                  0.0082005  0.0029445   2.785 0.005649 ** 
## CGPA                 0.0766884  0.0070841  10.825  < 2e-16 ***
## Research             0.0162931  0.0049671   3.280 0.001144 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03845 on 343 degrees of freedom
## Multiple R-squared:  0.8368, Adjusted R-squared:  0.834 
## F-statistic: 293.2 on 6 and 343 DF,  p-value: < 2.2e-16

With the model, we are able to get a higher Adjusted R-squared than model stepwise by about 3%

Errors

Now, we examine the errors

errors4 <- matrix(c(rmse_4,mae_4,mape_4),ncol=1,
                  byrow=FALSE)
colnames(errors4) <- c("Model Transformed")
rownames(errors4) <- c("RMSE","MAE","MAPE")
error_table4 <- as.table(errors4)
error_table4
##      Model Transformed
## RMSE        0.03496419
## MAE         0.02610167
## MAPE        0.15422195

Assumptions

Normality

# Normality
shapiro.test(model_transformed$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_transformed$residuals
## W = 0.94716, p-value = 7.297e-10

From the p-value obtained, we come to the conclusion of rejecting null hypothesis, which means that the residuals of the model, is not normally distributed even our box-cox transformation.

hist(model_transformed$residuals)

From the plot of histogram using residuals, we are able to see that the histogram is skewed to the left, thus indicating a non-normal distribution of the residuals.

Autocorrelation

Next, we are going to check if any auto-correlations exist between the predictor variables, we are using the variance inflation factor, in which a value greater than 10 indicates that auto-correlation in fact exists.

vif(model_transformed)
##         `GRE Score`       `TOEFL Score` `University Rating`                 LOR 
##            4.572280            3.892519            2.317527            1.744561 
##                CGPA            Research 
##            4.447588            1.437115

From inspecting the variance inflation factor, we are able to conclude that our model does not contain any autocorrelation as none of the variance inflation factor exceeds 10.

Homoscedasticity

bptest(model_transformed)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_transformed
## BP = 8.6446, df = 6, p-value = 0.1946

From the p-value, it exceeds 5% which we can conclude that we are going to accept null hypothesis of the breusch-pagan test, which means that our model is homoscedastic.

Linearity

plot(model_transformed,1)

From the plot above, we are able to see that the fitted values and residuals has a fairly linear relationship.

While transforming the model was able to result the fulfillment of the homoscedastic assumption, it is not able to get the model to adhere to the normality assumption.

predicted4<-predict(model_transformed,df_admission_new_test)

To make sure the predicted chance of admittance is still in bound.

predicted4<-ifelse(predicted1>1,1,predicted4)

Examining Overfit/Underfit

predicted5<-predict(model_transformed,df_admission_new_test)

To make sure the predicted chance of admittance is still in bound.

predicted5<-ifelse(predicted1>1,1,predicted5)
errors5 <- matrix(c(rmse_4,mae_4,mape_4,rmse_5,mae_5,mape_5),ncol=2,byrow=FALSE)
colnames(errors5) <- c("Train Set ","Test Set")
rownames(errors5) <- c("RMSE","MAE","MAPE")
error_table5 <- as.table(errors5)
error_table5
##      Train Set    Test Set
## RMSE 0.03482717 0.03482717
## MAE  0.02585899 0.02585899
## MAPE 0.15220554 0.15220554

As seen from the matrix above, the model is in fact not underfitted/overfitted as the values of the error metrics don’t differ as much.

General Overview

Overall, there are three resulting models that we have created and improved :

  • Model stepwise which is the model created from the stepwise regression
  • Model transformed is model stepwise in which the response variable is transformed using box-cox transformation

Now, we will compare the errors

errors3 <- matrix(c(rmse_2,mae_2,mape_2,rmse_4,mae_4,mape_4),ncol=2,
                  byrow=FALSE)
colnames(errors3) <- c("Model Stepwise","Model Transformed")
rownames(errors3) <- c("RMSE","MAE","MAPE")
error_table3 <- as.table(errors3)
error_table3
##      Model Stepwise Model Transformed
## RMSE     0.06130972        0.03482717
## MAE      0.04339546        0.02585899
## MAPE     0.06757797        0.15220554

Now, we will compare the r-squared

r<- matrix(c(summary(model_stepwise)$adj.r.squared,summary(model_transformed)$adj.r.squared),ncol=2,
                  byrow=FALSE)
colnames(r) <- c("Model Stepwise","Model Transformed")
rownames(r) <- c("Rsquared")
rsq_table <- as.table(r)
rsq_table
##          Model Stepwise Model Transformed
## Rsquared      0.8235030         0.8339871

Over all, the box-cox transformation definitely helped in boosting the model’s performance in terms of RMSE, MAE and also adjusted r-squared.

However, there is a trade-off in which the MAPE of the model is increased. According to Lewis, C. D in the book Industrial and Business Forecasting Methods, a MAPE that is less than 10% goes into the category of highly accurate forecasting, while 10-20% goes into the category of good forecasting. Which means that while there is an increase of performance in some areas, there is a slight decay in the model’s forecast performance in terms of MAPE.

Furthermore, even with the Box-Cox transformation the residual of the model is not normally distributed, which violates the linear regression assumption. In conclusion, more methods to predict the chance of graduate admittance needs to be explored.

References

  • Lewis, C. D. (1982). Industrial and Business Forecasting Methods. Butterworth-Heinemann.
  • Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019