This project want to analysis and estimate what variables are significant to increase the chances of graduate admission. This dataset created by Mohan S Acharya and inspired by the University of California (UCLA) graduate dataset. Hope you get a new insight from this report.

1 Input Data and Cleansing Data

library(tidyverse)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(dplyr)
library(ggplot2)
library(corrplot)
library(leaps)

data <- read.csv("Admission_Predict_Ver1.1.csv")

glimpse(data)

># Observations: 500
># Variables: 9
># $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
># $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
># $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
># $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
># $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
># $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
># $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
># $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
># $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...

Check if there are any missing values, if have we will remove it

sum(is.na(data))

># [1] 0

Since we don’t need “Serial.No” variable, we will drop it.

data <- data %>% select(GRE.Score, TOEFL.Score, University.Rating, SOP, LOR, CGPA, Research, Chance.of.Admit)
glimpse(data)

># Observations: 500
># Variables: 8
># $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
># $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
># $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
># $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
># $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
># $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
># $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
># $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...

Information data :
- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- Statement of Purpose ( out of 5)
- Letter of Recommendation Strength ( out of 5 )
- Undergraduate GPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Chance of Admit ( ranging from 0 to 1 )

2 Exploratory Data and Modelling

I want to see how each variables are related one another.

correlation <- cor(data)
corrplot.mixed(correlation, tl.col = "blue", tl.pos = "lt")

The result show each variable has at least a relationship with each other, but we can’t concluding it to be a cause and effect for each variables.

Because we have dummy variables, we will convert some variables into Factor to make dummy variables

data <- data %>%
  mutate(University.Rating = as.factor(University.Rating),
         Research = as.factor(Research),
         SOP = as.factor(SOP), 
         LOR = as.factor(LOR))
glimpse(data)

># Observations: 500
># Variables: 8
># $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
># $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
># $ University.Rating <fct> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
># $ SOP               <fct> 4.5, 4, 3, 3.5, 2, 4.5, 3, 3, 2, 3.5, 3.5, 4, 4, ...
># $ LOR               <fct> 4.5, 4.5, 3.5, 2.5, 3, 3, 4, 4, 1.5, 3, 4, 4.5, 4...
># $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
># $ Research          <fct> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
># $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...

make Stepwise to get the best OLS model with Akaike Information Criterion (AIC) and regsubsets based by the highest Adjusted R-squared.

modelnone <- lm(Chance.of.Admit~1, data)
modelfull <- lm(Chance.of.Admit~., data)
modelaic <- step(modelnone, scope = list(lower = modelnone, upper = modelfull), direction = "both")

># Start:  AIC=-1957
># Chance.of.Admit ~ 1
># 
>#                     Df Sum of Sq    RSS     AIC
># + CGPA               1    7.7401 2.2003 -2709.0
># + GRE.Score          1    6.5275 3.4129 -2489.5
># + TOEFL.Score        1    6.2388 3.7016 -2448.9
># + University.Rating  4    4.7665 5.1738 -2275.5
># + SOP                8    4.7414 5.1989 -2265.1
># + LOR                8    4.1807 5.7596 -2213.9
># + Research           1    2.9620 6.9784 -2131.9
># <none>                           9.9404 -1957.0
># 
># Step:  AIC=-2709.01
># Chance.of.Admit ~ CGPA
># 
>#                     Df Sum of Sq    RSS     AIC
># + GRE.Score          1    0.2081 1.9922 -2756.7
># + TOEFL.Score        1    0.1717 2.0286 -2747.6
># + Research           1    0.1422 2.0580 -2740.4
># + University.Rating  4    0.1085 2.0918 -2726.3
># + LOR                8    0.1302 2.0701 -2723.5
># + SOP                8    0.0786 2.1217 -2711.2
># <none>                           2.2003 -2709.0
># - CGPA               1    7.7401 9.9404 -1957.0
># 
># Step:  AIC=-2756.69
># Chance.of.Admit ~ CGPA + GRE.Score
># 
>#                     Df Sum of Sq    RSS     AIC
># + LOR                8   0.12730 1.8649 -2773.7
># + Research           1   0.06223 1.9299 -2770.6
># + TOEFL.Score        1   0.04998 1.9422 -2767.4
># + University.Rating  4   0.07214 1.9200 -2767.1
># + SOP                8   0.06360 1.9286 -2756.9
># <none>                           1.9922 -2756.7
># - GRE.Score          1   0.20812 2.2003 -2709.0
># - CGPA               1   1.42068 3.4129 -2489.5
># 
># Step:  AIC=-2773.71
># Chance.of.Admit ~ CGPA + GRE.Score + LOR
># 
>#                     Df Sum of Sq    RSS     AIC
># + Research           1   0.05020 1.8147 -2785.3
># + TOEFL.Score        1   0.04011 1.8248 -2782.6
># + University.Rating  4   0.03600 1.8289 -2775.4
># <none>                           1.8649 -2773.7
># + SOP                8   0.02594 1.8389 -2764.7
># - LOR                8   0.12730 1.9922 -2756.7
># - GRE.Score          1   0.20521 2.0701 -2723.5
># - CGPA               1   0.87171 2.7366 -2583.9
># 
># Step:  AIC=-2785.35
># Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research
># 
>#                     Df Sum of Sq    RSS     AIC
># + TOEFL.Score        1   0.04353 1.7711 -2795.5
># + University.Rating  4   0.03098 1.7837 -2786.0
># <none>                           1.8147 -2785.3
># + SOP                8   0.02629 1.7884 -2776.7
># - Research           1   0.05020 1.8649 -2773.7
># - LOR                8   0.11527 1.9299 -2770.6
># - GRE.Score          1   0.13041 1.9451 -2752.7
># - CGPA               1   0.85697 2.6716 -2594.0
># 
># Step:  AIC=-2795.49
># Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + TOEFL.Score
># 
>#                     Df Sum of Sq    RSS     AIC
># <none>                           1.7711 -2795.5
># + University.Rating  4   0.02542 1.7457 -2794.7
># + SOP                8   0.02368 1.7475 -2786.2
># - TOEFL.Score        1   0.04353 1.8147 -2785.3
># - GRE.Score          1   0.04909 1.8202 -2783.8
># - LOR                8   0.10515 1.8763 -2782.7
># - Research           1   0.05363 1.8248 -2782.6
># - CGPA               1   0.63132 2.4025 -2645.1

modeluji <- regsubsets(Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + TOEFL.Score + University.Rating + SOP, data = data, nbest = 2)
plot(modeluji, scale = "adjr2")

Conclusions :
\[\hat{Chance Of Admit}={\beta_{0}}+{\beta_{1}}\hat{CGPA} + {\beta_{2}}\hat{GREscore} + {\beta_{3}}\hat{LOR_{i}} + {\beta_{4}}\hat{Research_{i}} + {\beta_{5}}\hat{TOEFLscore} + {\upsilon_{i}}\] Vector of LOR :

\[\hat{LOR_{1}}|_{0=lainnya}^{1=1.5};\hat{LOR_{2}}|_{0=lainnya}^{1=2}; \hat{LOR_{3}}|_{0=lainnya}^{1=2.5}; \hat{LOR_{4}}|_{0=lainnya}^{1=3}\] \[\hat{LOR_{5}}|_{0=lainnya}^{1=3.5};\hat{LOR_{6}}|_{0=lainnya}^{1=4}; \hat{LOR_{7}}|_{0=lainnya}^{1=4.5};\hat{LOR_{8}}|_{0=lainnya}^{1=5}\]

Vector of Research : \[\hat{Research_{1}}|_{0=lainnya}^{1=1}\]

3 Regression with OLS Method

summary(modelaic)

># 
># Call:
># lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + 
>#     TOEFL.Score, data = data)
># 
># Residuals:
>#       Min        1Q    Median        3Q       Max 
># -0.260596 -0.023715  0.008437  0.033895  0.165406 
># 
># Coefficients:
>#               Estimate Std. Error t value             Pr(>|t|)    
># (Intercept) -1.3274285  0.1172721 -11.319 < 0.0000000000000002 ***
># CGPA         0.1238585  0.0094007  13.175 < 0.0000000000000002 ***
># GRE.Score    0.0018594  0.0005061   3.674             0.000265 ***
># LOR1.5       0.0083960  0.0631531   0.133             0.894291    
># LOR2         0.0384635  0.0611336   0.629             0.529532    
># LOR2.5       0.0524143  0.0611927   0.857             0.392118    
># LOR3         0.0478582  0.0610141   0.784             0.433200    
># LOR3.5       0.0621250  0.0611980   1.015             0.310540    
># LOR4         0.0722710  0.0612815   1.179             0.238844    
># LOR4.5       0.0793080  0.0617473   1.284             0.199614    
># LOR5         0.0947838  0.0620020   1.529             0.126983    
># Research1    0.0255375  0.0066503   3.840             0.000139 ***
># TOEFL.Score  0.0030036  0.0008681   3.460             0.000588 ***
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
># 
># Residual standard error: 0.06031 on 487 degrees of freedom
># Multiple R-squared:  0.8218, Adjusted R-squared:  0.8174 
># F-statistic: 187.2 on 12 and 487 DF,  p-value: < 0.00000000000000022

\[\hat{Chance Of Admit}=-1.33+0.12\hat{CGPA} + 0.002\hat{GREscore} + {\beta_{3}}\hat{LOR_{i}} + {\beta_{4}}\hat{Research_{i}} + 0.003\hat{TOEFLscore} + {\upsilon_{i}}\] Notes :
\(LOR = 1; \beta_{3} = -1.327\), \(LOR = 1.5; \beta_{3} = 0.08\),
\(LOR = 2; \beta_{3} = 0.038\), \(LOR = 2.5; \beta_{3} = 0.052\),
\(LOR = 3; \beta_{3} = 0.047\), \(LOR = 3.5; \beta_{3} = 0.062\),
\(LOR = 4; \beta_{3} = 0.072\), \(LOR = 4.5; \beta_{3} = 0.079\),
\(LOR = 5; \beta_{3} = 0.094\),
\(Research = 1; \beta_{4} = 0.025\), \(Research = 0; \beta_{4} = -1.327\)

Analysis from the Model:
1. From generally perspective, can be seen from F-statistics have significant p-value < 0.05 (level of significance), it can be concluded that our models is a good model.
2. Because this is multiple regression, we see Adjusted R-squared that have value 0.8174. it means that 81.74% variation of independent variables can explain variation of dependent variable. We have a good model.
3. Look more specifically with each of independent variables, with t-test, all variables have significant p-value < 0.05 except variables Letter of Recommendation Strength (LOR). This is an interesting fact because from AIC test LOR variables must included. There seems to be a problem in our model. we will discuss this problem more deeply in the below.

Interpretation from the Model:
1. From all independent variables, CGPA is the variable that can provide the highest added value for chance of admit to university. Every increase in 1 CGPA score, it can increases chance of admit to university 0.12 value, Ceteris Paribus.
2. if a candidate don’t have Research Experience and Letter of Recommendation (LOR), it will decreases chance of admit to university 1.32 value, Ceteris Paribus. So make sure that the candidate must have Research Experience and LOR.
3. GRE Score and TOEFL Score also important variables that can explain chance of admit.
- Every increase of 1 GRE Score, it can increases chance of admit 0.002 value, Ceteris Paribus.
- Every increase of 1 TOEFL score, it can increases chance of admit 0.003 value, Ceteris Paribus.
4. The higher the value of LOR, the higher the chance of admit to university.
- with LOR power 1.5 value, it increases chance of admit 0.008, Ceteris Paribus.
- with LOR power 5 value, it increases chance of admit 0.094, Ceteris Paribus.
Notes : Ceteris Paribus = “all other things being unchanged or constant”

3.1 Problems

in this section we will examine the assumptions that must be met in the OLS model, first we will looking how’s the residual from the model.

plot(modelaic)

From the graph, can be seen that the residuals is not a constant value \(\sigma^{2}\) (Homoscedasticity) and not normally distributed. but more likely to have values that are not constant \(\sigma_{i}^2\) and not zero mean value of error \(E(U_{i}|X_{i})≠0\). If it is proven, it will make the model no longer BEST estimator (smallest variance value) from BLUE estimator.
To prove that, we will conduct several tests to prove it.

3.1.1 Test the Assumptions

3.1.1.1 Assumption of Normality

means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. to test the condition, we will use Shapiro-Wilk Test for our residual :

shapiro.test(modelaic$residuals)

># 
>#  Shapiro-Wilk normality test
># 
># data:  modelaic$residuals
># W = 0.92802, p-value = 0.00000000000000923

\(H_{0}\) = Residuals are normally distributed.
\(H_{1}\) = Residuals are not normally distributed.

The test has a p-value below the significance level of 0.05. Therefore we reject the null hypothesis. We can conclude that the residuals are not normally distributed.
Violation of Normality assumption, will this be a problem with the model?
Quoting Gujarati (2003),
“Violation of the normality assumption doesn’t contribute to bias or inefficiency in regression models. It is only important for the calculation of p-values for significance testing, but this is only a consideration when sample size is very small. When the sample size is sufficiently large (>200), the normality assumption is not needed at all as the Central Limit Theorem (CLT) ensures that the distribution of disturbance term will approximate normality”
Conclusions :
Since our sample have sufficiently large (500 samples), normality assumption is not needed because Central Limit Theorem applicable.

3.1.1.2 Assumption of Multicollinearity

means to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. for this assumption, we will test use Variance Inflation Factor (VIF).

vif(modelaic)

>#                 GVIF Df GVIF^(1/(2*Df))
># CGPA        4.435488  1        2.106060
># GRE.Score   4.483496  1        2.117427
># LOR         1.765073  8        1.036150
># Research    1.498185  1        1.224004
># TOEFL.Score 3.825015  1        1.955765

A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection. Our VIF test to the model doesn’t have a VIF number that greater than 10, it means in the model don’t have high collinearity.
Conclusions : this assumption is fulfilled

3.1.1.3 Assumption of Homoscedasticity

common statistical test to check if there is Homoscedasticity or Heteroscedasticity using the Breusch-Pagan Test. Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual.

bptest(modelaic)

># 
>#  studentized Breusch-Pagan test
># 
># data:  modelaic
># BP = 31.938, df = 12, p-value = 0.001414

\(H_{0}\) = Homoscedasticity.
\(H_{1}\) = Heteroscedasticity.
The test has a p-value < 0.05, therefore we reject the null hypothesis.
Conclusions : The residuals has not a constant variance or Heteroscedasticity.
To cure this problem, I will use the Generalized Least Squares Method in the part below.

4 Regression with GLS Method

4.1 Result

For more explanations about GLS Method, please see Appendix Tab,
Fit our model to get our residuals

data$resi <- modelaic$residuals
varfunc.ols <- lm(log(resi^2) ~ log(CGPA) + log(GRE.Score) + LOR + Research + 
    log(TOEFL.Score), data = data)
summary(varfunc.ols)

># 
># Call:
># lm(formula = log(resi^2) ~ log(CGPA) + log(GRE.Score) + LOR + 
>#     Research + log(TOEFL.Score), data = data)
># 
># Residuals:
>#      Min       1Q   Median       3Q      Max 
># -14.8623  -0.9910   0.4604   1.4586   4.8703 
># 
># Coefficients:
>#                  Estimate Std. Error t value            Pr(>|t|)    
># (Intercept)         6.633     26.565   0.250              0.8029    
># log(CGPA)          -3.770      2.957  -1.275              0.2029    
># log(GRE.Score)    -15.347      5.948  -2.580              0.0102 *  
># LOR1.5             62.022      2.377  26.094 <0.0000000000000002 ***
># LOR2               61.694      2.301  26.814 <0.0000000000000002 ***
># LOR2.5             62.231      2.304  27.012 <0.0000000000000002 ***
># LOR3               62.294      2.297  27.113 <0.0000000000000002 ***
># LOR3.5             61.801      2.305  26.814 <0.0000000000000002 ***
># LOR4               61.749      2.308  26.753 <0.0000000000000002 ***
># LOR4.5             61.378      2.326  26.392 <0.0000000000000002 ***
># LOR5               61.023      2.335  26.133 <0.0000000000000002 ***
># Research1           0.243      0.250   0.972              0.3315    
># log(TOEFL.Score)    4.402      3.462   1.271              0.2042    
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
># 
># Residual standard error: 2.268 on 487 degrees of freedom
># Multiple R-squared:  0.6155, Adjusted R-squared:  0.606 
># F-statistic: 64.97 on 12 and 487 DF,  p-value: < 0.00000000000000022

the least squares estimate for our variance function is : \[\ln(\hat{\sigma_{i}^{2}})= 6.633-3.770z_{1}-15.347z_{2}+...+4.402z_{12}\]
Notes : for (…), because the function is lengthy, for more detail can be seen from summary above.
the next step is to tranform the observations in such a way that the transformed model has a constant error variance. To do so, we can obtain variance estimates from : \[\hat{\sigma_{i}^{2}} = \exp[{\alpha_{1}}+{\alpha_{2}z_{i2}}+...+{\alpha_{n}z_{in}}]\] and then divide both sides of the regression model \(\hat{Chance Of Admit}={\beta_{0}}+{\beta_{1}}\hat{CGPA} + {\beta_{2}}\hat{GREscore} + {\beta_{3}}\hat{LOR_{i}} + {\beta_{4}}\hat{Research_{i}} + {\beta_{5}}\hat{TOEFLscore} + {\upsilon_{i}}\) with \(\hat{\sigma_{i}}\). Doing so yields to the following equation :

\[(\frac{\hat{Chance Of Admit}}{\sigma_{i}})={\beta_{0}}+{\beta_{1}}(\frac{\hat{CGPA}}{\sigma_{i}}) + {\beta_{2}}(\frac{\hat{GREscore}}{\sigma_{i}}) + {\beta_{3}}(\frac{\hat{LOR_{i}}}{\sigma_{i}}) + {\beta_{4}}(\frac{\hat{Research_{i}}}{\sigma_{i}}) + {\beta_{5}}(\frac{\hat{TOEFLscore}}{\sigma_{i}}) + (\frac{\upsilon_{i}}{\sigma_{i}})\] The variance of the transformed error is Homoscedasticity because \[var(\frac{\upsilon_{i}}{\sigma_{i}})=(\frac{1}{\sigma_{i}^{2}})var(e_{i}) = (\frac{1}{\sigma_{i}^{2}})\sigma_{i}^{2} = 1\] using the estimates of our variance function \(\hat{\sigma_{i}^{2}}\) in place of \(\sigma_{i}^{2}\) to obtain the Generalized Least Squares (GLS) Estimators of \(\beta_{1}\) and \(\beta_{2}\). We define the transformed variables as : \[y_{i}^{*}=(\frac{y_{i}}{\hat{\sigma_{i}}}); x_{i1}^{*}=(\frac{1}{\hat{\sigma_{i}}}); x_{in}^{*}=(\frac{x_{i}}{\hat{\sigma_{i}}})\] and apply Weighted Least Squares to the equation \[y_{i}^{*}=\beta_{1}x_{i1}^{*}+\beta_{n}x_{in}^{*}+\upsilon_{i}^{*}\] Now we will apply GLS into our model

data$varfunc <- exp(varfunc.ols$fitted.values)
modelgls <- lm(Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + 
    TOEFL.Score, weights = 1/sqrt(varfunc), data = data)
summary(modelgls)

># 
># Call:
># lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + 
>#     TOEFL.Score, data = data, weights = 1/sqrt(varfunc))
># 
># Weighted Residuals:
>#      Min       1Q   Median       3Q      Max 
># -1.60771 -0.16138  0.05272  0.22327  1.02204 
># 
># Coefficients:
>#               Estimate Std. Error t value             Pr(>|t|)    
># (Intercept) -1.2787222  0.0951114 -13.444 < 0.0000000000000002 ***
># CGPA         0.1204355  0.0089794  13.412 < 0.0000000000000002 ***
># GRE.Score    0.0017248  0.0004774   3.613             0.000334 ***
># LOR1.5       0.0170834  0.0202099   0.845             0.398359    
># LOR2         0.0375297  0.0101178   3.709             0.000232 ***
># LOR2.5       0.0563550  0.0108039   5.216   0.0000002706578206 ***
># LOR3         0.0492183  0.0093029   5.291   0.0000001845676990 ***
># LOR3.5       0.0659084  0.0098112   6.718   0.0000000000516575 ***
># LOR4         0.0749338  0.0100990   7.420   0.0000000000005257 ***
># LOR4.5       0.0837127  0.0117136   7.147   0.0000000000032693 ***
># LOR5         0.0976367  0.0122835   7.949   0.0000000000000132 ***
># Research1    0.0286207  0.0063540   4.504   0.0000083439344606 ***
># TOEFL.Score  0.0031810  0.0007784   4.086   0.0000512564184059 ***
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
># 
># Residual standard error: 0.3512 on 487 degrees of freedom
># Multiple R-squared:  0.9791, Adjusted R-squared:  0.9785 
># F-statistic:  1898 on 12 and 487 DF,  p-value: < 0.00000000000000022

\[\hat{Chance Of Admit}=-1.28+0.12\hat{CGPA} + 0.002\hat{GREscore} + {\beta_{3}}\hat{LOR_{i}} + {\beta_{4}}\hat{Research_{i}} + 0.003\hat{TOEFLscore} + {\upsilon_{i}}\] Notes :
\(LOR = 1; \beta_{3} = -1.278\), \(LOR = 1.5; \beta_{3} = 0.017\),
\(LOR = 2; \beta_{3} = 0.038\), \(LOR = 2.5; \beta_{3} = 0.056\),
\(LOR = 3; \beta_{3} = 0.049\), \(LOR = 3.5; \beta_{3} = 0.066\),
\(LOR = 4; \beta_{3} = 0.075\), \(LOR = 4.5; \beta_{3} = 0.084\),
\(LOR = 5; \beta_{3} = 0.098\),
\(Research = 1; \beta_{4} = 0.029\), \(Research = 0; \beta_{4} = -1.278\)

Analysis from the Model:
1. From generally perspective, can be seen from F-statistics have significant p-value < 0.05 (level of significance), it can be concluded that our models is a good model.
2. Because this is multiple regression, we see Adjusted R-squared that have value 0.9785. it means that 97.85% variation of independent variables can explain variation of dependent variable. We have a very good model.
3. Look more specifically with each of independent variables, with t-test, all variables have significant p-value < 0.05 except variables Letter of Recommendation Strength (LOR) 1.5. This is much better than the OLS model.

Interpretation from the Model:
1. From all independent variables, CGPA is the variable that can provide the highest added value for chance of admit to university. Every increase in 1 CGPA score, it can increases chance of admit to university 0.12 value, Ceteris Paribus.
2. if a candidate don’t have Research Experience and Letter of Recommendation (LOR), it will decreases chance of admit to university 1.28 value, Ceteris Paribus. So make sure that the candidate must have Research Experience and LOR.
3. GRE Score and TOEFL Score also important variables that can explain chance of admit.
Every increase of 1 GRE Score, it can increases chance of admit 0.002 value, Ceteris Paribus.
Every increase of 1 TOEFL score, it can increases chance of admit 0.003 value, Ceteris Paribus.
4. The higher the value of LOR, the higher the chance of admit to university.
can be seen with LOR power 1.5 value, it increases chance of admit 0.017, Ceteris Paribus.
with LOR power 5 value, it increases chance of admit 0.098, Ceteris Paribus.
Notes : Ceteris Paribus = “all other things being unchanged or constant”

4.2 Appendix (Math Proof)

When heteroscedasticity is present, Best Linear Unbiased Estimator (BLUE) estimator depends on the unknown \({\sigma_{i}^{2}}\). This estimator is referred to as the Generalized Least Square (GLS) Estimator. When the ordinary least squares estimator is no longer BLUE, we can solve this problem by transforming the model into one with Homoscedasticity errors.

To begin, we start from general specification of the variance function which can be written as : \[var(\upsilon_{i})={\sigma_{i}^{2}}={\sigma^{2}x_{i}^{\gamma}}\] Where \(\gamma\) is the unknown parameter that must estimate first before proceed with the transformation. Notice that the variance function depends on a constant term \({\sigma^{2}}\) and increases as \(x_{i}\) increases. It’s more convenient to consider a framework more general than the above equation. To inroduce this framework, let’s start by taking the natural logs of both sides of the above equations so that we get :
\[\ln(\sigma_{i}^{2})=\ln(\sigma^{2})+{\gamma}\ln(x_{i})\] make anti log of both sides : \[\sigma_{i}^{2}=\exp[\ln(\sigma^{2})+{\gamma}\ln(x_{i})] = \exp({\alpha_{1}}+{\alpha_{2}z_{i}}) \] where \({\alpha_{1}}=\ln({\sigma^{2}})\), \({\alpha_{2}}=\gamma\), and \(z_{i}=\ln(x_{i})\).
Writing the variance function in this form is convenient because it shows how the variance can be related to any explanatory variable \(z_{i}\). Also, if we believe the variance is likely to depend on more than one explanatory variable, say \(z_{i2}, z_{i3}, ..., z_{in}\), we can extend the equation to \[\sigma_{i}^{2} = \exp[{\alpha_{1}}+{\alpha_{2}z_{i2}}+...+{\alpha_{n}z_{in}}]\] The exponential function is convenient because it ensures that we will get non-negative values for the variances \(\sigma_{i}^{2}\) for all possible values of the parameters \(\alpha_{1}, \alpha_{2}, ..., \alpha_{n}\). Returning to the equation \(\sigma_{i}^{2} = \exp[{\alpha_{1}}+{\alpha_{2}z_{i2}}+...+{\alpha_{n}z_{in}}]\), we can rewrite it as :

\[\ln(\sigma_{i}^{2})={\alpha_{1}}+{\alpha_{2}z_{i}} \] We now have an equation in which we can estimate the unknown parameters \(\alpha_{1}\) and \(\alpha_{2}\). We can do this the same way we obtain estimates for the parameters \(\beta_{1}\) and \(\beta_{2}\) in a simple regression model \(y_{i} = \beta_{1} + \beta_{2}x_{i} + e_{i}\) using ordinary least squares. We can do this by using the squares of our least squares residuals \(\hat{e_{i}^{2}}\) as our observations. Then, we can write the above equation as : \[\ln(\hat{\upsilon_{i}}^2)= \ln(\sigma_{i}^{2})+v_{i} ={\alpha_{1}}+{\alpha_{2}z_{i}}+e_{i}\] We can now apply least squares to get our parameter estimates.

5 Conclusions

Comparing OLS and GLS models, prove that GLS model is better to use because he solved the problem of Heteroscedasticity and fulfill all the assumptions.
Comparing with Adjusted R-squared, GLS model have Adjusted R-squared 0.9785, much more better than OLS model that have Adjusted R-squared 0.8174.
Comparing with significance test, GLS model almost all their independent variables are significant except LOR 1.5, much more better than OLS model which are not significant for all dummy variables in LOR variables.

6 Citations

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
Gujarati, D.N. (2004) Basic Econometrics. 4th Edition, McGraw-Hill Companies.

What is Needed for Graduate Admissions?

Aji Putera Tanumihardja

2/22/2020