Introduction

This notebook is a notebook that try to predict a graduate admission application based on several factors using linear regression model. Data is retrieved from Kaggle.com and inspired by the UCLA Graduate Dataset. Thus, this dataset is a real cases from a university in Untied States. Main goal of this analysis is to predict the chance of a student admitted in this university based on several variables.

The explanation consists of Data Preparation, Exploratory Data Analysis, making the models, evaluations of models, assumptions checking, and conclusion.

Data Preparation

library(tidyverse)
library(ggplot2)
library(data.table)
library(GGally)
library(car)
library(caret)
library(scales)
library(lmtest)
library(MLmetrics)
library(dplyr)

options(scipen = 100, max.print = 1e+06)
adm <- read.csv("Admission_Predict_Ver1.1.csv")
head(adm)
str(adm)
## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Data constructed by 9 columns, and 500 rows.

  1. Serial.No = No ID of each application
  2. GRE.Score = GRE Scores ( out of 340 )
  3. TOEFL.Score = TOEFL Scores ( out of 120 )
  4. University.Rating = Bachelor’s University Rating ( out of 5 )
  5. SOP= Statement of Purpose Strength ( out of 5 )
  6. LOR = Letter of Recommendation Strength ( out of 5 )
  7. CGPA = Undergraduate GPA ( out of 10 )
  8. Research = Research Experience ( either 0 or 1 )
  9. Chance.of.Admit = Chance of Admit ( ranging from 0 to 1 )

From 9 columns, GRE.Scores, TOELF.Scores, and CGPA are contiuous variable; and other variables are considered categorical factor. Although SOP, LOR, Research, University.Rating in form of number, they still categorical variable. Chance.of.Admit in this case is the target variable. Serial.No will be omitted and not used.

# Change Research to factor
adm2 <- adm %>%
  select(-Serial.No.) %>% 
  mutate(Research = as.factor(Research))
# Check NA Value
anyNA(adm2)
## [1] FALSE
# Check Summary 
summary(adm2)
##    GRE.Score      TOEFL.Score    University.Rating      SOP       
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.500  
##  Median :317.0   Median :107.0   Median :3.000     Median :3.500  
##  Mean   :316.5   Mean   :107.2   Mean   :3.114     Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.000  
##       LOR             CGPA       Research Chance.of.Admit 
##  Min.   :1.000   Min.   :6.800   0:220    Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.127   1:280    1st Qu.:0.6300  
##  Median :3.500   Median :8.560            Median :0.7200  
##  Mean   :3.484   Mean   :8.576            Mean   :0.7217  
##  3rd Qu.:4.000   3rd Qu.:9.040            3rd Qu.:0.8200  
##  Max.   :5.000   Max.   :9.920            Max.   :0.9700

Exploratory Data Analysis

Data Distribution

adm_long <- adm %>% 
  select(-Research) %>% 
  pivot_longer(-Serial.No.)

ggplot(data = adm_long, aes(x = name, y = value, fill = name)) +
  geom_boxplot() +
  facet_wrap(facets = ~name, scales = 'free')

Facet boxplot above shows distribution of data for each column or variable. There are only 2 variable that have outlier: LOR and Chance.of.Admit. Majority of the data is distributed between Q3 to Q1.

Correlation

Before make the model, Exploratory Data Analysis is needed to explain relations between predictors and target. In this case, predictors are all variable except Chance.of.Admit. And the target is Chance.of.Admit. The form of target variable is in the form of likelihood. Closer to 1 means high chance of accepted, and reverse.

ggcorr(data = adm2, label = T) +
  labs(title = "Correlation Matrix")

All variables seemed have correlation with the target Change.of.Admit with positive direction. The highest variable correlate with target is bachelor’s CGPA. Another predictors have high correlation scoree 0.8 - 0.6. In this Correlation matrix, Research is excluded because its data type is categorical (or factor in R). For each predictor variables also have strong correlation. For example, CGPA with TOEFL.Score and CGPA with GRE.Score. It is natural if a student graduate with high CGPA might also have excellent score in TOEFL and GRE.

Chart below shows the coefficient correlation, data distribution, and scatter.

library(psych)
pairs.panels(adm2,
             method = "spearman",
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE
             )

This chart emphasize more that every predictors has good enough correlation with target variable.

Cross Validation

# Split the data with 80:20 proportion
RNGkind(sample.kind = "Rounding")
set.seed(99)
adm_split <- sample(nrow(adm2), nrow(adm)*0.8)
adm_train <- adm2[adm_split,]
adm_test <- adm2[-adm_split,]

# Dimension of Train and Test data
dim(adm_train)
## [1] 400   8
dim(adm_test)
## [1] 100   8

Linear Regression Model

Before going to sophisticated model, I want to make a simple model with one predictor and target so at the end, simple and complex model are comparable.

Single Predictor CGPA

# Simple Linear Regression
model_simple <- lm(formula = Chance.of.Admit ~ CGPA, data = adm_train)
summary(model_simple)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = adm_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.280464 -0.030270  0.005316  0.039231  0.169758 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -0.992720   0.044236  -22.44 <0.0000000000000002 ***
## CGPA         0.200370   0.005145   38.94 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06257 on 398 degrees of freedom
## Multiple R-squared:  0.7921, Adjusted R-squared:  0.7916 
## F-statistic:  1517 on 1 and 398 DF,  p-value: < 0.00000000000000022

Interpretation

From the summary above, the estimate intercept is -0.99, and estimate CGPA is 0.2. Intercept means that when the CGPA is 0, the chance of accepted is -0.99 which likely will not to be accepted. Adjusted R-squared of 0.79 means that this model explain approximately 79% of all data. Therefore, the equation could be written in form:

\[ Y_i = -0.99 + 0.2 \beta_i \]

Multiple Predictors with Stepwise Backward Regression

In purpose for optimizing significant predictors, multiple linear regression will use stepwise backward regression.

# Stepwise Backward Regression Model
model_adm <- lm(formula = Chance.of.Admit ~ ., data = adm_train)
model_adm_back <- step(object = model_adm, direction = "backward", trace = F)
summary(model_adm_back)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP + 
##     LOR + CGPA + Research, data = adm_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.271283 -0.022847  0.007868  0.032371  0.149924 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.2750160  0.1032684 -12.347 < 0.0000000000000002 ***
## GRE.Score    0.0019288  0.0005086   3.793             0.000172 ***
## TOEFL.Score  0.0029130  0.0009129   3.191             0.001532 ** 
## SOP          0.0065736  0.0043817   1.500             0.134354    
## LOR          0.0149887  0.0041781   3.587             0.000376 ***
## CGPA         0.1152002  0.0096328  11.959 < 0.0000000000000002 ***
## Research1    0.0268645  0.0066635   4.032            0.0000666 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05537 on 393 degrees of freedom
## Multiple R-squared:  0.8392, Adjusted R-squared:  0.8367 
## F-statistic: 341.8 on 6 and 393 DF,  p-value: < 0.00000000000000022

Interpretation

From 6 predictors, 5 variables are significant and only 1 variable SOP that not significant. The model haave Adj R-squared and Multiple R-Squared score both 0.83 (2 decimals). The equation is below:

\[ Y_i = -1.27 + 0.001 GRE.Score_i + 0.002 TOEFL.Score_i + 0.014 LOR_i +0.115 CGPA_i + 0.026 Research_i \]

Model Comparison

Comparison table given explain that Multi Predictor model has better Multiple R-Squared score rather than the simple one. Therefore, for another analysis and prediction, I will use the multi predictor model or model_adm_back.

Model Evaluation

# Predicting testing data to the model
predicted_mv <- predict(object = model_adm_back, newdata = adm_test)

# Using MAE to calculate the error
MAE(y_pred = predicted_mv, y_true = adm_test$Chance.of.Admit)
## [1] 0.05310792

For model evaluation, I use MAE (Mean Absolute Error) because the data are normally distributed and its characteristic of less-affected by the outlier. The MAE score of this model is 0.053 which means that the model have good prediction.

Assumptions in Linear Regression

Linearity

Linearity assumption expects that every predictor variable has correlation with the target variable.

linearity <- data.frame(residual = model_adm_back$residuals, fitted = model_adm_back$fitted.values)
linearity %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The pattern is shown by the graph. The data distributed in 0 residuals and 0.5 - 0.8 fitted point. More points getting smaller (negative) as the fitted getting higher, but then it converges to X axis.

Normality Error/Residual

hist(model_adm_back$residuals)

Histogram shows that the data is right skewed. This is strong signal for non-normal residual distribution. For examining the score, I use Shapiro-Wilk Test.

shapiro.test(model_adm_back$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_adm_back$residuals
## W = 0.93864, p-value = 0.000000000008698

P-value given by the Shapiro-Wilk normality test is less than alpha 0.05. Therefore, the model is indicating non-normally distributed residuals.

Homoscedasticity

Homoscedasticity means the homogenity of variances. It assume of equal or similar variances in different groups that compared. When there is variance of variances, it is called Heteroscedasticity.

bptest(model_adm_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_adm_back
## BP = 26.373, df = 6, p-value = 0.0001897

Breusch-Pagan coefficient given shows the p-value is less than alpha (0.05). Therefore the model is indicating of heteroscedasticity.

No-multicolinearity

No-multicolinearity assumption expect that each variable or column does not affect each other (in particular, no correlation between consecutive errors in the case of time series data). This assumption also called test of independency.

vif(model_adm_back)
##   GRE.Score TOEFL.Score         SOP         LOR        CGPA    Research 
##    4.273734    3.989381    2.428683    1.932297    4.474493    1.423618

Based on the reference, VIF score > 5 means there is an indication that multicolinearity may be present, and VIF > 10 indicates certainly multicollinearity among the variables.

Data Transformation

From 4 assumptions, only 2 assumptions that fulfilled. Homoscedasticity and Normality are violated. if this model is used to make prediction, the result will be misleading.

Based on this article, alternative for fixing the model is to transform the dependent and/or independent variables with logarithm function. The next model, I am trying to transfrom dependent variable and use it to predict the target. The related article above suggest that: “If a log transformation is applied to the dependent variable only, this is equivalent to assuming that it grows (or decays) exponentially as a function of the independent variables.”

Data Transformation and Cross Validation

# Log10 transformation
adm3 <- adm %>%
  select(-Serial.No.) %>% # unselect Serial.No
  mutate(GRE.Score = log10(GRE.Score),
         TOEFL.Score = log10(TOEFL.Score),
         University.Rating = log10(University.Rating),
         SOP = log10(SOP),
         LOR = log10(LOR),
         CGPA = log10(CGPA))
head(adm3) # Research is not transformed due to factor data type
# Cross Validation
set.seed(99)
adm_train3 <- adm3[adm_split,]
adm_test3 <- adm3[-adm_split,]

Modelling and Evaluation

To create a model, the lm() function is used by filling in the target variable, predictor variable, and train data. This second model uses a data train whose predictor variables have been converted into logarithmic form.

# Modelling
model_logx = lm(formula = Chance.of.Admit ~., data = adm_train3)
summary(model_logx)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = adm_train3)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.277000 -0.022683  0.008397  0.031723  0.153650 
## 
## Coefficients:
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -6.513787   0.714058  -9.122 < 0.0000000000000002 ***
## GRE.Score          1.407398   0.367032   3.835             0.000147 ***
## TOEFL.Score        0.740896   0.224715   3.297             0.001066 ** 
## University.Rating  0.009342   0.023853   0.392             0.695544    
## SOP                0.029557   0.030955   0.955             0.340250    
## LOR                0.105349   0.029780   3.538             0.000452 ***
## CGPA               2.281419   0.186149  12.256 < 0.0000000000000002 ***
## Research           0.027478   0.006682   4.112            0.0000478 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05548 on 392 degrees of freedom
## Multiple R-squared:  0.839,  Adjusted R-squared:  0.8361 
## F-statistic: 291.9 on 7 and 392 DF,  p-value: < 0.00000000000000022

Perform model predictions with test data.

# Predicting
predict_logx <- predict(object = model_logx, newdata = adm_test3)

# Model evaluation
MAE(y_pred = predict_logx, y_true = adm_test3$Chance.of.Admit)
## [1] 0.05273135

Assumption Checking

Since only two assumptions violated, this section only focus on Normality and Homoscedasticity assumption to test new model.

Normality

hist(model_logx$residuals)

The histogram shows the distribution of residuals skewed to the right.

shapiro.test(model_logx$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_logx$residuals
## W = 0.93142, p-value = 0.000000000001362

P-value that less than 0.05 (alpha) indicates that the model also violates normality test. p-value score confirm histogram visualization.

Homoscedasticity

bptest(model_logx)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_logx
## BP = 22.926, df = 7, p-value = 0.001756

Using Breusch-Pagan test, p-value that given less than 0.05 (alpha value) indicates that the model violates Homoscedasticity assumption.

Conclusion

Of the several predictor variables, GRE score, TOEFL score, CGPA, and research experience are significant factors in predicting a person’s chance of passing university registration or not. Academic measure is commonly used to select applications.

The correlation value shows that all variables have a strong correlation with the target variable. Between independent variables also has a high correlation.

The linear model that has been made is able to provide predictions. However, in linear regression, prediction alone is not sufficient. Linear models must also meet the assumptions. If these assumptions are not met, it is likely that the resulting predictions will be misleading.

In the first model, the assumptions for normality and homoscedasticity are not fulfilled. To solve this model, one way is to convert one of the variables (independent or dependent) into a log, to ensure that the variable relationship is linear.

Then a second model is created by converting the independent variables. After testing the assumptions that are focused on the two previous tests, it is found that the second model does not fulfill these two assumptions.

Alternative

It can be assumed that the linear model is not match for predicting the probability of acceptance of university applications or not. Alternatives that can be done is to use a non-linear model, perform PCA analysis as pre-processing, or use a linear regression model or method that can solve problems in the OLS model.

References

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

https://www.statisticssolutions.com/assumptions-of-linear-regression/