Introduction

This dataset is created by Mohan S Acharya to estimate chances of graduate admission from an Indian perspective. This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

We will learn to use linear regression model using Graduate Admission dataset to understand the relationship among variables, especially between the Chance of Admit with other variables. We also want to predict the chance of admit based on the historical data. You can download the data here.

Load Dataset

Load the required package.

library(dplyr)
library(GGally)
library(performance)
library(MLmetrics)
library(lmtest)
library(car)

Load the dataset.

graduate <- read.csv('data_input/Admission_Predict.csv')

graduate

Exploratory Data Analysis

a. Data Structure

glimpse(graduate)
#> Rows: 400
#> Columns: 9
#> $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
#> $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
#> $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
#> $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
#> $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
#> $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
#> $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
#> $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
#> $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~
summary(graduate)
#>    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
#>  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
#>  1st Qu.:100.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
#>  Median :200.5   Median :317.0   Median :107.0   Median :3.000    
#>  Mean   :200.5   Mean   :316.8   Mean   :107.4   Mean   :3.087    
#>  3rd Qu.:300.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
#>  Max.   :400.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
#>       SOP           LOR             CGPA          Research     
#>  Min.   :1.0   Min.   :1.000   Min.   :6.800   Min.   :0.0000  
#>  1st Qu.:2.5   1st Qu.:3.000   1st Qu.:8.170   1st Qu.:0.0000  
#>  Median :3.5   Median :3.500   Median :8.610   Median :1.0000  
#>  Mean   :3.4   Mean   :3.453   Mean   :8.599   Mean   :0.5475  
#>  3rd Qu.:4.0   3rd Qu.:4.000   3rd Qu.:9.062   3rd Qu.:1.0000  
#>  Max.   :5.0   Max.   :5.000   Max.   :9.920   Max.   :1.0000  
#>  Chance.of.Admit 
#>  Min.   :0.3400  
#>  1st Qu.:0.6400  
#>  Median :0.7300  
#>  Mean   :0.7244  
#>  3rd Qu.:0.8300  
#>  Max.   :0.9700

b. Missing Value check

anyNA(graduate)
#> [1] FALSE

*no missing value

c. Data Overview

Glossary data graduate:

  1. Serial.No. : Index number (0-500)
  2. GRE.Score : Graduate Record Examination scores (out of 340)
  3. TOEFL.Score : TOEFL Scores (out of 120)
  4. University.Rating: University Rating (out of 5)
  5. SOP : Statement of Purpose strength (out of 5)
  6. LOR : Letter of Recommendation strength (out of 5)
  7. CGPA : Undergraduate GPA (out of 10)
  8. Research : Research Experience (either 0 or 1)
  9. Chance.of.Admit : Chance of Admit (ranging from 0 to 1)

Data Resume: - The data has 500 rows and 9 columns. - Serial.No. is an index number, so we can ignore it. - Our target variable is Chance.of.Admit. - Research and University.Rating data type can be changed into a factor.

d. Data Cleaning

graduate_clean <- graduate %>% 
  select(-Serial.No.) %>% 
  mutate_at(c('Research', 'University.Rating'), as.factor)

glimpse(graduate_clean)
#> Rows: 400
#> Columns: 8
#> $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
#> $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
#> $ University.Rating <fct> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
#> $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
#> $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
#> $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
#> $ Research          <fct> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
#> $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~

e. Data Correlation

ggcorr(graduate_clean, hjust = 0.64, label = T, label_size = 3, cex = 3)

Based on the correlation visualization above, it is known that:

  1. CGPA is the predictor variable with the highest correlation.
  2. LOR is the predictor variable with the lowest correlation.
  3. All predictor variables can be said to have a high correlation with the target variable because they have a correlation value above 0.5

Modeling

Train-Test Split

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparison and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 70% of the data as the training data and the rest of it as the testing data.

RNGkind(sample.kind = "Rounding")
set.seed(123)
intrain <- sample(nrow(graduate_clean), nrow(graduate_clean)*0.70)

# train-test splitting
graduate_train <- graduate_clean[intrain,]
graduate_test <- graduate_clean[-intrain,]

Linear Regression

Now we will try to model the linear regression using Chance.of.Admit as the target variable.

a. Linear Regression Model with all predictor variable

model_all <- lm(Chance.of.Admit ~ ., data = graduate_train)

summary(model_all)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ ., data = graduate_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.265365 -0.021831  0.008806  0.036244  0.154209 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)        -1.2389804  0.1588765  -7.798 1.38e-13 ***
#> GRE.Score           0.0016445  0.0007635   2.154  0.03214 *  
#> TOEFL.Score         0.0034531  0.0013086   2.639  0.00881 ** 
#> University.Rating2 -0.0031457  0.0183101  -0.172  0.86372    
#> University.Rating3  0.0058106  0.0197358   0.294  0.76866    
#> University.Rating4  0.0072630  0.0241904   0.300  0.76422    
#> University.Rating5  0.0232278  0.0265290   0.876  0.38205    
#> SOP                -0.0012208  0.0068045  -0.179  0.85775    
#> LOR                 0.0181388  0.0069112   2.625  0.00917 ** 
#> CGPA                0.1156133  0.0144201   8.018 3.32e-14 ***
#> Research1           0.0270413  0.0095338   2.836  0.00491 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06526 on 269 degrees of freedom
#> Multiple R-squared:  0.7997, Adjusted R-squared:  0.7923 
#> F-statistic: 107.4 on 10 and 269 DF,  p-value: < 2.2e-16

b. Linear Regression Model with selected predictor variable using backward elimination method

model_back <- step(object = model_all, direction = "backward", trace = 0)
summary(model_back)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
#>     CGPA + Research, data = graduate_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.264661 -0.022468  0.009422  0.038006  0.151586 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -1.3053253  0.1440060  -9.064  < 2e-16 ***
#> GRE.Score    0.0017322  0.0007487   2.314 0.021423 *  
#> TOEFL.Score  0.0035195  0.0012830   2.743 0.006486 ** 
#> LOR          0.0201934  0.0057446   3.515 0.000514 ***
#> CGPA         0.1185797  0.0139077   8.526 1.04e-15 ***
#> Research1    0.0278794  0.0093994   2.966 0.003282 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06494 on 274 degrees of freedom
#> Multiple R-squared:  0.798,  Adjusted R-squared:  0.7943 
#> F-statistic: 216.5 on 5 and 274 DF,  p-value: < 2.2e-16

Evaluation

Model Performance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package.

Below is the first model (with all predictor variables) performance.

#Data train RSME
RMSE(y_pred = model_all$fitted.values, y_true = graduate_train$Chance.of.Admit)
#> [1] 0.06396333
#Data test RSME
lm_pred <- predict(object = model_all, newdata = graduate_test %>% select(-Chance.of.Admit))
RMSE(y_pred = lm_pred, y_true = graduate_test$Chance.of.Admit)
#> [1] 0.06087141

Below is the second model (with removed variables by using backward method) performance.

#Data train RSME
RMSE(y_pred = model_back$fitted.values, y_true = graduate_train$Chance.of.Admit)
#> [1] 0.06423743
#Data test RSME
lm_pred2 <- predict(object = model_back, newdata = graduate_test %>% select(-Chance.of.Admit))
RMSE(y_pred = lm_pred2, y_true = graduate_test$Chance.of.Admit)
#> [1] 0.06120476

Performance Comparison

compare_performance(model_all, model_back, rank = T)

The second model is better than the first one in predicting the testing dataset, even only by a small margin.

Assumptions

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).

1. Linearity

resact <- data.frame(residual = model_back$residuals, fitted = model_back$fitted.values)

resact %>% 
  ggplot(aes(fitted, residual)) + 
  geom_point() + geom_smooth() + 
  geom_hline(aes(yintercept = 0)) + 
  theme(panel.grid = element_blank(), 
        panel.background = element_blank())

Could be seen from the plot that there is no visible pattern. So that indicate that the model is linear.

2. Normality of Residual

plot(density(model_back$residuals))

shapiro.test(model_back$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_back$residuals
#> W = 0.91432, p-value = 1.445e-11

The null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

3. Homoscedasticity

bptest(model_back)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_back
#> BP = 17.366, df = 5, p-value = 0.003856
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

We can observe that on lower fitted values, the residuals are spread in big range. As the fitted value increases, the residuals are concentrated around the value of 0 and creates a fan shape pattern.

Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is present in our model.

4. Multicolinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(model_back)
#>   GRE.Score TOEFL.Score         LOR        CGPA    Research 
#>    4.835205    4.017402    1.781769    4.628770    1.451949

In the model, all the VIF value is under 5 so correlation between predictors is weak.

Conclusion

The model that we can use to predict chances of student enter the university is model_back. Through the process of feature selection, we get five predictor variables that are considered to be important in predicting the acceptance rate of a graduate school applicant. The five predictor variables are:

  1. GRE score
  2. TOEFL Score
  3. LOR (Letter of Recommendation Strength)
  4. CGPA (Undergraduate GPA)
  5. Experience in research.

In terms of our model performance, it managed to pass two out of four assumptions check and those are the Linearity and the Multicollinearity. However, a linear regression model should have successfully passed all four of the assumptions check in order to be qualified as a “good” model. Meanwhile, the model_backward model has failed to fulfill that requirement. Therefore, we can say that predicting graduate admissions with linear regression model wouldn’t be the best choice that we can apply for future use. It is not sufficient to correctly predict the acceptance rate of a grad school applicants.

Predicting with different models beside linear regression would hopefully create a better model than the one that we just made.