library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
library(funModeling)
library(MASS)
library(dplyr)

1 Introduction

Regression analysis is one of the supervised machine learning methods that apply to predict the particular value of a numerical variable based on its predictor variables. The regression model is among the most theoretically understood models in statistics and often leads to highly interpretable models. We will learn to use a linear regression model to predict students’ chances of shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

2 Source of Dataset

The source of the dataset can be found in https://www.kaggle.com/mohansacharya/graduate-admissions

3 Data Preproces

3.1 Data Wrangling

Before we obtain the model, we need to perform data preprocessing to find any undesirable format in the raw data.

admission <- read.csv("Admission_Predict_Ver1.1.csv")

rmarkdown::paged_table(admission)
names(admission)
#> [1] "Serial.No."        "GRE.Score"         "TOEFL.Score"      
#> [4] "University.Rating" "SOP"               "LOR"              
#> [7] "CGPA"              "Research"          "Chance.of.Admit"

The dataset includes information as follow:

  • Serial.No : Student Number
  • GRE.Score : GRE Scores ( out of 340 )
  • TOEFL.Score : TOEFL Scores ( out of 120 )
  • University.Rating : University Rating ( out of 5 )
  • SOP : Statement of Purpose ( out of 5 )
  • LOR : Letter of Recommendation Strength ( out of 5 )
  • CGPA : Undergraduate GPA ( out of 10 )
  • Research : Research Experience ( either 0 or 1 )
  • Chance.of.Admit: Chance of Admit ( ranging from 0 to 1 )
colSums(is.na(admission))
#>        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
#>                 0                 0                 0                 0 
#>               SOP               LOR              CGPA          Research 
#>                 0                 0                 0                 0 
#>   Chance.of.Admit 
#>                 0
anyNA(admission)
#> [1] FALSE

We see that the data is clean and has no NA value. We do not need the Serial.No variable since it does not correlate with any other predictor, so we have to remove it.

admission_new <- admission %>% 
  dplyr::select(-Serial.No.)
rmarkdown::paged_table(admission_new)

3.2 Checking Correlation

ggcorr(admission_new, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Each predictor variable correlates with a highly positive score with the response variable (Chance of Admit). CGPA is the variable that correlates the highest with the chance of admission than other variables.

boxplot(admission_new$Chance.of.Admit)

boxplot(admission_new$TOEFL.Score)

boxplot(admission_new$GRE.Score)

boxplot(admission_new %>% 
          dplyr::select(-c(Chance.of.Admit, GRE.Score, TOEFL.Score)))

plot_num(admission_new)

4 Model Fitting and Evaluation

Before we make the model, we split the data into training and testing dataset. We will use the training dataset to train the linear regression model. The test dataset will be used as a comparison and see if the model gets overfit and can not predict new data that has not been seen during the training phase. We will use 80% of the data as the training data and the rest of it as the testing data.

set.seed(11)
index <- sample(nrow(admission_new), nrow(admission_new)*0.8)
admission.train <- admission_new[index, ]
admission.val <- admission_new[-index, ]

First, we build the model with all predictor variables

4.1 Building The Model

model_admission <- lm(formula = Chance.of.Admit ~.,data = admission.train)
summary(model_admission)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ ., data = admission.train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.268039 -0.022574  0.007516  0.033044  0.162165 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       -1.3696887  0.1145370 -11.958  < 2e-16 ***
#> GRE.Score          0.0021181  0.0005539   3.824 0.000153 ***
#> TOEFL.Score        0.0028105  0.0009724   2.890 0.004063 ** 
#> University.Rating  0.0040452  0.0040932   0.988 0.323628    
#> SOP               -0.0009614  0.0049034  -0.196 0.844664    
#> LOR                0.0191943  0.0046186   4.156 3.98e-05 ***
#> CGPA               0.1201964  0.0110339  10.893  < 2e-16 ***
#> Research           0.0225662  0.0074723   3.020 0.002693 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.05877 on 392 degrees of freedom
#> Multiple R-squared:  0.8359, Adjusted R-squared:  0.833 
#> F-statistic: 285.4 on 7 and 392 DF,  p-value: < 2.2e-16

From this model, we obtain the adjusted R-square with 83.3%. Next, we try to build a model automatically with stepwise regression with backward elimination.

MASS::stepAIC(model_admission, direction = "backward")
#> Start:  AIC=-2259.41
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     SOP + LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> - SOP                1   0.00013 1.3540 -2261.4
#> - University.Rating  1   0.00337 1.3572 -2260.4
#> <none>                           1.3538 -2259.4
#> - TOEFL.Score        1   0.02885 1.3827 -2253.0
#> - Research           1   0.03150 1.3853 -2252.2
#> - GRE.Score          1   0.05050 1.4043 -2246.8
#> - LOR                1   0.05965 1.4135 -2244.2
#> - CGPA               1   0.40983 1.7637 -2155.6
#> 
#> Step:  AIC=-2261.37
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> - University.Rating  1   0.00330 1.3573 -2262.4
#> <none>                           1.3540 -2261.4
#> - TOEFL.Score        1   0.02872 1.3827 -2255.0
#> - Research           1   0.03154 1.3855 -2254.2
#> - GRE.Score          1   0.05047 1.4044 -2248.7
#> - LOR                1   0.06420 1.4182 -2244.8
#> - CGPA               1   0.42270 1.7767 -2154.7
#> 
#> Step:  AIC=-2262.4
#> Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
#> 
#>               Df Sum of Sq    RSS     AIC
#> <none>                     1.3573 -2262.4
#> - TOEFL.Score  1   0.03189 1.3892 -2255.1
#> - Research     1   0.03401 1.3913 -2254.5
#> - GRE.Score    1   0.05057 1.4078 -2249.8
#> - LOR          1   0.07682 1.4341 -2242.4
#> - CGPA         1   0.47328 1.8305 -2144.7
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
#>     CGPA + Research, data = admission.train)
#> 
#> Coefficients:
#> (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA     Research  
#>   -1.395850     0.002120     0.002923     0.020003     0.122494     0.023326

We find with the stepwise regression, the model contain variables of GRE.Score, TOEFL.Score, LOR, CGPA, and Research, We build the new model with this set of predictors.

new_model_admission <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research,data = admission.train)
summary(new_model_admission)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
#>     CGPA + Research, data = admission.train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.267891 -0.023946  0.007039  0.032910  0.162565 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -1.3958502  0.1087265 -12.838  < 2e-16 ***
#> GRE.Score    0.0021196  0.0005532   3.831 0.000148 ***
#> TOEFL.Score  0.0029226  0.0009606   3.042 0.002503 ** 
#> LOR          0.0200029  0.0042358   4.722 3.25e-06 ***
#> CGPA         0.1224944  0.0104507  11.721  < 2e-16 ***
#> Research     0.0233259  0.0074233   3.142 0.001803 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.05869 on 394 degrees of freedom
#> Multiple R-squared:  0.8355, Adjusted R-squared:  0.8334 
#> F-statistic: 400.3 on 5 and 394 DF,  p-value: < 2.2e-16

We find the adjusted-R-squared has a slightly better score with 83.34%. Thus, we will use this model to test its performance in our testing dataset.

4.2 Evaluating Model

For evaluating our model, we will use RMSE (Root Mean Square Error) with the following formula:

\(RMSE = \sqrt{\frac{\sum_i^{n}(Predicted_i-Actual_i)^2}{n}}\)

lm_pred <- predict(new_model_admission, newdata = admission.val %>% dplyr::select(-Chance.of.Admit))

# RMSE of train dataset
RMSE(admission.train$Chance.of.Admit, new_model_admission$fitted.values)
#> [1] 0.05825091
RMSE(admission.val$Chance.of.Admit, lm_pred)
#> [1] 0.06536872

The error in the testing dataset is higher than in the training dataset. There is a chance that the model overfit.

4.3 Checking Assumptions

Linear Regression has several assumptions since it is a parametric model. Linear regression that does not follow the assumption may be misleading or has a biased estimator.

  1. Linearity
linear <- data.frame(residual = new_model_admission$residuals, fitted = new_model_admission$fitted.values)

linear %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

Ideally, the residual plot will show no fitted pattern. That is, the blue line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model. In our model, there is no pattern in residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.

  1. Normality

\(H_0\): the normality of residual is fulfilled

\(H_1\): the normality of residual is not fulfilled

shapiro.test(new_model_admission$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  new_model_admission$residuals
#> W = 0.93312, p-value = 2.085e-12

The null hypothesis is that the residuals follow the normal distribution. With p-value < 0.05, we reject our null hypothesis, and our residuals are not following the normal distribution.

  1. Autocorrelation

\(H_0\): residuals have no-autocorrelation

\(H_1\): residuals have autocorrelation

Box.test(new_model_admission$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  new_model_admission$residuals
#> X-squared = 0.30395, df = 1, p-value = 0.5814

The p-value is higher than 0.05. It is the sign that the data have no-autocorelation.

  1. Heterocedasticity

\(H_0\): Variation of residual is constant (Homoscedasticity)

\(H_1\): Variation of residual is not constant (Heteroscedasticity)

bptest(new_model_admission)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  new_model_admission
#> BP = 23.14, df = 5, p-value = 0.0003174

The model fail to reject the null hypothesis. We can conclude that our data have heterocedasticity problem.

  1. Multicollinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(new_model_admission)
#>   GRE.Score TOEFL.Score         LOR        CGPA    Research 
#>    4.690604    3.965071    1.764345    4.751085    1.563673

Our model does not fit several assumptions. Several methods may be applied to fix this problem.

  1. Transform some variables, both the target and the predictors.
  2. Shun off the variables that have a high correlation coefficient.