Introduction

This dataset is created for prediction of Graduate Admissions from an Indian perspective and contains several parameters which are considered important during the application for Masters Programs.

The business case for this analysis is to predict whether a person has a chance to admit into a university with linear regression model.

Library

Library used in this analysis.

library(tidyverse)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)

Data Preparation

Read the data

adm <- read.csv("Admission.csv")

Check data structure

glimpse(adm)
## Rows: 500
## Columns: 9
## $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...

We need change Research data types into factor but we need to change 0 to “no” and 1 to “yes”. After that we can just delete Serial.No. column.

adm <- adm %>% 
  select(-c(Serial.No.)) %>% 
  mutate(Research = ifelse(Research == 0, "no", "yes"),
         Research = as.factor(Research))
glimpse(adm)
## Rows: 500
## Columns: 8
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research          <fct> yes, yes, yes, yes, no, yes, yes, no, no, no, yes...
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...

Okay, now the data already changed into what we want. But we also need to check if there is any NA in the data.

colSums(is.na(adm))
##         GRE.Score       TOEFL.Score University.Rating               SOP 
##                 0                 0                 0                 0 
##               LOR              CGPA          Research   Chance.of.Admit 
##                 0                 0                 0                 0

Now, we know that the data don’t have NA and we can move to explore the data.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp =
## 2): data in column(s) 'Research' are not numeric and were ignored

Based on the result, all of the predictors have high correlation with the target. But this test only for numerical predictors, lets check the factor predictor.

adm %>% 
  group_by(Research) %>% 
  summarise(min_Chance.of.Admit = min(Chance.of.Admit), # Minimum
            max_Chance.of.Admit = max(Chance.of.Admit), # Maximum
            mean_Chance.of.Admit = mean(Chance.of.Admit), # Mean
            )

Based on the result, we know that if there is weak correlation between Research and Chance.of.Admit.

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparation and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 75% of the data as the training data and the rest of it as the testing data.

set.seed(123)
row_data <- nrow(adm)
index <- sample(row_data, row_data*0.75)

data_train <- adm[ index, ]
data_test <- adm[ -index, ]

Now we will try to model the linear regression using Chance.of.Admit as the target variable.

set.seed(123)
model_adm <- lm(Chance.of.Admit ~ ., data_train)
model_step <- step(model_adm, direction = "both")
## Start:  AIC=-2109.52
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1   0.00041 1.2959 -2111.4
## - University.Rating  1   0.00431 1.2998 -2110.3
## <none>                           1.2955 -2109.5
## - TOEFL.Score        1   0.02333 1.3188 -2104.8
## - Research           1   0.02872 1.3242 -2103.3
## - LOR                1   0.03485 1.3303 -2101.6
## - GRE.Score          1   0.05470 1.3502 -2096.0
## - CGPA               1   0.42853 1.7240 -2004.3
## 
## Step:  AIC=-2111.4
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  1   0.00622 1.3021 -2111.6
## <none>                           1.2959 -2111.4
## + SOP                1   0.00041 1.2955 -2109.5
## - TOEFL.Score        1   0.02408 1.3200 -2106.5
## - Research           1   0.02908 1.3250 -2105.1
## - LOR                1   0.04152 1.3374 -2101.6
## - GRE.Score          1   0.05430 1.3502 -2098.0
## - CGPA               1   0.45896 1.7548 -1999.7
## 
## Step:  AIC=-2111.61
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           1.3021 -2111.6
## + University.Rating  1   0.00622 1.2959 -2111.4
## + SOP                1   0.00232 1.2998 -2110.3
## - TOEFL.Score        1   0.02975 1.3318 -2105.1
## - Research           1   0.03092 1.3330 -2104.8
## - LOR                1   0.05455 1.3566 -2098.2
## - GRE.Score          1   0.05548 1.3576 -2098.0
## - CGPA               1   0.51286 1.8150 -1989.1
summary(model_step)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270620 -0.022970  0.008901  0.035976  0.160757 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.4210184  0.1151556 -12.340  < 2e-16 ***
## GRE.Score    0.0022661  0.0005715   3.965 8.81e-05 ***
## TOEFL.Score  0.0027951  0.0009626   2.904 0.003908 ** 
## LOR          0.0175733  0.0044695   3.932 0.000101 ***
## CGPA         0.1226855  0.0101766  12.056  < 2e-16 ***
## Researchyes  0.0224954  0.0075994   2.960 0.003273 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0594 on 369 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.8156 
## F-statistic: 331.8 on 5 and 369 DF,  p-value: < 2.2e-16

The summary of model shows a lot of information. But we just need to see Pr(>|t|). This column shows significance level of the variable towards model. It shows that all variables have significant effect toward the model. Also the model can explain 81,5% of the data.

Evaluation

The performance of our model can be calculated using MAPE and RMSE.

pred_data <- predict(model_step, data_test)
MAPE(pred_data, data_test$Chance.of.Admit)
## [1] 0.07290229
RMSE(pred_data, data_test$Chance.of.Admit)
## [1] 0.06222074

Based on the result, MAPE shows that the model deviate around 7.2% from the actual data. Meanwhile RMSE shows that prediction deviate around 0.06 points from the actual data. From this, we know that the model is not bad.

Assumption

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator.

Normality Test

The first assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Shaphiro-Wilk normality test.

shapiro.test(model_step$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_step$residuals
## W = 0.91908, p-value = 2.495e-13

From the result, p-value is <0.05 so it means our residuals are not following the normal distribution.

Homocesdasticity

Homocesdasticity shows that the residual or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it ‘Heterocesdasticity’ and it will affect the standard error value in the biased estimate / predictor coefficient (too narrow or too wide). For this analysis, I use Breusch-Pagan test.

bptest(model_step)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_step
## BP = 15.824, df = 5, p-value = 0.007364

Based on the result, p-value is <0.05 so the residuals error are form certain pattern.

Multicollinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(model_step)
##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.224167    3.521604    1.739356    3.830131    1.501305

Based on the result, there are no correlation between variables.

Model Improvement

Tuning

We have already seen that our model doesn’t satisfy normality, and heterocesdasticity test. Now we will try to fix them. To made the model more linear and simple, we can transform some of the variables, both the target and the predictors. First, we can shun off the variables that have correlation coefficient above 0.6. Then we can change the target with box-cox method.

ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp =
## 2): data in column(s) 'Research' are not numeric and were ignored

From the result, we know that TOEFL.Score and GRE.Score, also SOP and University.Rating have high correlation. So we can deselect one of them. But to do that, we have to know what predictor have higher effect. So, we have to scale it.

scale_adm <- adm %>% 
  mutate_if(is.numeric, scale)

model_scale <- lm(Chance.of.Admit ~ ., scale_adm)

summary(model_scale)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = scale_adm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88930 -0.16527  0.06512  0.23887  1.11108 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.09644    0.03238  -2.979 0.003036 ** 
## GRE.Score          0.14873    0.04020   3.700 0.000240 ***
## TOEFL.Score        0.11971    0.03759   3.184 0.001544 ** 
## University.Rating  0.04814    0.03080   1.563 0.118753    
## SOP                0.01114    0.03204   0.348 0.728263    
## LOR                0.11054    0.02713   4.074 5.38e-05 ***
## CGPA               0.50730    0.04159  12.198  < 2e-16 ***
## Researchyes        0.17222    0.04680   3.680 0.000259 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.425 on 492 degrees of freedom
## Multiple R-squared:  0.8219, Adjusted R-squared:  0.8194 
## F-statistic: 324.4 on 7 and 492 DF,  p-value: < 2.2e-16

Based on the result, we know that GRE.Score and University.Rating have higher effect on Chance.of.Admit. So we can deselect TOEFL.Score and SOP as predictor. After that, we can use box-cox method to the target.

adm2 <- adm %>% 
  select(-c(TOEFL.Score, SOP))
box_try <- boxCox(model_step, lambda = seq(-10, 10,by=0.1))

lambda <- box_try$x[ which.max(box_try$y) ]

adm2<- adm2 %>% 
  mutate(Chance.of.Admit = (Chance.of.Admit^lambda - 1) / lambda)

Performance

set.seed(123)
data_train2 <- adm2[index, ]
data_test2 <- adm2[-index, ]
set.seed(123)
model_adm2 <- lm(Chance.of.Admit ~ ., data_train2)
model_step2 <- step(model_adm2, direction = "both")
## Start:  AIC=-2452.45
## Chance.of.Admit ~ GRE.Score + University.Rating + LOR + CGPA + 
##     Research
## 
##                     Df Sum of Sq     RSS     AIC
## <none>                           0.52469 -2452.4
## - University.Rating  1  0.013684 0.53838 -2444.8
## - Research           1  0.014340 0.53903 -2444.3
## - LOR                1  0.015264 0.53996 -2443.7
## - GRE.Score          1  0.057330 0.58202 -2415.6
## - CGPA               1  0.249727 0.77442 -2308.5
summary(model_step2)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + University.Rating + 
##     LOR + CGPA + Research, data = data_train2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.160765 -0.019596  0.006957  0.023719  0.104475 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.6642363  0.0768589 -21.653  < 2e-16 ***
## GRE.Score          0.0020298  0.0003197   6.350 6.34e-10 ***
## University.Rating  0.0078394  0.0025271   3.102  0.00207 ** 
## LOR                0.0096503  0.0029454   3.276  0.00115 ** 
## CGPA               0.0851337  0.0064241  13.252  < 2e-16 ***
## Researchyes        0.0153469  0.0048327   3.176  0.00162 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03771 on 369 degrees of freedom
## Multiple R-squared:  0.8425, Adjusted R-squared:  0.8404 
## F-statistic: 394.9 on 5 and 369 DF,  p-value: < 2.2e-16

Based on the result, we can see that this model can explain 84% of the data. It is higher around 2.5% than the previous model.

pred_data2 <- predict(model_step2, data_test2)
MAPE(pred_data2, data_test2$Chance.of.Admit)
## [1] 0.1517087
RMSE(pred_data2, data_test2$Chance.of.Admit)
## [1] 0.03833579

Based on the result, MAPE shows that the model deviate around 15% from the actual data. Meanwhile RMSE shows that prediction deviate around 0.03 points from the actual data. From this, we know that the model is not bad. MAPE from this model is 2 times higher than previous model but RMSE is lower almost half than previous model.

Assumption

Lets test normality and Homocesdasticity again.

shapiro.test(model_step2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_step2$residuals
## W = 0.94623, p-value = 1.974e-10
bptest(model_step2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_step2
## BP = 4.3907, df = 5, p-value = 0.4946

Based on the result, the data still don’t pass normality test but pass Homocesdasticity test. It mean the predictor might be still biased. So lets check the residual plot.

qqPlot(model_step2$residuals)

##  10  66 
## 194 243

Based on the result, some of the residuals still outside the line.

Conclusion

Even after we tweak the model, it still have not pass the normality test. It means the predictor might be biased. Overall the model is good, the model can explain 84% of the data with MAPE is 0.015 and RMSE is 0.03. It is better than the previous model but unfortunately with higher MAPE.