library(tidyverse)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)

Introduction

Studying can be boring if we don’t like. But it is a must when we are as student demand for. This LBB will predict student chance of admit with several predictor. First, let’s read the data.

grade <- read.csv("Admission_Predict.csv")
head(grade)
glimpse(grade)
## Rows: 400
## Columns: 9
## $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...

There are 400 students on our data and after take a peek into the data, the Serial.No. column should be remove because it might be just indicate number of participant/students.

EDA on Data

After take a peek into the data set, we know that we should to remove Serial.No. column. Next, we will check Research column, if it should be has factor data type or no.

unique(grade$Research)
## [1] 1 0

Then, we will remove Serial.No. and change data type of Research into factor.

# EDA on Grade Admission
grade <- grade %>% 
  select(-Serial.No.) %>% 
  mutate(Research = as.factor(Research))

glimpse(grade)
## Rows: 400
## Columns: 8
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research          <fct> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...

Column Explanation:

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose ( out of 5 )
  • Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

After do some EDA to the data set, we will check is there any missing value on our data set?

colSums(is.na(grade))
##         GRE.Score       TOEFL.Score University.Rating               SOP 
##                 0                 0                 0                 0 
##               LOR              CGPA          Research   Chance.of.Admit 
##                 0                 0                 0                 0

Then, we will see the correlation between Chance.of.Admit as target with others as predictor by using ggcorr() function by GGally package.

ggcorr(grade, label = TRUE, label_size = 4, hjust = 1, layout.exp = 2)
## Warning in ggcorr(grade, label = TRUE, label_size = 4, hjust = 1, layout.exp =
## 2): data in column(s) 'Research' are not numeric and were ignored

On correlation graph above shows that almost each predictors and target are has positive correlation.

Because Research is a factor data type, we will create bar plot to see correlation between Research and Chance.of.Admit

# EDA and Visualisation of correlation between student who has research experience and chance of admit
grade %>% 
  group_by(Research) %>% 
  summarise(mean_chance = mean(Chance.of.Admit)) %>% 
  ggplot(aes(Research, 
             mean_chance)) +
  geom_col() +
  labs(title = "Correlation between Student who has Research Experience and Chance of Admit",
       y = NULL)+
  theme_minimal() 

From bar plot above we can conclude that student who has research experience has higher chance of admit than student who doesn’t have research experience.

Linear Regression Model

In this section, we will start create linear regression model. We will create linear regression model with all predictor.

1. Linear Regression Model; All Predictor

lm1 <- lm(Chance.of.Admit ~ ., grade)
summary(lm1)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26259 -0.02103  0.01005  0.03628  0.15928 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2594325  0.1247307 -10.097  < 2e-16 ***
## GRE.Score          0.0017374  0.0005979   2.906  0.00387 ** 
## TOEFL.Score        0.0029196  0.0010895   2.680  0.00768 ** 
## University.Rating  0.0057167  0.0047704   1.198  0.23150    
## SOP               -0.0033052  0.0055616  -0.594  0.55267    
## LOR                0.0223531  0.0055415   4.034  6.6e-05 ***
## CGPA               0.1189395  0.0122194   9.734  < 2e-16 ***
## Research1          0.0245251  0.0079598   3.081  0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:    0.8 
## F-statistic: 228.9 on 7 and 392 DF,  p-value: < 2.2e-16

From result above we have; - Adj. R-squared = 0.8 (means the model with all predictor can explain 80% of variance on the target (Chance.of.Admit)). - There is predictor that has slight variance on Chance.of.Admit.

Then, we will create another linear regression model with predictor; University.Rating and SOP.

2. Regression Model with Two Predictor

lm2 <- lm(Chance.of.Admit ~ TOEFL.Score + Research, grade)
summary(lm2)
## 
## Call:
## lm(formula = Chance.of.Admit ~ TOEFL.Score + Research, data = grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33422 -0.04267  0.01510  0.05416  0.20510 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.0384127  0.0820539  -12.65  < 2e-16 ***
## TOEFL.Score  0.0160940  0.0007857   20.48  < 2e-16 ***
## Research1    0.0622860  0.0095685    6.51 2.28e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08304 on 397 degrees of freedom
## Multiple R-squared:  0.6626, Adjusted R-squared:  0.6609 
## F-statistic: 389.9 on 2 and 397 DF,  p-value: < 2.2e-16

From the result above, we have; - Adj. R-Squared = 0.6609 (means this model can explain only 66,09% of variance in Chance.of Admit)

Then, we will try to use Stepwise Regression to create another regression model.

3. Stepwise Regression - Backward

lm3 <- step(object = lm1, direction = "backward")
## Start:  AIC=-2193.9
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1   0.00144 1.5962 -2195.5
## - University.Rating  1   0.00584 1.6006 -2194.4
## <none>                           1.5948 -2193.9
## - TOEFL.Score        1   0.02921 1.6240 -2188.6
## - GRE.Score          1   0.03435 1.6291 -2187.4
## - Research           1   0.03862 1.6334 -2186.3
## - LOR                1   0.06620 1.6609 -2179.6
## - CGPA               1   0.38544 1.9802 -2109.3
## 
## Step:  AIC=-2195.54
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  1   0.00464 1.6008 -2196.4
## <none>                           1.5962 -2195.5
## - TOEFL.Score        1   0.02806 1.6242 -2190.6
## - GRE.Score          1   0.03565 1.6318 -2188.7
## - Research           1   0.03769 1.6339 -2188.2
## - LOR                1   0.06983 1.6660 -2180.4
## - CGPA               1   0.38660 1.9828 -2110.8
## 
## Step:  AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     1.6008 -2196.4
## - TOEFL.Score  1   0.03292 1.6338 -2190.2
## - GRE.Score    1   0.03638 1.6372 -2189.4
## - Research     1   0.03912 1.6400 -2188.7
## - LOR          1   0.09133 1.6922 -2176.2
## - CGPA         1   0.43201 2.0328 -2102.8
summary(lm3)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = grade)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263542 -0.023297  0.009879  0.038078  0.159897 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.2984636  0.1172905 -11.070  < 2e-16 ***
## GRE.Score    0.0017820  0.0005955   2.992  0.00294 ** 
## TOEFL.Score  0.0030320  0.0010651   2.847  0.00465 ** 
## LOR          0.0227762  0.0048039   4.741 2.97e-06 ***
## CGPA         0.1210042  0.0117349  10.312  < 2e-16 ***
## Research1    0.0245769  0.0079203   3.103  0.00205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
## F-statistic: 320.6 on 5 and 394 DF,  p-value: < 2.2e-16

From result above, we have; - By using Stepwise Regression - Backward, we have GRE.Score, TOEFL.Score, LOR, CGPA, and Research as predictors. - Adj. R-Squared = 0.8002, a very slightly higher than model that using all predictors.

After create several model, let’s forecast each model and compare each model’s forecast and error.

Forecast The Data & Error

Forecast The Data

In this section, we will forecast each model and put each forecasted value on to new column.

# create new column that filled by prediction of each models
grade$predict1 <- predict(lm1, grade)
grade$predict2 <- predict(lm2, grade)
grade$predict3 <- predict(lm3, grade)

# quick check
head(grade)

Compare the Error

After forecast using each model, we will compare each model’s error by using RMSE() function from MLMetric package.

# Comparasion Error of each model
RMSE(y_pred = grade$predict1, y_true = grade$Chance.of.Admit)
## [1] 0.06314185
RMSE(y_pred = grade$predict2, y_true = grade$Chance.of.Admit)
## [1] 0.08272896
RMSE(y_pred = grade$predict3, y_true = grade$Chance.of.Admit)
## [1] 0.06326207

Result above shows; - RMSE of lm1 (model with all predictor) and lm3 (Stepwise Regression - Backward) has slightly different. - It is recommended to use model lm3 (Stepwise Regression - Backward) because beside the RMSE is only slightly different with the model lm1, the Adj.R-Squared of model lm3 slightly higher than model lm1.

We have compare each model’s error and Adj. R-Squared, then we will evaluate the model.

Evaluate The Model

In this section we will evaluate model lm3 by Linearity Check, Normality of Residuals, Homoscedascity, and Multicolinearity.

1. Linearty Check

There are 2 ways to check the linearity of the model. Visualize it or by using cor.test() function. We will do both of them to do check and re-check of both method.

1.a. Using cor.test() function:

cor.test(grade$Chance.of.Admit, grade$GRE.Score)
## 
##  Pearson's product-moment correlation
## 
## data:  grade$Chance.of.Admit and grade$GRE.Score
## t = 26.843, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7647419 0.8349536
## sample estimates:
##       cor 
## 0.8026105
cor.test(grade$Chance.of.Admit, grade$TOEFL.Score)
## 
##  Pearson's product-moment correlation
## 
## data:  grade$Chance.of.Admit and grade$TOEFL.Score
## t = 25.845, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7519028 0.8255675
## sample estimates:
##      cor 
## 0.791594
cor.test(grade$Chance.of.Admit, grade$LOR)
## 
##  Pearson's product-moment correlation
## 
## data:  grade$Chance.of.Admit and grade$LOR
## t = 18, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6120380 0.7206083
## sample estimates:
##       cor 
## 0.6698888
cor.test(grade$Chance.of.Admit, grade$CGPA)
## 
##  Pearson's product-moment correlation
## 
## data:  grade$Chance.of.Admit and grade$CGPA
## t = 35.759, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8478354 0.8947275
## sample estimates:
##       cor 
## 0.8732891

Result above shows all p-value residuals < 0.05 means that all predictors on model lm3 has linearity

1.b. Visualize it!

simpred <- data.frame(residuals = lm3$residuals, fitted = lm3$fitted.values)

simpred %>% 
  ggplot(aes(fitted, residuals)) +
  geom_point() +
  geom_smooth() +
  geom_hline(aes(yintercept = 0), col = "red") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Plot above shows there are linearity on each predictors.

2. Normality of Residuals

After check linearity assumption, we will check is the residual follow normal distribution or no by using shapiro.test() function.

shapiro.test(lm3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  lm3$residuals
## W = 0.92193, p-value = 1.443e-13

Result above shows that p-value of model lm3 < 0.05, means that this model’s residuals are not following the normal distribution.

3. Homoscedascity

Third assumption test are checking homoscedascity that will provide us information if the residual has pattern (heteroscedascity) or not (homoscedascity).

The hypothesis of homoscedascity are;

H0: Homoscedasticity (has no pattern) => p-value > 0.05 H1: Heteroscedasticity (has pattern) => p-value < 0.05

To check the homoscedascity assumption, we will use bptest() function from lmtest package.

bptest(lm3)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm3
## BP = 22.428, df = 5, p-value = 0.0004341

Result above shows that the p-value of model lm3 < 0.05, means that there are pattern on the model. If we visualize, it will be

plot(grade$Chance.of.Admit, lm3$residuals)
abline(h = 0, col = "red")

4. No-Multicolinearity

Last assumption check, Multicolinearity, is to check is there any linearity correlation between the independent variables/predictors. To check multicolinearity of each predictors we simply use vif() function. After that, we will simply remove predictor that has vif result exceeding 5 or 10.

vif(lm3)
##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.585053    4.104255    1.829491    4.808767    1.530007

Result above shows that variable/predictor in model lm3 has vif result below 5, thus each variable/predictors has no multicolinearity.

Conclusion

  • We have created 3 models of linear regression;
    • lm1, model with all predictor.
    • lm2, model only have 2 predictor (TOEFL Score and Research).
    • lm3, stepwise regression model with backward direction. If we compare Adj.R.Squared of three model, lm3 has highest Adj. R-Squared than other although Adj. R-Squared of lm1 and lm3 slightly different. Same case to the RMSE, lm1 and lm3 has slightly different, but it is recommended to use lm3 to forecast the data.
  • After do Assumption check on model lm3, we know that all predictors has linearity correlation and each variable/predictor on model lm3 has no multicolinearity, but the residual has not normal distribution and has pattern (does not fulfill Normality of Residuals and Homoscedascity Assumption Check).