Intro

We will be using linear regression model using graduate admission dataset. We want to know the relationship among variables, especially between chance of admission with other variables. You can download the data here

Library and Setup

library(tidyverse)
library(caret)
library(plotly)
library(car)
library(scales)
library(lmtest)
library(GGally)
library(car)

Data preparation

Load the dataset

admission <- read.csv("Admission_Predict.csv")

Check the structure of the dataset

str(admission)
## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Brief explanation about the data

  1. GRE Scores: Graduate Record Examination ( out of 340 )
  2. TOEFL Scores: Test of English as a Foreign Language ( out of 120 )
  3. University Rating: The rating of the university, the higher the better ( out of 5 )
  4. Statement of Purpose Strength: An essay or other written statement written by an applicant, often a prospective student applying to some college, university, or graduate school, the higher the better ( out of 5 )
  5. Letter of Recommendation Strength: A letter of reference that vouches for a specific person based on their characteristics and qualifications, the higher the better ( out of 5 )
  6. Undergraduate GPA: Undergraduate GPA based on all courses completed for bachelor degree ( out of 10 )
  7. Research Experience: Indicates whether the candidate have research experience or not ( either 0 or 1 )
  8. Chance of Admit: Chance of candidate of getting accepted for Masters Programs ( ranging from 0 to 1 )

Data Cleaning

We won’t be using the Serial.No. variable for our linear regression model so we will take it out of the dataframe.

admission_clean <- admission %>% 
                  select(-Serial.No.)

Data Exploration

Check numbers of rows and columns of the dataset.

dim(admission_clean)
## [1] 400   8

The dataset consist of 400 rows and 8 columns.

Check missing value

anyNA(admission_clean)
## [1] FALSE

Great! Our data has no missing value.

Check Correlation between columns

ggcorr(admission_clean, label = TRUE, label_size = 2.9, hjust = 1,layout.exp = 2)

From the visualization above, we can conclude that almost all of the variables has strong correlation with Chance.of.Admit with CGPA having the highest positive correlation.

Build Model

We will build a Linear Regression model with CGPA as the predictor variable since it has the highest positive correlation with Chance.of.Admit. We will name this model as model_cgpa

model_cgpa <- lm(formula = Chance.of.Admit~CGPA, data = admission_clean)
summary(model_cgpa)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274575 -0.030084  0.009443  0.041954  0.180734 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.07151    0.05034  -21.29   <2e-16 ***
## CGPA         0.20885    0.00584   35.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.762 
## F-statistic:  1279 on 1 and 398 DF,  p-value: < 2.2e-16

We will build a Linear Regression model using all predictor variables. We will name this model as model_all

model_all <- lm(formula = Chance.of.Admit~., data = admission_clean)
summary(model_all)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26259 -0.02103  0.01005  0.03628  0.15928 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2594325  0.1247307 -10.097  < 2e-16 ***
## GRE.Score          0.0017374  0.0005979   2.906  0.00387 ** 
## TOEFL.Score        0.0029196  0.0010895   2.680  0.00768 ** 
## University.Rating  0.0057167  0.0047704   1.198  0.23150    
## SOP               -0.0033052  0.0055616  -0.594  0.55267    
## LOR                0.0223531  0.0055415   4.034  6.6e-05 ***
## CGPA               0.1189395  0.0122194   9.734  < 2e-16 ***
## Research           0.0245251  0.0079598   3.081  0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:    0.8 
## F-statistic: 228.9 on 7 and 392 DF,  p-value: < 2.2e-16

We will be using stepwise with the direction backward to determine which predictor variables will give the lowest AIC(Akaike Information Criterion)

step(model_all, direction = "backward")
## Start:  AIC=-2193.9
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1   0.00144 1.5962 -2195.5
## - University.Rating  1   0.00584 1.6006 -2194.4
## <none>                           1.5948 -2193.9
## - TOEFL.Score        1   0.02921 1.6240 -2188.6
## - GRE.Score          1   0.03435 1.6291 -2187.4
## - Research           1   0.03862 1.6334 -2186.3
## - LOR                1   0.06620 1.6609 -2179.6
## - CGPA               1   0.38544 1.9802 -2109.3
## 
## Step:  AIC=-2195.54
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  1   0.00464 1.6008 -2196.4
## <none>                           1.5962 -2195.5
## - TOEFL.Score        1   0.02806 1.6242 -2190.6
## - GRE.Score          1   0.03565 1.6318 -2188.7
## - Research           1   0.03769 1.6339 -2188.2
## - LOR                1   0.06983 1.6660 -2180.4
## - CGPA               1   0.38660 1.9828 -2110.8
## 
## Step:  AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     1.6008 -2196.4
## - TOEFL.Score  1   0.03292 1.6338 -2190.2
## - GRE.Score    1   0.03638 1.6372 -2189.4
## - Research     1   0.03912 1.6400 -2188.7
## - LOR          1   0.09133 1.6922 -2176.2
## - CGPA         1   0.43201 2.0328 -2102.8
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission_clean)
## 
## Coefficients:
## (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA     Research  
##   -1.298464     0.001782     0.003032     0.022776     0.121004     0.024577

From the results above we can conclude that GRE.Score, TOEFL.Score, LOR, CGPA and Research are the predictor variables that have the lowest AIC. We will build a model from this result and name it model_backward

model_backward <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
    CGPA + Research, data = admission_clean)
summary(model_backward)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263542 -0.023297  0.009879  0.038078  0.159897 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.2984636  0.1172905 -11.070  < 2e-16 ***
## GRE.Score    0.0017820  0.0005955   2.992  0.00294 ** 
## TOEFL.Score  0.0030320  0.0010651   2.847  0.00465 ** 
## LOR          0.0227762  0.0048039   4.741 2.97e-06 ***
## CGPA         0.1210042  0.0117349  10.312  < 2e-16 ***
## Research     0.0245769  0.0079203   3.103  0.00205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
## F-statistic: 320.6 on 5 and 394 DF,  p-value: < 2.2e-16

Prediction

We will create a new column called prediction to predict Chance.of.Admit based on model_cgpa

admission_clean$prediction <- predict(model_cgpa, admission_clean)

We will create a new column called prediction2 to predict Chance.of.Admit based on model_all

admission_clean$prediction2 <- predict(model_all, admission_clean)

We will create a new column called prediction3 to predict Chance.of.Admit based on model_backward

admission_clean$prediction3 <- predict(model_backward, admission_clean)
head(admission_clean)
##   GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research Chance.of.Admit
## 1       337         118                 4 4.5 4.5 9.65        1            0.92
## 2       324         107                 4 4.0 4.5 8.87        1            0.76
## 3       316         104                 3 3.0 3.5 8.00        1            0.72
## 4       322         110                 3 3.5 2.5 8.67        1            0.80
## 5       314         103                 2 2.0 3.0 8.21        0            0.65
## 6       330         115                 5 4.5 3.0 9.34        1            0.90
##   prediction prediction2 prediction3
## 1  0.9438641   0.9514586   0.9546053
## 2  0.7809633   0.8056367   0.8037043
## 3  0.5992662   0.6547367   0.6523025
## 4  0.7391938   0.7383624   0.7394830
## 5  0.6431241   0.6352064   0.6351524
## 6  0.8791215   0.8658537   0.8613597

Root mean squared error (RMSE) is the square root of the mean of the square of all of the error. The use of RMSE is very common, and it is considered an excellent general-purpose error metric for numerical predictions.

RMSE(pred = admission_clean$prediction, obs = admission_clean$Chance.of.Admit)
## [1] 0.0693927
RMSE(pred = admission_clean$prediction2, obs = admission_clean$Chance.of.Admit)
## [1] 0.06314185
RMSE(pred = admission_clean$prediction3, obs = admission_clean$Chance.of.Admit)
## [1] 0.06326207

From the results of RMSE above, we can conclude that model_all have the lowest RMSE, however it is consisted of all predictor variables, therefore we will use model_backward

Assumptions Checking

Linear regression has several assumptions that need to be fulfilled so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is for interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions are not required to be met.

Linearity

Linearity means that the target variable with its predictor has a linear relationship or that the relationship is a straight line. In addition, the effect or coefficient value between variables is additive. If this linearity is not met, then automatically all the coefficient values that we get are invalid because the model assumes that the pattern that we will make is linear.

We can use residual plot to identify linearity. If there is a pattern in the residual plot, it means that the model does not meet the linearity assumption.

linearity <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)

linearity %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

There is a pattern in the data, this means that our model may not be linear enough.

Normality of Residual

Another assumption of Linear Regression model is that the residuals follow normal distribution. This means that most of the residuals gathered around 0. We can check normality of residual using Shapiro-Wilk normality test. The hypothesis of normality of residuals:

\[ H_0: error/residual\ followed\ normal\ distribution \\ H_1: error/residual\ does\ not\ followed\ normal\ distribution \]

shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.92193, p-value = 1.443e-13

With p-value < 0.05, we can conclude that our null hypothesis is rejected, this means the residuals are not following the normal distribution.

Heteroscedasticity

Heteroskedasticity refers to situations where the variance of the residuals is unequal over a range of measured values. When running a regression analysis, heteroskedasticity results in an unequal scatter of the residuals (also known as the error term). We can check heteroscedasticity in the model using Breusch-Pagan test. The hypothesis of heteroscedasticity:

\[ H_0: The\ residuals\ are\ distributed\ with\ equal\ variance\ (Homoscedasticity\ is\ present)\\ H_1: The\ residuals\ are\ not\ distributed\ with\ equal\ variance\ (Heteroscedasticity\ is\ present) \]

bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 22.428, df = 5, p-value = 0.0004341

With p-value < 0.05, we can conclude that our null hypothesis is rejected, this means the residuals are not distributed with equal variance meaning that hetereoscedasticity is present.

Multicollinearity

Multicollinearity means that there are strong correlations between predictor variables. We can check whether multicollinearity is present in our data by measuring the value of Variance Inflation Factor(VIF). If the value of VIF is more than 10 then we can conclude that multicollinearity is present and we need to remove one of the variables with VIF > 10.

vif(model_backward)
##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.585053    4.104255    1.829491    4.808767    1.530007

Based on the results above, we can conclude that there is no multicollinearity in our data

Conclusion

We can conclude that GRE.Score, TOEFL.Score, LOR, CGPA and Research are the variables that can describe the variances in Chance.of.Admit with RMSE = 0.063 and R-squared = 80.2%