Graduate Admission

Introduction
- Business Objectives
Data Preparation
- Load the Required Package
- Load the Dataset
Exploratory Data Analysis
- Explore Data Variables
- Check Data Correlation
Modeling
Evaluation
Check Assumption
Using Data Test
Conclusion

Introduction

We will learn to use linear regression model using Graduate Admission dataset from an Indian perspective. We want to know the relationship among variables, especially between the Chance of Admit with other variables. We also want to predit the Chance of Admit of a new applicants based on the historical data. You can download the data here.

Business Objectives

This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

Data Preparation

Load the Required Package

library(DT)           #datatables
library(dplyr)        #praprocess data
library(GGally)       #ggcorr
library(MLmetrics)    #MSE
library(lmtest)       #homoscedasticity check
library(car)          #Multicollinearity check

Load the Dataset

admission_data <- read.csv("data/Admission_Predict_Ver1.1.csv")

datatable(
  admission_data,
  options = list(scrollY = "400px")
)

Check data type

glimpse(admission_data)

## Observations: 500
## Variables: 9
## $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302,...
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3,...
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0,...
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5,...
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7....
## $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,...
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0....

The data has 500 rows and 9 columns. Serial.No. is a unique identifier so we can ignore it. Our target variable is the Chance.of.Admit and we will use other variable as our predictors.

Before we go further, first we need to make sure that our data is clean and will be useful so we will remove unused variables :

admission_data <- admission_data %>% 
  select(-Serial.No.)

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Explore Data Variables

The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

GRE Scores (out of 340)

summary(admission_data$GRE.Score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   290.0   308.0   317.0   316.5   325.0   340.0

TOEFL Scores (out of 120)

summary(admission_data$TOEFL.Score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    92.0   103.0   107.0   107.2   112.0   120.0

University Rating (out of 5)

summary(admission_data$University.Rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.114   4.000   5.000

Statement of Purpose and Letter of Recommendation Strength (out of 5)

summary(admission_data$SOP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.500   3.500   3.374   4.000   5.000

summary(admission_data$LOR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.500   3.484   4.000   5.000

Undergraduate GPA (out of 10)

summary(admission_data$CGPA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.800   8.127   8.560   8.576   9.040   9.920

Research Experience (either 0 or 1)

summary(admission_data$Research)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    0.56    1.00    1.00

Check Data Correlation

Find the Pearson correlation between variables :

ggcorr(admission_data, label = T, hjust = 0.9, cex=3)

The graphic shows that all variables has strong correlation with the Chance.of.Admit variable.

Modeling

Cross Validation

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(100)
idx <- sample(nrow(admission_data), nrow(admission_data)*0.8)


data_train <- admission_data[idx,]
data_test <- admission_data[-idx,]

Modeling

Based on Pearson correlation, all variables has strong correlation with the Chance.of.Admit variable. So we will make a model with all variables from data train.

lm_admission_all <- lm(data = data_train, formula = Chance.of.Admit~.)

summary(lm_admission_all)

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.243304 -0.023909  0.007254  0.032844  0.153875 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.4067512  0.1080347 -13.021 < 0.0000000000000002 ***
## GRE.Score          0.0023085  0.0005272   4.379            0.0000153 ***
## TOEFL.Score        0.0030805  0.0009126   3.376              0.00081 ***
## University.Rating  0.0069311  0.0039922   1.736              0.08332 .  
## SOP               -0.0007643  0.0048477  -0.158              0.87480    
## LOR                0.0116529  0.0043049   2.707              0.00709 ** 
## CGPA               0.1164724  0.0102799  11.330 < 0.0000000000000002 ***
## Research           0.0199933  0.0069428   2.880              0.00420 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05654 on 392 degrees of freedom
## Multiple R-squared:  0.8336, Adjusted R-squared:  0.8306 
## F-statistic: 280.5 on 7 and 392 DF,  p-value: < 0.00000000000000022

Feature Selection using Stepwise Regression

Now we will try to eliminate variables to get better model using Stepwise Regression

Backward method

lm_admission_back <- step(lm_admission_all, direction = "backward")

## Start:  AIC=-2290.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1   0.00008 1.2530 -2292.4
## <none>                           1.2530 -2290.4
## - University.Rating  1   0.00963 1.2626 -2289.3
## - LOR                1   0.02342 1.2764 -2285.0
## - Research           1   0.02651 1.2795 -2284.0
## - TOEFL.Score        1   0.03642 1.2894 -2280.9
## - GRE.Score          1   0.06130 1.3142 -2273.3
## - CGPA               1   0.41031 1.6633 -2179.1
## 
## Step:  AIC=-2292.36
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           1.2530 -2292.4
## - University.Rating  1   0.01057 1.2636 -2291.0
## - LOR                1   0.02475 1.2778 -2286.5
## - Research           1   0.02655 1.2796 -2286.0
## - TOEFL.Score        1   0.03642 1.2894 -2282.9
## - GRE.Score          1   0.06175 1.3148 -2275.1
## - CGPA               1   0.42549 1.6785 -2177.4

Forward method

lm_admission_none <- lm(data = data_train, formula = Chance.of.Admit~1)

lm_admission_forward <- step(lm_admission_none, scope = list(lower = lm_admission_none, upper = lm_admission_all), direction = "forward")

## Start:  AIC=-1587.12
## Chance.of.Admit ~ 1
## 
##                     Df Sum of Sq    RSS     AIC
## + CGPA               1    5.9358 1.5926 -2206.4
## + GRE.Score          1    5.1681 2.3602 -2049.1
## + TOEFL.Score        1    4.8172 2.7111 -1993.6
## + University.Rating  1    3.4607 4.0677 -1831.4
## + SOP                1    3.2633 4.2650 -1812.4
## + LOR                1    2.7773 4.7511 -1769.2
## + Research           1    2.2163 5.3120 -1724.6
## <none>                           7.5283 -1587.1
## 
## Step:  AIC=-2206.45
## Chance.of.Admit ~ CGPA
## 
##                     Df Sum of Sq    RSS     AIC
## + GRE.Score          1  0.214801 1.3778 -2262.4
## + TOEFL.Score        1  0.161008 1.4316 -2247.1
## + Research           1  0.092582 1.5000 -2228.4
## + University.Rating  1  0.061012 1.5315 -2220.1
## + LOR                1  0.049521 1.5430 -2217.1
## + SOP                1  0.023098 1.5695 -2210.3
## <none>                           1.5926 -2206.4
## 
## Step:  AIC=-2262.4
## Chance.of.Admit ~ CGPA + GRE.Score
## 
##                     Df Sum of Sq    RSS     AIC
## + LOR                1  0.045959 1.3318 -2274.0
## + TOEFL.Score        1  0.042554 1.3352 -2272.9
## + University.Rating  1  0.036724 1.3410 -2271.2
## + Research           1  0.031664 1.3461 -2269.7
## + SOP                1  0.018674 1.3591 -2265.9
## <none>                           1.3778 -2262.4
## 
## Step:  AIC=-2273.97
## Chance.of.Admit ~ CGPA + GRE.Score + LOR
## 
##                     Df Sum of Sq    RSS     AIC
## + TOEFL.Score        1  0.038574 1.2932 -2283.7
## + Research           1  0.026134 1.3057 -2279.9
## + University.Rating  1  0.019409 1.3124 -2277.8
## <none>                           1.3318 -2274.0
## + SOP                1  0.003728 1.3281 -2273.1
## 
## Step:  AIC=-2283.73
## Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score
## 
##                     Df Sum of Sq    RSS     AIC
## + Research           1 0.0296239 1.2636 -2291.0
## + University.Rating  1 0.0136412 1.2796 -2286.0
## <none>                           1.2932 -2283.7
## + SOP                1 0.0012437 1.2920 -2282.1
## 
## Step:  AIC=-2291
## Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research
## 
##                     Df Sum of Sq    RSS     AIC
## + University.Rating  1 0.0105685 1.2530 -2292.4
## <none>                           1.2636 -2291.0
## + SOP                1 0.0010136 1.2626 -2289.3
## 
## Step:  AIC=-2292.36
## Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research + 
##     University.Rating
## 
##        Df   Sum of Sq   RSS     AIC
## <none>                1.253 -2292.4
## + SOP   1 0.000079451 1.253 -2290.4

Both backward and forward method give the same models, so we will use one of them.

Evaluation

We have 3 models and now we will check performance of our model (how well our model predict the target variable) using MSE and adj.r.squared to.

# MSE of train dataset
MSE(lm_admission_all$fitted.values, data_train$Chance.of.Admit) %>% round(4)

## [1] 0.0031

MSE(lm_admission_none$fitted.values, data_train$Chance.of.Admit) %>% round(4)

## [1] 0.0188

MSE(lm_admission_back$fitted.values, data_train$Chance.of.Admit) %>% round(4)

## [1] 0.0031

summary(lm_admission_all)$adj.r.squared %>% round(4)

## [1] 0.8306

summary(lm_admission_none)$adj.r.squared %>% round(4)

## [1] 0

summary(lm_admission_back)$adj.r.squared %>% round(4)

## [1] 0.831

From data above, the best model is lm_admission_back because it produces the highest adj.r.squared and the smallest MSE (Mean Squared Error).

Check Assumption

Linearity

resact <- data.frame(residual = lm_admission_back$residuals, fitted = lm_admission_back$fitted.values)

resact %>% 
  ggplot(aes(fitted, residual)) + 
  geom_point() + 
  geom_hline(aes(yintercept = 0)) + 
  geom_smooth() + 
  theme(panel.grid = element_blank(), panel.background = element_blank())

There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

Normality of Residual

hist(lm_admission_back$residuals)

plot(density(lm_admission_back$residuals))

shapiro.test(lm_admission_back$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  lm_admission_back$residuals
## W = 0.93567, p-value = 0.000000000003987

With p-value < 0.05, we can conclude that our residuals are not normally distributed.

Homoscedascity

bptest(lm_admission_back)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_admission_back
## BP = 22.58, df = 6, p-value = 0.0009502

resact %>% 
  ggplot(aes(fitted, residual)) + 
  geom_point() + 
  geom_hline(aes(yintercept = 0)) + 
  theme(panel.grid = element_blank(), panel.background = element_blank())

With p-value < 0.05, we can conclude that heterocesdasticity is present.

Little to no multicollinearity

vif(lm_admission_back)

##         GRE.Score       TOEFL.Score University.Rating               LOR 
##          4.400360          3.698528          2.133388          1.727872 
##              CGPA          Research 
##          4.483945          1.484216

Using Data Test

lm_admission_test <- lm(data = data_test, formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    SOP + LOR + CGPA + Research)

MSE(lm_admission_test$fitted.values, data_test$Chance.of.Admit) %>% round(4)

## [1] 0.0045

Conclusion

Variables that are useful to describe the variances in Chance of Admit are GRE Score, TOEFL Score, University Rating, Letter of Recommendation Strength, Undergraduate GPA, Research Experience. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 83.1% of the variables can explain the variances in the Chance of Admit. The accuracy of the model in predicting the Chance of Admit is measured with MSE, with training data has MSE : 0.0031 and testing data has MSE : 0.0046.

We have already learn how to build a linear regression model and what need to be concerned when building the model.