Linear Regression on Graduate Admission Prediction

Introduction

About the data

The data contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

Business goal

This is a dataset created to predict the chance of Graduate Admissions. It was built with the purpose of helping students in shortlisting universities with accordance to their profiles. The predicted output gives them a fair idea about their chances for a getting admitted into a particular university.

What we will do

We will use linear regression model using Graduate Admission data from Kaggle. We want to know the relationship among variables, especially between the Admission_Chance with other variables. We also want to predict the chance of someone getting into a university using historical data. You can download the data here: https://www.kaggle.com/mohansacharya/graduate-admissions

Import Library

library(dplyr)
library(ggplot2)
library(GGally)
library(performance)
library(MLmetrics)
library(rmdformats)
library(lmtest)
library(performance)
library(car)

Data Preparation

Read data

admission <- read.csv("data_input/Admission_Predict.csv") %>% 
              select(-Serial.No.)
rmarkdown::paged_table(admission)

Renaming columns

names(admission) <- c("GRE", "TOEFL", "University_Rating", "SOP_Strength", "LOR_Strength", "CGPA", "Research", "Admission_Chance")
head(admission)
##   GRE TOEFL University_Rating SOP_Strength LOR_Strength CGPA Research
## 1 337   118                 4          4.5          4.5 9.65        1
## 2 324   107                 4          4.0          4.5 8.87        1
## 3 316   104                 3          3.0          3.5 8.00        1
## 4 322   110                 3          3.5          2.5 8.67        1
## 5 314   103                 2          2.0          3.0 8.21        0
## 6 330   115                 5          4.5          3.0 9.34        1
##   Admission_Chance
## 1             0.92
## 2             0.76
## 3             0.72
## 4             0.80
## 5             0.65
## 6             0.90

Data Wrangling/Preprocessing

Check for missing values

admission %>% 
  is.na() %>% 
  colSums()/nrow(admission)
##               GRE             TOEFL University_Rating      SOP_Strength 
##                 0                 0                 0                 0 
##      LOR_Strength              CGPA          Research  Admission_Chance 
##                 0                 0                 0                 0

No missing value, thus the data is well prepared.

Check if there are any mismatched data type and change them if necessary

Changing them to the right data type will ease the data analytics and machine learning process.

summary(admission)
##       GRE            TOEFL       University_Rating  SOP_Strength
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.0  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.5  
##  Median :317.0   Median :107.0   Median :3.000     Median :3.5  
##  Mean   :316.8   Mean   :107.4   Mean   :3.087     Mean   :3.4  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.0  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.0  
##   LOR_Strength        CGPA          Research      Admission_Chance
##  Min.   :1.000   Min.   :6.800   Min.   :0.0000   Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.170   1st Qu.:0.0000   1st Qu.:0.6400  
##  Median :3.500   Median :8.610   Median :1.0000   Median :0.7300  
##  Mean   :3.453   Mean   :8.599   Mean   :0.5475   Mean   :0.7244  
##  3rd Qu.:4.000   3rd Qu.:9.062   3rd Qu.:1.0000   3rd Qu.:0.8300  
##  Max.   :5.000   Max.   :9.920   Max.   :1.0000   Max.   :0.9700

Looks like we can change University_Rating and Research to Factor

Change the respective columns to the right data types

admission <- 
  admission %>% 
  mutate_at(vars(University_Rating, Research), as.factor)

str(admission)
## 'data.frame':    400 obs. of  8 variables:
##  $ GRE              : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL            : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University_Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP_Strength     : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR_Strength     : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Admission_Chance : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Exploratory Data Analysis

Check correlations among the numeric columns to the target column

ggcorr(admission, label = T, hjust = 0.7)
## Warning in ggcorr(admission, label = T, hjust = 0.7): data in column(s)
## 'University_Rating', 'Research' are not numeric and were ignored

Insights: - ALl the numeric columns are positively correlated to the target column - CGPA has the strongest correlation (0.9) to the admission chance - TOEFL, SOP_Strength and LOR_Strength have the same weightage to how correlated they are to the Admission_Chance

Plot a scatter plot of Admission_Chance against CGPA

THis is done to dig deeper and confirm the strong correlation of the two of them

plot(admission$CGPA, 
     admission$Admission_Chance,
     main = "Plot of CGPA against the Chance of Getting into College",
     xlab = "Cummulative GPA", 
     ylab = "Admission Chance")

The plot above perfectly illustrates a strong positive correlation between Admission_Chance and CGPA. The higher your CGPA, the higher your chance getting into a college

Plot a box plot for CGPA,TOEFL Admission_Chance

Try to find any possible outliers

boxplot(admission$CGPA,
        main = "CGPA Data Distribution")

boxplot(admission$TOEFL,
        main = "TOEFL Score Data Distribution")

boxplot(admission$Admission_Chance,
        main = "Admission Chance Data Distribution")

Cross Validation

This step is necessary to prepare some “unseen” data for the ML model to determine its accuracy and performance We will use 75:25 proportion for this data set s ## Splitting the data into train and test sets

set.seed(123)
index <- sample(nrow(admission), nrow(admission)*0.75)

data_train <- admission[index,]
data_test <- admission[-index,]

Model Building

Build a model that uses all predictors

model_admission_all <- lm(formula = Admission_Chance ~ ., data = data_train)
summary(model_admission_all)
## 
## Call:
## lm(formula = Admission_Chance ~ ., data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263574 -0.020604  0.009283  0.032224  0.165258 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.2891833  0.1466345  -8.792  < 2e-16 ***
## GRE                 0.0020508  0.0006663   3.078 0.002284 ** 
## TOEFL               0.0029860  0.0011909   2.507 0.012715 *  
## University_Rating2 -0.0217269  0.0162474  -1.337 0.182193    
## University_Rating3 -0.0148768  0.0177127  -0.840 0.401663    
## University_Rating4 -0.0178331  0.0216163  -0.825 0.410063    
## University_Rating5  0.0020886  0.0237871   0.088 0.930093    
## SOP_Strength       -0.0022572  0.0066178  -0.341 0.733294    
## LOR_Strength        0.0220683  0.0065183   3.386 0.000809 ***
## CGPA                0.1133068  0.0138506   8.181 9.05e-15 ***
## Research1           0.0253893  0.0091847   2.764 0.006071 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06201 on 289 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8027 
## F-statistic: 122.6 on 10 and 289 DF,  p-value: < 2.2e-16

Model Evaluation

Calculate the RMSE of the training data

RMSE(y_pred = model_admission_all$fitted.values, y_true = data_train$Admission_Chance)
## [1] 0.06086444

RMSE is really small, thus the model does well in the training data.

Make a prediction on data_test

model_admission_all_pred <- predict(model_admission_all,
                                    newdata = data_test %>% select(-Admission_Chance))

RMSE(y_pred = model_admission_all_pred, y_true = data_test$Admission_Chance)
## [1] 0.06885452

RMSE is even smaller, thus the model does not overfit.

Model fine-tuning

To fine-tune the model, we can use Step-wise regression to find the best features to be used in the model so it can make the model much better.

Feature selection

Using Step-Wise regression with “backwards” direction

model_admission_back <- step(model_admission_all, 
                             direction = "backward",
                             trace = F)
summary(model_admission_back)
## 
## Call:
## lm(formula = Admission_Chance ~ GRE + TOEFL + LOR_Strength + 
##     CGPA + Research, data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26640 -0.02093  0.01044  0.03477  0.16085 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.3423417  0.1343084  -9.994  < 2e-16 ***
## GRE           0.0021750  0.0006581   3.305 0.001068 ** 
## TOEFL         0.0028850  0.0011610   2.485 0.013513 *  
## LOR_Strength  0.0219284  0.0056717   3.866 0.000136 ***
## CGPA          0.1138209  0.0132256   8.606 4.64e-16 ***
## Research1     0.0246875  0.0091284   2.704 0.007240 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06203 on 294 degrees of freedom
## Multiple R-squared:  0.8059, Adjusted R-squared:  0.8026 
## F-statistic: 244.1 on 5 and 294 DF,  p-value: < 2.2e-16

Try prediciting using the new model

model_admission_back_pred <-predict(model_admission_back,
                                    newdata = data_test %>% select(-Admission_Chance))
RMSE(y_pred = model_admission_all_pred, y_true = data_test$Admission_Chance)
## [1] 0.06885452

The RMSE is the same, thus, we can say move on to comparing both of the models’ performance

Comparing the two models performance

compare_performance(model_admission_all, model_admission_back)
## # Comparison of Model Performance Indices
## 
## Name                 | Model |      AIC | AIC weights |      BIC | BIC weights |    R2 | R2 (adj.) |  RMSE | Sigma
## ------------------------------------------------------------------------------------------------------------------
## model_admission_all  |    lm | -804.101 |       0.088 | -759.655 |     < 0.001 | 0.809 |     0.803 | 0.061 | 0.062
## model_admission_back |    lm | -808.771 |       0.912 | -782.844 |       1.000 | 0.806 |     0.803 | 0.061 | 0.062

Both models are similar, but the adj. R2 of model_admission_all is slightly lower and the AIC of . Thus we rather use the model_admission_back for the next predictions

Checking for assumptions

Linearity

Why linearity is tested? Because linear regression model can only learn well on linear pattern.

Let’s take one of the strongly correlated predictor, CGPA to test out for linearity against the target variable:

cor.test(x = admission$CGPA, y = admission$Admission_Chance)
## 
##  Pearson's product-moment correlation
## 
## data:  admission$CGPA and admission$Admission_Chance
## t = 35.759, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8478354 0.8947275
## sample estimates:
##       cor 
## 0.8732891

Since the p-value is smaller than 0.05, thus we can say that CGPA and Admission_Chance are significantly correlated.

Normality

Normality is to check whether the distribution of the model residual is normal. We use Saphiro test here to test for the Normality:

shapiro.test(model_admission_back$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_admission_back$residuals
## W = 0.90741, p-value = 1.283e-12

Since the p-value of the model is smaller than 0.05, thus it is saying that the distribution of model residual is normal.

Homoscedasticity

Here we can see the distribution of the data fitted values against the residuals

plot(model_admission_back$fitted.values, model_admission_back$residuals, ylim = c(-50,50))
abline(h = 0, col = "red")

We will use Breusch-Pagan test to check for the Homoscedasticity:

bptest(model_admission_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_admission_back
## BP = 14.046, df = 5, p-value = 0.01532

Since the p-value is smaller than 0.05, thus it is saying that heteroscedasticity doesn’t happen in the model.

Multicolinearity

We check multicolinearity to make sure that there is no dependency among the predictors that are used in the model.

We use VIF to check for multicolinearity:

vif(model_admission_back)
##          GRE        TOEFL LOR_Strength         CGPA     Research 
##     4.380017     3.788371     2.013739     4.635938     1.607922

There is none of the values that goes beyond 10 among the variables, thus multicolinearity is nor present

Conclusion

The predictors that are useful to describe the variances in the chance of being admistted to the university are GRE, TOEFL, LOR_Strength, CGPA and Research. Our final model has satisfied all four classical assumptions. The R-squared of the model is not high enough even after fine tuning with 79.3% of the variables can explain the variances in the chance of being admitted to the university. We may use other models to have higher performance.

The accuracy of the model in predicting the car price is measured with RMSE, with training data has RMSE of 0.06451918 and testing data has RMSE of 0.0583371, suggesting that our model does not overfit the training model.

We have already learn how to build a linear regression model and what need to be concerned when building the model.

Reference

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019