1 Introduction

1.1 Data Explanation

The dataset was collected from kaggle and containing 500 observations with 7 variables. The dataset contains several parameters which are considered important during the application for Masters Programs. The target variable is Chance.of.Admit, which determines how big is the chance of one to successfully be admitted to a master program, while the possible predictors are GRE Scores (out of 340), TOEFL Scores (out of 120), University Rating (out of 5), Statement of Purpose’s Strength (out of 5), Letter of Recommendation’s Strength (out of 5), Undergraduate GPA (out of 10), and Research Experience (binary, either 1 or 0).

1.2 Main Case

We have two different conditions towards our regression modelling and prediction. First, a student who has already done all tests and want to know their chances of being accepted in a master program. Second, a student who is about to take the TOEFL and GRE test, and want to know the minimum score to have a high chance of being admitted to a master program.

1.3 Load Dataset

We load the csv file and store the data to an object named admission

admission <- read.csv("Admission_Predict_Ver1.1.csv")
head(admission)

In order to test the model’s ability to predict, we’ll have to divide the data into train dataset and test dataset, having 400 and 100 observations respectively. I’ll use random sampling to choose observations.

set.seed(16)
rand_sample <- sample(nrow(admission),400)
admission_train <- admission[rand_sample,]
admission_test <- admission[-rand_sample,]

admission_train

admission_test

2 Exploratory Data Analysis

First we check the structure of the data

str(admission_train)

## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  225 127 315 189 271 427 264 276 478 430 ...
##  $ GRE.Score        : int  305 323 305 331 306 312 324 322 309 340 ...
##  $ TOEFL.Score      : int  105 113 105 115 105 106 111 110 105 115 ...
##  $ University.Rating: int  2 3 2 5 2 3 3 3 4 5 ...
##  $ SOP              : num  3 4 3 4.5 2.5 3 2.5 3.5 3.5 5 ...
##  $ LOR              : num  2 3 4 3.5 3 5 1.5 3 2 4.5 ...
##  $ CGPA             : num  8.23 9.32 8.13 9.36 8.22 8.57 8.79 8.96 8.18 9.06 ...
##  $ Research         : int  0 1 0 1 1 0 1 1 0 1 ...
##  $ Chance.of.Admit  : num  0.67 0.85 0.66 0.93 0.72 0.71 0.7 0.78 0.65 0.95 ...

Here are some explanation for variables above :
1. GRE.Score : The Graduate Record Examinations, is a standardized test that is being an admissions requirement for many graduate schools
2. TOEFL.Score : Test of English as a Foreign Language, one of the most famous English language skill test
3. University.Rating : Rate of a University, range from 1 to 5
4. SOP : Statement of Purpose, is an essay that tells the admission committee about who you are, why you’re applying, why you’re a good candidate, and what you want to do in the future.
5.LOR : Letter of Recommendation, is a document designed to add some merit value to college application. They are usually written by a supervisor, colleague, teacher, or friend.

We obviously know that serial number (Serial.No) has nothing to do with the chance to be admitted in a master program, so we can remove this variable

admission_train <- admission_train[,-c(1)]

Then we observe the missing values

colSums(is.na(admission_train))

##         GRE.Score       TOEFL.Score University.Rating               SOP 
##                 0                 0                 0                 0 
##               LOR              CGPA          Research   Chance.of.Admit 
##                 0                 0                 0                 0

Finally, we check the outliers of the data

# GRE and TOEFL score has a different range than other variables
boxplot(admission_train$GRE.Score,admission_train$TOEFL.Score,names = c("GRE Score","TOEFL Score"))

# Boxplot of other variables
boxplot(admission_train[,-c(1,2)])

With no missing values and inconsiderable outliers, we can directly go to regression modelling

3 Linear Regression Modelling

3.1 Correlation and Main Model

First we check the correlation between variables

GGally::ggcorr(admission_train, label = T, hjust = 0.9, layout.exp = 3, label_size = 3, name = "Correlation Value")

From the chart above we find out that CGPA is a must-included variable, while from business viewpoint, University Rating is one of the most important attributes when applying to a master program. We’ll set these two as minimum variables.

model.admission.main <- lm(Chance.of.Admit~ University.Rating + CGPA,admission_train)
summary(model.admission.main)

## 
## Call:
## lm(formula = Chance.of.Admit ~ University.Rating + CGPA, data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274275 -0.023677  0.008582  0.039056  0.165477 
## 
## Coefficients:
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -0.866414   0.059210 -14.633 < 0.0000000000000002 ***
## University.Rating  0.018323   0.004160   4.405            0.0000136 ***
## CGPA               0.178572   0.007876  22.672 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06682 on 397 degrees of freedom
## Multiple R-squared:  0.7758, Adjusted R-squared:  0.7747 
## F-statistic:   687 on 2 and 397 DF,  p-value: < 0.00000000000000022

As we can see, the two variables have a very significant impact towards target variable

3.2 Feature Selection with Stepwise Regression

In order to make a good model, we need to combine both our business standpoint and statistical methods. We’ll use stepwise regression to acquire some suggestions from statistical point of view

First we need 2 scope models, one with main variables and one with all variables. We already have the main models, so let’s make the model with All Variables

model.admission.all <- lm(Chance.of.Admit~.,admission_train)
summary(model.admission.all)

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.264996 -0.023110  0.009709  0.034993  0.155816 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2584116  0.1213883 -10.367 < 0.0000000000000002 ***
## GRE.Score          0.0018755  0.0005899   3.179             0.001594 ** 
## TOEFL.Score        0.0030544  0.0010115   3.020             0.002696 ** 
## University.Rating  0.0081625  0.0044621   1.829             0.068114 .  
## SOP               -0.0002268  0.0052454  -0.043             0.965540    
## LOR                0.0158052  0.0047446   3.331             0.000947 ***
## CGPA               0.1124005  0.0111793  10.054 < 0.0000000000000002 ***
## Research           0.0275598  0.0075177   3.666             0.000280 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06163 on 392 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8084 
## F-statistic: 241.4 on 7 and 392 DF,  p-value: < 0.00000000000000022

Now we create 3 different models using all 3 stepwise regression methods

model.admission.forward <- step(model.admission.main,scope = list(lower = model.admission.main, upper = model.admission.all),trace = 0,direction = "forward")
model.admission.backward <- step(model.admission.all,trace = 0,direction = "backward")
model.admission.both <- step(model.admission.main,scope = list(lower = model.admission.main, upper = model.admission.all),trace = 0,direction = "both")

summary(model.admission.forward)[1]

## $call
## lm(formula = Chance.of.Admit ~ University.Rating + CGPA + GRE.Score + 
##     LOR + Research + TOEFL.Score, data = admission_train)

summary(model.admission.backward)[1]

## $call
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research, data = admission_train)

summary(model.admission.both)[1]

## $call
## lm(formula = Chance.of.Admit ~ University.Rating + CGPA + GRE.Score + 
##     LOR + Research + TOEFL.Score, data = admission_train)

From the results above we can find out that all three stepwise methods produce models with the same predictors, so we can store these models to one object called model.admission

model.admission <- model.admission.both
summary(model.admission)

## 
## Call:
## lm(formula = Chance.of.Admit ~ University.Rating + CGPA + GRE.Score + 
##     LOR + Research + TOEFL.Score, data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.265067 -0.023077  0.009819  0.034848  0.155793 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2581051  0.1210271 -10.395 < 0.0000000000000002 ***
## University.Rating  0.0080847  0.0040786   1.982             0.048149 *  
## CGPA               0.1123190  0.0110055  10.206 < 0.0000000000000002 ***
## GRE.Score          0.0018775  0.0005874   3.196             0.001505 ** 
## LOR                0.0157465  0.0045408   3.468             0.000583 ***
## Research           0.0275544  0.0075070   3.670             0.000276 ***
## TOEFL.Score        0.0030492  0.0010032   3.040             0.002527 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06155 on 393 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8089 
## F-statistic: 282.4 on 6 and 393 DF,  p-value: < 0.00000000000000022

4 Model Comparison

Finally we have 3 different models, with :
- Main Variables
- All Variables
- Stepwise Method

4.1 Adjusted R-Squared

Let’s compare their Adjusted R-Squared value

paste("Model with Main Variables :",round(summary(model.admission.main)$adj.r.squared,5))

## [1] "Model with Main Variables : 0.7747"

paste("Model with All Variables :",round(summary(model.admission.all)$adj.r.squared,5))

## [1] "Model with All Variables : 0.80837"

paste("Model with Stepwise Method :",round(summary(model.admission)$adj.r.squared,5))

## [1] "Model with Stepwise Method : 0.80886"

The model with main variables has the lowest Adjusted R-Squared value with quite a difference, so we can eliminate this model.

Now we have two models to compare, model with stepwise method (model.admission) and model with all variables (model.admission.all)

4.2 Assumptions

Linearity

summary(model.admission)

## 
## Call:
## lm(formula = Chance.of.Admit ~ University.Rating + CGPA + GRE.Score + 
##     LOR + Research + TOEFL.Score, data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.265067 -0.023077  0.009819  0.034848  0.155793 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2581051  0.1210271 -10.395 < 0.0000000000000002 ***
## University.Rating  0.0080847  0.0040786   1.982             0.048149 *  
## CGPA               0.1123190  0.0110055  10.206 < 0.0000000000000002 ***
## GRE.Score          0.0018775  0.0005874   3.196             0.001505 ** 
## LOR                0.0157465  0.0045408   3.468             0.000583 ***
## Research           0.0275544  0.0075070   3.670             0.000276 ***
## TOEFL.Score        0.0030492  0.0010032   3.040             0.002527 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06155 on 393 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8089 
## F-statistic: 282.4 on 6 and 393 DF,  p-value: < 0.00000000000000022

summary(model.admission.all)

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.264996 -0.023110  0.009709  0.034993  0.155816 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2584116  0.1213883 -10.367 < 0.0000000000000002 ***
## GRE.Score          0.0018755  0.0005899   3.179             0.001594 ** 
## TOEFL.Score        0.0030544  0.0010115   3.020             0.002696 ** 
## University.Rating  0.0081625  0.0044621   1.829             0.068114 .  
## SOP               -0.0002268  0.0052454  -0.043             0.965540    
## LOR                0.0158052  0.0047446   3.331             0.000947 ***
## CGPA               0.1124005  0.0111793  10.054 < 0.0000000000000002 ***
## Research           0.0275598  0.0075177   3.666             0.000280 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06163 on 392 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8084 
## F-statistic: 241.4 on 7 and 392 DF,  p-value: < 0.00000000000000022

Model with all variables (model.admission.all) has two insignificant variables, while model with stepwise method (model.admission) has all variables proven significant/linear.

Error’s Normality

Let’s visualize the residuals

hist(model.admission$residuals,xlab = "Residuals",main = "Residuals of model.admission")

hist(model.admission.all$residuals,xlab = "Residuals",main = "Residuals of model.admission.all")

Both model visually have the same histogram chart, and also it’s kinda subjective to define a data normality based on chart, so let’s try and compare the Shapiro-Wilk value with Shapiro Test.

H0: error/residuals normally distributed (if p-value > 0.05)

H1: error/residuals are not normally distributed (if p-value < 0.05)

shapiro.test(model.admission$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model.admission$residuals
## W = 0.92251, p-value = 0.0000000000001648

shapiro.test(model.admission.all$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model.admission.all$residuals
## W = 0.92263, p-value = 0.0000000000001693

Both model has p-value lower than 0.05, so we can reject the H0. The residuals from both model are not normally distributed.

Heteroscedasticity
Determines whether the residuals have a pattern. Let’s visualize the residuals distribution

plot(model.admission$fitted.values,model.admission$residuals,main = "Error Distribution of model.admission")
abline(h = 0, col = "red")

plot(model.admission.all$fitted.values,model.admission.all$residuals, main = "Error Distribution of model.admission.all")
abline(h = 0, col = "red")

Based on scatter plot above, there are no visible pattern in the residuals. However, it is better to check from the statistical point of view using Breusch-Pagan test

H0: error variance spreads constantly/has no pattern (Homoscedasticity) (if p-value > 0.05)
H1: errors generate a pattern (Heteroscedasticity) (if p-value < 0.05)

lmtest::bptest(model.admission)

## 
##  studentized Breusch-Pagan test
## 
## data:  model.admission
## BP = 20.731, df = 6, p-value = 0.00205

lmtest::bptest(model.admission.all)

## 
##  studentized Breusch-Pagan test
## 
## data:  model.admission.all
## BP = 23.236, df = 7, p-value = 0.001551

Again, both model has a p-value below 0.05, so we can reject the H0. The model’s residuals is heteroscedastic

Multicolinearity
Defines whether the variables related to each other or not. We use the Variance Inflation Factor (VIF) value, which if any of its value is above 10, then the model is multicolinear.

car::vif(model.admission)

## University.Rating              CGPA         GRE.Score               LOR 
##          2.303959          4.678755          4.516008          1.843592 
##          Research       TOEFL.Score 
##          1.456273          3.926298

car::vif(model.admission.all)

##         GRE.Score       TOEFL.Score University.Rating               SOP 
##          4.543281          3.981715          2.750559          2.799827 
##               LOR              CGPA          Research 
##          2.007702          4.815469          1.456683

None of the variables in the model are strongly correlated to each other. Thus, there are no multicolinearity

4.3 Error Values

Finally we need to compare both model by the MSE (Mean Squared Error) value they produce.

As a comparison, here is the range of the target variable (Chance.of.Admit)

range(admission_train$Chance.of.Admit)

## [1] 0.34 0.97

Now, let’s get back to the test data I provided above, we’ll see how accurate the models can predict the unseen data.

admission_predict <- predict(model.admission,admission_test)
admission_predict <- data.frame(admission_predict, admission_test$Chance.of.Admit)
names(admission_predict) <- c("Model_Prediction","Actual_Values")

admission_predict_all <- predict(model.admission.all,admission_test)
admission_predict_all <- data.frame(admission_predict_all, admission_test$Chance.of.Admit)
names(admission_predict_all) <- c("Model_Prediction","Actual_Values")

Here are the Mean Squared Error of each model’s prediction

MSE : Stepwise (model.admission)

# Train Data 
MLmetrics::MSE(model.admission$fitted.values, admission_train$Chance.of.Admit)

## [1] 0.003722198

# Test Data
MLmetrics::MSE(admission_predict$Model_Prediction,admission_predict$Actual_Values)

## [1] 0.002858706

MSE : All Variables (model.admission.all)

# Train Data 
MLmetrics::MSE(model.admission.all$fitted.values, admission_train$Chance.of.Admit)

## [1] 0.00372218

# Test Data
MLmetrics::MSE(admission_predict_all$Model_Prediction,admission_predict_all$Actual_Values)

## [1] 0.00286023

Both models have a small error values, and the train-test MSE comparison has not much of a difference. Safe to say that the models aren’t overfitting nor underfitting.

4.4 Final Model

Now we’ll accumulate all the comparisons above :

We can decide that the model from stepwise method (model.admission) is the best model to predict the chance for master program admission.

summary(model.admission)

## 
## Call:
## lm(formula = Chance.of.Admit ~ University.Rating + CGPA + GRE.Score + 
##     LOR + Research + TOEFL.Score, data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.265067 -0.023077  0.009819  0.034848  0.155793 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -1.2581051  0.1210271 -10.395 < 0.0000000000000002 ***
## University.Rating  0.0080847  0.0040786   1.982             0.048149 *  
## CGPA               0.1123190  0.0110055  10.206 < 0.0000000000000002 ***
## GRE.Score          0.0018775  0.0005874   3.196             0.001505 ** 
## LOR                0.0157465  0.0045408   3.468             0.000583 ***
## Research           0.0275544  0.0075070   3.670             0.000276 ***
## TOEFL.Score        0.0030492  0.0010032   3.040             0.002527 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06155 on 393 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8089 
## F-statistic: 282.4 on 6 and 393 DF,  p-value: < 0.00000000000000022

5 Conclusion

The significant variables to predict the chance of master program admission are University Rating, CGPA, GRE Score, Letter of Recommendation, Research Experience and TOEFL Score, while two insignificant variables are Serial Number and Statement of Purpose. Serial number clearly doesn’t matter in applying a master program since it is just used for identification, and Statement of Purpose doesn’t really matter because it cannot really define someone’s academic attributes, it just shows some thoughts and reasons of why he/she should be accepted in a master program.

The model we have created can predict the chance of master admission with a decent accuracy, proven by 80% of Adjusted R-Squared value, and a low Mean Squared Error value. However, linear model probably isn’t the best model to predict this Masters Program Admission chance since none of the model passed the Normality and Heteroscedasticity test, which means it is still possible that the model produced a wrong prediction.

Master Program Admission

Asido Rogate Panjaitan

2020-02-17