BACKGROUND

Introduction

In this exercise, I want to predict someone’s chance of admission to a university of their choice using linear regression method based on other variables.

I will use MSE and RMSE as a measure of my model’s accuracy. For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.


Data

The data is kindly provided by : Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

Checking the data

Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
1 337 118 4 4.5 4.5 9.65 1 0.92
2 324 107 4 4.0 4.5 8.87 1 0.76
3 316 104 3 3.0 3.5 8.00 1 0.72
4 322 110 3 3.5 2.5 8.67 1 0.80
5 314 103 2 2.0 3.0 8.21 0 0.65
6 330 115 5 4.5 3.0 9.34 1 0.90
7 321 109 3 3.0 4.0 8.20 1 0.75
8 308 101 2 3.0 4.0 7.90 0 0.68
9 302 102 1 2.0 1.5 8.00 0 0.50
10 323 108 3 3.5 3.0 8.60 0 0.45


The Data Explains :

1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. University Rating ( out of 5 )
4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
5. Undergraduate GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Chance of Admit ( ranging from 0 to 1 )


I’m changing the column names to something more code-friendly & our research column from numeric to a logical TRUE and FALSE

Separating Data to Train and Test

In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.


DATA ANALYSIS

Checking if any of the variables are linearly related to each other.

Seeing that all variablers have a good correlation with each other except the serial number, I think we can move to modelling and exclude the serial number in our model.

MODELLING

I’ll make a linear model with the name model_admission

## 
## Call:
## lm(formula = chance_of_admit ~ gre_score + toefl_score + university_rating + 
##     lor + cgpa + research, data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.243033 -0.024065  0.007685  0.032871  0.153770 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.4051368  0.1074149 -13.081  < 2e-16 ***
## gre_score          0.0023133  0.0005256   4.401 1.39e-05 ***
## toefl_score        0.0030671  0.0009075   3.380 0.000798 ***
## university_rating  0.0066859  0.0036723   1.821 0.069424 .  
## lor                0.0114540  0.0041109   2.786 0.005590 ** 
## cgpa               0.1161438  0.0100539  11.552  < 2e-16 ***
## researchTRUE       0.0200083  0.0069335   2.886 0.004120 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05647 on 393 degrees of freedom
## Multiple R-squared:  0.8336, Adjusted R-squared:  0.831 
## F-statistic:   328 on 6 and 393 DF,  p-value: < 2.2e-16

I think I’m pretty happy with the resulting R squared and t value. The university rating has a lower t value but I personally think that it’s an important variable, so I’ll keep it inside the linear model.


Checking Our Assumption

Normality

## 
##  Shapiro-Wilk normality test
## 
## data:  model_admission$residuals
## W = 0.93567, p-value = 3.987e-12

Homoscedacity

## 
##  studentized Breusch-Pagan test
## 
## data:  model_admission
## BP = 22.58, df = 6, p-value = 0.0009502

Based on our bp test, there seemed to be abit of a pattern here, but looking at the graph, I think it’s still acceptable.

Multicolinearity

##         gre_score       toefl_score university_rating               lor 
##          4.400360          3.698528          2.133388          1.727872 
##              cgpa          research 
##          4.483945          1.484216

It seemed that our predictor variable does not correlate strongly with each other.



PREDICTION

Using the model I’ve built, I will try to test it with the admission_test dataset that we’ve split before.

CONCLUSION

The data on its own has shown -even at the early linearity test- a very strong linearity between its target variable and its predictor variables. Very little feature engineering needs to be done before getting a good linear regression model.

Model Performance

My linear regression model received Adjusted R squared value of 0.8003. This means that the variable predictor chosen for the model can explain 80% of the target variable, in this case, the percentage of chance of admission.
The model MSE and RMSE when used against our test data is 0.005341 and 0.07308215 respectively.


Considering that our range of data is between 0.34 and 0.97, I think my model have achieved a pretty good prediction result with average error value of 0.042.

Significance of Predictor Variable

Our predictor variables that significantly affects our linear model, with the significance range value from 0 to 0.001 are GRE Score, TOEFL Score and Undergraduate GPA. The predictor variables with significance range value from 0.001 to 0.01 are Letter of Recommendation, and Research Experience.


The variables that greatly impacts our chance of admission is CGPA with a coefficient of 0.116. This means if there’s an increase of one unit in CGPA, our chance of admission goes up by 0.116.