BACKGROUND

Introduction

In this exercise, I want to predict someone’s chance of admission to a university of their choice using linear regression method based on other variables.

I will use MSE and RMSE as a measure of my model’s accuracy. For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.

Library

Library packages that I’m using

library(tidyverse)
library(GGally)
library(MLmetrics)
library(car)
library(lmtest)
library(stringr)

Data

The data is kindly provided by : Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

admission <- read_csv("data_input/Admission_Predict_Ver1.1.csv")

Checking the data

knitr::kable(head(admission, 10))

Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
1	337	118	4	4.5	4.5	9.65	1	0.92
2	324	107	4	4.0	4.5	8.87	1	0.76
3	316	104	3	3.0	3.5	8.00	1	0.72
4	322	110	3	3.5	2.5	8.67	1	0.80
5	314	103	2	2.0	3.0	8.21	0	0.65
6	330	115	5	4.5	3.0	9.34	1	0.90
7	321	109	3	3.0	4.0	8.20	1	0.75
8	308	101	2	3.0	4.0	7.90	0	0.68
9	302	102	1	2.0	1.5	8.00	0	0.50
10	323	108	3	3.5	3.0	8.60	0	0.45

The Data Explains :

1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. University Rating ( out of 5 )
4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
5. Undergraduate GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Chance of Admit ( ranging from 0 to 1 )

I’m changing the column names to something more code-friendly & our research column from numeric to a logical TRUE and FALSE

names(admission) <- str_replace_all(str_to_lower(names(admission)), " ", "_")
admission[,"research"] <- as.logical(as.integer( unlist(admission[,"research"])))

Separating Data to Train and Test

In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.

set.seed(100)
intrain <- sample(nrow(admission), nrow(admission) *.8)

admission_train <- admission[intrain,]
admission_test <- admission[-intrain,]

DATA ANALYSIS

Checking if any of the variables are linearly related to each other.

ggcorr(admission_train, label = T, hjust = .7, layout.exp = 1, label_size = 4, cex = 3)

Seeing that all variablers have a good correlation with each other except the serial number, I think we can move to modelling and exclude the serial number in our model.

MODELLING

I’ll make a linear model with the name model_admission

model_admission <- lm(formula = chance_of_admit ~ gre_score + toefl_score + 
    university_rating + lor + cgpa + research, data = admission_train)

summary(model_admission)

## 
## Call:
## lm(formula = chance_of_admit ~ gre_score + toefl_score + university_rating + 
##     lor + cgpa + research, data = admission_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.243033 -0.024065  0.007685  0.032871  0.153770 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.4051368  0.1074149 -13.081  < 2e-16 ***
## gre_score          0.0023133  0.0005256   4.401 1.39e-05 ***
## toefl_score        0.0030671  0.0009075   3.380 0.000798 ***
## university_rating  0.0066859  0.0036723   1.821 0.069424 .  
## lor                0.0114540  0.0041109   2.786 0.005590 ** 
## cgpa               0.1161438  0.0100539  11.552  < 2e-16 ***
## researchTRUE       0.0200083  0.0069335   2.886 0.004120 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05647 on 393 degrees of freedom
## Multiple R-squared:  0.8336, Adjusted R-squared:  0.831 
## F-statistic:   328 on 6 and 393 DF,  p-value: < 2.2e-16

I think I’m pretty happy with the resulting R squared and t value. The university rating has a lower t value but I personally think that it’s an important variable, so I’ll keep it inside the linear model.

Checking Our Assumption

Normality

hist(model_admission$residuals)

shapiro.test(model_admission$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_admission$residuals
## W = 0.93567, p-value = 3.987e-12

Homoscedacity

plot(model_admission$fitted.values, model_admission$residuals)
abline(h = 0, col = "red")

bptest(model_admission)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_admission
## BP = 22.58, df = 6, p-value = 0.0009502

Based on our bp test, there seemed to be abit of a pattern here, but looking at the graph, I think it’s still acceptable.

Multicolinearity

vif(model_admission)

##         gre_score       toefl_score university_rating               lor 
##          4.400360          3.698528          2.133388          1.727872 
##              cgpa          research 
##          4.483945          1.484216

It seemed that our predictor variable does not correlate strongly with each other.

Initial MSE and RMSE

Our training vs model data MSE and RMSE.

MSE(y_pred = model_admission$fitted.values, y_true = admission_train$chance_of_admit)

## [1] 0.003132578

RMSE(y_pred = model_admission$fitted.values, y_true = admission_train$chance_of_admit)

## [1] 0.05596944

PREDICTION

Using the model I’ve built, I will try to test it with the admission_test dataset that we’ve split before.

admission_test$chance_admit_predict <- round(predict(object = model_admission, newdata = admission_test),2)

MSE and RMSE value with test data

MSE(y_pred = admission_test$chance_admit_predict, y_true = admission_test$chance_of_admit)

## [1] 0.005341

RMSE(y_pred = admission_test$chance_admit_predict, y_true = admission_test$chance_of_admit)

## [1] 0.07308215

CONCLUSION

The data on its own has shown -even at the early linearity test- a very strong linearity between its target variable and its predictor variables. Very little feature engineering needs to be done before getting a good linear regression model.

Model Performance

My linear regression model received Adjusted R squared value of 0.8003. This means that the variable predictor chosen for the model can explain 80% of the target variable, in this case, the percentage of chance of admission.
The model MSE and RMSE when used against our test data is 0.005341 and 0.07308215 respectively.

Considering that our range of data is between 0.34 and 0.97, I think my model have achieved a pretty good prediction result with average error value of 0.042.

Significance of Predictor Variable

Our predictor variables that significantly affects our linear model, with the significance range value from 0 to 0.001 are GRE Score, TOEFL Score and Undergraduate GPA. The predictor variables with significance range value from 0.001 to 0.01 are Letter of Recommendation, and Research Experience.

The variables that greatly impacts our chance of admission is CGPA with a coefficient of 0.116. This means if there’s an increase of one unit in CGPA, our chance of admission goes up by 0.116.

College Admission Prediction

Deo Ivan Mareza

`14 February 2020`