We will be using linear regression model to predicting the chances of a student making it into this particular university masters program. hopefully the prediction will give the students fair idea of how likely they are to make it into the program.
library(scales)
library(dplyr)
library(GGally)
library(performance)
library(car)
library(MLmetrics)
library(lmtest)
options(scipen = 9999)admission <- read.csv('AdmissionTrain.csv')
head(admission)summary(admission)#> Serial.No. GRE.Score TOEFL.Score University.Rating
#> Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
#> 1st Qu.:100.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
#> Median :200.5 Median :317.0 Median :107.0 Median :3.000
#> Mean :200.5 Mean :316.8 Mean :107.4 Mean :3.087
#> 3rd Qu.:300.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
#> Max. :400.0 Max. :340.0 Max. :120.0 Max. :5.000
#> SOP LOR CGPA Research
#> Min. :1.0 Min. :1.000 Min. :6.800 Min. :0.0000
#> 1st Qu.:2.5 1st Qu.:3.000 1st Qu.:8.170 1st Qu.:0.0000
#> Median :3.5 Median :3.500 Median :8.610 Median :1.0000
#> Mean :3.4 Mean :3.453 Mean :8.599 Mean :0.5475
#> 3rd Qu.:4.0 3rd Qu.:4.000 3rd Qu.:9.062 3rd Qu.:1.0000
#> Max. :5.0 Max. :5.000 Max. :9.920 Max. :1.0000
#> Chance.of.Admit
#> Min. :0.3400
#> 1st Qu.:0.6400
#> Median :0.7300
#> Mean :0.7244
#> 3rd Qu.:0.8300
#> Max. :0.9700
ggcorr(admission, label = T, hjust= 0.9) Our Graphic shows that all of the variables indicating strong correlation with the target variable (Chance.of.Admit). Also Strong correlation between our features, we will explore it later on multicolinearity assumption test.
anyNA(admission)#> [1] FALSE
No missing value found on our data, we can move on to the modelling.
We can start building model using every predictors available
init_model_all <- lm(Chance.of.Admit~., admission)Then use stepwise function to eliminate the predictor one by one based on their significance and if the model achieved their minimum AIC value.
backstep_model <- step(init_model_all, direction = 'backward')#> Start: AIC=-2224.4
#> Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score + University.Rating +
#> SOP + LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> - SOP 1 0.000001 1.4703 -2226.4
#> <none> 1.4703 -2224.4
#> - University.Rating 1 0.013613 1.4839 -2222.7
#> - GRE.Score 1 0.036819 1.5071 -2216.5
#> - Research 1 0.038157 1.5084 -2216.2
#> - TOEFL.Score 1 0.045734 1.5160 -2214.2
#> - LOR 1 0.061407 1.5317 -2210.0
#> - Serial.No. 1 0.124472 1.5948 -2193.9
#> - CGPA 1 0.290273 1.7606 -2154.3
#>
#> Step: AIC=-2226.4
#> Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score + University.Rating +
#> LOR + CGPA + Research
#>
#> Df Sum of Sq RSS AIC
#> <none> 1.4703 -2226.4
#> - University.Rating 1 0.014924 1.4852 -2224.4
#> - GRE.Score 1 0.037011 1.5073 -2218.5
#> - Research 1 0.038427 1.5087 -2218.1
#> - TOEFL.Score 1 0.046338 1.5166 -2216.0
#> - LOR 1 0.073463 1.5437 -2208.9
#> - Serial.No. 1 0.125908 1.5962 -2195.5
#> - CGPA 1 0.298134 1.7684 -2154.6
summary(backstep_model)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score +
#> University.Rating + LOR + CGPA + Research, data = admission)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.233547 -0.026629 0.006155 0.038221 0.140281
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.29378909 0.11967201 -10.811 < 0.0000000000000002 ***
#> Serial.No. 0.00015926 0.00002749 5.794 0.0000000141 ***
#> GRE.Score 0.00179818 0.00057244 3.141 0.001810 **
#> TOEFL.Score 0.00368430 0.00104820 3.515 0.000491 ***
#> University.Rating 0.00880952 0.00441633 1.995 0.046761 *
#> LOR 0.02157648 0.00487533 4.426 0.0000124796 ***
#> CGPA 0.10533353 0.01181460 8.916 < 0.0000000000000002 ***
#> Research 0.02438836 0.00761939 3.201 0.001482 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06124 on 392 degrees of freedom
#> Multiple R-squared: 0.8188, Adjusted R-squared: 0.8156
#> F-statistic: 253.1 on 7 and 392 DF, p-value: < 0.00000000000000022
The model end up using Serial.No as a predictor, we should remove it.
admission_model <- lm(Chance.of.Admit ~ GRE.Score + University.Rating + TOEFL.Score + LOR + CGPA + Research, admission)
summary(admission_model)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + University.Rating +
#> TOEFL.Score + LOR + CGPA + Research, data = admission)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.26391 -0.02197 0.01008 0.03621 0.15851
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.2543094 0.1243300 -10.089 < 0.0000000000000002 ***
#> GRE.Score 0.0017646 0.0005957 2.962 0.00324 **
#> University.Rating 0.0048540 0.0045404 1.069 0.28570
#> TOEFL.Score 0.0028389 0.0010801 2.628 0.00892 **
#> LOR 0.0210327 0.0050724 4.147 0.0000414 ***
#> CGPA 0.1179066 0.0120852 9.756 < 0.0000000000000002 ***
#> Research 0.0241542 0.0079287 3.046 0.00247 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06373 on 393 degrees of freedom
#> Multiple R-squared: 0.8033, Adjusted R-squared: 0.8003
#> F-statistic: 267.5 on 6 and 393 DF, p-value: < 0.00000000000000022
Our model now has lesser Adj.R-Square value, but still a good fit. also University.Rating is now an unsignificant predictor, it is safe to remove it.
Regarding SOP, after many tries, i decided to not include SOP variable in the model, not only after applying stepwise function in both direction, both model leaves SOP, also the function of SOP are also similiar with LOR, adding SOP to model also reduce model performance.
Our Regression Model is : Chance.of.Admit ~ GRE.Score + University.Rating + TOEFL.Score + LOR + CGPA + Research
admission_model <- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research, admission)
summary(admission_model)#>
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
#> CGPA + Research, data = admission)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.263542 -0.023297 0.009879 0.038078 0.159897
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.2984636 0.1172905 -11.070 < 0.0000000000000002 ***
#> GRE.Score 0.0017820 0.0005955 2.992 0.00294 **
#> TOEFL.Score 0.0030320 0.0010651 2.847 0.00465 **
#> LOR 0.0227762 0.0048039 4.741 0.00000297 ***
#> CGPA 0.1210042 0.0117349 10.312 < 0.0000000000000002 ***
#> Research 0.0245769 0.0079203 3.103 0.00205 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06374 on 394 degrees of freedom
#> Multiple R-squared: 0.8027, Adjusted R-squared: 0.8002
#> F-statistic: 320.6 on 5 and 394 DF, p-value: < 0.00000000000000022
Interpretation : -For every 1 point in GRE, the chance of admission will increase by 0.001782 -For every 1 point in TOEFL, the chance of admission will increase by 0.0030320 -For every 1 point in LOR, the chance of admission will increase by 0.0227762 -For every 1 point in CGPA, the chance of admission will increase by 0.1210042 -For every 1 point in Researcg, the chance of admission will increase by 0.0245769
Our Linear Regression model need to fulfill four assumptions before we actually use it, so our model wont be biased.
cor.test(admission$Chance.of.Admit, admission$GRE.Score)#>
#> Pearson's product-moment correlation
#>
#> data: admission$Chance.of.Admit and admission$GRE.Score
#> t = 26.843, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.7647419 0.8349536
#> sample estimates:
#> cor
#> 0.8026105
cor.test(admission$Chance.of.Admit, admission$TOEFL.Score)#>
#> Pearson's product-moment correlation
#>
#> data: admission$Chance.of.Admit and admission$TOEFL.Score
#> t = 25.845, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.7519028 0.8255675
#> sample estimates:
#> cor
#> 0.791594
cor.test(admission$Chance.of.Admit, admission$LOR)#>
#> Pearson's product-moment correlation
#>
#> data: admission$Chance.of.Admit and admission$LOR
#> t = 18, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.6120380 0.7206083
#> sample estimates:
#> cor
#> 0.6698888
cor.test(admission$Chance.of.Admit, admission$CGPA)#>
#> Pearson's product-moment correlation
#>
#> data: admission$Chance.of.Admit and admission$CGPA
#> t = 35.759, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.8478354 0.8947275
#> sample estimates:
#> cor
#> 0.8732891
cor.test(admission$Chance.of.Admit, admission$Research)#>
#> Pearson's product-moment correlation
#>
#> data: admission$Chance.of.Admit and admission$Research
#> t = 13.248, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.4812548 0.6177458
#> sample estimates:
#> cor
#> 0.5532021
Based on correlation test above, all five predictors we used shows no significant correlation to the target, p-value < 0.05.
hist(admission_model$residuals)shapiro.test(admission_model$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: admission_model$residuals
#> W = 0.92193, p-value = 0.0000000000001443
H1 = errors are not distributed normally
As expected, the model did not pass the stastistical test, but if you look at the histogram, Our visual shows bell curve with longer tail on the left side, mean we have outlier in our data, but i decide to leave it as it is because the amount is to low, hence why we have the bell curve, so it pass the test
plot(admission_model$fitted.values, admission_model$residuals)
abline(h = 0, col = 'red')bptest(admission_model)#>
#> studentized Breusch-Pagan test
#>
#> data: admission_model
#> BP = 22.428, df = 5, p-value = 0.0004341
With p-value < 0.05, we can conclude that heterocesdasticity is present. Our model did not pass Homoscedasticity test due to the outlier. however, just like our normality test, our visual did not catch a pattern. visually, it pass the test.
vif(admission_model)#> GRE.Score TOEFL.Score LOR CGPA Research
#> 4.585053 4.104255 1.829491 4.808767 1.530007
no colinearity found.
Assumption Test Result : -Linearity : All predictor has p-value < 0.05. -Normality : Errors are distributed normally, transform the target variable. (Due to our data contained several outliers, i judged it based on visual) -Homoscedasticity : Error are scaterred randomly (Due to our data contained several outliers, i judged it based on visual) -Multi Colinearity : none of our predictor are correlated to each other.
The model violate Normality and Homoscedasticity assumption test because the data contain outlier but not so much, so i can pass it based on the visual.
hist(scale(admission$Chance.of.Admit))test <- read.csv('AdmissionTest.csv')
test$prediction <- predict(admission_model, newdata = test)# MAPE TRAIN
MAPE(y_pred = admission_model$fitted.values, y_true=admission$Chance.of.Admit)#> [1] 0.07317295
# MAPE TEST
MAPE(y_pred = test$prediction, y_true=test$Chance.of.Admit)#> [1] 0.05130283
MAPE on test set (5%) managed to achieved smaller value compared to train set (7%).
# RMSE TRAIN
RMSE(y_pred = admission_model$fitted.values, y_true=admission$Chance.of.Admit)#> [1] 0.06326207
# RMSE TRAIN
RMSE(y_pred = test$prediction, y_true=test$Chance.of.Admit)#> [1] 0.04304102
Error Value are very close and even less, we can assume that the predictions are very very close to the actual value.
Conclussion : - Our model is a good-fit (Adj.R-Square = 0.8) - Train and Test error value are small, and its close to each other.
Recommendation : Student who are planning to continue their study should work hard to get highest GPA possible(0-10), do a research and then try to acquire a strong letter of recomendation, folllowed by TOEFL Score and GRE Score to boost your chance of admission.