The purpose of this R markdown is to find out which of the parameters that are most significant in predicting the result of Graduate Admissions from an Indian perspective. It could also help to predict which applicants that are most likely to get accepted to graduate schools and those who don’t. This can be achieved by using linear regression in machine learning.
The dataset that I used here was obtained from Kaggle and was created by Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019.
library(tidyverse)
library(GGally) # to visualized the correlation among variables
library(performance) # to compare performances from different regression models
library(MLmetrics) # to check error rate such as RMSE
library(lmtest) # Breusch-Pagan assumption check
library(car) # will be used in assumption check using VIFadmission <- read_csv("Graduate Admission 2/Admission_Predict_Ver1.1.csv")
admissionColumns description: 1. Serial No.: Applicants identifiers 2. GRE Score: GRE Scores ( out of 340 ) 3. TOEFL Score: TOEFL Scores ( out of 120 ) 4. University Rating: University Rating ( out of 5 ) 5. SOP: Statement of Purpose ( out of 5 ) 6. LOR: Letter of Recommendation Strength ( out of 5 ) 7. CGPA: Undergraduate GPA ( out of 10 ) 8. Research: Research Experience ( either 0 or 1 ) 9. Chance of Admit: Chance of Admit ( ranging from 0 to 1 )
admission <- admission %>%
mutate(`University Rating` = as.factor(`University Rating`)) %>%
select(-`Serial No.`)
admissionanyNA(admission)## [1] FALSE
All columns are already stored in their correct data types and it doesn’t have missing values. Now, we are ready to make our linear model from admission dataframe.
Linear regression formula:
\[ \hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n \]
y = b0 + b1x y : target variables / prediction result b0 : intercept b1 : slope \(x_1,...,x_n\) : the predictor variables that we’d use
Choosing our target & predictor variables:
Target variable: Chance of Admit Predictor variables: All variables
Check correlation between target variables and predictor variables with ggcorr function from GGally package.
# from 'GGally' package
ggcorr(admission, label = T)## Warning in ggcorr(admission, label = T): data in column(s) 'University Rating'
## are not numeric and were ignored
Among the target variables, CGPA has the strongest positive correlation with our target variable, followed closely by TOEFL_Score and GRE_Score.
Split data into 80% data train and 20% data test.
RNGkind(sample.kind = "Rounding")
set.seed(205)
index <- sample(x = nrow(admission), size = nrow(admission)*0.8)
train <- admission[index,]
test <- admission[-index,]nrow(train)## [1] 400
nrow(test)## [1] 100
adm_model <- lm(`Chance of Admit`~., train)
summary(adm_model)##
## Call:
## lm(formula = `Chance of Admit` ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.265434 -0.023774 0.006304 0.031860 0.159554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2199756 0.1180922 -10.331 < 2e-16 ***
## `GRE Score` 0.0016277 0.0005598 2.908 0.003849 **
## `TOEFL Score` 0.0029342 0.0009698 3.026 0.002646 **
## `University Rating`2 -0.0178475 0.0138356 -1.290 0.197827
## `University Rating`3 -0.0071599 0.0147563 -0.485 0.627803
## `University Rating`4 -0.0033960 0.0172852 -0.196 0.844344
## `University Rating`5 0.0130942 0.0197289 0.664 0.507270
## SOP 0.0016861 0.0051836 0.325 0.745145
## LOR 0.0166625 0.0045547 3.658 0.000289 ***
## CGPA 0.1214008 0.0108234 11.217 < 2e-16 ***
## Research 0.0219704 0.0074641 2.943 0.003440 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05975 on 389 degrees of freedom
## Multiple R-squared: 0.83, Adjusted R-squared: 0.8256
## F-statistic: 189.9 on 10 and 389 DF, p-value: < 2.2e-16
Linear regression is a highly interpretable model, meaning we are able to interpret the model’s result. We can then insert both the intercept and the slope to the linear regression formula.
\[ \hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n \] y = b0 + b1x1 + b2x2 + …. y = -1.2199756 + 0.0016277*GRE Score + ….
Model interpretation:
We will be doing some feature engineering to select which of the predictor columns that we should and shouldn’t keep by looking at its AIC (information loss). This can be done using step-wise regression.
model_backward <- step(object = adm_model, direction = "backward")## Start: AIC=-2243.27
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00038 1.3890 -2245.2
## - `University Rating` 4 0.02431 1.4129 -2244.3
## <none> 1.3886 -2243.3
## - `GRE Score` 1 0.03018 1.4188 -2236.7
## - Research 1 0.03093 1.4195 -2236.5
## - `TOEFL Score` 1 0.03268 1.4213 -2236.0
## - LOR 1 0.04777 1.4364 -2231.7
## - CGPA 1 0.44909 1.8377 -2133.2
##
## Step: AIC=-2245.16
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - `University Rating` 4 0.02630 1.4152 -2245.7
## <none> 1.3890 -2245.2
## - `GRE Score` 1 0.03003 1.4190 -2238.6
## - Research 1 0.03102 1.4200 -2238.3
## - `TOEFL Score` 1 0.03383 1.4228 -2237.5
## - LOR 1 0.05429 1.4432 -2231.8
## - CGPA 1 0.47220 1.8612 -2130.1
##
## Step: AIC=-2245.66
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + LOR + CGPA +
## Research
##
## Df Sum of Sq RSS AIC
## <none> 1.4152 -2245.7
## - `GRE Score` 1 0.03433 1.4496 -2238.1
## - Research 1 0.03463 1.4499 -2238.0
## - `TOEFL Score` 1 0.03948 1.4547 -2236.7
## - LOR 1 0.07277 1.4880 -2227.6
## - CGPA 1 0.53332 1.9486 -2119.8
Through the process of step-wise regression, SOP is the only column that is removed from our model. Let’s now compare the performance from both models.
# from 'performance' package
compare_performance(adm_model, model_backward)Turns out, ‘adm_model’ has a slightly higher Adujsted R-square in comparison to ‘model_backward’ but its AIC (information loss) is also a little bit higher than the other model. The Akaike information criterion (AIC) itself can be defined as a metric that is used to compare the fit of different regression models. It is also possible to have a negative AIC result. We just have to pay attention to the difference.
The availabe option here is to either go with the model_backward which has lower AIC (model_backward) or to go with adm_model with its higher R adjusted square. At this stage, we will first try to go and use the model_backward first.
model_backward$call## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` +
## LOR + CGPA + Research, data = train)
# check our prediction
model_backward$fitted.values[1:5]## 1 2 3 4 5
## 0.6696896 0.6254814 0.9090086 0.8247680 0.8561895
Error/residual is the difference between our prediction and the actual values.
Error check (RMSE)
# from 'MLmetrics' package
RMSE(y_pred = model_backward$fitted.values, y_true = admission$`Chance of Admit`)## Warning in y_true - y_pred: longer object length is not a multiple of shorter
## object length
## [1] 0.1979951
range(admission$`Chance of Admit`)## [1] 0.34 0.97
hist(admission$`Chance of Admit`)(Bruesch-pagan, linearity, multicollinearity, homoscedasticity)
hist(model_backward$residuals)Most of the residuals are concentrated in 0, making it the shape of normal distribution.
# from 'lmtest' package
bptest(model_backward)##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 22.189, df = 5, p-value = 0.000482
With Breusch-Pagan, we got p-value of 0.00009647 which has a way lower number than the alpha (0.05). P-value < alpha means we reject H0. In Breusch-Pagan test, H1 means that our error variance isn’t distributed in constant. Therefore, the model fall in the category of Heteroscesdasticity. Check again with scatter plot.
plot(x = model_backward$fitted.values, # prediction results
y = model_backward$residuals) # residuals
abline(h = 0, col = "red", lty = 2)Our plot shows the inclination towards ‘heteroscedasticity’, particularly in the pattern of Fan Shape. We don’t want it to have any pattern, we want it to be random. This pattern indicates that the error standard in the backward_model isn’t distributed in constant.
# from `car` package
vif(model_backward)## `GRE Score` `TOEFL Score` LOR CGPA Research
## 4.569588 3.892302 1.733001 4.455922 1.512236
None of our predictor variables that has a score of more than 10. This is great because if it did score more than 10, it means that there’s a strong correlation between one predictor variable with another and we don’t want that to happen.
By using the check_model function, we can test our models with all assumptions.
# from 'performance' package
check_model(model_backward)# predict with adm_model
adm_test <- predict(adm_model, test)
# add column adm_test to the original test data set
test$predict.adm_model <- adm_testComparing the result between the prediction and the actual values.
test <- test %>%
rename("Predicted Values" = "predict.adm_model")
test %>%
select(c(`Predicted Values`, `Chance of Admit`)) %>%
head()Plot observed and predicted values from the test dataset.
test %>%
ggplot(aes(x = `Chance of Admit`, y = `Predicted Values`)) +
geom_point(alpha = 0.8) +
geom_abline(intercept = 0,
slope = 1,
color = "blue",
size = 1.5) +
theme_minimal()At this stage, the model that we use to predict graduate admissions is the model_backward. Through the process of feature selection, we get six variables that are considered to be important in predicting the acceptance rate of a graduate school applicant. The six contributing factors are: 1) GRE score 2) TOEFL Score 3) University Rating 4) LOR (Letter of Recommendation Strength) 5) CGPA (Undergraduate GPA) and 6) Experience in research.
In terms of our model performance, it managed to pass two out of four assumptions check and those are the Normality of Residuals and the Multicollinearity. However, a linear regression model should have successfully passed all four of the assumptions check in order to be qualified as a “good” model. Meanwhile, the model_backward model has failed to fulfill that requirement. Therefore, we can say that predicting graduate admissions with linear regression model wouldn’t be the best choice that we can apply for future use. It is not sufficient to correctly predict the acceptance rate of a grad school applicants.
Predicting with different regression models beside linear regression would hopefully create a better model than the one that we just made.