Linear Regression on Graduate Admission Prediction
Introduction
About the data
The data contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
- Undergraduate GPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Chance of Admit ( ranging from 0 to 1 )
Business goal
This is a dataset created to predict the chance of Graduate Admissions. It was built with the purpose of helping students in shortlisting universities with accordance to their profiles. The predicted output gives them a fair idea about their chances for a getting admitted into a particular university.
What we will do
We will use linear regression model using Graduate Admission data from Kaggle. We want to know the relationship among variables, especially between the Admission_Chance with other variables. We also want to predict the chance of someone getting into a university using historical data. You can download the data here: https://www.kaggle.com/mohansacharya/graduate-admissions
Import Library
library(dplyr)
library(ggplot2)
library(GGally)
library(performance)
library(MLmetrics)
library(rmdformats)
library(lmtest)
library(performance)
library(car)Data Preparation
Read data
admission <- read.csv("data_input/Admission_Predict.csv") %>%
select(-Serial.No.)
rmarkdown::paged_table(admission)Renaming columns
names(admission) <- c("GRE", "TOEFL", "University_Rating", "SOP_Strength", "LOR_Strength", "CGPA", "Research", "Admission_Chance")
head(admission)## GRE TOEFL University_Rating SOP_Strength LOR_Strength CGPA Research
## 1 337 118 4 4.5 4.5 9.65 1
## 2 324 107 4 4.0 4.5 8.87 1
## 3 316 104 3 3.0 3.5 8.00 1
## 4 322 110 3 3.5 2.5 8.67 1
## 5 314 103 2 2.0 3.0 8.21 0
## 6 330 115 5 4.5 3.0 9.34 1
## Admission_Chance
## 1 0.92
## 2 0.76
## 3 0.72
## 4 0.80
## 5 0.65
## 6 0.90
Data Wrangling/Preprocessing
Check for missing values
admission %>%
is.na() %>%
colSums()/nrow(admission)## GRE TOEFL University_Rating SOP_Strength
## 0 0 0 0
## LOR_Strength CGPA Research Admission_Chance
## 0 0 0 0
No missing value, thus the data is well prepared.
Check if there are any mismatched data type and change them if necessary
Changing them to the right data type will ease the data analytics and machine learning process.
summary(admission)## GRE TOEFL University_Rating SOP_Strength
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.0
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.5
## Median :317.0 Median :107.0 Median :3.000 Median :3.5
## Mean :316.8 Mean :107.4 Mean :3.087 Mean :3.4
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.0
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.0
## LOR_Strength CGPA Research Admission_Chance
## Min. :1.000 Min. :6.800 Min. :0.0000 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.170 1st Qu.:0.0000 1st Qu.:0.6400
## Median :3.500 Median :8.610 Median :1.0000 Median :0.7300
## Mean :3.453 Mean :8.599 Mean :0.5475 Mean :0.7244
## 3rd Qu.:4.000 3rd Qu.:9.062 3rd Qu.:1.0000 3rd Qu.:0.8300
## Max. :5.000 Max. :9.920 Max. :1.0000 Max. :0.9700
Looks like we can change University_Rating and Research to Factor
Change the respective columns to the right data types
admission <-
admission %>%
mutate_at(vars(University_Rating, Research), as.factor)
str(admission)## 'data.frame': 400 obs. of 8 variables:
## $ GRE : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University_Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP_Strength : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR_Strength : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
## $ Admission_Chance : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
Exploratory Data Analysis
Check correlations among the numeric columns to the target column
ggcorr(admission, label = T, hjust = 0.7)## Warning in ggcorr(admission, label = T, hjust = 0.7): data in column(s)
## 'University_Rating', 'Research' are not numeric and were ignored
Insights: - ALl the numeric columns are positively correlated to the target column - CGPA has the strongest correlation (0.9) to the admission chance - TOEFL, SOP_Strength and LOR_Strength have the same weightage to how correlated they are to the Admission_Chance
Plot a scatter plot of Admission_Chance against CGPA
THis is done to dig deeper and confirm the strong correlation of the two of them
plot(admission$CGPA,
admission$Admission_Chance,
main = "Plot of CGPA against the Chance of Getting into College",
xlab = "Cummulative GPA",
ylab = "Admission Chance")The plot above perfectly illustrates a strong positive correlation between Admission_Chance and CGPA. The higher your CGPA, the higher your chance getting into a college
Plot a box plot for CGPA,TOEFL Admission_Chance
Try to find any possible outliers
boxplot(admission$CGPA,
main = "CGPA Data Distribution")boxplot(admission$TOEFL,
main = "TOEFL Score Data Distribution")boxplot(admission$Admission_Chance,
main = "Admission Chance Data Distribution")Cross Validation
This step is necessary to prepare some “unseen” data for the ML model to determine its accuracy and performance We will use 75:25 proportion for this data set s ## Splitting the data into train and test sets
set.seed(123)
index <- sample(nrow(admission), nrow(admission)*0.75)
data_train <- admission[index,]
data_test <- admission[-index,]Model Building
Build a model that uses all predictors
model_admission_all <- lm(formula = Admission_Chance ~ ., data = data_train)
summary(model_admission_all)##
## Call:
## lm(formula = Admission_Chance ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.263574 -0.020604 0.009283 0.032224 0.165258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2891833 0.1466345 -8.792 < 2e-16 ***
## GRE 0.0020508 0.0006663 3.078 0.002284 **
## TOEFL 0.0029860 0.0011909 2.507 0.012715 *
## University_Rating2 -0.0217269 0.0162474 -1.337 0.182193
## University_Rating3 -0.0148768 0.0177127 -0.840 0.401663
## University_Rating4 -0.0178331 0.0216163 -0.825 0.410063
## University_Rating5 0.0020886 0.0237871 0.088 0.930093
## SOP_Strength -0.0022572 0.0066178 -0.341 0.733294
## LOR_Strength 0.0220683 0.0065183 3.386 0.000809 ***
## CGPA 0.1133068 0.0138506 8.181 9.05e-15 ***
## Research1 0.0253893 0.0091847 2.764 0.006071 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06201 on 289 degrees of freedom
## Multiple R-squared: 0.8093, Adjusted R-squared: 0.8027
## F-statistic: 122.6 on 10 and 289 DF, p-value: < 2.2e-16
Model Evaluation
Calculate the RMSE of the training data
RMSE(y_pred = model_admission_all$fitted.values, y_true = data_train$Admission_Chance)## [1] 0.06086444
RMSE is really small, thus the model does well in the training data.
Make a prediction on data_test
model_admission_all_pred <- predict(model_admission_all,
newdata = data_test %>% select(-Admission_Chance))
RMSE(y_pred = model_admission_all_pred, y_true = data_test$Admission_Chance)## [1] 0.06885452
RMSE is even smaller, thus the model does not overfit.
Model fine-tuning
To fine-tune the model, we can use Step-wise regression to find the best features to be used in the model so it can make the model much better.
Feature selection
Using Step-Wise regression with “backwards” direction
model_admission_back <- step(model_admission_all,
direction = "backward",
trace = F)
summary(model_admission_back)##
## Call:
## lm(formula = Admission_Chance ~ GRE + TOEFL + LOR_Strength +
## CGPA + Research, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26640 -0.02093 0.01044 0.03477 0.16085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3423417 0.1343084 -9.994 < 2e-16 ***
## GRE 0.0021750 0.0006581 3.305 0.001068 **
## TOEFL 0.0028850 0.0011610 2.485 0.013513 *
## LOR_Strength 0.0219284 0.0056717 3.866 0.000136 ***
## CGPA 0.1138209 0.0132256 8.606 4.64e-16 ***
## Research1 0.0246875 0.0091284 2.704 0.007240 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06203 on 294 degrees of freedom
## Multiple R-squared: 0.8059, Adjusted R-squared: 0.8026
## F-statistic: 244.1 on 5 and 294 DF, p-value: < 2.2e-16
Try prediciting using the new model
model_admission_back_pred <-predict(model_admission_back,
newdata = data_test %>% select(-Admission_Chance))
RMSE(y_pred = model_admission_all_pred, y_true = data_test$Admission_Chance)## [1] 0.06885452
The RMSE is the same, thus, we can say move on to comparing both of the models’ performance
Comparing the two models performance
compare_performance(model_admission_all, model_admission_back)## # Comparison of Model Performance Indices
##
## Name | Model | AIC | AIC weights | BIC | BIC weights | R2 | R2 (adj.) | RMSE | Sigma
## ------------------------------------------------------------------------------------------------------------------
## model_admission_all | lm | -804.101 | 0.088 | -759.655 | < 0.001 | 0.809 | 0.803 | 0.061 | 0.062
## model_admission_back | lm | -808.771 | 0.912 | -782.844 | 1.000 | 0.806 | 0.803 | 0.061 | 0.062
Both models are similar, but the adj. R2 of model_admission_all is slightly lower and the AIC of . Thus we rather use the model_admission_back for the next predictions
Checking for assumptions
Linearity
Why linearity is tested? Because linear regression model can only learn well on linear pattern.
Let’s take one of the strongly correlated predictor, CGPA to test out for linearity against the target variable:
cor.test(x = admission$CGPA, y = admission$Admission_Chance)##
## Pearson's product-moment correlation
##
## data: admission$CGPA and admission$Admission_Chance
## t = 35.759, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8478354 0.8947275
## sample estimates:
## cor
## 0.8732891
Since the p-value is smaller than 0.05, thus we can say that CGPA and Admission_Chance are significantly correlated.
Normality
Normality is to check whether the distribution of the model residual is normal. We use Saphiro test here to test for the Normality:
shapiro.test(model_admission_back$residuals)##
## Shapiro-Wilk normality test
##
## data: model_admission_back$residuals
## W = 0.90741, p-value = 1.283e-12
Since the p-value of the model is smaller than 0.05, thus it is saying that the distribution of model residual is normal.
Homoscedasticity
Here we can see the distribution of the data fitted values against the residuals
plot(model_admission_back$fitted.values, model_admission_back$residuals, ylim = c(-50,50))
abline(h = 0, col = "red") We will use Breusch-Pagan test to check for the Homoscedasticity:
bptest(model_admission_back)##
## studentized Breusch-Pagan test
##
## data: model_admission_back
## BP = 14.046, df = 5, p-value = 0.01532
Since the p-value is smaller than 0.05, thus it is saying that heteroscedasticity doesn’t happen in the model.
Multicolinearity
We check multicolinearity to make sure that there is no dependency among the predictors that are used in the model.
We use VIF to check for multicolinearity:
vif(model_admission_back)## GRE TOEFL LOR_Strength CGPA Research
## 4.380017 3.788371 2.013739 4.635938 1.607922
There is none of the values that goes beyond 10 among the variables, thus multicolinearity is nor present
Conclusion
The predictors that are useful to describe the variances in the chance of being admistted to the university are GRE, TOEFL, LOR_Strength, CGPA and Research. Our final model has satisfied all four classical assumptions. The R-squared of the model is not high enough even after fine tuning with 79.3% of the variables can explain the variances in the chance of being admitted to the university. We may use other models to have higher performance.
The accuracy of the model in predicting the car price is measured with RMSE, with training data has RMSE of 0.06451918 and testing data has RMSE of 0.0583371, suggesting that our model does not overfit the training model.
We have already learn how to build a linear regression model and what need to be concerned when building the model.
Reference
Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019