This notebook is a notebook that try to predict a graduate admission application based on several factors using linear regression model. Data is retrieved from Kaggle.com and inspired by the UCLA Graduate Dataset. Thus, this dataset is a real cases from a university in Untied States. Main goal of this analysis is to predict the chance of a student admitted in this university based on several variables.
The explanation consists of Data Preparation, Exploratory Data Analysis, making the models, evaluations of models, assumptions checking, and conclusion.
library(tidyverse)
library(ggplot2)
library(data.table)
library(GGally)
library(car)
library(caret)
library(scales)
library(lmtest)
library(MLmetrics)
library(dplyr)
options(scipen = 100, max.print = 1e+06)
adm <- read.csv("Admission_Predict_Ver1.1.csv")
head(adm)
str(adm)
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
Data constructed by 9 columns, and 500 rows.
Serial.No = No ID of each applicationGRE.Score = GRE Scores ( out of 340 )TOEFL.Score = TOEFL Scores ( out of 120 )University.Rating = Bachelor’s University Rating ( out of 5 )SOP= Statement of Purpose Strength ( out of 5 )LOR = Letter of Recommendation Strength ( out of 5 )CGPA = Undergraduate GPA ( out of 10 )Research = Research Experience ( either 0 or 1 )Chance.of.Admit = Chance of Admit ( ranging from 0 to 1 )From 9 columns, GRE.Scores, TOELF.Scores, and CGPA are contiuous variable; and other variables are considered categorical factor. Although SOP, LOR, Research, University.Rating in form of number, they still categorical variable. Chance.of.Admit in this case is the target variable. Serial.No will be omitted and not used.
# Change Research to factor
adm2 <- adm %>%
select(-Serial.No.) %>%
mutate(Research = as.factor(Research))
# Check NA Value
anyNA(adm2)
## [1] FALSE
# Check Summary
summary(adm2)
## GRE.Score TOEFL.Score University.Rating SOP
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.500
## Median :317.0 Median :107.0 Median :3.000 Median :3.500
## Mean :316.5 Mean :107.2 Mean :3.114 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.000
## LOR CGPA Research Chance.of.Admit
## Min. :1.000 Min. :6.800 0:220 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.127 1:280 1st Qu.:0.6300
## Median :3.500 Median :8.560 Median :0.7200
## Mean :3.484 Mean :8.576 Mean :0.7217
## 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:0.8200
## Max. :5.000 Max. :9.920 Max. :0.9700
adm_long <- adm %>%
select(-Research) %>%
pivot_longer(-Serial.No.)
ggplot(data = adm_long, aes(x = name, y = value, fill = name)) +
geom_boxplot() +
facet_wrap(facets = ~name, scales = 'free')
Facet boxplot above shows distribution of data for each column or variable. There are only 2 variable that have outlier: LOR and Chance.of.Admit. Majority of the data is distributed between Q3 to Q1.
Before make the model, Exploratory Data Analysis is needed to explain relations between predictors and target. In this case, predictors are all variable except Chance.of.Admit. And the target is Chance.of.Admit. The form of target variable is in the form of likelihood. Closer to 1 means high chance of accepted, and reverse.
ggcorr(data = adm2, label = T) +
labs(title = "Correlation Matrix")
All variables seemed have correlation with the target
Change.of.Admit with positive direction. The highest variable correlate with target is bachelor’s CGPA. Another predictors have high correlation scoree 0.8 - 0.6. In this Correlation matrix, Research is excluded because its data type is categorical (or factor in R). For each predictor variables also have strong correlation. For example, CGPA with TOEFL.Score and CGPA with GRE.Score. It is natural if a student graduate with high CGPA might also have excellent score in TOEFL and GRE.
Chart below shows the coefficient correlation, data distribution, and scatter.
library(psych)
pairs.panels(adm2,
method = "spearman",
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE
)
This chart emphasize more that every predictors has good enough correlation with target variable.
# Split the data with 80:20 proportion
RNGkind(sample.kind = "Rounding")
set.seed(99)
adm_split <- sample(nrow(adm2), nrow(adm)*0.8)
adm_train <- adm2[adm_split,]
adm_test <- adm2[-adm_split,]
# Dimension of Train and Test data
dim(adm_train)
## [1] 400 8
dim(adm_test)
## [1] 100 8
Before going to sophisticated model, I want to make a simple model with one predictor and target so at the end, simple and complex model are comparable.
# Simple Linear Regression
model_simple <- lm(formula = Chance.of.Admit ~ CGPA, data = adm_train)
summary(model_simple)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = adm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.280464 -0.030270 0.005316 0.039231 0.169758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.992720 0.044236 -22.44 <0.0000000000000002 ***
## CGPA 0.200370 0.005145 38.94 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06257 on 398 degrees of freedom
## Multiple R-squared: 0.7921, Adjusted R-squared: 0.7916
## F-statistic: 1517 on 1 and 398 DF, p-value: < 0.00000000000000022
From the summary above, the estimate intercept is -0.99, and estimate CGPA is 0.2. Intercept means that when the CGPA is 0, the chance of accepted is -0.99 which likely will not to be accepted. Adjusted R-squared of 0.79 means that this model explain approximately 79% of all data. Therefore, the equation could be written in form:
\[ Y_i = -0.99 + 0.2 \beta_i \]
In purpose for optimizing significant predictors, multiple linear regression will use stepwise backward regression.
# Stepwise Backward Regression Model
model_adm <- lm(formula = Chance.of.Admit ~ ., data = adm_train)
model_adm_back <- step(object = model_adm, direction = "backward", trace = F)
summary(model_adm_back)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP +
## LOR + CGPA + Research, data = adm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.271283 -0.022847 0.007868 0.032371 0.149924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2750160 0.1032684 -12.347 < 0.0000000000000002 ***
## GRE.Score 0.0019288 0.0005086 3.793 0.000172 ***
## TOEFL.Score 0.0029130 0.0009129 3.191 0.001532 **
## SOP 0.0065736 0.0043817 1.500 0.134354
## LOR 0.0149887 0.0041781 3.587 0.000376 ***
## CGPA 0.1152002 0.0096328 11.959 < 0.0000000000000002 ***
## Research1 0.0268645 0.0066635 4.032 0.0000666 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05537 on 393 degrees of freedom
## Multiple R-squared: 0.8392, Adjusted R-squared: 0.8367
## F-statistic: 341.8 on 6 and 393 DF, p-value: < 0.00000000000000022
From 6 predictors, 5 variables are significant and only 1 variable SOP that not significant. The model haave Adj R-squared and Multiple R-Squared score both 0.83 (2 decimals). The equation is below:
\[ Y_i = -1.27 + 0.001 GRE.Score_i + 0.002 TOEFL.Score_i + 0.014 LOR_i +0.115 CGPA_i + 0.026 Research_i \]
Comparison table given explain that Multi Predictor model has better Multiple R-Squared score rather than the simple one. Therefore, for another analysis and prediction, I will use the multi predictor model or model_adm_back.
# Predicting testing data to the model
predicted_mv <- predict(object = model_adm_back, newdata = adm_test)
# Using MAE to calculate the error
MAE(y_pred = predicted_mv, y_true = adm_test$Chance.of.Admit)
## [1] 0.05310792
For model evaluation, I use MAE (Mean Absolute Error) because the data are normally distributed and its characteristic of less-affected by the outlier. The MAE score of this model is 0.053 which means that the model have good prediction.
Linearity assumption expects that every predictor variable has correlation with the target variable.
linearity <- data.frame(residual = model_adm_back$residuals, fitted = model_adm_back$fitted.values)
linearity %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The pattern is shown by the graph. The data distributed in 0 residuals and 0.5 - 0.8 fitted point. More points getting smaller (negative) as the fitted getting higher, but then it converges to X axis.
hist(model_adm_back$residuals)
Histogram shows that the data is right skewed. This is strong signal for non-normal residual distribution. For examining the score, I use Shapiro-Wilk Test.
shapiro.test(model_adm_back$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_adm_back$residuals
## W = 0.93864, p-value = 0.000000000008698
P-value given by the Shapiro-Wilk normality test is less than alpha 0.05. Therefore, the model is indicating non-normally distributed residuals.
Homoscedasticity means the homogenity of variances. It assume of equal or similar variances in different groups that compared. When there is variance of variances, it is called Heteroscedasticity.
bptest(model_adm_back)
##
## studentized Breusch-Pagan test
##
## data: model_adm_back
## BP = 26.373, df = 6, p-value = 0.0001897
Breusch-Pagan coefficient given shows the p-value is less than alpha (0.05). Therefore the model is indicating of heteroscedasticity.
No-multicolinearity assumption expect that each variable or column does not affect each other (in particular, no correlation between consecutive errors in the case of time series data). This assumption also called test of independency.
vif(model_adm_back)
## GRE.Score TOEFL.Score SOP LOR CGPA Research
## 4.273734 3.989381 2.428683 1.932297 4.474493 1.423618
Based on the reference, VIF score > 5 means there is an indication that multicolinearity may be present, and VIF > 10 indicates certainly multicollinearity among the variables.
From 4 assumptions, only 2 assumptions that fulfilled. Homoscedasticity and Normality are violated. if this model is used to make prediction, the result will be misleading.
Based on this article, alternative for fixing the model is to transform the dependent and/or independent variables with logarithm function. The next model, I am trying to transfrom dependent variable and use it to predict the target. The related article above suggest that: “If a log transformation is applied to the dependent variable only, this is equivalent to assuming that it grows (or decays) exponentially as a function of the independent variables.”
# Log10 transformation
adm3 <- adm %>%
select(-Serial.No.) %>% # unselect Serial.No
mutate(GRE.Score = log10(GRE.Score),
TOEFL.Score = log10(TOEFL.Score),
University.Rating = log10(University.Rating),
SOP = log10(SOP),
LOR = log10(LOR),
CGPA = log10(CGPA))
head(adm3) # Research is not transformed due to factor data type
# Cross Validation
set.seed(99)
adm_train3 <- adm3[adm_split,]
adm_test3 <- adm3[-adm_split,]
To create a model, the lm() function is used by filling in the target variable, predictor variable, and train data. This second model uses a data train whose predictor variables have been converted into logarithmic form.
# Modelling
model_logx = lm(formula = Chance.of.Admit ~., data = adm_train3)
summary(model_logx)
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = adm_train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.277000 -0.022683 0.008397 0.031723 0.153650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.513787 0.714058 -9.122 < 0.0000000000000002 ***
## GRE.Score 1.407398 0.367032 3.835 0.000147 ***
## TOEFL.Score 0.740896 0.224715 3.297 0.001066 **
## University.Rating 0.009342 0.023853 0.392 0.695544
## SOP 0.029557 0.030955 0.955 0.340250
## LOR 0.105349 0.029780 3.538 0.000452 ***
## CGPA 2.281419 0.186149 12.256 < 0.0000000000000002 ***
## Research 0.027478 0.006682 4.112 0.0000478 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05548 on 392 degrees of freedom
## Multiple R-squared: 0.839, Adjusted R-squared: 0.8361
## F-statistic: 291.9 on 7 and 392 DF, p-value: < 0.00000000000000022
Perform model predictions with test data.
# Predicting
predict_logx <- predict(object = model_logx, newdata = adm_test3)
# Model evaluation
MAE(y_pred = predict_logx, y_true = adm_test3$Chance.of.Admit)
## [1] 0.05273135
Since only two assumptions violated, this section only focus on Normality and Homoscedasticity assumption to test new model.
hist(model_logx$residuals)
The histogram shows the distribution of residuals skewed to the right.
shapiro.test(model_logx$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_logx$residuals
## W = 0.93142, p-value = 0.000000000001362
P-value that less than 0.05 (alpha) indicates that the model also violates normality test. p-value score confirm histogram visualization.
bptest(model_logx)
##
## studentized Breusch-Pagan test
##
## data: model_logx
## BP = 22.926, df = 7, p-value = 0.001756
Using Breusch-Pagan test, p-value that given less than 0.05 (alpha value) indicates that the model violates Homoscedasticity assumption.
Of the several predictor variables, GRE score, TOEFL score, CGPA, and research experience are significant factors in predicting a person’s chance of passing university registration or not. Academic measure is commonly used to select applications.
The correlation value shows that all variables have a strong correlation with the target variable. Between independent variables also has a high correlation.
The linear model that has been made is able to provide predictions. However, in linear regression, prediction alone is not sufficient. Linear models must also meet the assumptions. If these assumptions are not met, it is likely that the resulting predictions will be misleading.
In the first model, the assumptions for normality and homoscedasticity are not fulfilled. To solve this model, one way is to convert one of the variables (independent or dependent) into a log, to ensure that the variable relationship is linear.
Then a second model is created by converting the independent variables. After testing the assumptions that are focused on the two previous tests, it is found that the second model does not fulfill these two assumptions.
It can be assumed that the linear model is not match for predicting the probability of acceptance of university applications or not. Alternatives that can be done is to use a non-linear model, perform PCA analysis as pre-processing, or use a linear regression model or method that can solve problems in the OLS model.
Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
https://www.statisticssolutions.com/assumptions-of-linear-regression/