This dataset is created by Mohan S Acharya to estimate chances of graduate admission from an Indian perspective. This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.
We will learn to use linear regression model using Graduate Admission
dataset to understand the relationship among variables, especially between the Chance of Admit
with other variables. We also want to predict the chance of admit based on the historical data. You can download the data here.
Load the required package.
library(dplyr)
library(GGally)
library(performance)
library(MLmetrics)
library(lmtest)
library(car)
Load the dataset.
<- read.csv('data_input/Admission_Predict.csv')
graduate
graduate
a. Data Structure
glimpse(graduate)
#> Rows: 400
#> Columns: 9
#> $ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
#> $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
#> $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
#> $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
#> $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
#> $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
#> $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
#> $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
#> $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~
summary(graduate)
#> Serial.No. GRE.Score TOEFL.Score University.Rating
#> Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
#> 1st Qu.:100.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
#> Median :200.5 Median :317.0 Median :107.0 Median :3.000
#> Mean :200.5 Mean :316.8 Mean :107.4 Mean :3.087
#> 3rd Qu.:300.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
#> Max. :400.0 Max. :340.0 Max. :120.0 Max. :5.000
#> SOP LOR CGPA Research
#> Min. :1.0 Min. :1.000 Min. :6.800 Min. :0.0000
#> 1st Qu.:2.5 1st Qu.:3.000 1st Qu.:8.170 1st Qu.:0.0000
#> Median :3.5 Median :3.500 Median :8.610 Median :1.0000
#> Mean :3.4 Mean :3.453 Mean :8.599 Mean :0.5475
#> 3rd Qu.:4.0 3rd Qu.:4.000 3rd Qu.:9.062 3rd Qu.:1.0000
#> Max. :5.0 Max. :5.000 Max. :9.920 Max. :1.0000
#> Chance.of.Admit
#> Min. :0.3400
#> 1st Qu.:0.6400
#> Median :0.7300
#> Mean :0.7244
#> 3rd Qu.:0.8300
#> Max. :0.9700
b. Missing Value check
anyNA(graduate)
#> [1] FALSE
*no missing value
c. Data Overview
Glossary data graduate
:
Serial.No.
: Index number (0-500)GRE.Score
: Graduate Record Examination scores (out of 340)TOEFL.Score
: TOEFL Scores (out of 120)University.Rating
: University Rating (out of 5)SOP
: Statement of Purpose strength (out of 5)LOR
: Letter of Recommendation strength (out of 5)CGPA
: Undergraduate GPA (out of 10)Research
: Research Experience (either 0 or 1)Chance.of.Admit
: Chance of Admit (ranging from 0 to 1)Data Resume: - The data has 500 rows and 9 columns. - Serial.No.
is an index number, so we can ignore it. - Our target variable is Chance.of.Admit
. - Research
and University.Rating
data type can be changed into a factor.
d. Data Cleaning
<- graduate %>%
graduate_clean select(-Serial.No.) %>%
mutate_at(c('Research', 'University.Rating'), as.factor)
glimpse(graduate_clean)
#> Rows: 400
#> Columns: 8
#> $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
#> $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
#> $ University.Rating <fct> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
#> $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
#> $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
#> $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
#> $ Research <fct> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
#> $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~
e. Data Correlation
ggcorr(graduate_clean, hjust = 0.64, label = T, label_size = 3, cex = 3)
Based on the correlation visualization above, it is known that:
CGPA
is the predictor variable with the highest correlation.LOR
is the predictor variable with the lowest correlation.Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparison and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 70% of the data as the training data and the rest of it as the testing data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
<- sample(nrow(graduate_clean), nrow(graduate_clean)*0.70)
intrain
# train-test splitting
<- graduate_clean[intrain,]
graduate_train <- graduate_clean[-intrain,] graduate_test
Now we will try to model the linear regression using Chance.of.Admit
as the target variable.
a. Linear Regression Model with all predictor variable
<- lm(Chance.of.Admit ~ ., data = graduate_train)
model_all
summary(model_all)
#>
#> Call:
#> lm(formula = Chance.of.Admit ~ ., data = graduate_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.265365 -0.021831 0.008806 0.036244 0.154209
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.2389804 0.1588765 -7.798 1.38e-13 ***
#> GRE.Score 0.0016445 0.0007635 2.154 0.03214 *
#> TOEFL.Score 0.0034531 0.0013086 2.639 0.00881 **
#> University.Rating2 -0.0031457 0.0183101 -0.172 0.86372
#> University.Rating3 0.0058106 0.0197358 0.294 0.76866
#> University.Rating4 0.0072630 0.0241904 0.300 0.76422
#> University.Rating5 0.0232278 0.0265290 0.876 0.38205
#> SOP -0.0012208 0.0068045 -0.179 0.85775
#> LOR 0.0181388 0.0069112 2.625 0.00917 **
#> CGPA 0.1156133 0.0144201 8.018 3.32e-14 ***
#> Research1 0.0270413 0.0095338 2.836 0.00491 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06526 on 269 degrees of freedom
#> Multiple R-squared: 0.7997, Adjusted R-squared: 0.7923
#> F-statistic: 107.4 on 10 and 269 DF, p-value: < 2.2e-16
b. Linear Regression Model with selected predictor variable using backward elimination method
<- step(object = model_all, direction = "backward", trace = 0)
model_back summary(model_back)
#>
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
#> CGPA + Research, data = graduate_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.264661 -0.022468 0.009422 0.038006 0.151586
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3053253 0.1440060 -9.064 < 2e-16 ***
#> GRE.Score 0.0017322 0.0007487 2.314 0.021423 *
#> TOEFL.Score 0.0035195 0.0012830 2.743 0.006486 **
#> LOR 0.0201934 0.0057446 3.515 0.000514 ***
#> CGPA 0.1185797 0.0139077 8.526 1.04e-15 ***
#> Research1 0.0278794 0.0093994 2.966 0.003282 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06494 on 274 degrees of freedom
#> Multiple R-squared: 0.798, Adjusted R-squared: 0.7943
#> F-statistic: 216.5 on 5 and 274 DF, p-value: < 2.2e-16
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE ()
functions from caret
package.
Below is the first model (with all predictor variables) performance.
#Data train RSME
RMSE(y_pred = model_all$fitted.values, y_true = graduate_train$Chance.of.Admit)
#> [1] 0.06396333
#Data test RSME
<- predict(object = model_all, newdata = graduate_test %>% select(-Chance.of.Admit))
lm_pred RMSE(y_pred = lm_pred, y_true = graduate_test$Chance.of.Admit)
#> [1] 0.06087141
Below is the second model (with removed variables by using backward
method) performance.
#Data train RSME
RMSE(y_pred = model_back$fitted.values, y_true = graduate_train$Chance.of.Admit)
#> [1] 0.06423743
#Data test RSME
<- predict(object = model_back, newdata = graduate_test %>% select(-Chance.of.Admit))
lm_pred2 RMSE(y_pred = lm_pred2, y_true = graduate_test$Chance.of.Admit)
#> [1] 0.06120476
Performance Comparison
compare_performance(model_all, model_back, rank = T)
The second model is better than the first one in predicting the testing dataset, even only by a small margin.
Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).
1. Linearity
<- data.frame(residual = model_back$residuals, fitted = model_back$fitted.values)
resact
%>%
resact ggplot(aes(fitted, residual)) +
geom_point() + geom_smooth() +
geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(),
panel.background = element_blank())
Could be seen from the plot that there is no visible pattern. So that indicate that the model is linear.
2. Normality of Residual
plot(density(model_back$residuals))
shapiro.test(model_back$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: model_back$residuals
#> W = 0.91432, p-value = 1.445e-11
The null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.
3. Homoscedasticity
bptest(model_back)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_back
#> BP = 17.366, df = 5, p-value = 0.003856
%>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0)) resact
We can observe that on lower fitted values, the residuals are spread in big range. As the fitted value increases, the residuals are concentrated around the value of 0 and creates a fan shape pattern.
Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is present in our model.
4. Multicolinearity
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif(model_back)
#> GRE.Score TOEFL.Score LOR CGPA Research
#> 4.835205 4.017402 1.781769 4.628770 1.451949
In the model, all the VIF value is under 5 so correlation between predictors is weak.
The model that we can use to predict chances of student enter the university is model_back
. Through the process of feature selection, we get five predictor variables that are considered to be important in predicting the acceptance rate of a graduate school applicant. The five predictor variables are:
In terms of our model performance, it managed to pass two out of four assumptions check and those are the Linearity and the Multicollinearity. However, a linear regression model should have successfully passed all four of the assumptions check in order to be qualified as a “good” model. Meanwhile, the model_backward model has failed to fulfill that requirement. Therefore, we can say that predicting graduate admissions with linear regression model wouldn’t be the best choice that we can apply for future use. It is not sufficient to correctly predict the acceptance rate of a grad school applicants.
Predicting with different models beside linear regression would hopefully create a better model than the one that we just made.