Graduate Admissions
Introduction
Here, we would like to use linear regression model using graduate admission dataset. The main goal is to know the relationship between variables and to predict the chance of admission.
The graduate admission dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
GRE Scores( out of 340 )TOEFL Scores( out of 120 )University Rating( out of 5 )Statement of Purpose and Letter of Recommendation Strength( out of 5 )Undergraduate GPA( out of 10 )Research Experience( either 0 or 1 )Chance of Admit( ranging from 0 to 1 )
Library
Load Dataset
## 'data.frame': 500 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
The data has 500 observartions (rows) with 8 variables (column). The Serial No. variable is neglected as it is an unique identifier. The target variable is Chance of Admit.
Exploratory Data Analysis
Before exploring the data variables, it is important to inspect and change the class of variables into the suitable one. In this case, Research and University.Rating variables needed to be change into the character class.
Correlation
To see the relationship between variables, we could find the Pearson correlation between the features.
## Warning in ggcorr(a, label = T, label_size = 2.9): data in column(s)
## 'University.Rating', 'Research' are not numeric and were ignored
The graph indicates that Chance of Admit variable is correlate strongly with each predictor variables.
Data Visualization
plot_ly(a, x=~GRE.Score, y=~TOEFL.Score, z=~CGPA, color = ~Chance.of.Admit, type="scatter3d", mode="markers") %>%
layout(scene = list(xaxis = list(title = "GRE Score"),
yaxis = list(title = "TOEFL Score"),
zaxis = list(title = "CGPA")))From the 3D graph, we could see that chance of admission is higher when the CGPA, TOEFL and GRE score is higher.
Modeling
The data is needed to be seperated into two dataset, which are train dataset and test dataset. The train dataset will be used for the linear regression model while test dataset is used for comparison. Train dataset contains 70% of the main data.
We will try to do the linear regression modeing using Chance of Admit as the target variable.
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = a.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.226543 -0.023759 0.007652 0.034031 0.164322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3233948 0.1213051 -10.910 < 0.0000000000000002 ***
## GRE.Score 0.0024465 0.0005654 4.327 0.0000199 ***
## TOEFL.Score 0.0022961 0.0010320 2.225 0.026749 *
## University.Rating2 -0.0129424 0.0138680 -0.933 0.351354
## University.Rating3 -0.0104983 0.0147933 -0.710 0.478400
## University.Rating4 -0.0134790 0.0175686 -0.767 0.443485
## University.Rating5 0.0037659 0.0202704 0.186 0.852726
## SOP 0.0065106 0.0055111 1.181 0.238290
## LOR 0.0182460 0.0047745 3.822 0.000158 ***
## CGPA 0.1088601 0.0112626 9.666 < 0.0000000000000002 ***
## Research1 0.0269591 0.0079592 3.387 0.000789 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05836 on 339 degrees of freedom
## Multiple R-squared: 0.8338, Adjusted R-squared: 0.8289
## F-statistic: 170.1 on 10 and 339 DF, p-value: < 0.00000000000000022
The summary of the model m shows that the adjusted R-squared value of 0.8288. Here we could see that there are two variables with Pr(>|t|) > 0.05 which show that the variables have no significant effect toward the model.
For comparing, we would use step-wise regression with backward elimination method.
## Start: AIC=-1977.99
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - University.Rating 4 0.01186 1.1664 -1982.4
## - SOP 1 0.00475 1.1593 -1978.5
## <none> 1.1545 -1978.0
## - TOEFL.Score 1 0.01686 1.1714 -1974.9
## - Research 1 0.03907 1.1936 -1968.3
## - LOR 1 0.04974 1.2042 -1965.2
## - GRE.Score 1 0.06376 1.2183 -1961.2
## - CGPA 1 0.31817 1.4727 -1894.8
##
## Step: AIC=-1982.41
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP + LOR + CGPA +
## Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00568 1.1721 -1982.7
## <none> 1.1664 -1982.4
## - TOEFL.Score 1 0.01703 1.1834 -1979.3
## - Research 1 0.04164 1.2080 -1972.1
## - LOR 1 0.05207 1.2184 -1969.1
## - GRE.Score 1 0.06607 1.2324 -1965.1
## - CGPA 1 0.32988 1.4962 -1897.2
##
## Step: AIC=-1982.71
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.1721 -1982.7
## - TOEFL.Score 1 0.02192 1.1940 -1978.2
## - Research 1 0.04385 1.2159 -1971.9
## - GRE.Score 1 0.06508 1.2371 -1965.8
## - LOR 1 0.07173 1.2438 -1963.9
## - CGPA 1 0.37421 1.5462 -1887.7
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = a.train)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score LOR CGPA Research1
## -1.389682 0.002461 0.002552 0.020523 0.113357 0.028009
m1 <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
CGPA + Research, data = a.train)
summary(m1)##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = a.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.231469 -0.024395 0.007353 0.035280 0.163887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3896819 0.1125191 -12.351 < 0.0000000000000002 ***
## GRE.Score 0.0024611 0.0005631 4.370 0.00001645 ***
## TOEFL.Score 0.0025517 0.0010059 2.537 0.011631 *
## LOR 0.0205232 0.0044728 4.588 0.00000627 ***
## CGPA 0.1133571 0.0108165 10.480 < 0.0000000000000002 ***
## Research1 0.0280088 0.0078072 3.588 0.000382 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05837 on 344 degrees of freedom
## Multiple R-squared: 0.8313, Adjusted R-squared: 0.8288
## F-statistic: 339 on 5 and 344 DF, p-value: < 0.00000000000000022
The stepwise regression model throw away the two variables that don’t have significant effect on model. To see if it has an effect on the model, we need to compare the adjusted R-square value from both. The first model show value of 0.8288 while the second model show the same value. It show that it is safe to removing the variable that has no significant value. Therefore, we use the second model, m1 as the main candidate model.
Evaluation
Model Performance
To see how well the model predict the target variable, we use root mean squared error (RMSE)
## [1] 0.05786801
## [1] 0.06423879
As the RMSE of both train and test datasets is similar, we could assume that the model is not overfit.
Assumption
1. Linearity
lin <- data.frame(residual = m1$residuals, fitted = m1$fitted.values)
lin %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It could be seen from the plot that there is no visible pattern. So that indicate that the model is linear.
2. Normality Test
##
## Shapiro-Wilk normality test
##
## data: m1$residuals
## W = 0.94155, p-value = 0.0000000001612
With p-value < 0.05, it can be concluded that our hypothesis is rejected, which means that residuals are not following the normal distribution.
3. Heterocedasticity
##
## studentized Breusch-Pagan test
##
## data: m1
## BP = 17.041, df = 5, p-value = 0.004423
Using the Breusch-Pagan test, the model shows p-value below 0.05, so it can be concluded that heterocesdasticity is present in our model.
4. Autocorrelation
## lag Autocorrelation D-W Statistic p-value
## 1 0.05308529 1.893537 0.332
## Alternative hypothesis: rho != 0
Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.The result shows that the null hypothesis is not rejected, meaning that our residual has no autocorrelation in it.
5. Multicollinearity
## GRE.Score TOEFL.Score LOR CGPA Research
## 4.248963 3.861199 1.815835 4.411289 1.540605
Multicollinearity could indicate correlation between the independent variables/predictors. A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. In the model, all the VIF value is under 5 so correlation between predictors is weak.
Model Improvement
Model Tuning
library(dplyr)
a2 <- a %>%
mutate(chance = ifelse(Chance.of.Admit>0.5,1,0)) %>%
select(-Chance.of.Admit)
a2$chance <- as.factor(a2$chance)As the linear regression model does not meet several assumptions, we try to change the model. We will try to use logistic regression model. We make new variable, chance that using cutoff value of 0.5 from the previous target, Chance.of.Admit.
Modeling
##
## Call:
## glm(formula = chance ~ ., family = "binomial", data = a2.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.62630 0.00406 0.05386 0.21851 1.64962
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -57.01801 11.83404 -4.818 0.00000145 ***
## GRE.Score 0.04581 0.03908 1.172 0.2410
## TOEFL.Score 0.13289 0.08483 1.567 0.1172
## University.Rating2 -1.02121 0.74887 -1.364 0.1727
## University.Rating3 -0.44938 0.91825 -0.489 0.6246
## University.Rating4 0.20034 1.34131 0.149 0.8813
## University.Rating5 11.26151 1169.46461 0.010 0.9923
## SOP -0.70378 0.38625 -1.822 0.0684 .
## LOR 0.95513 0.39320 2.429 0.0151 *
## CGPA 3.92543 0.99701 3.937 0.00008243 ***
## Research1 -0.30313 0.63527 -0.477 0.6332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 232.65 on 399 degrees of freedom
## Residual deviance: 112.82 on 389 degrees of freedom
## AIC: 134.82
##
## Number of Fisher Scoring iterations: 18
Performance
library(caret)
a2.train$pred.chance <- predict(m2, a2.train, type = "response")
a2.train$pred.label <- ifelse(a2.train$pred.chance < 0.5, "0", "1") %>% as.factor()
confusionMatrix(a2.train$pred.label, a2.train$chance, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 7
## 1 18 359
##
## Accuracy : 0.9375
## 95% CI : (0.9091, 0.9591)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : 0.0591
##
## Kappa : 0.5291
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.9809
## Specificity : 0.4706
## Pos Pred Value : 0.9523
## Neg Pred Value : 0.6957
## Prevalence : 0.9150
## Detection Rate : 0.8975
## Detection Prevalence : 0.9425
## Balanced Accuracy : 0.7257
##
## 'Positive' Class : 1
##
library(caret)
a2.test$pred.chance <- predict(m2, a2.test, type = "response")
a2.test$pred.label <- ifelse(a2.test$pred.chance < 0.5, "0", "1") %>% as.factor()
confusionMatrix(a2.test$pred.label, a2.test$chance, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 1
## 1 4 94
##
## Accuracy : 0.95
## 95% CI : (0.8872, 0.9836)
## No Information Rate : 0.95
## P-Value [Acc > NIR] : 0.6160
##
## Kappa : 0.2647
##
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.9895
## Specificity : 0.2000
## Pos Pred Value : 0.9592
## Neg Pred Value : 0.5000
## Prevalence : 0.9500
## Detection Rate : 0.9400
## Detection Prevalence : 0.9800
## Balanced Accuracy : 0.5947
##
## 'Positive' Class : 1
##
The train dataset model has the accuracy of 93.75%, while test dataset model has the accuracy of 95%.
Conclusion
To predict chances from student to enter the university, variables that are useful are GRE Score, TOEFL Score, CGPA, Research and Letter of Recommendation Strength. The R-squared of the model is pretty high, 82.88%. RMSE from the training data has RMSE of 0.058, while RMSE from the test data is 0.064, that show that the model is fit. Unfortunately, the model could not satisfy the classical assumptions. When we tried the logistic regression model with cutoff of 0.5, the training data show accuracy of 93.75 %, while the test data show accuracy of 95 %. Logistic regression model is the better model for the graduate admission data.