Graduate Admission
Introduction
We will learn to use linear regression model using Graduate Admission dataset from an Indian perspective. We want to know the relationship among variables, especially between the Chance of Admit with other variables. We also want to predit the Chance of Admit of a new applicants based on the historical data. You can download the data here.
Business Objectives
This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.
Data Preparation
Load the Required Package
Load the Dataset
Check data type
## Observations: 500
## Variables: 9
## $ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302,...
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3,...
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0,...
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5,...
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7....
## $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,...
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0....
The data has 500 rows and 9 columns. Serial.No. is a unique identifier so we can ignore it. Our target variable is the Chance.of.Admit and we will use other variable as our predictors.
Before we go further, first we need to make sure that our data is clean and will be useful so we will remove unused variables :
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Explore Data Variables
The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
- GRE Scores (out of 340)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290.0 308.0 317.0 316.5 325.0 340.0
- TOEFL Scores (out of 120)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 92.0 103.0 107.0 107.2 112.0 120.0
- University Rating (out of 5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.114 4.000 5.000
- Statement of Purpose and Letter of Recommendation Strength (out of 5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.500 3.500 3.374 4.000 5.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.500 3.484 4.000 5.000
- Undergraduate GPA (out of 10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.800 8.127 8.560 8.576 9.040 9.920
- Research Experience (either 0 or 1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 0.56 1.00 1.00
Check Data Correlation
Find the Pearson correlation between variables :
The graphic shows that all variables has strong correlation with the Chance.of.Admit variable.
Modeling
Cross Validation
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
Modeling
Based on Pearson correlation, all variables has strong correlation with the Chance.of.Admit variable. So we will make a model with all variables from data train.
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.243304 -0.023909 0.007254 0.032844 0.153875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4067512 0.1080347 -13.021 < 0.0000000000000002 ***
## GRE.Score 0.0023085 0.0005272 4.379 0.0000153 ***
## TOEFL.Score 0.0030805 0.0009126 3.376 0.00081 ***
## University.Rating 0.0069311 0.0039922 1.736 0.08332 .
## SOP -0.0007643 0.0048477 -0.158 0.87480
## LOR 0.0116529 0.0043049 2.707 0.00709 **
## CGPA 0.1164724 0.0102799 11.330 < 0.0000000000000002 ***
## Research 0.0199933 0.0069428 2.880 0.00420 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05654 on 392 degrees of freedom
## Multiple R-squared: 0.8336, Adjusted R-squared: 0.8306
## F-statistic: 280.5 on 7 and 392 DF, p-value: < 0.00000000000000022
Feature Selection using Stepwise Regression
Now we will try to eliminate variables to get better model using Stepwise Regression
- Backward method
## Start: AIC=-2290.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00008 1.2530 -2292.4
## <none> 1.2530 -2290.4
## - University.Rating 1 0.00963 1.2626 -2289.3
## - LOR 1 0.02342 1.2764 -2285.0
## - Research 1 0.02651 1.2795 -2284.0
## - TOEFL.Score 1 0.03642 1.2894 -2280.9
## - GRE.Score 1 0.06130 1.3142 -2273.3
## - CGPA 1 0.41031 1.6633 -2179.1
##
## Step: AIC=-2292.36
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.2530 -2292.4
## - University.Rating 1 0.01057 1.2636 -2291.0
## - LOR 1 0.02475 1.2778 -2286.5
## - Research 1 0.02655 1.2796 -2286.0
## - TOEFL.Score 1 0.03642 1.2894 -2282.9
## - GRE.Score 1 0.06175 1.3148 -2275.1
## - CGPA 1 0.42549 1.6785 -2177.4
- Forward method
lm_admission_none <- lm(data = data_train, formula = Chance.of.Admit~1)
lm_admission_forward <- step(lm_admission_none, scope = list(lower = lm_admission_none, upper = lm_admission_all), direction = "forward")## Start: AIC=-1587.12
## Chance.of.Admit ~ 1
##
## Df Sum of Sq RSS AIC
## + CGPA 1 5.9358 1.5926 -2206.4
## + GRE.Score 1 5.1681 2.3602 -2049.1
## + TOEFL.Score 1 4.8172 2.7111 -1993.6
## + University.Rating 1 3.4607 4.0677 -1831.4
## + SOP 1 3.2633 4.2650 -1812.4
## + LOR 1 2.7773 4.7511 -1769.2
## + Research 1 2.2163 5.3120 -1724.6
## <none> 7.5283 -1587.1
##
## Step: AIC=-2206.45
## Chance.of.Admit ~ CGPA
##
## Df Sum of Sq RSS AIC
## + GRE.Score 1 0.214801 1.3778 -2262.4
## + TOEFL.Score 1 0.161008 1.4316 -2247.1
## + Research 1 0.092582 1.5000 -2228.4
## + University.Rating 1 0.061012 1.5315 -2220.1
## + LOR 1 0.049521 1.5430 -2217.1
## + SOP 1 0.023098 1.5695 -2210.3
## <none> 1.5926 -2206.4
##
## Step: AIC=-2262.4
## Chance.of.Admit ~ CGPA + GRE.Score
##
## Df Sum of Sq RSS AIC
## + LOR 1 0.045959 1.3318 -2274.0
## + TOEFL.Score 1 0.042554 1.3352 -2272.9
## + University.Rating 1 0.036724 1.3410 -2271.2
## + Research 1 0.031664 1.3461 -2269.7
## + SOP 1 0.018674 1.3591 -2265.9
## <none> 1.3778 -2262.4
##
## Step: AIC=-2273.97
## Chance.of.Admit ~ CGPA + GRE.Score + LOR
##
## Df Sum of Sq RSS AIC
## + TOEFL.Score 1 0.038574 1.2932 -2283.7
## + Research 1 0.026134 1.3057 -2279.9
## + University.Rating 1 0.019409 1.3124 -2277.8
## <none> 1.3318 -2274.0
## + SOP 1 0.003728 1.3281 -2273.1
##
## Step: AIC=-2283.73
## Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score
##
## Df Sum of Sq RSS AIC
## + Research 1 0.0296239 1.2636 -2291.0
## + University.Rating 1 0.0136412 1.2796 -2286.0
## <none> 1.2932 -2283.7
## + SOP 1 0.0012437 1.2920 -2282.1
##
## Step: AIC=-2291
## Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research
##
## Df Sum of Sq RSS AIC
## + University.Rating 1 0.0105685 1.2530 -2292.4
## <none> 1.2636 -2291.0
## + SOP 1 0.0010136 1.2626 -2289.3
##
## Step: AIC=-2292.36
## Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research +
## University.Rating
##
## Df Sum of Sq RSS AIC
## <none> 1.253 -2292.4
## + SOP 1 0.000079451 1.253 -2290.4
Both backward and forward method give the same models, so we will use one of them.
Evaluation
We have 3 models and now we will check performance of our model (how well our model predict the target variable) using MSE and adj.r.squared to.
## [1] 0.0031
## [1] 0.0188
## [1] 0.0031
## [1] 0.8306
## [1] 0
## [1] 0.831
From data above, the best model is lm_admission_back because it produces the highest adj.r.squared and the smallest MSE (Mean Squared Error).
Check Assumption
- Linearity
resact <- data.frame(residual = lm_admission_back$residuals, fitted = lm_admission_back$fitted.values)
resact %>%
ggplot(aes(fitted, residual)) +
geom_point() +
geom_hline(aes(yintercept = 0)) +
geom_smooth() +
theme(panel.grid = element_blank(), panel.background = element_blank())There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.
- Normality of Residual
##
## Shapiro-Wilk normality test
##
## data: lm_admission_back$residuals
## W = 0.93567, p-value = 0.000000000003987
With p-value < 0.05, we can conclude that our residuals are not normally distributed.
- Homoscedascity
##
## studentized Breusch-Pagan test
##
## data: lm_admission_back
## BP = 22.58, df = 6, p-value = 0.0009502
resact %>%
ggplot(aes(fitted, residual)) +
geom_point() +
geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())With p-value < 0.05, we can conclude that heterocesdasticity is present.
- Little to no multicollinearity
## GRE.Score TOEFL.Score University.Rating LOR
## 4.400360 3.698528 2.133388 1.727872
## CGPA Research
## 4.483945 1.484216
Using Data Test
lm_admission_test <- lm(data = data_test, formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
SOP + LOR + CGPA + Research)## [1] 0.0045
Conclusion
Variables that are useful to describe the variances in Chance of Admit are GRE Score, TOEFL Score, University Rating, Letter of Recommendation Strength, Undergraduate GPA, Research Experience. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 83.1% of the variables can explain the variances in the Chance of Admit. The accuracy of the model in predicting the Chance of Admit is measured with MSE, with training data has MSE : 0.0031 and testing data has MSE : 0.0046.
We have already learn how to build a linear regression model and what need to be concerned when building the model.