We will be using linear regression model using graduate admission dataset. We want to know the relationship among variables, especially between chance of admission with other variables. You can download the data here
library(tidyverse)
library(caret)
library(plotly)
library(car)
library(scales)
library(lmtest)
library(GGally)
library(car)admission <- read.csv("Admission_Predict.csv")str(admission)## 'data.frame': 400 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
GRE Scores: Graduate Record Examination ( out of 340 )TOEFL Scores: Test of English as a Foreign Language ( out of 120 )University Rating: The rating of the university, the higher the better ( out of 5 )Statement of Purpose Strength: An essay or other written statement written by an applicant, often a prospective student applying to some college, university, or graduate school, the higher the better ( out of 5 )Letter of Recommendation Strength: A letter of reference that vouches for a specific person based on their characteristics and qualifications, the higher the better ( out of 5 )Undergraduate GPA: Undergraduate GPA based on all courses completed for bachelor degree ( out of 10 )Research Experience: Indicates whether the candidate have research experience or not ( either 0 or 1 )Chance of Admit: Chance of candidate of getting accepted for Masters Programs ( ranging from 0 to 1 )We won’t be using the Serial.No. variable for our linear regression model so we will take it out of the dataframe.
admission_clean <- admission %>%
select(-Serial.No.)dim(admission_clean)## [1] 400 8
The dataset consist of 400 rows and 8 columns.
anyNA(admission_clean)## [1] FALSE
Great! Our data has no missing value.
ggcorr(admission_clean, label = TRUE, label_size = 2.9, hjust = 1,layout.exp = 2) From the visualization above, we can conclude that almost all of the variables has strong correlation with
Chance.of.Admit with CGPA having the highest positive correlation.
We will build a Linear Regression model with CGPA as the predictor variable since it has the highest positive correlation with Chance.of.Admit. We will name this model as model_cgpa
model_cgpa <- lm(formula = Chance.of.Admit~CGPA, data = admission_clean)
summary(model_cgpa)##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.274575 -0.030084 0.009443 0.041954 0.180734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.07151 0.05034 -21.29 <2e-16 ***
## CGPA 0.20885 0.00584 35.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.762
## F-statistic: 1279 on 1 and 398 DF, p-value: < 2.2e-16
We will build a Linear Regression model using all predictor variables. We will name this model as model_all
model_all <- lm(formula = Chance.of.Admit~., data = admission_clean)
summary(model_all)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26259 -0.02103 0.01005 0.03628 0.15928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2594325 0.1247307 -10.097 < 2e-16 ***
## GRE.Score 0.0017374 0.0005979 2.906 0.00387 **
## TOEFL.Score 0.0029196 0.0010895 2.680 0.00768 **
## University.Rating 0.0057167 0.0047704 1.198 0.23150
## SOP -0.0033052 0.0055616 -0.594 0.55267
## LOR 0.0223531 0.0055415 4.034 6.6e-05 ***
## CGPA 0.1189395 0.0122194 9.734 < 2e-16 ***
## Research 0.0245251 0.0079598 3.081 0.00221 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.8
## F-statistic: 228.9 on 7 and 392 DF, p-value: < 2.2e-16
We will be using stepwise with the direction backward to determine which predictor variables will give the lowest AIC(Akaike Information Criterion)
step(model_all, direction = "backward")## Start: AIC=-2193.9
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00144 1.5962 -2195.5
## - University.Rating 1 0.00584 1.6006 -2194.4
## <none> 1.5948 -2193.9
## - TOEFL.Score 1 0.02921 1.6240 -2188.6
## - GRE.Score 1 0.03435 1.6291 -2187.4
## - Research 1 0.03862 1.6334 -2186.3
## - LOR 1 0.06620 1.6609 -2179.6
## - CGPA 1 0.38544 1.9802 -2109.3
##
## Step: AIC=-2195.54
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - University.Rating 1 0.00464 1.6008 -2196.4
## <none> 1.5962 -2195.5
## - TOEFL.Score 1 0.02806 1.6242 -2190.6
## - GRE.Score 1 0.03565 1.6318 -2188.7
## - Research 1 0.03769 1.6339 -2188.2
## - LOR 1 0.06983 1.6660 -2180.4
## - CGPA 1 0.38660 1.9828 -2110.8
##
## Step: AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.6008 -2196.4
## - TOEFL.Score 1 0.03292 1.6338 -2190.2
## - GRE.Score 1 0.03638 1.6372 -2189.4
## - Research 1 0.03912 1.6400 -2188.7
## - LOR 1 0.09133 1.6922 -2176.2
## - CGPA 1 0.43201 2.0328 -2102.8
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = admission_clean)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score LOR CGPA Research
## -1.298464 0.001782 0.003032 0.022776 0.121004 0.024577
From the results above we can conclude that GRE.Score, TOEFL.Score, LOR, CGPA and Research are the predictor variables that have the lowest AIC. We will build a model from this result and name it model_backward
model_backward <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
CGPA + Research, data = admission_clean)
summary(model_backward)##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = admission_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.263542 -0.023297 0.009879 0.038078 0.159897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2984636 0.1172905 -11.070 < 2e-16 ***
## GRE.Score 0.0017820 0.0005955 2.992 0.00294 **
## TOEFL.Score 0.0030320 0.0010651 2.847 0.00465 **
## LOR 0.0227762 0.0048039 4.741 2.97e-06 ***
## CGPA 0.1210042 0.0117349 10.312 < 2e-16 ***
## Research 0.0245769 0.0079203 3.103 0.00205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared: 0.8027, Adjusted R-squared: 0.8002
## F-statistic: 320.6 on 5 and 394 DF, p-value: < 2.2e-16
We will create a new column called prediction to predict Chance.of.Admit based on model_cgpa
admission_clean$prediction <- predict(model_cgpa, admission_clean)We will create a new column called prediction2 to predict Chance.of.Admit based on model_all
admission_clean$prediction2 <- predict(model_all, admission_clean)We will create a new column called prediction3 to predict Chance.of.Admit based on model_backward
admission_clean$prediction3 <- predict(model_backward, admission_clean)
head(admission_clean)## GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research Chance.of.Admit
## 1 337 118 4 4.5 4.5 9.65 1 0.92
## 2 324 107 4 4.0 4.5 8.87 1 0.76
## 3 316 104 3 3.0 3.5 8.00 1 0.72
## 4 322 110 3 3.5 2.5 8.67 1 0.80
## 5 314 103 2 2.0 3.0 8.21 0 0.65
## 6 330 115 5 4.5 3.0 9.34 1 0.90
## prediction prediction2 prediction3
## 1 0.9438641 0.9514586 0.9546053
## 2 0.7809633 0.8056367 0.8037043
## 3 0.5992662 0.6547367 0.6523025
## 4 0.7391938 0.7383624 0.7394830
## 5 0.6431241 0.6352064 0.6351524
## 6 0.8791215 0.8658537 0.8613597
Root mean squared error (RMSE) is the square root of the mean of the square of all of the error. The use of RMSE is very common, and it is considered an excellent general-purpose error metric for numerical predictions.
RMSE(pred = admission_clean$prediction, obs = admission_clean$Chance.of.Admit)## [1] 0.0693927
RMSE(pred = admission_clean$prediction2, obs = admission_clean$Chance.of.Admit)## [1] 0.06314185
RMSE(pred = admission_clean$prediction3, obs = admission_clean$Chance.of.Admit)## [1] 0.06326207
From the results of RMSE above, we can conclude that model_all have the lowest RMSE, however it is consisted of all predictor variables, therefore we will use model_backward
Linear regression has several assumptions that need to be fulfilled so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is for interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions are not required to be met.
Linearity means that the target variable with its predictor has a linear relationship or that the relationship is a straight line. In addition, the effect or coefficient value between variables is additive. If this linearity is not met, then automatically all the coefficient values that we get are invalid because the model assumes that the pattern that we will make is linear.
We can use residual plot to identify linearity. If there is a pattern in the residual plot, it means that the model does not meet the linearity assumption.
linearity <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)
linearity %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())There is a pattern in the data, this means that our model may not be linear enough.
Another assumption of Linear Regression model is that the residuals follow normal distribution. This means that most of the residuals gathered around 0. We can check normality of residual using Shapiro-Wilk normality test. The hypothesis of normality of residuals:
\[ H_0: error/residual\ followed\ normal\ distribution \\ H_1: error/residual\ does\ not\ followed\ normal\ distribution \]
shapiro.test(model_backward$residuals)##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.92193, p-value = 1.443e-13
With p-value < 0.05, we can conclude that our null hypothesis is rejected, this means the residuals are not following the normal distribution.
Heteroskedasticity refers to situations where the variance of the residuals is unequal over a range of measured values. When running a regression analysis, heteroskedasticity results in an unequal scatter of the residuals (also known as the error term). We can check heteroscedasticity in the model using Breusch-Pagan test. The hypothesis of heteroscedasticity:
\[ H_0: The\ residuals\ are\ distributed\ with\ equal\ variance\ (Homoscedasticity\ is\ present)\\ H_1: The\ residuals\ are\ not\ distributed\ with\ equal\ variance\ (Heteroscedasticity\ is\ present) \]
bptest(model_backward)##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 22.428, df = 5, p-value = 0.0004341
With p-value < 0.05, we can conclude that our null hypothesis is rejected, this means the residuals are not distributed with equal variance meaning that hetereoscedasticity is present.
Multicollinearity means that there are strong correlations between predictor variables. We can check whether multicollinearity is present in our data by measuring the value of Variance Inflation Factor(VIF). If the value of VIF is more than 10 then we can conclude that multicollinearity is present and we need to remove one of the variables with VIF > 10.
vif(model_backward)## GRE.Score TOEFL.Score LOR CGPA Research
## 4.585053 4.104255 1.829491 4.808767 1.530007
Based on the results above, we can conclude that there is no multicollinearity in our data
We can conclude that GRE.Score, TOEFL.Score, LOR, CGPA and Research are the variables that can describe the variances in Chance.of.Admit with RMSE = 0.063 and R-squared = 80.2%