This dataset is created for prediction of Graduate Admissions from an Indian perspective and contains several parameters which are considered important during the application for Masters Programs.
The business case for this analysis is to predict whether a person has a chance to admit into a university with linear regression model.
Library used in this analysis.
library(tidyverse)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)Read the data
adm <- read.csv("Admission.csv")Check data structure
glimpse(adm)## Rows: 500
## Columns: 9
## $ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...
We need change Research data types into factor but we need to change 0 to “no” and 1 to “yes”. After that we can just delete Serial.No. column.
adm <- adm %>%
select(-c(Serial.No.)) %>%
mutate(Research = ifelse(Research == 0, "no", "yes"),
Research = as.factor(Research))glimpse(adm)## Rows: 500
## Columns: 8
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research <fct> yes, yes, yes, yes, no, yes, yes, no, no, no, yes...
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...
Okay, now the data already changed into what we want. But we also need to check if there is any NA in the data.
colSums(is.na(adm))## GRE.Score TOEFL.Score University.Rating SOP
## 0 0 0 0
## LOR CGPA Research Chance.of.Admit
## 0 0 0 0
Now, we know that the data don’t have NA and we can move to explore the data.
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)## Warning in ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp =
## 2): data in column(s) 'Research' are not numeric and were ignored
Based on the result, all of the predictors have high correlation with the target. But this test only for numerical predictors, lets check the factor predictor.
adm %>%
group_by(Research) %>%
summarise(min_Chance.of.Admit = min(Chance.of.Admit), # Minimum
max_Chance.of.Admit = max(Chance.of.Admit), # Maximum
mean_Chance.of.Admit = mean(Chance.of.Admit), # Mean
)Based on the result, we know that if there is weak correlation between Research and Chance.of.Admit.
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparation and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 75% of the data as the training data and the rest of it as the testing data.
set.seed(123)
row_data <- nrow(adm)
index <- sample(row_data, row_data*0.75)
data_train <- adm[ index, ]
data_test <- adm[ -index, ]Now we will try to model the linear regression using Chance.of.Admit as the target variable.
set.seed(123)
model_adm <- lm(Chance.of.Admit ~ ., data_train)
model_step <- step(model_adm, direction = "both")## Start: AIC=-2109.52
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00041 1.2959 -2111.4
## - University.Rating 1 0.00431 1.2998 -2110.3
## <none> 1.2955 -2109.5
## - TOEFL.Score 1 0.02333 1.3188 -2104.8
## - Research 1 0.02872 1.3242 -2103.3
## - LOR 1 0.03485 1.3303 -2101.6
## - GRE.Score 1 0.05470 1.3502 -2096.0
## - CGPA 1 0.42853 1.7240 -2004.3
##
## Step: AIC=-2111.4
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - University.Rating 1 0.00622 1.3021 -2111.6
## <none> 1.2959 -2111.4
## + SOP 1 0.00041 1.2955 -2109.5
## - TOEFL.Score 1 0.02408 1.3200 -2106.5
## - Research 1 0.02908 1.3250 -2105.1
## - LOR 1 0.04152 1.3374 -2101.6
## - GRE.Score 1 0.05430 1.3502 -2098.0
## - CGPA 1 0.45896 1.7548 -1999.7
##
## Step: AIC=-2111.61
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.3021 -2111.6
## + University.Rating 1 0.00622 1.2959 -2111.4
## + SOP 1 0.00232 1.2998 -2110.3
## - TOEFL.Score 1 0.02975 1.3318 -2105.1
## - Research 1 0.03092 1.3330 -2104.8
## - LOR 1 0.05455 1.3566 -2098.2
## - GRE.Score 1 0.05548 1.3576 -2098.0
## - CGPA 1 0.51286 1.8150 -1989.1
summary(model_step)##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.270620 -0.022970 0.008901 0.035976 0.160757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4210184 0.1151556 -12.340 < 2e-16 ***
## GRE.Score 0.0022661 0.0005715 3.965 8.81e-05 ***
## TOEFL.Score 0.0027951 0.0009626 2.904 0.003908 **
## LOR 0.0175733 0.0044695 3.932 0.000101 ***
## CGPA 0.1226855 0.0101766 12.056 < 2e-16 ***
## Researchyes 0.0224954 0.0075994 2.960 0.003273 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0594 on 369 degrees of freedom
## Multiple R-squared: 0.818, Adjusted R-squared: 0.8156
## F-statistic: 331.8 on 5 and 369 DF, p-value: < 2.2e-16
The summary of model shows a lot of information. But we just need to see Pr(>|t|). This column shows significance level of the variable towards model. It shows that all variables have significant effect toward the model. Also the model can explain 81,5% of the data.
The performance of our model can be calculated using MAPE and RMSE.
pred_data <- predict(model_step, data_test)
MAPE(pred_data, data_test$Chance.of.Admit)## [1] 0.07290229
RMSE(pred_data, data_test$Chance.of.Admit)## [1] 0.06222074
Based on the result, MAPE shows that the model deviate around 7.2% from the actual data. Meanwhile RMSE shows that prediction deviate around 0.06 points from the actual data. From this, we know that the model is not bad.
Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator.
The first assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Shaphiro-Wilk normality test.
shapiro.test(model_step$residuals)##
## Shapiro-Wilk normality test
##
## data: model_step$residuals
## W = 0.91908, p-value = 2.495e-13
From the result, p-value is <0.05 so it means our residuals are not following the normal distribution.
Homocesdasticity shows that the residual or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it ‘Heterocesdasticity’ and it will affect the standard error value in the biased estimate / predictor coefficient (too narrow or too wide). For this analysis, I use Breusch-Pagan test.
bptest(model_step)##
## studentized Breusch-Pagan test
##
## data: model_step
## BP = 15.824, df = 5, p-value = 0.007364
Based on the result, p-value is <0.05 so the residuals error are form certain pattern.
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif(model_step)## GRE.Score TOEFL.Score LOR CGPA Research
## 4.224167 3.521604 1.739356 3.830131 1.501305
Based on the result, there are no correlation between variables.
We have already seen that our model doesn’t satisfy normality, and heterocesdasticity test. Now we will try to fix them. To made the model more linear and simple, we can transform some of the variables, both the target and the predictors. First, we can shun off the variables that have correlation coefficient above 0.6. Then we can change the target with box-cox method.
ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)## Warning in ggcorr(adm, label = TRUE, label_size = 2.9, hjust = 1, layout.exp =
## 2): data in column(s) 'Research' are not numeric and were ignored
From the result, we know that TOEFL.Score and GRE.Score, also SOP and University.Rating have high correlation. So we can deselect one of them. But to do that, we have to know what predictor have higher effect. So, we have to scale it.
scale_adm <- adm %>%
mutate_if(is.numeric, scale)
model_scale <- lm(Chance.of.Admit ~ ., scale_adm)
summary(model_scale)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = scale_adm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88930 -0.16527 0.06512 0.23887 1.11108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09644 0.03238 -2.979 0.003036 **
## GRE.Score 0.14873 0.04020 3.700 0.000240 ***
## TOEFL.Score 0.11971 0.03759 3.184 0.001544 **
## University.Rating 0.04814 0.03080 1.563 0.118753
## SOP 0.01114 0.03204 0.348 0.728263
## LOR 0.11054 0.02713 4.074 5.38e-05 ***
## CGPA 0.50730 0.04159 12.198 < 2e-16 ***
## Researchyes 0.17222 0.04680 3.680 0.000259 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.425 on 492 degrees of freedom
## Multiple R-squared: 0.8219, Adjusted R-squared: 0.8194
## F-statistic: 324.4 on 7 and 492 DF, p-value: < 2.2e-16
Based on the result, we know that GRE.Score and University.Rating have higher effect on Chance.of.Admit. So we can deselect TOEFL.Score and SOP as predictor. After that, we can use box-cox method to the target.
adm2 <- adm %>%
select(-c(TOEFL.Score, SOP))box_try <- boxCox(model_step, lambda = seq(-10, 10,by=0.1))lambda <- box_try$x[ which.max(box_try$y) ]
adm2<- adm2 %>%
mutate(Chance.of.Admit = (Chance.of.Admit^lambda - 1) / lambda)set.seed(123)
data_train2 <- adm2[index, ]
data_test2 <- adm2[-index, ]set.seed(123)
model_adm2 <- lm(Chance.of.Admit ~ ., data_train2)
model_step2 <- step(model_adm2, direction = "both")## Start: AIC=-2452.45
## Chance.of.Admit ~ GRE.Score + University.Rating + LOR + CGPA +
## Research
##
## Df Sum of Sq RSS AIC
## <none> 0.52469 -2452.4
## - University.Rating 1 0.013684 0.53838 -2444.8
## - Research 1 0.014340 0.53903 -2444.3
## - LOR 1 0.015264 0.53996 -2443.7
## - GRE.Score 1 0.057330 0.58202 -2415.6
## - CGPA 1 0.249727 0.77442 -2308.5
summary(model_step2)##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + University.Rating +
## LOR + CGPA + Research, data = data_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.160765 -0.019596 0.006957 0.023719 0.104475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.6642363 0.0768589 -21.653 < 2e-16 ***
## GRE.Score 0.0020298 0.0003197 6.350 6.34e-10 ***
## University.Rating 0.0078394 0.0025271 3.102 0.00207 **
## LOR 0.0096503 0.0029454 3.276 0.00115 **
## CGPA 0.0851337 0.0064241 13.252 < 2e-16 ***
## Researchyes 0.0153469 0.0048327 3.176 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03771 on 369 degrees of freedom
## Multiple R-squared: 0.8425, Adjusted R-squared: 0.8404
## F-statistic: 394.9 on 5 and 369 DF, p-value: < 2.2e-16
Based on the result, we can see that this model can explain 84% of the data. It is higher around 2.5% than the previous model.
pred_data2 <- predict(model_step2, data_test2)
MAPE(pred_data2, data_test2$Chance.of.Admit)## [1] 0.1517087
RMSE(pred_data2, data_test2$Chance.of.Admit)## [1] 0.03833579
Based on the result, MAPE shows that the model deviate around 15% from the actual data. Meanwhile RMSE shows that prediction deviate around 0.03 points from the actual data. From this, we know that the model is not bad. MAPE from this model is 2 times higher than previous model but RMSE is lower almost half than previous model.
Lets test normality and Homocesdasticity again.
shapiro.test(model_step2$residuals)##
## Shapiro-Wilk normality test
##
## data: model_step2$residuals
## W = 0.94623, p-value = 1.974e-10
bptest(model_step2)##
## studentized Breusch-Pagan test
##
## data: model_step2
## BP = 4.3907, df = 5, p-value = 0.4946
Based on the result, the data still don’t pass normality test but pass Homocesdasticity test. It mean the predictor might be still biased. So lets check the residual plot.
qqPlot(model_step2$residuals)## 10 66
## 194 243
Based on the result, some of the residuals still outside the line.
Even after we tweak the model, it still have not pass the normality test. It means the predictor might be biased. Overall the model is good, the model can explain 84% of the data with MAPE is 0.015 and RMSE is 0.03. It is better than the previous model but unfortunately with higher MAPE.