library(tidyverse)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)Studying can be boring if we don’t like. But it is a must when we are as student demand for. This LBB will predict student chance of admit with several predictor. First, let’s read the data.
grade <- read.csv("Admission_Predict.csv")
head(grade)glimpse(grade)## Rows: 400
## Columns: 9
## $ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...
There are 400 students on our data and after take a peek into the data, the Serial.No. column should be remove because it might be just indicate number of participant/students.
After take a peek into the data set, we know that we should to remove Serial.No. column. Next, we will check Research column, if it should be has factor data type or no.
unique(grade$Research)## [1] 1 0
Then, we will remove Serial.No. and change data type of Research into factor.
# EDA on Grade Admission
grade <- grade %>%
select(-Serial.No.) %>%
mutate(Research = as.factor(Research))
glimpse(grade)## Rows: 400
## Columns: 8
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
## $ Research <fct> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...
Column Explanation:
After do some EDA to the data set, we will check is there any missing value on our data set?
colSums(is.na(grade))## GRE.Score TOEFL.Score University.Rating SOP
## 0 0 0 0
## LOR CGPA Research Chance.of.Admit
## 0 0 0 0
Then, we will see the correlation between Chance.of.Admit as target with others as predictor by using ggcorr() function by GGally package.
ggcorr(grade, label = TRUE, label_size = 4, hjust = 1, layout.exp = 2)## Warning in ggcorr(grade, label = TRUE, label_size = 4, hjust = 1, layout.exp =
## 2): data in column(s) 'Research' are not numeric and were ignored
On correlation graph above shows that almost each predictors and target are has positive correlation.
Because Research is a factor data type, we will create bar plot to see correlation between Research and Chance.of.Admit
# EDA and Visualisation of correlation between student who has research experience and chance of admit
grade %>%
group_by(Research) %>%
summarise(mean_chance = mean(Chance.of.Admit)) %>%
ggplot(aes(Research,
mean_chance)) +
geom_col() +
labs(title = "Correlation between Student who has Research Experience and Chance of Admit",
y = NULL)+
theme_minimal() From bar plot above we can conclude that student who has research experience has higher chance of admit than student who doesn’t have research experience.
In this section, we will start create linear regression model. We will create linear regression model with all predictor.
1. Linear Regression Model; All Predictor
lm1 <- lm(Chance.of.Admit ~ ., grade)
summary(lm1)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26259 -0.02103 0.01005 0.03628 0.15928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2594325 0.1247307 -10.097 < 2e-16 ***
## GRE.Score 0.0017374 0.0005979 2.906 0.00387 **
## TOEFL.Score 0.0029196 0.0010895 2.680 0.00768 **
## University.Rating 0.0057167 0.0047704 1.198 0.23150
## SOP -0.0033052 0.0055616 -0.594 0.55267
## LOR 0.0223531 0.0055415 4.034 6.6e-05 ***
## CGPA 0.1189395 0.0122194 9.734 < 2e-16 ***
## Research1 0.0245251 0.0079598 3.081 0.00221 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.8
## F-statistic: 228.9 on 7 and 392 DF, p-value: < 2.2e-16
From result above we have; - Adj. R-squared = 0.8 (means the model with all predictor can explain 80% of variance on the target (Chance.of.Admit)). - There is predictor that has slight variance on Chance.of.Admit.
Then, we will create another linear regression model with predictor; University.Rating and SOP.
2. Regression Model with Two Predictor
lm2 <- lm(Chance.of.Admit ~ TOEFL.Score + Research, grade)
summary(lm2)##
## Call:
## lm(formula = Chance.of.Admit ~ TOEFL.Score + Research, data = grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33422 -0.04267 0.01510 0.05416 0.20510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0384127 0.0820539 -12.65 < 2e-16 ***
## TOEFL.Score 0.0160940 0.0007857 20.48 < 2e-16 ***
## Research1 0.0622860 0.0095685 6.51 2.28e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08304 on 397 degrees of freedom
## Multiple R-squared: 0.6626, Adjusted R-squared: 0.6609
## F-statistic: 389.9 on 2 and 397 DF, p-value: < 2.2e-16
From the result above, we have; - Adj. R-Squared = 0.6609 (means this model can explain only 66,09% of variance in Chance.of Admit)
Then, we will try to use Stepwise Regression to create another regression model.
3. Stepwise Regression - Backward
lm3 <- step(object = lm1, direction = "backward")## Start: AIC=-2193.9
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00144 1.5962 -2195.5
## - University.Rating 1 0.00584 1.6006 -2194.4
## <none> 1.5948 -2193.9
## - TOEFL.Score 1 0.02921 1.6240 -2188.6
## - GRE.Score 1 0.03435 1.6291 -2187.4
## - Research 1 0.03862 1.6334 -2186.3
## - LOR 1 0.06620 1.6609 -2179.6
## - CGPA 1 0.38544 1.9802 -2109.3
##
## Step: AIC=-2195.54
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - University.Rating 1 0.00464 1.6008 -2196.4
## <none> 1.5962 -2195.5
## - TOEFL.Score 1 0.02806 1.6242 -2190.6
## - GRE.Score 1 0.03565 1.6318 -2188.7
## - Research 1 0.03769 1.6339 -2188.2
## - LOR 1 0.06983 1.6660 -2180.4
## - CGPA 1 0.38660 1.9828 -2110.8
##
## Step: AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.6008 -2196.4
## - TOEFL.Score 1 0.03292 1.6338 -2190.2
## - GRE.Score 1 0.03638 1.6372 -2189.4
## - Research 1 0.03912 1.6400 -2188.7
## - LOR 1 0.09133 1.6922 -2176.2
## - CGPA 1 0.43201 2.0328 -2102.8
summary(lm3)##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.263542 -0.023297 0.009879 0.038078 0.159897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2984636 0.1172905 -11.070 < 2e-16 ***
## GRE.Score 0.0017820 0.0005955 2.992 0.00294 **
## TOEFL.Score 0.0030320 0.0010651 2.847 0.00465 **
## LOR 0.0227762 0.0048039 4.741 2.97e-06 ***
## CGPA 0.1210042 0.0117349 10.312 < 2e-16 ***
## Research1 0.0245769 0.0079203 3.103 0.00205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared: 0.8027, Adjusted R-squared: 0.8002
## F-statistic: 320.6 on 5 and 394 DF, p-value: < 2.2e-16
From result above, we have; - By using Stepwise Regression - Backward, we have GRE.Score, TOEFL.Score, LOR, CGPA, and Research as predictors. - Adj. R-Squared = 0.8002, a very slightly higher than model that using all predictors.
After create several model, let’s forecast each model and compare each model’s forecast and error.
In this section, we will forecast each model and put each forecasted value on to new column.
# create new column that filled by prediction of each models
grade$predict1 <- predict(lm1, grade)
grade$predict2 <- predict(lm2, grade)
grade$predict3 <- predict(lm3, grade)
# quick check
head(grade)After forecast using each model, we will compare each model’s error by using RMSE() function from MLMetric package.
# Comparasion Error of each model
RMSE(y_pred = grade$predict1, y_true = grade$Chance.of.Admit)## [1] 0.06314185
RMSE(y_pred = grade$predict2, y_true = grade$Chance.of.Admit)## [1] 0.08272896
RMSE(y_pred = grade$predict3, y_true = grade$Chance.of.Admit)## [1] 0.06326207
Result above shows; - RMSE of lm1 (model with all predictor) and lm3 (Stepwise Regression - Backward) has slightly different. - It is recommended to use model lm3 (Stepwise Regression - Backward) because beside the RMSE is only slightly different with the model lm1, the Adj.R-Squared of model lm3 slightly higher than model lm1.
We have compare each model’s error and Adj. R-Squared, then we will evaluate the model.
In this section we will evaluate model lm3 by Linearity Check, Normality of Residuals, Homoscedascity, and Multicolinearity.
There are 2 ways to check the linearity of the model. Visualize it or by using cor.test() function. We will do both of them to do check and re-check of both method.
1.a. Using cor.test() function:
cor.test(grade$Chance.of.Admit, grade$GRE.Score)##
## Pearson's product-moment correlation
##
## data: grade$Chance.of.Admit and grade$GRE.Score
## t = 26.843, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7647419 0.8349536
## sample estimates:
## cor
## 0.8026105
cor.test(grade$Chance.of.Admit, grade$TOEFL.Score)##
## Pearson's product-moment correlation
##
## data: grade$Chance.of.Admit and grade$TOEFL.Score
## t = 25.845, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7519028 0.8255675
## sample estimates:
## cor
## 0.791594
cor.test(grade$Chance.of.Admit, grade$LOR)##
## Pearson's product-moment correlation
##
## data: grade$Chance.of.Admit and grade$LOR
## t = 18, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6120380 0.7206083
## sample estimates:
## cor
## 0.6698888
cor.test(grade$Chance.of.Admit, grade$CGPA)##
## Pearson's product-moment correlation
##
## data: grade$Chance.of.Admit and grade$CGPA
## t = 35.759, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8478354 0.8947275
## sample estimates:
## cor
## 0.8732891
Result above shows all p-value residuals < 0.05 means that all predictors on model lm3 has linearity
1.b. Visualize it!
simpred <- data.frame(residuals = lm3$residuals, fitted = lm3$fitted.values)
simpred %>%
ggplot(aes(fitted, residuals)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0), col = "red") +
theme_minimal()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Plot above shows there are linearity on each predictors.
After check linearity assumption, we will check is the residual follow normal distribution or no by using shapiro.test() function.
shapiro.test(lm3$residuals)##
## Shapiro-Wilk normality test
##
## data: lm3$residuals
## W = 0.92193, p-value = 1.443e-13
Result above shows that p-value of model lm3 < 0.05, means that this model’s residuals are not following the normal distribution.
Third assumption test are checking homoscedascity that will provide us information if the residual has pattern (heteroscedascity) or not (homoscedascity).
The hypothesis of homoscedascity are;
H0: Homoscedasticity (has no pattern) => p-value > 0.05 H1: Heteroscedasticity (has pattern) => p-value < 0.05
To check the homoscedascity assumption, we will use bptest() function from lmtest package.
bptest(lm3)##
## studentized Breusch-Pagan test
##
## data: lm3
## BP = 22.428, df = 5, p-value = 0.0004341
Result above shows that the p-value of model lm3 < 0.05, means that there are pattern on the model. If we visualize, it will be
plot(grade$Chance.of.Admit, lm3$residuals)
abline(h = 0, col = "red")Last assumption check, Multicolinearity, is to check is there any linearity correlation between the independent variables/predictors. To check multicolinearity of each predictors we simply use vif() function. After that, we will simply remove predictor that has vif result exceeding 5 or 10.
vif(lm3)## GRE.Score TOEFL.Score LOR CGPA Research
## 4.585053 4.104255 1.829491 4.808767 1.530007
Result above shows that variable/predictor in model lm3 has vif result below 5, thus each variable/predictors has no multicolinearity.
lm1, model with all predictor.lm2, model only have 2 predictor (TOEFL Score and Research).lm3, stepwise regression model with backward direction. If we compare Adj.R.Squared of three model, lm3 has highest Adj. R-Squared than other although Adj. R-Squared of lm1 and lm3 slightly different. Same case to the RMSE, lm1 and lm3 has slightly different, but it is recommended to use lm3 to forecast the data.lm3, we know that all predictors has linearity correlation and each variable/predictor on model lm3 has no multicolinearity, but the residual has not normal distribution and has pattern (does not fulfill Normality of Residuals and Homoscedascity Assumption Check).