In this project, we will use graduate admission dataset from Kaggle. Our objectives are to predict the chance of admit and analyze the relationship between variables using linear regression model.
First, we must import the library.
library(lubridate)
library(dplyr)
library(GGally)
library(ggplot2)
library(plotly)
library(glue)
library(scales)
library(MLmetrics)
library(lmtest)
library(car)
library(performance)Input our data and put it into admission object.
admission <- read.csv("Admission_Predict_Ver1.1.csv")Overview our data:
head(admission)Check the number of columns and rows.
dim(admission)## [1] 500 9
Data contains 500 rows and 9 columns.
View all columns and the data types.
glimpse(admission)## Rows: 500
## Columns: 9
## $ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
## $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~
The dataset contains variables:
GRE Scores (out of 340)
TOEFL Scores (out of 120)
University Rating (out of 5)
Statement of Purpose (out of 5)
Letter of Recommendation Strength (out of 5)
Undergraduate GPA (out of 10)
Research Experience (either 0 or 1)
Chance of Admit (ranging from 0 to 1)
We will adjust data type of Research, LOR, SOP and University.Rating, then delete -Serial.No. column.
admission <- admission %>%
select(-Serial.No.) %>%
mutate_at(vars(Research, LOR, SOP, University.Rating), as.factor)Next, checking the missing value.
colSums(is.na(admission))## GRE.Score TOEFL.Score University.Rating SOP
## 0 0 0 0
## LOR CGPA Research Chance.of.Admit
## 0 0 0 0
No missing value found.
Let’s see the summary of all columns.
summary(admission)## GRE.Score TOEFL.Score University.Rating SOP LOR
## Min. :290.0 Min. : 92.0 1: 34 4 :89 3 :99
## 1st Qu.:308.0 1st Qu.:103.0 2:126 3.5 :88 4 :94
## Median :317.0 Median :107.0 3:162 3 :80 3.5 :86
## Mean :316.5 Mean :107.2 4:105 2.5 :64 4.5 :63
## 3rd Qu.:325.0 3rd Qu.:112.0 5: 73 4.5 :63 2.5 :50
## Max. :340.0 Max. :120.0 2 :43 5 :50
## (Other):73 (Other):58
## CGPA Research Chance.of.Admit
## Min. :6.800 0:220 Min. :0.3400
## 1st Qu.:8.127 1:280 1st Qu.:0.6300
## Median :8.560 Median :0.7200
## Mean :8.576 Mean :0.7217
## 3rd Qu.:9.040 3rd Qu.:0.8200
## Max. :9.920 Max. :0.9700
##
We will use Chance of Admit as Target Variable. We need to see the distribution.
ggplot(admission, aes(x = Chance.of.Admit, fill = ..count..)) +
geom_histogram() +
ggtitle("Chance of Admit Histogram") +
ylab("Frequency") +
xlab("Chance of Admit") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal()ggplot(admission, aes(y = Chance.of.Admit)) +
geom_boxplot(colour="dark blue", outlier.colour="red") +
labs(title = "Chance of Admit Boxplot",
x = "",
y = "Chance of Admit") +
theme_minimal()Relationship between Chance of Admit and other variables.
ggplot(admission, aes(x=Chance.of.Admit, y=GRE.Score)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)ggplot(admission, aes(x=Chance.of.Admit, y=TOEFL.Score)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)ggplot(admission, aes(x=Chance.of.Admit, y=CGPA)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)ggcorr(admission, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2, low = "black", high = "blue")From the result above, GRE Score, TOEFL Score, and CGPA have linearity and strong correlation with Chance of Admit.
Before we create the model, we need to split the data into train dataset and test dataset. We will use the train dataset to create linear regression model and the test dataset will be used as a comparasion to see if the model get overfit or can not predict new data. We will use 80% of the data as the training data and the rest of it as the testing data.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(admission), nrow(admission) *0.8)
data_train <- admission[index, ]
data_test <- admission[-index, ]First model, we will use variables that have strong correlation with
Chance of Admit, we named it
model_corr.
model_corr <- lm(formula = Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, data_train)
summary(model_corr)##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score,
## data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.249482 -0.022160 0.004915 0.034195 0.138405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5381820 0.0959634 -16.029 < 0.0000000000000002 ***
## CGPA 0.1536305 0.0095068 16.160 < 0.0000000000000002 ***
## TOEFL.Score 0.0032507 0.0009295 3.497 0.000523 ***
## GRE.Score 0.0018777 0.0005341 3.515 0.000490 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0575 on 396 degrees of freedom
## Multiple R-squared: 0.8377, Adjusted R-squared: 0.8365
## F-statistic: 681.5 on 3 and 396 DF, p-value: < 0.00000000000000022
Next, create model include all variables model_all.
model_all <- lm(formula = Chance.of.Admit ~ ., data_train)
summary(model_all)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.232413 -0.024831 0.005917 0.030587 0.191986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2902487 0.1307190 -9.870 < 0.0000000000000002 ***
## GRE.Score 0.0014724 0.0005603 2.628 0.00894 **
## TOEFL.Score 0.0028758 0.0009394 3.061 0.00236 **
## University.Rating2 -0.0214283 0.0137622 -1.557 0.12030
## University.Rating3 -0.0118871 0.0145173 -0.819 0.41341
## University.Rating4 -0.0124343 0.0167728 -0.741 0.45895
## University.Rating5 0.0008585 0.0187909 0.046 0.96359
## SOP1.5 -0.0064189 0.0287011 -0.224 0.82315
## SOP2 -0.0138282 0.0280494 -0.493 0.62231
## SOP2.5 0.0266290 0.0283697 0.939 0.34852
## SOP3 0.0060646 0.0286415 0.212 0.83242
## SOP3.5 0.0030335 0.0290199 0.105 0.91680
## SOP4 0.0083375 0.0296008 0.282 0.77835
## SOP4.5 0.0190720 0.0306618 0.622 0.53431
## SOP5 0.0189152 0.0314845 0.601 0.54835
## LOR1.5 0.0136770 0.0641854 0.213 0.83138
## LOR2 0.0458054 0.0623261 0.735 0.46284
## LOR2.5 0.0658993 0.0624474 1.055 0.29198
## LOR3 0.0580758 0.0628132 0.925 0.35578
## LOR3.5 0.0658383 0.0629702 1.046 0.29644
## LOR4 0.0667821 0.0629994 1.060 0.28981
## LOR4.5 0.0692877 0.0634818 1.091 0.27577
## LOR5 0.0816670 0.0636553 1.283 0.20030
## CGPA 0.1361953 0.0106278 12.815 < 0.0000000000000002 ***
## Research1 0.0165966 0.0072530 2.288 0.02268 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05627 on 375 degrees of freedom
## Multiple R-squared: 0.8529, Adjusted R-squared: 0.8435
## F-statistic: 90.57 on 24 and 375 DF, p-value: < 0.00000000000000022
We can use Step-wise Regression to finding a combination of
predictors that produces the best model based on AIC value. There are 3
types of Step-wise Regression such as Forward, Backward, and both. We
will use Step-wise both. We named it model_stepwise
model_none <- lm(formula = Chance.of.Admit ~ 1, data = data_train)
model_stepwise <- step(
object = model_none,
direction = "both",
scope = list(upper = model_all),
trace = FALSE)
summary(model_stepwise)##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + TOEFL.Score +
## Research + SOP, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.243639 -0.024933 0.004193 0.032739 0.138656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3023257 0.1178324 -11.052 < 0.0000000000000002 ***
## CGPA 0.1425699 0.0101725 14.015 < 0.0000000000000002 ***
## GRE.Score 0.0014524 0.0005561 2.612 0.00936 **
## TOEFL.Score 0.0029205 0.0009287 3.145 0.00179 **
## Research1 0.0176687 0.0071646 2.466 0.01409 *
## SOP1.5 -0.0009313 0.0261934 -0.036 0.97166
## SOP2 -0.0083335 0.0251307 -0.332 0.74037
## SOP2.5 0.0283377 0.0245313 1.155 0.24874
## SOP3 0.0129597 0.0246596 0.526 0.59951
## SOP3.5 0.0126892 0.0248057 0.512 0.60926
## SOP4 0.0221039 0.0254353 0.869 0.38537
## SOP4.5 0.0351652 0.0263658 1.334 0.18307
## SOP5 0.0417065 0.0270883 1.540 0.12446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05651 on 387 degrees of freedom
## Multiple R-squared: 0.8468, Adjusted R-squared: 0.8421
## F-statistic: 178.3 on 12 and 387 DF, p-value: < 0.00000000000000022
performance <- compare_performance(model_corr, model_all, model_stepwise)
as.data.frame(performance)Based on the R Squared Adjusted, the result of all models are not much different. The stepwise model can be chosen because it does not use all variables or just selected the best variables, also the AIC value is the lowest compared others.
As mentioned before, all numeric variables have strong correlation
with Chance of Admit. So the assumptions
linearity test is fulfilled. If we want to make sure, we can use
statistics test with cor.test.
cor.test(admission$Chance.of.Admit, admission$GRE.Score)##
## Pearson's product-moment correlation
##
## data: admission$Chance.of.Admit and admission$GRE.Score
## t = 30.862, df = 498, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7779406 0.8384601
## sample estimates:
## cor
## 0.8103506
cor.test(admission$Chance.of.Admit, admission$CGPA)##
## Pearson's product-moment correlation
##
## data: admission$Chance.of.Admit and admission$CGPA
## t = 41.855, df = 498, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8613745 0.9004286
## sample estimates:
## cor
## 0.8824126
cor.test(admission$Chance.of.Admit, admission$TOEFL.Score)##
## Pearson's product-moment correlation
##
## data: admission$Chance.of.Admit and admission$TOEFL.Score
## t = 28.972, df = 498, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7571359 0.8227603
## sample estimates:
## cor
## 0.7922276
Linearity hypothesis test:
All variables have p-value < alpha (0.05), we can not accept H0, correlation significant.
hist(model_stepwise$residuals)shapiro.test(model_all$residuals)##
## Shapiro-Wilk normality test
##
## data: model_all$residuals
## W = 0.938, p-value = 0.000000000007347
Shapiro-Wilk hypothesis test:
As we can see, p-value < alpha (0.05), so the residuals is not normally distributed.
bptest(model_stepwise)##
## studentized Breusch-Pagan test
##
## data: model_stepwise
## BP = 36.625, df = 12, p-value = 0.000257
Breusch-Pagan hypothesis test:
The result shows p-value < alpha (0.05), so the residuals are not distributed with equal variance.
vif(model_stepwise)## GVIF Df GVIF^(1/(2*Df))
## CGPA 4.889091 1 2.211129
## GRE.Score 4.978436 1 2.231241
## TOEFL.Score 3.991265 1 1.997815
## Research 1.589583 1 1.260786
## SOP 2.365755 8 1.055293
VIF (Variance Inflation Factor) test:
VIF all variables less than 10, so there is no multicollinear predictors in model.
The assumption of normality and heteroscadicity are not fulfilled. We will try to remove outlier and transform the data. Based on exploratory data before, Chance of Admit have outliers, we can delete the outliers and create the model.
quartiles <- quantile(admission$Chance.of.Admit, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(admission$Chance.of.Admit)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
admission_no_outlier <- subset(admission, admission$Chance.of.Admit > Lower & admission$Chance.of.Admit < Upper)
dim(admission_no_outlier)## [1] 498 8
RNGkind(sample.kind = "Rounding")
set.seed(100)
index2 <- sample(nrow(admission_no_outlier), nrow(admission_no_outlier) *0.8)
data_train2 <- admission_no_outlier[index2, ]
data_test2 <- admission_no_outlier[-index2, ]model_no_outlier <- lm(formula = Chance.of.Admit ~ ., data_train2)
summary(model_no_outlier)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.197975 -0.023937 0.005178 0.029699 0.137988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2685265 0.1192678 -10.636 < 0.0000000000000002 ***
## GRE.Score 0.0016517 0.0005214 3.168 0.001661 **
## TOEFL.Score 0.0024960 0.0009125 2.735 0.006527 **
## University.Rating2 -0.0213306 0.0144212 -1.479 0.139952
## University.Rating3 -0.0111655 0.0153660 -0.727 0.467903
## University.Rating4 -0.0078303 0.0169242 -0.463 0.643870
## University.Rating5 0.0059444 0.0189718 0.313 0.754208
## SOP1.5 -0.0180075 0.0283133 -0.636 0.525161
## SOP2 -0.0216088 0.0280433 -0.771 0.441458
## SOP2.5 0.0155228 0.0282648 0.549 0.583202
## SOP3 -0.0002045 0.0282709 -0.007 0.994233
## SOP3.5 -0.0071351 0.0289588 -0.246 0.805518
## SOP4 -0.0029007 0.0294819 -0.098 0.921675
## SOP4.5 0.0010225 0.0303911 0.034 0.973180
## SOP5 0.0036080 0.0311331 0.116 0.907802
## LOR2 0.0719751 0.0226082 3.184 0.001577 **
## LOR2.5 0.0873862 0.0224399 3.894 0.000117 ***
## LOR3 0.0852572 0.0218074 3.910 0.000110 ***
## LOR3.5 0.1001953 0.0221567 4.522 0.00000823 ***
## LOR4 0.0989308 0.0223310 4.430 0.00001238 ***
## LOR4.5 0.1114681 0.0233991 4.764 0.00000272 ***
## LOR5 0.1159061 0.0239959 4.830 0.00000199 ***
## CGPA 0.1291371 0.0104132 12.401 < 0.0000000000000002 ***
## Research1 0.0197347 0.0070021 2.818 0.005083 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05518 on 374 degrees of freedom
## Multiple R-squared: 0.8555, Adjusted R-squared: 0.8466
## F-statistic: 96.27 on 23 and 374 DF, p-value: < 0.00000000000000022
Linearity assumption is fulfilled, we need to check other assumptions.
hist(model_no_outlier$residuals)shapiro.test(model_no_outlier$residuals)##
## Shapiro-Wilk normality test
##
## data: model_no_outlier$residuals
## W = 0.94808, p-value = 0.0000000001337
Result: p-value < alpha (0.05), the residuals is not normally distributed.
bptest(model_no_outlier)##
## studentized Breusch-Pagan test
##
## data: model_no_outlier
## BP = 60.367, df = 23, p-value = 0.0000338
Result: p-value < alpha (0.05), the residuals are not distributed with equal variance.
vif(model_no_outlier)## GVIF Df GVIF^(1/(2*Df))
## GRE.Score 4.535110 1 2.129580
## TOEFL.Score 3.938226 1 1.984496
## University.Rating 5.337756 4 1.232877
## SOP 7.292455 8 1.132217
## LOR 3.648910 7 1.096868
## CGPA 5.183138 1 2.276651
## Research 1.580847 1 1.257317
Result: VIF value < 10, there is no multicollinear predictors in model.
admission_transform_y <- admission %>%
select(Chance.of.Admit, CGPA, GRE.Score, LOR, TOEFL.Score, Research) %>%
mutate(Chance.of.Admit = asin(sqrt(Chance.of.Admit)))RNGkind(sample.kind = "Rounding")
set.seed(100)
index3 <- sample(nrow(admission_transform_y), nrow(admission_transform_y) *0.8)
data_train3 <- admission_transform_y[index3, ]
data_test3 <- admission_transform_y[-index3, ]model_transform_y <- lm(formula = Chance.of.Admit ~ ., data_train3)
summary(model_transform_y)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.264200 -0.032878 0.005827 0.040560 0.158164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4778156 0.1370978 -10.779 < 0.0000000000000002 ***
## CGPA 0.1658955 0.0112438 14.754 < 0.0000000000000002 ***
## GRE.Score 0.0019023 0.0006178 3.079 0.00222 **
## LOR1.5 -0.0226055 0.0669269 -0.338 0.73572
## LOR2 0.0114515 0.0638982 0.179 0.85786
## LOR2.5 0.0128182 0.0639578 0.200 0.84126
## LOR3 0.0082470 0.0637610 0.129 0.89715
## LOR3.5 0.0177819 0.0640222 0.278 0.78136
## LOR4 0.0196596 0.0641363 0.307 0.75937
## LOR4.5 0.0352651 0.0647730 0.544 0.58645
## LOR5 0.0574063 0.0649549 0.884 0.37736
## TOEFL.Score 0.0042171 0.0010210 4.130 0.0000444 ***
## Research1 0.0203053 0.0079378 2.558 0.01091 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06283 on 387 degrees of freedom
## Multiple R-squared: 0.8665, Adjusted R-squared: 0.8624
## F-statistic: 209.3 on 12 and 387 DF, p-value: < 0.00000000000000022
hist(model_transform_y$residuals)shapiro.test(model_transform_y$residuals)##
## Shapiro-Wilk normality test
##
## data: model_transform_y$residuals
## W = 0.961, p-value = 0.00000000812
Result: p-value < alpha (0.05), the residuals is not normally distributed.
bptest(model_transform_y)##
## studentized Breusch-Pagan test
##
## data: model_transform_y
## BP = 19.375, df = 12, p-value = 0.07987
Result: p-value > alpha (0.05), the residuals are distributed with equal variance.
vif(model_transform_y)## GVIF Df GVIF^(1/(2*Df))
## CGPA 4.832159 1 2.198217
## GRE.Score 4.969697 1 2.229282
## LOR 1.787768 8 1.036978
## TOEFL.Score 3.902233 1 1.975407
## Research 1.578478 1 1.256375
Result: VIF value < 10, there is no multicollinear predictors in model.
admission_transform_x <- admission %>%
select(Chance.of.Admit, CGPA, GRE.Score, LOR, TOEFL.Score, Research) %>%
mutate_at(vars(CGPA, GRE.Score, TOEFL.Score), ~log10(.)) %>%
mutate(Chance.of.Admit = asin(sqrt(Chance.of.Admit)))RNGkind(sample.kind = "Rounding")
set.seed(100)
index3 <- sample(nrow(admission_transform_x), nrow(admission_transform_x) *0.8)
data_train3 <- admission_transform_x[index3, ]
data_test3 <- admission_transform_x[-index3, ]model_transform_x <- lm(formula = Chance.of.Admit ~ ., data_train3)
summary(model_transform_x)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.267603 -0.032860 0.005186 0.042480 0.155471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.837340 0.882288 -8.883 < 0.0000000000000002 ***
## CGPA 3.155180 0.219525 14.373 < 0.0000000000000002 ***
## GRE.Score 1.476373 0.449916 3.281 0.00113 **
## LOR1.5 -0.035447 0.067993 -0.521 0.60243
## LOR2 0.000532 0.064903 0.008 0.99346
## LOR2.5 -0.002727 0.064998 -0.042 0.96656
## LOR3 -0.007060 0.064814 -0.109 0.91331
## LOR3.5 0.002681 0.065093 0.041 0.96717
## LOR4 0.005237 0.065213 0.080 0.93604
## LOR4.5 0.023704 0.065858 0.360 0.71909
## LOR5 0.046654 0.066042 0.706 0.48034
## TOEFL.Score 1.092656 0.252819 4.322 0.0000197 ***
## Research1 0.021121 0.008055 2.622 0.00909 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06378 on 387 degrees of freedom
## Multiple R-squared: 0.8625, Adjusted R-squared: 0.8582
## F-statistic: 202.2 on 12 and 387 DF, p-value: < 0.00000000000000022
hist(model_transform_x$residuals)shapiro.test(model_transform_x$residuals)##
## Shapiro-Wilk normality test
##
## data: model_transform_x$residuals
## W = 0.96352, p-value = 0.00000002021
Result: p-value < alpha (0.05), the residuals is not normally distributed.
bptest(model_transform_x)##
## studentized Breusch-Pagan test
##
## data: model_transform_x
## BP = 18.307, df = 12, p-value = 0.1067
Result: p-value > alpha (0.05), the residuals are distributed with equal variance.
vif(model_transform_x)## GVIF Df GVIF^(1/(2*Df))
## CGPA 4.639310 1 2.153906
## GRE.Score 4.840780 1 2.200177
## LOR 1.790516 8 1.037077
## TOEFL.Score 3.816483 1 1.953582
## Research 1.577807 1 1.256108
Result: VIF value < 10: there is no multicollinear predictors in model.
admission_transform_x2 <- admission%>%
select(Chance.of.Admit, CGPA, GRE.Score, LOR, TOEFL.Score, Research) %>%
mutate_at(vars(CGPA, GRE.Score, TOEFL.Score), ~sqrt(.)) %>%
mutate(Chance.of.Admit = asin(sqrt(Chance.of.Admit)))RNGkind(sample.kind = "Rounding")
set.seed(100)
index4 <- sample(nrow(admission_transform_x2), nrow(admission_transform_x2) *0.8)
data_train4 <- admission_transform_x2[index4, ]
data_test4 <- admission_transform_x2[-index4, ]model_transform_x2 <- lm(formula = Chance.of.Admit ~ ., data_train4)
summary(model_transform_x2)##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.265930 -0.033087 0.004758 0.042121 0.155499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.9567095 0.2561513 -15.447 < 0.0000000000000002 ***
## CGPA 0.9554399 0.0655197 14.582 < 0.0000000000000002 ***
## GRE.Score 0.0696975 0.0219679 3.173 0.00163 **
## LOR1.5 -0.0290496 0.0674169 -0.431 0.66678
## LOR2 0.0059692 0.0643597 0.093 0.92615
## LOR2.5 0.0049894 0.0644364 0.077 0.93832
## LOR3 0.0005006 0.0642461 0.008 0.99379
## LOR3.5 0.0101154 0.0645159 0.157 0.87549
## LOR4 0.0123081 0.0646329 0.190 0.84907
## LOR4.5 0.0293128 0.0652740 0.449 0.65363
## LOR5 0.0518538 0.0654568 0.792 0.42874
## TOEFL.Score 0.0895337 0.0211750 4.228 0.0000294 ***
## Research1 0.0206827 0.0079919 2.588 0.01002 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06326 on 387 degrees of freedom
## Multiple R-squared: 0.8647, Adjusted R-squared: 0.8605
## F-statistic: 206 on 12 and 387 DF, p-value: < 0.00000000000000022
hist(model_transform_x2$residuals)shapiro.test(model_transform_x2$residuals)##
## Shapiro-Wilk normality test
##
## data: model_transform_x2$residuals
## W = 0.9621, p-value = 0.00000001205
Result: p-value < alpha (0.05), the residuals is not normally distributed.
bptest(model_transform_x2)##
## studentized Breusch-Pagan test
##
## data: model_transform_x2
## BP = 18.865, df = 12, p-value = 0.09185
Result: p-value > alpha (0.05), the residuals are distributed with equal variance.
vif(model_transform_x2)## GVIF Df GVIF^(1/(2*Df))
## CGPA 4.738605 1 2.176834
## GRE.Score 4.906441 1 2.215049
## LOR 1.789409 8 1.037037
## TOEFL.Score 3.859976 1 1.964682
## Research 1.578227 1 1.256275
Result: VIF value < 10, there is no multicollinear predictors in model.
performance_model <- compare_performance(model_no_outlier, model_transform_y, model_transform_x, model_transform_x2)
as.data.frame(performance_model)We created model_no_outlier,
model_transform_y, model_transform_x, and
model_transform_x2. The assumption test of
model_no_outlier is still not fulfilled the normality and
homoscedasticity test. Others model fulfilled the homoscedasticity, but
unfortunately not the normality test. Meanwhile, we can use the best
model we had based on the R Squared Adjusted above. We will use the
model_transform_yfor prediction.
model_pred <- predict(model_transform_y, newdata = data_test3 %>% select(-Chance.of.Admit))
# RMSE of train dataset
RMSE(y_pred = (model_transform_y$fitted.values), y_true = sin(data_train3$Chance.of.Admit)^2)## [1] 0.3149209
# RMSE of test dataset
RMSE(y_pred = (model_pred), y_true = sin(data_test3$Chance.of.Admit)^2)## [1] 1.995927
The results turned out that data test produce larger RMSE than data train. So it can be concluded that the model is overfit.
model_transform_y with R
Square Adjusted 86.24%.