Introduction

In this project, we will use graduate admission dataset from Kaggle. Our objectives are to predict the chance of admit and analyze the relationship between variables using linear regression model.

First, we must import the library.

library(lubridate)
library(dplyr)
library(GGally)
library(ggplot2)
library(plotly)
library(glue)
library(scales)
library(MLmetrics)
library(lmtest)
library(car)
library(performance)

Data Preparation

Input Data

Input our data and put it into admission object.

admission <- read.csv("Admission_Predict_Ver1.1.csv")

Overview our data:

head(admission)

Data Structure

Check the number of columns and rows.

dim(admission)
## [1] 500   9

Data contains 500 rows and 9 columns.

View all columns and the data types.

glimpse(admission)
## Rows: 500
## Columns: 9
## $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
## $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~

The dataset contains variables:

  1. GRE Scores (out of 340)

  2. TOEFL Scores (out of 120)

  3. University Rating (out of 5)

  4. Statement of Purpose (out of 5)

  5. Letter of Recommendation Strength (out of 5)

  6. Undergraduate GPA (out of 10)

  7. Research Experience (either 0 or 1)

  8. Chance of Admit (ranging from 0 to 1)

Pre-processing Data

We will adjust data type of Research, LOR, SOP and University.Rating, then delete -Serial.No. column.

admission <- admission %>% 
  select(-Serial.No.) %>% 
  mutate_at(vars(Research, LOR, SOP, University.Rating), as.factor)

Next, checking the missing value.

colSums(is.na(admission))
##         GRE.Score       TOEFL.Score University.Rating               SOP 
##                 0                 0                 0                 0 
##               LOR              CGPA          Research   Chance.of.Admit 
##                 0                 0                 0                 0

No missing value found.

Exploratory Data Analysis

Let’s see the summary of all columns.

summary(admission)
##    GRE.Score      TOEFL.Score    University.Rating      SOP          LOR    
##  Min.   :290.0   Min.   : 92.0   1: 34             4      :89   3      :99  
##  1st Qu.:308.0   1st Qu.:103.0   2:126             3.5    :88   4      :94  
##  Median :317.0   Median :107.0   3:162             3      :80   3.5    :86  
##  Mean   :316.5   Mean   :107.2   4:105             2.5    :64   4.5    :63  
##  3rd Qu.:325.0   3rd Qu.:112.0   5: 73             4.5    :63   2.5    :50  
##  Max.   :340.0   Max.   :120.0                     2      :43   5      :50  
##                                                    (Other):73   (Other):58  
##       CGPA       Research Chance.of.Admit 
##  Min.   :6.800   0:220    Min.   :0.3400  
##  1st Qu.:8.127   1:280    1st Qu.:0.6300  
##  Median :8.560            Median :0.7200  
##  Mean   :8.576            Mean   :0.7217  
##  3rd Qu.:9.040            3rd Qu.:0.8200  
##  Max.   :9.920            Max.   :0.9700  
## 

We will use Chance of Admit as Target Variable. We need to see the distribution.

ggplot(admission, aes(x = Chance.of.Admit, fill = ..count..)) +
  geom_histogram() +
  ggtitle("Chance of Admit Histogram") +
  ylab("Frequency") +
  xlab("Chance of Admit") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  theme_minimal()

ggplot(admission, aes(y = Chance.of.Admit)) +
  geom_boxplot(colour="dark blue", outlier.colour="red") +
  labs(title = "Chance of Admit Boxplot",
        x = "",
        y = "Chance of Admit") +
  theme_minimal()

Relationship between Chance of Admit and other variables.

ggplot(admission, aes(x=Chance.of.Admit, y=GRE.Score)) +
  geom_point() + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

ggplot(admission, aes(x=Chance.of.Admit, y=TOEFL.Score)) +
  geom_point() + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

ggplot(admission, aes(x=Chance.of.Admit, y=CGPA)) +
  geom_point() + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

ggcorr(admission, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2, low = "black", high = "blue")

From the result above, GRE Score, TOEFL Score, and CGPA have linearity and strong correlation with Chance of Admit.

Create Model

Before we create the model, we need to split the data into train dataset and test dataset. We will use the train dataset to create linear regression model and the test dataset will be used as a comparasion to see if the model get overfit or can not predict new data. We will use 80% of the data as the training data and the rest of it as the testing data.

RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(admission), nrow(admission) *0.8)
data_train <- admission[index, ]
data_test <- admission[-index, ]

Model - Strong Correlation

First model, we will use variables that have strong correlation with Chance of Admit, we named it model_corr.

model_corr <- lm(formula = Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, data_train)
summary(model_corr)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, 
##     data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.249482 -0.022160  0.004915  0.034195  0.138405 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.5381820  0.0959634 -16.029 < 0.0000000000000002 ***
## CGPA         0.1536305  0.0095068  16.160 < 0.0000000000000002 ***
## TOEFL.Score  0.0032507  0.0009295   3.497             0.000523 ***
## GRE.Score    0.0018777  0.0005341   3.515             0.000490 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0575 on 396 degrees of freedom
## Multiple R-squared:  0.8377, Adjusted R-squared:  0.8365 
## F-statistic: 681.5 on 3 and 396 DF,  p-value: < 0.00000000000000022

Model - All variables

Next, create model include all variables model_all.

model_all <- lm(formula = Chance.of.Admit ~ ., data_train)
summary(model_all)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.232413 -0.024831  0.005917  0.030587  0.191986 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)        -1.2902487  0.1307190  -9.870 < 0.0000000000000002 ***
## GRE.Score           0.0014724  0.0005603   2.628              0.00894 ** 
## TOEFL.Score         0.0028758  0.0009394   3.061              0.00236 ** 
## University.Rating2 -0.0214283  0.0137622  -1.557              0.12030    
## University.Rating3 -0.0118871  0.0145173  -0.819              0.41341    
## University.Rating4 -0.0124343  0.0167728  -0.741              0.45895    
## University.Rating5  0.0008585  0.0187909   0.046              0.96359    
## SOP1.5             -0.0064189  0.0287011  -0.224              0.82315    
## SOP2               -0.0138282  0.0280494  -0.493              0.62231    
## SOP2.5              0.0266290  0.0283697   0.939              0.34852    
## SOP3                0.0060646  0.0286415   0.212              0.83242    
## SOP3.5              0.0030335  0.0290199   0.105              0.91680    
## SOP4                0.0083375  0.0296008   0.282              0.77835    
## SOP4.5              0.0190720  0.0306618   0.622              0.53431    
## SOP5                0.0189152  0.0314845   0.601              0.54835    
## LOR1.5              0.0136770  0.0641854   0.213              0.83138    
## LOR2                0.0458054  0.0623261   0.735              0.46284    
## LOR2.5              0.0658993  0.0624474   1.055              0.29198    
## LOR3                0.0580758  0.0628132   0.925              0.35578    
## LOR3.5              0.0658383  0.0629702   1.046              0.29644    
## LOR4                0.0667821  0.0629994   1.060              0.28981    
## LOR4.5              0.0692877  0.0634818   1.091              0.27577    
## LOR5                0.0816670  0.0636553   1.283              0.20030    
## CGPA                0.1361953  0.0106278  12.815 < 0.0000000000000002 ***
## Research1           0.0165966  0.0072530   2.288              0.02268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05627 on 375 degrees of freedom
## Multiple R-squared:  0.8529, Adjusted R-squared:  0.8435 
## F-statistic: 90.57 on 24 and 375 DF,  p-value: < 0.00000000000000022

Model - Feature Selection

We can use Step-wise Regression to finding a combination of predictors that produces the best model based on AIC value. There are 3 types of Step-wise Regression such as Forward, Backward, and both. We will use Step-wise both. We named it model_stepwise

model_none <- lm(formula = Chance.of.Admit ~ 1, data = data_train)
model_stepwise <- step(
  object = model_none,
  direction = "both",
  scope = list(upper = model_all),
  trace = FALSE)
summary(model_stepwise)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + TOEFL.Score + 
##     Research + SOP, data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.243639 -0.024933  0.004193  0.032739  0.138656 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.3023257  0.1178324 -11.052 < 0.0000000000000002 ***
## CGPA         0.1425699  0.0101725  14.015 < 0.0000000000000002 ***
## GRE.Score    0.0014524  0.0005561   2.612              0.00936 ** 
## TOEFL.Score  0.0029205  0.0009287   3.145              0.00179 ** 
## Research1    0.0176687  0.0071646   2.466              0.01409 *  
## SOP1.5      -0.0009313  0.0261934  -0.036              0.97166    
## SOP2        -0.0083335  0.0251307  -0.332              0.74037    
## SOP2.5       0.0283377  0.0245313   1.155              0.24874    
## SOP3         0.0129597  0.0246596   0.526              0.59951    
## SOP3.5       0.0126892  0.0248057   0.512              0.60926    
## SOP4         0.0221039  0.0254353   0.869              0.38537    
## SOP4.5       0.0351652  0.0263658   1.334              0.18307    
## SOP5         0.0417065  0.0270883   1.540              0.12446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05651 on 387 degrees of freedom
## Multiple R-squared:  0.8468, Adjusted R-squared:  0.8421 
## F-statistic: 178.3 on 12 and 387 DF,  p-value: < 0.00000000000000022

Model Evaluation

Model Performance

performance <- compare_performance(model_corr, model_all, model_stepwise)
as.data.frame(performance)

Based on the R Squared Adjusted, the result of all models are not much different. The stepwise model can be chosen because it does not use all variables or just selected the best variables, also the AIC value is the lowest compared others.

Assumptions

Linearity

As mentioned before, all numeric variables have strong correlation with Chance of Admit. So the assumptions linearity test is fulfilled. If we want to make sure, we can use statistics test with cor.test.

cor.test(admission$Chance.of.Admit, admission$GRE.Score)
## 
##  Pearson's product-moment correlation
## 
## data:  admission$Chance.of.Admit and admission$GRE.Score
## t = 30.862, df = 498, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7779406 0.8384601
## sample estimates:
##       cor 
## 0.8103506
cor.test(admission$Chance.of.Admit, admission$CGPA)
## 
##  Pearson's product-moment correlation
## 
## data:  admission$Chance.of.Admit and admission$CGPA
## t = 41.855, df = 498, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8613745 0.9004286
## sample estimates:
##       cor 
## 0.8824126
cor.test(admission$Chance.of.Admit, admission$TOEFL.Score)
## 
##  Pearson's product-moment correlation
## 
## data:  admission$Chance.of.Admit and admission$TOEFL.Score
## t = 28.972, df = 498, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7571359 0.8227603
## sample estimates:
##       cor 
## 0.7922276

Linearity hypothesis test:

  • H0: Correlation not significant (cor = 0)
  • H1: Correlation significant (cor != 0)

All variables have p-value < alpha (0.05), we can not accept H0, correlation significant.

Normality of Residuals

hist(model_stepwise$residuals)

shapiro.test(model_all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_all$residuals
## W = 0.938, p-value = 0.000000000007347

Shapiro-Wilk hypothesis test:

  • H0: Variable is normally distributed
  • H1: Variable is not normally distributed

As we can see, p-value < alpha (0.05), so the residuals is not normally distributed.

Homoscedasticity of Residuals

bptest(model_stepwise)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_stepwise
## BP = 36.625, df = 12, p-value = 0.000257

Breusch-Pagan hypothesis test:

  • H0: Homoscedasticity is present (the residuals are distributed with equal variance)
  • H1 : Heteroscedasticity is present (the residuals are not distributed with equal variance)

The result shows p-value < alpha (0.05), so the residuals are not distributed with equal variance.

No Multicollinearity

vif(model_stepwise)
##                 GVIF Df GVIF^(1/(2*Df))
## CGPA        4.889091  1        2.211129
## GRE.Score   4.978436  1        2.231241
## TOEFL.Score 3.991265  1        1.997815
## Research    1.589583  1        1.260786
## SOP         2.365755  8        1.055293

VIF (Variance Inflation Factor) test:

  • VIF value > 10: there is multicollinear predictors in model
  • VIF value < 10: there is no multicollinear predictors in model

VIF all variables less than 10, so there is no multicollinear predictors in model.

Model Improvement

Remove Outlier

The assumption of normality and heteroscadicity are not fulfilled. We will try to remove outlier and transform the data. Based on exploratory data before, Chance of Admit have outliers, we can delete the outliers and create the model.

quartiles <- quantile(admission$Chance.of.Admit, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(admission$Chance.of.Admit)
 
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 
 
admission_no_outlier <- subset(admission, admission$Chance.of.Admit > Lower & admission$Chance.of.Admit < Upper)
 
dim(admission_no_outlier)
## [1] 498   8
RNGkind(sample.kind = "Rounding")
set.seed(100)
index2 <- sample(nrow(admission_no_outlier), nrow(admission_no_outlier) *0.8)
data_train2 <- admission_no_outlier[index2, ]
data_test2 <- admission_no_outlier[-index2, ]
model_no_outlier <- lm(formula = Chance.of.Admit ~ ., data_train2)
summary(model_no_outlier)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.197975 -0.023937  0.005178  0.029699  0.137988 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)        -1.2685265  0.1192678 -10.636 < 0.0000000000000002 ***
## GRE.Score           0.0016517  0.0005214   3.168             0.001661 ** 
## TOEFL.Score         0.0024960  0.0009125   2.735             0.006527 ** 
## University.Rating2 -0.0213306  0.0144212  -1.479             0.139952    
## University.Rating3 -0.0111655  0.0153660  -0.727             0.467903    
## University.Rating4 -0.0078303  0.0169242  -0.463             0.643870    
## University.Rating5  0.0059444  0.0189718   0.313             0.754208    
## SOP1.5             -0.0180075  0.0283133  -0.636             0.525161    
## SOP2               -0.0216088  0.0280433  -0.771             0.441458    
## SOP2.5              0.0155228  0.0282648   0.549             0.583202    
## SOP3               -0.0002045  0.0282709  -0.007             0.994233    
## SOP3.5             -0.0071351  0.0289588  -0.246             0.805518    
## SOP4               -0.0029007  0.0294819  -0.098             0.921675    
## SOP4.5              0.0010225  0.0303911   0.034             0.973180    
## SOP5                0.0036080  0.0311331   0.116             0.907802    
## LOR2                0.0719751  0.0226082   3.184             0.001577 ** 
## LOR2.5              0.0873862  0.0224399   3.894             0.000117 ***
## LOR3                0.0852572  0.0218074   3.910             0.000110 ***
## LOR3.5              0.1001953  0.0221567   4.522           0.00000823 ***
## LOR4                0.0989308  0.0223310   4.430           0.00001238 ***
## LOR4.5              0.1114681  0.0233991   4.764           0.00000272 ***
## LOR5                0.1159061  0.0239959   4.830           0.00000199 ***
## CGPA                0.1291371  0.0104132  12.401 < 0.0000000000000002 ***
## Research1           0.0197347  0.0070021   2.818             0.005083 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05518 on 374 degrees of freedom
## Multiple R-squared:  0.8555, Adjusted R-squared:  0.8466 
## F-statistic: 96.27 on 23 and 374 DF,  p-value: < 0.00000000000000022

Check Assumption

Linearity assumption is fulfilled, we need to check other assumptions.

Normality of Residuals
hist(model_no_outlier$residuals)

shapiro.test(model_no_outlier$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_no_outlier$residuals
## W = 0.94808, p-value = 0.0000000001337

Result: p-value < alpha (0.05), the residuals is not normally distributed.

Homoscedasticity of Residuals
bptest(model_no_outlier)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_no_outlier
## BP = 60.367, df = 23, p-value = 0.0000338

Result: p-value < alpha (0.05), the residuals are not distributed with equal variance.

No Multicollinearity
vif(model_no_outlier)
##                       GVIF Df GVIF^(1/(2*Df))
## GRE.Score         4.535110  1        2.129580
## TOEFL.Score       3.938226  1        1.984496
## University.Rating 5.337756  4        1.232877
## SOP               7.292455  8        1.132217
## LOR               3.648910  7        1.096868
## CGPA              5.183138  1        2.276651
## Research          1.580847  1        1.257317

Result: VIF value < 10, there is no multicollinear predictors in model.

Transform Target-Var (Arcsin)

Subset Data and Create Model

admission_transform_y <- admission %>%
  select(Chance.of.Admit, CGPA, GRE.Score, LOR, TOEFL.Score, Research) %>%
  mutate(Chance.of.Admit = asin(sqrt(Chance.of.Admit)))
RNGkind(sample.kind = "Rounding")
set.seed(100)
index3 <- sample(nrow(admission_transform_y), nrow(admission_transform_y) *0.8)
data_train3 <- admission_transform_y[index3, ]
data_test3 <- admission_transform_y[-index3, ]
model_transform_y <- lm(formula = Chance.of.Admit ~ ., data_train3)
summary(model_transform_y)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train3)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.264200 -0.032878  0.005827  0.040560  0.158164 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.4778156  0.1370978 -10.779 < 0.0000000000000002 ***
## CGPA         0.1658955  0.0112438  14.754 < 0.0000000000000002 ***
## GRE.Score    0.0019023  0.0006178   3.079              0.00222 ** 
## LOR1.5      -0.0226055  0.0669269  -0.338              0.73572    
## LOR2         0.0114515  0.0638982   0.179              0.85786    
## LOR2.5       0.0128182  0.0639578   0.200              0.84126    
## LOR3         0.0082470  0.0637610   0.129              0.89715    
## LOR3.5       0.0177819  0.0640222   0.278              0.78136    
## LOR4         0.0196596  0.0641363   0.307              0.75937    
## LOR4.5       0.0352651  0.0647730   0.544              0.58645    
## LOR5         0.0574063  0.0649549   0.884              0.37736    
## TOEFL.Score  0.0042171  0.0010210   4.130            0.0000444 ***
## Research1    0.0203053  0.0079378   2.558              0.01091 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06283 on 387 degrees of freedom
## Multiple R-squared:  0.8665, Adjusted R-squared:  0.8624 
## F-statistic: 209.3 on 12 and 387 DF,  p-value: < 0.00000000000000022

Check Assumption

Normality of Residuals
hist(model_transform_y$residuals)

shapiro.test(model_transform_y$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_transform_y$residuals
## W = 0.961, p-value = 0.00000000812

Result: p-value < alpha (0.05), the residuals is not normally distributed.

Homoscedasticity of Residuals
bptest(model_transform_y)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_transform_y
## BP = 19.375, df = 12, p-value = 0.07987

Result: p-value > alpha (0.05), the residuals are distributed with equal variance.

No Multicollinearity
vif(model_transform_y)
##                 GVIF Df GVIF^(1/(2*Df))
## CGPA        4.832159  1        2.198217
## GRE.Score   4.969697  1        2.229282
## LOR         1.787768  8        1.036978
## TOEFL.Score 3.902233  1        1.975407
## Research    1.578478  1        1.256375

Result: VIF value < 10, there is no multicollinear predictors in model.

Transform Predictor-Var (log10)

Subset Data and Create Model

admission_transform_x <- admission %>%
  select(Chance.of.Admit, CGPA, GRE.Score, LOR, TOEFL.Score, Research) %>%
  mutate_at(vars(CGPA, GRE.Score, TOEFL.Score), ~log10(.)) %>%
  mutate(Chance.of.Admit = asin(sqrt(Chance.of.Admit)))
RNGkind(sample.kind = "Rounding")
set.seed(100)
index3 <- sample(nrow(admission_transform_x), nrow(admission_transform_x) *0.8)
data_train3 <- admission_transform_x[index3, ]
data_test3 <- admission_transform_x[-index3, ]
model_transform_x <- lm(formula = Chance.of.Admit ~ ., data_train3)
summary(model_transform_x)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train3)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.267603 -0.032860  0.005186  0.042480  0.155471 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -7.837340   0.882288  -8.883 < 0.0000000000000002 ***
## CGPA         3.155180   0.219525  14.373 < 0.0000000000000002 ***
## GRE.Score    1.476373   0.449916   3.281              0.00113 ** 
## LOR1.5      -0.035447   0.067993  -0.521              0.60243    
## LOR2         0.000532   0.064903   0.008              0.99346    
## LOR2.5      -0.002727   0.064998  -0.042              0.96656    
## LOR3        -0.007060   0.064814  -0.109              0.91331    
## LOR3.5       0.002681   0.065093   0.041              0.96717    
## LOR4         0.005237   0.065213   0.080              0.93604    
## LOR4.5       0.023704   0.065858   0.360              0.71909    
## LOR5         0.046654   0.066042   0.706              0.48034    
## TOEFL.Score  1.092656   0.252819   4.322            0.0000197 ***
## Research1    0.021121   0.008055   2.622              0.00909 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 387 degrees of freedom
## Multiple R-squared:  0.8625, Adjusted R-squared:  0.8582 
## F-statistic: 202.2 on 12 and 387 DF,  p-value: < 0.00000000000000022

Check Assumption

Normality of Residuals
hist(model_transform_x$residuals)

shapiro.test(model_transform_x$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_transform_x$residuals
## W = 0.96352, p-value = 0.00000002021

Result: p-value < alpha (0.05), the residuals is not normally distributed.

Homoscedasticity of Residuals
bptest(model_transform_x)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_transform_x
## BP = 18.307, df = 12, p-value = 0.1067

Result: p-value > alpha (0.05), the residuals are distributed with equal variance.

No Multicollinearity
vif(model_transform_x)
##                 GVIF Df GVIF^(1/(2*Df))
## CGPA        4.639310  1        2.153906
## GRE.Score   4.840780  1        2.200177
## LOR         1.790516  8        1.037077
## TOEFL.Score 3.816483  1        1.953582
## Research    1.577807  1        1.256108

Result: VIF value < 10: there is no multicollinear predictors in model.

Transform Predictor-Var (sqrt)

Subset Data and Create Model

admission_transform_x2 <- admission%>%
  select(Chance.of.Admit, CGPA, GRE.Score, LOR, TOEFL.Score, Research) %>%
  mutate_at(vars(CGPA, GRE.Score, TOEFL.Score), ~sqrt(.)) %>%
  mutate(Chance.of.Admit = asin(sqrt(Chance.of.Admit)))
RNGkind(sample.kind = "Rounding")
set.seed(100)
index4 <- sample(nrow(admission_transform_x2), nrow(admission_transform_x2) *0.8)
data_train4 <- admission_transform_x2[index4, ]
data_test4 <- admission_transform_x2[-index4, ]
model_transform_x2 <- lm(formula = Chance.of.Admit ~ ., data_train4)
summary(model_transform_x2)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = data_train4)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.265930 -0.033087  0.004758  0.042121  0.155499 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -3.9567095  0.2561513 -15.447 < 0.0000000000000002 ***
## CGPA         0.9554399  0.0655197  14.582 < 0.0000000000000002 ***
## GRE.Score    0.0696975  0.0219679   3.173              0.00163 ** 
## LOR1.5      -0.0290496  0.0674169  -0.431              0.66678    
## LOR2         0.0059692  0.0643597   0.093              0.92615    
## LOR2.5       0.0049894  0.0644364   0.077              0.93832    
## LOR3         0.0005006  0.0642461   0.008              0.99379    
## LOR3.5       0.0101154  0.0645159   0.157              0.87549    
## LOR4         0.0123081  0.0646329   0.190              0.84907    
## LOR4.5       0.0293128  0.0652740   0.449              0.65363    
## LOR5         0.0518538  0.0654568   0.792              0.42874    
## TOEFL.Score  0.0895337  0.0211750   4.228            0.0000294 ***
## Research1    0.0206827  0.0079919   2.588              0.01002 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06326 on 387 degrees of freedom
## Multiple R-squared:  0.8647, Adjusted R-squared:  0.8605 
## F-statistic:   206 on 12 and 387 DF,  p-value: < 0.00000000000000022

Check Assumption

Normality of Residuals
hist(model_transform_x2$residuals)

shapiro.test(model_transform_x2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_transform_x2$residuals
## W = 0.9621, p-value = 0.00000001205

Result: p-value < alpha (0.05), the residuals is not normally distributed.

Homoscedasticity of Residuals
bptest(model_transform_x2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_transform_x2
## BP = 18.865, df = 12, p-value = 0.09185

Result: p-value > alpha (0.05), the residuals are distributed with equal variance.

No Multicollinearity
vif(model_transform_x2)
##                 GVIF Df GVIF^(1/(2*Df))
## CGPA        4.738605  1        2.176834
## GRE.Score   4.906441  1        2.215049
## LOR         1.789409  8        1.037037
## TOEFL.Score 3.859976  1        1.964682
## Research    1.578227  1        1.256275

Result: VIF value < 10, there is no multicollinear predictors in model.

Performance and Prediction

performance_model <- compare_performance(model_no_outlier, model_transform_y, model_transform_x, model_transform_x2)
as.data.frame(performance_model)

We created model_no_outlier, model_transform_y, model_transform_x, and model_transform_x2. The assumption test of model_no_outlier is still not fulfilled the normality and homoscedasticity test. Others model fulfilled the homoscedasticity, but unfortunately not the normality test. Meanwhile, we can use the best model we had based on the R Squared Adjusted above. We will use the model_transform_yfor prediction.

model_pred <- predict(model_transform_y, newdata = data_test3 %>% select(-Chance.of.Admit))

# RMSE of train dataset
RMSE(y_pred = (model_transform_y$fitted.values), y_true = sin(data_train3$Chance.of.Admit)^2)
## [1] 0.3149209
# RMSE of test dataset
RMSE(y_pred = (model_pred), y_true = sin(data_test3$Chance.of.Admit)^2)
## [1] 1.995927

The results turned out that data test produce larger RMSE than data train. So it can be concluded that the model is overfit.

Conclusion

  1. The best model we created is model_transform_y with R Square Adjusted 86.24%.
  2. Variables that significant to Chance of Admit are: GRE.Score, TOEFL.Score, CGPA, and Research. We can conclude that student need to have good GRE.Score, TOEFL.Score, and CGPA, also have Research Experience in order to have a higher Chance of Admit.
  3. From the result, we know that the normality test is still not fulfilled although we did the transformation and remove the outlier. This may be due to the small size of the dataset.
  4. For future project, we can try to adjust the proportion of training-test, increase size of the dataset, using other transformation techniques, or analyze using other methods.