1 Intro

This dataset is created for prediction of Graduate Admissions from an Indian perspective. This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

1.1 Load the Packages

## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1

1.2 Load the Dataset.

adm <- read_csv("Admission_Predict.csv")
## Rows: 400 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (9): Serial No., GRE Score, TOEFL Score, University Rating, SOP, LOR, CG...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
adm
## # A tibble: 400 x 9
##    `Serial No.` `GRE Score` `TOEFL Score` `University Rating`   SOP   LOR  CGPA
##           <dbl>       <dbl>         <dbl>               <dbl> <dbl> <dbl> <dbl>
##  1            1         337           118                   4   4.5   4.5  9.65
##  2            2         324           107                   4   4     4.5  8.87
##  3            3         316           104                   3   3     3.5  8   
##  4            4         322           110                   3   3.5   2.5  8.67
##  5            5         314           103                   2   2     3    8.21
##  6            6         330           115                   5   4.5   3    9.34
##  7            7         321           109                   3   3     4    8.2 
##  8            8         308           101                   2   3     4    7.9 
##  9            9         302           102                   1   2     1.5  8   
## 10           10         323           108                   3   3.5   3    8.6 
## # ... with 390 more rows, and 2 more variables: Research <dbl>,
## #   Chance of Admit <dbl>
str(adm)
## spec_tbl_df [400 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Serial No.       : num [1:400] 1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE Score        : num [1:400] 337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL Score      : num [1:400] 118 107 104 110 103 115 109 101 102 108 ...
##  $ University Rating: num [1:400] 4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num [1:400] 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num [1:400] 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num [1:400] 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : num [1:400] 1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance of Admit  : num [1:400] 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Serial No.` = col_double(),
##   ..   `GRE Score` = col_double(),
##   ..   `TOEFL Score` = col_double(),
##   ..   `University Rating` = col_double(),
##   ..   SOP = col_double(),
##   ..   LOR = col_double(),
##   ..   CGPA = col_double(),
##   ..   Research = col_double(),
##   ..   `Chance of Admit` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

1.2.1 Content

The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

  1. GRE Scores ( out of 340 )
  2. TOEFL Scores ( out of 120 )
  3. University Rating ( out of 5 ) - The Rating of the University the student applied to
  4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  5. Undergraduate GPA ( out of 10 )
  6. Research Experience ( either 0 or 1 )
  7. Chance of Admit ( ranging from 0 to 1 )

The data has 400 rows and 9 columns. Serial No. is a unique identifier for each student’s admission. Our target variable is the Chance of Admit which is the probability of the students of being admitted to the university.

2 Exploratory Data Analysis

2.1 Summary (Boxplot)

summary(adm[,-c(1,8)])
##    GRE Score      TOEFL Score    University Rating      SOP     
##  Min.   :290.0   Min.   : 92.0   Min.   :1.000     Min.   :1.0  
##  1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000     1st Qu.:2.5  
##  Median :317.0   Median :107.0   Median :3.000     Median :3.5  
##  Mean   :316.8   Mean   :107.4   Mean   :3.087     Mean   :3.4  
##  3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000     3rd Qu.:4.0  
##  Max.   :340.0   Max.   :120.0   Max.   :5.000     Max.   :5.0  
##       LOR             CGPA       Chance of Admit 
##  Min.   :1.000   Min.   :6.800   Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.170   1st Qu.:0.6400  
##  Median :3.500   Median :8.610   Median :0.7300  
##  Mean   :3.453   Mean   :8.599   Mean   :0.7244  
##  3rd Qu.:4.000   3rd Qu.:9.062   3rd Qu.:0.8300  
##  Max.   :5.000   Max.   :9.920   Max.   :0.9700

The summary above shows that the distribution of our dataset doesn’t have any anomaly. In every variable the median and the mean value has a relatively equal number, this shows that the data or the scores doesn’t have any significant errors in them, we can visualize it better by looking at the plot below.

adm %>% 
  select(-c("Serial No.", "Research")) %>% 
  pivot_longer(c("GRE Score","TOEFL Score","University Rating","SOP","LOR","CGPA","Chance of Admit"), 
               names_to = "Parameter", values_to = "Score") %>% 
  ggplot(aes(x = Parameter, y = Score)) + 
  geom_boxplot(fill = "maroon", color = "black") +
  facet_wrap(~Parameter, scales = "free" ) +
  theme_minimal() 

2.2 Data Distributions

Now we have to see whether the data that we pull is normally distributed or not. Normally distributed data suggests that the sample that we pulled in represents the population because a well-made grade score of students is normally distributed. First of all, we need to visualize the data that we’d like to test.

hist(adm$'GRE Score', breaks = 1+3.322*log(400))

hist(adm$`TOEFL Score`, breaks = 1 + 3.322*log(400))

hist(adm$CGPA, breaks = 1 + 3.322*log(400))

hist(adm$`Chance of Admit`, breaks = 1+3.322*log(400))

Dari histogram di atas beberapa data terlihat terdistribusi normal, selanjutnya kita akan menggunakan Kolmogorov-Smirnov Test dimana akan dihasilkan nilai pasti bahwa daata terdistribusi dengan normal atau tidak

Pada r kita akan menggunakan ks.test() yang akan berisikan value dari masing-masing variabel. value dari masing-masing variabel ini akan dilakukan scaling terlebih dahulu agar data yang diuji adalah data dari nilai min hingga max. apabila kita tidak melakukan scaling data akan skew. Misal pada CGPA, sebaran nilai peserta berada pada range 6.5 hingga 10.0 apabila tidak kita scaling maka ks.test() akan menganggap pengujian dari nilai minimumnya adalah 0 sehingga akan salah interpretasi.

ks.test(scale(adm$`GRE Score`), "pnorm")
## Warning in ks.test(scale(adm$`GRE Score`), "pnorm"): ties should not be present
## for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  scale(adm$`GRE Score`)
## D = 0.050095, p-value = 0.268
## alternative hypothesis: two-sided
ks.test(scale(adm$`TOEFL Score`), "pnorm")
## Warning in ks.test(scale(adm$`TOEFL Score`), "pnorm"): ties should not be
## present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  scale(adm$`TOEFL Score`)
## D = 0.057709, p-value = 0.1392
## alternative hypothesis: two-sided
ks.test(scale(adm$`University Rating`), "pnorm")
## Warning in ks.test(scale(adm$`University Rating`), "pnorm"): ties should not be
## present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  scale(adm$`University Rating`)
## D = 0.19549, p-value = 0.0000000000001055
## alternative hypothesis: two-sided
ks.test(scale(adm$`SOP`), "pnorm")
## Warning in ks.test(scale(adm$SOP), "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  scale(adm$SOP)
## D = 0.12438, p-value = 0.000008432
## alternative hypothesis: two-sided
ks.test(scale(adm$`LOR`), "pnorm")
## Warning in ks.test(scale(adm$LOR), "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  scale(adm$LOR)
## D = 0.12136, p-value = 0.00001528
## alternative hypothesis: two-sided
ks.test(scale(adm$`CGPA`), "pnorm")
## Warning in ks.test(scale(adm$CGPA), "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  scale(adm$CGPA)
## D = 0.044395, p-value = 0.4097
## alternative hypothesis: two-sided
ks.test(scale(adm$`Chance of Admit`), "pnorm")
## Warning in ks.test(scale(adm$`Chance of Admit`), "pnorm"): ties should not be
## present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  scale(adm$`Chance of Admit`)
## D = 0.049712, p-value = 0.2763
## alternative hypothesis: two-sided

Dari uji statistik di atas, University Rating, SOP, LOR tidak terdistribusi normal, ini tidak apa-apa karena penilaian tersebut secara natural tidak berdistribusi normal. Dari sini, dapat dipastikan bahwa semua data dapat digunakan.

3 Correlation

Berikut adalah korelasi dari masing-masing variabel pada dataset

adm %>% 
  select(-c("Serial No.")) %>% 
  ggcorr(label = T, label_size = 2.5, hjust = .8, layout.exp = 2)

4 Regression Analysis

4.1 Stepwise Modelling

Kita akan melakukan modelling menggunakan stepwise karena akan memberikan hasil yang terbaik.

adm <- adm %>% 
  select(-c("Serial No.")) 
model_all <- lm(formula = `Chance of Admit` ~ ., data = adm)
summary(model_all)
## 
## Call:
## lm(formula = `Chance of Admit` ~ ., data = adm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26259 -0.02103  0.01005  0.03628  0.15928 
## 
## Coefficients:
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         -1.2594325  0.1247307 -10.097 < 0.0000000000000002 ***
## `GRE Score`          0.0017374  0.0005979   2.906              0.00387 ** 
## `TOEFL Score`        0.0029196  0.0010895   2.680              0.00768 ** 
## `University Rating`  0.0057167  0.0047704   1.198              0.23150    
## SOP                 -0.0033052  0.0055616  -0.594              0.55267    
## LOR                  0.0223531  0.0055415   4.034             0.000066 ***
## CGPA                 0.1189395  0.0122194   9.734 < 0.0000000000000002 ***
## Research             0.0245251  0.0079598   3.081              0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:    0.8 
## F-statistic: 228.9 on 7 and 392 DF,  p-value: < 0.00000000000000022

Berikut adalah proses pengerjaan model stepwise, sehingga nanti akan di dapat model terbaik. Perhitungan stepwise ini akan menggunakan Akaike Information Criterion (AIC).

step(object = model_all, direction = "backward", trace = T)
## Start:  AIC=-2193.9
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` + 
##     SOP + LOR + CGPA + Research
## 
##                       Df Sum of Sq    RSS     AIC
## - SOP                  1   0.00144 1.5962 -2195.5
## - `University Rating`  1   0.00584 1.6006 -2194.4
## <none>                             1.5948 -2193.9
## - `TOEFL Score`        1   0.02921 1.6240 -2188.6
## - `GRE Score`          1   0.03435 1.6291 -2187.4
## - Research             1   0.03862 1.6334 -2186.3
## - LOR                  1   0.06620 1.6609 -2179.6
## - CGPA                 1   0.38544 1.9802 -2109.3
## 
## Step:  AIC=-2195.54
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` + 
##     LOR + CGPA + Research
## 
##                       Df Sum of Sq    RSS     AIC
## - `University Rating`  1   0.00464 1.6008 -2196.4
## <none>                             1.5962 -2195.5
## - `TOEFL Score`        1   0.02806 1.6242 -2190.6
## - `GRE Score`          1   0.03565 1.6318 -2188.7
## - Research             1   0.03769 1.6339 -2188.2
## - LOR                  1   0.06983 1.6660 -2180.4
## - CGPA                 1   0.38660 1.9828 -2110.8
## 
## Step:  AIC=-2196.38
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + LOR + CGPA + 
##     Research
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       1.6008 -2196.4
## - `TOEFL Score`  1   0.03292 1.6338 -2190.2
## - `GRE Score`    1   0.03638 1.6372 -2189.4
## - Research       1   0.03912 1.6400 -2188.7
## - LOR            1   0.09133 1.6922 -2176.2
## - CGPA           1   0.43201 2.0328 -2102.8
## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
##     LOR + CGPA + Research, data = adm)
## 
## Coefficients:
##   (Intercept)    `GRE Score`  `TOEFL Score`            LOR           CGPA  
##     -1.298464       0.001782       0.003032       0.022776       0.121004  
##      Research  
##      0.024577

Berikut adalah model yang terbentuk setelah melakukan proses stepwise

model_stepwise <- lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
    LOR + CGPA + Research, data = adm)
summary(model_stepwise)
## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
##     LOR + CGPA + Research, data = adm)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263542 -0.023297  0.009879  0.038078  0.159897 
## 
## Coefficients:
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   -1.2984636  0.1172905 -11.070 < 0.0000000000000002 ***
## `GRE Score`    0.0017820  0.0005955   2.992              0.00294 ** 
## `TOEFL Score`  0.0030320  0.0010651   2.847              0.00465 ** 
## LOR            0.0227762  0.0048039   4.741           0.00000297 ***
## CGPA           0.1210042  0.0117349  10.312 < 0.0000000000000002 ***
## Research       0.0245769  0.0079203   3.103              0.00205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
## F-statistic: 320.6 on 5 and 394 DF,  p-value: < 0.00000000000000022

4.2 Statistic Test for Stepwise Model

4.2.1 Shapiro-Wilk normality residual test

shapiro.test(model_stepwise$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_stepwise$residuals
## W = 0.92193, p-value = 0.0000000000001443

4.2.2 Breusch-Pagan heteroscedasticity test

bptest(model_stepwise)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_stepwise
## BP = 22.428, df = 5, p-value = 0.0004341

4.2.3 Multicolinearity test

vif(model_stepwise)
##   `GRE Score` `TOEFL Score`           LOR          CGPA      Research 
##      4.585053      4.104255      1.829491      4.808767      1.530007

Disini nilai residual dari model stepwise tidak terdistribusi normal, uji statistik lainnya menunjukkan bahwa data masih baik, tidak heteroscedastic dan tidak multikorelasi. Untuk itu perlu ada peninjauan lebih dalam lagi mengenai model-model yang dapat dibentuk. untuk itu akan dibentuk model yang dibangun secara manual untuk membandingkan kedua data.

4.3 Manual Modelling

Dalam melakukan manual modelling, pembuatan data harus mempertimbangkan multikolinearitas dari data. berarti variabel-variabel yang berpotensi berkorelasi tinggi dengan satu sama lain harus dihilangkan. Untuk itu kita harus melihat nilai korelasi dari masing-masing variabel.

adm %>% 
  ggcorr(label = T, label_size = 2.5, hjust = .8, layout.exp = 2)

Untuk itu dalam mencari model yang baik kita memerlukan data yang variabel prediktornya tidak saling berhubungan atau memiliki korelasi kecil. terdapat beberapa:

  1. University Rating + Research
  2. SOP + Research
  3. LOR + Research
  4. CGPA
model1 <- lm(formula = `Chance of Admit` ~ `University Rating` + Research, data = adm)
summary(model1)
## 
## Call:
## lm(formula = `Chance of Admit` ~ `University Rating` + Research, 
##     data = adm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32432 -0.04203  0.01938  0.05938  0.25256 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         0.455150   0.013446  33.850 < 0.0000000000000002 ***
## `University Rating` 0.072293   0.004564  15.840 < 0.0000000000000002 ***
## Research            0.084011   0.010474   8.021   0.0000000000000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09323 on 397 degrees of freedom
## Multiple R-squared:  0.5748, Adjusted R-squared:  0.5726 
## F-statistic: 268.3 on 2 and 397 DF,  p-value: < 0.00000000000000022
model2 <- lm(formula = `Chance of Admit` ~ SOP + Research, data = adm)
summary(model2)
## 
## Call:
## lm(formula = `Chance of Admit` ~ SOP + Research, data = adm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41635 -0.04078  0.01429  0.06385  0.24128 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 0.416965   0.017298  24.105 < 0.0000000000000002 ***
## SOP         0.075877   0.005402  14.047 < 0.0000000000000002 ***
## Research    0.090233   0.010913   8.268  0.00000000000000207 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09734 on 397 degrees of freedom
## Multiple R-squared:  0.5364, Adjusted R-squared:  0.5341 
## F-statistic: 229.7 on 2 and 397 DF,  p-value: < 0.00000000000000022
model3 <- lm(formula = `Chance of Admit` ~ LOR + Research, data = adm)
summary(model3)
## 
## Call:
## lm(formula = `Chance of Admit` ~ LOR + Research, data = adm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29496 -0.05314  0.01368  0.06262  0.30474 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 0.377995   0.019262  19.624 <0.0000000000000002 ***
## LOR         0.084843   0.005843  14.521 <0.0000000000000002 ***
## Research    0.097599   0.010534   9.265 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09625 on 397 degrees of freedom
## Multiple R-squared:  0.5468, Adjusted R-squared:  0.5445 
## F-statistic: 239.5 on 2 and 397 DF,  p-value: < 0.00000000000000022
model4 <- lm(formula = `Chance of Admit` ~ `CGPA`, data = adm)
summary(model4)
## 
## Call:
## lm(formula = `Chance of Admit` ~ CGPA, data = adm)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274575 -0.030084  0.009443  0.041954  0.180734 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -1.07151    0.05034  -21.29 <0.0000000000000002 ***
## CGPA         0.20885    0.00584   35.76 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.762 
## F-statistic:  1279 on 1 and 398 DF,  p-value: < 0.00000000000000022

4.4 Statistic Test for Manual Model

4.4.1 Normality Residual

shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.94826, p-value = 0.0000000001304
shapiro.test(model2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model2$residuals
## W = 0.93461, p-value = 0.000000000003037
shapiro.test(model3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model3$residuals
## W = 0.97316, p-value = 0.0000009746
shapiro.test(model4$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model4$residuals
## W = 0.94782, p-value = 0.0000000001143

Didapat bahwa persebaran residual model manual tidak terdistribusi normal. Disini model tetap akan digunakan dan akan dicari yang paling baik. Akan tetapi kita mengasumsikan data residual tidak terdistribusi normal dan akan dijelaskan lebih lanjut pada model interpretation

4.4.2 Heteroscedasticity & Multicolinearity

bptest(model1)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 8.2593, df = 2, p-value = 0.01609
bptest(model2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model2
## BP = 1.0373, df = 2, p-value = 0.5953
bptest(model3)
## 
##  studentized Breusch-Pagan test
## 
## data:  model3
## BP = 15.637, df = 2, p-value = 0.0004023
bptest(model4)
## 
##  studentized Breusch-Pagan test
## 
## data:  model4
## BP = 19.562, df = 1, p-value = 0.000009737

Sebaran error/residual dari model 2 menjelaskan bahwa data tidak heteroscedastic dimana data menyebar dengan konstan tanpa membentuk pola. Lalu untuk memastikan, pengujian multikolinearitas menunjukkan bahwa VIF model2 < 10 yang berarti tidak ada korelasi yang kuat antar variabel prediktor, sehingga model2 adalah model terbaik dalam menggambarkan chance of admission dari peserta tes masuk S2

vif(model2)
##      SOP Research 
## 1.245581 1.245581

5 Model Interpretation

5.1 Performance Comparison

Dari uji Normalitas residual disini menggambarkan variabilitas pada outcome diluar model yang telah di buat. Semua model yang kita generate memiliki nilai residual yang bervariasi tinggi. akan tetapi penyelewaengan terhadap asumsi normalitas residual ini hanya akan mennimbulkan masalah apabila sample size yang digunakan kecil yaitu kurang dari 30 (<30). Asumsi nya untuk data yang besar asumsi tersebut menjadi tidak relevan akibat central limit theorem.

Lalu dari kelima model yang telah dibentuk akan kita lakukan komparasi performance dengan menggunakan compare_performance()

compare_performance(model_stepwise, model1, model2, model3, model4)
## # Comparison of Model Performance Indices
## 
## Name           | Model |       AIC |       BIC |    R2 | R2 (adj.) |  RMSE | Sigma
## ----------------------------------------------------------------------------------
## model_stepwise |    lm | -1059.225 | -1031.284 | 0.803 |     0.800 | 0.063 | 0.064
## model1         |    lm |  -758.032 |  -742.066 | 0.575 |     0.573 | 0.093 | 0.093
## model2         |    lm |  -723.496 |  -707.531 | 0.536 |     0.534 | 0.097 | 0.097
## model3         |    lm |  -732.498 |  -716.532 | 0.547 |     0.544 | 0.096 | 0.096
## model4         |    lm |  -993.228 |  -981.254 | 0.763 |     0.762 | 0.069 | 0.070

dari tabel performance tersebut AIC terkecil adalah model dari stepwise, model tersebut memiliki nilai kehilangan informasi yang paling kecil dibanding model-model lainnya

5.2 Prediction

Untuk menguji kebagusan dari model, akan dilakukan prediction terhadap data test/ unseen data. data test disini adalah data Admission_Predict_Ver1.1.csv dimana data tersebut berisikan 500 data, 400 data pertama adalah data train yang tadi sudah kita modelkan dan 100 data terakhir merupakan data yang unseen, sehingga akan kita gunakan.

adm_test <- read.csv("Admission_Predict_Ver1.1.csv")
colnames(adm_test) <- c("Serial No.", "GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR", "CGPA", "Research", "Chance of Admit")
adm_test <- adm_test %>% tail(100) 
adm_pred <- predict(object = model_stepwise, newdata = adm_test)

Prediction akan menggunakan model stepwise karena model tersebut adalah model terbaik. Dari prediksi yang di dapat lalu akan kita bandingkan dengan nilai admission sesuangguhnya. dengan menggunakan MAPE kita akan mengetahui nilai error dari prediksi yang telah kita buat, dan di dapat bahwa nilai error sangat kecil yaitu 0.05%, yang menandakan bahwa model prediction yang kita bangun sangat baik.

MAPE(y_pred = adm_pred, y_true = adm_test$`Chance of Admit`)
## [1] 0.05130283

— done —