Berikut ini adalah analisis mengenai Data Pendaftaran. Business Problem yang ingin saya miliki adalah pengaruh terhadap TOEFL.Score. Pertama saya lakukan analisis TOEFL.Score dan CGPA untuk membuat simple linear regression. Analisis pengaruh tersebut, saya menggunakan metode Linear Regression yang akan dibuat suatu model untuk memprediksi faktor-faktor yang akan mempengaruhi dalam TOEFL.Score.

1 Import Library yang dibutuhkan

# import libs
library(tidyverse)
library(lubridate)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(plotly)
library(performance)
library(lmtest)

2 Persiapan data + EDA

2.1 Baca data

admission <- read.csv("dataset/Admission_Predict.csv")
admission
glimpse(admission)
#> Rows: 400
#> Columns: 9
#> $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
#> $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
#> $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
#> $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
#> $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
#> $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
#> $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
#> $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
#> $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...
admission_new <- admission %>% 
  select(-Serial.No.)

admission_new

2.2 Check missing data

colSums(is.na(admission_new))
#>         GRE.Score       TOEFL.Score University.Rating               SOP 
#>                 0                 0                 0                 0 
#>               LOR              CGPA          Research   Chance.of.Admit 
#>                 0                 0                 0                 0

2.3 Check Korelasi Data

ggcorr(admission_new, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

3 Pembuatan Linear Regression

chance_of_admit <- lm(formula = TOEFL.Score ~ CGPA, data = admission_new)
chance_of_admit
#> 
#> Call:
#> lm(formula = TOEFL.Score ~ CGPA, data = admission_new)
#> 
#> Coefficients:
#> (Intercept)         CGPA  
#>      34.905        8.432
plot(admission_new$CGPA, admission_new$TOEFL.Score)
abline(chance_of_admit$coefficients[1], chance_of_admit$coefficients[2], col = "red")

4 Model Tuning

chance_of_admit_none <- lm(formula = TOEFL.Score ~ 1, data = admission_new)
chance_of_admit_all <- lm(formula = TOEFL.Score ~ . , data = admission_new)
chance_of_admit_backward <- step(object = chance_of_admit_all, direction = "backward")
#> Start:  AIC=867.94
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA + 
#>     Research + Chance.of.Admit
#> 
#>                     Df Sum of Sq    RSS    AIC
#> - Research           1     16.14 3381.6 867.86
#> <none>                           3365.5 867.94
#> - LOR                1     28.61 3394.1 869.33
#> - University.Rating  1     41.76 3407.2 870.88
#> - SOP                1     55.61 3421.1 872.50
#> - Chance.of.Admit    1     61.65 3427.1 873.20
#> - CGPA               1    158.48 3523.9 884.35
#> - GRE.Score          1    742.27 4107.7 945.67
#> 
#> Step:  AIC=867.86
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA + 
#>     Chance.of.Admit
#> 
#>                     Df Sum of Sq    RSS    AIC
#> <none>                           3381.6 867.86
#> - LOR                1     28.52 3410.1 869.22
#> - University.Rating  1     41.43 3423.0 870.73
#> - SOP                1     51.54 3433.1 871.91
#> - Chance.of.Admit    1     53.92 3435.5 872.18
#> - CGPA               1    165.45 3547.1 884.96
#> - GRE.Score          1    736.09 4117.7 944.63
chance_of_admit_forward <- step(object = chance_of_admit_none, direction = "forward", 
                      scope = list(lower=chance_of_admit_none, upper=chance_of_admit_all))
#> Start:  AIC=1443.62
#> TOEFL.Score ~ 1
#> 
#>                     Df Sum of Sq     RSS     AIC
#> + GRE.Score          1   10272.3  4426.4  965.55
#> + CGPA               1   10087.4  4611.4  981.93
#> + Chance.of.Admit    1    9210.6  5488.2 1051.56
#> + University.Rating  1    7111.9  7586.8 1181.08
#> + SOP                1    6363.7  8335.1 1218.71
#> + LOR                1    4737.5  9961.2 1290.00
#> + Research           1    3527.1 11171.6 1335.87
#> <none>                           14698.8 1443.62
#> 
#> Step:  AIC=965.55
#> TOEFL.Score ~ GRE.Score
#> 
#>                     Df Sum of Sq    RSS    AIC
#> + CGPA               1    836.91 3589.5 883.72
#> + Chance.of.Admit    1    601.13 3825.3 909.17
#> + SOP                1    499.48 3926.9 919.66
#> + University.Rating  1    494.57 3931.9 920.16
#> + LOR                1    220.25 4206.2 947.14
#> <none>                           4426.4 965.55
#> + Research           1      0.48 4425.9 967.51
#> 
#> Step:  AIC=883.72
#> TOEFL.Score ~ GRE.Score + CGPA
#> 
#>                     Df Sum of Sq    RSS    AIC
#> + University.Rating  1   101.123 3488.4 874.29
#> + SOP                1    95.701 3493.8 874.91
#> + Chance.of.Admit    1    70.250 3519.3 877.82
#> <none>                           3589.5 883.72
#> + LOR                1     4.423 3585.1 885.23
#> + Research           1     3.105 3586.4 885.38
#> 
#> Step:  AIC=874.29
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating
#> 
#>                   Df Sum of Sq    RSS    AIC
#> + Chance.of.Admit  1    47.724 3440.7 870.78
#> + SOP              1    36.794 3451.6 872.05
#> <none>                         3488.4 874.29
#> + Research         1     5.965 3482.4 875.61
#> + LOR              1     1.561 3486.8 876.11
#> 
#> Step:  AIC=870.78
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit
#> 
#>            Df Sum of Sq    RSS    AIC
#> + SOP       1   30.5341 3410.1 869.22
#> <none>                  3440.7 870.78
#> + Research  1   12.6319 3428.0 871.31
#> + LOR       1    7.5147 3433.1 871.91
#> 
#> Step:  AIC=869.22
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit + 
#>     SOP
#> 
#>            Df Sum of Sq    RSS    AIC
#> + LOR       1    28.524 3381.6 867.86
#> <none>                  3410.1 869.22
#> + Research  1    16.056 3394.1 869.33
#> 
#> Step:  AIC=867.86
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit + 
#>     SOP + LOR
#> 
#>            Df Sum of Sq    RSS    AIC
#> <none>                  3381.6 867.86
#> + Research  1    16.139 3365.5 867.94
chance_of_admit_both1 <- step(object = chance_of_admit_all, direction = "both", 
                    scope = list(lower=chance_of_admit_none, upper = chance_of_admit_all))
#> Start:  AIC=867.94
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA + 
#>     Research + Chance.of.Admit
#> 
#>                     Df Sum of Sq    RSS    AIC
#> - Research           1     16.14 3381.6 867.86
#> <none>                           3365.5 867.94
#> - LOR                1     28.61 3394.1 869.33
#> - University.Rating  1     41.76 3407.2 870.88
#> - SOP                1     55.61 3421.1 872.50
#> - Chance.of.Admit    1     61.65 3427.1 873.20
#> - CGPA               1    158.48 3523.9 884.35
#> - GRE.Score          1    742.27 4107.7 945.67
#> 
#> Step:  AIC=867.86
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA + 
#>     Chance.of.Admit
#> 
#>                     Df Sum of Sq    RSS    AIC
#> <none>                           3381.6 867.86
#> + Research           1     16.14 3365.5 867.94
#> - LOR                1     28.52 3410.1 869.22
#> - University.Rating  1     41.43 3423.0 870.73
#> - SOP                1     51.54 3433.1 871.91
#> - Chance.of.Admit    1     53.92 3435.5 872.18
#> - CGPA               1    165.45 3547.1 884.96
#> - GRE.Score          1    736.09 4117.7 944.63
chance_of_admit_both2 <- step(object = chance_of_admit_none, direction = "both",
                    scope = list(lower=chance_of_admit_none, upper =chance_of_admit_all))
#> Start:  AIC=1443.62
#> TOEFL.Score ~ 1
#> 
#>                     Df Sum of Sq     RSS     AIC
#> + GRE.Score          1   10272.3  4426.4  965.55
#> + CGPA               1   10087.4  4611.4  981.93
#> + Chance.of.Admit    1    9210.6  5488.2 1051.56
#> + University.Rating  1    7111.9  7586.8 1181.08
#> + SOP                1    6363.7  8335.1 1218.71
#> + LOR                1    4737.5  9961.2 1290.00
#> + Research           1    3527.1 11171.6 1335.87
#> <none>                           14698.8 1443.62
#> 
#> Step:  AIC=965.55
#> TOEFL.Score ~ GRE.Score
#> 
#>                     Df Sum of Sq     RSS     AIC
#> + CGPA               1     836.9  3589.5  883.72
#> + Chance.of.Admit    1     601.1  3825.3  909.17
#> + SOP                1     499.5  3926.9  919.66
#> + University.Rating  1     494.6  3931.9  920.16
#> + LOR                1     220.3  4206.2  947.14
#> <none>                            4426.4  965.55
#> + Research           1       0.5  4425.9  967.51
#> - GRE.Score          1   10272.3 14698.8 1443.62
#> 
#> Step:  AIC=883.72
#> TOEFL.Score ~ GRE.Score + CGPA
#> 
#>                     Df Sum of Sq    RSS    AIC
#> + University.Rating  1    101.12 3488.4 874.29
#> + SOP                1     95.70 3493.8 874.91
#> + Chance.of.Admit    1     70.25 3519.3 877.82
#> <none>                           3589.5 883.72
#> + LOR                1      4.42 3585.1 885.23
#> + Research           1      3.10 3586.4 885.38
#> - CGPA               1    836.91 4426.4 965.55
#> - GRE.Score          1   1021.85 4611.4 981.93
#> 
#> Step:  AIC=874.29
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating
#> 
#>                     Df Sum of Sq    RSS    AIC
#> + Chance.of.Admit    1     47.72 3440.7 870.78
#> + SOP                1     36.79 3451.6 872.05
#> <none>                           3488.4 874.29
#> + Research           1      5.96 3482.4 875.61
#> + LOR                1      1.56 3486.8 876.11
#> - University.Rating  1    101.12 3589.5 883.72
#> - CGPA               1    443.47 3931.9 920.16
#> - GRE.Score          1    925.15 4413.5 966.39
#> 
#> Step:  AIC=870.78
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit
#> 
#>                     Df Sum of Sq    RSS    AIC
#> + SOP                1     30.53 3410.1 869.22
#> <none>                           3440.7 870.78
#> + Research           1     12.63 3428.0 871.31
#> + LOR                1      7.51 3433.1 871.91
#> - Chance.of.Admit    1     47.72 3488.4 874.29
#> - University.Rating  1     78.60 3519.3 877.82
#> - CGPA               1    196.87 3637.5 891.04
#> - GRE.Score          1    758.84 4199.5 948.50
#> 
#> Step:  AIC=869.22
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit + 
#>     SOP
#> 
#>                     Df Sum of Sq    RSS    AIC
#> + LOR                1     28.52 3381.6 867.86
#> <none>                           3410.1 869.22
#> + Research           1     16.06 3394.1 869.33
#> - SOP                1     30.53 3440.7 870.78
#> - University.Rating  1     33.32 3443.4 871.11
#> - Chance.of.Admit    1     41.46 3451.6 872.05
#> - CGPA               1    156.53 3566.7 885.17
#> - GRE.Score          1    769.74 4179.9 948.63
#> 
#> Step:  AIC=867.86
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit + 
#>     SOP + LOR
#> 
#>                     Df Sum of Sq    RSS    AIC
#> <none>                           3381.6 867.86
#> + Research           1     16.14 3365.5 867.94
#> - LOR                1     28.52 3410.1 869.22
#> - University.Rating  1     41.43 3423.0 870.73
#> - SOP                1     51.54 3433.1 871.91
#> - Chance.of.Admit    1     53.92 3435.5 872.18
#> - CGPA               1    165.45 3547.1 884.96
#> - GRE.Score          1    736.09 4117.7 944.63

Goodness of fit

summary(chance_of_admit_all)$adj.r.squared
#> [1] 0.7669487
summary(chance_of_admit_backward)$adj.r.squared
#> [1] 0.7664269
summary(chance_of_admit_forward)$adj.r.squared
#> [1] 0.7664269
summary(chance_of_admit_both1)$adj.r.squared
#> [1] 0.7664269
summary(chance_of_admit_both2)$adj.r.squared
#> [1] 0.7664269

Dari adjusted r-squared sudah berhasil menggambarkan suatu target.

4.1 Model Evaluation

#predict data
all_pred_chance <- predict(chance_of_admit_all, newdata = admission_new)
backward_chance <- predict(chance_of_admit_backward, newdata = admission_new)
forward_chance <- predict(chance_of_admit_forward, newdata = admission_new)
both_chance <- predict(chance_of_admit_both1, newdata = admission_new)
both_2_chance <- predict(chance_of_admit_both2, newdata = admission_new)
data.frame(RMSE = c(RMSE(all_pred_chance, admission_new$TOEFL.Score),
                    RMSE(backward_chance, admission_new$TOEFL.Score),
                    RMSE(forward_chance, admission_new$TOEFL.Score),
                    RMSE(both_chance, admission_new$TOEFL.Score),
                    RMSE(both_2_chance, admission_new$TOEFL.Score)),
           model = c("All", "Backward", "Forward", "Both1", "Both2"))

4.2 Uji Asumsi

###Normality

  1. Membuat visualisasi dari error yang dihasilkan(histogram)
hist(chance_of_admit_backward$residuals, breaks = 10)

  1. Melakukan uji statistik menggunakan fungsi shapiro.test()

Shapiro-Wilk hypothesis:

  • H0: error/residual berdistribusi normal
  • H1: error/residual tidak berdistribusi normal
shapiro.test(chance_of_admit_backward$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  chance_of_admit_backward$residuals
#> W = 0.99516, p-value = 0.2468

Dari hasil uji statistik yang telah dilakukan maka saya mendapatkan p-value > 0.05 dapat disimpulkan sehingga gagal tolak h0 berarti error/residual berdistribusi normal. Hal ini model tidak memiliki pola Heteroscedasticity. Maka model yang dibuat dapat menangkap semua pola yang ada sudah berhasil ditangkap.

4.2.1 Homoscedasticity

  1. Membuat visualisasi antar hasil antar hasil prediksi dengan error dengan menggunakan scatterplot
plot(chance_of_admit_backward$fitted.values, chance_of_admit_backward$residuals)
abline(h=0, col = "red")

Dari visualisasi yang dibuat maka menghasilkan residual yang memenuhi Homoscedasticity atau residual yang tidak berpola.

  1. Melakukan Uji Breusch-Pagan menggunakan fungsi bptest()

Breusch-Pagan hypothesis :

  • H0: Homoscedasticity
  • H1: Heteroscedasticity
bptest(chance_of_admit_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  chance_of_admit_backward
#> BP = 11.521, df = 6, p-value = 0.07355

Dari hasil uji Breusch-Pagan menggunakan fungsi bptest(), saya mendapatkan bahwa p-value > 0.05. Maka dapat saya simpulkan bahwa dari hasil uji tersebut adalah gagal tolak h0 yang berarti residual yang bersifat Homoscedasticity.

4.2.2 Variance Inflation Factor (Multicollinearity)

vif(chance_of_admit_backward)
#>         GRE.Score University.Rating               SOP               LOR 
#>          3.603161          2.894273          3.011641          2.510825 
#>              CGPA   Chance.of.Admit 
#>          6.151457          4.890025

Dari hasil uji Variance Inflation Factor (Multicollinearity), saya mendapatkan bahwa tidak ada nilai sama dengan dan lebih dari 10 sehingga tidak ditemukan Multicollinearity antar variabel yang berarti antar variabel prediktor yang saling idependen.

5 Kesimpulan

  1. Model memenuhi uji asumsi seperti: Normalitas, Homoscedasticity, VIF dan lain-lain.
  2. Dari data pendaftaran tersebut, TOEFL.Score dan CGPA saling berkaitan erat kuat yaitu 0.8. Maka data TOEFL.Score dan CGPA sangat berpengaruh dalam data pendaftaran.
  3. Adj.R-Squared yang tertinggi yaitu 0.7664269 dalam suatu model yang dilakukan analisis.