Berikut ini adalah analisis mengenai Data Pendaftaran. Business Problem yang ingin saya miliki adalah pengaruh terhadap TOEFL.Score. Pertama saya lakukan analisis TOEFL.Score dan CGPA untuk membuat simple linear regression. Analisis pengaruh tersebut, saya menggunakan metode Linear Regression yang akan dibuat suatu model untuk memprediksi faktor-faktor yang akan mempengaruhi dalam TOEFL.Score.
# import libs
library(tidyverse)
library(lubridate)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(plotly)
library(performance)
library(lmtest)
admission <- read.csv("dataset/Admission_Predict.csv")
admission
glimpse(admission)
#> Rows: 400
#> Columns: 9
#> $ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
#> $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323,...
#> $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108,...
#> $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3...
#> $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5,...
#> $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0,...
#> $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8...
#> $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0...
#> $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0...
admission_new <- admission %>%
select(-Serial.No.)
admission_new
colSums(is.na(admission_new))
#> GRE.Score TOEFL.Score University.Rating SOP
#> 0 0 0 0
#> LOR CGPA Research Chance.of.Admit
#> 0 0 0 0
ggcorr(admission_new, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
chance_of_admit <- lm(formula = TOEFL.Score ~ CGPA, data = admission_new)
chance_of_admit
#>
#> Call:
#> lm(formula = TOEFL.Score ~ CGPA, data = admission_new)
#>
#> Coefficients:
#> (Intercept) CGPA
#> 34.905 8.432
plot(admission_new$CGPA, admission_new$TOEFL.Score)
abline(chance_of_admit$coefficients[1], chance_of_admit$coefficients[2], col = "red")
chance_of_admit_none <- lm(formula = TOEFL.Score ~ 1, data = admission_new)
chance_of_admit_all <- lm(formula = TOEFL.Score ~ . , data = admission_new)
chance_of_admit_backward <- step(object = chance_of_admit_all, direction = "backward")
#> Start: AIC=867.94
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA +
#> Research + Chance.of.Admit
#>
#> Df Sum of Sq RSS AIC
#> - Research 1 16.14 3381.6 867.86
#> <none> 3365.5 867.94
#> - LOR 1 28.61 3394.1 869.33
#> - University.Rating 1 41.76 3407.2 870.88
#> - SOP 1 55.61 3421.1 872.50
#> - Chance.of.Admit 1 61.65 3427.1 873.20
#> - CGPA 1 158.48 3523.9 884.35
#> - GRE.Score 1 742.27 4107.7 945.67
#>
#> Step: AIC=867.86
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA +
#> Chance.of.Admit
#>
#> Df Sum of Sq RSS AIC
#> <none> 3381.6 867.86
#> - LOR 1 28.52 3410.1 869.22
#> - University.Rating 1 41.43 3423.0 870.73
#> - SOP 1 51.54 3433.1 871.91
#> - Chance.of.Admit 1 53.92 3435.5 872.18
#> - CGPA 1 165.45 3547.1 884.96
#> - GRE.Score 1 736.09 4117.7 944.63
chance_of_admit_forward <- step(object = chance_of_admit_none, direction = "forward",
scope = list(lower=chance_of_admit_none, upper=chance_of_admit_all))
#> Start: AIC=1443.62
#> TOEFL.Score ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + GRE.Score 1 10272.3 4426.4 965.55
#> + CGPA 1 10087.4 4611.4 981.93
#> + Chance.of.Admit 1 9210.6 5488.2 1051.56
#> + University.Rating 1 7111.9 7586.8 1181.08
#> + SOP 1 6363.7 8335.1 1218.71
#> + LOR 1 4737.5 9961.2 1290.00
#> + Research 1 3527.1 11171.6 1335.87
#> <none> 14698.8 1443.62
#>
#> Step: AIC=965.55
#> TOEFL.Score ~ GRE.Score
#>
#> Df Sum of Sq RSS AIC
#> + CGPA 1 836.91 3589.5 883.72
#> + Chance.of.Admit 1 601.13 3825.3 909.17
#> + SOP 1 499.48 3926.9 919.66
#> + University.Rating 1 494.57 3931.9 920.16
#> + LOR 1 220.25 4206.2 947.14
#> <none> 4426.4 965.55
#> + Research 1 0.48 4425.9 967.51
#>
#> Step: AIC=883.72
#> TOEFL.Score ~ GRE.Score + CGPA
#>
#> Df Sum of Sq RSS AIC
#> + University.Rating 1 101.123 3488.4 874.29
#> + SOP 1 95.701 3493.8 874.91
#> + Chance.of.Admit 1 70.250 3519.3 877.82
#> <none> 3589.5 883.72
#> + LOR 1 4.423 3585.1 885.23
#> + Research 1 3.105 3586.4 885.38
#>
#> Step: AIC=874.29
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating
#>
#> Df Sum of Sq RSS AIC
#> + Chance.of.Admit 1 47.724 3440.7 870.78
#> + SOP 1 36.794 3451.6 872.05
#> <none> 3488.4 874.29
#> + Research 1 5.965 3482.4 875.61
#> + LOR 1 1.561 3486.8 876.11
#>
#> Step: AIC=870.78
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit
#>
#> Df Sum of Sq RSS AIC
#> + SOP 1 30.5341 3410.1 869.22
#> <none> 3440.7 870.78
#> + Research 1 12.6319 3428.0 871.31
#> + LOR 1 7.5147 3433.1 871.91
#>
#> Step: AIC=869.22
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit +
#> SOP
#>
#> Df Sum of Sq RSS AIC
#> + LOR 1 28.524 3381.6 867.86
#> <none> 3410.1 869.22
#> + Research 1 16.056 3394.1 869.33
#>
#> Step: AIC=867.86
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit +
#> SOP + LOR
#>
#> Df Sum of Sq RSS AIC
#> <none> 3381.6 867.86
#> + Research 1 16.139 3365.5 867.94
chance_of_admit_both1 <- step(object = chance_of_admit_all, direction = "both",
scope = list(lower=chance_of_admit_none, upper = chance_of_admit_all))
#> Start: AIC=867.94
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA +
#> Research + Chance.of.Admit
#>
#> Df Sum of Sq RSS AIC
#> - Research 1 16.14 3381.6 867.86
#> <none> 3365.5 867.94
#> - LOR 1 28.61 3394.1 869.33
#> - University.Rating 1 41.76 3407.2 870.88
#> - SOP 1 55.61 3421.1 872.50
#> - Chance.of.Admit 1 61.65 3427.1 873.20
#> - CGPA 1 158.48 3523.9 884.35
#> - GRE.Score 1 742.27 4107.7 945.67
#>
#> Step: AIC=867.86
#> TOEFL.Score ~ GRE.Score + University.Rating + SOP + LOR + CGPA +
#> Chance.of.Admit
#>
#> Df Sum of Sq RSS AIC
#> <none> 3381.6 867.86
#> + Research 1 16.14 3365.5 867.94
#> - LOR 1 28.52 3410.1 869.22
#> - University.Rating 1 41.43 3423.0 870.73
#> - SOP 1 51.54 3433.1 871.91
#> - Chance.of.Admit 1 53.92 3435.5 872.18
#> - CGPA 1 165.45 3547.1 884.96
#> - GRE.Score 1 736.09 4117.7 944.63
chance_of_admit_both2 <- step(object = chance_of_admit_none, direction = "both",
scope = list(lower=chance_of_admit_none, upper =chance_of_admit_all))
#> Start: AIC=1443.62
#> TOEFL.Score ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + GRE.Score 1 10272.3 4426.4 965.55
#> + CGPA 1 10087.4 4611.4 981.93
#> + Chance.of.Admit 1 9210.6 5488.2 1051.56
#> + University.Rating 1 7111.9 7586.8 1181.08
#> + SOP 1 6363.7 8335.1 1218.71
#> + LOR 1 4737.5 9961.2 1290.00
#> + Research 1 3527.1 11171.6 1335.87
#> <none> 14698.8 1443.62
#>
#> Step: AIC=965.55
#> TOEFL.Score ~ GRE.Score
#>
#> Df Sum of Sq RSS AIC
#> + CGPA 1 836.9 3589.5 883.72
#> + Chance.of.Admit 1 601.1 3825.3 909.17
#> + SOP 1 499.5 3926.9 919.66
#> + University.Rating 1 494.6 3931.9 920.16
#> + LOR 1 220.3 4206.2 947.14
#> <none> 4426.4 965.55
#> + Research 1 0.5 4425.9 967.51
#> - GRE.Score 1 10272.3 14698.8 1443.62
#>
#> Step: AIC=883.72
#> TOEFL.Score ~ GRE.Score + CGPA
#>
#> Df Sum of Sq RSS AIC
#> + University.Rating 1 101.12 3488.4 874.29
#> + SOP 1 95.70 3493.8 874.91
#> + Chance.of.Admit 1 70.25 3519.3 877.82
#> <none> 3589.5 883.72
#> + LOR 1 4.42 3585.1 885.23
#> + Research 1 3.10 3586.4 885.38
#> - CGPA 1 836.91 4426.4 965.55
#> - GRE.Score 1 1021.85 4611.4 981.93
#>
#> Step: AIC=874.29
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating
#>
#> Df Sum of Sq RSS AIC
#> + Chance.of.Admit 1 47.72 3440.7 870.78
#> + SOP 1 36.79 3451.6 872.05
#> <none> 3488.4 874.29
#> + Research 1 5.96 3482.4 875.61
#> + LOR 1 1.56 3486.8 876.11
#> - University.Rating 1 101.12 3589.5 883.72
#> - CGPA 1 443.47 3931.9 920.16
#> - GRE.Score 1 925.15 4413.5 966.39
#>
#> Step: AIC=870.78
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit
#>
#> Df Sum of Sq RSS AIC
#> + SOP 1 30.53 3410.1 869.22
#> <none> 3440.7 870.78
#> + Research 1 12.63 3428.0 871.31
#> + LOR 1 7.51 3433.1 871.91
#> - Chance.of.Admit 1 47.72 3488.4 874.29
#> - University.Rating 1 78.60 3519.3 877.82
#> - CGPA 1 196.87 3637.5 891.04
#> - GRE.Score 1 758.84 4199.5 948.50
#>
#> Step: AIC=869.22
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit +
#> SOP
#>
#> Df Sum of Sq RSS AIC
#> + LOR 1 28.52 3381.6 867.86
#> <none> 3410.1 869.22
#> + Research 1 16.06 3394.1 869.33
#> - SOP 1 30.53 3440.7 870.78
#> - University.Rating 1 33.32 3443.4 871.11
#> - Chance.of.Admit 1 41.46 3451.6 872.05
#> - CGPA 1 156.53 3566.7 885.17
#> - GRE.Score 1 769.74 4179.9 948.63
#>
#> Step: AIC=867.86
#> TOEFL.Score ~ GRE.Score + CGPA + University.Rating + Chance.of.Admit +
#> SOP + LOR
#>
#> Df Sum of Sq RSS AIC
#> <none> 3381.6 867.86
#> + Research 1 16.14 3365.5 867.94
#> - LOR 1 28.52 3410.1 869.22
#> - University.Rating 1 41.43 3423.0 870.73
#> - SOP 1 51.54 3433.1 871.91
#> - Chance.of.Admit 1 53.92 3435.5 872.18
#> - CGPA 1 165.45 3547.1 884.96
#> - GRE.Score 1 736.09 4117.7 944.63
Goodness of fit
summary(chance_of_admit_all)$adj.r.squared
#> [1] 0.7669487
summary(chance_of_admit_backward)$adj.r.squared
#> [1] 0.7664269
summary(chance_of_admit_forward)$adj.r.squared
#> [1] 0.7664269
summary(chance_of_admit_both1)$adj.r.squared
#> [1] 0.7664269
summary(chance_of_admit_both2)$adj.r.squared
#> [1] 0.7664269
Dari adjusted r-squared sudah berhasil menggambarkan suatu target.
#predict data
all_pred_chance <- predict(chance_of_admit_all, newdata = admission_new)
backward_chance <- predict(chance_of_admit_backward, newdata = admission_new)
forward_chance <- predict(chance_of_admit_forward, newdata = admission_new)
both_chance <- predict(chance_of_admit_both1, newdata = admission_new)
both_2_chance <- predict(chance_of_admit_both2, newdata = admission_new)
data.frame(RMSE = c(RMSE(all_pred_chance, admission_new$TOEFL.Score),
RMSE(backward_chance, admission_new$TOEFL.Score),
RMSE(forward_chance, admission_new$TOEFL.Score),
RMSE(both_chance, admission_new$TOEFL.Score),
RMSE(both_2_chance, admission_new$TOEFL.Score)),
model = c("All", "Backward", "Forward", "Both1", "Both2"))
###Normality
hist(chance_of_admit_backward$residuals, breaks = 10)
shapiro.test()Shapiro-Wilk hypothesis:
shapiro.test(chance_of_admit_backward$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: chance_of_admit_backward$residuals
#> W = 0.99516, p-value = 0.2468
Dari hasil uji statistik yang telah dilakukan maka saya mendapatkan p-value > 0.05 dapat disimpulkan sehingga gagal tolak h0 berarti error/residual berdistribusi normal. Hal ini model tidak memiliki pola Heteroscedasticity. Maka model yang dibuat dapat menangkap semua pola yang ada sudah berhasil ditangkap.
plot(chance_of_admit_backward$fitted.values, chance_of_admit_backward$residuals)
abline(h=0, col = "red")
Dari visualisasi yang dibuat maka menghasilkan residual yang memenuhi Homoscedasticity atau residual yang tidak berpola.
bptest()Breusch-Pagan hypothesis :
bptest(chance_of_admit_backward)
#>
#> studentized Breusch-Pagan test
#>
#> data: chance_of_admit_backward
#> BP = 11.521, df = 6, p-value = 0.07355
Dari hasil uji Breusch-Pagan menggunakan fungsi bptest(), saya mendapatkan bahwa p-value > 0.05. Maka dapat saya simpulkan bahwa dari hasil uji tersebut adalah gagal tolak h0 yang berarti residual yang bersifat Homoscedasticity.
vif(chance_of_admit_backward)
#> GRE.Score University.Rating SOP LOR
#> 3.603161 2.894273 3.011641 2.510825
#> CGPA Chance.of.Admit
#> 6.151457 4.890025
Dari hasil uji Variance Inflation Factor (Multicollinearity), saya mendapatkan bahwa tidak ada nilai sama dengan dan lebih dari 10 sehingga tidak ditemukan Multicollinearity antar variabel yang berarti antar variabel prediktor yang saling idependen.