Pada Analysis kali ini kita akan menggunakan data admission dari kaggle. dari hasil analysis tersebut kita ingin melihat seberapa pengaruh nilai predictor terhadap targetnya
# Load Library yang dibutuhkan
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
admission <- read.csv("admission.csv")
glimpse(admission)
## Rows: 400
## Columns: 9
## $ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
## $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~
Keteragan setiap kolom:
Serial.No.: Nomor SerialGRE.Score: nilai tes standarTOEFL.Score : nilai Uji Bahasa InggrisUniversity.Rating : Peringkat UniversitasSOP : nilai dari pernyataan pribadiLOR : nilai surat rekomendasiCGPA : Nilai IPKResearch : risetChance.of.Admit : PeluangBisa dilihat dari struktur data, tidak ada kolom yang harus disesuaikan tipe datanya. Karena hampir semua kolom akan di hitung 2. Menghilangkan kolom yang tidak diperukan
admission <-
admission %>%
select(-Serial.No.)
anyNA(admission)
## [1] FALSE
bisa dilihat tidak ada missing value, jadi tidak pelu diberikan treatment
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(admission, label = T, label_size = 2.9, hjust = 1, layout.exp = 2)
bisa dilihat bsemua memberi korelasi positif yang kuat
admission_m <- lm(Chance.of.Admit ~ CGPA , admission)
admission_m
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission)
##
## Coefficients:
## (Intercept) CGPA
## -1.0715 0.2088
Membuat model regresi linear dengan variabel prediktor CGPA karena variabel tersebut memiliki korelasi positif tertinggi terhadap variabel target Chance.of.Admit
summary(admission_m)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.274575 -0.030084 0.009443 0.041954 0.180734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.07151 0.05034 -21.29 <2e-16 ***
## CGPA 0.20885 0.00584 35.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.762
## F-statistic: 1279 on 1 and 398 DF, p-value: < 2.2e-16
Bisa dilihat bahwa Adjusted R-squared memliki nilai 0.762
plot(admission$CGPA, admission$Chance.of.Admit,)
abline(admission_m$coefficients[1],admission_m$coefficients[2])
Selanjutnya akan dilakukan pemilihan variabel prediktor secara otomatis menggunakan step wise dengan backward dengan berharap menghasilkan model yang lebih baik
model <- lm(Chance.of.Admit ~ ., admission)
step(model, direction = "backward")
## Start: AIC=-2193.9
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## SOP + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - SOP 1 0.00144 1.5962 -2195.5
## - University.Rating 1 0.00584 1.6006 -2194.4
## <none> 1.5948 -2193.9
## - TOEFL.Score 1 0.02921 1.6240 -2188.6
## - GRE.Score 1 0.03435 1.6291 -2187.4
## - Research 1 0.03862 1.6334 -2186.3
## - LOR 1 0.06620 1.6609 -2179.6
## - CGPA 1 0.38544 1.9802 -2109.3
##
## Step: AIC=-2195.54
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## - University.Rating 1 0.00464 1.6008 -2196.4
## <none> 1.5962 -2195.5
## - TOEFL.Score 1 0.02806 1.6242 -2190.6
## - GRE.Score 1 0.03565 1.6318 -2188.7
## - Research 1 0.03769 1.6339 -2188.2
## - LOR 1 0.06983 1.6660 -2180.4
## - CGPA 1 0.38660 1.9828 -2110.8
##
## Step: AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
##
## Df Sum of Sq RSS AIC
## <none> 1.6008 -2196.4
## - TOEFL.Score 1 0.03292 1.6338 -2190.2
## - GRE.Score 1 0.03638 1.6372 -2189.4
## - Research 1 0.03912 1.6400 -2188.7
## - LOR 1 0.09133 1.6922 -2176.2
## - CGPA 1 0.43201 2.0328 -2102.8
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = admission)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score LOR CGPA Research
## -1.298464 0.001782 0.003032 0.022776 0.121004 0.024577
model_back <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
CGPA + Research, data = admission)
summary(model_back)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, data = admission)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.263542 -0.023297 0.009879 0.038078 0.159897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2984636 0.1172905 -11.070 < 2e-16 ***
## GRE.Score 0.0017820 0.0005955 2.992 0.00294 **
## TOEFL.Score 0.0030320 0.0010651 2.847 0.00465 **
## LOR 0.0227762 0.0048039 4.741 2.97e-06 ***
## CGPA 0.1210042 0.0117349 10.312 < 2e-16 ***
## Research 0.0245769 0.0079203 3.103 0.00205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared: 0.8027, Adjusted R-squared: 0.8002
## F-statistic: 320.6 on 5 and 394 DF, p-value: < 2.2e-16
library(performance)
## Warning: package 'performance' was built under R version 4.1.2
compare_performance(admission_m, model_back)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC | AIC weights | BIC | BIC weights | R2 | R2 (adj.) | RMSE | Sigma
## -----------------------------------------------------------------------------------------------------------
## admission_m | lm | -993.228 | < 0.001 | -981.254 | < 0.001 | 0.763 | 0.762 | 0.069 | 0.070
## model_back | lm | -1059.225 | 1.000 | -1031.284 | 1.000 | 0.803 | 0.800 | 0.063 | 0.064
bisa dilihat bilai Adjusted R-squared mengalami kenaikan yang sebelumnya nilainya 0.762 menjadi 0.800
# prediksi dengan model pertama
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.1.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
pred1 <- predict(admission_m, admission)
RMSE(y_pred = pred1, y_true = admission$Chance.of.Admit)
## [1] 0.0693927
# prediksi dengan model stepwise
pred2 <- predict(model_back, admission)
RMSE(y_pred = pred2, y_true = admission$Chance.of.Admit)
## [1] 0.06326207
range(admission$Chance.of.Admit)
## [1] 0.34 0.97
Dilihat dari nilai RMSE dari kedua model masih cukup baik
plot(density(model_back$residuals))
library(stats)
shapiro.test(admission_m$residuals)
##
## Shapiro-Wilk normality test
##
## data: admission_m$residuals
## W = 0.94782, p-value = 1.143e-10
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.1.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(admission_m)
##
## studentized Breusch-Pagan test
##
## data: admission_m
## BP = 19.562, df = 1, p-value = 9.737e-06
karena nilai pvalue (1.44) > alpha (0.05) (gagal tolak H0), artinya residual model sudah berdistribusi normal.
plot(admission$Chance.of.Admit, model_back$residuals)
abline(h = 0, col = "red")
library(car)
## Warning: package 'car' was built under R version 4.1.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: carData
vif(model_back)
## GRE.Score TOEFL.Score LOR CGPA Research
## 4.585053 4.104255 1.829491 4.808767 1.530007
Tidak ada nilai sama dengan atau lebih dari 10 sehingga tidak ditemukan Multicollinearity antar variabel (antar variabel prediktor saling independen).
Model model_back Memiliki Adj R-Square 0.8 yang baik
Artinya, Nilai GRE.Score, TOEFL.Score, CGPA, LOR, Research naik maka nilai Chance.of.Admit akan ikut naik, dengan kata lain jika ingin memliki Peluang yang tinggi maka harus meningkatkan nilai nilai tersebut, dan setiap nilai prediktor lainnya akan ikut mempengaruhi satu sama lain