Pada Analysis kali ini kita akan menggunakan data admission dari kaggle. dari hasil analysis tersebut kita ingin melihat seberapa pengaruh nilai predictor terhadap targetnya

# Load Library yang dibutuhkan
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1. Load Data

admission <- read.csv("admission.csv")

2. Explanatory Data Analysis

  1. Melihat Struktur data
glimpse(admission)
## Rows: 400
## Columns: 9
## $ Serial.No.        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 32~
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 10~
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3~
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.~
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.~
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00~
## $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1~
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50~

Keteragan setiap kolom:

Bisa dilihat dari struktur data, tidak ada kolom yang harus disesuaikan tipe datanya. Karena hampir semua kolom akan di hitung 2. Menghilangkan kolom yang tidak diperukan

admission <-
  admission %>% 
  select(-Serial.No.)
  1. Cek Missing Value
anyNA(admission)
## [1] FALSE

bisa dilihat tidak ada missing value, jadi tidak pelu diberikan treatment

  1. Melihat korelasi
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(admission, label = T, label_size = 2.9, hjust = 1, layout.exp = 2)

bisa dilihat bsemua memberi korelasi positif yang kuat

3. Pembuatan Model Regresi Linear

admission_m <- lm(Chance.of.Admit ~ CGPA , admission)
admission_m
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission)
## 
## Coefficients:
## (Intercept)         CGPA  
##     -1.0715       0.2088

Membuat model regresi linear dengan variabel prediktor CGPA karena variabel tersebut memiliki korelasi positif tertinggi terhadap variabel target Chance.of.Admit

summary(admission_m)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274575 -0.030084  0.009443  0.041954  0.180734 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.07151    0.05034  -21.29   <2e-16 ***
## CGPA         0.20885    0.00584   35.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.762 
## F-statistic:  1279 on 1 and 398 DF,  p-value: < 2.2e-16

Bisa dilihat bahwa Adjusted R-squared memliki nilai 0.762

plot(admission$CGPA, admission$Chance.of.Admit,)
abline(admission_m$coefficients[1],admission_m$coefficients[2])

Selanjutnya akan dilakukan pemilihan variabel prediktor secara otomatis menggunakan step wise dengan backward dengan berharap menghasilkan model yang lebih baik

model <- lm(Chance.of.Admit ~ ., admission)
step(model, direction = "backward")
## Start:  AIC=-2193.9
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - SOP                1   0.00144 1.5962 -2195.5
## - University.Rating  1   0.00584 1.6006 -2194.4
## <none>                           1.5948 -2193.9
## - TOEFL.Score        1   0.02921 1.6240 -2188.6
## - GRE.Score          1   0.03435 1.6291 -2187.4
## - Research           1   0.03862 1.6334 -2186.3
## - LOR                1   0.06620 1.6609 -2179.6
## - CGPA               1   0.38544 1.9802 -2109.3
## 
## Step:  AIC=-2195.54
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  1   0.00464 1.6008 -2196.4
## <none>                           1.5962 -2195.5
## - TOEFL.Score        1   0.02806 1.6242 -2190.6
## - GRE.Score          1   0.03565 1.6318 -2188.7
## - Research           1   0.03769 1.6339 -2188.2
## - LOR                1   0.06983 1.6660 -2180.4
## - CGPA               1   0.38660 1.9828 -2110.8
## 
## Step:  AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     1.6008 -2196.4
## - TOEFL.Score  1   0.03292 1.6338 -2190.2
## - GRE.Score    1   0.03638 1.6372 -2189.4
## - Research     1   0.03912 1.6400 -2188.7
## - LOR          1   0.09133 1.6922 -2176.2
## - CGPA         1   0.43201 2.0328 -2102.8
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission)
## 
## Coefficients:
## (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA     Research  
##   -1.298464     0.001782     0.003032     0.022776     0.121004     0.024577
model_back <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
    CGPA + Research, data = admission)
summary(model_back)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admission)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263542 -0.023297  0.009879  0.038078  0.159897 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.2984636  0.1172905 -11.070  < 2e-16 ***
## GRE.Score    0.0017820  0.0005955   2.992  0.00294 ** 
## TOEFL.Score  0.0030320  0.0010651   2.847  0.00465 ** 
## LOR          0.0227762  0.0048039   4.741 2.97e-06 ***
## CGPA         0.1210042  0.0117349  10.312  < 2e-16 ***
## Research     0.0245769  0.0079203   3.103  0.00205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
## F-statistic: 320.6 on 5 and 394 DF,  p-value: < 2.2e-16
library(performance)
## Warning: package 'performance' was built under R version 4.1.2
compare_performance(admission_m, model_back)
## # Comparison of Model Performance Indices
## 
## Name        | Model |       AIC | AIC weights |       BIC | BIC weights |    R2 | R2 (adj.) |  RMSE | Sigma
## -----------------------------------------------------------------------------------------------------------
## admission_m |    lm |  -993.228 |     < 0.001 |  -981.254 |     < 0.001 | 0.763 |     0.762 | 0.069 | 0.070
## model_back  |    lm | -1059.225 |       1.000 | -1031.284 |       1.000 | 0.803 |     0.800 | 0.063 | 0.064

bisa dilihat bilai Adjusted R-squared mengalami kenaikan yang sebelumnya nilainya 0.762 menjadi 0.800

4. Prediksi Model

# prediksi dengan model pertama
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.1.2
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
pred1 <- predict(admission_m, admission)

RMSE(y_pred = pred1, y_true = admission$Chance.of.Admit)
## [1] 0.0693927
# prediksi dengan model stepwise
pred2 <- predict(model_back, admission)

RMSE(y_pred = pred2, y_true = admission$Chance.of.Admit)
## [1] 0.06326207
range(admission$Chance.of.Admit)
## [1] 0.34 0.97

Dilihat dari nilai RMSE dari kedua model masih cukup baik

plot(density(model_back$residuals))

library(stats)
shapiro.test(admission_m$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  admission_m$residuals
## W = 0.94782, p-value = 1.143e-10
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.1.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(admission_m)
## 
##  studentized Breusch-Pagan test
## 
## data:  admission_m
## BP = 19.562, df = 1, p-value = 9.737e-06

karena nilai pvalue (1.44) > alpha (0.05) (gagal tolak H0), artinya residual model sudah berdistribusi normal.

plot(admission$Chance.of.Admit, model_back$residuals)
abline(h = 0, col = "red")

library(car)
## Warning: package 'car' was built under R version 4.1.2
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: carData
vif(model_back)
##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.585053    4.104255    1.829491    4.808767    1.530007

Tidak ada nilai sama dengan atau lebih dari 10 sehingga tidak ditemukan Multicollinearity antar variabel (antar variabel prediktor saling independen).

Kesimpulan

Model model_back Memiliki Adj R-Square 0.8 yang baik

Artinya, Nilai GRE.Score, TOEFL.Score, CGPA, LOR, Research naik maka nilai Chance.of.Admit akan ikut naik, dengan kata lain jika ingin memliki Peluang yang tinggi maka harus meningkatkan nilai nilai tersebut, dan setiap nilai prediktor lainnya akan ikut mempengaruhi satu sama lain