STA1382 Analisis Regresi

Cahya Alkahfi | Rizki Ananda

Regresi Linier

Packages

# memuat library yang dibutuhkan
library(ggplot2)
library(cowplot)
library(corrplot)
library(lattice)
library(car)
library(lmtest)

Data

Untuk studi kasus praktikum pemodelan regresi linier, akan menggunakan data Indeks Pembangunan Manusia (IPM) Tahun 2018 dari 119 Kabupaten/Kota di Pulau Jawa.\

Terdapat 16 kolom (tanpa kolom ID) dimana Nilai IPM akan menjadi peubah respon dan 15 lainnya sebagai peubah penjelas (merupakan indikator hasil agregasi data potensi Desa 2018). Keterangan masing-masing kolom adalah sebagai berikut:

Peubah	Deskripsi
ID	Kode Kab/Kota (4 digit Kode BPS)
IPM	Nilai IPM
PR_NO_LIS	Proporsi keluarga yang tidak menggunakan listrik baik PLN maupun non-PLN
PR_SAMPAH	Proporsi desa/kelurahan pada kabupaten/kota dimana sebagian besar warganya membuang sampah di sungai/saluran irigasi/danau/laut serta got/selokan dan lainnya
PR_TINJA	Proporsi desa/kelurahan pada kabupaten/kota dimana tempat pembuangan akhir tinja sebagian besar warganya adalah sawah/kolam/sungai/danau/laut atau pantai/tanah lapang/kebun, lubang tanah dan lainnya
PR_MKM_SUNGAI	Proporsi desa/kelurahan di kabupaten/kota yang memiliki pemukiman di bantaran sungai
PR_SUNGAI_LMBH	Proporsi desa/kelurahan di kabupaten/kota yang memiliki sungai yang tercemar limbah
PR_KUMUH	Proporsi desa/kelurahan di kabupaten/kota yang memiliki pemukiman kumuh
PRA_1000	Jumlah PAUD dan TK per 1000 penduduk
SD_1000	Jumlah SD/MI per 1000 penduduk
SM_1000	Jumlah sekolah menengah (SMP/MTs, SMA/MA, SMK) per 1000 penduduk
RS_PKS_PDK_1000	Jumlah rumah sakit, puskesmas, puskesmas pembantu, poliklinik, praktek dokter per 1000 penduduk
LIN_BID_POS_1000	Jumlah rumah bersalin, praktek bidan, posyandu, polindes per 1000 penduduk
APT_OBT_1000	Jumlah apotek dan toko obat per 1000 penduduk
DOK_DRG_1000	Jumlah dokter dan dokter gigi per 1000 penduduk
BID_1000	Jumlah bidan per 1000 penduduk
GZ_BURUK_1000	Jumlah kejadian gizi buruk per 1000 penduduk

Eksplorasi Data

Dataset terdiri dari 119 observasi dengan 17 kolom. Pada contoh ini kolom ID akan dibuang karena tidak digunakan untuk pemodelan regresi linier.

Seluruh kolom bertipe numerik. Kolom ipm_2018 memiliki rentang nilai dari 61 - 86.11 yang menunjukkan besaran IPM dari setiap Kabupaten/Kota di Pulau Jawa. Kolom-kolom lainnya memiliki nilai yang bervariasi, namun secara umum memiliki rentang nilai angtara 0-1 atau 0-2.

# Membaca data
data <- read.csv("jawa_2018_ipm.csv")

# Melihat Struktur data
str(data)

## 'data.frame':    119 obs. of  17 variables:
##  $ id                : int  3101 3171 3172 3173 3174 3175 3201 3202 3203 3204 ...
##  $ ipm_2018          : num  70.9 84.4 82.1 81 80.9 ...
##  $ PR_NO_LIS         : num  0 0 0 0 0 ...
##  $ PR_SAMPAH         : num  0 0 0 0 0 ...
##  $ PR_TINJA          : num  0 0.0154 0.0308 0 0 ...
##  $ PR_MKM_SUNGAI     : num  0 0.615 0.446 0.25 0.304 ...
##  $ PR_SUNGAI_LMBH    : num  0 0.169 0.369 0.432 0.679 ...
##  $ PR_KUMUH          : num  0.167 0.369 0.554 0.841 0.679 ...
##  $ PRA_1000          : num  0.787 0.199 0.228 0.255 0.134 ...
##  $ SD_1000           : num  0.829 0.311 0.279 0.398 0.234 ...
##  $ SM_1000           : num  0.622 0.219 0.233 0.287 0.172 ...
##  $ RS_PKS_PDK_1000   : num  0.331 0.337 0.298 0.393 0.269 ...
##  $ LIN_BID_POS_P_1000: num  0.29 0.1046 0.1626 0.0552 0.1242 ...
##  $ APT_OBT_1000      : num  0.0414 0.1349 0.1488 0.2293 0.2583 ...
##  $ DOK_DRG_1000      : num  0.373 0.212 0.448 0.313 0.262 ...
##  $ BID_1000          : num  1.906 0.0868 0.2123 0.04 0.1309 ...
##  $ GZ_BURUK_1000     : num  0.16574 0.00757 0.00343 0.01622 0 ...

# Kolom ID tidak diperlukan
data$id = NULL

# Melihat summary data
summary(data)

##     ipm_2018       PR_NO_LIS           PR_SAMPAH          PR_TINJA     
##  Min.   :61.00   Min.   :0.0000000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:67.91   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.1522  
##  Median :71.23   Median :0.0001314   Median :0.04982   Median :0.3169  
##  Mean   :71.98   Mean   :0.0015136   Mean   :0.07444   Mean   :0.3485  
##  3rd Qu.:74.92   3rd Qu.:0.0010234   3rd Qu.:0.10515   3rd Qu.:0.4942  
##  Max.   :86.11   Max.   :0.0246482   Max.   :0.35890   Max.   :0.9471  
##  PR_MKM_SUNGAI     PR_SUNGAI_LMBH      PR_KUMUH          PRA_1000     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.1344  
##  1st Qu.:0.07685   1st Qu.:0.1428   1st Qu.:0.02484   1st Qu.:0.3792  
##  Median :0.12648   Median :0.2174   Median :0.07576   Median :0.4748  
##  Mean   :0.17818   Mean   :0.2682   Mean   :0.15779   Mean   :0.5107  
##  3rd Qu.:0.23558   3rd Qu.:0.3617   3rd Qu.:0.19812   3rd Qu.:0.6020  
##  Max.   :0.76471   Max.   :0.9649   Max.   :0.84091   Max.   :1.2772  
##     SD_1000          SM_1000       RS_PKS_PDK_1000  LIN_BID_POS_P_1000
##  Min.   :0.2093   Min.   :0.1627   Min.   :0.1487   Min.   :0.05515   
##  1st Qu.:0.4782   1st Qu.:0.2237   1st Qu.:0.2198   1st Qu.:0.26374   
##  Median :0.6318   Median :0.2760   Median :0.2641   Median :0.43536   
##  Mean   :0.6159   Mean   :0.3019   Mean   :0.3080   Mean   :0.43437   
##  3rd Qu.:0.7508   3rd Qu.:0.3433   3rd Qu.:0.3486   3rd Qu.:0.58097   
##  Max.   :1.1616   Max.   :0.7648   Max.   :0.8780   Max.   :0.92269   
##   APT_OBT_1000      DOK_DRG_1000        BID_1000       GZ_BURUK_1000    
##  Min.   :0.04144   Min.   :0.06402   Min.   :0.04001   Min.   :0.00000  
##  1st Qu.:0.13320   1st Qu.:0.13407   1st Qu.:0.31515   1st Qu.:0.03831  
##  Median :0.18011   Median :0.17146   Median :0.42429   Median :0.07618  
##  Mean   :0.21075   Mean   :0.23894   Mean   :0.42723   Mean   :0.12594  
##  3rd Qu.:0.24919   3rd Qu.:0.28077   3rd Qu.:0.52864   3rd Qu.:0.15679  
##  Max.   :1.09135   Max.   :1.07837   Max.   :1.90602   Max.   :0.93573

Pengecekan NA

Tidak terdapat nilai NA pada data sehingga tidak diperlukan penanganan data

colSums(is.na(data))

##           ipm_2018          PR_NO_LIS          PR_SAMPAH           PR_TINJA 
##                  0                  0                  0                  0 
##      PR_MKM_SUNGAI     PR_SUNGAI_LMBH           PR_KUMUH           PRA_1000 
##                  0                  0                  0                  0 
##            SD_1000            SM_1000    RS_PKS_PDK_1000 LIN_BID_POS_P_1000 
##                  0                  0                  0                  0 
##       APT_OBT_1000       DOK_DRG_1000           BID_1000      GZ_BURUK_1000 
##                  0                  0                  0                  0

Sebaran Data

plots <- lapply(names(data), function(var_x){
  p <- 
    ggplot(data) +
    aes_string(var_x) +
    geom_density(lwd=1, color="darkorange")
})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation ideoms with `aes()`

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

plot_grid(plotlist = plots)

Korelasi

Dari plot korelasi berikut, terlihat beberapa kolom memiliki nilai korelasi yang tinggi, sehingga mungkin akan terdapat kondisi multikolinieritas.

Adapun untuk korelasi antara peubah penjelas dengan peubah respon, terlihat beberapa peubah memiliki korelasi yang sangat tinggi seperti DOK_DRG_1000, SD_1000, PR_TINJA dan sebagainya.

Namun demikian kondisi data sebenarnya cukup menarik karena korelasi beberapa data tampak agak “aneh”. Misal, korelasi SD_1000 atau SM_1000 dengan nilai IPM memiliki korelasi yang negatif. Semakin banyak SD_1000 maka nilai IPM di Kabupaten/Kota tersebut semakin rendah.

Sebaliknya Korelasi PR_KUMUH serta PR_SUNGAI_LMBH terhadap nilai IPM bernilai positif. DImana artinya, semakin tinggi proporsi pemukiman kumuh, atau proporsi sungai yang tercemar limbah maka nilai IPM di kabupaten tersebut cenderung lebih tinggi juga.

# menghitung korelasi antar kolom
corr_matrix <- round(cor(data), 2)

# membuat plot korelasi
corrplot(corr_matrix, 
         type="lower",
         method = "color", 
         tl.cex = 0.5, 
         tl.col = "black",
         addCoef.col = "#2F2F2F",
         addCoefasPercent = FALSE,
         number.cex = 0.5,
         diag = FALSE)

Pemodelan

Model dibuat dengan seluruh peubah.

Berdasarkan hasil model regresi linier, dari 15 peubah penjelas hanya 5 peubah saja yang signifikan yaitu DOK_DRG_1000, SD_1000, PR_SAMPAH, APT_OBT_1000, dan PR_TINJA. Hal ini mungkin karena adanya masalah multikolinieritas yang mana akan memperbesar varians dari penduga parameter sehingga banyak peubah yang menjadi tidak signifikan.

Menarik untuk membahas peubah SD_1000 seperti yang sempat disinggung sebelumnya, peubah ini memiliki koefisien bernilai negatif dan menunjukkan bahwa semakin tinggi proporsi SD per 1000 penduduk maka terdapat kecenderungan untuk menurunkan prediksi nilai IPM.

Secara keseluruhan, melihat dari besaran nilai \(R^2\), dapat dikatakan bahwa model (dengan 15 peubah penjelas) mampu menjelaskan sekitar 84,74 persen keragaman pada nilai IPM Kabupaten/Kota di Pulau Jawa.

# membuat model regresi linier
model <- lm(ipm_2018 ~ ., data)

# melihat summary dari model
summary(model)

## 
## Call:
## lm(formula = ipm_2018 ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9279 -1.3472 -0.0228  1.2981  6.6353 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         77.69339    1.61562  48.089  < 2e-16 ***
## PR_NO_LIS          -54.16621   72.84019  -0.744 0.458793    
## PR_SAMPAH          -12.32380    3.34207  -3.687 0.000364 ***
## PR_TINJA            -3.85873    1.63650  -2.358 0.020266 *  
## PR_MKM_SUNGAI       -0.30645    2.23861  -0.137 0.891383    
## PR_SUNGAI_LMBH      -0.06651    1.70957  -0.039 0.969043    
## PR_KUMUH             1.14777    1.87576   0.612 0.541953    
## PRA_1000            -1.47333    1.35941  -1.084 0.280981    
## SD_1000             -8.47723    2.23944  -3.785 0.000258 ***
## SM_1000             -3.55704    2.85238  -1.247 0.215211    
## RS_PKS_PDK_1000      5.06631    3.63456   1.394 0.166340    
## LIN_BID_POS_P_1000   2.95048    2.03778   1.448 0.150685    
## APT_OBT_1000        -7.72740    2.88662  -2.677 0.008646 ** 
## DOK_DRG_1000        13.50010    2.48140   5.441 3.61e-07 ***
## BID_1000            -2.22060    1.52796  -1.453 0.149178    
## GZ_BURUK_1000        0.63440    1.56650   0.405 0.686335    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.282 on 103 degrees of freedom
## Multiple R-squared:  0.8474, Adjusted R-squared:  0.8252 
## F-statistic: 38.14 on 15 and 103 DF,  p-value: < 2.2e-16

Pengecekan Multikolinieritas

Nilai VIF yang diperoleh untuk masing-masing peubah ternyata cukup kecil, bahkan tidak ada yang lebih besar dari 10. Jika melihat hasil ini tampaknya tidak terdapat masalah multikolinieritas antar peubah penjelasnya.

(Mohon dicek kembali barangkali terdapat kekeliruan)

# pengecekan multicollinearity
vif(model)

##          PR_NO_LIS          PR_SAMPAH           PR_TINJA      PR_MKM_SUNGAI 
##           1.674004           1.752889           3.883222           2.519156 
##     PR_SUNGAI_LMBH           PR_KUMUH           PRA_1000            SD_1000 
##           2.248597           2.945363           1.840095           4.638975 
##            SM_1000    RS_PKS_PDK_1000 LIN_BID_POS_P_1000       APT_OBT_1000 
##           2.323137           5.283640           3.867567           3.014481 
##       DOK_DRG_1000           BID_1000      GZ_BURUK_1000 
##           4.545848           2.377227           1.272896

Menghitung VIF manual:

X <- data[-1]

rsquared <- function(df, var) {
  y <- df[, var]
  x <- df[, setdiff(names(df), var)]
  formula <- as.formula(paste(var, "~", paste(names(x), collapse = " + ")))
  fit <- lm(formula, data=df)
  return(summary(fit)$r.squared)
}

vifs <- sapply(names(X), 
               function(var) {
                  r2 <- rsquared(X, var)
                  vif <- 1 / (1 - r2)
                  return(vif)
               })

vifs

##          PR_NO_LIS          PR_SAMPAH           PR_TINJA      PR_MKM_SUNGAI 
##           1.674004           1.752889           3.883222           2.519156 
##     PR_SUNGAI_LMBH           PR_KUMUH           PRA_1000            SD_1000 
##           2.248597           2.945363           1.840095           4.638975 
##            SM_1000    RS_PKS_PDK_1000 LIN_BID_POS_P_1000       APT_OBT_1000 
##           2.323137           5.283640           3.867567           3.014481 
##       DOK_DRG_1000           BID_1000      GZ_BURUK_1000 
##           4.545848           2.377227           1.272896

Pengecekan Asumsi (Plot)

Homoskedastisitas : varian dari error (residual) tidak bergantung pada nilai prediksi. Dalam kata lain, varian dari error harus sama untuk setiap nilai prediksi. Hal ini penting karena jika varian dari error tidak sama, maka interval kepercayaan dari prediksi model tidak akan valid.

Untuk memeriksa homoskedastisitas, dapat dilakukan dengan membuat scatter plot antara residual dan nilai prediksi. Jika tidak terdapat pola atau varian dari residual tidak sama untuk setiap nilai prediksi, maka model tidak homoskedastik dan perlu ditangani terlebih dahulu.

Pada kasus ini, terlihat plot residual dan nilai prediksinya mecenyabr secara acak dengan rata-rata 0, sehingga dapat kita katakan bahwa kondisi homoskedastisitas terpenuhi.

Normalitas : residual menyebar normal dengan nilai tengah 0 dan varians \(\sigma^2\).

Untuk mengecek normalitas dapat melalui density plot dari residual atau qunatile-to-quantile (QQ plot). Dari densitly plot dapt terlihat jelas bahwa residual menyebar normal dengan nilai tengah di sekitar nilai 0. Adapun melalui QQ plot, terlihat residual berada di sekitar garis normal)

Independensi : residual harus independen satu sama lain. Ini berarti bahwa tidak ada hubungan antar residual dan tidak ada pengaruh dari residual sebelumnya pada residual selanjutnya. Pengecekan dapat menggunakan grafik ACF Plot.

Penurut plot ACF tidak terdapat autokorelasi atau residual saling bebas.

# plot homoscedasticity
ggplot(model, aes(x = fitted(model), y = resid(model))) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  ggtitle("Scatter Plot Homoscedasticity")

# plot normalitas
densityplot(resid(model), main = "Plot Normalitas")

qqnorm(resid(model), main = "Q-Q Plot Normalitas")
qqline(resid(model))

# plot independence
acf(resid(model), main = "Plot Autokorelasi Residual")

Pengecekan Asumsi (Uji Formal)

Jika output menggunakan plot masih meragukan, dapat pula digunakan uji formal.

Untuk Normalitas terdapat beberapa uji yang dapat digunakan, salah satunya uji shapiro-wick. p-value \(>\) 0.05 berarti terima H0 dalam hal ini adalah residual menyebar normal.

Untuk asumsi homoskedastisitas juga terdapat beberapa uji yang dapat digunakan, salah satunya Uji Breusch-Pagan. p-value \(>\) 0.05 berarti terima H0 atau residual memenuhi asumsi homoskedastisitas.

Untuk asumsi independen juga terdapat beberapa uji yang dapat digunakan, salah satunya Uji Ljung-Box. p-value \(>\) 0.05 berarti terima H0 atau residual memenuhi asumsi independen (tidak terdapat autokorelasi).

# memeriksa asumsi normalitas
shapiro.test(resid(model))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model)
## W = 0.99134, p-value = 0.6644

# memeriksa asumsi homoscedasticity
bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 15.26, df = 15, p-value = 0.4329

# Uji Ljung-Box
stats::Box.test(residuals(model), lag = 20, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(model)
## X-squared = 18.942, df = 20, p-value = 0.5256

Evaluasi Model

Evaluasi model (seharusnya) menggunakan data yang belum pernah dilihat oleh model. Sehingga dalam pembentukan model sebaiknya membagi data terlebih dahulu mejadi data training dan testing. Atau dapat pula menggunakan Validasi Silang (misal K-Fold CV).

Pada contoh ini kita akan menggunakan keseluruhan data yang digunakan dalam pembentukan model.

prediksi <- predict(model, data[-1])

Root Mean Squared Error (RMSE)

# Manual 
sqrt(mean((data$ipm_2018-prediksi)^2))

## [1] 2.123366

# Package RMSE
Metrics::rmse(data$ipm_2018, prediksi)

## [1] 2.123366

Mean Absolute Error (MAE)

# Manual 
mean(abs(data$ipm_2018-prediksi))

## [1] 1.636279

# Package RMSE
Metrics::mae(data$ipm_2018, prediksi)

## [1] 1.636279

Mean Absolute Percentage Error (MAPE)

# Manual 
 mean(abs((data$ipm_2018 - prediksi) / data$ipm_2018)) * 100

## [1] 2.275987

# Package RMSE
Metrics::mape(data$ipm_2018, prediksi)*100

## [1] 2.275987

Regresi Logistik

Packages

library(caret)
library(skimr)
library(ggplot2)
library(cowplot)
library(dplyr)
library(gridExtra)
library(glmnet)

Data

Pada latihan kali ini data yang digunakan adalah data German Credit. Data tersebut merupakan 1000 data pemohon pinjaman yang berisi 20 peubah penjelas sebagai profil dari pemohon dan 1 peubah respon sebagai risiko kredit baik atau buruk. Peubah penjelas terdiri dari 13 peubah kategorik dan 7 peubah numerik. Pada data tersebut terdapat 700 sampel yang tergolong sebagai peminjam yang baik dan 300 sampel sebagai peminjam yang buruk.

germancredit <- read.csv("credit-g.csv", stringsAsFactors = TRUE)
head(germancredit)

str(germancredit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ checking_status       : Factor w/ 4 levels "<0",">=200","0<=X<200",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ duration              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history        : Factor w/ 5 levels "all paid","critical/other existing credit",..: 2 4 2 4 3 4 4 4 4 2 ...
##  $ purpose               : Factor w/ 10 levels "business","domestic appliance",..: 7 7 3 4 5 3 4 10 7 5 ...
##  $ credit_amount         : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_status        : Factor w/ 5 levels "<100",">=1000",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment            : Factor w/ 5 levels "<1",">=7","1<=X<4",..: 2 3 4 4 3 3 2 3 4 5 ...
##  $ installment_commitment: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status       : Factor w/ 4 levels "female div/dep/mar",..: 4 1 4 4 4 4 4 4 2 3 ...
##  $ other_parties         : Factor w/ 3 levels "co applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
##  $ residence_since       : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property_magnitude    : Factor w/ 4 levels "car","life insurance",..: 4 4 4 2 3 3 2 1 4 1 ...
##  $ age                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_payment_plans   : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing               : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_credits      : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                   : Factor w/ 4 levels "high qualif/self emp/mgmt",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ num_dependents        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ own_telephone         : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker        : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ class                 : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...

Peubah	Tipe Peubah	Keterangan
checking_status	Kategorik	Status Kredit
duration	Numerik	Durasi Kredit
credit_history	Kategorik	Riwayat Kredit
purpose	Kategorik	Tujuan Kredit
credit_amount	Numerik	Jumlah Kredit
savings_status	Kategorik	Tabungan
employment	Kategorik	Lama Bekerja
installment_commitment	Numerik	Tingkat angsuran yang dapat disisihkan dari pendapatan
personal_status	Kategorik	Status Jenis Kelamin dan Perkawinan
other_parties	Kategorik	Debitur Lain
residence_since	Numerik	Lama tinggal di alamat sekarang
property_magnitude	Kategorik	Properti atau aset paling berharga
age	Numerik	Umur
other_payment_plans	Kategorik	Cicilan lainnya
housing	Kategorik	Jenis tempat tinggal
existing_credits	Numerik	Jumlah kredit yang ada di bank ini
job	Kategorik	Pekerjaan
num_dependents	Numerik	Jumlah tanggungan
own_telephone	Kategorik	Langganan telepon
foreign_worker	Kategorik	Status pekerja asing
class	Kategorik	Risiko Kredit (Bad and Good)

Eksplorasi Data

Berikut adalah ringkasan dari data German Credit. Tidak terdapat missing values pada dataset tersebut sehingga tidak perlu dilakukan penanganan.

skim(germancredit)

Data summary
Name	germancredit
Number of rows	1000
Number of columns	21
_______________________
Column type frequency:
factor	14
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
checking_status	1	FALSE	4	no : 394, <0: 274, 0<=: 269, >=2: 63
credit_history	1	FALSE	5	exi: 530, cri: 293, del: 88, all: 49
purpose	1	FALSE	10	rad: 280, new: 234, fur: 181, use: 103
savings_status	1	FALSE	5	<10: 603, no : 183, 100: 103, 500: 63
employment	1	FALSE	5	1<=: 339, >=7: 253, 4<=: 174, <1: 172
personal_status	1	FALSE	4	mal: 548, fem: 310, mal: 92, mal: 50
other_parties	1	FALSE	3	non: 907, gua: 52, co : 41
property_magnitude	1	FALSE	4	car: 332, rea: 282, lif: 232, no : 154
other_payment_plans	1	FALSE	3	non: 814, ban: 139, sto: 47
housing	1	FALSE	3	own: 713, ren: 179, for: 108
job	1	FALSE	4	ski: 630, uns: 200, hig: 148, une: 22
own_telephone	1	FALSE	2	non: 596, yes: 404
foreign_worker	1	FALSE	2	yes: 963, no: 37
class	1	FALSE	2	goo: 700, bad: 300

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
duration	1	20.90	12.06	4	12.0	18.0	24.00	72	▇▇▂▁▁
credit_amount	1	3271.26	2822.74	250	1365.5	2319.5	3972.25	18424	▇▂▁▁▁
installment_commitment	1	2.97	1.12	1	2.0	3.0	4.00	4	▂▃▁▂▇
residence_since	1	2.85	1.10	1	2.0	3.0	4.00	4	▂▆▁▃▇
age	1	35.55	11.38	19	27.0	33.0	42.00	75	▇▆▃▁▁
existing_credits	1	1.41	0.58	1	1.0	1.0	2.00	4	▇▅▁▁▁
num_dependents	1	1.16	0.36	1	1.0	1.0	1.00	2	▇▁▁▁▂

summary(germancredit)

##     checking_status    duration                           credit_history
##  <0         :274    Min.   : 4.0   all paid                      : 49   
##  >=200      : 63    1st Qu.:12.0   critical/other existing credit:293   
##  0<=X<200   :269    Median :18.0   delayed previously            : 88   
##  no checking:394    Mean   :20.9   existing paid                 :530   
##                     3rd Qu.:24.0   no credits/all paid           : 40   
##                     Max.   :72.0                                        
##                                                                         
##                 purpose    credit_amount            savings_status
##  radio/tv           :280   Min.   :  250   <100            :603   
##  new car            :234   1st Qu.: 1366   >=1000          : 48   
##  furniture/equipment:181   Median : 2320   100<=X<500      :103   
##  used car           :103   Mean   : 3271   500<=X<1000     : 63   
##  business           : 97   3rd Qu.: 3972   no known savings:183   
##  education          : 50   Max.   :18424                          
##  (Other)            : 55                                          
##       employment  installment_commitment           personal_status
##  <1        :172   Min.   :1.000          female div/dep/mar:310   
##  >=7       :253   1st Qu.:2.000          male div/sep      : 50   
##  1<=X<4    :339   Median :3.000          male mar/wid      : 92   
##  4<=X<7    :174   Mean   :2.973          male single       :548   
##  unemployed: 62   3rd Qu.:4.000                                   
##                   Max.   :4.000                                   
##                                                                   
##       other_parties residence_since         property_magnitude      age       
##  co applicant: 41   Min.   :1.000   car              :332      Min.   :19.00  
##  guarantor   : 52   1st Qu.:2.000   life insurance   :232      1st Qu.:27.00  
##  none        :907   Median :3.000   no known property:154      Median :33.00  
##                     Mean   :2.845   real estate      :282      Mean   :35.55  
##                     3rd Qu.:4.000                              3rd Qu.:42.00  
##                     Max.   :4.000                              Max.   :75.00  
##                                                                               
##  other_payment_plans     housing    existing_credits
##  bank  :139          for free:108   Min.   :1.000   
##  none  :814          own     :713   1st Qu.:1.000   
##  stores: 47          rent    :179   Median :1.000   
##                                     Mean   :1.407   
##                                     3rd Qu.:2.000   
##                                     Max.   :4.000   
##                                                     
##                         job      num_dependents  own_telephone foreign_worker
##  high qualif/self emp/mgmt:148   Min.   :1.000   none:596      no : 37       
##  skilled                  :630   1st Qu.:1.000   yes :404      yes:963       
##  unemp/unskilled non res  : 22   Median :1.000                               
##  unskilled resident       :200   Mean   :1.155                               
##                                  3rd Qu.:1.000                               
##                                  Max.   :2.000                               
##                                                                              
##   class    
##  bad :300  
##  good:700  
##            
##            
##            
##            
##

prop.table(table(germancredit$class))

## 
##  bad good 
##  0.3  0.7

Dapat dilihat bahwa persentase peminjam yang masuk dalam kategori bad sebesar 30% dan kategori good sebesar 70%. Data ini termasuk data tidak seimbang sehingga jika dilakukan penanganan data tidak seimbang akan lebih baik namun pada latihan kali ini belum dilakukan penanganan data tidak seimbang.

Sebaran Peubah Numerik

p1 <- ggplot(germancredit, aes(x=class, y=duration, fill=class)) + 
  geom_boxplot()

p2 <- ggplot(germancredit, aes(x=class, y=credit_amount, fill=class)) + 
  geom_boxplot()

p3 <- ggplot(germancredit, aes(x=class, y=installment_commitment, fill=class)) + 
  geom_boxplot()

p4 <- ggplot(germancredit, aes(x=class, y=residence_since, fill=class)) + 
  geom_boxplot()

p5 <- ggplot(germancredit, aes(x=class, y=age, fill=class)) + 
  geom_boxplot()

p6 <- ggplot(germancredit, aes(x=class, y=existing_credits, fill=class)) + 
  geom_boxplot()

p7 <- ggplot(germancredit, aes(x=class, y=num_dependents, fill=class)) + 
  geom_boxplot()

grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

grid.arrange(p5, p6, p7, nrow = 2, ncol = 2)

Sebaran Peubah Kategorik

k1 <- ggplot(germancredit, aes(checking_status, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k2 <- ggplot(germancredit, aes(credit_history, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k3<- ggplot(germancredit, aes(purpose, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k4 <- ggplot(germancredit, aes(savings_status, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k5 <- ggplot(germancredit, aes(employment, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k6 <-  ggplot(germancredit, aes(personal_status, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k7 <- ggplot(germancredit, aes(other_parties, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k8 <- ggplot(germancredit, aes(property_magnitude, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k9 <- ggplot(germancredit, aes(other_payment_plans, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k10 <- ggplot(germancredit, aes(housing, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k11 <- ggplot(germancredit, aes(job, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k12 <- ggplot(germancredit, aes(own_telephone, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

k13 <- ggplot(germancredit, aes(foreign_worker, after_stat(count))) + 
  geom_bar(aes(fill = class), position = "dodge")

grid.arrange(k1, k2, nrow = 2, ncol = 1)

grid.arrange(k3, k4, nrow = 2, ncol = 1)

grid.arrange(k5, k6, nrow = 2, ncol = 1)

grid.arrange(k7, k8, nrow = 2, ncol = 1)

grid.arrange(k9, k10, nrow = 2, ncol = 1)

grid.arrange(k11, k12, nrow = 2, ncol = 1)

grid.arrange(k13, nrow = 1, ncol = 1)

Pemodelan

Spliting Data

Split Train/Testing adalah salah satu metode yang dapat digunakan untuk mengevaluasi performa model machine learning. Metode ini membagi dataset menjadi dua bagian, yaitu data training atau data latih dan data testing atau data uji. Data training digunakan untuk membuat model, sementara data testing digunakan untuk evaluasi kebaikan model. Pada latihan ini data dibagi menjadi 80% data training dan 20% data testing.

set.seed(12420246)
in.train <- createDataPartition(as.factor(germancredit$class), p=0.8, list=FALSE)
german.credit.train <- germancredit[in.train,]
german.credit.test <- germancredit[-in.train,]

Seleksi Peubah

Dari 20 peubah penjelas dilakukan seleksi peubah dengan metode Stepwise Variable Selection menggunakan AIC. Peubah yang signifikan terpilih adalah:

checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment_commitment
other_parties
other_payment_plans
housing
foreign_worker

credit.glm0 <- glm(class ~ ., family = binomial, german.credit.train)
credit.glm.step <- step(credit.glm0)

## Start:  AIC=790.74
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + residence_since + property_magnitude + 
##     age + other_payment_plans + housing + existing_credits + 
##     job + num_dependents + own_telephone + foreign_worker
## 
##                          Df Deviance    AIC
## - property_magnitude      3   694.23 786.23
## - job                     3   694.85 786.85
## - residence_since         1   693.36 789.36
## - age                     1   693.65 789.65
## - own_telephone           1   693.79 789.79
## - existing_credits        1   693.84 789.84
## - num_dependents          1   694.05 790.05
## - housing                 2   696.21 790.21
## <none>                        692.74 790.74
## - personal_status         3   699.59 791.59
## - other_parties           2   697.71 791.71
## - credit_amount           1   696.80 792.80
## - employment              4   704.66 794.66
## - other_payment_plans     2   700.68 794.68
## - foreign_worker          1   699.72 795.72
## - installment_commitment  1   700.92 796.92
## - duration                1   703.41 799.41
## - savings_status          4   714.35 804.35
## - credit_history          4   715.30 805.30
## - purpose                 9   726.69 806.69
## - checking_status         3   743.28 835.28
## 
## Step:  AIC=786.23
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + residence_since + age + 
##     other_payment_plans + housing + existing_credits + job + 
##     num_dependents + own_telephone + foreign_worker
## 
##                          Df Deviance    AIC
## - job                     3   696.37 782.37
## - residence_since         1   694.88 784.88
## - age                     1   695.16 785.16
## - own_telephone           1   695.24 785.24
## - existing_credits        1   695.33 785.33
## - num_dependents          1   695.43 785.43
## - housing                 2   697.58 785.58
## <none>                        694.23 786.23
## - personal_status         3   700.89 786.89
## - other_parties           2   699.07 787.07
## - credit_amount           1   698.39 788.39
## - other_payment_plans     2   702.17 790.17
## - employment              4   706.55 790.55
## - foreign_worker          1   701.05 791.05
## - installment_commitment  1   702.78 792.78
## - duration                1   705.13 795.13
## - savings_status          4   715.42 799.42
## - credit_history          4   716.63 800.63
## - purpose                 9   729.95 803.95
## - checking_status         3   746.39 832.39
## 
## Step:  AIC=782.37
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + residence_since + age + 
##     other_payment_plans + housing + existing_credits + num_dependents + 
##     own_telephone + foreign_worker
## 
##                          Df Deviance    AIC
## - own_telephone           1   696.85 780.85
## - residence_since         1   696.93 780.93
## - existing_credits        1   697.19 781.19
## - age                     1   697.26 781.26
## - num_dependents          1   697.46 781.46
## - housing                 2   699.78 781.78
## <none>                        696.37 782.37
## - personal_status         3   702.94 782.94
## - other_parties           2   701.25 783.25
## - employment              4   708.10 786.10
## - credit_amount           1   702.27 786.27
## - other_payment_plans     2   704.87 786.87
## - foreign_worker          1   703.04 787.04
## - duration                1   706.12 790.12
## - installment_commitment  1   706.23 790.23
## - credit_history          4   718.22 796.22
## - savings_status          4   718.50 796.50
## - purpose                 9   731.15 799.15
## - checking_status         3   748.21 828.21
## 
## Step:  AIC=780.85
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + residence_since + age + 
##     other_payment_plans + housing + existing_credits + num_dependents + 
##     foreign_worker
## 
##                          Df Deviance    AIC
## - residence_since         1   697.39 779.39
## - existing_credits        1   697.61 779.61
## - num_dependents          1   697.98 779.98
## - age                     1   697.99 779.99
## - housing                 2   700.26 780.26
## <none>                        696.85 780.85
## - personal_status         3   703.43 781.43
## - other_parties           2   701.79 781.79
## - credit_amount           1   702.29 784.29
## - employment              4   708.83 784.83
## - foreign_worker          1   703.35 785.35
## - other_payment_plans     2   705.44 785.44
## - installment_commitment  1   706.50 788.50
## - duration                1   707.02 789.02
## - credit_history          4   718.80 794.80
## - savings_status          4   719.37 795.37
## - purpose                 9   732.00 798.00
## - checking_status         3   749.89 827.89
## 
## Step:  AIC=779.39
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + age + other_payment_plans + 
##     housing + existing_credits + num_dependents + foreign_worker
## 
##                          Df Deviance    AIC
## - existing_credits        1   698.22 778.22
## - age                     1   698.35 778.35
## - num_dependents          1   698.55 778.55
## <none>                        697.39 779.39
## - housing                 2   701.85 779.85
## - personal_status         3   704.29 780.29
## - other_parties           2   702.33 780.33
## - credit_amount           1   702.76 782.76
## - employment              4   708.89 782.89
## - other_payment_plans     2   705.91 783.91
## - foreign_worker          1   704.13 784.13
## - installment_commitment  1   707.08 787.08
## - duration                1   707.64 787.64
## - credit_history          4   719.12 793.12
## - savings_status          4   719.62 793.62
## - purpose                 9   732.34 796.34
## - checking_status         3   751.34 827.34
## 
## Step:  AIC=778.22
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + age + other_payment_plans + 
##     housing + num_dependents + foreign_worker
## 
##                          Df Deviance    AIC
## - age                     1   699.15 777.15
## - num_dependents          1   699.68 777.68
## <none>                        698.22 778.22
## - housing                 2   702.72 778.72
## - personal_status         3   705.12 779.12
## - other_parties           2   703.23 779.23
## - employment              4   709.43 781.43
## - credit_amount           1   703.94 781.94
## - other_payment_plans     2   706.95 782.95
## - foreign_worker          1   705.15 783.15
## - installment_commitment  1   707.98 785.98
## - duration                1   708.09 786.09
## - credit_history          4   719.67 791.67
## - savings_status          4   720.22 792.22
## - purpose                 9   733.25 795.25
## - checking_status         3   751.93 825.93
## 
## Step:  AIC=777.15
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + other_payment_plans + housing + 
##     num_dependents + foreign_worker
## 
##                          Df Deviance    AIC
## - num_dependents          1   700.47 776.47
## <none>                        699.15 777.15
## - personal_status         3   706.08 778.08
## - other_parties           2   704.21 778.21
## - housing                 2   704.49 778.49
## - credit_amount           1   704.61 780.61
## - employment              4   711.02 781.02
## - other_payment_plans     2   707.69 781.69
## - foreign_worker          1   706.26 782.26
## - installment_commitment  1   708.58 784.58
## - duration                1   709.46 785.46
## - credit_history          4   721.13 791.13
## - savings_status          4   721.67 791.67
## - purpose                 9   733.94 793.94
## - checking_status         3   753.11 825.11
## 
## Step:  AIC=776.47
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     personal_status + other_parties + other_payment_plans + housing + 
##     foreign_worker
## 
##                          Df Deviance    AIC
## - personal_status         3   706.37 776.37
## <none>                        700.47 776.47
## - other_parties           2   705.54 777.54
## - housing                 2   705.80 777.80
## - credit_amount           1   705.65 779.65
## - employment              4   712.20 780.20
## - other_payment_plans     2   709.25 781.25
## - foreign_worker          1   707.35 781.35
## - installment_commitment  1   709.56 783.56
## - duration                1   710.34 784.34
## - credit_history          4   722.73 790.73
## - savings_status          4   723.00 791.00
## - purpose                 9   735.25 793.25
## - checking_status         3   754.53 824.53
## 
## Step:  AIC=776.37
## class ~ checking_status + duration + credit_history + purpose + 
##     credit_amount + savings_status + employment + installment_commitment + 
##     other_parties + other_payment_plans + housing + foreign_worker
## 
##                          Df Deviance    AIC
## <none>                        706.37 776.37
## - other_parties           2   711.20 777.20
## - credit_amount           1   710.40 778.40
## - housing                 2   713.07 779.07
## - other_payment_plans     2   713.89 779.89
## - installment_commitment  1   713.42 781.42
## - foreign_worker          1   713.88 781.88
## - employment              4   721.30 783.30
## - duration                1   716.74 784.74
## - savings_status          4   728.10 790.10
## - credit_history          4   729.70 791.70
## - purpose                 9   741.48 793.48
## - checking_status         3   760.63 824.63

summary(credit.glm.step)

## 
## Call:
## glm(formula = class ~ checking_status + duration + credit_history + 
##     purpose + credit_amount + savings_status + employment + installment_commitment + 
##     other_parties + other_payment_plans + housing + foreign_worker, 
##     family = binomial, data = german.credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6954  -0.6619   0.3654   0.6793   2.3281  
## 
## Coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                   1.501e+00  1.152e+00   1.303
## checking_status>=200                          7.324e-01  4.079e-01   1.796
## checking_status0<=X<200                       2.981e-01  2.475e-01   1.204
## checking_statusno checking                    1.707e+00  2.586e-01   6.599
## duration                                     -3.420e-02  1.067e-02  -3.203
## credit_historycritical/other existing credit  1.452e+00  4.583e-01   3.168
## credit_historydelayed previously              1.238e+00  5.263e-01   2.352
## credit_historyexisting paid                   6.459e-01  4.283e-01   1.508
## credit_historyno credits/all paid            -2.175e-01  6.010e-01  -0.362
## purposedomestic appliance                    -2.455e-01  8.001e-01  -0.307
## purposeeducation                             -1.128e+00  5.308e-01  -2.126
## purposefurniture/equipment                   -3.124e-01  3.831e-01  -0.816
## purposenew car                               -9.982e-01  3.734e-01  -2.673
## purposeother                                  3.574e-01  9.267e-01   0.386
## purposeradio/tv                               1.207e-01  3.641e-01   0.332
## purposerepairs                               -2.513e-01  6.545e-01  -0.384
## purposeretraining                             9.639e-01  1.232e+00   0.782
## purposeused car                               8.487e-01  4.823e-01   1.760
## credit_amount                                -9.511e-05  4.774e-05  -1.992
## savings_status>=1000                          1.807e+00  6.228e-01   2.901
## savings_status100<=X<500                      4.432e-01  3.280e-01   1.351
## savings_status500<=X<1000                    -1.129e-02  3.958e-01  -0.029
## savings_statusno known savings                9.832e-01  2.917e-01   3.370
## employment>=7                                 6.816e-01  3.053e-01   2.232
## employment1<=X<4                              3.789e-01  2.692e-01   1.407
## employment4<=X<7                              1.235e+00  3.417e-01   3.615
## employmentunemployed                          5.382e-01  4.383e-01   1.228
## installment_commitment                       -2.508e-01  9.548e-02  -2.627
## other_partiesguarantor                        1.282e+00  6.045e-01   2.121
## other_partiesnone                             5.317e-01  4.486e-01   1.185
## other_payment_plansnone                       6.973e-01  2.729e-01   2.555
## other_payment_plansstores                     9.242e-02  4.933e-01   0.187
## housingown                                   -4.471e-02  3.300e-01  -0.135
## housingrent                                  -6.611e-01  3.770e-01  -1.754
## foreign_workeryes                            -1.859e+00  8.032e-01  -2.314
##                                              Pr(>|z|)    
## (Intercept)                                   0.19251    
## checking_status>=200                          0.07257 .  
## checking_status0<=X<200                       0.22843    
## checking_statusno checking                   4.14e-11 ***
## duration                                      0.00136 ** 
## credit_historycritical/other existing credit  0.00154 ** 
## credit_historydelayed previously              0.01867 *  
## credit_historyexisting paid                   0.13151    
## credit_historyno credits/all paid             0.71735    
## purposedomestic appliance                     0.75897    
## purposeeducation                              0.03352 *  
## purposefurniture/equipment                    0.41474    
## purposenew car                                0.00751 ** 
## purposeother                                  0.69972    
## purposeradio/tv                               0.74018    
## purposerepairs                                0.70103    
## purposeretraining                             0.43406    
## purposeused car                               0.07846 .  
## credit_amount                                 0.04633 *  
## savings_status>=1000                          0.00372 ** 
## savings_status100<=X<500                      0.17661    
## savings_status500<=X<1000                     0.97723    
## savings_statusno known savings                0.00075 ***
## employment>=7                                 0.02560 *  
## employment1<=X<4                              0.15939    
## employment4<=X<7                              0.00030 ***
## employmentunemployed                          0.21942    
## installment_commitment                        0.00862 ** 
## other_partiesguarantor                        0.03395 *  
## other_partiesnone                             0.23583    
## other_payment_plansnone                       0.01062 *  
## other_payment_plansstores                     0.85140    
## housingown                                    0.89223    
## housingrent                                   0.07949 .  
## foreign_workeryes                             0.02064 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 977.38  on 799  degrees of freedom
## Residual deviance: 706.37  on 765  degrees of freedom
## AIC: 776.37
## 
## Number of Fisher Scoring iterations: 5

Final Model

credit.glm.final <- glm(formula = class ~ checking_status + duration + credit_history + purpose + credit_amount + savings_status + employment + installment_commitment + 
    other_parties + other_payment_plans + housing + foreign_worker, family = binomial, data = german.credit.train)

summary(credit.glm.final)

## 
## Call:
## glm(formula = class ~ checking_status + duration + credit_history + 
##     purpose + credit_amount + savings_status + employment + installment_commitment + 
##     other_parties + other_payment_plans + housing + foreign_worker, 
##     family = binomial, data = german.credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6954  -0.6619   0.3654   0.6793   2.3281  
## 
## Coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                   1.501e+00  1.152e+00   1.303
## checking_status>=200                          7.324e-01  4.079e-01   1.796
## checking_status0<=X<200                       2.981e-01  2.475e-01   1.204
## checking_statusno checking                    1.707e+00  2.586e-01   6.599
## duration                                     -3.420e-02  1.067e-02  -3.203
## credit_historycritical/other existing credit  1.452e+00  4.583e-01   3.168
## credit_historydelayed previously              1.238e+00  5.263e-01   2.352
## credit_historyexisting paid                   6.459e-01  4.283e-01   1.508
## credit_historyno credits/all paid            -2.175e-01  6.010e-01  -0.362
## purposedomestic appliance                    -2.455e-01  8.001e-01  -0.307
## purposeeducation                             -1.128e+00  5.308e-01  -2.126
## purposefurniture/equipment                   -3.124e-01  3.831e-01  -0.816
## purposenew car                               -9.982e-01  3.734e-01  -2.673
## purposeother                                  3.574e-01  9.267e-01   0.386
## purposeradio/tv                               1.207e-01  3.641e-01   0.332
## purposerepairs                               -2.513e-01  6.545e-01  -0.384
## purposeretraining                             9.639e-01  1.232e+00   0.782
## purposeused car                               8.487e-01  4.823e-01   1.760
## credit_amount                                -9.511e-05  4.774e-05  -1.992
## savings_status>=1000                          1.807e+00  6.228e-01   2.901
## savings_status100<=X<500                      4.432e-01  3.280e-01   1.351
## savings_status500<=X<1000                    -1.129e-02  3.958e-01  -0.029
## savings_statusno known savings                9.832e-01  2.917e-01   3.370
## employment>=7                                 6.816e-01  3.053e-01   2.232
## employment1<=X<4                              3.789e-01  2.692e-01   1.407
## employment4<=X<7                              1.235e+00  3.417e-01   3.615
## employmentunemployed                          5.382e-01  4.383e-01   1.228
## installment_commitment                       -2.508e-01  9.548e-02  -2.627
## other_partiesguarantor                        1.282e+00  6.045e-01   2.121
## other_partiesnone                             5.317e-01  4.486e-01   1.185
## other_payment_plansnone                       6.973e-01  2.729e-01   2.555
## other_payment_plansstores                     9.242e-02  4.933e-01   0.187
## housingown                                   -4.471e-02  3.300e-01  -0.135
## housingrent                                  -6.611e-01  3.770e-01  -1.754
## foreign_workeryes                            -1.859e+00  8.032e-01  -2.314
##                                              Pr(>|z|)    
## (Intercept)                                   0.19251    
## checking_status>=200                          0.07257 .  
## checking_status0<=X<200                       0.22843    
## checking_statusno checking                   4.14e-11 ***
## duration                                      0.00136 ** 
## credit_historycritical/other existing credit  0.00154 ** 
## credit_historydelayed previously              0.01867 *  
## credit_historyexisting paid                   0.13151    
## credit_historyno credits/all paid             0.71735    
## purposedomestic appliance                     0.75897    
## purposeeducation                              0.03352 *  
## purposefurniture/equipment                    0.41474    
## purposenew car                                0.00751 ** 
## purposeother                                  0.69972    
## purposeradio/tv                               0.74018    
## purposerepairs                                0.70103    
## purposeretraining                             0.43406    
## purposeused car                               0.07846 .  
## credit_amount                                 0.04633 *  
## savings_status>=1000                          0.00372 ** 
## savings_status100<=X<500                      0.17661    
## savings_status500<=X<1000                     0.97723    
## savings_statusno known savings                0.00075 ***
## employment>=7                                 0.02560 *  
## employment1<=X<4                              0.15939    
## employment4<=X<7                              0.00030 ***
## employmentunemployed                          0.21942    
## installment_commitment                        0.00862 ** 
## other_partiesguarantor                        0.03395 *  
## other_partiesnone                             0.23583    
## other_payment_plansnone                       0.01062 *  
## other_payment_plansstores                     0.85140    
## housingown                                    0.89223    
## housingrent                                   0.07949 .  
## foreign_workeryes                             0.02064 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 977.38  on 799  degrees of freedom
## Residual deviance: 706.37  on 765  degrees of freedom
## AIC: 776.37
## 
## Number of Fisher Scoring iterations: 5

Evaluasi Model

Confusion Matrix

Evaluasi terhadap model klasifikasi dilakukan dengan membuat tabel klasifikasi antara kelas sebenarnya dengan kelas hasil prediksi dari model klasifikasi. Tabel klasifikasi ini biasa disebut dengan Confusion Matrix.

Confusion matrix dapat diartikan sebagai suatu alat yang memiliki fungsi untuk melakukan analisis apakah classifier tersebut baik dalam mengenali tuple dari kelas yang berbeda. Nilai dari True-Positive (TP) dan True-Negative (TN) memberikan informasi ketika classifier dalam melakukan klasifikasi data bernilai benar, sedangkan False-Positive (FP) dan False-Negative (FN) memberikan informasi ketika classifier salah dalam melakukan klasifikasi data (Han dan Kamber, 2011).

Pada performa klasifikasi akan dihitung akurasi, spesifisitas, sensitivitas dan AUC. Akurasi merupakan persentase classifier benar melakukan prediksi. Sensitivitas merupakan Persentase data positif yang diprediksi sebagai positif. Dan Spesifisitas merupakan Persentase data negatif diprediksi sebagai negatif (Faisal dan Nugrahadi, 2019).Sedangkan AUC adalah kinerja dari model klasifikasi.

\(Akurasi = \frac{TN+TP}{TN+TP+FN+FP}100\%\)

\(Sensitivitas = \frac{TP}{TP+FN}100\%\)

\(Spesifisitas = \frac{TN}{TN+FP}100\%\)

\(AUC = \frac{(100+Sensitivitas)-(100-Spesifisitas)}{2}\)

AUC	Meaning
0.90 - 1.00	excellent classification
0.80 - 0.90	good classification
0.70 - 0.80	fair classification
0.60 - 0.70	poor classification
< 0.60	failure

fit.final <- fitted.values(credit.glm.final)
pred.final <- ifelse(fit.final>=0.5,"GOOD","BAD")
tab <- table(german.credit.train$class,pred.final, dnn = c("Truth", "Predicted"))
tab

##       Predicted
## Truth  BAD GOOD
##   bad  129  111
##   good  55  505

acc <- sum(diag(tab))/sum(tab)
acc

## [1] 0.7925

pdata=predict(credit.glm.final,newdata=german.credit.train,type="response")
y_prediksi<-ifelse(pdata<0.5,"bad","good")
y_aktual<-german.credit.train$class
klf=table(y_aktual,y_prediksi)
accuracy=(klf[1,1]+klf[2,2])/sum(klf)*100
sensitivity= klf[2,2]/sum(klf[2,])*100
specificity= klf[1,1]/sum(klf[1,])*100 
fprate=klf[2,1]/(klf[2,1]+klf[1,1])*100
AUC=(100+sensitivity-fprate)/2 
performa=data.frame(accuracy,sensitivity,specificity,AUC)

klf

##         y_prediksi
## y_aktual bad good
##     bad  129  111
##     good  55  505

Berdasarkan output confusion matrix diatas dapat dilihat bahwa kemampuan prediksi dari model yang didapatkan untuk memprediksi kelas dengan pembayaran kredit yang baik dapat diprediksi tepat sebanyak 505 dari total 560 sedangkan untuk kelas pembayaran kredit yang macet atau kurang baik diprediksi tepat sebanyak 129 dari total 240.

performa

ROC Plot

library(ROCR)
pred<-
prediction(predict.glm(credit.glm.final,german.credit.test),german.credit.test$class)
perf <- performance(pred,"tpr","fpr")
plot(perf)