CarInsurance Campaign: prediksi menggunakan logistic regression
Memanggil Library R
library(tidyverse) #for inspecting df (dfcheck), dplyr can't handle
library(rsample)
library(caret)Input Data
insurance <- read.csv("carInsurance_train.csv")
insuranceData ini merupakan modifikasi dari data bank marketing UCI Machine Learning Datasets, menjadi data yang membahas tentang carInsurance. Download Metadata
HAL YANG MENARIK
Kolom “DaysPassed” melambangkan:
(-1) tidak di contact;
(1-500+) waiting days sampai seseorang berhasil menindaklanjuti panggilan dan subscribe CarInsurance
Memeriksa Missing Value (1)
dfcheck <- insurance %>%
is.na() %>%
colSums() %>%
as.data.frame() %>%
rownames_to_column(var = "var") %>%
rename(total = 2) %>%
filter(total !=0) %>%
arrange(desc(total)) %>%
mutate(missing = total/nrow(.))
dfcheckDapat dilihat bahwa Kolom Outcome, Communication, Education, Job memiliki missing values dengan total nilai NA yang sudah diurutkan dari yang terbanyak.
insurance_clean <- insurance %>%
# mutate_at(vars(CallStart, CallEnd),hms) %>%
# mutate(CallDuration = CallEnd-CallStart) %>%
# tidak jadi merekayasa data ymd, lebih baik di takedown
select(
-Id,
-CallStart,
-CallEnd,
-Outcome,
-LastContactDay,
-LastContactMonth,
-Communication,
-Job,
-Marital,
-Education
) %>%
mutate(
Default = as.factor(Default),
HHInsurance = as.factor(HHInsurance),
CarLoan = as.factor(CarLoan),
CarInsurance = as.factor(CarInsurance),
)
head(insurance_clean)Memeriksa kembali Missing Value
any(is.na(insurance_clean))#> [1] FALSE
summary(insurance_clean)#> Age Default Balance HHInsurance CarLoan
#> Min. :18.00 0:3942 Min. :-3058.0 0:2029 0:3468
#> 1st Qu.:32.00 1: 58 1st Qu.: 111.0 1:1971 1: 532
#> Median :39.00 Median : 551.5
#> Mean :41.21 Mean : 1532.9
#> 3rd Qu.:49.00 3rd Qu.: 1619.0
#> Max. :95.00 Max. :98417.0
#> NoOfContacts DaysPassed PrevAttempts CarInsurance
#> Min. : 1.000 Min. : -1.00 Min. : 0.0000 0:2396
#> 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 1:1604
#> Median : 2.000 Median : -1.00 Median : 0.0000
#> Mean : 2.607 Mean : 48.71 Mean : 0.7175
#> 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
#> Max. :43.000 Max. :854.00 Max. :58.0000
Kolom DaysPassed tetap akan digunakan untuk melihat bagaimana (nantinya) rekayasa yang dilakukan bisa memberi pengaruh signifikan terhadap model.
Secara sekilas, kolom
DaysPassedmemiliki proporsi data yang tidak seimbang. Dapat dilihat bahwa nilai Q1 hingga Q3 memiliki nilai yang sama, yaitu -1; dari sini kita dapat menyimpulkan bahwa 76 persen data diisi oleh -1. Sintaks di bawah ini dapat menunjukkan bahwa terdapat 3042 nilai -1 dari total jumlah baris 4000.Namun, mengingat kita akan menggunakan Logistic Regression yang juga bisa mempertimbangkan factor sebagai prediktornya; saya tergerak untuk melakukan binning pada kolom
DaysPasseduntuk eksperimen lebih lanjut.
insurance_clean$DaysPassed %>%
as.factor() %>%
table()#> .
#> -1 1 2 4 5 8 13 15 20 21 24 27 28 32 34 35
#> 3042 3 5 1 2 1 1 1 1 1 1 1 1 1 1 1
#> 36 37 38 40 42 43 44 48 49 50 53 55 56 57 61 65
#> 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1
#> 66 67 70 71 73 74 76 78 79 80 81 83 84 85 86 87
#> 2 1 1 1 1 1 1 1 2 5 2 3 3 4 2 6
#> 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103
#> 6 5 11 24 38 16 14 16 4 13 8 3 3 2 5 4
#> 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
#> 8 9 3 2 1 2 4 3 4 3 2 1 1 2 1 5
#> 120 121 122 123 126 127 130 133 134 136 137 138 139 141 144 146
#> 5 1 3 2 3 1 2 1 2 1 1 2 1 1 1 2
#> 147 148 149 150 151 152 153 154 155 157 158 160 161 163 164 165
#> 2 3 2 3 3 2 2 4 1 1 1 3 2 1 5 5
#> 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
#> 1 2 6 8 3 4 2 2 5 6 6 2 10 6 3 13
#> 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
#> 33 24 12 8 5 7 8 12 3 3 3 2 3 7 9 3
#> 199 200 202 203 204 205 208 209 211 212 213 214 216 217 218 221
#> 1 3 5 2 4 2 2 2 2 1 1 1 2 1 1 1
#> 223 224 225 226 227 230 231 232 237 238 239 241 245 246 247 248
#> 1 1 1 1 3 3 2 1 1 2 1 2 1 3 2 2
#> 250 251 252 253 254 255 257 258 259 260 261 262 263 264 265 266
#> 1 2 2 2 2 1 1 1 2 1 1 3 3 5 2 3
#> 267 268 269 270 271 272 273 276 279 280 281 282 285 286 287 289
#> 2 2 1 3 1 4 2 2 3 3 2 1 2 3 3 1
#> 290 291 292 293 294 295 296 297 298 300 301 302 303 304 305 306
#> 2 2 1 2 5 6 4 1 3 2 2 1 1 1 1 2
#> 307 308 310 311 316 317 318 320 321 322 323 324 325 326 327 329
#> 4 1 1 1 2 3 3 1 1 4 1 1 1 2 1 2
#> 330 331 332 333 334 336 337 339 340 341 342 343 344 345 346 347
#> 2 2 1 1 3 1 3 2 1 1 3 6 1 6 5 4
#> 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363
#> 1 1 8 3 6 4 4 3 2 3 2 3 3 1 2 2
#> 364 365 366 367 368 369 370 371 372 375 376 378 379 384 385 388
#> 3 3 2 4 5 1 10 5 1 1 1 1 1 2 2 1
#> 389 390 397 409 412 415 417 421 422 424 426 430 433 439 440 444
#> 1 1 1 1 3 2 1 1 1 1 2 1 1 1 1 1
#> 445 450 455 457 460 462 472 474 476 495 503 515 532 544 555 558
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 579 683 690 728 769 775 779 828 842 854
#> 1 1 1 1 1 1 1 1 1 1
3042/nrow(insurance_clean)#> [1] 0.7605
Insight Data
Data memiliki banyak variabel dummy yang merepresentasikan ‘Yes’ ataupun ‘No’ (dalam angka ‘0’ dan ‘1’)
Kembali menjelaskan, bahwa kolom ‘DaysPassed’ memiliki informasi yang tidak tunggal, dalam hal ini, nilai ‘-1’ menggambarkan calon customer yang tidak dihubungi (pada saat campaign); dan nilai selain ‘-1’ yang merepresentasikan interval hari seseorang membeli produk asuransi setelah dihubungi oleh tim campaign
Untuk kebutuhan eksplorasi logistic regression menggunakan glm(), sementara data dibiarkan apa adanya
Exploratory Data Analysis
GGally::ggcorr(insurance_clean, label = TRUE) ## Melihat Proporsi Target
insurance_clean$CarInsurance %>%
table() %>%
prop.table()#> .
#> 0 1
#> 0.599 0.401
prop.table(table(insurance_clean$CarLoan))#>
#> 0 1
#> 0.867 0.133
prop.table(table(insurance_clean$HHInsurance))#>
#> 0 1
#> 0.50725 0.49275
Memisahkan Data Train dan Test
library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(insurance_clean, prop = 0.8 , strata = "CarInsurance")
insur_train <- training(index)
insur_test <- testing(index)# recheck class balance
prop.table(table(insur_train$CarInsurance))#>
#> 0 1
#> 0.5989372 0.4010628
prop.table(table(insur_test$CarInsurance))#>
#> 0 1
#> 0.5992509 0.4007491
Membuat Model
Untuk keperluan eksperimen, sementara, model dibuat secara langsung tanpa memperhatikan rekayasa data
options(scipen = 100)
model_insurance <-
glm(
CarInsurance ~ Balance + Age + CarLoan + NoOfContacts + DaysPassed + PrevAttempts,
data = insur_train,
family = "binomial"
)
summary(model_insurance)#>
#> Call:
#> glm(formula = CarInsurance ~ Balance + Age + CarLoan + NoOfContacts +
#> DaysPassed + PrevAttempts, family = "binomial", data = insur_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.5559 -0.9925 -0.8203 1.2670 2.4912
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.73917287 0.14462349 -5.111 0.0000003204 ***
#> Balance 0.00001805 0.00001011 1.785 0.074265 .
#> Age 0.01065286 0.00320519 3.324 0.000889 ***
#> CarLoan1 -0.54183099 0.11620346 -4.663 0.0000031197 ***
#> NoOfContacts -0.09126891 0.01668037 -5.472 0.0000000446 ***
#> DaysPassed 0.00127284 0.00041132 3.095 0.001971 **
#> PrevAttempts 0.14268016 0.02726160 5.234 0.0000001661 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4308.7 on 3198 degrees of freedom
#> Residual deviance: 4125.9 on 3192 degrees of freedom
#> AIC: 4139.9
#>
#> Number of Fisher Scoring iterations: 4
Membuat label probability
insur_test$pred.goals <- predict(object = model_insurance,
newdata = insur_test ,
type = "response") # untuk mengkonversikan probability (hasil model) ke dalam label
head(insur_test$pred.goals)#> [1] 0.3025822 0.4462432 0.4554195 0.4224123 0.4261464 0.8765132
# ifelse(kondisi, benar, salah)
insur_test$pred.label <- ifelse(insur_test$pred.goals > 0.55, 1, 0) #ini labelnya
# pastikan kelas target (aktual dan prediksi) bertipe factor
insur_test$pred.label <- as.factor(insur_test$pred.label)insur_test %>%
select(CarInsurance, pred.goals, pred.label) Evaluasi: Confusion Matrix
library(caret)
confusionMatrix(data = insur_test$pred.label,
reference = insur_test$CarInsurance,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 445 270
#> 1 35 51
#>
#> Accuracy : 0.6192
#> 95% CI : (0.5846, 0.653)
#> No Information Rate : 0.5993
#> P-Value [Acc > NIR] : 0.1317
#>
#> Kappa : 0.0978
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.15888
#> Specificity : 0.92708
#> Pos Pred Value : 0.59302
#> Neg Pred Value : 0.62238
#> Prevalence : 0.40075
#> Detection Rate : 0.06367
#> Detection Prevalence : 0.10737
#> Balanced Accuracy : 0.54298
#>
#> 'Positive' Class : 1
#>
Dari workflow eksperimen ini, dapat kita lihat bahwa performa model dari dataset sangatlah kecil; dengan kemampuan menebak positive class yang kurang baik (TP = 51 prediksi & FP = 35 prediksi). Jika kebutuhan bisnis meminta kita untuk memilih matriks Precision (Pos Pred Value), dapat kita lihat nilainya hanya 59%. Nilai ini menjadi indikator kuat bahwa kita perlu membangun ulang model, dan memperbaiki beberapa prediktor yang ada.
Re-Modeling
Untuk memperbaiki model, saya harus membuat ulang model dengan memikirkan rekayasa prediktor yang dapat dilakukan. - Melakukan log pada prediktor Balance - Melakukan binning pada beberapa prediktor lainnya (DaysPassed, NoOfContacts)
Seleksi Prediktor:
membuang beberapa prediktor redundant
insurance_recleans <- insurance %>%
# mutate_at(vars(CallStart, CallEnd),hms) %>%
#mutate(CallDuration = CallEnd-CallStart) %>%
select(
-Id,
-CallStart,
-CallEnd,
-Outcome,
-LastContactDay,
-LastContactMonth,
-Communication,
) %>%
mutate(
Default = as.factor(Default),
HHInsurance = as.factor(HHInsurance),
CarLoan = as.factor(CarLoan),
CarInsurance = as.factor(CarInsurance),
)
head(insurance_recleans)insurance_recleans %>%
mutate_if(is.character, as.factor)insurance_recleans$Balance = log(insurance_recleans$Balance)Terdapat warning, setelah melakukan log, kita menemukan adanya value yang terkonversi menjadi NaN dan -Inf: - Setelah diamati, nilai minus pada Balance menghasilkan NaN setelah dilakukan log() - Nilai 0 pada Balance menghasilkan -inf setelah dilakukan log() - Perlu dilakukan pengisian missing values yang sesuai
Binning Prediktor ‘DaysPassed’
Mengganti nilai pada data integer menjadi beberapa kelas factor
binned <- insurance_recleans %>%
mutate(
DaysPassed = case_when(
DaysPassed == -1 ~ "Not contacted",
DaysPassed >= 91 &
DaysPassed < 181 ~ "91-180 Days",
DaysPassed > 180 &
DaysPassed < 271 ~ "181-270 Days",
DaysPassed > 270 &
DaysPassed < 361 ~ "271-360 Days",
TRUE ~ "more than a year"
)
)
binnedHandling Missing Values
Rekayasa terhadap missing values dengan cara: - Mengembalikan nilai minus pada value NaN - Mengembalikan nilai 0 pada value -inf
binned2 <- binned %>%
mutate_at(vars(Balance), ~replace(., is.nan(.), -1)) %>%
mutate_at(vars(Balance), ~replace(., is.infinite(.), 0))
binned2Memeriksa kembali missing values
dfcheck1 <- binned2 %>%
is.na() %>%
colSums() %>%
as.data.frame() %>%
rownames_to_column(var = "var") %>%
rename(total = 2) %>%
filter(total !=0) %>%
arrange(desc(total)) %>%
mutate(percent = total/nrow(.))
dfcheck1binned2$Job <- binned2$Job %>%
replace(is.na(.), "unknown")binned2$Education <- binned2$Education %>%
replace(is.na(.), "unknown")binned2 %>%
is.na() %>%
colSums()#> Age Job Marital Education Default Balance
#> 0 0 0 0 0 0
#> HHInsurance CarLoan NoOfContacts DaysPassed PrevAttempts CarInsurance
#> 0 0 0 0 0 0
ready <- binned2 %>%
mutate_if(is.character, as.factor)Melihat Korelasi Antar Prediktor
GGally::ggcorr(ready, label = T)Splitting data train dan test
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(ready, prop = 0.8 , strata = "CarInsurance")
train2 <- training(index)
test2 <- testing(index)# recheck class balance
prop.table(table(train2$CarInsurance))#>
#> 0 1
#> 0.5989372 0.4010628
prop.table(table(test2$CarInsurance))#>
#> 0 1
#> 0.5992509 0.4007491
Model Fitting ke semua prediktor
model_2 <- glm(CarInsurance ~ ., data = train2,
family = "binomial")
summary(model_insurance)#>
#> Call:
#> glm(formula = CarInsurance ~ Balance + Age + CarLoan + NoOfContacts +
#> DaysPassed + PrevAttempts, family = "binomial", data = insur_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.5559 -0.9925 -0.8203 1.2670 2.4912
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.73917287 0.14462349 -5.111 0.0000003204 ***
#> Balance 0.00001805 0.00001011 1.785 0.074265 .
#> Age 0.01065286 0.00320519 3.324 0.000889 ***
#> CarLoan1 -0.54183099 0.11620346 -4.663 0.0000031197 ***
#> NoOfContacts -0.09126891 0.01668037 -5.472 0.0000000446 ***
#> DaysPassed 0.00127284 0.00041132 3.095 0.001971 **
#> PrevAttempts 0.14268016 0.02726160 5.234 0.0000001661 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4308.7 on 3198 degrees of freedom
#> Residual deviance: 4125.9 on 3192 degrees of freedom
#> AIC: 4139.9
#>
#> Number of Fisher Scoring iterations: 4
Melakukan fitting dengan opsi step()
step(object = model_2, direction = "backward", trace = T)#> Start: AIC=3918.41
#> CarInsurance ~ Age + Job + Marital + Education + Default + Balance +
#> HHInsurance + CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#>
#> Df Deviance AIC
#> - Default 1 3862.4 3916.4
#> <none> 3862.4 3918.4
#> - Job 11 3884.8 3918.8
#> - Age 1 3867.8 3921.8
#> - CarLoan 1 3868.4 3922.4
#> - Education 3 3872.6 3922.6
#> - PrevAttempts 1 3869.2 3923.2
#> - Marital 2 3879.9 3931.9
#> - Balance 1 3881.6 3935.6
#> - NoOfContacts 1 3889.1 3943.1
#> - DaysPassed 4 3909.7 3957.7
#> - HHInsurance 1 3944.0 3998.0
#>
#> Step: AIC=3916.42
#> CarInsurance ~ Age + Job + Marital + Education + Balance + HHInsurance +
#> CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#>
#> Df Deviance AIC
#> <none> 3862.4 3916.4
#> - Job 11 3884.8 3916.8
#> - Age 1 3867.8 3919.8
#> - CarLoan 1 3868.5 3920.5
#> - Education 3 3872.7 3920.7
#> - PrevAttempts 1 3869.2 3921.2
#> - Marital 2 3879.9 3929.9
#> - Balance 1 3882.2 3934.2
#> - NoOfContacts 1 3889.1 3941.1
#> - DaysPassed 4 3909.7 3955.7
#> - HHInsurance 1 3944.1 3996.1
#>
#> Call: glm(formula = CarInsurance ~ Age + Job + Marital + Education +
#> Balance + HHInsurance + CarLoan + NoOfContacts + DaysPassed +
#> PrevAttempts, family = "binomial", data = train2)
#>
#> Coefficients:
#> (Intercept) Age
#> -0.17476 0.01058
#> Jobblue-collar Jobentrepreneur
#> -0.15608 -0.30343
#> Jobhousemaid Jobmanagement
#> -0.12139 -0.16118
#> Jobretired Jobself-employed
#> 0.22121 -0.15087
#> Jobservices Jobstudent
#> -0.08355 0.83586
#> Jobtechnician Jobunemployed
#> -0.03707 0.21535
#> Jobunknown Maritalmarried
#> -0.11072 -0.32469
#> Maritalsingle Educationsecondary
#> 0.04015 0.03602
#> Educationtertiary Educationunknown
#> 0.38593 0.16972
#> Balance HHInsurance1
#> 0.06296 -0.74567
#> CarLoan1 NoOfContacts
#> -0.29550 -0.07803
#> DaysPassed271-360 Days DaysPassed91-180 Days
#> -0.31929 0.22824
#> DaysPassedmore than a year DaysPassedNot contacted
#> 0.57310 -0.58463
#> PrevAttempts
#> 0.06163
#>
#> Degrees of Freedom: 3198 Total (i.e. Null); 3172 Residual
#> Null Deviance: 4309
#> Residual Deviance: 3862 AIC: 3916
step(object = model_2, direction = "both", trace = T)#> Start: AIC=3918.41
#> CarInsurance ~ Age + Job + Marital + Education + Default + Balance +
#> HHInsurance + CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#>
#> Df Deviance AIC
#> - Default 1 3862.4 3916.4
#> <none> 3862.4 3918.4
#> - Job 11 3884.8 3918.8
#> - Age 1 3867.8 3921.8
#> - CarLoan 1 3868.4 3922.4
#> - Education 3 3872.6 3922.6
#> - PrevAttempts 1 3869.2 3923.2
#> - Marital 2 3879.9 3931.9
#> - Balance 1 3881.6 3935.6
#> - NoOfContacts 1 3889.1 3943.1
#> - DaysPassed 4 3909.7 3957.7
#> - HHInsurance 1 3944.0 3998.0
#>
#> Step: AIC=3916.42
#> CarInsurance ~ Age + Job + Marital + Education + Balance + HHInsurance +
#> CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#>
#> Df Deviance AIC
#> <none> 3862.4 3916.4
#> - Job 11 3884.8 3916.8
#> + Default 1 3862.4 3918.4
#> - Age 1 3867.8 3919.8
#> - CarLoan 1 3868.5 3920.5
#> - Education 3 3872.7 3920.7
#> - PrevAttempts 1 3869.2 3921.2
#> - Marital 2 3879.9 3929.9
#> - Balance 1 3882.2 3934.2
#> - NoOfContacts 1 3889.1 3941.1
#> - DaysPassed 4 3909.7 3955.7
#> - HHInsurance 1 3944.1 3996.1
#>
#> Call: glm(formula = CarInsurance ~ Age + Job + Marital + Education +
#> Balance + HHInsurance + CarLoan + NoOfContacts + DaysPassed +
#> PrevAttempts, family = "binomial", data = train2)
#>
#> Coefficients:
#> (Intercept) Age
#> -0.17476 0.01058
#> Jobblue-collar Jobentrepreneur
#> -0.15608 -0.30343
#> Jobhousemaid Jobmanagement
#> -0.12139 -0.16118
#> Jobretired Jobself-employed
#> 0.22121 -0.15087
#> Jobservices Jobstudent
#> -0.08355 0.83586
#> Jobtechnician Jobunemployed
#> -0.03707 0.21535
#> Jobunknown Maritalmarried
#> -0.11072 -0.32469
#> Maritalsingle Educationsecondary
#> 0.04015 0.03602
#> Educationtertiary Educationunknown
#> 0.38593 0.16972
#> Balance HHInsurance1
#> 0.06296 -0.74567
#> CarLoan1 NoOfContacts
#> -0.29550 -0.07803
#> DaysPassed271-360 Days DaysPassed91-180 Days
#> -0.31929 0.22824
#> DaysPassedmore than a year DaysPassedNot contacted
#> 0.57310 -0.58463
#> PrevAttempts
#> 0.06163
#>
#> Degrees of Freedom: 3198 Total (i.e. Null); 3172 Residual
#> Null Deviance: 4309
#> Residual Deviance: 3862 AIC: 3916
Memilih model dari opsi step()
melakukan tuning dengan menambah atau mengurangi jumlah prediktor, namun ini tetap menjadi hasil yang terbaik:
model_step <- glm(formula = CarInsurance ~ Age + Job + Marital + Education + DaysPassed +
Balance + HHInsurance + CarLoan + NoOfContacts +
PrevAttempts, family = "binomial", data = train2)
summary(model_step)#>
#> Call:
#> glm(formula = CarInsurance ~ Age + Job + Marital + Education +
#> DaysPassed + Balance + HHInsurance + CarLoan + NoOfContacts +
#> PrevAttempts, family = "binomial", data = train2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.9962 -0.9522 -0.6693 1.1067 2.6428
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.174756 0.353309 -0.495 0.62086
#> Age 0.010577 0.004551 2.324 0.02011
#> Jobblue-collar -0.156077 0.154391 -1.011 0.31205
#> Jobentrepreneur -0.303432 0.265828 -1.141 0.25368
#> Jobhousemaid -0.121393 0.268076 -0.453 0.65067
#> Jobmanagement -0.161177 0.161232 -1.000 0.31748
#> Jobretired 0.221208 0.216533 1.022 0.30698
#> Jobself-employed -0.150872 0.237009 -0.637 0.52441
#> Jobservices -0.083553 0.176207 -0.474 0.63537
#> Jobstudent 0.835862 0.272922 3.063 0.00219
#> Jobtechnician -0.037069 0.148746 -0.249 0.80320
#> Jobunemployed 0.215346 0.244510 0.881 0.37847
#> Jobunknown -0.110718 0.537006 -0.206 0.83665
#> Maritalmarried -0.324690 0.122254 -2.656 0.00791
#> Maritalsingle 0.040150 0.140709 0.285 0.77538
#> Educationsecondary 0.036022 0.132280 0.272 0.78538
#> Educationtertiary 0.385931 0.158402 2.436 0.01483
#> Educationunknown 0.169720 0.223148 0.761 0.44691
#> DaysPassed271-360 Days -0.319291 0.239505 -1.333 0.18249
#> DaysPassed91-180 Days 0.228238 0.203842 1.120 0.26285
#> DaysPassedmore than a year 0.573097 0.237275 2.415 0.01572
#> DaysPassedNot contacted -0.584626 0.178597 -3.273 0.00106
#> Balance 0.062962 0.014304 4.402 0.00001074
#> HHInsurance1 -0.745673 0.083159 -8.967 < 0.0000000000000002
#> CarLoan1 -0.295498 0.120925 -2.444 0.01454
#> NoOfContacts -0.078026 0.016760 -4.655 0.00000323
#> PrevAttempts 0.061628 0.026447 2.330 0.01980
#>
#> (Intercept)
#> Age *
#> Jobblue-collar
#> Jobentrepreneur
#> Jobhousemaid
#> Jobmanagement
#> Jobretired
#> Jobself-employed
#> Jobservices
#> Jobstudent **
#> Jobtechnician
#> Jobunemployed
#> Jobunknown
#> Maritalmarried **
#> Maritalsingle
#> Educationsecondary
#> Educationtertiary *
#> Educationunknown
#> DaysPassed271-360 Days
#> DaysPassed91-180 Days
#> DaysPassedmore than a year *
#> DaysPassedNot contacted **
#> Balance ***
#> HHInsurance1 ***
#> CarLoan1 *
#> NoOfContacts ***
#> PrevAttempts *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4308.7 on 3198 degrees of freedom
#> Residual deviance: 3862.4 on 3172 degrees of freedom
#> AIC: 3916.4
#>
#> Number of Fisher Scoring iterations: 4
test2$preds <- predict(object = model_step,
newdata = test2 ,
type = "response") #ini untuk mengkonversikan probability (hasil model) ke dalam label
head(test2$preds)#> [1] 0.1689022 0.5669911 0.7738292 0.2800508 0.3747582 0.8773409
test2$label <- ifelse(test2$preds > 0.55, 1, 0) #ini labelnya
# pastikan kelas target (aktual dan prediksi) bertipe factor
test2$label <- as.factor(test2$label)Evaluasi Model
confusionMatrix(data = test2$label,
reference = test2$CarInsurance,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 433 195
#> 1 47 126
#>
#> Accuracy : 0.6979
#> 95% CI : (0.6648, 0.7295)
#> No Information Rate : 0.5993
#> P-Value [Acc > NIR] : 0.000000004283
#>
#> Kappa : 0.319
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.3925
#> Specificity : 0.9021
#> Pos Pred Value : 0.7283
#> Neg Pred Value : 0.6895
#> Prevalence : 0.4007
#> Detection Rate : 0.1573
#> Detection Prevalence : 0.2160
#> Balanced Accuracy : 0.6473
#>
#> 'Positive' Class : 1
#>