CarInsurance Campaign: prediksi menggunakan logistic regression

Memanggil Library R

library(tidyverse) #for inspecting df (dfcheck), dplyr can't handle 
library(rsample)
library(caret)

Input Data

insurance <- read.csv("carInsurance_train.csv")
insurance

Data ini merupakan modifikasi dari data bank marketing UCI Machine Learning Datasets, menjadi data yang membahas tentang carInsurance. Download Metadata

HAL YANG MENARIK
Kolom “DaysPassed” melambangkan:

(-1) tidak di contact;
(1-500+) waiting days sampai seseorang berhasil menindaklanjuti panggilan dan subscribe CarInsurance

Memeriksa Missing Value (1)

dfcheck <- insurance %>% 
  is.na() %>% 
  colSums() %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "var") %>% 
  rename(total = 2) %>% 
  filter(total !=0) %>%
  arrange(desc(total)) %>% 
  mutate(missing = total/nrow(.))

dfcheck

Dapat dilihat bahwa Kolom Outcome, Communication, Education, Job memiliki missing values dengan total nilai NA yang sudah diurutkan dari yang terbanyak.

insurance_clean <- insurance %>%
  # mutate_at(vars(CallStart, CallEnd),hms) %>%
  # mutate(CallDuration = CallEnd-CallStart) %>%
  # tidak jadi merekayasa data ymd, lebih baik di takedown
  select(
    -Id,
    -CallStart,
    -CallEnd,
    -Outcome,
    -LastContactDay,
    -LastContactMonth,
    -Communication,
    -Job,
    -Marital,
    -Education
  ) %>%
  mutate(
    Default = as.factor(Default),
    HHInsurance = as.factor(HHInsurance),
    CarLoan = as.factor(CarLoan),
    CarInsurance = as.factor(CarInsurance),
  )

head(insurance_clean)

Memeriksa kembali Missing Value

any(is.na(insurance_clean))

#> [1] FALSE

summary(insurance_clean)

#>       Age        Default     Balance        HHInsurance CarLoan 
#>  Min.   :18.00   0:3942   Min.   :-3058.0   0:2029      0:3468  
#>  1st Qu.:32.00   1:  58   1st Qu.:  111.0   1:1971      1: 532  
#>  Median :39.00            Median :  551.5                       
#>  Mean   :41.21            Mean   : 1532.9                       
#>  3rd Qu.:49.00            3rd Qu.: 1619.0                       
#>  Max.   :95.00            Max.   :98417.0                       
#>   NoOfContacts      DaysPassed      PrevAttempts     CarInsurance
#>  Min.   : 1.000   Min.   : -1.00   Min.   : 0.0000   0:2396      
#>  1st Qu.: 1.000   1st Qu.: -1.00   1st Qu.: 0.0000   1:1604      
#>  Median : 2.000   Median : -1.00   Median : 0.0000               
#>  Mean   : 2.607   Mean   : 48.71   Mean   : 0.7175               
#>  3rd Qu.: 3.000   3rd Qu.: -1.00   3rd Qu.: 0.0000               
#>  Max.   :43.000   Max.   :854.00   Max.   :58.0000

Kolom DaysPassed tetap akan digunakan untuk melihat bagaimana (nantinya) rekayasa yang dilakukan bisa memberi pengaruh signifikan terhadap model.

Secara sekilas, kolom DaysPassed memiliki proporsi data yang tidak seimbang. Dapat dilihat bahwa nilai Q1 hingga Q3 memiliki nilai yang sama, yaitu -1; dari sini kita dapat menyimpulkan bahwa 76 persen data diisi oleh -1. Sintaks di bawah ini dapat menunjukkan bahwa terdapat 3042 nilai -1 dari total jumlah baris 4000.
Namun, mengingat kita akan menggunakan Logistic Regression yang juga bisa mempertimbangkan factor sebagai prediktornya; saya tergerak untuk melakukan binning pada kolom DaysPassed untuk eksperimen lebih lanjut.

insurance_clean$DaysPassed %>% 
  as.factor() %>% 
  table()

#> .
#>   -1    1    2    4    5    8   13   15   20   21   24   27   28   32   34   35 
#> 3042    3    5    1    2    1    1    1    1    1    1    1    1    1    1    1 
#>   36   37   38   40   42   43   44   48   49   50   53   55   56   57   61   65 
#>    1    1    1    1    1    1    1    2    1    1    1    2    1    1    1    1 
#>   66   67   70   71   73   74   76   78   79   80   81   83   84   85   86   87 
#>    2    1    1    1    1    1    1    1    2    5    2    3    3    4    2    6 
#>   88   89   90   91   92   93   94   95   96   97   98   99  100  101  102  103 
#>    6    5   11   24   38   16   14   16    4   13    8    3    3    2    5    4 
#>  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119 
#>    8    9    3    2    1    2    4    3    4    3    2    1    1    2    1    5 
#>  120  121  122  123  126  127  130  133  134  136  137  138  139  141  144  146 
#>    5    1    3    2    3    1    2    1    2    1    1    2    1    1    1    2 
#>  147  148  149  150  151  152  153  154  155  157  158  160  161  163  164  165 
#>    2    3    2    3    3    2    2    4    1    1    1    3    2    1    5    5 
#>  166  167  168  169  170  171  172  173  174  175  176  177  178  179  180  181 
#>    1    2    6    8    3    4    2    2    5    6    6    2   10    6    3   13 
#>  182  183  184  185  186  187  188  189  190  191  192  193  194  195  196  197 
#>   33   24   12    8    5    7    8   12    3    3    3    2    3    7    9    3 
#>  199  200  202  203  204  205  208  209  211  212  213  214  216  217  218  221 
#>    1    3    5    2    4    2    2    2    2    1    1    1    2    1    1    1 
#>  223  224  225  226  227  230  231  232  237  238  239  241  245  246  247  248 
#>    1    1    1    1    3    3    2    1    1    2    1    2    1    3    2    2 
#>  250  251  252  253  254  255  257  258  259  260  261  262  263  264  265  266 
#>    1    2    2    2    2    1    1    1    2    1    1    3    3    5    2    3 
#>  267  268  269  270  271  272  273  276  279  280  281  282  285  286  287  289 
#>    2    2    1    3    1    4    2    2    3    3    2    1    2    3    3    1 
#>  290  291  292  293  294  295  296  297  298  300  301  302  303  304  305  306 
#>    2    2    1    2    5    6    4    1    3    2    2    1    1    1    1    2 
#>  307  308  310  311  316  317  318  320  321  322  323  324  325  326  327  329 
#>    4    1    1    1    2    3    3    1    1    4    1    1    1    2    1    2 
#>  330  331  332  333  334  336  337  339  340  341  342  343  344  345  346  347 
#>    2    2    1    1    3    1    3    2    1    1    3    6    1    6    5    4 
#>  348  349  350  351  352  353  354  355  356  357  358  359  360  361  362  363 
#>    1    1    8    3    6    4    4    3    2    3    2    3    3    1    2    2 
#>  364  365  366  367  368  369  370  371  372  375  376  378  379  384  385  388 
#>    3    3    2    4    5    1   10    5    1    1    1    1    1    2    2    1 
#>  389  390  397  409  412  415  417  421  422  424  426  430  433  439  440  444 
#>    1    1    1    1    3    2    1    1    1    1    2    1    1    1    1    1 
#>  445  450  455  457  460  462  472  474  476  495  503  515  532  544  555  558 
#>    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
#>  579  683  690  728  769  775  779  828  842  854 
#>    1    1    1    1    1    1    1    1    1    1

3042/nrow(insurance_clean)

#> [1] 0.7605

Insight Data

Data memiliki banyak variabel dummy yang merepresentasikan ‘Yes’ ataupun ‘No’ (dalam angka ‘0’ dan ‘1’)

Kembali menjelaskan, bahwa kolom ‘DaysPassed’ memiliki informasi yang tidak tunggal, dalam hal ini, nilai ‘-1’ menggambarkan calon customer yang tidak dihubungi (pada saat campaign); dan nilai selain ‘-1’ yang merepresentasikan interval hari seseorang membeli produk asuransi setelah dihubungi oleh tim campaign

Untuk kebutuhan eksplorasi logistic regression menggunakan glm(), sementara data dibiarkan apa adanya

Exploratory Data Analysis

GGally::ggcorr(insurance_clean, label = TRUE)

## Melihat Proporsi Target

insurance_clean$CarInsurance %>%
  table() %>% 
  prop.table()

#> .
#>     0     1 
#> 0.599 0.401

prop.table(table(insurance_clean$CarLoan))

#> 
#>     0     1 
#> 0.867 0.133

prop.table(table(insurance_clean$HHInsurance))

#> 
#>       0       1 
#> 0.50725 0.49275

Memisahkan Data Train dan Test

library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(insurance_clean, prop = 0.8 , strata = "CarInsurance")

insur_train <- training(index)
insur_test <- testing(index)

# recheck class balance
prop.table(table(insur_train$CarInsurance))

#> 
#>         0         1 
#> 0.5989372 0.4010628

prop.table(table(insur_test$CarInsurance))

#> 
#>         0         1 
#> 0.5992509 0.4007491

Membuat Model

Untuk keperluan eksperimen, sementara, model dibuat secara langsung tanpa memperhatikan rekayasa data

options(scipen = 100)
model_insurance <-
  glm(
    CarInsurance ~ Balance + Age + CarLoan + NoOfContacts + DaysPassed + PrevAttempts,
    data = insur_train,
    family = "binomial"
  )
summary(model_insurance)

#> 
#> Call:
#> glm(formula = CarInsurance ~ Balance + Age + CarLoan + NoOfContacts + 
#>     DaysPassed + PrevAttempts, family = "binomial", data = insur_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.5559  -0.9925  -0.8203   1.2670   2.4912  
#> 
#> Coefficients:
#>                 Estimate  Std. Error z value     Pr(>|z|)    
#> (Intercept)  -0.73917287  0.14462349  -5.111 0.0000003204 ***
#> Balance       0.00001805  0.00001011   1.785     0.074265 .  
#> Age           0.01065286  0.00320519   3.324     0.000889 ***
#> CarLoan1     -0.54183099  0.11620346  -4.663 0.0000031197 ***
#> NoOfContacts -0.09126891  0.01668037  -5.472 0.0000000446 ***
#> DaysPassed    0.00127284  0.00041132   3.095     0.001971 ** 
#> PrevAttempts  0.14268016  0.02726160   5.234 0.0000001661 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4308.7  on 3198  degrees of freedom
#> Residual deviance: 4125.9  on 3192  degrees of freedom
#> AIC: 4139.9
#> 
#> Number of Fisher Scoring iterations: 4

Membuat label probability

insur_test$pred.goals <- predict(object = model_insurance,
                                 newdata = insur_test ,
                                 type = "response") # untuk mengkonversikan probability (hasil model) ke dalam label
head(insur_test$pred.goals)

#> [1] 0.3025822 0.4462432 0.4554195 0.4224123 0.4261464 0.8765132

# ifelse(kondisi, benar, salah)

insur_test$pred.label <- ifelse(insur_test$pred.goals > 0.55, 1, 0) #ini labelnya

# pastikan kelas target (aktual dan prediksi) bertipe factor
insur_test$pred.label <- as.factor(insur_test$pred.label)

insur_test %>% 
  select(CarInsurance, pred.goals, pred.label)

Evaluasi: Confusion Matrix

library(caret)

confusionMatrix(data = insur_test$pred.label, 
                reference = insur_test$CarInsurance,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 445 270
#>          1  35  51
#>                                              
#>                Accuracy : 0.6192             
#>                  95% CI : (0.5846, 0.653)    
#>     No Information Rate : 0.5993             
#>     P-Value [Acc > NIR] : 0.1317             
#>                                              
#>                   Kappa : 0.0978             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.15888            
#>             Specificity : 0.92708            
#>          Pos Pred Value : 0.59302            
#>          Neg Pred Value : 0.62238            
#>              Prevalence : 0.40075            
#>          Detection Rate : 0.06367            
#>    Detection Prevalence : 0.10737            
#>       Balanced Accuracy : 0.54298            
#>                                              
#>        'Positive' Class : 1                  
#>

Dari workflow eksperimen ini, dapat kita lihat bahwa performa model dari dataset sangatlah kecil; dengan kemampuan menebak positive class yang kurang baik (TP = 51 prediksi & FP = 35 prediksi). Jika kebutuhan bisnis meminta kita untuk memilih matriks Precision (Pos Pred Value), dapat kita lihat nilainya hanya 59%. Nilai ini menjadi indikator kuat bahwa kita perlu membangun ulang model, dan memperbaiki beberapa prediktor yang ada.

Re-Modeling

Untuk memperbaiki model, saya harus membuat ulang model dengan memikirkan rekayasa prediktor yang dapat dilakukan. - Melakukan log pada prediktor Balance - Melakukan binning pada beberapa prediktor lainnya (DaysPassed, NoOfContacts)

Seleksi Prediktor:

membuang beberapa prediktor redundant

insurance_recleans <- insurance %>%
  # mutate_at(vars(CallStart, CallEnd),hms) %>%
  #mutate(CallDuration = CallEnd-CallStart) %>%
  select(
    -Id,
    -CallStart,
    -CallEnd,
    -Outcome,
    -LastContactDay,
    -LastContactMonth,
    -Communication,
    
  ) %>%
  mutate(
    Default = as.factor(Default),
    HHInsurance = as.factor(HHInsurance),
    CarLoan = as.factor(CarLoan),
    CarInsurance = as.factor(CarInsurance),
  )

head(insurance_recleans)

insurance_recleans %>% 
  mutate_if(is.character, as.factor)

insurance_recleans$Balance = log(insurance_recleans$Balance)

Terdapat warning, setelah melakukan log, kita menemukan adanya value yang terkonversi menjadi NaN dan -Inf: - Setelah diamati, nilai minus pada Balance menghasilkan NaN setelah dilakukan log() - Nilai 0 pada Balance menghasilkan -inf setelah dilakukan log() - Perlu dilakukan pengisian missing values yang sesuai

Binning Prediktor ‘DaysPassed’

Mengganti nilai pada data integer menjadi beberapa kelas factor

binned <- insurance_recleans %>%
  mutate(
    DaysPassed = case_when(
      DaysPassed == -1 ~ "Not contacted",
      DaysPassed >= 91 &
        DaysPassed < 181  ~ "91-180 Days",
      DaysPassed > 180 &
        DaysPassed < 271  ~ "181-270 Days",
      DaysPassed > 270 &
        DaysPassed < 361 ~ "271-360 Days",
      TRUE ~ "more than a year"
    )
  )

binned

Handling Missing Values

Rekayasa terhadap missing values dengan cara: - Mengembalikan nilai minus pada value NaN - Mengembalikan nilai 0 pada value -inf

binned2 <- binned %>% 
   mutate_at(vars(Balance), ~replace(., is.nan(.), -1)) %>% 
  mutate_at(vars(Balance), ~replace(., is.infinite(.), 0))
binned2

Memeriksa kembali missing values

dfcheck1 <- binned2 %>% 
  is.na() %>% 
  colSums() %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "var") %>% 
  rename(total = 2) %>% 
  filter(total !=0) %>% 
  arrange(desc(total)) %>% 
  mutate(percent = total/nrow(.))

dfcheck1

binned2$Job <- binned2$Job %>% 
  replace(is.na(.), "unknown")

binned2$Education <- binned2$Education %>% 
  replace(is.na(.), "unknown")

binned2 %>% 
  is.na() %>% 
  colSums()

#>          Age          Job      Marital    Education      Default      Balance 
#>            0            0            0            0            0            0 
#>  HHInsurance      CarLoan NoOfContacts   DaysPassed PrevAttempts CarInsurance 
#>            0            0            0            0            0            0

ready <- binned2 %>% 
  mutate_if(is.character, as.factor)

Melihat Korelasi Antar Prediktor

GGally::ggcorr(ready, label = T)

Splitting data train dan test

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(ready, prop = 0.8 , strata = "CarInsurance")

train2 <- training(index)
test2 <- testing(index)

# recheck class balance
prop.table(table(train2$CarInsurance))

#> 
#>         0         1 
#> 0.5989372 0.4010628

prop.table(table(test2$CarInsurance))

#> 
#>         0         1 
#> 0.5992509 0.4007491

Model Fitting ke semua prediktor

model_2 <- glm(CarInsurance ~ ., data = train2,
                       family = "binomial")
summary(model_insurance)

#> 
#> Call:
#> glm(formula = CarInsurance ~ Balance + Age + CarLoan + NoOfContacts + 
#>     DaysPassed + PrevAttempts, family = "binomial", data = insur_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.5559  -0.9925  -0.8203   1.2670   2.4912  
#> 
#> Coefficients:
#>                 Estimate  Std. Error z value     Pr(>|z|)    
#> (Intercept)  -0.73917287  0.14462349  -5.111 0.0000003204 ***
#> Balance       0.00001805  0.00001011   1.785     0.074265 .  
#> Age           0.01065286  0.00320519   3.324     0.000889 ***
#> CarLoan1     -0.54183099  0.11620346  -4.663 0.0000031197 ***
#> NoOfContacts -0.09126891  0.01668037  -5.472 0.0000000446 ***
#> DaysPassed    0.00127284  0.00041132   3.095     0.001971 ** 
#> PrevAttempts  0.14268016  0.02726160   5.234 0.0000001661 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4308.7  on 3198  degrees of freedom
#> Residual deviance: 4125.9  on 3192  degrees of freedom
#> AIC: 4139.9
#> 
#> Number of Fisher Scoring iterations: 4

Melakukan fitting dengan opsi `step()`

step(object = model_2, direction = "backward", trace = T)

#> Start:  AIC=3918.41
#> CarInsurance ~ Age + Job + Marital + Education + Default + Balance + 
#>     HHInsurance + CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#> 
#>                Df Deviance    AIC
#> - Default       1   3862.4 3916.4
#> <none>              3862.4 3918.4
#> - Job          11   3884.8 3918.8
#> - Age           1   3867.8 3921.8
#> - CarLoan       1   3868.4 3922.4
#> - Education     3   3872.6 3922.6
#> - PrevAttempts  1   3869.2 3923.2
#> - Marital       2   3879.9 3931.9
#> - Balance       1   3881.6 3935.6
#> - NoOfContacts  1   3889.1 3943.1
#> - DaysPassed    4   3909.7 3957.7
#> - HHInsurance   1   3944.0 3998.0
#> 
#> Step:  AIC=3916.42
#> CarInsurance ~ Age + Job + Marital + Education + Balance + HHInsurance + 
#>     CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#> 
#>                Df Deviance    AIC
#> <none>              3862.4 3916.4
#> - Job          11   3884.8 3916.8
#> - Age           1   3867.8 3919.8
#> - CarLoan       1   3868.5 3920.5
#> - Education     3   3872.7 3920.7
#> - PrevAttempts  1   3869.2 3921.2
#> - Marital       2   3879.9 3929.9
#> - Balance       1   3882.2 3934.2
#> - NoOfContacts  1   3889.1 3941.1
#> - DaysPassed    4   3909.7 3955.7
#> - HHInsurance   1   3944.1 3996.1

#> 
#> Call:  glm(formula = CarInsurance ~ Age + Job + Marital + Education + 
#>     Balance + HHInsurance + CarLoan + NoOfContacts + DaysPassed + 
#>     PrevAttempts, family = "binomial", data = train2)
#> 
#> Coefficients:
#>                (Intercept)                         Age  
#>                   -0.17476                     0.01058  
#>             Jobblue-collar             Jobentrepreneur  
#>                   -0.15608                    -0.30343  
#>               Jobhousemaid               Jobmanagement  
#>                   -0.12139                    -0.16118  
#>                 Jobretired            Jobself-employed  
#>                    0.22121                    -0.15087  
#>                Jobservices                  Jobstudent  
#>                   -0.08355                     0.83586  
#>              Jobtechnician               Jobunemployed  
#>                   -0.03707                     0.21535  
#>                 Jobunknown              Maritalmarried  
#>                   -0.11072                    -0.32469  
#>              Maritalsingle          Educationsecondary  
#>                    0.04015                     0.03602  
#>          Educationtertiary            Educationunknown  
#>                    0.38593                     0.16972  
#>                    Balance                HHInsurance1  
#>                    0.06296                    -0.74567  
#>                   CarLoan1                NoOfContacts  
#>                   -0.29550                    -0.07803  
#>     DaysPassed271-360 Days       DaysPassed91-180 Days  
#>                   -0.31929                     0.22824  
#> DaysPassedmore than a year     DaysPassedNot contacted  
#>                    0.57310                    -0.58463  
#>               PrevAttempts  
#>                    0.06163  
#> 
#> Degrees of Freedom: 3198 Total (i.e. Null);  3172 Residual
#> Null Deviance:       4309 
#> Residual Deviance: 3862  AIC: 3916

step(object = model_2, direction = "both", trace = T)

#> Start:  AIC=3918.41
#> CarInsurance ~ Age + Job + Marital + Education + Default + Balance + 
#>     HHInsurance + CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#> 
#>                Df Deviance    AIC
#> - Default       1   3862.4 3916.4
#> <none>              3862.4 3918.4
#> - Job          11   3884.8 3918.8
#> - Age           1   3867.8 3921.8
#> - CarLoan       1   3868.4 3922.4
#> - Education     3   3872.6 3922.6
#> - PrevAttempts  1   3869.2 3923.2
#> - Marital       2   3879.9 3931.9
#> - Balance       1   3881.6 3935.6
#> - NoOfContacts  1   3889.1 3943.1
#> - DaysPassed    4   3909.7 3957.7
#> - HHInsurance   1   3944.0 3998.0
#> 
#> Step:  AIC=3916.42
#> CarInsurance ~ Age + Job + Marital + Education + Balance + HHInsurance + 
#>     CarLoan + NoOfContacts + DaysPassed + PrevAttempts
#> 
#>                Df Deviance    AIC
#> <none>              3862.4 3916.4
#> - Job          11   3884.8 3916.8
#> + Default       1   3862.4 3918.4
#> - Age           1   3867.8 3919.8
#> - CarLoan       1   3868.5 3920.5
#> - Education     3   3872.7 3920.7
#> - PrevAttempts  1   3869.2 3921.2
#> - Marital       2   3879.9 3929.9
#> - Balance       1   3882.2 3934.2
#> - NoOfContacts  1   3889.1 3941.1
#> - DaysPassed    4   3909.7 3955.7
#> - HHInsurance   1   3944.1 3996.1

#> 
#> Call:  glm(formula = CarInsurance ~ Age + Job + Marital + Education + 
#>     Balance + HHInsurance + CarLoan + NoOfContacts + DaysPassed + 
#>     PrevAttempts, family = "binomial", data = train2)
#> 
#> Coefficients:
#>                (Intercept)                         Age  
#>                   -0.17476                     0.01058  
#>             Jobblue-collar             Jobentrepreneur  
#>                   -0.15608                    -0.30343  
#>               Jobhousemaid               Jobmanagement  
#>                   -0.12139                    -0.16118  
#>                 Jobretired            Jobself-employed  
#>                    0.22121                    -0.15087  
#>                Jobservices                  Jobstudent  
#>                   -0.08355                     0.83586  
#>              Jobtechnician               Jobunemployed  
#>                   -0.03707                     0.21535  
#>                 Jobunknown              Maritalmarried  
#>                   -0.11072                    -0.32469  
#>              Maritalsingle          Educationsecondary  
#>                    0.04015                     0.03602  
#>          Educationtertiary            Educationunknown  
#>                    0.38593                     0.16972  
#>                    Balance                HHInsurance1  
#>                    0.06296                    -0.74567  
#>                   CarLoan1                NoOfContacts  
#>                   -0.29550                    -0.07803  
#>     DaysPassed271-360 Days       DaysPassed91-180 Days  
#>                   -0.31929                     0.22824  
#> DaysPassedmore than a year     DaysPassedNot contacted  
#>                    0.57310                    -0.58463  
#>               PrevAttempts  
#>                    0.06163  
#> 
#> Degrees of Freedom: 3198 Total (i.e. Null);  3172 Residual
#> Null Deviance:       4309 
#> Residual Deviance: 3862  AIC: 3916

Memilih model dari opsi step()

melakukan tuning dengan menambah atau mengurangi jumlah prediktor, namun ini tetap menjadi hasil yang terbaik:

model_step <- glm(formula = CarInsurance ~ Age + Job + Marital + Education + DaysPassed + 
    Balance + HHInsurance + CarLoan + NoOfContacts + 
    PrevAttempts, family = "binomial", data = train2)
summary(model_step)

#> 
#> Call:
#> glm(formula = CarInsurance ~ Age + Job + Marital + Education + 
#>     DaysPassed + Balance + HHInsurance + CarLoan + NoOfContacts + 
#>     PrevAttempts, family = "binomial", data = train2)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.9962  -0.9522  -0.6693   1.1067   2.6428  
#> 
#> Coefficients:
#>                             Estimate Std. Error z value             Pr(>|z|)
#> (Intercept)                -0.174756   0.353309  -0.495              0.62086
#> Age                         0.010577   0.004551   2.324              0.02011
#> Jobblue-collar             -0.156077   0.154391  -1.011              0.31205
#> Jobentrepreneur            -0.303432   0.265828  -1.141              0.25368
#> Jobhousemaid               -0.121393   0.268076  -0.453              0.65067
#> Jobmanagement              -0.161177   0.161232  -1.000              0.31748
#> Jobretired                  0.221208   0.216533   1.022              0.30698
#> Jobself-employed           -0.150872   0.237009  -0.637              0.52441
#> Jobservices                -0.083553   0.176207  -0.474              0.63537
#> Jobstudent                  0.835862   0.272922   3.063              0.00219
#> Jobtechnician              -0.037069   0.148746  -0.249              0.80320
#> Jobunemployed               0.215346   0.244510   0.881              0.37847
#> Jobunknown                 -0.110718   0.537006  -0.206              0.83665
#> Maritalmarried             -0.324690   0.122254  -2.656              0.00791
#> Maritalsingle               0.040150   0.140709   0.285              0.77538
#> Educationsecondary          0.036022   0.132280   0.272              0.78538
#> Educationtertiary           0.385931   0.158402   2.436              0.01483
#> Educationunknown            0.169720   0.223148   0.761              0.44691
#> DaysPassed271-360 Days     -0.319291   0.239505  -1.333              0.18249
#> DaysPassed91-180 Days       0.228238   0.203842   1.120              0.26285
#> DaysPassedmore than a year  0.573097   0.237275   2.415              0.01572
#> DaysPassedNot contacted    -0.584626   0.178597  -3.273              0.00106
#> Balance                     0.062962   0.014304   4.402           0.00001074
#> HHInsurance1               -0.745673   0.083159  -8.967 < 0.0000000000000002
#> CarLoan1                   -0.295498   0.120925  -2.444              0.01454
#> NoOfContacts               -0.078026   0.016760  -4.655           0.00000323
#> PrevAttempts                0.061628   0.026447   2.330              0.01980
#>                               
#> (Intercept)                   
#> Age                        *  
#> Jobblue-collar                
#> Jobentrepreneur               
#> Jobhousemaid                  
#> Jobmanagement                 
#> Jobretired                    
#> Jobself-employed              
#> Jobservices                   
#> Jobstudent                 ** 
#> Jobtechnician                 
#> Jobunemployed                 
#> Jobunknown                    
#> Maritalmarried             ** 
#> Maritalsingle                 
#> Educationsecondary            
#> Educationtertiary          *  
#> Educationunknown              
#> DaysPassed271-360 Days        
#> DaysPassed91-180 Days         
#> DaysPassedmore than a year *  
#> DaysPassedNot contacted    ** 
#> Balance                    ***
#> HHInsurance1               ***
#> CarLoan1                   *  
#> NoOfContacts               ***
#> PrevAttempts               *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4308.7  on 3198  degrees of freedom
#> Residual deviance: 3862.4  on 3172  degrees of freedom
#> AIC: 3916.4
#> 
#> Number of Fisher Scoring iterations: 4

test2$preds <- predict(object = model_step, 
                         newdata = test2 , 
                         type = "response") #ini untuk mengkonversikan probability (hasil model) ke dalam label 
head(test2$preds)

#> [1] 0.1689022 0.5669911 0.7738292 0.2800508 0.3747582 0.8773409

test2$label <- ifelse(test2$preds > 0.55, 1, 0) #ini labelnya

# pastikan kelas target (aktual dan prediksi) bertipe factor
test2$label <- as.factor(test2$label)

Evaluasi Model

confusionMatrix(data = test2$label, 
                reference = test2$CarInsurance,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 433 195
#>          1  47 126
#>                                                
#>                Accuracy : 0.6979               
#>                  95% CI : (0.6648, 0.7295)     
#>     No Information Rate : 0.5993               
#>     P-Value [Acc > NIR] : 0.000000004283       
#>                                                
#>                   Kappa : 0.319                
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.3925               
#>             Specificity : 0.9021               
#>          Pos Pred Value : 0.7283               
#>          Neg Pred Value : 0.6895               
#>              Prevalence : 0.4007               
#>          Detection Rate : 0.1573               
#>    Detection Prevalence : 0.2160               
#>       Balanced Accuracy : 0.6473               
#>                                                
#>        'Positive' Class : 1                    
#>