Klasifikasi Data Pengguna Kartu Kredit dengan Metode Suport Vector Machine

Support Vector Machine (SVM) diperkenalkan oleh Vapnik pada tahun 1992 sebagai suatu teknik klasifikasi yang efisien untuk masalah nonlinier. SVM berbeda dengan teknik klasifikasi di era 1980-an, seperti decission tree dan ANN, yang secara konsep kurang begitu jelas dan seringkali terjebak pada optimum lokal. SVM memiliki konsep yang jauh lebih matang, lebih jelas secara matematis, dibanding teknik-teknik klasifikasi sebelum era 1990-an. (Suyanto,2018)

Membaca data

bank <- read.csv("UniversalBank.csv")

Struktur Bank

str(bank)

## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...

Deskripsi statistik data

library(psych)
describe(bank)

##                    vars    n     mean      sd  median  trimmed     mad
## ID                    1 5000  2500.50 1443.52  2500.5  2500.50 1853.25
## Age                   2 5000    45.34   11.46    45.0    45.38   14.83
## Experience            3 5000    20.10   11.47    20.0    20.13   14.83
## Income                4 5000    73.77   46.03    64.0    68.83   43.00
## ZIP.Code              5 5000 93152.50 2121.85 93437.0 93236.13 1968.89
## Family                6 5000     2.40    1.15     2.0     2.37    1.48
## CCAvg                 7 5000     1.94    1.75     1.5     1.65    1.33
## Education             8 5000     1.88    0.84     2.0     1.85    1.48
## Mortgage              9 5000    56.50  101.71     0.0    32.98    0.00
## Personal.Loan        10 5000     0.10    0.29     0.0     0.00    0.00
## Securities.Account   11 5000     0.10    0.31     0.0     0.01    0.00
## CD.Account           12 5000     0.06    0.24     0.0     0.00    0.00
## Online               13 5000     0.60    0.49     1.0     0.62    0.00
## CreditCard           14 5000     0.29    0.46     0.0     0.24    0.00
##                     min   max range   skew kurtosis    se
## ID                    1  5000  4999   0.00    -1.20 20.41
## Age                  23    67    44  -0.03    -1.15  0.16
## Experience           -3    43    46  -0.03    -1.12  0.16
## Income                8   224   216   0.84    -0.05  0.65
## ZIP.Code           9307 96651 87344 -12.49   485.52 30.01
## Family                1     4     3   0.16    -1.40  0.02
## CCAvg                 0    10    10   1.60     2.64  0.02
## Education             1     3     2   0.23    -1.55  0.01
## Mortgage              0   635   635   2.10     4.75  1.44
## Personal.Loan         0     1     1   2.74     5.52  0.00
## Securities.Account    0     1     1   2.59     4.69  0.00
## CD.Account            0     1     1   3.69    11.61  0.00
## Online                0     1     1  -0.39    -1.84  0.01
## CreditCard            0     1     1   0.90    -1.18  0.01

Analisis data tak bernilai

library(Amelia)

## Loading required package: Rcpp

## Warning: package 'Rcpp' was built under R version 3.5.3

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(bank, main="Missings Map", col=c("yellow", "black"), legend=FALSE)

#Data tidak mengandung NA

Merapikan data untuk persiapan pembuatan data training dan testing

bank=bank[,-c(1,5)] 
bank$CreditCard <- as.factor(bank$CreditCard)
Under=ifelse(bank$Education==1,1,0) 
Grad=ifelse(bank$Education== 2,1,0) 
bank=cbind(bank,Under,Grad)
bank_2=bank[,-6] 
bank_2[-11] <- scale(bank_2[-11])

Membuat data traing dan testing

library(caTools)
set.seed(1111)
spl_1 = sample.split(bank_2$CreditCard, SplitRatio=0.7)

Training = subset(bank_2, spl_1==TRUE) 
Testing = subset(bank_2, spl_1==T)

Membuat model SVM

library(e1071)
models <- svm(CreditCard~., data=Training, scale=T, kernel="linear", type="C-classification" )
summary(models)

## 
## Call:
## svm(formula = CreditCard ~ ., data = Training, kernel = "linear", 
##     type = "C-classification", scale = T)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.08333333 
## 
## Number of Support Vectors:  2016
## 
##  ( 1067 949 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Pengujian data testing

pred = predict(models, Testing[-11])
tabel <- table(as.numeric(pred),as.numeric(Testing$CreditCard))
library(caret)

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.5.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

confusionMatrix(tabel)

## Confusion Matrix and Statistics
## 
##    
##        1    2
##   1 2425  863
##   2   46  166
##                                           
##                Accuracy : 0.7403          
##                  95% CI : (0.7254, 0.7547)
##     No Information Rate : 0.706           
##     P-Value [Acc > NIR] : 3.651e-06       
##                                           
##                   Kappa : 0.1857          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9814          
##             Specificity : 0.1613          
##          Pos Pred Value : 0.7375          
##          Neg Pred Value : 0.7830          
##              Prevalence : 0.7060          
##          Detection Rate : 0.6929          
##    Detection Prevalence : 0.9394          
##       Balanced Accuracy : 0.5714          
##                                           
##        'Positive' Class : 1               
##

Klasifikasi Data Pengguna Kartu Kredit dengan Metode Suport Vector Machine

Hellan Bagus Sedyansyah

28 Februari 2020