Prediction and Visualization of Motor Vehicle Insurance Premium

Background
Read Data
Data Preprocess
Premium (Production Year)
Premium (Underwriting Year)
Claim Fraud Detection

Background

Pencatatan premi pada asuransi kendaraan bermotor (Motor Vehicle Insurance) bisa dikelompokkan menjadi :

Berdasarkan Tahun Produksi (Production Year), dimana produksi premi pada periode cutoff akhir tahun. Data ini akan digunakan sebagai premium reserved setiap tahunnya.
Berdasarkan Tahun Underwriting (Underwriting Year), dimana produksi premi berdasarkan tahun underwriter menilai risiko tersebut, Data ini juga digunakan sebagai achievement dari tim marketing.

Sehingga dibutuhkan visualisasi yang bisa memberikan informasi atas ketiga kategori pencatatan premi tersebut sebagai dasar perusahaan asuransi dalam meningkatkan performa perusahaan

Premi asuransi kendaraan bermotor juga dipengaruhi oleh waktu karena sangat erat hubungannya dengan pembelian mobil, yang mana pembelian mobil juga di pengaruhi oleh waktu - waktu tertentu, misalnya pembelian mobil menjelang hari raya idul fitri akan meningkat, sehingga dibutuhkan algoritma yang sesuai untuk digunakan dalam melakukan prediksi pencatatan premi dari kedua kategori tersebut (Time Series Prediction)

Dalam asuransi kendaraan bermotor erat hubungannya dengan klaim, maka dari itu akan digunakan algoritma Machine Learning untuk mendeteksi fraud dalam pengajuan klaim berdasarkan data historis fraud klaim asuransi kendaraan bermotor.

Read Data

Data yang digunakan merupakan data produksi premi kendaraan bermotor yang terdiri dari beberapa segment / Business Channel yaitu :

Direct Segment
Agency Segment
Leasing Segment
Dealer Segment

Data Preprocess

Pada tahap ini kita akan melakukan pengelompokkan beberapa feauture yang akan kita gunakan untuk visualisasi pencatatan premi berdasarkan ketiga kategori di atas dan juga nantinya digunakan untuk pembuatan model time series dalam melakukan prediksi dari ketiga kategori pencatatan premi tersebut.

Select Columns

Keterangan :

NO_MONTH : Bulan ke- premi tersebut di catat pada sistem

MONTH : Bulan premi di catat pada sistem

SEGMENT : Segment untuk sumber bisnsis asuransi di dapatkan

POLICYNO : No Polis atau no master polis

TRANSACTION_TYPE : Jenis transaksi asuransi, apakah transaksi polis baru, perubahan, atau pembatalan

INCEPTION : Periode polis dimulai

EXPIRY : Periode polis berakhir

BOOKING_DATE : Tanggal premi dicatat sebagai produksi pada sistem

USER_APPROVE_DATE : Tanggal premi di invorce oleh underwriter

TOC : Type of Coverage (jenis cover asuransi)

TOC_DESCRIPTION : Deskripsi jenis cover asuransi

TOC_GROUP : Grup dari jenis cover asuransi

TOC_GROUP_DESCRIPTION : Deskripsi jenis cover asuransi

TSI : Total sum Insured (Harga pertanggungan Asuransi)

PREMIUM_GROSS : Premi kotor asuransi

DISCOUNT : Diskon premi asuransi untuk customer

COMMISION : Komisi asuransi untuk agen, broker, atau perantara

VAT : Pajak pertambahan nilai (ppn 10%) dari premi kotor asuransi

TAX : Pajak penghasilan (pph berdasarkan kategori perusahaan yang dikenakan pajak)

POLICY_FEE : Biaya administrasi polis asuransi

STAMP_DUTY : Biaya materai

VEHICLE_CATEGORY : Kategori kendaraan yang diasuransikan

VEHICLE_TYPE : Tipe kendaraan yang diasuransikan

GROUP_MV : Grup dari kendaraan yang diasuransikan, apakah Kendaraan Roda 2, Kendaraan Roda 4, atau truk tangki

Data Skimming

## Observations: 42,537
## Variables: 24
## $ NO_MONTH              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ MONTH                 <fct> January, January, January, January, Janu...
## $ SEGMENT               <fct> Direct, Agent, Agent, Agent, Agent, Agen...
## $ POLICYNO              <fct> 11002011900022, 11502011900001, 11502011...
## $ TRANSACTION_TYPE      <fct> NEW, NEW, NEW, NEW, NEW, NEW, NEW, NEW, ...
## $ INCEPTION             <fct> 12/14/2018, 12/04/2018, 12/04/2018, 12/2...
## $ EXPIRY                <fct> 12/14/2019, 12/27/2021, 12/04/2020, 12/2...
## $ BOOKING_DATE          <fct> 01/09/2019, 01/02/2019, 01/02/2019, 01/0...
## $ USER_APPROVE_DATE     <fct> 01/09/2019, 01/02/2019, 01/02/2019, 01/0...
## $ TOC                   <dbl> 201, 201, 201, 201, 201, 201, 201, 201, ...
## $ TOC_DESCRIPTION       <fct> PSAKBI, PSAKBI, PSAKBI, PSAKBI, PSAKBI, ...
## $ TOC_GROUP             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ TOC_GROUP_DESCRIPTION <fct> Motor Vehicle, Motor Vehicle, Motor Vehi...
## $ TSI                   <dbl> 165750000, 0, 37000000, 65000000, 780000...
## $ PREMIUM_GROSS         <dbl> 5347933, 0, 207200, 637000, 764400, 5824...
## $ DISCOUNT              <dbl> -1336983, 0, 0, 0, 0, 0, -3757500, -1958...
## $ COMMISSION            <dbl> 0, 0, -51800, -159250, -191100, -145600,...
## $ VAT                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ TAX                   <dbl> 0, 0, 1295, 3981, 4778, 3640, 0, 0, 0, 0...
## $ POLICY_FEE            <dbl> 44000, 120000, 0, 0, 0, 0, 50000, 50000,...
## $ STAMP_DUTY            <dbl> 6000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ VEHICLE_CATEGORY      <fct> Non Bus / Non Truck, #N/A, Non Bus / Non...
## $ VEHICLE_TYPE          <fct> Sedan, #N/A, Minibus, Light Truck, Pick ...
## $ GROUP_MV              <fct> MV 4, MV 4, MV 4, MV 4, MV 4, MV 4, MV 4...

Segmen yang paling berkontribusi

Jumlah Polis Terbanyak

Terlihat bahwa produksi premi terbesar adalah dari segment leasing, dimana segment leasing tersebut periode asuransi nya hampir semuanya multiyear. Sehingga sangat dibutuhkan visualisasi premi yang reserved setiap tahunnya dan juga premi yang akan dicatat sebagai pencapaian / achievement dari marketing.

Premium (Production Year)

Read and Preprocessing Data

data dihitung dalam jutaan rupiah

Cross Validation

Split data train and data test

Data train

Data test

Build Model

Holt Winters

beta di buat false karena hasil dari decompose data sudah tidak ada tren nya, namun masih dipengaruhi oleh musiman

S-ARIMA

karena data mempunyai pengaruh Seasonal maka digunakan model S-ARIMA

Auto S-ARIMA

## Series: production_year_train 
## ARIMA(1,0,1) with zero mean 
## 
## Coefficients:
##          ar1      ma1
##       0.9876  -0.6662
## s.e.  0.0115   0.0743
## 
## sigma^2 estimated as 10643:  log likelihood=-629.81
## AIC=1265.62   AICc=1265.86   BIC=1273.56

Manual S-ARIMA

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.3655, Lag order = 4, p-value = 0.8397
## alternative hypothesis: stationary

p-value > 0.05 harus dilakukan differencing

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -5.4259, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

setelah differencing p-value < 0.05

ada differencing (p,d,q)(P,D,Q)(F) menjadi (,1,)(,1,)()

S-ARIMA 1

differencing 1 kali, dan berdasarkan grafik ACF dan PACF adalah MA murni karena ACF cut off dan PACF dying down, maka MA seasonal buat 1

masukan unsur multiseasonal

(0,1,1) dan (0,1,1) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 2

(0,1,1) dan (0,1,3) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 3

(0,1,3) dan (0,1,3) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 4

(0,1,3) dan (0,1,5) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 5

(0,1,5) dan (0,1,5) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 6

(0,1,7) dan (0,1,5) seasonal.periods = c(4,50.62,52.17)

Perbandingan AIC seluruh model S-ARIMA

##      Model      nilai_AIC         
## [1,] "S-ARIMA1" "623.071240225206"
## [2,] "S-ARIMA2" "627.071240228832"
## [3,] "S-ARIMA3" "629.488378913748"
## [4,] "S-ARIMA4" "633.466320979104"

model S-ARIMA1 punya AIC yang terkecil dengan nilai order dan seasonal = (0,1,1) dan (0,1,1) seasonal.periods = c(4,50.62,52.17)

kesimpulan sementara, model S-ARIMA1 yang terbaik, karena mempunyai nilai AIC terkecil

Forecasting

1. Forcasting Holt Winters

evaluasi model Holt Winters

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.984112098177416"
## [2,] "RSquared prediksi" "0.661750747970163"
## [3,] "RMSE model"        "21.7150206859214" 
## [4,] "RMSE prediksi"     "40.8102400874216" 
## [5,] "MAE model"         "15.0334661201796" 
## [6,] "MAE prediksi"      "35.3434835265517" 
## [7,] "MAPE model"        "0.096020360214706"
## [8,] "MAPE prediksi"     "0.110840387217148"

model ini fitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi tidak begitu jauh perbedaannya.

2. Forecasting Auto S-ARIMA

hasil plot di atas terlihat prediksi nya sangat jauh dari data actual, sehingga model tidak akan dievaluasi dan tidak digunakan untuk prediksi.

3. Forecasting SARIMA1

ketentuan SARIMA1 adalah order(0,1,1) seasonal(0,1,1) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA1

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.672183273502831"
## [2,] "RSquared prediksi" "0.418900238504829"
## [3,] "RMSE Model"        "69.7473209131274" 
## [4,] "RMSE prediksi"     "53.490428368665"  
## [5,] "MAE Model"         "26.312974058106"  
## [6,] "MAE Prediksi"      "36.7563425206813" 
## [7,] "MAPE Model"        "0.213472928284028"
## [8,] "MAPE Prediksi"     "0.126183035781386"

model ini fitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi tidak begitu jauh perbedaannya.

4. Forecasting SARIMA2

Ketentuan SARIMA2 adalah order(0,1,1) seasonal(0,1,3) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA2

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.672183351889067"
## [2,] "RSquared prediksi" "0.418898920733051"
## [3,] "RMSE Model"        "69.7473125742756" 
## [4,] "RMSE prediksi"     "53.4904890192981" 
## [5,] "MAE Model"         "26.3129753870405" 
## [6,] "MAE Prediksi"      "36.756403786463"  
## [7,] "MAPE Model"        "0.213473016843235"
## [8,] "MAPE Prediksi"     "0.126183262205178"

model ini fitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi tidak begitu jauh perbedaannya.

5. Forecasting SARIMA3

ketentuan SARIMA3 adalah order(0,1,3) seasonal(0,1,3) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA3

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.902048911907225"
## [2,] "RSquared prediksi" "0.179122587597292"
## [3,] "RMSE Model"        "38.1255930737074" 
## [4,] "RMSE prediksi"     "63.5755140015888" 
## [5,] "MAE Model"         "15.0353425312786" 
## [6,] "MAE Prediksi"      "35.06600057872"   
## [7,] "MAPE Model"        "0.111475893343911"
## [8,] "MAPE Prediksi"     "0.104187234953977"

model ini underfitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi jauh perbedaannya.

6. Forecasting SARIMA4

ketentuan SARIMA4 adalah order(0,1,3) seasonal(0,1,5) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA4

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error         
## [1,] "RSquared model"    "0.954098691603804" 
## [2,] "RSquared prediksi" "0.141893529215824" 
## [3,] "RMSE Model"        "26.0990385205821"  
## [4,] "RMSE prediksi"     "65.0011912325919"  
## [5,] "MAE Model"         "10.3167524117525"  
## [6,] "MAE Prediksi"      "35.6527799268892"  
## [7,] "MAPE Model"        "0.0765459218174017"
## [8,] "MAPE Prediksi"     "0.104756733774311"

model ini underfitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi jauh perbedaannya.

7. Forecasting SARIMA5

ketentuan SARIMA5 adalah order(0,1,5) seasonal(0,1,5) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA5

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

model ini underfitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi jauh perbedaannya.

8. Forecasting SARIMA6

Ketentuan SARIMA6 adalah order(0,1,7) seasonal(0,1,5) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA6

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

model ini underfitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi jauh perbedaannya.

Hasil dari prediksi tersebut bahwa model Holtwinters, SARIMA1 dan SARIMA2 merupakan model yang cenderung fitted dalam melakukan prediksi, dan model SARIMA1 dan SARIMA2 punya nilai AIC yang kecil

Uji Asumsi

1. Normality

## 
##  Shapiro-Wilk normality test
## 
## data:  production_year_train_sarima1$residuals
## W = 0.53154, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  production_year_train_sarima2$residuals
## W = 0.53154, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  production_year_train_sarima3$residuals
## W = 0.55927, p-value = 3.712e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  production_year_train_sarima4$residuals
## W = 0.56021, p-value = 3.857e-16

hasil dari uji normality, bahwa seluruh model SARIMA data residual nya terdistribusi normal, meskipun ada beberapa outlier

2. Autocorelation

## 
##  Box-Ljung test
## 
## data:  production_year_train_sarima1$residuals
## X-squared = 1.1881, df = 1, p-value = 0.2757

## 
##  Box-Ljung test
## 
## data:  production_year_train_sarima2$residuals
## X-squared = 1.1881, df = 1, p-value = 0.2757

## 
##  Box-Ljung test
## 
## data:  production_year_train_sarima3$residuals
## X-squared = 0.69453, df = 1, p-value = 0.4046

## 
##  Box-Ljung test
## 
## data:  production_year_train_sarima4$residuals
## X-squared = 0.6872, df = 1, p-value = 0.4071

Model SARIMA5 dan SARIMA6 p-value > 0.5 artinya kedua model tersebut tidak mempunyai auto corelation, namun kedua model tersebut berdasarkan evaluasi model RMSE, MAE dan MAPE cenderung underfitted.

Premium (Underwriting Year)

Read and Preprocessing Data

data dihitung dalam jutaan rupiah

Cross Validation

Split data train and data test

Data train

Data test

Build Model

Holt Winters

beta di buat true karena adanya tren (data multiplikatif) dan dipengaruhi oleh musiman (gama = True)

S-ARIMA

karena data mempunyai pengaruh Seasonal maka digunakan model S-ARIMA

Auto S-ARIMA

## Series: uw_year_train 
## ARIMA(0,1,1) 
## 
## Coefficients:
##           ma1
##       -0.7214
## s.e.   0.0622
## 
## sigma^2 estimated as 507474:  log likelihood=-822.58
## AIC=1649.16   AICc=1649.28   BIC=1654.43

Manual S-ARIMA

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.9641, Lag order = 4, p-value = 0.5916
## alternative hypothesis: stationary

p-value > 0.05 harus dilakukan differencing

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -5.4774, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

setelah differencing p-value < 0.05

ada differencing (p,d,q)(P,D,Q)(F) menjadi (,1,)(,1,)()

S-ARIMA 1

differencing 1 kali, dan berdasarkan grafik ACF dan PACF adalah MA murni karena ACF cut off dan PACF dying down, maka MA seasonal buat 1

masukan unsur multiseasonal

(0,1,1) dan (0,1,1) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 2

(0,1,1) dan (0,1,3) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 3

(0,1,3) dan (0,1,3) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 4

(0,1,3) dan (0,1,5) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 5

(0,1,5) dan (0,1,5) seasonal.periods = c(4,50.62,52.17)

S-ARIMA 6

(0,1,7) dan (0,1,5) seasonal.periods = c(4,50.62,52.17)

Perbandingan AIC seluruh model S-ARIMA

##      Model      nilai_AIC         
## [1,] "S-ARIMA1" "846.332594951965"
## [2,] "S-ARIMA2" "850.332594952412"
## [3,] "S-ARIMA3" "840.504065602817"
## [4,] "S-ARIMA4" "844.465838749176"

model S-ARIMA3 punya AIC yang terkecil dengan nilai order dan seasonal = (0,1,1) dan (0,1,1) seasonal.periods = c(4,50.62,52.17)

kesimpulan sementara, model S-ARIMA3 yang terbaik, karena mempunyai nilai AIC terkecil

Forecasting

1. Forcasting Holt Winters

evaluasi model Holt Winters

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.6648406081387"  
## [2,] "RSquared prediksi" "0.38029298476568" 
## [3,] "RMSE model"        "763.910566290509" 
## [4,] "RMSE prediksi"     "295.472745114177" 
## [5,] "MAE model"         "446.061594083477" 
## [6,] "MAE prediksi"      "269.339433813702" 
## [7,] "MAPE model"        "0.305770532414891"
## [8,] "MAPE prediksi"     "0.26287521854548"

model ini underfitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi jauh perbedaannya.

2. Forecasting Auto S-ARIMA

hasil plot di atas terlihat prediksi nya sangat jauh dari data actual, sehingga model tidak akan dievaluasi dan tidak digunakan untuk prediksi

3. Forecasting SARIMA1

ketentuan SARIMA1 adalah order(0,1,1) seasonal(0,1,1) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA1

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.550268673119596"
## [2,] "RSquared prediksi" "0.516715360576173"
## [3,] "RMSE Model"        "625.717631289556" 
## [4,] "RMSE prediksi"     "260.9310228599"   
## [5,] "MAE Model"         "281.54530709245"  
## [6,] "MAE Prediksi"      "236.852756049351" 
## [7,] "MAPE Model"        "0.319779343664395"
## [8,] "MAPE Prediksi"     "0.168127393499762"

model ini cenderung overfitted, karena nilai RMSE model dan preidiksinya nya jauh bedanya, meskipun nilai R squared, MAE, dan MAPE dari model dan prediksi tidak begitu jauh perbedaannya.

4. Forecasting SARIMA2

Ketentuan SARIMA2 adalah order(0,1,1) seasonal(0,1,3) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA2

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.550268673583459"
## [2,] "RSquared prediksi" "0.516715335712043"
## [3,] "RMSE Model"        "625.717630966867" 
## [4,] "RMSE prediksi"     "260.931029572117" 
## [5,] "MAE Model"         "281.54530705641"  
## [6,] "MAE Prediksi"      "236.852760330296" 
## [7,] "MAPE Model"        "0.31977934313984" 
## [8,] "MAPE Prediksi"     "0.168127395235184"

model ini cenderung overfitted, karena nilai RMSE model dan preidiksinya nya jauh bedanya, meskipun nilai R squared, MAE, dan MAPE dari model dan prediksi tidak begitu jauh perbedaannya.

5. Forecasting SARIMA3

ketentuan SARIMA3 adalah order(0,1,3) seasonal(0,1,3) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA3

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error        
## [1,] "RSquared model"    "0.923186871107274"
## [2,] "RSquared prediksi" "0.905103272381356"
## [3,] "RMSE Model"        "258.594658818531" 
## [4,] "RMSE prediksi"     "115.624492385178" 
## [5,] "MAE Model"         "118.904185266878" 
## [6,] "MAE Prediksi"      "111.58232847372"  
## [7,] "MAPE Model"        "0.129481126033259"
## [8,] "MAPE Prediksi"     "0.092555135818933"

model ini fitted, karena nilai R squared, RMSE MAE, dan MAPE dari model dan prediksi tidak begitu jauh perbedaannya.

6. Forecasting SARIMA4

ketentuan SARIMA4 adalah order(0,1,3) seasonal(0,1,5) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA4

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

##      evaluasi            nilai_error         
## [1,] "RSquared model"    "0.957457435530386" 
## [2,] "RSquared prediksi" "0.90494920126867"  
## [3,] "RMSE Model"        "192.448112205576"  
## [4,] "RMSE prediksi"     "115.718316323023"  
## [5,] "MAE Model"         "88.5799762924094"  
## [6,] "MAE Prediksi"      "110.976297143027"  
## [7,] "MAPE Model"        "0.0963805373813602"
## [8,] "MAPE Prediksi"     "0.0917762246062302"

model ini fitted, karena nilai R squared, RMSE MAE, dan MAPE dari model dan prediksi tidak begitu jauh perbedaannya.

7. Forecasting SARIMA5

ketentuan SARIMA5 adalah order(0,1,5) seasonal(0,1,5) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA5

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

model ini underfitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi jauh perbedaannya.

8. Forecasting SARIMA6

Ketentuan SARIMA6 adalah order(0,1,7) seasonal(0,1,5) seasonal.periods = c(4,50.62,52.17)

evaluasi model SARIMA6

evaluasi model ini menggunakan faktor RMSE, MAE dan MAPE

model ini underfitted, karena nilai R squared, RMSE, MAE, dan MAPE dari model dan prediksi jauh perbedaannya.

Hasil dari prediksi tersebut bahwa model SARIMA-3 dan SARIMA-4 yang fitted untuk melakukan prediksi premi berdasarkan Underwriting Year.

Uji Asumsi

1. Normality

## 
##  Shapiro-Wilk normality test
## 
## data:  uw_year_train_sarima1$residuals
## W = 0.68177, p-value = 1.073e-13

## 
##  Shapiro-Wilk normality test
## 
## data:  uw_year_train_sarima2$residuals
## W = 0.68177, p-value = 1.073e-13

## 
##  Shapiro-Wilk normality test
## 
## data:  uw_year_train_sarima3$residuals
## W = 0.69851, p-value = 2.614e-13

## 
##  Shapiro-Wilk normality test
## 
## data:  uw_year_train_sarima4$residuals
## W = 0.69892, p-value = 2.673e-13

hasil dari uji normality, bahwa seluruh model SARIMA data residual nya terdistribusi normal, meskipun ada beberapa outlier

2. Autocorelation

## 
##  Box-Ljung test
## 
## data:  uw_year_train_sarima1$residuals
## X-squared = 15.724, df = 1, p-value = 7.33e-05

## 
##  Box-Ljung test
## 
## data:  uw_year_train_sarima2$residuals
## X-squared = 15.724, df = 1, p-value = 7.33e-05

## 
##  Box-Ljung test
## 
## data:  uw_year_train_sarima3$residuals
## X-squared = 0.12112, df = 1, p-value = 0.7278

## 
##  Box-Ljung test
## 
## data:  uw_year_train_sarima4$residuals
## X-squared = 0.10992, df = 1, p-value = 0.7402

Model SARIMA3 dan SARIMA4 p-value > 0.5 artinya kedua model tersebut tidak mempunyai auto corelation dan kedua model tersebut merupakan model fitted, sedangkan model SARIMA6 yang memilkiki p-value > 0.5 modelnya underfitted

Kesimpulan :

untuk premi berdasarkan tahun produksi (production year) diprediksi lebih baik menggunakan model HoltWinters, SARIMA-1 dan SARIMA-2, sedangkan untuk premi berdasarkan underwriting year diprediksi lebih baik menggunakan model SARIMA-3 dan SARIMA-4

Claim Fraud Detection

Read Data dan Data Preprocess

##          months_as_customer                         age 
##                           0                           0 
##               policy_number            policy_bind_date 
##                           0                           0 
##                policy_state                  policy_csl 
##                           0                           0 
##           policy_deductable       policy_annual_premium 
##                           0                           0 
##              umbrella_limit                 insured_zip 
##                           0                           0 
##                 insured_sex     insured_education_level 
##                           0                           0 
##          insured_occupation             insured_hobbies 
##                           0                           0 
##        insured_relationship               capital.gains 
##                           0                           0 
##                capital.loss               incident_date 
##                           0                           0 
##               incident_type              collision_type 
##                           0                           0 
##           incident_severity       authorities_contacted 
##                           0                           0 
##              incident_state               incident_city 
##                           0                           0 
##           incident_location    incident_hour_of_the_day 
##                           0                           0 
## number_of_vehicles_involved             property_damage 
##                           0                           0 
##             bodily_injuries                   witnesses 
##                           0                           0 
##     police_report_available          total_claim_amount 
##                           0                           0 
##                injury_claim              property_claim 
##                           0                           0 
##               vehicle_claim                   auto_make 
##                           0                           0 
##                  auto_model                   auto_year 
##                           0                           0 
##              fraud_reported 
##                           0

## Observations: 1,000
## Variables: 39
## $ months_as_customer          <int> 328, 228, 134, 256, 228, 256, 137,...
## $ age                         <int> 48, 42, 29, 41, 44, 39, 34, 37, 33...
## $ policy_number               <int> 521585, 342868, 687698, 227811, 36...
## $ policy_bind_date            <fct> 10/17/2014, 06/27/2006, 09/06/2000...
## $ policy_state                <fct> OH, IN, OH, IL, IL, OH, IN, IL, IL...
## $ policy_csl                  <fct> 250/500, 250/500, 100/300, 250/500...
## $ policy_deductable           <int> 1000, 2000, 2000, 2000, 1000, 1000...
## $ policy_annual_premium       <dbl> 1406.91, 1197.22, 1413.14, 1415.74...
## $ umbrella_limit              <int> 0, 5000000, 5000000, 6000000, 6000...
## $ insured_zip                 <int> 466132, 468176, 430632, 608117, 61...
## $ insured_sex                 <fct> MALE, MALE, FEMALE, FEMALE, MALE, ...
## $ insured_education_level     <fct> MD, MD, PhD, PhD, Associate, PhD, ...
## $ insured_occupation          <fct> craft-repair, machine-op-inspct, s...
## $ insured_hobbies             <fct> sleeping, reading, board-games, bo...
## $ insured_relationship        <fct> husband, other-relative, own-child...
## $ capital.gains               <int> 53300, 0, 35100, 48900, 66000, 0, ...
## $ capital.loss                <int> 0, 0, 0, -62400, -46000, 0, -77000...
## $ incident_date               <fct> 01/25/2015, 01/21/2015, 02/22/2015...
## $ incident_type               <fct> Single Vehicle Collision, Vehicle ...
## $ collision_type              <fct> Side Collision, ?, Rear Collision,...
## $ incident_severity           <fct> Major Damage, Minor Damage, Minor ...
## $ authorities_contacted       <fct> Police, Police, Police, Police, No...
## $ incident_state              <fct> SC, VA, NY, OH, NY, SC, NY, VA, WV...
## $ incident_city               <fct> Columbus, Riverwood, Columbus, Arl...
## $ incident_location           <fct> 9935 4th Drive, 6608 MLK Hwy, 7121...
## $ incident_hour_of_the_day    <int> 5, 8, 7, 5, 20, 19, 0, 23, 21, 14,...
## $ number_of_vehicles_involved <int> 1, 1, 3, 1, 1, 3, 3, 3, 1, 1, 1, 3...
## $ property_damage             <fct> YES, ?, NO, ?, NO, NO, ?, ?, NO, N...
## $ bodily_injuries             <int> 1, 0, 2, 1, 0, 0, 0, 2, 1, 2, 2, 1...
## $ witnesses                   <int> 2, 0, 3, 2, 1, 2, 0, 2, 1, 1, 2, 2...
## $ police_report_available     <fct> YES, ?, NO, NO, NO, NO, ?, YES, YE...
## $ total_claim_amount          <int> 71610, 5070, 34650, 63400, 6500, 6...
## $ injury_claim                <int> 6510, 780, 7700, 6340, 1300, 6410,...
## $ property_claim              <int> 13020, 780, 3850, 6340, 650, 6410,...
## $ vehicle_claim               <int> 52080, 3510, 23100, 50720, 4550, 5...
## $ auto_make                   <fct> Saab, Mercedes, Dodge, Chevrolet, ...
## $ auto_model                  <fct> 92x, E400, RAM, Tahoe, RSX, 95, Pa...
## $ auto_year                   <int> 2004, 2007, 2007, 2014, 2009, 2003...
## $ fraud_reported              <fct> Y, Y, N, Y, N, Y, N, N, N, N, N, N...

hasil dari skimming data di atas, terdapat beberap data numeric yang harus kita bining dan buat menjadi factor

delete column yang tidak digunakan

## Observations: 1,000
## Variables: 30
## $ policy_annual_premium        <dbl> 1406.91, 1197.22, 1413.14, 1415.7...
## $ umbrella_limit               <int> 0, 5000000, 5000000, 6000000, 600...
## $ insured_sex                  <fct> MALE, MALE, FEMALE, FEMALE, MALE,...
## $ insured_education_level      <fct> MD, MD, PhD, PhD, Associate, PhD,...
## $ insured_occupation           <fct> craft-repair, machine-op-inspct, ...
## $ insured_hobbies              <fct> sleeping, reading, board-games, b...
## $ insured_relationship         <fct> husband, other-relative, own-chil...
## $ capital.gains                <int> 53300, 0, 35100, 48900, 66000, 0,...
## $ capital.loss                 <int> 0, 0, 0, -62400, -46000, 0, -7700...
## $ incident_type                <fct> Single Vehicle Collision, Vehicle...
## $ collision_type               <fct> Side Collision, ?, Rear Collision...
## $ incident_severity            <fct> Major Damage, Minor Damage, Minor...
## $ authorities_contacted        <fct> Police, Police, Police, Police, N...
## $ incident_city                <fct> Columbus, Riverwood, Columbus, Ar...
## $ number_of_vehicles_involved  <int> 1, 1, 3, 1, 1, 3, 3, 3, 1, 1, 1, ...
## $ property_damage              <fct> YES, ?, NO, ?, NO, NO, ?, ?, NO, ...
## $ bodily_injuries              <int> 1, 0, 2, 1, 0, 0, 0, 2, 1, 2, 2, ...
## $ witnesses                    <int> 2, 0, 3, 2, 1, 2, 0, 2, 1, 1, 2, ...
## $ police_report_available      <fct> YES, ?, NO, NO, NO, NO, ?, YES, Y...
## $ total_claim_amount           <int> 71610, 5070, 34650, 63400, 6500, ...
## $ injury_claim                 <int> 6510, 780, 7700, 6340, 1300, 6410...
## $ property_claim               <int> 13020, 780, 3850, 6340, 650, 6410...
## $ vehicle_claim                <int> 52080, 3510, 23100, 50720, 4550, ...
## $ auto_make                    <fct> Saab, Mercedes, Dodge, Chevrolet,...
## $ auto_model                   <fct> 92x, E400, RAM, Tahoe, RSX, 95, P...
## $ fraud_reported               <fct> Y, Y, N, Y, N, Y, N, N, N, N, N, ...
## $ auto_year_bin                <fct> old_car, old_car, old_car, new_ca...
## $ months_as_customer_bin       <fct> good_customer, good_customer, goo...
## $ age_bin                      <fct> old_people, old_people, mature, o...
## $ incident_hour_of_the_day_bin <fct> morning, morning, morning, mornin...

Cross Validation

Split data train, validation, dan test

cek proporsi kelas target

## 
##       N       Y 
## 0.74375 0.25625

## 
##         N         Y 
## 0.7694444 0.2305556

downsample data train

## 
##   N   Y 
## 0.5 0.5

Build Model

Random Forest Model 1

Gunakan metode random forest dengan menggunakan seluruh variable untuk memprediksi fraud.

## Random Forest 
## 
## 328 samples
##  29 predictor
##   2 classes: 'N', 'Y' 
## 
## Pre-processing: scaled (137) 
## Resampling: Cross-Validated (10 fold, repeated 4 times) 
## Summary of sample sizes: 296, 294, 295, 294, 296, 296, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.6521557  0.3041104
##    69   0.7955186  0.5912308
##   137   0.8038798  0.6079762
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 137.

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 137
## 
##         OOB estimate of  error rate: 18.6%
## Confusion matrix:
##     N   Y class.error
## N 134  30   0.1829268
## Y  31 133   0.1890244

Model random forest di atas menyatakan mtry terbaik adalah 137 karena mempunya nilai Accuracy paling tinggi diantara lainnya.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 227  14
##          Y  50  69
##                                           
##                Accuracy : 0.8222          
##                  95% CI : (0.7787, 0.8603)
##     No Information Rate : 0.7694          
##     P-Value [Acc > NIR] : 0.008876        
##                                           
##                   Kappa : 0.565           
##                                           
##  Mcnemar's Test P-Value : 1.214e-05       
##                                           
##             Sensitivity : 0.8313          
##             Specificity : 0.8195          
##          Pos Pred Value : 0.5798          
##          Neg Pred Value : 0.9419          
##              Prevalence : 0.2306          
##          Detection Rate : 0.1917          
##    Detection Prevalence : 0.3306          
##       Balanced Accuracy : 0.8254          
##                                           
##        'Positive' Class : Y               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 123   4
##          Y  23  50
##                                           
##                Accuracy : 0.865           
##                  95% CI : (0.8097, 0.9091)
##     No Information Rate : 0.73            
##     P-Value [Acc > NIR] : 3.325e-06       
##                                           
##                   Kappa : 0.6917          
##                                           
##  Mcnemar's Test P-Value : 0.000532        
##                                           
##             Sensitivity : 0.9259          
##             Specificity : 0.8425          
##          Pos Pred Value : 0.6849          
##          Neg Pred Value : 0.9685          
##              Prevalence : 0.2700          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.3650          
##       Balanced Accuracy : 0.8842          
##                                           
##        'Positive' Class : Y               
##

hasil model random forest pertama memberikan nilai accuracy 87% untuk data test dengan nilai recall / sensitivity 92%, hasil prediksi dari model random forest dengan menggunakan data test sangat tinggi dibanding dengan data validation sehingga model RF pertama cenderung overfit

Variable Importance RF 1

## rf variable importance
## 
##   only 20 most important variables shown (out of 137)
## 
##                                   Overall
## incident_severityMinor Damage     100.000
## insured_hobbieschess               82.059
## insured_hobbiescross-fit           67.779
## incident_severityTotal Loss        67.282
## vehicle_claim                      56.051
## property_claim                     40.856
## total_claim_amount                 36.101
## policy_annual_premium              33.995
## capital.loss                       29.651
## injury_claim                       20.203
## auto_modelCivic                    19.807
## capital.gains                      14.155
## witnesses                          11.584
## umbrella_limit                      9.808
## insured_occupationpriv-house-serv   8.143
## auto_modelMDX                       8.033
## auto_model93                        7.680
## incident_cityRiverwood              7.568
## auto_makeDodge                      7.288
## auto_modelNeon                      7.187

Dari hasil dari variable importance di atas kita akan menggunakan beberapa variable saja untuk membuat model random forest ke dua (proses tunning model) yaitu : incident_severity + insured_hobbies + vehicle_claim + property_claim + total_claim_amount + policy_annual_premium + capital.loss + injury_claim + capital.gains + witnesses + umbrella_limit

Random Forest Model 2

Gunakan beberapa variabel yang sudah di pilih untuk membuat model ke 2

## Random Forest 
## 
## 328 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## Pre-processing: scaled (31) 
## Resampling: Cross-Validated (10 fold, repeated 4 times) 
## Summary of sample sizes: 296, 294, 295, 294, 296, 296, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7390458  0.4783115
##   16    0.7916611  0.5834694
##   31    0.7902852  0.5807517
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 16
## 
##         OOB estimate of  error rate: 21.34%
## Confusion matrix:
##     N   Y class.error
## N 128  36   0.2195122
## Y  34 130   0.2073171

Model random forest di atas menyatakan mtry terbaik adalah 16 karena mempunya nilai Accuracy paling tinggi diantara lainnya.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 216  14
##          Y  61  69
##                                          
##                Accuracy : 0.7917         
##                  95% CI : (0.746, 0.8325)
##     No Information Rate : 0.7694         
##     P-Value [Acc > NIR] : 0.1743         
##                                          
##                   Kappa : 0.51           
##                                          
##  Mcnemar's Test P-Value : 1.087e-07      
##                                          
##             Sensitivity : 0.8313         
##             Specificity : 0.7798         
##          Pos Pred Value : 0.5308         
##          Neg Pred Value : 0.9391         
##              Prevalence : 0.2306         
##          Detection Rate : 0.1917         
##    Detection Prevalence : 0.3611         
##       Balanced Accuracy : 0.8056         
##                                          
##        'Positive' Class : Y              
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 118   3
##          Y  28  51
##                                           
##                Accuracy : 0.845           
##                  95% CI : (0.7873, 0.8922)
##     No Information Rate : 0.73            
##     P-Value [Acc > NIR] : 8.071e-05       
##                                           
##                   Kappa : 0.6569          
##                                           
##  Mcnemar's Test P-Value : 1.629e-05       
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.8082          
##          Pos Pred Value : 0.6456          
##          Neg Pred Value : 0.9752          
##              Prevalence : 0.2700          
##          Detection Rate : 0.2550          
##    Detection Prevalence : 0.3950          
##       Balanced Accuracy : 0.8763          
##                                           
##        'Positive' Class : Y               
##

hasil model random forest kedua memberikan nilai accuracy 86% untuk data test dengan nilai recall / sensitivity 94%, hasil prediksi dari model random forest dengan menggunakan data test sangat tinggi dibanding dengan data validation sehingga model RF kedua cenderung overfit

Logistic Regression 1

kita akan coba gunakan model logistic regression dalam pembuatan model prediksi fraud tersebut Variable akan kita seleksi kembali dengan menggunakan metode stepwise regression dengan metode backward

## Start:  AIC=246
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_education_level + insured_occupation + insured_hobbies + 
##     insured_relationship + capital.gains + capital.loss + incident_type + 
##     collision_type + incident_severity + authorities_contacted + 
##     incident_city + number_of_vehicles_involved + property_damage + 
##     bodily_injuries + witnesses + police_report_available + total_claim_amount + 
##     injury_claim + property_claim + vehicle_claim + auto_make + 
##     auto_model + auto_year_bin + months_as_customer_bin + age_bin + 
##     incident_hour_of_the_day_bin
## 
## 
## Step:  AIC=246
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_education_level + insured_occupation + insured_hobbies + 
##     insured_relationship + capital.gains + capital.loss + incident_type + 
##     collision_type + incident_severity + authorities_contacted + 
##     incident_city + number_of_vehicles_involved + property_damage + 
##     bodily_injuries + witnesses + police_report_available + total_claim_amount + 
##     injury_claim + property_claim + vehicle_claim + auto_model + 
##     auto_year_bin + months_as_customer_bin + age_bin + incident_hour_of_the_day_bin
## 
## 
## Step:  AIC=246
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_education_level + insured_occupation + insured_hobbies + 
##     insured_relationship + capital.gains + capital.loss + incident_type + 
##     collision_type + incident_severity + authorities_contacted + 
##     incident_city + number_of_vehicles_involved + property_damage + 
##     bodily_injuries + witnesses + police_report_available + total_claim_amount + 
##     injury_claim + property_claim + auto_model + auto_year_bin + 
##     months_as_customer_bin + age_bin + incident_hour_of_the_day_bin
## 
##                                Df Deviance     AIC
## - incident_city                 6     0.00  234.00
## - insured_education_level       6     0.00  234.00
## - insured_relationship          5     0.00  236.00
## - authorities_contacted         4     0.00  238.00
## - incident_hour_of_the_day_bin  3     0.00  240.00
## - incident_type                 2     0.00  242.00
## - police_report_available       2     0.00  242.00
## - age_bin                       2     0.00  242.00
## - months_as_customer_bin        2     0.00  242.00
## - property_damage               2     0.00  242.00
## - collision_type                2     0.00  242.00
## - policy_annual_premium         1     0.00  244.00
## - property_claim                1     0.00  244.00
## - total_claim_amount            1     0.00  244.00
## - injury_claim                  1     0.00  244.00
## - capital.gains                 1     0.00  244.00
## - bodily_injuries               1     0.00  244.00
## - capital.loss                  1     0.00  244.00
## - number_of_vehicles_involved   1     0.00  244.00
## - auto_year_bin                 1     0.00  244.00
## - umbrella_limit                1     0.00  244.00
## - insured_sex                   1     0.00  244.00
## - witnesses                     1     0.00  244.00
## <none>                                0.00  246.00
## - auto_model                   38   131.01  301.01
## - incident_severity             3   186.81  426.81
## - insured_hobbies              19   255.06  463.06
## - insured_occupation           13  2811.40 3031.40
## 
## Step:  AIC=234
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_education_level + insured_occupation + insured_hobbies + 
##     insured_relationship + capital.gains + capital.loss + incident_type + 
##     collision_type + incident_severity + authorities_contacted + 
##     number_of_vehicles_involved + property_damage + bodily_injuries + 
##     witnesses + police_report_available + total_claim_amount + 
##     injury_claim + property_claim + auto_model + auto_year_bin + 
##     months_as_customer_bin + age_bin + incident_hour_of_the_day_bin
## 
##                                Df Deviance    AIC
## - insured_education_level       6     0.00 222.00
## - insured_relationship          5     0.00 224.00
## - authorities_contacted         4     0.00 226.00
## - incident_hour_of_the_day_bin  3     0.00 228.00
## - police_report_available       2     0.00 230.00
## - age_bin                       2     0.00 230.00
## - months_as_customer_bin        2     0.00 230.00
## - incident_type                 2     0.00 230.00
## - property_damage               2     0.00 230.00
## - collision_type                2     0.00 230.00
## - capital.gains                 1     0.00 232.00
## - property_claim                1     0.00 232.00
## - injury_claim                  1     0.00 232.00
## - bodily_injuries               1     0.00 232.00
## - policy_annual_premium         1     0.00 232.00
## - total_claim_amount            1     0.00 232.00
## - number_of_vehicles_involved   1     0.00 232.00
## - capital.loss                  1     0.00 232.00
## - insured_sex                   1     0.00 232.00
## - auto_year_bin                 1     0.00 232.00
## - umbrella_limit                1     0.00 232.00
## - witnesses                     1     0.00 232.00
## <none>                                0.00 234.00
## - auto_model                   38   139.45 297.45
## - insured_occupation           13   104.44 312.44
## - incident_severity             3   192.99 420.99
## - insured_hobbies              19   259.66 455.66
## 
## Step:  AIC=222
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_occupation + insured_hobbies + insured_relationship + 
##     capital.gains + capital.loss + incident_type + collision_type + 
##     incident_severity + authorities_contacted + number_of_vehicles_involved + 
##     property_damage + bodily_injuries + witnesses + police_report_available + 
##     total_claim_amount + injury_claim + property_claim + auto_model + 
##     auto_year_bin + months_as_customer_bin + age_bin + incident_hour_of_the_day_bin
## 
##                                Df Deviance     AIC
## - insured_relationship          5     0.00  212.00
## - authorities_contacted         4     0.00  214.00
## - incident_hour_of_the_day_bin  3     0.00  216.00
## - incident_type                 2     0.00  218.00
## - police_report_available       2     0.00  218.00
## - age_bin                       2     0.00  218.00
## - property_damage               2     0.00  218.00
## - injury_claim                  1     0.00  220.00
## - bodily_injuries               1     0.00  220.00
## - total_claim_amount            1     0.00  220.00
## - number_of_vehicles_involved   1     0.00  220.00
## - property_claim                1     0.00  220.00
## - capital.gains                 1     0.00  220.00
## - policy_annual_premium         1     0.00  220.00
## - insured_sex                   1     0.00  220.00
## <none>                                0.00  222.00
## - auto_model                   38   149.28  295.28
## - insured_occupation           13   135.93  331.93
## - incident_severity             3   205.20  421.20
## - insured_hobbies              19   266.16  450.16
## - months_as_customer_bin        2  1946.36 2164.36
## - umbrella_limit                1  2234.71 2454.71
## - auto_year_bin                 1  2450.97 2670.97
## - capital.loss                  1  2523.06 2743.06
## - witnesses                     1  2523.06 2743.06
## - collision_type                2  2667.23 2885.23
## 
## Step:  AIC=212
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_occupation + insured_hobbies + capital.gains + capital.loss + 
##     incident_type + collision_type + incident_severity + authorities_contacted + 
##     number_of_vehicles_involved + property_damage + bodily_injuries + 
##     witnesses + police_report_available + total_claim_amount + 
##     injury_claim + property_claim + auto_model + auto_year_bin + 
##     months_as_customer_bin + age_bin + incident_hour_of_the_day_bin
## 
##                                Df Deviance     AIC
## - authorities_contacted         4     0.00  204.00
## - incident_hour_of_the_day_bin  3     0.00  206.00
## - incident_type                 2     0.00  208.00
## - police_report_available       2     0.00  208.00
## - age_bin                       2     0.00  208.00
## - months_as_customer_bin        2     0.00  208.00
## - bodily_injuries               1     0.00  210.00
## - total_claim_amount            1     0.00  210.00
## - injury_claim                  1     0.00  210.00
## - policy_annual_premium         1     0.00  210.00
## - number_of_vehicles_involved   1     0.00  210.00
## <none>                                0.00  212.00
## - auto_year_bin                 1    77.26  287.26
## - auto_model                   38   159.59  295.59
## - collision_type                2    96.13  304.13
## - witnesses                     1    98.48  308.48
## - insured_occupation           13   140.00  326.00
## - incident_severity             3   218.68  424.68
## - insured_hobbies              19   275.22  449.22
## - capital.loss                  1  1946.36 2156.36
## - property_claim                1  2162.62 2372.62
## - insured_sex                   1  2378.88 2588.88
## - property_damage               2  2450.97 2658.97
## - capital.gains                 1  2450.97 2660.97
## - umbrella_limit                1  2883.49 3093.49
## 
## Step:  AIC=204
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_occupation + insured_hobbies + capital.gains + capital.loss + 
##     incident_type + collision_type + incident_severity + number_of_vehicles_involved + 
##     property_damage + bodily_injuries + witnesses + police_report_available + 
##     total_claim_amount + injury_claim + property_claim + auto_model + 
##     auto_year_bin + months_as_customer_bin + age_bin + incident_hour_of_the_day_bin
## 
##                                Df Deviance     AIC
## - incident_hour_of_the_day_bin  3     0.00  198.00
## - police_report_available       2     0.00  200.00
## - age_bin                       2     0.00  200.00
## - months_as_customer_bin        2     0.00  200.00
## - incident_type                 2     0.00  200.00
## - bodily_injuries               1     0.00  202.00
## - number_of_vehicles_involved   1     0.00  202.00
## - capital.gains                 1     0.00  202.00
## - policy_annual_premium         1     0.00  202.00
## - property_claim                1     0.00  202.00
## - insured_sex                   1     0.00  202.00
## <none>                                0.00  204.00
## - auto_model                   38   168.58  296.58
## - property_damage               2    99.30  299.30
## - witnesses                     1   110.56  312.56
## - umbrella_limit                1   113.76  315.76
## - auto_year_bin                 1   114.95  316.95
## - collision_type                2   130.29  330.29
## - insured_occupation           13   156.07  334.07
## - incident_severity             3   235.91  433.91
## - insured_hobbies              19   279.41  445.41
## - capital.loss                  1  2378.88 2580.88
## - injury_claim                  1  2450.97 2652.97
## - total_claim_amount            1  2955.58 3157.58
## 
## Step:  AIC=198
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_occupation + insured_hobbies + capital.gains + capital.loss + 
##     incident_type + collision_type + incident_severity + number_of_vehicles_involved + 
##     property_damage + bodily_injuries + witnesses + police_report_available + 
##     total_claim_amount + injury_claim + property_claim + auto_model + 
##     auto_year_bin + months_as_customer_bin + age_bin
## 
##                               Df Deviance     AIC
## - number_of_vehicles_involved  1      0.0   196.0
## - bodily_injuries              1      0.0   196.0
## - injury_claim                 1      0.0   196.0
## <none>                                0.0   198.0
## - capital.loss                 1     90.6   286.6
## - auto_model                  38    170.6   292.6
## - property_damage              2    100.5   294.5
## - insured_sex                  1    102.7   298.7
## - witnesses                    1    112.1   308.1
## - umbrella_limit               1    115.0   311.0
## - auto_year_bin                1    117.6   313.6
## - collision_type               2    130.9   324.9
## - insured_occupation          13    158.7   330.7
## - incident_severity            3    245.7   437.7
## - insured_hobbies             19    280.2   440.2
## - months_as_customer_bin       2   2451.0  2645.0
## - total_claim_amount           1   2451.0  2647.0
## - incident_type                2   2523.1  2717.1
## - police_report_available      2   2523.1  2717.1
## - age_bin                      2   3027.7  3221.7
## - property_claim               1   3027.7  3223.7
## - policy_annual_premium        1   3243.9  3439.9
## - capital.gains                1  20977.4 21173.4
## 
## Step:  AIC=196
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_occupation + insured_hobbies + capital.gains + capital.loss + 
##     incident_type + collision_type + incident_severity + property_damage + 
##     bodily_injuries + witnesses + police_report_available + total_claim_amount + 
##     injury_claim + property_claim + auto_model + auto_year_bin + 
##     months_as_customer_bin + age_bin
## 
##                           Df Deviance     AIC
## - police_report_available  2     0.00  192.00
## - bodily_injuries          1     0.00  194.00
## - injury_claim             1     0.00  194.00
## - capital.gains            1     0.00  194.00
## <none>                           0.00  196.00
## - incident_type            2    90.86  282.86
## - capital.loss             1    90.83  284.83
## - auto_model              38   170.58  290.58
## - property_damage          2   100.65  292.65
## - insured_sex              1   104.25  298.25
## - witnesses                1   113.60  307.60
## - umbrella_limit           1   115.05  309.05
## - auto_year_bin            1   118.79  312.79
## - collision_type           2   132.25  324.25
## - insured_occupation      13   158.75  328.75
## - incident_severity        3   245.69  435.69
## - insured_hobbies         19   280.98  438.98
## - policy_annual_premium    1  1874.27 2068.27
## - months_as_customer_bin   2  2090.53 2282.53
## - total_claim_amount       1  2090.53 2284.53
## - age_bin                  2  2378.88 2570.88
## - property_claim           1  2523.06 2717.06
## 
## Step:  AIC=192
## fraud_reported ~ policy_annual_premium + umbrella_limit + insured_sex + 
##     insured_occupation + insured_hobbies + capital.gains + capital.loss + 
##     incident_type + collision_type + incident_severity + property_damage + 
##     bodily_injuries + witnesses + total_claim_amount + injury_claim + 
##     property_claim + auto_model + auto_year_bin + months_as_customer_bin + 
##     age_bin
## 
##                          Df Deviance    AIC
## <none>                          0.00  192.0
## - months_as_customer_bin  2    72.36  260.4
## - incident_type           2    92.20  280.2
## - capital.loss            1    91.08  281.1
## - auto_model             38   171.06  287.1
## - property_damage         2   101.45  289.4
## - insured_sex             1   104.26  294.3
## - witnesses               1   114.84  304.8
## - umbrella_limit          1   116.95  306.9
## - auto_year_bin           1   121.45  311.5
## - collision_type          2   133.46  321.5
## - insured_occupation     13   158.93  324.9
## - insured_hobbies        19   284.76  438.8
## - incident_severity       3   252.82  438.8
## - age_bin                 2  2090.53 2278.5
## - total_claim_amount      1  2162.62 2352.6
## - policy_annual_premium   1  2306.79 2496.8
## - capital.gains           1  2450.97 2641.0
## - bodily_injuries         1  2523.06 2713.1
## - injury_claim            1  2739.32 2929.3
## - property_claim          1  3027.67 3217.7

## 
## Call:  glm(formula = fraud_reported ~ policy_annual_premium + umbrella_limit + 
##     insured_sex + insured_occupation + insured_hobbies + capital.gains + 
##     capital.loss + incident_type + collision_type + incident_severity + 
##     property_damage + bodily_injuries + witnesses + total_claim_amount + 
##     injury_claim + property_claim + auto_model + auto_year_bin + 
##     months_as_customer_bin + age_bin, family = "binomial", data = claim.train)
## 
## Coefficients:
##                           (Intercept)  
##                            -3.545e+03  
##                 policy_annual_premium  
##                             8.513e-01  
##                        umbrella_limit  
##                             3.102e-04  
##                       insured_sexMALE  
##                            -1.215e+03  
##        insured_occupationarmed-forces  
##                            -4.544e+02  
##        insured_occupationcraft-repair  
##                             1.039e+03  
##     insured_occupationexec-managerial  
##                             3.263e+03  
##     insured_occupationfarming-fishing  
##                            -1.019e+03  
##   insured_occupationhandlers-cleaners  
##                            -3.145e+03  
##   insured_occupationmachine-op-inspct  
##                            -1.470e+03  
##       insured_occupationother-service  
##                            -1.203e+03  
##     insured_occupationpriv-house-serv  
##                            -2.154e+03  
##      insured_occupationprof-specialty  
##                            -1.027e+03  
##     insured_occupationprotective-serv  
##                            -1.507e+03  
##               insured_occupationsales  
##                            -1.460e+03  
##        insured_occupationtech-support  
##                            -1.229e+03  
##    insured_occupationtransport-moving  
##                            -1.148e+03  
##             insured_hobbiesbasketball  
##                            -1.135e+03  
##            insured_hobbiesboard-games  
##                             1.646e+03  
##         insured_hobbiesbungie-jumping  
##                            -1.809e+03  
##                insured_hobbiescamping  
##                            -3.500e+02  
##                  insured_hobbieschess  
##                             6.751e+03  
##              insured_hobbiescross-fit  
##                             6.475e+03  
##                insured_hobbiesdancing  
##                            -2.480e+03  
##               insured_hobbiesexercise  
##                            -8.326e+02  
##                   insured_hobbiesgolf  
##                            -5.297e+02  
##                 insured_hobbieshiking  
##                             2.054e+03  
##               insured_hobbieskayaking  
##                            -2.697e+03  
##                 insured_hobbiesmovies  
##                             1.580e+03  
##              insured_hobbiespaintball  
##                             1.039e+03  
##                   insured_hobbiespolo  
##                            -2.691e+02  
##                insured_hobbiesreading  
##                             2.373e+03  
##              insured_hobbiesskydiving  
##                            -3.091e+03  
##               insured_hobbiessleeping  
##                            -1.835e+03  
##            insured_hobbiesvideo-games  
##                             1.813e+03  
##               insured_hobbiesyachting  
##                            -5.212e+00  
##                         capital.gains  
##                             2.509e-03  
##                          capital.loss  
##                             1.293e-02  
##               incident_typeParked Car  
##                             1.635e+03  
## incident_typeSingle Vehicle Collision  
##                             6.926e+02  
##            incident_typeVehicle Theft  
##                             2.184e+03  
##         collision_typeFront Collision  
##                             1.284e+03  
##          collision_typeRear Collision  
##                             2.529e+03  
##          collision_typeSide Collision  
##                                    NA  
##         incident_severityMinor Damage  
##                            -4.299e+03  
##           incident_severityTotal Loss  
##                            -3.883e+03  
##       incident_severityTrivial Damage  
##                            -9.940e+03  
##                     property_damageNO  
##                            -7.894e+02  
##                    property_damageYES  
##                             6.407e+02  
##                       bodily_injuries  
##                             7.668e+01  
##                             witnesses  
##                             7.268e+02  
##                    total_claim_amount  
##                            -5.099e-03  
##                          injury_claim  
##                            -2.451e-02  
##                        property_claim  
##                             6.792e-02  
##                         auto_model92x  
##                            -1.100e+02  
##                          auto_model93  
##                            -4.014e+01  
##                          auto_model95  
##                             4.162e+02  
##                          auto_modelA3  
##                             3.904e+03  
##                          auto_modelA5  
##                             9.744e+02  
##                      auto_modelAccord  
##                             9.046e+02  
##                        auto_modelC300  
##                            -3.210e+03  
##                       auto_modelCamry  
##                            -5.680e+02  
##                       auto_modelCivic  
##                             2.063e+03  
##                     auto_modelCorolla  
##                            -5.758e+02  
##                         auto_modelCRV  
##                            -2.243e+03  
##                        auto_modelE400  
##                             2.069e+03  
##                      auto_modelEscape  
##                            -1.682e+03  
##                        auto_modelF150  
##                             5.173e+03  
##                   auto_modelForrestor  
##                             9.296e+02  
##                      auto_modelFusion  
##                             3.374e+03  
##              auto_modelGrand Cherokee  
##                            -8.424e+02  
##                  auto_modelHighlander  
##                             9.666e+02  
##                     auto_modelImpreza  
##                             1.081e+03  
##                       auto_modelJetta  
##                             2.488e+02  
##                      auto_modelLegacy  
##                            -9.914e+01  
##                          auto_modelM5  
##                             3.988e+03  
##                      auto_modelMalibu  
##                             2.949e+03  
##                      auto_modelMaxima  
##                             2.170e+03  
##                         auto_modelMDX  
##                             1.536e+03  
##                       auto_modelML350  
##                             1.123e+03  
##                        auto_modelNeon  
##                            -6.094e+01  
##                      auto_modelPassat  
##                             3.198e+03  
##                  auto_modelPathfinder  
##                             3.207e+03  
##                         auto_modelRAM  
##                            -2.583e+02  
##                         auto_modelRSX  
##                             1.953e+03  
##                   auto_modelSilverado  
##                             8.726e+02  
##                       auto_modelTahoe  
##                            -8.151e+02  
##                          auto_modelTL  
##                             1.436e+03  
##                      auto_modelUltima  
##                             1.804e+03  
##                    auto_modelWrangler  
##                            -1.697e+03  
##                          auto_modelX5  
##                            -1.518e+03  
##                          auto_modelX6  
##                             2.816e+03  
##                  auto_year_binold_car  
##                             2.717e+03  
##  months_as_customer_binloyal_customer  
##                            -5.498e+02  
##    months_as_customer_binnew_customer  
##                             4.168e+02  
##                     age_binold_people  
##                             1.970e+02  
##                   age_binyoung_people  
##                            -1.160e+03  
## 
## Degrees of Freedom: 327 Total (i.e. Null);  232 Residual
## Null Deviance:       454.7 
## Residual Deviance: 9.791e-06     AIC: 192

## Generalized Linear Model 
## 
## 328 samples
##  20 predictor
##   2 classes: 'N', 'Y' 
## 
## Pre-processing: scaled (96) 
## Resampling: Cross-Validated (10 fold, repeated 4 times) 
## Summary of sample sizes: 296, 294, 295, 294, 296, 296, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7091564  0.4185863

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 197  19
##          Y  80  64
##                                           
##                Accuracy : 0.725           
##                  95% CI : (0.6758, 0.7705)
##     No Information Rate : 0.7694          
##     P-Value [Acc > NIR] : 0.9789          
##                                           
##                   Kappa : 0.3836          
##                                           
##  Mcnemar's Test P-Value : 1.637e-09       
##                                           
##             Sensitivity : 0.7711          
##             Specificity : 0.7112          
##          Pos Pred Value : 0.4444          
##          Neg Pred Value : 0.9120          
##              Prevalence : 0.2306          
##          Detection Rate : 0.1778          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.7411          
##                                           
##        'Positive' Class : Y               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 116   3
##          Y  30  51
##                                           
##                Accuracy : 0.835           
##                  95% CI : (0.7762, 0.8836)
##     No Information Rate : 0.73            
##     P-Value [Acc > NIR] : 0.0003184       
##                                           
##                   Kappa : 0.6384          
##                                           
##  Mcnemar's Test P-Value : 6.011e-06       
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.7945          
##          Pos Pred Value : 0.6296          
##          Neg Pred Value : 0.9748          
##              Prevalence : 0.2700          
##          Detection Rate : 0.2550          
##    Detection Prevalence : 0.4050          
##       Balanced Accuracy : 0.8695          
##                                           
##        'Positive' Class : Y               
##

hasil model logistic regression pertama memberikan nilai accuracy 83% untuk data test dengan nilai recall / sensitivity 94%, hasil prediksi dari model logistic regression dengan menggunakan data test sangat tinggi dibanding dengan data validation sehingga model logistic regression pertama cenderung overfit

## glm variable importance
## 
##   only 20 most important variables shown (out of 95)
## 
##                                       Overall
## `incident_severityMinor Damage`        100.00
## `incident_severityTotal Loss`           98.74
## `insured_occupationhandlers-cleaners`   97.50
## `collision_typeRear Collision`          92.76
## `insured_occupationtransport-moving`    92.57
## insured_sexMALE                         92.07
## auto_year_binold_car                    91.82
## witnesses                               90.40
## umbrella_limit                          89.16
## `insured_occupationexec-managerial`     86.52
## `insured_occupationcraft-repair`        80.37
## `insured_occupationtech-support`        77.15
## `insured_occupationfarming-fishing`     72.28
## `insured_occupationprof-specialty`      71.87
## `collision_typeFront Collision`         71.69
## months_as_customer_binnew_customer      71.50
## capital.loss                            70.00
## `incident_severityTrivial Damage`       66.05
## `incident_typeParked Car`               65.78
## property_damageYES                      64.47

Logistic Regression 2

Variable pada model random forest 2 akan digunakan untuk model logistic regression ke 2

## Generalized Linear Model 
## 
## 328 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## Pre-processing: scaled (31) 
## Resampling: Cross-Validated (10 fold, repeated 4 times) 
## Summary of sample sizes: 296, 294, 295, 294, 296, 296, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8142825  0.6286741

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 234  10
##          Y  43  73
##                                           
##                Accuracy : 0.8528          
##                  95% CI : (0.8119, 0.8877)
##     No Information Rate : 0.7694          
##     P-Value [Acc > NIR] : 5.530e-05       
##                                           
##                   Kappa : 0.6358          
##                                           
##  Mcnemar's Test P-Value : 1.105e-05       
##                                           
##             Sensitivity : 0.8795          
##             Specificity : 0.8448          
##          Pos Pred Value : 0.6293          
##          Neg Pred Value : 0.9590          
##              Prevalence : 0.2306          
##          Detection Rate : 0.2028          
##    Detection Prevalence : 0.3222          
##       Balanced Accuracy : 0.8621          
##                                           
##        'Positive' Class : Y               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 124   6
##          Y  22  48
##                                           
##                Accuracy : 0.86            
##                  95% CI : (0.8041, 0.9049)
##     No Information Rate : 0.73            
##     P-Value [Acc > NIR] : 7.825e-06       
##                                           
##                   Kappa : 0.6752          
##                                           
##  Mcnemar's Test P-Value : 0.004586        
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.8493          
##          Pos Pred Value : 0.6857          
##          Neg Pred Value : 0.9538          
##              Prevalence : 0.2700          
##          Detection Rate : 0.2400          
##    Detection Prevalence : 0.3500          
##       Balanced Accuracy : 0.8691          
##                                           
##        'Positive' Class : Y               
##

hasil model logistic regression kedua memberikan nilai accuracy 86% untuk data test dengan nilai recall / sensitivity 88%, hasil prediksi dari model logistic regression dengan menggunakan data test tidak berbeda jauh dengan data validation sehingga model logistic regression kedua merupakan model yang fit

## glm variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                                   Overall
## `incident_severityMinor Damage`   100.000
## `incident_severityTotal Loss`      85.944
## `incident_severityTrivial Damage`  55.323
## insured_hobbieschess               38.565
## insured_hobbieskayaking            30.657
## insured_hobbiesskydiving           30.103
## `insured_hobbiesbungie-jumping`    27.453
## insured_hobbiescamping             26.955
## witnesses                          26.355
## insured_hobbiessleeping            25.763
## insured_hobbiesexercise            25.075
## insured_hobbiesbasketball          23.258
## property_claim                     20.316
## insured_hobbiesmovies              18.447
## insured_hobbiesdancing             17.369
## total_claim_amount                 17.066
## vehicle_claim                      15.109
## insured_hobbiesgolf                14.815
## `insured_hobbiesvideo-games`        9.980
## insured_hobbieshiking               9.789

Logistic Regression 3 (menggunakan probability)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 197  19
##          Y  80  64
##                                           
##                Accuracy : 0.725           
##                  95% CI : (0.6758, 0.7705)
##     No Information Rate : 0.7694          
##     P-Value [Acc > NIR] : 0.9789          
##                                           
##                   Kappa : 0.3836          
##                                           
##  Mcnemar's Test P-Value : 1.637e-09       
##                                           
##             Sensitivity : 0.7711          
##             Specificity : 0.7112          
##          Pos Pred Value : 0.4444          
##          Neg Pred Value : 0.9120          
##              Prevalence : 0.2306          
##          Detection Rate : 0.1778          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.7411          
##                                           
##        'Positive' Class : Y               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 116   3
##          Y  30  51
##                                           
##                Accuracy : 0.835           
##                  95% CI : (0.7762, 0.8836)
##     No Information Rate : 0.73            
##     P-Value [Acc > NIR] : 0.0003184       
##                                           
##                   Kappa : 0.6384          
##                                           
##  Mcnemar's Test P-Value : 6.011e-06       
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.7945          
##          Pos Pred Value : 0.6296          
##          Neg Pred Value : 0.9748          
##              Prevalence : 0.2700          
##          Detection Rate : 0.2550          
##    Detection Prevalence : 0.4050          
##       Balanced Accuracy : 0.8695          
##                                           
##        'Positive' Class : Y               
##

dari hasil di atas, variance trade off nya terlihat hasil model cenderung overfit.

Logistic Regression 4 (menggunakan probability)

gunakan varaible pada model random forest 2 dan logistic regression 2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 238  14
##          Y  39  69
##                                           
##                Accuracy : 0.8528          
##                  95% CI : (0.8119, 0.8877)
##     No Information Rate : 0.7694          
##     P-Value [Acc > NIR] : 5.53e-05        
##                                           
##                   Kappa : 0.6246          
##                                           
##  Mcnemar's Test P-Value : 0.0009784       
##                                           
##             Sensitivity : 0.8313          
##             Specificity : 0.8592          
##          Pos Pred Value : 0.6389          
##          Neg Pred Value : 0.9444          
##              Prevalence : 0.2306          
##          Detection Rate : 0.1917          
##    Detection Prevalence : 0.3000          
##       Balanced Accuracy : 0.8453          
##                                           
##        'Positive' Class : Y               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 126   8
##          Y  20  46
##                                           
##                Accuracy : 0.86            
##                  95% CI : (0.8041, 0.9049)
##     No Information Rate : 0.73            
##     P-Value [Acc > NIR] : 7.825e-06       
##                                           
##                   Kappa : 0.6681          
##                                           
##  Mcnemar's Test P-Value : 0.03764         
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.8630          
##          Pos Pred Value : 0.6970          
##          Neg Pred Value : 0.9403          
##              Prevalence : 0.2700          
##          Detection Rate : 0.2300          
##    Detection Prevalence : 0.3300          
##       Balanced Accuracy : 0.8574          
##                                           
##        'Positive' Class : Y               
##

hasil model logistic regression ke empat memberikan nilai accuracy 86% untuk data test dengan nilai recall / sensitivity 85%, hasil prediksi dari model logistic regression dengan menggunakan data test tidak berbeda jauh dengan data validation sehingga model logistic regression ke empat merupakan model yang fit

variable yang cocok untuk melakukan klasifikasi customer yang terindikasi fraud adalah

incident_severity insured_hobbies vehicle_claim property_claim total_claim_amount policy_annual_premium capital.loss injury_claim capital.gains witnesses umbrella_limit