Usted trabaja en el área de estadística de una aseguradora y se le ha solicitado su ayuda para analizar los montos reclamados y evaluar si el valor de las primas es el correcto para los incidentes mayores y de pérdida total. Dada su experiencia en el ramo usted sabe que puede echar mano de los siguientes análisis:
Características del asegurado asociadas al monto de reclamación
Características del incidente y su relación con el monto reclamado
Segmentación de asegurados
Suficiencia de prima
Primero realizamos una limpieza de la base
Quitamos las variables que no son relevantes para el análisis como el número de póliza y la fecha
La variable collision_type tiene como factor el signo “?” también es necesario retirarlo como factor
Finalmente retiramos los NA de la base
library(dplyr)
library(MASS)
datos <- read.csv(file = "C:/Users/Rafae/Desktop/CIENCIA_DE_DATOS/insurance_claims.csv")
str(datos)
## 'data.frame': 1000 obs. of 22 variables:
## $ months_as_customer : int 328 228 134 256 228 256 137 165 27 212 ...
## $ age : int 48 42 29 41 44 39 34 37 33 42 ...
## $ policy_number : int 521585 342868 687698 227811 367455 104594 413978 429027 485665 636550 ...
## $ policy_state : Factor w/ 3 levels "IL","IN","OH": 3 2 3 1 1 3 2 1 1 1 ...
## $ policy_deductable : int 1000 2000 2000 2000 1000 1000 1000 1000 500 500 ...
## $ policy_annual_premium : num 1407 1197 1413 1416 1584 ...
## $ insured_sex : Factor w/ 2 levels "FEMALE","MALE": 2 2 1 1 2 1 2 2 1 2 ...
## $ insured_education_level : Factor w/ 7 levels "Associate","College",..: 6 6 7 7 1 7 7 1 7 7 ...
## $ insured_occupation : Factor w/ 14 levels "adm-clerical",..: 3 7 12 2 12 13 10 13 8 9 ...
## $ insured_hobbies : Factor w/ 20 levels "base-jumping",..: 18 16 3 3 3 4 3 1 10 5 ...
## $ insured_relationship : Factor w/ 6 levels "husband","not-in-family",..: 1 3 4 5 5 5 1 5 4 6 ...
## $ incident_date : Factor w/ 60 levels "01/01/2015","01/02/2015",..: 50 42 45 20 35 4 26 55 59 10 ...
## $ incident_type : Factor w/ 4 levels "Multi-vehicle Collision",..: 3 4 1 3 4 1 1 1 3 3 ...
## $ collision_type : Factor w/ 4 levels "?","Front Collision",..: 4 1 3 2 1 3 2 2 2 3 ...
## $ incident_severity : Factor w/ 4 levels "Major Damage",..: 1 2 2 1 2 1 2 3 3 3 ...
## $ authorities_contacted : Factor w/ 5 levels "Ambulance","Fire",..: 5 5 5 5 3 2 5 5 5 4 ...
## $ incident_hour_of_the_day : int 5 8 7 5 20 19 0 23 21 14 ...
## $ number_of_vehicles_involved: int 1 1 3 1 1 3 3 3 1 1 ...
## $ auto_make : Factor w/ 14 levels "Accura","Audi",..: 11 9 5 4 1 11 10 2 13 11 ...
## $ auto_model : Factor w/ 39 levels "3 Series","92x",..: 2 13 31 34 32 4 30 6 9 2 ...
## $ auto_year : int 2004 2007 2007 2014 2009 2003 2012 2015 2012 1996 ...
## $ total_claim_amount : int 71610 5070 34650 63400 6500 64100 78650 51590 27700 42300 ...
datos_2 <- datos %>% dplyr::select(-policy_number,-incident_date)
datos_2[datos_2 == "?"] <- NA
datos_2$collision_type <- as.integer(datos_2$collision_type)
datos_2$collision_type <- as.factor(datos_2$collision_type)
nrow(datos_2[is.na(datos_2$collision_type),])
## [1] 178
datos_2 <- datos_2[!(is.na(datos_2$collision_type)),]
Veamos que en esta versión de la tabla hemos quitado los renglones que tienen NA, por lo que solo nos quedan 822 observaciones, y 20 variables a analizar
str(datos_2)
## 'data.frame': 822 obs. of 20 variables:
## $ months_as_customer : int 328 134 256 256 137 165 27 212 235 447 ...
## $ age : int 48 29 41 39 34 37 33 42 42 61 ...
## $ policy_state : Factor w/ 3 levels "IL","IN","OH": 3 3 1 3 2 1 1 1 3 3 ...
## $ policy_deductable : int 1000 2000 2000 1000 1000 1000 500 500 500 2000 ...
## $ policy_annual_premium : num 1407 1413 1416 1351 1333 ...
## $ insured_sex : Factor w/ 2 levels "FEMALE","MALE": 2 1 1 1 2 2 1 2 1 1 ...
## $ insured_education_level : Factor w/ 7 levels "Associate","College",..: 6 7 7 7 7 1 7 7 5 3 ...
## $ insured_occupation : Factor w/ 14 levels "adm-clerical",..: 3 12 2 13 10 13 8 9 4 4 ...
## $ insured_hobbies : Factor w/ 20 levels "base-jumping",..: 18 3 3 4 3 1 10 5 8 17 ...
## $ insured_relationship : Factor w/ 6 levels "husband","not-in-family",..: 1 4 5 5 1 5 4 6 3 3 ...
## $ incident_type : Factor w/ 4 levels "Multi-vehicle Collision",..: 3 1 3 1 1 1 3 3 3 1 ...
## $ collision_type : Factor w/ 3 levels "2","3","4": 3 2 1 2 1 1 1 2 1 1 ...
## $ incident_severity : Factor w/ 4 levels "Major Damage",..: 1 2 1 1 2 3 3 3 3 1 ...
## $ authorities_contacted : Factor w/ 5 levels "Ambulance","Fire",..: 5 5 5 2 5 5 5 4 5 2 ...
## $ incident_hour_of_the_day : int 5 7 5 19 0 23 21 14 22 21 ...
## $ number_of_vehicles_involved: int 1 3 1 3 3 3 1 1 1 3 ...
## $ auto_make : Factor w/ 14 levels "Accura","Audi",..: 11 5 4 11 10 2 13 11 6 2 ...
## $ auto_model : Factor w/ 39 levels "3 Series","92x",..: 2 31 34 4 30 6 9 2 15 5 ...
## $ auto_year : int 2004 2007 2014 2003 2012 2015 2012 1996 2002 2006 ...
## $ total_claim_amount : int 71610 34650 63400 64100 78650 51590 27700 42300 87010 114920 ...
Primero vamos a buscar un modelo usando todas las variables de la base
datos_ajuste <- glm(total_claim_amount~., data = datos_2)
stepAIC(datos_ajuste, direction = "both", trace = F)
##
## Call: glm(formula = total_claim_amount ~ age + number_of_vehicles_involved,
## data = datos_2)
##
## Coefficients:
## (Intercept) age
## 61719.8 105.1
## number_of_vehicles_involved
## -1396.1
##
## Degrees of Freedom: 821 Total (i.e. Null); 819 Residual
## Null Deviance: 2.105e+11
## Residual Deviance: 2.08e+11 AIC: 18250
summary(datos_ajuste)
##
## Call:
## glm(formula = total_claim_amount ~ ., data = datos_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -41701 -9962 -582 9261 46850
##
## Coefficients: (13 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 2.057e+05 1.997e+05 1.030
## months_as_customer -8.303e+00 1.322e+01 -0.628
## age 1.665e+02 1.654e+02 1.007
## policy_stateIN 2.236e+03 1.491e+03 1.499
## policy_stateOH 1.254e+02 1.423e+03 0.088
## policy_deductable 1.651e+00 9.658e-01 1.709
## policy_annual_premium 4.981e-01 2.423e+00 0.206
## insured_sexMALE -9.918e+02 1.171e+03 -0.847
## insured_education_levelCollege -2.159e+03 2.376e+03 -0.909
## insured_education_levelHigh School 2.737e+03 2.229e+03 1.228
## insured_education_levelJD -6.816e+02 2.206e+03 -0.309
## insured_education_levelMasters 3.201e+03 2.251e+03 1.422
## insured_education_levelMD 1.189e+03 2.231e+03 0.533
## insured_education_levelPhD 1.951e+03 2.312e+03 0.844
## insured_occupationarmed-forces 1.095e+03 3.345e+03 0.327
## insured_occupationcraft-repair 1.624e+03 3.256e+03 0.499
## insured_occupationexec-managerial 3.247e+03 3.231e+03 1.005
## insured_occupationfarming-fishing -4.511e+03 3.524e+03 -1.280
## insured_occupationhandlers-cleaners 4.193e+03 3.432e+03 1.222
## insured_occupationmachine-op-inspct -2.769e+02 3.059e+03 -0.091
## insured_occupationother-service 2.732e+03 3.365e+03 0.812
## insured_occupationpriv-house-serv 2.300e+03 3.280e+03 0.701
## insured_occupationprof-specialty 3.127e+03 3.147e+03 0.994
## insured_occupationprotective-serv 1.946e+03 3.339e+03 0.583
## insured_occupationsales -2.165e+03 3.267e+03 -0.663
## insured_occupationtech-support 2.952e+03 3.232e+03 0.913
## insured_occupationtransport-moving 2.185e+03 3.225e+03 0.677
## insured_hobbiesbasketball 5.233e+03 4.118e+03 1.271
## insured_hobbiesboard-games 6.170e+03 3.755e+03 1.643
## insured_hobbiesbungie-jumping 4.512e+02 3.655e+03 0.123
## insured_hobbiescamping 5.119e+03 3.722e+03 1.375
## insured_hobbieschess 6.392e+03 3.896e+03 1.641
## insured_hobbiescross-fit 9.804e+03 4.012e+03 2.444
## insured_hobbiesdancing 4.184e+03 3.913e+03 1.069
## insured_hobbiesexercise 4.573e+03 3.692e+03 1.238
## insured_hobbiesgolf 3.856e+03 3.750e+03 1.028
## insured_hobbieshiking 7.157e+03 3.713e+03 1.928
## insured_hobbieskayaking 5.562e+03 3.769e+03 1.476
## insured_hobbiesmovies 6.119e+03 3.735e+03 1.638
## insured_hobbiespaintball 8.509e+03 3.767e+03 2.259
## insured_hobbiespolo 7.394e+03 3.946e+03 1.873
## insured_hobbiesreading 2.820e+03 3.619e+03 0.779
## insured_hobbiesskydiving 8.031e+03 3.726e+03 2.155
## insured_hobbiessleeping 2.653e+03 3.928e+03 0.675
## insured_hobbiesvideo-games 9.640e+03 3.872e+03 2.489
## insured_hobbiesyachting 4.797e+03 3.707e+03 1.294
## insured_relationshipnot-in-family 1.263e+03 2.018e+03 0.626
## insured_relationshipother-relative 1.624e+03 2.019e+03 0.804
## insured_relationshipown-child -2.099e+02 1.990e+03 -0.105
## insured_relationshipunmarried 1.899e+03 2.146e+03 0.885
## insured_relationshipwife 2.677e+03 2.092e+03 1.280
## incident_typeSingle Vehicle Collision 3.222e+03 4.578e+03 0.704
## collision_type3 -2.450e+03 1.491e+03 -1.643
## collision_type4 -1.512e+03 1.511e+03 -1.000
## incident_severityMinor Damage -9.914e+02 1.470e+03 -0.674
## incident_severityTotal Loss -2.940e+03 1.451e+03 -2.026
## authorities_contactedFire 5.190e+02 1.671e+03 0.311
## authorities_contactedOther 4.461e+03 1.699e+03 2.626
## authorities_contactedPolice 2.449e+03 1.675e+03 1.462
## incident_hour_of_the_day 7.002e+01 8.433e+01 0.830
## number_of_vehicles_involved 2.040e+02 2.182e+03 0.093
## auto_makeAudi 1.568e+01 5.174e+03 0.003
## auto_makeBMW 4.354e+03 6.053e+03 0.719
## auto_makeChevrolet 4.069e+03 5.479e+03 0.743
## auto_makeDodge 1.795e+03 4.847e+03 0.370
## auto_makeFord -7.383e+02 5.511e+03 -0.134
## auto_makeHonda 2.817e+03 5.657e+03 0.498
## auto_makeJeep 1.185e+03 4.889e+03 0.242
## auto_makeMercedes 7.265e+03 5.697e+03 1.275
## auto_makeNissan 6.235e+01 5.578e+03 0.011
## auto_makeSaab -1.499e+03 5.345e+03 -0.281
## auto_makeSuburu -9.495e+02 5.081e+03 -0.187
## auto_makeToyota 8.590e+02 5.588e+03 0.154
## auto_makeVolkswagen 2.604e+02 5.171e+03 0.050
## auto_model92x 3.451e+03 5.002e+03 0.690
## auto_model93 -1.683e+02 5.079e+03 -0.033
## auto_model95 NA NA NA
## auto_modelA3 4.133e+03 4.401e+03 0.939
## auto_modelA5 NA NA NA
## auto_modelAccord -7.858e+03 6.435e+03 -1.221
## auto_modelC300 -6.551e+03 5.693e+03 -1.151
## auto_modelCamry -4.170e+03 5.494e+03 -0.759
## auto_modelCivic -4.452e+03 5.709e+03 -0.780
## auto_modelCorolla -4.074e+03 5.812e+03 -0.701
## auto_modelCRV NA NA NA
## auto_modelE400 -8.240e+03 5.689e+03 -1.448
## auto_modelEscape 9.747e+03 5.195e+03 1.876
## auto_modelF150 3.979e+03 5.341e+03 0.745
## auto_modelForrestor 2.497e+03 4.573e+03 0.546
## auto_modelFusion NA NA NA
## auto_modelGrand Cherokee -2.481e+03 4.747e+03 -0.523
## auto_modelHighlander NA NA NA
## auto_modelImpreza -3.980e+02 5.219e+03 -0.076
## auto_modelJetta -2.132e+03 4.524e+03 -0.471
## auto_modelLegacy NA NA NA
## auto_modelM5 -9.364e+03 6.536e+03 -1.433
## auto_modelMalibu -3.523e+02 5.189e+03 -0.068
## auto_modelMaxima -3.526e+03 5.315e+03 -0.663
## auto_modelMDX -5.744e+03 5.106e+03 -1.125
## auto_modelML350 NA NA NA
## auto_modelNeon 1.116e+03 4.092e+03 0.273
## auto_modelPassat NA NA NA
## auto_modelPathfinder -1.761e+03 5.069e+03 -0.347
## auto_modelRAM NA NA NA
## auto_modelRSX 9.052e+03 6.560e+03 1.380
## auto_modelSilverado -6.178e+03 5.390e+03 -1.146
## auto_modelTahoe NA NA NA
## auto_modelTL NA NA NA
## auto_modelUltima NA NA NA
## auto_modelWrangler NA NA NA
## auto_modelX5 -1.967e+03 5.930e+03 -0.332
## auto_modelX6 -6.163e+02 6.133e+03 -0.100
## auto_year -8.063e+01 9.938e+01 -0.811
## Pr(>|t|)
## (Intercept) 0.30345
## months_as_customer 0.53015
## age 0.31418
## policy_stateIN 0.13420
## policy_stateOH 0.92980
## policy_deductable 0.08784 .
## policy_annual_premium 0.83720
## insured_sexMALE 0.39747
## insured_education_levelCollege 0.36390
## insured_education_levelHigh School 0.21985
## insured_education_levelJD 0.75747
## insured_education_levelMasters 0.15547
## insured_education_levelMD 0.59420
## insured_education_levelPhD 0.39909
## insured_occupationarmed-forces 0.74358
## insured_occupationcraft-repair 0.61803
## insured_occupationexec-managerial 0.31522
## insured_occupationfarming-fishing 0.20087
## insured_occupationhandlers-cleaners 0.22217
## insured_occupationmachine-op-inspct 0.92790
## insured_occupationother-service 0.41712
## insured_occupationpriv-house-serv 0.48336
## insured_occupationprof-specialty 0.32068
## insured_occupationprotective-serv 0.56015
## insured_occupationsales 0.50784
## insured_occupationtech-support 0.36129
## insured_occupationtransport-moving 0.49831
## insured_hobbiesbasketball 0.20416
## insured_hobbiesboard-games 0.10074
## insured_hobbiesbungie-jumping 0.90180
## insured_hobbiescamping 0.16947
## insured_hobbieschess 0.10128
## insured_hobbiescross-fit 0.01477 *
## insured_hobbiesdancing 0.28530
## insured_hobbiesexercise 0.21594
## insured_hobbiesgolf 0.30415
## insured_hobbieshiking 0.05428 .
## insured_hobbieskayaking 0.14047
## insured_hobbiesmovies 0.10182
## insured_hobbiespaintball 0.02418 *
## insured_hobbiespolo 0.06141 .
## insured_hobbiesreading 0.43607
## insured_hobbiesskydiving 0.03146 *
## insured_hobbiessleeping 0.49958
## insured_hobbiesvideo-games 0.01302 *
## insured_hobbiesyachting 0.19604
## insured_relationshipnot-in-family 0.53156
## insured_relationshipother-relative 0.42143
## insured_relationshipown-child 0.91603
## insured_relationshipunmarried 0.37647
## insured_relationshipwife 0.20103
## incident_typeSingle Vehicle Collision 0.48177
## collision_type3 0.10080
## collision_type4 0.31750
## incident_severityMinor Damage 0.50025
## incident_severityTotal Loss 0.04316 *
## authorities_contactedFire 0.75621
## authorities_contactedOther 0.00882 **
## authorities_contactedPolice 0.14408
## incident_hour_of_the_day 0.40669
## number_of_vehicles_involved 0.92553
## auto_makeAudi 0.99758
## auto_makeBMW 0.47224
## auto_makeChevrolet 0.45789
## auto_makeDodge 0.71131
## auto_makeFord 0.89346
## auto_makeHonda 0.61872
## auto_makeJeep 0.80854
## auto_makeMercedes 0.20263
## auto_makeNissan 0.99108
## auto_makeSaab 0.77916
## auto_makeSuburu 0.85181
## auto_makeToyota 0.87788
## auto_makeVolkswagen 0.95985
## auto_model92x 0.49053
## auto_model93 0.97358
## auto_model95 NA
## auto_modelA3 0.34801
## auto_modelA5 NA
## auto_modelAccord 0.22243
## auto_modelC300 0.25023
## auto_modelCamry 0.44812
## auto_modelCivic 0.43579
## auto_modelCorolla 0.48363
## auto_modelCRV NA
## auto_modelE400 0.14797
## auto_modelEscape 0.06102 .
## auto_modelF150 0.45643
## auto_modelForrestor 0.58514
## auto_modelFusion NA
## auto_modelGrand Cherokee 0.60145
## auto_modelHighlander NA
## auto_modelImpreza 0.93923
## auto_modelJetta 0.63757
## auto_modelLegacy NA
## auto_modelM5 0.15238
## auto_modelMalibu 0.94589
## auto_modelMaxima 0.50734
## auto_modelMDX 0.26093
## auto_modelML350 NA
## auto_modelNeon 0.78508
## auto_modelPassat NA
## auto_modelPathfinder 0.72834
## auto_modelRAM NA
## auto_modelRSX 0.16800
## auto_modelSilverado 0.25208
## auto_modelTahoe NA
## auto_modelTL NA
## auto_modelUltima NA
## auto_modelWrangler NA
## auto_modelX5 0.74026
## auto_modelX6 0.91999
## auto_year 0.41744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 255963678)
##
## Null deviance: 2.1046e+11 on 821 degrees of freedom
## Residual deviance: 1.8481e+11 on 722 degrees of freedom
## AIC: 18342
##
## Number of Fisher Scoring iterations: 2
De acuerdo al ajuste anterior se sugieren que las variables más significativas son: la edad y el número de autos involucrados
Si probamos las variables antes sugeridas obtenemos lo siguiente:
fit <- glm(formula = total_claim_amount ~ age + number_of_vehicles_involved,
family = Gamma, data = datos_2)
summary(fit)
##
## Call:
## glm(formula = total_claim_amount ~ age + number_of_vehicles_involved,
## family = Gamma, data = datos_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0168 -0.1758 -0.0137 0.1612 0.6507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.620e-05 6.608e-07 24.511 < 2e-16 ***
## age -2.625e-08 1.494e-08 -1.757 0.07921 .
## number_of_vehicles_involved 3.529e-07 1.354e-07 2.606 0.00933 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.06388584)
##
## Null deviance: 56.761 on 821 degrees of freedom
## Residual deviance: 56.132 on 819 degrees of freedom
## AIC: 18255
##
## Number of Fisher Scoring iterations: 4
dev <- fit$deviance
nullDev <- fit$null.deviance
modelChi <- nullDev - dev
modelChi
## [1] 0.6286884
chidf <- fit$df.null - fit$df.residual
chisq.prob <- 1 - pchisq(modelChi, chidf)
chisq.prob
## [1] 0.7302676
Como podemos notar el modelo no es bueno pues no es estadisticamente significativo, por lo que se intentarán ajustar otros modelos al monto de la suma asegurada.