Usted trabaja en el área de estadística de una aseguradora y se le ha solicitado su ayuda para analizar los montos reclamados y evaluar si el valor de las primas es el correcto para los incidentes mayores y de pérdida total. Dada su experiencia en el ramo usted sabe que puede echar mano de los siguientes análisis:

Primero realizamos una limpieza de la base

Quitamos las variables que no son relevantes para el análisis como el número de póliza y la fecha

La variable collision_type tiene como factor el signo “?” también es necesario retirarlo como factor

Finalmente retiramos los NA de la base

library(dplyr)
library(MASS)

datos <- read.csv(file = "C:/Users/Rafae/Desktop/CIENCIA_DE_DATOS/insurance_claims.csv") 
str(datos)
## 'data.frame':    1000 obs. of  22 variables:
##  $ months_as_customer         : int  328 228 134 256 228 256 137 165 27 212 ...
##  $ age                        : int  48 42 29 41 44 39 34 37 33 42 ...
##  $ policy_number              : int  521585 342868 687698 227811 367455 104594 413978 429027 485665 636550 ...
##  $ policy_state               : Factor w/ 3 levels "IL","IN","OH": 3 2 3 1 1 3 2 1 1 1 ...
##  $ policy_deductable          : int  1000 2000 2000 2000 1000 1000 1000 1000 500 500 ...
##  $ policy_annual_premium      : num  1407 1197 1413 1416 1584 ...
##  $ insured_sex                : Factor w/ 2 levels "FEMALE","MALE": 2 2 1 1 2 1 2 2 1 2 ...
##  $ insured_education_level    : Factor w/ 7 levels "Associate","College",..: 6 6 7 7 1 7 7 1 7 7 ...
##  $ insured_occupation         : Factor w/ 14 levels "adm-clerical",..: 3 7 12 2 12 13 10 13 8 9 ...
##  $ insured_hobbies            : Factor w/ 20 levels "base-jumping",..: 18 16 3 3 3 4 3 1 10 5 ...
##  $ insured_relationship       : Factor w/ 6 levels "husband","not-in-family",..: 1 3 4 5 5 5 1 5 4 6 ...
##  $ incident_date              : Factor w/ 60 levels "01/01/2015","01/02/2015",..: 50 42 45 20 35 4 26 55 59 10 ...
##  $ incident_type              : Factor w/ 4 levels "Multi-vehicle Collision",..: 3 4 1 3 4 1 1 1 3 3 ...
##  $ collision_type             : Factor w/ 4 levels "?","Front Collision",..: 4 1 3 2 1 3 2 2 2 3 ...
##  $ incident_severity          : Factor w/ 4 levels "Major Damage",..: 1 2 2 1 2 1 2 3 3 3 ...
##  $ authorities_contacted      : Factor w/ 5 levels "Ambulance","Fire",..: 5 5 5 5 3 2 5 5 5 4 ...
##  $ incident_hour_of_the_day   : int  5 8 7 5 20 19 0 23 21 14 ...
##  $ number_of_vehicles_involved: int  1 1 3 1 1 3 3 3 1 1 ...
##  $ auto_make                  : Factor w/ 14 levels "Accura","Audi",..: 11 9 5 4 1 11 10 2 13 11 ...
##  $ auto_model                 : Factor w/ 39 levels "3 Series","92x",..: 2 13 31 34 32 4 30 6 9 2 ...
##  $ auto_year                  : int  2004 2007 2007 2014 2009 2003 2012 2015 2012 1996 ...
##  $ total_claim_amount         : int  71610 5070 34650 63400 6500 64100 78650 51590 27700 42300 ...
datos_2 <- datos %>% dplyr::select(-policy_number,-incident_date)

datos_2[datos_2 == "?"] <- NA

datos_2$collision_type <- as.integer(datos_2$collision_type)
datos_2$collision_type <- as.factor(datos_2$collision_type)

nrow(datos_2[is.na(datos_2$collision_type),])
## [1] 178
datos_2 <- datos_2[!(is.na(datos_2$collision_type)),]

Veamos que en esta versión de la tabla hemos quitado los renglones que tienen NA, por lo que solo nos quedan 822 observaciones, y 20 variables a analizar

str(datos_2)
## 'data.frame':    822 obs. of  20 variables:
##  $ months_as_customer         : int  328 134 256 256 137 165 27 212 235 447 ...
##  $ age                        : int  48 29 41 39 34 37 33 42 42 61 ...
##  $ policy_state               : Factor w/ 3 levels "IL","IN","OH": 3 3 1 3 2 1 1 1 3 3 ...
##  $ policy_deductable          : int  1000 2000 2000 1000 1000 1000 500 500 500 2000 ...
##  $ policy_annual_premium      : num  1407 1413 1416 1351 1333 ...
##  $ insured_sex                : Factor w/ 2 levels "FEMALE","MALE": 2 1 1 1 2 2 1 2 1 1 ...
##  $ insured_education_level    : Factor w/ 7 levels "Associate","College",..: 6 7 7 7 7 1 7 7 5 3 ...
##  $ insured_occupation         : Factor w/ 14 levels "adm-clerical",..: 3 12 2 13 10 13 8 9 4 4 ...
##  $ insured_hobbies            : Factor w/ 20 levels "base-jumping",..: 18 3 3 4 3 1 10 5 8 17 ...
##  $ insured_relationship       : Factor w/ 6 levels "husband","not-in-family",..: 1 4 5 5 1 5 4 6 3 3 ...
##  $ incident_type              : Factor w/ 4 levels "Multi-vehicle Collision",..: 3 1 3 1 1 1 3 3 3 1 ...
##  $ collision_type             : Factor w/ 3 levels "2","3","4": 3 2 1 2 1 1 1 2 1 1 ...
##  $ incident_severity          : Factor w/ 4 levels "Major Damage",..: 1 2 1 1 2 3 3 3 3 1 ...
##  $ authorities_contacted      : Factor w/ 5 levels "Ambulance","Fire",..: 5 5 5 2 5 5 5 4 5 2 ...
##  $ incident_hour_of_the_day   : int  5 7 5 19 0 23 21 14 22 21 ...
##  $ number_of_vehicles_involved: int  1 3 1 3 3 3 1 1 1 3 ...
##  $ auto_make                  : Factor w/ 14 levels "Accura","Audi",..: 11 5 4 11 10 2 13 11 6 2 ...
##  $ auto_model                 : Factor w/ 39 levels "3 Series","92x",..: 2 31 34 4 30 6 9 2 15 5 ...
##  $ auto_year                  : int  2004 2007 2014 2003 2012 2015 2012 1996 2002 2006 ...
##  $ total_claim_amount         : int  71610 34650 63400 64100 78650 51590 27700 42300 87010 114920 ...

Ajuste del modelo

Primero vamos a buscar un modelo usando todas las variables de la base

datos_ajuste <- glm(total_claim_amount~., data = datos_2)
stepAIC(datos_ajuste, direction = "both", trace = F)
## 
## Call:  glm(formula = total_claim_amount ~ age + number_of_vehicles_involved, 
##     data = datos_2)
## 
## Coefficients:
##                 (Intercept)                          age  
##                     61719.8                        105.1  
## number_of_vehicles_involved  
##                     -1396.1  
## 
## Degrees of Freedom: 821 Total (i.e. Null);  819 Residual
## Null Deviance:       2.105e+11 
## Residual Deviance: 2.08e+11  AIC: 18250
summary(datos_ajuste)
## 
## Call:
## glm(formula = total_claim_amount ~ ., data = datos_2)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -41701   -9962    -582    9261   46850  
## 
## Coefficients: (13 not defined because of singularities)
##                                         Estimate Std. Error t value
## (Intercept)                            2.057e+05  1.997e+05   1.030
## months_as_customer                    -8.303e+00  1.322e+01  -0.628
## age                                    1.665e+02  1.654e+02   1.007
## policy_stateIN                         2.236e+03  1.491e+03   1.499
## policy_stateOH                         1.254e+02  1.423e+03   0.088
## policy_deductable                      1.651e+00  9.658e-01   1.709
## policy_annual_premium                  4.981e-01  2.423e+00   0.206
## insured_sexMALE                       -9.918e+02  1.171e+03  -0.847
## insured_education_levelCollege        -2.159e+03  2.376e+03  -0.909
## insured_education_levelHigh School     2.737e+03  2.229e+03   1.228
## insured_education_levelJD             -6.816e+02  2.206e+03  -0.309
## insured_education_levelMasters         3.201e+03  2.251e+03   1.422
## insured_education_levelMD              1.189e+03  2.231e+03   0.533
## insured_education_levelPhD             1.951e+03  2.312e+03   0.844
## insured_occupationarmed-forces         1.095e+03  3.345e+03   0.327
## insured_occupationcraft-repair         1.624e+03  3.256e+03   0.499
## insured_occupationexec-managerial      3.247e+03  3.231e+03   1.005
## insured_occupationfarming-fishing     -4.511e+03  3.524e+03  -1.280
## insured_occupationhandlers-cleaners    4.193e+03  3.432e+03   1.222
## insured_occupationmachine-op-inspct   -2.769e+02  3.059e+03  -0.091
## insured_occupationother-service        2.732e+03  3.365e+03   0.812
## insured_occupationpriv-house-serv      2.300e+03  3.280e+03   0.701
## insured_occupationprof-specialty       3.127e+03  3.147e+03   0.994
## insured_occupationprotective-serv      1.946e+03  3.339e+03   0.583
## insured_occupationsales               -2.165e+03  3.267e+03  -0.663
## insured_occupationtech-support         2.952e+03  3.232e+03   0.913
## insured_occupationtransport-moving     2.185e+03  3.225e+03   0.677
## insured_hobbiesbasketball              5.233e+03  4.118e+03   1.271
## insured_hobbiesboard-games             6.170e+03  3.755e+03   1.643
## insured_hobbiesbungie-jumping          4.512e+02  3.655e+03   0.123
## insured_hobbiescamping                 5.119e+03  3.722e+03   1.375
## insured_hobbieschess                   6.392e+03  3.896e+03   1.641
## insured_hobbiescross-fit               9.804e+03  4.012e+03   2.444
## insured_hobbiesdancing                 4.184e+03  3.913e+03   1.069
## insured_hobbiesexercise                4.573e+03  3.692e+03   1.238
## insured_hobbiesgolf                    3.856e+03  3.750e+03   1.028
## insured_hobbieshiking                  7.157e+03  3.713e+03   1.928
## insured_hobbieskayaking                5.562e+03  3.769e+03   1.476
## insured_hobbiesmovies                  6.119e+03  3.735e+03   1.638
## insured_hobbiespaintball               8.509e+03  3.767e+03   2.259
## insured_hobbiespolo                    7.394e+03  3.946e+03   1.873
## insured_hobbiesreading                 2.820e+03  3.619e+03   0.779
## insured_hobbiesskydiving               8.031e+03  3.726e+03   2.155
## insured_hobbiessleeping                2.653e+03  3.928e+03   0.675
## insured_hobbiesvideo-games             9.640e+03  3.872e+03   2.489
## insured_hobbiesyachting                4.797e+03  3.707e+03   1.294
## insured_relationshipnot-in-family      1.263e+03  2.018e+03   0.626
## insured_relationshipother-relative     1.624e+03  2.019e+03   0.804
## insured_relationshipown-child         -2.099e+02  1.990e+03  -0.105
## insured_relationshipunmarried          1.899e+03  2.146e+03   0.885
## insured_relationshipwife               2.677e+03  2.092e+03   1.280
## incident_typeSingle Vehicle Collision  3.222e+03  4.578e+03   0.704
## collision_type3                       -2.450e+03  1.491e+03  -1.643
## collision_type4                       -1.512e+03  1.511e+03  -1.000
## incident_severityMinor Damage         -9.914e+02  1.470e+03  -0.674
## incident_severityTotal Loss           -2.940e+03  1.451e+03  -2.026
## authorities_contactedFire              5.190e+02  1.671e+03   0.311
## authorities_contactedOther             4.461e+03  1.699e+03   2.626
## authorities_contactedPolice            2.449e+03  1.675e+03   1.462
## incident_hour_of_the_day               7.002e+01  8.433e+01   0.830
## number_of_vehicles_involved            2.040e+02  2.182e+03   0.093
## auto_makeAudi                          1.568e+01  5.174e+03   0.003
## auto_makeBMW                           4.354e+03  6.053e+03   0.719
## auto_makeChevrolet                     4.069e+03  5.479e+03   0.743
## auto_makeDodge                         1.795e+03  4.847e+03   0.370
## auto_makeFord                         -7.383e+02  5.511e+03  -0.134
## auto_makeHonda                         2.817e+03  5.657e+03   0.498
## auto_makeJeep                          1.185e+03  4.889e+03   0.242
## auto_makeMercedes                      7.265e+03  5.697e+03   1.275
## auto_makeNissan                        6.235e+01  5.578e+03   0.011
## auto_makeSaab                         -1.499e+03  5.345e+03  -0.281
## auto_makeSuburu                       -9.495e+02  5.081e+03  -0.187
## auto_makeToyota                        8.590e+02  5.588e+03   0.154
## auto_makeVolkswagen                    2.604e+02  5.171e+03   0.050
## auto_model92x                          3.451e+03  5.002e+03   0.690
## auto_model93                          -1.683e+02  5.079e+03  -0.033
## auto_model95                                  NA         NA      NA
## auto_modelA3                           4.133e+03  4.401e+03   0.939
## auto_modelA5                                  NA         NA      NA
## auto_modelAccord                      -7.858e+03  6.435e+03  -1.221
## auto_modelC300                        -6.551e+03  5.693e+03  -1.151
## auto_modelCamry                       -4.170e+03  5.494e+03  -0.759
## auto_modelCivic                       -4.452e+03  5.709e+03  -0.780
## auto_modelCorolla                     -4.074e+03  5.812e+03  -0.701
## auto_modelCRV                                 NA         NA      NA
## auto_modelE400                        -8.240e+03  5.689e+03  -1.448
## auto_modelEscape                       9.747e+03  5.195e+03   1.876
## auto_modelF150                         3.979e+03  5.341e+03   0.745
## auto_modelForrestor                    2.497e+03  4.573e+03   0.546
## auto_modelFusion                              NA         NA      NA
## auto_modelGrand Cherokee              -2.481e+03  4.747e+03  -0.523
## auto_modelHighlander                          NA         NA      NA
## auto_modelImpreza                     -3.980e+02  5.219e+03  -0.076
## auto_modelJetta                       -2.132e+03  4.524e+03  -0.471
## auto_modelLegacy                              NA         NA      NA
## auto_modelM5                          -9.364e+03  6.536e+03  -1.433
## auto_modelMalibu                      -3.523e+02  5.189e+03  -0.068
## auto_modelMaxima                      -3.526e+03  5.315e+03  -0.663
## auto_modelMDX                         -5.744e+03  5.106e+03  -1.125
## auto_modelML350                               NA         NA      NA
## auto_modelNeon                         1.116e+03  4.092e+03   0.273
## auto_modelPassat                              NA         NA      NA
## auto_modelPathfinder                  -1.761e+03  5.069e+03  -0.347
## auto_modelRAM                                 NA         NA      NA
## auto_modelRSX                          9.052e+03  6.560e+03   1.380
## auto_modelSilverado                   -6.178e+03  5.390e+03  -1.146
## auto_modelTahoe                               NA         NA      NA
## auto_modelTL                                  NA         NA      NA
## auto_modelUltima                              NA         NA      NA
## auto_modelWrangler                            NA         NA      NA
## auto_modelX5                          -1.967e+03  5.930e+03  -0.332
## auto_modelX6                          -6.163e+02  6.133e+03  -0.100
## auto_year                             -8.063e+01  9.938e+01  -0.811
##                                       Pr(>|t|)   
## (Intercept)                            0.30345   
## months_as_customer                     0.53015   
## age                                    0.31418   
## policy_stateIN                         0.13420   
## policy_stateOH                         0.92980   
## policy_deductable                      0.08784 . 
## policy_annual_premium                  0.83720   
## insured_sexMALE                        0.39747   
## insured_education_levelCollege         0.36390   
## insured_education_levelHigh School     0.21985   
## insured_education_levelJD              0.75747   
## insured_education_levelMasters         0.15547   
## insured_education_levelMD              0.59420   
## insured_education_levelPhD             0.39909   
## insured_occupationarmed-forces         0.74358   
## insured_occupationcraft-repair         0.61803   
## insured_occupationexec-managerial      0.31522   
## insured_occupationfarming-fishing      0.20087   
## insured_occupationhandlers-cleaners    0.22217   
## insured_occupationmachine-op-inspct    0.92790   
## insured_occupationother-service        0.41712   
## insured_occupationpriv-house-serv      0.48336   
## insured_occupationprof-specialty       0.32068   
## insured_occupationprotective-serv      0.56015   
## insured_occupationsales                0.50784   
## insured_occupationtech-support         0.36129   
## insured_occupationtransport-moving     0.49831   
## insured_hobbiesbasketball              0.20416   
## insured_hobbiesboard-games             0.10074   
## insured_hobbiesbungie-jumping          0.90180   
## insured_hobbiescamping                 0.16947   
## insured_hobbieschess                   0.10128   
## insured_hobbiescross-fit               0.01477 * 
## insured_hobbiesdancing                 0.28530   
## insured_hobbiesexercise                0.21594   
## insured_hobbiesgolf                    0.30415   
## insured_hobbieshiking                  0.05428 . 
## insured_hobbieskayaking                0.14047   
## insured_hobbiesmovies                  0.10182   
## insured_hobbiespaintball               0.02418 * 
## insured_hobbiespolo                    0.06141 . 
## insured_hobbiesreading                 0.43607   
## insured_hobbiesskydiving               0.03146 * 
## insured_hobbiessleeping                0.49958   
## insured_hobbiesvideo-games             0.01302 * 
## insured_hobbiesyachting                0.19604   
## insured_relationshipnot-in-family      0.53156   
## insured_relationshipother-relative     0.42143   
## insured_relationshipown-child          0.91603   
## insured_relationshipunmarried          0.37647   
## insured_relationshipwife               0.20103   
## incident_typeSingle Vehicle Collision  0.48177   
## collision_type3                        0.10080   
## collision_type4                        0.31750   
## incident_severityMinor Damage          0.50025   
## incident_severityTotal Loss            0.04316 * 
## authorities_contactedFire              0.75621   
## authorities_contactedOther             0.00882 **
## authorities_contactedPolice            0.14408   
## incident_hour_of_the_day               0.40669   
## number_of_vehicles_involved            0.92553   
## auto_makeAudi                          0.99758   
## auto_makeBMW                           0.47224   
## auto_makeChevrolet                     0.45789   
## auto_makeDodge                         0.71131   
## auto_makeFord                          0.89346   
## auto_makeHonda                         0.61872   
## auto_makeJeep                          0.80854   
## auto_makeMercedes                      0.20263   
## auto_makeNissan                        0.99108   
## auto_makeSaab                          0.77916   
## auto_makeSuburu                        0.85181   
## auto_makeToyota                        0.87788   
## auto_makeVolkswagen                    0.95985   
## auto_model92x                          0.49053   
## auto_model93                           0.97358   
## auto_model95                                NA   
## auto_modelA3                           0.34801   
## auto_modelA5                                NA   
## auto_modelAccord                       0.22243   
## auto_modelC300                         0.25023   
## auto_modelCamry                        0.44812   
## auto_modelCivic                        0.43579   
## auto_modelCorolla                      0.48363   
## auto_modelCRV                               NA   
## auto_modelE400                         0.14797   
## auto_modelEscape                       0.06102 . 
## auto_modelF150                         0.45643   
## auto_modelForrestor                    0.58514   
## auto_modelFusion                            NA   
## auto_modelGrand Cherokee               0.60145   
## auto_modelHighlander                        NA   
## auto_modelImpreza                      0.93923   
## auto_modelJetta                        0.63757   
## auto_modelLegacy                            NA   
## auto_modelM5                           0.15238   
## auto_modelMalibu                       0.94589   
## auto_modelMaxima                       0.50734   
## auto_modelMDX                          0.26093   
## auto_modelML350                             NA   
## auto_modelNeon                         0.78508   
## auto_modelPassat                            NA   
## auto_modelPathfinder                   0.72834   
## auto_modelRAM                               NA   
## auto_modelRSX                          0.16800   
## auto_modelSilverado                    0.25208   
## auto_modelTahoe                             NA   
## auto_modelTL                                NA   
## auto_modelUltima                            NA   
## auto_modelWrangler                          NA   
## auto_modelX5                           0.74026   
## auto_modelX6                           0.91999   
## auto_year                              0.41744   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 255963678)
## 
##     Null deviance: 2.1046e+11  on 821  degrees of freedom
## Residual deviance: 1.8481e+11  on 722  degrees of freedom
## AIC: 18342
## 
## Number of Fisher Scoring iterations: 2

De acuerdo al ajuste anterior se sugieren que las variables más significativas son: la edad y el número de autos involucrados

Si probamos las variables antes sugeridas obtenemos lo siguiente:

fit <- glm(formula = total_claim_amount ~ age + number_of_vehicles_involved, 
           family = Gamma, data = datos_2)

summary(fit)
## 
## Call:
## glm(formula = total_claim_amount ~ age + number_of_vehicles_involved, 
##     family = Gamma, data = datos_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0168  -0.1758  -0.0137   0.1612   0.6507  
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.620e-05  6.608e-07  24.511  < 2e-16 ***
## age                         -2.625e-08  1.494e-08  -1.757  0.07921 .  
## number_of_vehicles_involved  3.529e-07  1.354e-07   2.606  0.00933 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.06388584)
## 
##     Null deviance: 56.761  on 821  degrees of freedom
## Residual deviance: 56.132  on 819  degrees of freedom
## AIC: 18255
## 
## Number of Fisher Scoring iterations: 4

Bondad de ajuste

dev <- fit$deviance
nullDev <- fit$null.deviance
modelChi <- nullDev - dev 
modelChi 
## [1] 0.6286884
chidf <- fit$df.null - fit$df.residual

chisq.prob <- 1 - pchisq(modelChi, chidf)
chisq.prob
## [1] 0.7302676

Como podemos notar el modelo no es bueno pues no es estadisticamente significativo, por lo que se intentarán ajustar otros modelos al monto de la suma asegurada.