title: “LAT4082_S14” author: “Mauricio Rodriguez Abreu” date: “3/13/2022” output: html_document

In this session we will continue with the use of GLM’s for claim frequency and will introduce the use of GLM’s for claim severity.

We are going to use a new dataset, the dataset is called InsuranceClaims

setwd("~/Dropbox/UDLAP/Cursos/2022 Primavera/Tema Selecto/Presentaciones")

data<-read.csv("InsuranceClaims.csv")
names(data)
##  [1] "Insurance.ID"               "Customer.ID"               
##  [3] "Driving.Children"           "DOB"                       
##  [5] "Age"                        "Children"                  
##  [7] "Job.Duration"               "Income"                    
##  [9] "Single.Parent."             "Home.Value"                
## [11] "Marital.Status"             "Gender"                    
## [13] "Education"                  "Occupation"                
## [15] "Commute.Time..Min."         "Vehicle.Use"               
## [17] "Vehicle.Value"              "Time.In.Force..Years."     
## [19] "Vehicle.Type"               "Red.Car."                  
## [21] "Total.Claims..5.Years."     "X..Claims..5.Years."       
## [23] "License.Revoked...7.Years." "Points"                    
## [25] "Claim.Amount"               "Car.Age"                   
## [27] "Car.Crash"                  "City.Area"

The meaning of the variables names are:

***** We can start with a basic analysis of the dataset:

table(data$Car.Crash)
## 
##   No  Yes 
## 5592 2058
prop.table(table(data$Car.Crash))
## 
##        No       Yes 
## 0.7309804 0.2690196

And alwasya good idea to present the values in plots:

library(ggplot2)
## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar
library(scales)
ggplot(data, aes(x=Car.Crash))+geom_bar()

ggplot(data, aes(x=Car.Crash))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
  labs(x = "Claim",
       y= "Percent")

We are not going to fit a logistic, model, though it could be interesting … we will jump to analyze claim frequency

We are going to use the 5-year frame to model this outcome variable:

table(data$X..Claims..5.Years.)
## 
##    0    1    2    3    4    5 
## 4714  939 1082  732  168   15
ggplot(data, aes(x=X..Claims..5.Years.))+geom_bar()+
labs(x = "Claims",
     y= "Counts",
     title= "Claims per policy in 5 years")

Now, we are ready to fit a model:

data$OFreq<-data$X..Claims..5.Years.
data$Age2<-data$Age*data$Age

model<-glm(OFreq~Age+Age2+Children+Gender+Red.Car.,
           family="poisson",data=data)
summary(model)
## 
## Call:
## glm(formula = OFreq ~ Age + Age2 + Children + Gender + Red.Car., 
##     family = "poisson", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8894  -1.2435  -1.1946   0.9239   3.3191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.3958035  0.2237632   6.238 4.44e-10 ***
## Age         -0.0727252  0.0099032  -7.344 2.08e-13 ***
## Age2         0.0007607  0.0001094   6.953 3.57e-12 ***
## Children     0.0293042  0.0124199   2.359   0.0183 *  
## GenderM     -0.0225028  0.0356311  -0.632   0.5277    
## Red.Car.Yes  0.0870623  0.0388959   2.238   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12775  on 7649  degrees of freedom
## Residual deviance: 12689  on 7644  degrees of freedom
## AIC: 20198
## 
## Number of Fisher Scoring iterations: 6

How can we tell if this is a good model? Of course we have the deviance and the AIC information, but …. what about the data we have?

One way to analyze this is by comparing the predicted values and the observed ones to see how far appart they are.

Since we have a count model, we will predict the mean expected number of claims, rather than the actual claims. We can compare this to the summary from the original dataset.

We can start with the expected value for men and women:

ds1<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
                Gender="M",Red.Car.="Yes",
                Children=mean(data$Children))
ds2<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
                Gender="F",Red.Car.="Yes",
                Children=mean(data$Children))
predict(model,ds1,type="response",se.fit=TRUE)
## $fit
##         1 
## 0.8257072 
## 
## $se.fit
##          1 
## 0.01978401 
## 
## $residual.scale
## [1] 1
predict(model,ds2,type="response",se.fit=TRUE)
## $fit
##         1 
## 0.8444986 
## 
## $se.fit
##          1 
## 0.03567502 
## 
## $residual.scale
## [1] 1

Now, we see that we have an expected number of claims of 0.8257, with an standard error of 0.0197. For women we have a preicted value of 0.84449 and se of 0.03567. What was the original mean value for Men and Women?

library(doBy)
summaryBy(OFreq~Gender, FUN=c("mean"),data=data)
##   Gender OFreq.c("mean")
## 1      F       0.7843275
## 2      M       0.7979259

We know that Gender was not a good predictor, lets explore the color of the car:

ds1<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
                Gender="M",Red.Car.="Yes",
                Children=mean(data$Children))
ds2<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
                Gender="M",Red.Car.="No",
                Children=mean(data$Children))
predict(model,ds1,type="response",se.fit=TRUE)
## $fit
##         1 
## 0.8257072 
## 
## $se.fit
##          1 
## 0.01978401 
## 
## $residual.scale
## [1] 1
predict(model,ds2,type="response",se.fit=TRUE)
## $fit
##         1 
## 0.7568597 
## 
## $se.fit
##          1 
## 0.02375463 
## 
## $residual.scale
## [1] 1
summaryBy(OFreq~Gender+Red.Car., FUN=c("mean"),data=data)
##   Gender Red.Car. OFreq.c("mean")
## 1      F       No       0.7800614
## 2      F      Yes       1.2142857
## 3      M       No       0.7546296
## 4      M      Yes       0.8249158
data$phat<-predict(model,type="response")
data<-data[with(data,order(Red.Car.,Age)),]

ggplot(data,aes(x=Age, y=phat,colour= Red.Car.)) +
    geom_point() +
    geom_point(aes(x=Age,y=OFreq),alpha=1/5,shape=21)+ 
    labs(x="Age",y="Expected number of claims")

We can get a better visualization of the average

df1<-as.data.frame(summaryBy(OFreq~Age+Red.Car.,Fun=c("mean"),data=data))
data<-merge(data,df1,by=c("Age","Red.Car."))
ggplot(data,aes(x=Age, y=phat,colour= Red.Car.)) +
  geom_point(shape=21) +
  geom_point(aes(x=Age,y=OFreq.mean,colour= Red.Car.),
             ,alpha=1/5,shape=8)+ 
  labs(x="Age",y="Expected number of claims")

Now, before exploring Negative Binomial, we know we have lots of 0’s, we can try a zero-inflated model:

library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
model2<- zeroinfl(OFreq~Age+Children+Gender+
                    Red.Car.| Children,
                 data=data)
summary(model2)
## 
## Call:
## zeroinfl(formula = OFreq ~ Age + Children + Gender + Red.Car. | Children, 
##     data = data)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7927 -0.6377 -0.6212  0.8717  3.6436 
## 
## Count model coefficients (poisson with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.5387245  0.0867397   6.211 5.27e-10 ***
## Age         -0.0008949  0.0017826  -0.502   0.6157    
## Children    -0.0141337  0.0159455  -0.886   0.3754    
## GenderM     -0.0100650  0.0416108  -0.242   0.8089    
## Red.Car.Yes  0.1062618  0.0451905   2.351   0.0187 *  
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.20585    0.03577   5.755 8.69e-09 ***
## Children    -0.13213    0.02824  -4.679 2.89e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 12 
## Log-likelihood: -8935 on 7 Df

As you can see, by using the vertical line, we are telling R that only the variable Children should be taking into account for the logit model.

Now, we can compare the Poisson and the zero inflated Poisson Before, lets fit the Poisson again, but using only the same variables we have in model 2.

model1<- glm(OFreq~Age+Children+Gender+
               Red.Car.,family="poisson",
             data=data)
summary(model1)
## 
## Call:
## glm(formula = OFreq ~ Age + Children + Gender + Red.Car., family = "poisson", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4530  -1.2610  -1.2069   0.9863   3.3178  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.066417   0.079978  -0.830  0.40629   
## Age         -0.004821   0.001655  -2.913  0.00358 **
## Children     0.038349   0.012368   3.101  0.00193 **
## GenderM     -0.033348   0.035818  -0.931  0.35183   
## Red.Car.Yes  0.106568   0.039000   2.733  0.00628 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12775  on 7649  degrees of freedom
## Residual deviance: 12734  on 7645  degrees of freedom
## AIC: 20241
## 
## Number of Fisher Scoring iterations: 6
vuong(model1,model2)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -28.82065 model2 > model1 < 2.22e-16
## AIC-corrected         -28.77184 model2 > model1 < 2.22e-16
## BIC-corrected         -28.60241 model2 > model1 < 2.22e-16

The Vuong test compares the Poisson (model1) with the zero-inflated Poisson (model2). The test results in a significant improvement, indicating that the zero-inflated model is better than the regular Poisson.

**** Now we can move to model severity:

In the same data set we have the information on the claim loss. The variable is “Claim.Amount”.

Before, we have to change some elements in our variables, since they were saved with symbols “$” and commas for quantities.

levels(data$Claim.Amount) <- gsub("$", "", levels(data$Claim.Amount), fixed=TRUE)
levels(data$Claim.Amount) <- gsub(",", "", levels(data$Claim.Amount), fixed=TRUE)

levels(data$Vehicle.Value) <- gsub("$", "", levels(data$Vehicle.Value), fixed=TRUE)
levels(data$Vehicle.Value) <- gsub(",", "", levels(data$Vehicle.Value), fixed=TRUE)

data$Claim.Amount<-as.numeric(as.character(data$Claim.Amount))
data$Vehicle.Value<-as.numeric(as.character(data$Vehicle.Value))

Now, we can run our models, since we knoe that severity takes possitive values all the time, we have to use lognormal or gamma regression. We can fit both:

models1<- lm(log(Claim.Amount)~Car.Age+Vehicle.Value,
             data=data[data$Claim.Amount>0,])
summary(models1)
## 
## Call:
## lm(formula = log(Claim.Amount) ~ Car.Age + Vehicle.Value, data = data[data$Claim.Amount > 
##     0, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7884 -0.3863  0.0542  0.3940  2.9817 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.128e+00  3.955e-02 205.538  < 2e-16 ***
## Car.Age       -2.304e-03  3.324e-03  -0.693    0.488    
## Vehicle.Value  1.089e-05  2.231e-06   4.881 1.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7849 on 2055 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.0105 
## F-statistic: 11.91 on 2 and 2055 DF,  p-value: 7.178e-06
models2<- glm(Claim.Amount~Car.Age+Vehicle.Value,
              family=Gamma(link="log"),
              data=data[data$Claim.Amount>0,])
summary(models2)
## 
## Call:
## glm(formula = Claim.Amount ~ Car.Age + Vehicle.Value, family = Gamma(link = "log"), 
##     data = data[data$Claim.Amount > 0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8494  -0.6500  -0.2751   0.0652   4.3640  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.373e+00  6.061e-02 138.153  < 2e-16 ***
## Car.Age       -9.006e-03  5.095e-03  -1.768   0.0772 .  
## Vehicle.Value  2.103e-05  3.419e-06   6.151 9.24e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.447016)
## 
##     Null deviance: 1446.3  on 2057  degrees of freedom
## Residual deviance: 1391.4  on 2055  degrees of freedom
## AIC: 39279
## 
## Number of Fisher Scoring iterations: 6

Which one fits better? These models are easier to compare, we just have to estimate the mean predicted Claim Amount and we will see the best fit to our data:

mean(exp(predict(models1,type="response")))
## [1] 3888.896
mean(predict(models2,type="response"))
## [1] 5511.057
summaryBy(Claim.Amount~1,FUN=c("mean"),data=data[data$Claim.Amount>0,])
##   Claim.Amount.c("mean")
## 1               5506.125

Clearly, the second model fits better, or at least, this is the model that is resulting in a close estimated mean.

***** Combining the two models…

In these new models, we will use as many variables as we can to obtain a better estimation. We have two posibilities, backward selection or forward selection. The first refers to the process where we will start with all possible variables, we we will remove those that don’t have a significant effect. The second approach means that we will start with fewer variables and we will add new variables, base on our assumptions regarding the expected behavior.

Since age is a complicated variable, we can create age groups:

data$ageg<-0
data$ageg<-ifelse(data$Age>=16 & data$Age<=25,1,data$ageg)
data$ageg<-ifelse(data$Age>=26 & data$Age<=35,2,data$ageg)
data$ageg<-ifelse(data$Age>=36 & data$Age<=45,3,data$ageg)
data$ageg<-ifelse(data$Age>=46 & data$Age<=55,4,data$ageg)
data$ageg<-ifelse(data$Age>=56,5,data$ageg)

Now, we can estimate a first model for claim frequency:

model1f<- glm(OFreq~factor(ageg)+Gender+Marital.Status+Education+Commute.Time..Min.+Vehicle.Use+
               Red.Car.,family="poisson",data=data)
summary(model1f)
## 
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status + 
##     Education + Commute.Time..Min. + Vehicle.Use + Red.Car., 
##     family = "poisson", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7490  -1.2557  -1.1323   0.8585   3.4556  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.3130431  0.1033751   3.028  0.00246 ** 
## factor(ageg)2        -0.1887025  0.0997191  -1.892  0.05845 .  
## factor(ageg)3        -0.3362252  0.0967387  -3.476  0.00051 ***
## factor(ageg)4        -0.3972654  0.0974207  -4.078 4.55e-05 ***
## factor(ageg)5        -0.1678348  0.1025030  -1.637  0.10155    
## GenderM              -0.0731850  0.0363275  -2.015  0.04395 *  
## Marital.StatusYes    -0.2124673  0.0260228  -8.165 3.22e-16 ***
## EducationHigh School  0.0161123  0.0301054   0.535  0.59251    
## EducationMasters     -0.0811863  0.0421235  -1.927  0.05394 .  
## EducationPhD         -0.0739082  0.0573080  -1.290  0.19717    
## Commute.Time..Min.    0.0006739  0.0008076   0.834  0.40403    
## Vehicle.UsePrivate   -0.1844038  0.0285067  -6.469 9.88e-11 ***
## Red.Car.Yes           0.0850374  0.0391830   2.170  0.02999 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12775  on 7649  degrees of freedom
## Residual deviance: 12575  on 7637  degrees of freedom
## AIC: 20098
## 
## Number of Fisher Scoring iterations: 6
exp(model1f$coeff)
##          (Intercept)        factor(ageg)2        factor(ageg)3 
##            1.3675804            0.8280328            0.7144622 
##        factor(ageg)4        factor(ageg)5              GenderM 
##            0.6721556            0.8454935            0.9294288 
##    Marital.StatusYes EducationHigh School     EducationMasters 
##            0.8085867            1.0162428            0.9220219 
##         EducationPhD   Commute.Time..Min.   Vehicle.UsePrivate 
##            0.9287569            1.0006741            0.8316000 
##          Red.Car.Yes 
##            1.0887578

We could keep the variables that have a significant effect:

model2f<- glm(OFreq~factor(ageg)+Gender+Marital.Status+Vehicle.Use+
               Red.Car.,family="poisson",data=data)
summary(model2f)
## 
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status + 
##     Vehicle.Use + Red.Car., family = "poisson", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7636  -1.2470  -1.1174   0.8614   3.4837  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.35622    0.09740   3.657 0.000255 ***
## factor(ageg)2      -0.19332    0.09969  -1.939 0.052484 .  
## factor(ageg)3      -0.35022    0.09661  -3.625 0.000289 ***
## factor(ageg)4      -0.42209    0.09700  -4.351 1.35e-05 ***
## factor(ageg)5      -0.19871    0.10186  -1.951 0.051085 .  
## GenderM            -0.07594    0.03631  -2.091 0.036495 *  
## Marital.StatusYes  -0.20889    0.02597  -8.045 8.63e-16 ***
## Vehicle.UsePrivate -0.20570    0.02742  -7.501 6.31e-14 ***
## Red.Car.Yes         0.08535    0.03917   2.179 0.029345 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12775  on 7649  degrees of freedom
## Residual deviance: 12583  on 7641  degrees of freedom
## AIC: 20098
## 
## Number of Fisher Scoring iterations: 6
exp(model2f$coeff)
##        (Intercept)      factor(ageg)2      factor(ageg)3      factor(ageg)4 
##          1.4279245          0.8242184          0.7045365          0.6556777 
##      factor(ageg)5            GenderM  Marital.StatusYes Vehicle.UsePrivate 
##          0.8197846          0.9268738          0.8114822          0.8140743 
##        Red.Car.Yes 
##          1.0890941

Now we can move to estimate a model for severity. We use the same variables:

model1s<- glm(Claim.Amount~factor(ageg)+Gender+Marital.Status+Education+Commute.Time..Min.+
                Vehicle.Use+Red.Car.,family=Gamma(link="log"),
              data=data[data$Claim.Amount>0,])
summary(model1s)
## 
## Call:
## glm(formula = Claim.Amount ~ factor(ageg) + Gender + Marital.Status + 
##     Education + Commute.Time..Min. + Vehicle.Use + Red.Car., 
##     family = Gamma(link = "log"), data = data[data$Claim.Amount > 
##         0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8954  -0.6628  -0.2814   0.0487   4.8859  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.8961478  0.1790443  49.687   <2e-16 ***
## factor(ageg)2        -0.1741810  0.1669213  -1.043   0.2968    
## factor(ageg)3        -0.0869086  0.1601108  -0.543   0.5873    
## factor(ageg)4        -0.1090085  0.1638780  -0.665   0.5060    
## factor(ageg)5        -0.0429290  0.1754673  -0.245   0.8067    
## GenderM               0.0279007  0.0773276   0.361   0.7183    
## Marital.StatusYes    -0.1356558  0.0561470  -2.416   0.0158 *  
## EducationHigh School -0.1021501  0.0659101  -1.550   0.1213    
## EducationMasters     -0.1148340  0.1046305  -1.098   0.2725    
## EducationPhD          0.0354619  0.1495255   0.237   0.8126    
## Commute.Time..Min.   -0.0003522  0.0018295  -0.193   0.8474    
## Vehicle.UsePrivate   -0.0864486  0.0610875  -1.415   0.1572    
## Red.Car.Yes          -0.0031505  0.0849080  -0.037   0.9704    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.578212)
## 
##     Null deviance: 1446.3  on 2057  degrees of freedom
## Residual deviance: 1420.8  on 2045  degrees of freedom
## AIC: 39347
## 
## Number of Fisher Scoring iterations: 6
exp(model1s$coeff)
##          (Intercept)        factor(ageg)2        factor(ageg)3 
##         7303.7835598            0.8401448            0.9167609 
##        factor(ageg)4        factor(ageg)5              GenderM 
##            0.8967228            0.9579794            1.0282935 
##    Marital.StatusYes EducationHigh School     EducationMasters 
##            0.8731431            0.9028940            0.8915141 
##         EducationPhD   Commute.Time..Min.   Vehicle.UsePrivate 
##            1.0360982            0.9996478            0.9171827 
##          Red.Car.Yes 
##            0.9968544

What we see is that few variables have a significant impact on severity (only Marital Status). Something we could do is first, explore the variables that have an impact on claim severity and then build the model for frequency…

Another option is to get independent models… Then, for severity, the model we obtanin is:

model2s<- glm(Claim.Amount~Marital.Status+Vehicle.Value
              ,family=Gamma(link="log"),data=data[data$Claim.Amount>0,])
summary(model2s)
## 
## Call:
## glm(formula = Claim.Amount ~ Marital.Status + Vehicle.Value, 
##     family = Gamma(link = "log"), data = data[data$Claim.Amount > 
##         0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8334  -0.6467  -0.2757   0.0721   4.5771  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.389e+00  5.959e-02 140.773  < 2e-16 ***
## Marital.StatusYes -1.258e-01  5.316e-02  -2.367    0.018 *  
## Vehicle.Value      1.989e-05  3.399e-06   5.853  5.6e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.453672)
## 
##     Null deviance: 1446.3  on 2057  degrees of freedom
## Residual deviance: 1387.8  on 2055  degrees of freedom
## AIC: 39273
## 
## Number of Fisher Scoring iterations: 6
exp(model2s$coeff)
##       (Intercept) Marital.StatusYes     Vehicle.Value 
##      4399.4776459         0.8817506         1.0000199

Now, how does this translates into premiums?

We can start with a basic premium calculation. Let’s estimate the premium using the fact that

\[ Pure Premium=E(N)\times E(X) \]

mean(data$OFreq)*mean(data$Claim.Amount)
## [1] 1170.676

But wait! the estimation is not that easy, remember the formula for siniestralidad?

\[ sin=\frac{Claims}{Exposure} \times \frac{Costs}{Claims} \]

So we have to refine our estimation as:

mean(data$OFreq)*mean(data$Claim.Amount[data$Claim.Amount>0])
## [1] 4351.638

Review of our models…

data$Car.Age<-as.numeric(as.character(data$Car.Age))
data$CarG<-0
data$CarG<-ifelse(data$Car.Age<=1,1,data$CarG)
data$CarG<-ifelse(data$Car.Age>=2 & data$Car.Age<=5,2,data$CarG)
data$CarG<-ifelse(data$Car.Age>=6 & data$Car.Age<=10,3,data$CarG)
data$CarG<-ifelse(data$Car.Age>=10,4,data$CarG)

data$CarV<-0
data$CarV<-ifelse(data$Vehicle.Value<5000,1,0)
data$CarV<-ifelse(data$Vehicle.Value>=5000 & data$Vehicle.Value<10000,2,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=10000 & data$Vehicle.Value<15000,3,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=15000 & data$Vehicle.Value<20000,4,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=20000 & data$Vehicle.Value<25000,5,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=25000 ,6,data$CarV)

model3s<- glm(Claim.Amount~Marital.Status+factor(CarG)+factor(CarV)+Gender
              ,family=Gamma(link="log"),data=data[data$Claim.Amount>0,])
summary(model3s)
## 
## Call:
## glm(formula = Claim.Amount ~ Marital.Status + factor(CarG) + 
##     factor(CarV) + Gender, family = Gamma(link = "log"), data = data[data$Claim.Amount > 
##     0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8580  -0.6576  -0.2683   0.0764   4.4060  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.25743    0.10445  79.055  < 2e-16 ***
## Marital.StatusYes -0.11647    0.05243  -2.222   0.0264 *  
## factor(CarG)2     -0.04345    0.10885  -0.399   0.6898    
## factor(CarG)3     -0.02655    0.06570  -0.404   0.6862    
## factor(CarG)4     -0.09977    0.06669  -1.496   0.1348    
## factor(CarV)2      0.33487    0.10247   3.268   0.0011 ** 
## factor(CarV)3      0.41241    0.10563   3.904 9.75e-05 ***
## factor(CarV)4      0.53175    0.11011   4.829 1.47e-06 ***
## factor(CarV)5      0.68418    0.11951   5.725 1.19e-08 ***
## factor(CarV)6      0.60438    0.12663   4.773 1.94e-06 ***
## GenderM            0.02982    0.05317   0.561   0.5750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.403837)
## 
##     Null deviance: 1446.3  on 2057  degrees of freedom
## Residual deviance: 1374.0  on 2047  degrees of freedom
## AIC: 39266
## 
## Number of Fisher Scoring iterations: 6
model3s<- glm(Claim.Amount~Marital.Status+factor(CarV),
              family=Gamma(link="log"),data=data[data$Claim.Amount>0,])
summary(model3s)
## 
## Call:
## glm(formula = Claim.Amount ~ Marital.Status + factor(CarV), family = Gamma(link = "log"), 
##     data = data[data$Claim.Amount > 0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8496  -0.6539  -0.2718   0.0753   4.4976  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.23508    0.09505  86.637  < 2e-16 ***
## Marital.StatusYes -0.11389    0.05278  -2.158 0.031069 *  
## factor(CarV)2      0.33381    0.10291   3.244 0.001198 ** 
## factor(CarV)3      0.40443    0.10605   3.813 0.000141 ***
## factor(CarV)4      0.52221    0.11046   4.727 2.43e-06 ***
## factor(CarV)5      0.68042    0.11983   5.678 1.56e-08 ***
## factor(CarV)6      0.59311    0.12683   4.677 3.11e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.425698)
## 
##     Null deviance: 1446.3  on 2057  degrees of freedom
## Residual deviance: 1378.0  on 2051  degrees of freedom
## AIC: 39265
## 
## Number of Fisher Scoring iterations: 6
exp(model3s$coeff)
##       (Intercept) Marital.StatusYes     factor(CarV)2     factor(CarV)3 
##      3770.9547868         0.8923553         1.3962846         1.4984494 
##     factor(CarV)4     factor(CarV)5     factor(CarV)6 
##         1.6857554         1.9747046         1.8096116

We have, then the following information:

Variable Category Frequency Severity
Age Group Age1(16-25) 1.00
Age Group Age2(25-35) 0.8242
Age Group Age3(36-45) 0.7045
Gender Female 1.00
Gender Male 0.9269
Marital Status Not married 1.00 1.00
Marital Status Married 0.8115 0.8924
Vehicle Use Commercial 1.00
Vehicle Use Private 0.8141
Color Not red 1.00
Color Red 1.089
Vehicle value <$5,000 1.00
Vehicle value $5,000-$10,000 1.3963
Vehicle value $10,000-$15,000 1.4984
Vehicle value $15,000-$20,000 1.6858
Vehicle value $20,000-$25,000 1.9747
Vehicle value >=$25,000 1.8096

So, we can start estimating the premium. However, lets see what is the meaning of the intercepts.

I1<-exp(model2f$coeff[1])
I2<-exp(model3s$coeff[1])

I1*I2
## (Intercept) 
##    5384.639

So, this is the net premium for a Women, ages 16 to 25, single, driving a commercial vehicle, not red (color) and the vehicle price is less than 5000.

Now, we want the premium for a married man, age 47, driving a red non-commercial vehicle worth $22,500. What is the premium?

Now, there are some things we should consider when using models… we can go back to our model for frequency:

summary(model1f)
## 
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status + 
##     Education + Commute.Time..Min. + Vehicle.Use + Red.Car., 
##     family = "poisson", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7490  -1.2557  -1.1323   0.8585   3.4556  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.3130431  0.1033751   3.028  0.00246 ** 
## factor(ageg)2        -0.1887025  0.0997191  -1.892  0.05845 .  
## factor(ageg)3        -0.3362252  0.0967387  -3.476  0.00051 ***
## factor(ageg)4        -0.3972654  0.0974207  -4.078 4.55e-05 ***
## factor(ageg)5        -0.1678348  0.1025030  -1.637  0.10155    
## GenderM              -0.0731850  0.0363275  -2.015  0.04395 *  
## Marital.StatusYes    -0.2124673  0.0260228  -8.165 3.22e-16 ***
## EducationHigh School  0.0161123  0.0301054   0.535  0.59251    
## EducationMasters     -0.0811863  0.0421235  -1.927  0.05394 .  
## EducationPhD         -0.0739082  0.0573080  -1.290  0.19717    
## Commute.Time..Min.    0.0006739  0.0008076   0.834  0.40403    
## Vehicle.UsePrivate   -0.1844038  0.0285067  -6.469 9.88e-11 ***
## Red.Car.Yes           0.0850374  0.0391830   2.170  0.02999 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12775  on 7649  degrees of freedom
## Residual deviance: 12575  on 7637  degrees of freedom
## AIC: 20098
## 
## Number of Fisher Scoring iterations: 6
model3f<- glm(OFreq~factor(ageg)+Gender+Marital.Status+Education+Commute.Time..Min.+Vehicle.Use+
                Red.Car.+Occupation,family="poisson",data=data)
summary(model3f)
## 
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status + 
##     Education + Commute.Time..Min. + Vehicle.Use + Red.Car. + 
##     Occupation, family = "poisson", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7447  -1.2574  -1.1340   0.8135   3.4141  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             0.2594035  0.1059726   2.448 0.014372 *  
## factor(ageg)2          -0.1809846  0.0998346  -1.813 0.069856 .  
## factor(ageg)3          -0.3268344  0.0970296  -3.368 0.000756 ***
## factor(ageg)4          -0.3891166  0.0978599  -3.976 7.00e-05 ***
## factor(ageg)5          -0.1647328  0.1029556  -1.600 0.109590    
## GenderM                -0.0784293  0.0368824  -2.126 0.033464 *  
## Marital.StatusYes      -0.2114631  0.0260389  -8.121 4.62e-16 ***
## EducationHigh School    0.0372290  0.0338705   1.099 0.271700    
## EducationMasters       -0.1557438  0.0574379  -2.712 0.006698 ** 
## EducationPhD           -0.1773087  0.0811376  -2.185 0.028868 *  
## Commute.Time..Min.      0.0007031  0.0008101   0.868 0.385487    
## Vehicle.UsePrivate     -0.2259312  0.0339261  -6.660 2.75e-11 ***
## Red.Car.Yes             0.0823661  0.0391968   2.101 0.035611 *  
## OccupationClerical      0.0497387  0.0453401   1.097 0.272636    
## OccupationDoctor        0.2586733  0.1153466   2.243 0.024924 *  
## OccupationHome Maker    0.0885794  0.0583779   1.517 0.129179    
## OccupationLawyer        0.2063450  0.0792815   2.603 0.009250 ** 
## OccupationManager       0.0846570  0.0553258   1.530 0.125979    
## OccupationProfessional  0.1176954  0.0494253   2.381 0.017253 *  
## OccupationStudent       0.1037001  0.0486225   2.133 0.032945 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12775  on 7649  degrees of freedom
## Residual deviance: 12562  on 7630  degrees of freedom
## AIC: 20099
## 
## Number of Fisher Scoring iterations: 6

What happended here?

We have to be careful when fitting our models, the simpliest model is the better… but not only can we find mediating effects of variables, such as the previous example, we could have some interesting interactions.

Lets create new dummy variables and run a new model:

data$red<-ifelse(data$Red.Car.=="Yes",1,0)
data$mujer<- ifelse(data$Gender=="F",1,0)
data$hs<-ifelse(data$Education=="High School",1,0)
data$bach<-ifelse(data$Education=="Bachelors",1,0)
data$grad<-ifelse(data$Education=="Masters" | data$Education=="PhD",1,0)
data$ctime<-data$Commute.Time..Min.
data$married<-ifelse(data$Marital.Status=="Yes",1,0)


model4f<- glm(OFreq~factor(ageg)+mujer+married+bach+grad+ctime+
                red+mujer*bach+mujer*grad,family="poisson",data=data)
summary(model4f)
## 
## Call:
## glm(formula = OFreq ~ factor(ageg) + mujer + married + bach + 
##     grad + ctime + red + mujer * bach + mujer * grad, family = "poisson", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6963  -1.2611  -1.1528   0.8521   3.3931  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.2096756  0.1072502   1.955 0.050581 .  
## factor(ageg)2 -0.1846494  0.0997327  -1.851 0.064106 .  
## factor(ageg)3 -0.3345657  0.0967511  -3.458 0.000544 ***
## factor(ageg)4 -0.3954696  0.0974303  -4.059 4.93e-05 ***
## factor(ageg)5 -0.1687801  0.1024836  -1.647 0.099579 .  
## mujer         -0.0397816  0.0441696  -0.901 0.367771    
## married       -0.2139615  0.0260052  -8.228  < 2e-16 ***
## bach          -0.1024962  0.0448138  -2.287 0.022187 *  
## grad          -0.2161228  0.0516570  -4.184 2.87e-05 ***
## ctime          0.0006775  0.0008080   0.838 0.401771    
## red            0.0886436  0.0392229   2.260 0.023821 *  
## mujer:bach     0.1623547  0.0602054   2.697 0.007003 ** 
## mujer:grad     0.1157729  0.0673901   1.718 0.085805 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12775  on 7649  degrees of freedom
## Residual deviance: 12608  on 7637  degrees of freedom
## AIC: 20131
## 
## Number of Fisher Scoring iterations: 6
rm(list = ls())