title: “LAT4082_S14” author: “Mauricio Rodriguez Abreu” date: “3/13/2022” output: html_document
In this session we will continue with the use of GLM’s for claim frequency and will introduce the use of GLM’s for claim severity.
We are going to use a new dataset, the dataset is called InsuranceClaims
setwd("~/Dropbox/UDLAP/Cursos/2022 Primavera/Tema Selecto/Presentaciones")
data<-read.csv("InsuranceClaims.csv")
names(data)
## [1] "Insurance.ID" "Customer.ID"
## [3] "Driving.Children" "DOB"
## [5] "Age" "Children"
## [7] "Job.Duration" "Income"
## [9] "Single.Parent." "Home.Value"
## [11] "Marital.Status" "Gender"
## [13] "Education" "Occupation"
## [15] "Commute.Time..Min." "Vehicle.Use"
## [17] "Vehicle.Value" "Time.In.Force..Years."
## [19] "Vehicle.Type" "Red.Car."
## [21] "Total.Claims..5.Years." "X..Claims..5.Years."
## [23] "License.Revoked...7.Years." "Points"
## [25] "Claim.Amount" "Car.Age"
## [27] "Car.Crash" "City.Area"
The meaning of the variables names are:
***** We can start with a basic analysis of the dataset:
table(data$Car.Crash)
##
## No Yes
## 5592 2058
prop.table(table(data$Car.Crash))
##
## No Yes
## 0.7309804 0.2690196
And alwasya good idea to present the values in plots:
library(ggplot2)
## Registered S3 methods overwritten by 'tibble':
## method from
## format.tbl pillar
## print.tbl pillar
library(scales)
ggplot(data, aes(x=Car.Crash))+geom_bar()
ggplot(data, aes(x=Car.Crash))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
labs(x = "Claim",
y= "Percent")
We are not going to fit a logistic, model, though it could be interesting … we will jump to analyze claim frequency
We are going to use the 5-year frame to model this outcome variable:
table(data$X..Claims..5.Years.)
##
## 0 1 2 3 4 5
## 4714 939 1082 732 168 15
ggplot(data, aes(x=X..Claims..5.Years.))+geom_bar()+
labs(x = "Claims",
y= "Counts",
title= "Claims per policy in 5 years")
Now, we are ready to fit a model:
data$OFreq<-data$X..Claims..5.Years.
data$Age2<-data$Age*data$Age
model<-glm(OFreq~Age+Age2+Children+Gender+Red.Car.,
family="poisson",data=data)
summary(model)
##
## Call:
## glm(formula = OFreq ~ Age + Age2 + Children + Gender + Red.Car.,
## family = "poisson", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8894 -1.2435 -1.1946 0.9239 3.3191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3958035 0.2237632 6.238 4.44e-10 ***
## Age -0.0727252 0.0099032 -7.344 2.08e-13 ***
## Age2 0.0007607 0.0001094 6.953 3.57e-12 ***
## Children 0.0293042 0.0124199 2.359 0.0183 *
## GenderM -0.0225028 0.0356311 -0.632 0.5277
## Red.Car.Yes 0.0870623 0.0388959 2.238 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12775 on 7649 degrees of freedom
## Residual deviance: 12689 on 7644 degrees of freedom
## AIC: 20198
##
## Number of Fisher Scoring iterations: 6
How can we tell if this is a good model? Of course we have the deviance and the AIC information, but …. what about the data we have?
One way to analyze this is by comparing the predicted values and the observed ones to see how far appart they are.
Since we have a count model, we will predict the mean expected number of claims, rather than the actual claims. We can compare this to the summary from the original dataset.
We can start with the expected value for men and women:
ds1<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
Gender="M",Red.Car.="Yes",
Children=mean(data$Children))
ds2<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
Gender="F",Red.Car.="Yes",
Children=mean(data$Children))
predict(model,ds1,type="response",se.fit=TRUE)
## $fit
## 1
## 0.8257072
##
## $se.fit
## 1
## 0.01978401
##
## $residual.scale
## [1] 1
predict(model,ds2,type="response",se.fit=TRUE)
## $fit
## 1
## 0.8444986
##
## $se.fit
## 1
## 0.03567502
##
## $residual.scale
## [1] 1
Now, we see that we have an expected number of claims of 0.8257, with an standard error of 0.0197. For women we have a preicted value of 0.84449 and se of 0.03567. What was the original mean value for Men and Women?
library(doBy)
summaryBy(OFreq~Gender, FUN=c("mean"),data=data)
## Gender OFreq.c("mean")
## 1 F 0.7843275
## 2 M 0.7979259
We know that Gender was not a good predictor, lets explore the color of the car:
ds1<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
Gender="M",Red.Car.="Yes",
Children=mean(data$Children))
ds2<-data.frame(Age=mean(data$Age),Age2=mean(data$Age2),
Gender="M",Red.Car.="No",
Children=mean(data$Children))
predict(model,ds1,type="response",se.fit=TRUE)
## $fit
## 1
## 0.8257072
##
## $se.fit
## 1
## 0.01978401
##
## $residual.scale
## [1] 1
predict(model,ds2,type="response",se.fit=TRUE)
## $fit
## 1
## 0.7568597
##
## $se.fit
## 1
## 0.02375463
##
## $residual.scale
## [1] 1
summaryBy(OFreq~Gender+Red.Car., FUN=c("mean"),data=data)
## Gender Red.Car. OFreq.c("mean")
## 1 F No 0.7800614
## 2 F Yes 1.2142857
## 3 M No 0.7546296
## 4 M Yes 0.8249158
data$phat<-predict(model,type="response")
data<-data[with(data,order(Red.Car.,Age)),]
ggplot(data,aes(x=Age, y=phat,colour= Red.Car.)) +
geom_point() +
geom_point(aes(x=Age,y=OFreq),alpha=1/5,shape=21)+
labs(x="Age",y="Expected number of claims")
We can get a better visualization of the average
df1<-as.data.frame(summaryBy(OFreq~Age+Red.Car.,Fun=c("mean"),data=data))
data<-merge(data,df1,by=c("Age","Red.Car."))
ggplot(data,aes(x=Age, y=phat,colour= Red.Car.)) +
geom_point(shape=21) +
geom_point(aes(x=Age,y=OFreq.mean,colour= Red.Car.),
,alpha=1/5,shape=8)+
labs(x="Age",y="Expected number of claims")
Now, before exploring Negative Binomial, we know we have lots of 0’s, we can try a zero-inflated model:
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
model2<- zeroinfl(OFreq~Age+Children+Gender+
Red.Car.| Children,
data=data)
summary(model2)
##
## Call:
## zeroinfl(formula = OFreq ~ Age + Children + Gender + Red.Car. | Children,
## data = data)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.7927 -0.6377 -0.6212 0.8717 3.6436
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5387245 0.0867397 6.211 5.27e-10 ***
## Age -0.0008949 0.0017826 -0.502 0.6157
## Children -0.0141337 0.0159455 -0.886 0.3754
## GenderM -0.0100650 0.0416108 -0.242 0.8089
## Red.Car.Yes 0.1062618 0.0451905 2.351 0.0187 *
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20585 0.03577 5.755 8.69e-09 ***
## Children -0.13213 0.02824 -4.679 2.89e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 12
## Log-likelihood: -8935 on 7 Df
As you can see, by using the vertical line, we are telling R that only the variable Children should be taking into account for the logit model.
Now, we can compare the Poisson and the zero inflated Poisson Before, lets fit the Poisson again, but using only the same variables we have in model 2.
model1<- glm(OFreq~Age+Children+Gender+
Red.Car.,family="poisson",
data=data)
summary(model1)
##
## Call:
## glm(formula = OFreq ~ Age + Children + Gender + Red.Car., family = "poisson",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4530 -1.2610 -1.2069 0.9863 3.3178
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.066417 0.079978 -0.830 0.40629
## Age -0.004821 0.001655 -2.913 0.00358 **
## Children 0.038349 0.012368 3.101 0.00193 **
## GenderM -0.033348 0.035818 -0.931 0.35183
## Red.Car.Yes 0.106568 0.039000 2.733 0.00628 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12775 on 7649 degrees of freedom
## Residual deviance: 12734 on 7645 degrees of freedom
## AIC: 20241
##
## Number of Fisher Scoring iterations: 6
vuong(model1,model2)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -28.82065 model2 > model1 < 2.22e-16
## AIC-corrected -28.77184 model2 > model1 < 2.22e-16
## BIC-corrected -28.60241 model2 > model1 < 2.22e-16
The Vuong test compares the Poisson (model1) with the zero-inflated Poisson (model2). The test results in a significant improvement, indicating that the zero-inflated model is better than the regular Poisson.
**** Now we can move to model severity:
In the same data set we have the information on the claim loss. The variable is “Claim.Amount”.
Before, we have to change some elements in our variables, since they were saved with symbols “$” and commas for quantities.
levels(data$Claim.Amount) <- gsub("$", "", levels(data$Claim.Amount), fixed=TRUE)
levels(data$Claim.Amount) <- gsub(",", "", levels(data$Claim.Amount), fixed=TRUE)
levels(data$Vehicle.Value) <- gsub("$", "", levels(data$Vehicle.Value), fixed=TRUE)
levels(data$Vehicle.Value) <- gsub(",", "", levels(data$Vehicle.Value), fixed=TRUE)
data$Claim.Amount<-as.numeric(as.character(data$Claim.Amount))
data$Vehicle.Value<-as.numeric(as.character(data$Vehicle.Value))
Now, we can run our models, since we knoe that severity takes possitive values all the time, we have to use lognormal or gamma regression. We can fit both:
models1<- lm(log(Claim.Amount)~Car.Age+Vehicle.Value,
data=data[data$Claim.Amount>0,])
summary(models1)
##
## Call:
## lm(formula = log(Claim.Amount) ~ Car.Age + Vehicle.Value, data = data[data$Claim.Amount >
## 0, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7884 -0.3863 0.0542 0.3940 2.9817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.128e+00 3.955e-02 205.538 < 2e-16 ***
## Car.Age -2.304e-03 3.324e-03 -0.693 0.488
## Vehicle.Value 1.089e-05 2.231e-06 4.881 1.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7849 on 2055 degrees of freedom
## Multiple R-squared: 0.01146, Adjusted R-squared: 0.0105
## F-statistic: 11.91 on 2 and 2055 DF, p-value: 7.178e-06
models2<- glm(Claim.Amount~Car.Age+Vehicle.Value,
family=Gamma(link="log"),
data=data[data$Claim.Amount>0,])
summary(models2)
##
## Call:
## glm(formula = Claim.Amount ~ Car.Age + Vehicle.Value, family = Gamma(link = "log"),
## data = data[data$Claim.Amount > 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8494 -0.6500 -0.2751 0.0652 4.3640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.373e+00 6.061e-02 138.153 < 2e-16 ***
## Car.Age -9.006e-03 5.095e-03 -1.768 0.0772 .
## Vehicle.Value 2.103e-05 3.419e-06 6.151 9.24e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.447016)
##
## Null deviance: 1446.3 on 2057 degrees of freedom
## Residual deviance: 1391.4 on 2055 degrees of freedom
## AIC: 39279
##
## Number of Fisher Scoring iterations: 6
Which one fits better? These models are easier to compare, we just have to estimate the mean predicted Claim Amount and we will see the best fit to our data:
mean(exp(predict(models1,type="response")))
## [1] 3888.896
mean(predict(models2,type="response"))
## [1] 5511.057
summaryBy(Claim.Amount~1,FUN=c("mean"),data=data[data$Claim.Amount>0,])
## Claim.Amount.c("mean")
## 1 5506.125
Clearly, the second model fits better, or at least, this is the model that is resulting in a close estimated mean.
***** Combining the two models…
For purposes of the present example,the steps we will follow are:
Estimate the severity model
In these new models, we will use as many variables as we can to obtain a better estimation. We have two posibilities, backward selection or forward selection. The first refers to the process where we will start with all possible variables, we we will remove those that don’t have a significant effect. The second approach means that we will start with fewer variables and we will add new variables, base on our assumptions regarding the expected behavior.
Since age is a complicated variable, we can create age groups:
data$ageg<-0
data$ageg<-ifelse(data$Age>=16 & data$Age<=25,1,data$ageg)
data$ageg<-ifelse(data$Age>=26 & data$Age<=35,2,data$ageg)
data$ageg<-ifelse(data$Age>=36 & data$Age<=45,3,data$ageg)
data$ageg<-ifelse(data$Age>=46 & data$Age<=55,4,data$ageg)
data$ageg<-ifelse(data$Age>=56,5,data$ageg)
Now, we can estimate a first model for claim frequency:
model1f<- glm(OFreq~factor(ageg)+Gender+Marital.Status+Education+Commute.Time..Min.+Vehicle.Use+
Red.Car.,family="poisson",data=data)
summary(model1f)
##
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status +
## Education + Commute.Time..Min. + Vehicle.Use + Red.Car.,
## family = "poisson", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7490 -1.2557 -1.1323 0.8585 3.4556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3130431 0.1033751 3.028 0.00246 **
## factor(ageg)2 -0.1887025 0.0997191 -1.892 0.05845 .
## factor(ageg)3 -0.3362252 0.0967387 -3.476 0.00051 ***
## factor(ageg)4 -0.3972654 0.0974207 -4.078 4.55e-05 ***
## factor(ageg)5 -0.1678348 0.1025030 -1.637 0.10155
## GenderM -0.0731850 0.0363275 -2.015 0.04395 *
## Marital.StatusYes -0.2124673 0.0260228 -8.165 3.22e-16 ***
## EducationHigh School 0.0161123 0.0301054 0.535 0.59251
## EducationMasters -0.0811863 0.0421235 -1.927 0.05394 .
## EducationPhD -0.0739082 0.0573080 -1.290 0.19717
## Commute.Time..Min. 0.0006739 0.0008076 0.834 0.40403
## Vehicle.UsePrivate -0.1844038 0.0285067 -6.469 9.88e-11 ***
## Red.Car.Yes 0.0850374 0.0391830 2.170 0.02999 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12775 on 7649 degrees of freedom
## Residual deviance: 12575 on 7637 degrees of freedom
## AIC: 20098
##
## Number of Fisher Scoring iterations: 6
exp(model1f$coeff)
## (Intercept) factor(ageg)2 factor(ageg)3
## 1.3675804 0.8280328 0.7144622
## factor(ageg)4 factor(ageg)5 GenderM
## 0.6721556 0.8454935 0.9294288
## Marital.StatusYes EducationHigh School EducationMasters
## 0.8085867 1.0162428 0.9220219
## EducationPhD Commute.Time..Min. Vehicle.UsePrivate
## 0.9287569 1.0006741 0.8316000
## Red.Car.Yes
## 1.0887578
We could keep the variables that have a significant effect:
model2f<- glm(OFreq~factor(ageg)+Gender+Marital.Status+Vehicle.Use+
Red.Car.,family="poisson",data=data)
summary(model2f)
##
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status +
## Vehicle.Use + Red.Car., family = "poisson", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7636 -1.2470 -1.1174 0.8614 3.4837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.35622 0.09740 3.657 0.000255 ***
## factor(ageg)2 -0.19332 0.09969 -1.939 0.052484 .
## factor(ageg)3 -0.35022 0.09661 -3.625 0.000289 ***
## factor(ageg)4 -0.42209 0.09700 -4.351 1.35e-05 ***
## factor(ageg)5 -0.19871 0.10186 -1.951 0.051085 .
## GenderM -0.07594 0.03631 -2.091 0.036495 *
## Marital.StatusYes -0.20889 0.02597 -8.045 8.63e-16 ***
## Vehicle.UsePrivate -0.20570 0.02742 -7.501 6.31e-14 ***
## Red.Car.Yes 0.08535 0.03917 2.179 0.029345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12775 on 7649 degrees of freedom
## Residual deviance: 12583 on 7641 degrees of freedom
## AIC: 20098
##
## Number of Fisher Scoring iterations: 6
exp(model2f$coeff)
## (Intercept) factor(ageg)2 factor(ageg)3 factor(ageg)4
## 1.4279245 0.8242184 0.7045365 0.6556777
## factor(ageg)5 GenderM Marital.StatusYes Vehicle.UsePrivate
## 0.8197846 0.9268738 0.8114822 0.8140743
## Red.Car.Yes
## 1.0890941
Now we can move to estimate a model for severity. We use the same variables:
model1s<- glm(Claim.Amount~factor(ageg)+Gender+Marital.Status+Education+Commute.Time..Min.+
Vehicle.Use+Red.Car.,family=Gamma(link="log"),
data=data[data$Claim.Amount>0,])
summary(model1s)
##
## Call:
## glm(formula = Claim.Amount ~ factor(ageg) + Gender + Marital.Status +
## Education + Commute.Time..Min. + Vehicle.Use + Red.Car.,
## family = Gamma(link = "log"), data = data[data$Claim.Amount >
## 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8954 -0.6628 -0.2814 0.0487 4.8859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8961478 0.1790443 49.687 <2e-16 ***
## factor(ageg)2 -0.1741810 0.1669213 -1.043 0.2968
## factor(ageg)3 -0.0869086 0.1601108 -0.543 0.5873
## factor(ageg)4 -0.1090085 0.1638780 -0.665 0.5060
## factor(ageg)5 -0.0429290 0.1754673 -0.245 0.8067
## GenderM 0.0279007 0.0773276 0.361 0.7183
## Marital.StatusYes -0.1356558 0.0561470 -2.416 0.0158 *
## EducationHigh School -0.1021501 0.0659101 -1.550 0.1213
## EducationMasters -0.1148340 0.1046305 -1.098 0.2725
## EducationPhD 0.0354619 0.1495255 0.237 0.8126
## Commute.Time..Min. -0.0003522 0.0018295 -0.193 0.8474
## Vehicle.UsePrivate -0.0864486 0.0610875 -1.415 0.1572
## Red.Car.Yes -0.0031505 0.0849080 -0.037 0.9704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.578212)
##
## Null deviance: 1446.3 on 2057 degrees of freedom
## Residual deviance: 1420.8 on 2045 degrees of freedom
## AIC: 39347
##
## Number of Fisher Scoring iterations: 6
exp(model1s$coeff)
## (Intercept) factor(ageg)2 factor(ageg)3
## 7303.7835598 0.8401448 0.9167609
## factor(ageg)4 factor(ageg)5 GenderM
## 0.8967228 0.9579794 1.0282935
## Marital.StatusYes EducationHigh School EducationMasters
## 0.8731431 0.9028940 0.8915141
## EducationPhD Commute.Time..Min. Vehicle.UsePrivate
## 1.0360982 0.9996478 0.9171827
## Red.Car.Yes
## 0.9968544
What we see is that few variables have a significant impact on severity (only Marital Status). Something we could do is first, explore the variables that have an impact on claim severity and then build the model for frequency…
Another option is to get independent models… Then, for severity, the model we obtanin is:
model2s<- glm(Claim.Amount~Marital.Status+Vehicle.Value
,family=Gamma(link="log"),data=data[data$Claim.Amount>0,])
summary(model2s)
##
## Call:
## glm(formula = Claim.Amount ~ Marital.Status + Vehicle.Value,
## family = Gamma(link = "log"), data = data[data$Claim.Amount >
## 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8334 -0.6467 -0.2757 0.0721 4.5771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.389e+00 5.959e-02 140.773 < 2e-16 ***
## Marital.StatusYes -1.258e-01 5.316e-02 -2.367 0.018 *
## Vehicle.Value 1.989e-05 3.399e-06 5.853 5.6e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.453672)
##
## Null deviance: 1446.3 on 2057 degrees of freedom
## Residual deviance: 1387.8 on 2055 degrees of freedom
## AIC: 39273
##
## Number of Fisher Scoring iterations: 6
exp(model2s$coeff)
## (Intercept) Marital.StatusYes Vehicle.Value
## 4399.4776459 0.8817506 1.0000199
Now, how does this translates into premiums?
We can start with a basic premium calculation. Let’s estimate the premium using the fact that
\[ Pure Premium=E(N)\times E(X) \]
mean(data$OFreq)*mean(data$Claim.Amount)
## [1] 1170.676
But wait! the estimation is not that easy, remember the formula for siniestralidad?
\[ sin=\frac{Claims}{Exposure} \times \frac{Costs}{Claims} \]
So we have to refine our estimation as:
mean(data$OFreq)*mean(data$Claim.Amount[data$Claim.Amount>0])
## [1] 4351.638
Review of our models…
data$Car.Age<-as.numeric(as.character(data$Car.Age))
data$CarG<-0
data$CarG<-ifelse(data$Car.Age<=1,1,data$CarG)
data$CarG<-ifelse(data$Car.Age>=2 & data$Car.Age<=5,2,data$CarG)
data$CarG<-ifelse(data$Car.Age>=6 & data$Car.Age<=10,3,data$CarG)
data$CarG<-ifelse(data$Car.Age>=10,4,data$CarG)
data$CarV<-0
data$CarV<-ifelse(data$Vehicle.Value<5000,1,0)
data$CarV<-ifelse(data$Vehicle.Value>=5000 & data$Vehicle.Value<10000,2,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=10000 & data$Vehicle.Value<15000,3,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=15000 & data$Vehicle.Value<20000,4,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=20000 & data$Vehicle.Value<25000,5,data$CarV)
data$CarV<-ifelse(data$Vehicle.Value>=25000 ,6,data$CarV)
model3s<- glm(Claim.Amount~Marital.Status+factor(CarG)+factor(CarV)+Gender
,family=Gamma(link="log"),data=data[data$Claim.Amount>0,])
summary(model3s)
##
## Call:
## glm(formula = Claim.Amount ~ Marital.Status + factor(CarG) +
## factor(CarV) + Gender, family = Gamma(link = "log"), data = data[data$Claim.Amount >
## 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8580 -0.6576 -0.2683 0.0764 4.4060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.25743 0.10445 79.055 < 2e-16 ***
## Marital.StatusYes -0.11647 0.05243 -2.222 0.0264 *
## factor(CarG)2 -0.04345 0.10885 -0.399 0.6898
## factor(CarG)3 -0.02655 0.06570 -0.404 0.6862
## factor(CarG)4 -0.09977 0.06669 -1.496 0.1348
## factor(CarV)2 0.33487 0.10247 3.268 0.0011 **
## factor(CarV)3 0.41241 0.10563 3.904 9.75e-05 ***
## factor(CarV)4 0.53175 0.11011 4.829 1.47e-06 ***
## factor(CarV)5 0.68418 0.11951 5.725 1.19e-08 ***
## factor(CarV)6 0.60438 0.12663 4.773 1.94e-06 ***
## GenderM 0.02982 0.05317 0.561 0.5750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.403837)
##
## Null deviance: 1446.3 on 2057 degrees of freedom
## Residual deviance: 1374.0 on 2047 degrees of freedom
## AIC: 39266
##
## Number of Fisher Scoring iterations: 6
model3s<- glm(Claim.Amount~Marital.Status+factor(CarV),
family=Gamma(link="log"),data=data[data$Claim.Amount>0,])
summary(model3s)
##
## Call:
## glm(formula = Claim.Amount ~ Marital.Status + factor(CarV), family = Gamma(link = "log"),
## data = data[data$Claim.Amount > 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8496 -0.6539 -0.2718 0.0753 4.4976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.23508 0.09505 86.637 < 2e-16 ***
## Marital.StatusYes -0.11389 0.05278 -2.158 0.031069 *
## factor(CarV)2 0.33381 0.10291 3.244 0.001198 **
## factor(CarV)3 0.40443 0.10605 3.813 0.000141 ***
## factor(CarV)4 0.52221 0.11046 4.727 2.43e-06 ***
## factor(CarV)5 0.68042 0.11983 5.678 1.56e-08 ***
## factor(CarV)6 0.59311 0.12683 4.677 3.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.425698)
##
## Null deviance: 1446.3 on 2057 degrees of freedom
## Residual deviance: 1378.0 on 2051 degrees of freedom
## AIC: 39265
##
## Number of Fisher Scoring iterations: 6
exp(model3s$coeff)
## (Intercept) Marital.StatusYes factor(CarV)2 factor(CarV)3
## 3770.9547868 0.8923553 1.3962846 1.4984494
## factor(CarV)4 factor(CarV)5 factor(CarV)6
## 1.6857554 1.9747046 1.8096116
We have, then the following information:
| Variable | Category | Frequency | Severity |
|---|---|---|---|
| Age Group | Age1(16-25) | 1.00 | – |
| Age Group | Age2(25-35) | 0.8242 | – |
| Age Group | Age3(36-45) | 0.7045 | – |
| Gender | Female | 1.00 | – |
| Gender | Male | 0.9269 | – |
| Marital Status | Not married | 1.00 | 1.00 |
| Marital Status | Married | 0.8115 | 0.8924 |
| Vehicle Use | Commercial | 1.00 | – |
| Vehicle Use | Private | 0.8141 | – |
| Color | Not red | 1.00 | – |
| Color | Red | 1.089 | – |
| Vehicle value | <$5,000 | – | 1.00 |
| Vehicle value | $5,000-$10,000 | – | 1.3963 |
| Vehicle value | $10,000-$15,000 | – | 1.4984 |
| Vehicle value | $15,000-$20,000 | – | 1.6858 |
| Vehicle value | $20,000-$25,000 | – | 1.9747 |
| Vehicle value | >=$25,000 | – | 1.8096 |
So, we can start estimating the premium. However, lets see what is the meaning of the intercepts.
I1<-exp(model2f$coeff[1])
I2<-exp(model3s$coeff[1])
I1*I2
## (Intercept)
## 5384.639
So, this is the net premium for a Women, ages 16 to 25, single, driving a commercial vehicle, not red (color) and the vehicle price is less than 5000.
Now, we want the premium for a married man, age 47, driving a red non-commercial vehicle worth $22,500. What is the premium?
Now, there are some things we should consider when using models… we can go back to our model for frequency:
summary(model1f)
##
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status +
## Education + Commute.Time..Min. + Vehicle.Use + Red.Car.,
## family = "poisson", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7490 -1.2557 -1.1323 0.8585 3.4556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3130431 0.1033751 3.028 0.00246 **
## factor(ageg)2 -0.1887025 0.0997191 -1.892 0.05845 .
## factor(ageg)3 -0.3362252 0.0967387 -3.476 0.00051 ***
## factor(ageg)4 -0.3972654 0.0974207 -4.078 4.55e-05 ***
## factor(ageg)5 -0.1678348 0.1025030 -1.637 0.10155
## GenderM -0.0731850 0.0363275 -2.015 0.04395 *
## Marital.StatusYes -0.2124673 0.0260228 -8.165 3.22e-16 ***
## EducationHigh School 0.0161123 0.0301054 0.535 0.59251
## EducationMasters -0.0811863 0.0421235 -1.927 0.05394 .
## EducationPhD -0.0739082 0.0573080 -1.290 0.19717
## Commute.Time..Min. 0.0006739 0.0008076 0.834 0.40403
## Vehicle.UsePrivate -0.1844038 0.0285067 -6.469 9.88e-11 ***
## Red.Car.Yes 0.0850374 0.0391830 2.170 0.02999 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12775 on 7649 degrees of freedom
## Residual deviance: 12575 on 7637 degrees of freedom
## AIC: 20098
##
## Number of Fisher Scoring iterations: 6
model3f<- glm(OFreq~factor(ageg)+Gender+Marital.Status+Education+Commute.Time..Min.+Vehicle.Use+
Red.Car.+Occupation,family="poisson",data=data)
summary(model3f)
##
## Call:
## glm(formula = OFreq ~ factor(ageg) + Gender + Marital.Status +
## Education + Commute.Time..Min. + Vehicle.Use + Red.Car. +
## Occupation, family = "poisson", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7447 -1.2574 -1.1340 0.8135 3.4141
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2594035 0.1059726 2.448 0.014372 *
## factor(ageg)2 -0.1809846 0.0998346 -1.813 0.069856 .
## factor(ageg)3 -0.3268344 0.0970296 -3.368 0.000756 ***
## factor(ageg)4 -0.3891166 0.0978599 -3.976 7.00e-05 ***
## factor(ageg)5 -0.1647328 0.1029556 -1.600 0.109590
## GenderM -0.0784293 0.0368824 -2.126 0.033464 *
## Marital.StatusYes -0.2114631 0.0260389 -8.121 4.62e-16 ***
## EducationHigh School 0.0372290 0.0338705 1.099 0.271700
## EducationMasters -0.1557438 0.0574379 -2.712 0.006698 **
## EducationPhD -0.1773087 0.0811376 -2.185 0.028868 *
## Commute.Time..Min. 0.0007031 0.0008101 0.868 0.385487
## Vehicle.UsePrivate -0.2259312 0.0339261 -6.660 2.75e-11 ***
## Red.Car.Yes 0.0823661 0.0391968 2.101 0.035611 *
## OccupationClerical 0.0497387 0.0453401 1.097 0.272636
## OccupationDoctor 0.2586733 0.1153466 2.243 0.024924 *
## OccupationHome Maker 0.0885794 0.0583779 1.517 0.129179
## OccupationLawyer 0.2063450 0.0792815 2.603 0.009250 **
## OccupationManager 0.0846570 0.0553258 1.530 0.125979
## OccupationProfessional 0.1176954 0.0494253 2.381 0.017253 *
## OccupationStudent 0.1037001 0.0486225 2.133 0.032945 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12775 on 7649 degrees of freedom
## Residual deviance: 12562 on 7630 degrees of freedom
## AIC: 20099
##
## Number of Fisher Scoring iterations: 6
What happended here?
We have to be careful when fitting our models, the simpliest model is the better… but not only can we find mediating effects of variables, such as the previous example, we could have some interesting interactions.
Lets create new dummy variables and run a new model:
data$red<-ifelse(data$Red.Car.=="Yes",1,0)
data$mujer<- ifelse(data$Gender=="F",1,0)
data$hs<-ifelse(data$Education=="High School",1,0)
data$bach<-ifelse(data$Education=="Bachelors",1,0)
data$grad<-ifelse(data$Education=="Masters" | data$Education=="PhD",1,0)
data$ctime<-data$Commute.Time..Min.
data$married<-ifelse(data$Marital.Status=="Yes",1,0)
model4f<- glm(OFreq~factor(ageg)+mujer+married+bach+grad+ctime+
red+mujer*bach+mujer*grad,family="poisson",data=data)
summary(model4f)
##
## Call:
## glm(formula = OFreq ~ factor(ageg) + mujer + married + bach +
## grad + ctime + red + mujer * bach + mujer * grad, family = "poisson",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6963 -1.2611 -1.1528 0.8521 3.3931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2096756 0.1072502 1.955 0.050581 .
## factor(ageg)2 -0.1846494 0.0997327 -1.851 0.064106 .
## factor(ageg)3 -0.3345657 0.0967511 -3.458 0.000544 ***
## factor(ageg)4 -0.3954696 0.0974303 -4.059 4.93e-05 ***
## factor(ageg)5 -0.1687801 0.1024836 -1.647 0.099579 .
## mujer -0.0397816 0.0441696 -0.901 0.367771
## married -0.2139615 0.0260052 -8.228 < 2e-16 ***
## bach -0.1024962 0.0448138 -2.287 0.022187 *
## grad -0.2161228 0.0516570 -4.184 2.87e-05 ***
## ctime 0.0006775 0.0008080 0.838 0.401771
## red 0.0886436 0.0392229 2.260 0.023821 *
## mujer:bach 0.1623547 0.0602054 2.697 0.007003 **
## mujer:grad 0.1157729 0.0673901 1.718 0.085805 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12775 on 7649 degrees of freedom
## Residual deviance: 12608 on 7637 degrees of freedom
## AIC: 20131
##
## Number of Fisher Scoring iterations: 6
rm(list = ls())