Why GLM, and not LM?

Linear model (LM) is a special case of a Generalized Linear Model (GLM). It is called as generalized linear model because it is flexible and the relationship between the dependent variables and target variable are not strictly linear but need to be transformed through a link function (e.g. identity, log).

library(insuranceData)
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

Use AutoCollision Data

Age and Vehicle_Use are factor variables, i.e. they have several levels and something that you need to take note when running a regression model as one of the levels is absence & the results are just the intercept values when all other factor levels are zero.

## 'data.frame':    32 obs. of  4 variables:
##  $ Age        : Factor w/ 8 levels "A","B","C","D",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Vehicle_Use: Factor w/ 4 levels "Business","DriveLong",..: 4 3 2 1 4 3 2 1 4 3 ...
##  $ Severity   : num  250 275 245 798 214 ...
##  $ Claim_Count: int  21 40 23 5 63 171 92 44 140 343 ...
summary(AutoCollision)
##       Age        Vehicle_Use    Severity      Claim_Count   
##  A      :4   Business  :8    Min.   :153.6   Min.   :  5.0  
##  B      :4   DriveLong :8    1st Qu.:212.4   1st Qu.:116.2  
##  C      :4   DriveShort:8    Median :250.5   Median :208.0  
##  D      :4   Pleasure  :8    Mean   :276.4   Mean   :279.4  
##  E      :4                   3rd Qu.:298.2   3rd Qu.:366.0  
##  F      :4                   Max.   :797.8   Max.   :970.0  
##  (Other):8

Initial Data Understanding

Use density to see the distribution of the data, and you can see it is skewed towards right/positvely skewed and this means the average/mean value is larger than median.

ggplot(AutoCollision, aes(Claim_Count, fill="#FA8072"))+geom_density(show.legend = F)

It appears that the short drive had the larger number of claims, compared to long drive.

ggplot(AutoCollision, aes(x=Vehicle_Use, y=Claim_Count, fill=Vehicle_Use))+geom_col()

The number of claim counts for age group F are the highest.

ggplot(AutoCollision, aes(x=Age, y=Claim_Count, fill=Age))+geom_col()

There is an outlier for a claim that appears to be quite severe for a business use.

ggplot(AutoCollision,aes(x=Vehicle_Use, y=Severity, col=Age, group=Age))+geom_jitter()

Poisson Regression

Poisson regresison is suitable in this case, especially when our target variable is a count (numeric) and the count values are not as wide as continous values (if so, then use normal regression). However, do remember there’s a caveat of using poisson distribution because it assumes that mean is equal to variance.

Looking at model results, we have significant p-values at 95% confidence level, except for vehicle use (pleasure). AIC is only useful when we have another similar Poisson model to compare to. Aikaike Information Criterion, or AIC measures the relative amount of information lost, hence the less information losses, the better the model. Usually in this case, we should remove vehicle use feature, and test it again and see if AIC improves (i.e.lower AIC values). However since we are only dealing with a small sample size, and limited variables, I doubt that it would improve and we would ignore that in this case.

claimPoisson_log <- glm(Claim_Count~., family=poisson(link="log"), data=AutoCollision)
summary(claimPoisson_log) #use summary to view the model results 
## 
## Call:
## glm(formula = Claim_Count ~ ., family = poisson(link = "log"), 
##     data = AutoCollision)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4241  -1.8265   0.3862   1.6265   5.7016  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            3.2015756  0.2639339  12.130  < 2e-16 ***
## AgeB                   1.3942637  0.1180954  11.806  < 2e-16 ***
## AgeC                   2.2786347  0.1116319  20.412  < 2e-16 ***
## AgeD                   2.4271341  0.1116924  21.731  < 2e-16 ***
## AgeE                   2.3850965  0.1211942  19.680  < 2e-16 ***
## AgeF                   3.0679957  0.1147927  26.726  < 2e-16 ***
## AgeG                   2.8529720  0.1144980  24.917  < 2e-16 ***
## AgeH                   2.4841704  0.1161438  21.389  < 2e-16 ***
## Vehicle_UseDriveLong   0.7597295  0.0603762  12.583  < 2e-16 ***
## Vehicle_UseDriveShort  1.0285515  0.0834527  12.325  < 2e-16 ***
## Vehicle_UsePleasure   -0.1076671  0.0910142  -1.183 0.236821    
## Severity              -0.0020475  0.0006133  -3.339 0.000842 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6064.97  on 31  degrees of freedom
## Residual deviance:  172.58  on 20  degrees of freedom
## AIC: 420.67
## 
## Number of Fisher Scoring iterations: 4
exp(coef(claimPoisson_log))
##           (Intercept)                  AgeB                  AgeC 
##            24.5712152             4.0320047             9.7633417 
##                  AgeD                  AgeE                  AgeF 
##            11.3263749            10.8601101            21.4987687 
##                  AgeG                  AgeH  Vehicle_UseDriveLong 
##            17.3392378            11.9911683             2.1376979 
## Vehicle_UseDriveShort   Vehicle_UsePleasure              Severity 
##             2.7970115             0.8979264             0.9979546
exp(confint(claimPoisson_log))
## Waiting for profiling to be done...
##                            2.5 %     97.5 %
## (Intercept)           14.6515890 41.2462622
## AgeB                   3.2155068  5.1114959
## AgeC                   7.8939615 12.2338469
## AgeD                   9.1576598 14.1954896
## AgeE                   8.6165064 13.8631904
## AgeF                  17.2803057 27.1131770
## AgeG                  13.9445032 21.8540507
## AgeH                   9.6108991 15.1599108
## Vehicle_UseDriveLong   1.8980290  2.4047450
## Vehicle_UseDriveShort  2.3707770  3.2877716
## Vehicle_UsePleasure    0.7497538  1.0710268
## Severity               0.9967304  0.9991272
plot(claimPoisson_log)

All in all, if we want to predict claim count, then we will need to split the sample into train, test and validate to find out the accuracy rate of the model.

But then again, we will need more features, larger sample size and better proxy features than this!