Generalized linear models allow for linear regression to be applied to data sets that have response variables in which the error distribution does not follow a normal distribution. In this approach the response variable can be considered with a linear model through a link function. This link function allows the magnitude of the variance of the measurement to be a function of its predicted value. Common uses cases are for datasets with a Boolean response, a Bernoulli variable, in which the probability of the outcome is only of the range 0 to 1. This would lead to a binomial distribution. A log-linear model or log-odds model are also good candidates for applying the link function to use a GLM. The log-odds model would use the logit function to define the model.
From Wikipedia, the GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
Using data from: https://www.kaggle.com/mhdzahier/travel-insurance, in which the dependent variable Claim identifies the filing of a travel insurance claim (yes or no). I work as a software engineer for a large auto insurance company based in San Antonio, Texas, so I found this dataset intriguing as an exercise in predicting insurance claims. The independent variables use in the implementation below include product, duration and age.
library(rcompanion)
travel_data <- read.csv("travelinsurance.csv")
travel_data$Claim <- ifelse(travel_data$Claim == 'Yes', 1, 0)
summary(travel_data)
## Agency Agency_Type Distribution_Channel
## EPX :35119 Airlines :17457 Offline: 1107
## CWT : 8580 Travel Agency:45869 Online :62219
## C2B : 8267
## JZI : 6329
## SSI : 1056
## JWT : 749
## (Other): 3226
## Product Claim Duration
## Cancellation Plan :18630 Min. :0.00000 Min. : -2.00
## 2 way Comprehensive Plan :13158 1st Qu.:0.00000 1st Qu.: 9.00
## Rental Vehicle Excess Insurance: 8580 Median :0.00000 Median : 22.00
## Basic Plan : 5469 Mean :0.01464 Mean : 49.32
## Bronze Plan : 4049 3rd Qu.:0.00000 3rd Qu.: 53.00
## 1 way Comprehensive Plan : 3331 Max. :1.00000 Max. :4881.00
## (Other) :10109
## Destination Net_Sales Commision Gender
## SINGAPORE:13255 Min. :-389.00 Min. : 0.00 :45107
## MALAYSIA : 5930 1st Qu.: 18.00 1st Qu.: 0.00 F: 8872
## THAILAND : 5894 Median : 26.53 Median : 0.00 M: 9347
## CHINA : 4796 Mean : 40.70 Mean : 9.81
## AUSTRALIA: 3694 3rd Qu.: 48.00 3rd Qu.: 11.55
## INDONESIA: 3452 Max. : 810.00 Max. :283.50
## (Other) :26305
## Age
## Min. : 0.00
## 1st Qu.: 35.00
## Median : 36.00
## Mean : 39.97
## 3rd Qu.: 43.00
## Max. :118.00
##
For the sake of comparison, build three GLM models based on progressive use of predictor variables.
glmod1 <- glm(Claim ~ Product, data=travel_data, family=binomial())
summary(glmod1)
##
## Call:
## glm(formula = Claim ~ Product, family = binomial(), data = travel_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4835 -0.1473 -0.1185 -0.0688 3.4780
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -5.9111 0.3338 -17.709
## Product2 way Comprehensive Plan 1.3930 0.3443 4.046
## Product24 Protect -10.6550 152.6797 -0.070
## ProductAnnual Gold Plan 3.8023 0.4060 9.366
## ProductAnnual Silver Plan 3.8237 0.3443 11.104
## ProductAnnual Travel Protect Gold 3.7139 0.4717 7.873
## ProductAnnual Travel Protect Platinum 3.0977 0.6817 4.544
## ProductAnnual Travel Protect Silver 2.8907 0.6112 4.729
## ProductBasic Plan 0.4440 0.3938 1.127
## ProductBronze Plan 3.0052 0.3412 8.807
## ProductCancellation Plan -0.1349 0.3663 -0.368
## ProductChild Comprehensive Plan -10.6550 799.8483 -0.013
## ProductComprehensive Plan 1.6372 0.5605 2.921
## ProductGold Plan 2.9903 0.4123 7.253
## ProductIndividual Comprehensive Plan 2.7470 0.6774 4.055
## ProductPremier Plan 1.7574 0.6708 2.620
## ProductRental Vehicle Excess Insurance 1.3183 0.3509 3.757
## ProductSilver Plan 3.2064 0.3449 9.296
## ProductSingle Trip Travel Protect Gold 2.9458 0.4654 6.330
## ProductSingle Trip Travel Protect Platinum 3.3010 0.5711 5.780
## ProductSingle Trip Travel Protect Silver 2.1675 0.6061 3.576
## ProductSpouse or Parents Comprehensive Plan 3.2720 1.0876 3.009
## ProductTicket Protector 0.9014 0.5052 1.784
## ProductTravel Cruise Protect 0.3408 0.7831 0.435
## ProductTravel Cruise Protect Family -10.6550 2399.5447 -0.004
## ProductValue Plan 0.9560 0.4055 2.358
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Product2 way Comprehensive Plan 5.21e-05 ***
## Product24 Protect 0.944364
## ProductAnnual Gold Plan < 2e-16 ***
## ProductAnnual Silver Plan < 2e-16 ***
## ProductAnnual Travel Protect Gold 3.46e-15 ***
## ProductAnnual Travel Protect Platinum 5.52e-06 ***
## ProductAnnual Travel Protect Silver 2.25e-06 ***
## ProductBasic Plan 0.259583
## ProductBronze Plan < 2e-16 ***
## ProductCancellation Plan 0.712732
## ProductChild Comprehensive Plan 0.989372
## ProductComprehensive Plan 0.003491 **
## ProductGold Plan 4.06e-13 ***
## ProductIndividual Comprehensive Plan 5.00e-05 ***
## ProductPremier Plan 0.008796 **
## ProductRental Vehicle Excess Insurance 0.000172 ***
## ProductSilver Plan < 2e-16 ***
## ProductSingle Trip Travel Protect Gold 2.45e-10 ***
## ProductSingle Trip Travel Protect Platinum 7.45e-09 ***
## ProductSingle Trip Travel Protect Silver 0.000349 ***
## ProductSpouse or Parents Comprehensive Plan 0.002625 **
## ProductTicket Protector 0.074377 .
## ProductTravel Cruise Protect 0.663397
## ProductTravel Cruise Protect Family 0.996457
## ProductValue Plan 0.018387 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9671.8 on 63325 degrees of freedom
## Residual deviance: 8270.5 on 63300 degrees of freedom
## AIC: 8322.5
##
## Number of Fisher Scoring iterations: 15
glmod2 <- glm(Claim ~ Product + Duration, data=travel_data, family=binomial())
summary(glmod2)
##
## Call:
## glm(formula = Claim ~ Product + Duration, family = binomial(),
## data = travel_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5196 -0.1468 -0.1183 -0.0690 3.4825
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -5.921e+00 3.338e-01 -17.735
## Product2 way Comprehensive Plan 1.384e+00 3.443e-01 4.020
## Product24 Protect -1.066e+01 1.527e+02 -0.070
## ProductAnnual Gold Plan 3.655e+00 4.175e-01 8.755
## ProductAnnual Silver Plan 3.675e+00 3.580e-01 10.265
## ProductAnnual Travel Protect Gold 3.563e+00 4.821e-01 7.391
## ProductAnnual Travel Protect Platinum 2.948e+00 6.888e-01 4.280
## ProductAnnual Travel Protect Silver 2.735e+00 6.199e-01 4.412
## ProductBasic Plan 4.395e-01 3.938e-01 1.116
## ProductBronze Plan 3.004e+00 3.412e-01 8.803
## ProductCancellation Plan -1.421e-01 3.664e-01 -0.388
## ProductChild Comprehensive Plan -1.080e+01 7.998e+02 -0.014
## ProductComprehensive Plan 1.615e+00 5.607e-01 2.881
## ProductGold Plan 2.987e+00 4.123e-01 7.244
## ProductIndividual Comprehensive Plan 2.601e+00 6.842e-01 3.802
## ProductPremier Plan 1.746e+00 6.708e-01 2.603
## ProductRental Vehicle Excess Insurance 1.310e+00 3.510e-01 3.733
## ProductSilver Plan 3.204e+00 3.449e-01 9.289
## ProductSingle Trip Travel Protect Gold 2.945e+00 4.654e-01 6.329
## ProductSingle Trip Travel Protect Platinum 3.299e+00 5.711e-01 5.777
## ProductSingle Trip Travel Protect Silver 2.166e+00 6.061e-01 3.573
## ProductSpouse or Parents Comprehensive Plan 3.127e+00 1.092e+00 2.864
## ProductTicket Protector 8.087e-01 5.194e-01 1.557
## ProductTravel Cruise Protect 3.333e-01 7.832e-01 0.426
## ProductTravel Cruise Protect Family -1.066e+01 2.400e+03 -0.004
## ProductValue Plan 9.516e-01 4.055e-01 2.347
## Duration 4.207e-04 2.778e-04 1.514
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Product2 way Comprehensive Plan 5.81e-05 ***
## Product24 Protect 0.944345
## ProductAnnual Gold Plan < 2e-16 ***
## ProductAnnual Silver Plan < 2e-16 ***
## ProductAnnual Travel Protect Gold 1.46e-13 ***
## ProductAnnual Travel Protect Platinum 1.87e-05 ***
## ProductAnnual Travel Protect Silver 1.03e-05 ***
## ProductBasic Plan 0.264385
## ProductBronze Plan < 2e-16 ***
## ProductCancellation Plan 0.698196
## ProductChild Comprehensive Plan 0.989227
## ProductComprehensive Plan 0.003964 **
## ProductGold Plan 4.34e-13 ***
## ProductIndividual Comprehensive Plan 0.000144 ***
## ProductPremier Plan 0.009243 **
## ProductRental Vehicle Excess Insurance 0.000189 ***
## ProductSilver Plan < 2e-16 ***
## ProductSingle Trip Travel Protect Gold 2.47e-10 ***
## ProductSingle Trip Travel Protect Platinum 7.61e-09 ***
## ProductSingle Trip Travel Protect Silver 0.000352 ***
## ProductSpouse or Parents Comprehensive Plan 0.004184 **
## ProductTicket Protector 0.119463
## ProductTravel Cruise Protect 0.670407
## ProductTravel Cruise Protect Family 0.996456
## ProductValue Plan 0.018933 *
## Duration 0.129932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9671.8 on 63325 degrees of freedom
## Residual deviance: 8269.0 on 63299 degrees of freedom
## AIC: 8323
##
## Number of Fisher Scoring iterations: 15
glmod3 <- glm(Claim ~ Product + Duration + Age, data=travel_data, family=binomial())
summary(glmod3)
##
## Call:
## glm(formula = Claim ~ Product + Duration + Age, family = binomial(),
## data = travel_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5239 -0.1472 -0.1226 -0.0694 3.5430
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -5.759e+00 3.479e-01 -16.551
## Product2 way Comprehensive Plan 1.388e+00 3.443e-01 4.031
## Product24 Protect -1.061e+01 1.527e+02 -0.069
## ProductAnnual Gold Plan 3.688e+00 4.180e-01 8.823
## ProductAnnual Silver Plan 3.692e+00 3.582e-01 10.308
## ProductAnnual Travel Protect Gold 3.574e+00 4.822e-01 7.412
## ProductAnnual Travel Protect Platinum 2.978e+00 6.891e-01 4.321
## ProductAnnual Travel Protect Silver 2.751e+00 6.200e-01 4.437
## ProductBasic Plan 4.728e-01 3.943e-01 1.199
## ProductBronze Plan 3.002e+00 3.412e-01 8.797
## ProductCancellation Plan -1.419e-01 3.664e-01 -0.387
## ProductChild Comprehensive Plan -1.090e+01 7.998e+02 -0.014
## ProductComprehensive Plan 1.724e+00 5.642e-01 3.056
## ProductGold Plan 3.001e+00 4.124e-01 7.278
## ProductIndividual Comprehensive Plan 2.630e+00 6.845e-01 3.842
## ProductPremier Plan 1.775e+00 6.711e-01 2.646
## ProductRental Vehicle Excess Insurance 1.326e+00 3.511e-01 3.777
## ProductSilver Plan 3.208e+00 3.449e-01 9.299
## ProductSingle Trip Travel Protect Gold 2.959e+00 4.654e-01 6.356
## ProductSingle Trip Travel Protect Platinum 3.319e+00 5.712e-01 5.810
## ProductSingle Trip Travel Protect Silver 2.176e+00 6.061e-01 3.589
## ProductSpouse or Parents Comprehensive Plan 3.164e+00 1.092e+00 2.897
## ProductTicket Protector 8.647e-01 5.204e-01 1.662
## ProductTravel Cruise Protect 3.921e-01 7.839e-01 0.500
## ProductTravel Cruise Protect Family -1.068e+01 2.400e+03 -0.004
## ProductValue Plan 1.071e+00 4.109e-01 2.608
## Duration 4.186e-04 2.783e-04 1.504
## Age -4.433e-03 2.695e-03 -1.645
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Product2 way Comprehensive Plan 5.56e-05 ***
## Product24 Protect 0.944620
## ProductAnnual Gold Plan < 2e-16 ***
## ProductAnnual Silver Plan < 2e-16 ***
## ProductAnnual Travel Protect Gold 1.24e-13 ***
## ProductAnnual Travel Protect Platinum 1.55e-05 ***
## ProductAnnual Travel Protect Silver 9.12e-06 ***
## ProductBasic Plan 0.230485
## ProductBronze Plan < 2e-16 ***
## ProductCancellation Plan 0.698613
## ProductChild Comprehensive Plan 0.989123
## ProductComprehensive Plan 0.002245 **
## ProductGold Plan 3.39e-13 ***
## ProductIndividual Comprehensive Plan 0.000122 ***
## ProductPremier Plan 0.008157 **
## ProductRental Vehicle Excess Insurance 0.000159 ***
## ProductSilver Plan < 2e-16 ***
## ProductSingle Trip Travel Protect Gold 2.07e-10 ***
## ProductSingle Trip Travel Protect Platinum 6.26e-09 ***
## ProductSingle Trip Travel Protect Silver 0.000332 ***
## ProductSpouse or Parents Comprehensive Plan 0.003769 **
## ProductTicket Protector 0.096603 .
## ProductTravel Cruise Protect 0.616910
## ProductTravel Cruise Protect Family 0.996449
## ProductValue Plan 0.009117 **
## Duration 0.132521
## Age 0.100071
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9671.8 on 63325 degrees of freedom
## Residual deviance: 8266.2 on 63298 degrees of freedom
## AIC: 8322.2
##
## Number of Fisher Scoring iterations: 15
compareGLM(glmod1, glmod2, glmod3)
## $Models
## Formula
## 1 "Claim ~ Product"
## 2 "Claim ~ Product + Duration"
## 3 "Claim ~ Product + Duration + Age"
##
## $Fit.criteria
## Rank Df.res AIC AICc BIC McFadden Cox.and.Snell Nagelkerke p.value
## 1 26 63300 8325 8325 8569 0.1449 0.02189 0.1545 9.844e-281
## 2 27 63300 8325 8325 8579 0.1450 0.02191 0.1547 3.547e-280
## 3 28 63300 8324 8324 8587 0.1453 0.02195 0.1550 6.763e-280
Comparison of the three different GLMs shows an almost identical AIC for all three models (8325 or 8324). This would indicate the models are comparatively the same in which the third model slightly outperforms the first two.
library(ModelMetrics)
##
## Attaching package: 'ModelMetrics'
## The following object is masked from 'package:base':
##
## kappa
lmod <- lm(Claim ~ Product + Duration + Age, data=travel_data)
summary(lmod)
##
## Call:
## lm(formula = Claim ~ Product + Duration + Age, data = travel_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11303 -0.01082 -0.00811 -0.00260 1.00258
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 4.915e-03 2.476e-03 1.985
## Product2 way Comprehensive Plan 8.008e-03 2.292e-03 3.494
## Product24 Protect -1.982e-03 7.795e-03 -0.254
## ProductAnnual Gold Plan 1.037e-01 8.946e-03 11.593
## ProductAnnual Silver Plan 1.055e-01 4.247e-03 24.852
## ProductAnnual Travel Protect Gold 9.508e-02 1.215e-02 7.826
## ProductAnnual Travel Protect Platinum 5.197e-02 1.646e-02 3.156
## ProductAnnual Travel Protect Silver 4.159e-02 1.306e-02 3.185
## ProductBasic Plan 1.939e-03 2.611e-03 0.743
## ProductBronze Plan 4.912e-02 2.760e-03 17.796
## ProductCancellation Plan -4.499e-04 2.222e-03 -0.202
## ProductChild Comprehensive Plan -6.501e-03 3.944e-02 -0.165
## ProductComprehensive Plan 1.238e-02 6.596e-03 1.877
## ProductGold Plan 4.861e-02 6.615e-03 7.349
## ProductIndividual Comprehensive Plan 3.596e-02 1.401e-02 2.566
## ProductPremier Plan 1.304e-02 8.720e-03 1.495
## ProductRental Vehicle Excess Insurance 7.434e-03 2.415e-03 3.078
## ProductSilver Plan 6.002e-02 3.221e-03 18.636
## ProductSingle Trip Travel Protect Gold 4.652e-02 8.511e-03 5.465
## ProductSingle Trip Travel Protect Platinum 6.606e-02 1.396e-02 4.732
## ProductSingle Trip Travel Protect Silver 2.055e-02 9.202e-03 2.234
## ProductSpouse or Parents Comprehensive Plan 6.221e-02 3.060e-02 2.033
## ProductTicket Protector 4.075e-03 4.234e-03 0.963
## ProductTravel Cruise Protect 1.870e-03 5.558e-03 0.336
## ProductTravel Cruise Protect Family -3.049e-03 1.180e-01 -0.026
## ProductValue Plan 6.139e-03 3.252e-03 1.888
## Duration 6.624e-06 5.690e-06 1.164
## Age -6.453e-05 3.790e-05 -1.702
## Pr(>|t|)
## (Intercept) 0.047125 *
## Product2 way Comprehensive Plan 0.000476 ***
## Product24 Protect 0.799298
## ProductAnnual Gold Plan < 2e-16 ***
## ProductAnnual Silver Plan < 2e-16 ***
## ProductAnnual Travel Protect Gold 5.10e-15 ***
## ProductAnnual Travel Protect Platinum 0.001598 **
## ProductAnnual Travel Protect Silver 0.001450 **
## ProductBasic Plan 0.457684
## ProductBronze Plan < 2e-16 ***
## ProductCancellation Plan 0.839543
## ProductChild Comprehensive Plan 0.869088
## ProductComprehensive Plan 0.060492 .
## ProductGold Plan 2.03e-13 ***
## ProductIndividual Comprehensive Plan 0.010278 *
## ProductPremier Plan 0.134842
## ProductRental Vehicle Excess Insurance 0.002086 **
## ProductSilver Plan < 2e-16 ***
## ProductSingle Trip Travel Protect Gold 4.65e-08 ***
## ProductSingle Trip Travel Protect Platinum 2.23e-06 ***
## ProductSingle Trip Travel Protect Silver 0.025517 *
## ProductSpouse or Parents Comprehensive Plan 0.042044 *
## ProductTicket Protector 0.335788
## ProductTravel Cruise Protect 0.736503
## ProductTravel Cruise Protect Family 0.979391
## ProductValue Plan 0.059043 .
## Duration 0.244401
## Age 0.088676 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.118 on 63298 degrees of freedom
## Multiple R-squared: 0.03517, Adjusted R-squared: 0.03476
## F-statistic: 85.46 on 27 and 63298 DF, p-value: < 2.2e-16
AIC(lmod)
## [1] -90927.14
#mae(lmod)
DMwR::regr.eval(travel_data$Claim, lmod$fitted.values)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## mae mse rmse mape
## 0.02783661 0.01391691 0.11796994 Inf
Naive linear regression model created with the same independent variables as the third GLM above for comparison purposes.
AIC(glmod3)
## [1] 8322.245
mae(glmod3)
## [1] 0.02782946
Interestingly, the AIC of the linear regression model is negative, while the AIC of the GLM is positive. This is cause for further investigation.
As for the MAE, the GLM did slightly outperform the linear model in MAE (mean absolute error).
Overall, generalized linear models allow for the application of linear models to a wider array of datasets. As my blog entries have attempted to focus on real-world, observational datasets, the GLMs can be a powerful tool for evaluating such data. The binomial nature of the response data in the implementation above represents a good use case for building a GLM using the binomial family as the dependent variable is guaranteed to not follow a linear pattern. The GLM offers a viable approach to generating models to meaningfully evaluate non-normal, non-continuous distributions.