For today’s discussion, we will be reviewing some potential pitfalls associated with using a GLM to model mortality. In particular, we will be discussing the bias/variance tradeoff. Per wikipedia.
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.
A common tool that data scientists use to help optimize the bias/variance tradeoff is to split data into a training set and a testing set. Various models will be developed using the training data and validated on the testing data. This can be done with a simple data split (70/30 train/test is a common split). A more sophisticated approach is k-fold validation.
Generally speaking, actuarial experience studies don’t use the test/train methods described above. So care must be taken to avoid high bias and high variance. Actuarial best practices describe a good assumption as having 3 qualities:
By focusing on all 3 of these qualities, you can strike the right balance of bias and variance. To demonstrate these points, we will walk through some sample models.
For this section we will look at 2 very simple models. These models are obviously not candidates for a final model. However, these models very clearly demonstrate some key concepts that aren’t so easy to recognize in more complicated models.
Model1: Deaths ~ Gender
##
## Call:
## glm(formula = number_of_deaths ~ gender + offset(log(policies_exposed)),
## family = poisson(link = "log"), data = df.full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.0671 -0.3504 -0.1733 -0.1020 20.1610
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.848198 0.002589 -1872.5 <2e-16 ***
## genderMale -0.057791 0.004661 -12.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 874656 on 1537853 degrees of freedom
## Residual deviance: 874501 on 1537852 degrees of freedom
## AIC: 1088822
##
## Number of Fisher Scoring iterations: 6
The summary indicates 2 factors:
This means the model is predicting the mortality rate for Females (base level) is .0078. For males, the rate is 94.38% of females. As any actuary would immediately tell you - there is something wrong here. Male mortality is higher than female mortality. Is this something wrong with the code? That would be my first guess - but that is not the case.
This is an example of Simpsons Paradox. A relatively well-known example (and my first introduction) is here:
For our example:
* Substitute attained age for season.
* Substitute mortality rate for batting average.
* Substitute male for David Justice.
* Substitute female for Derek Jeter.
Adding attained age to our model (or season to the batting average discussion) should address this misleading result. This was an obvious example, but if you are looking to a GLM for insights on a lesser known variable, this could be dangerous. Always verify new insights.
Model2: Deaths ~ Attained Age
##
## Call:
## glm(formula = number_of_deaths ~ attained_age.factor + offset(log(policies_exposed)),
## family = poisson(link = "log"), data = df.full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -15.4526 -0.2340 -0.1021 -0.0426 6.0033
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.424556 0.176777 -42.000 < 2e-16 ***
## attained_age.factor1 -1.272317 0.320435 -3.971 7.17e-05 ***
## attained_age.factor2 -1.682695 0.349513 -4.814 1.48e-06 ***
## attained_age.factor3 -1.690766 0.338502 -4.995 5.89e-07 ***
## attained_age.factor4 -1.982939 0.362284 -5.473 4.41e-08 ***
## attained_age.factor5 -2.541303 0.444853 -5.713 1.11e-08 ***
## attained_age.factor6 -1.616573 0.306186 -5.280 1.29e-07 ***
## attained_age.factor7 -2.518701 0.417208 -6.037 1.57e-09 ***
## attained_age.factor8 -2.351410 0.377292 -6.232 4.60e-10 ***
## attained_age.factor9 -1.985365 0.320434 -6.196 5.80e-10 ***
## attained_age.factor10 -2.867602 0.444500 -6.451 1.11e-10 ***
## attained_age.factor11 -2.052714 0.320433 -6.406 1.49e-10 ***
## attained_age.factor12 -1.894775 0.300122 -6.313 2.73e-10 ***
## attained_age.factor13 -1.726330 0.280836 -6.147 7.89e-10 ***
## attained_age.factor14 -1.164089 0.238518 -4.881 1.06e-06 ***
## attained_age.factor15 -0.635549 0.214373 -2.965 0.003030 **
## attained_age.factor16 -0.729981 0.215950 -3.380 0.000724 ***
## attained_age.factor17 -0.474676 0.206431 -2.299 0.021480 *
## attained_age.factor18 -0.496312 0.206431 -2.404 0.016206 *
## attained_age.factor19 -0.277240 0.200055 -1.386 0.165801
## attained_age.factor20 -0.232704 0.198612 -1.172 0.241337
## attained_age.factor21 0.015346 0.194169 0.079 0.937005
## attained_age.factor22 -0.209128 0.197798 -1.057 0.290383
## attained_age.factor23 -0.238749 0.197956 -1.206 0.227789
## attained_age.factor24 -0.266148 0.198444 -1.341 0.179865
## attained_age.factor25 -0.085397 0.194953 -0.438 0.661357
## attained_age.factor26 0.100232 0.191689 0.523 0.601049
## attained_age.factor27 -0.008743 0.193255 -0.045 0.963915
## attained_age.factor28 -0.003721 0.193065 -0.019 0.984622
## attained_age.factor29 0.011234 0.192788 0.058 0.953532
## attained_age.factor30 -0.025868 0.193549 -0.134 0.893679
## attained_age.factor31 0.114774 0.191012 0.601 0.547923
## attained_age.factor32 -0.066480 0.193649 -0.343 0.731372
## attained_age.factor33 0.142345 0.190394 0.748 0.454683
## attained_age.factor34 0.023589 0.191380 0.123 0.901902
## attained_age.factor35 -0.034931 0.191768 -0.182 0.855462
## attained_age.factor36 0.238515 0.187340 1.273 0.202959
## attained_age.factor37 0.291845 0.185705 1.572 0.116054
## attained_age.factor38 0.241025 0.185301 1.301 0.193354
## attained_age.factor39 0.315599 0.184425 1.711 0.087033 .
## attained_age.factor40 0.362291 0.184069 1.968 0.049042 *
## attained_age.factor41 0.415555 0.183546 2.264 0.023571 *
## attained_age.factor42 0.478443 0.182809 2.617 0.008866 **
## attained_age.factor43 0.634376 0.181722 3.491 0.000481 ***
## attained_age.factor44 0.626831 0.181384 3.456 0.000549 ***
## attained_age.factor45 0.715530 0.180755 3.959 7.54e-05 ***
## attained_age.factor46 0.821219 0.180317 4.554 5.26e-06 ***
## attained_age.factor47 0.925540 0.179920 5.144 2.69e-07 ***
## attained_age.factor48 1.048033 0.179515 5.838 5.28e-09 ***
## attained_age.factor49 1.103500 0.179350 6.153 7.61e-10 ***
## attained_age.factor50 1.208823 0.179071 6.751 1.47e-11 ***
## attained_age.factor51 1.315613 0.178840 7.356 1.89e-13 ***
## attained_age.factor52 1.386779 0.178703 7.760 8.48e-15 ***
## attained_age.factor53 1.400105 0.178711 7.834 4.71e-15 ***
## attained_age.factor54 1.521736 0.178535 8.523 < 2e-16 ***
## attained_age.factor55 1.634045 0.178387 9.160 < 2e-16 ***
## attained_age.factor56 1.652770 0.178404 9.264 < 2e-16 ***
## attained_age.factor57 1.795188 0.178233 10.072 < 2e-16 ***
## attained_age.factor58 1.892617 0.178156 10.623 < 2e-16 ***
## attained_age.factor59 1.950798 0.178111 10.953 < 2e-16 ***
## attained_age.factor60 2.057730 0.177982 11.561 < 2e-16 ***
## attained_age.factor61 2.154229 0.177839 12.113 < 2e-16 ***
## attained_age.factor62 2.308018 0.177780 12.982 < 2e-16 ***
## attained_age.factor63 2.402556 0.177893 13.506 < 2e-16 ***
## attained_age.factor64 2.505956 0.177823 14.092 < 2e-16 ***
## attained_age.factor65 2.618320 0.177710 14.734 < 2e-16 ***
## attained_age.factor66 2.711878 0.177649 15.265 < 2e-16 ***
## attained_age.factor67 2.803818 0.177686 15.780 < 2e-16 ***
## attained_age.factor68 2.893517 0.177689 16.284 < 2e-16 ***
## attained_age.factor69 3.030181 0.177621 17.060 < 2e-16 ***
## attained_age.factor70 3.158381 0.177554 17.788 < 2e-16 ***
## attained_age.factor71 3.217536 0.177543 18.123 < 2e-16 ***
## attained_age.factor72 3.299196 0.177523 18.585 < 2e-16 ***
## attained_age.factor73 3.427010 0.177451 19.312 < 2e-16 ***
## attained_age.factor74 3.554901 0.177399 20.039 < 2e-16 ***
## attained_age.factor75 3.662357 0.177371 20.648 < 2e-16 ***
## attained_age.factor76 3.760658 0.177327 21.207 < 2e-16 ***
## attained_age.factor77 3.875935 0.177268 21.865 < 2e-16 ***
## attained_age.factor78 3.961667 0.177232 22.353 < 2e-16 ***
## attained_age.factor79 4.090971 0.177194 23.088 < 2e-16 ***
## attained_age.factor80 4.171688 0.177175 23.546 < 2e-16 ***
## attained_age.factor81 4.321556 0.177128 24.398 < 2e-16 ***
## attained_age.factor82 4.446037 0.177110 25.103 < 2e-16 ***
## attained_age.factor83 4.514958 0.177116 25.492 < 2e-16 ***
## attained_age.factor84 4.634414 0.177100 26.168 < 2e-16 ***
## attained_age.factor85 4.769156 0.177098 26.929 < 2e-16 ***
## attained_age.factor86 4.874967 0.177108 27.525 < 2e-16 ***
## attained_age.factor87 4.952527 0.177128 27.960 < 2e-16 ***
## attained_age.factor88 5.090304 0.177140 28.736 < 2e-16 ***
## attained_age.factor89 5.180855 0.177198 29.238 < 2e-16 ***
## attained_age.factor90 5.290172 0.177236 29.848 < 2e-16 ***
## attained_age.factor91 5.368749 0.177302 30.280 < 2e-16 ***
## attained_age.factor92 5.442986 0.177406 30.681 < 2e-16 ***
## attained_age.factor93 5.518767 0.177505 31.091 < 2e-16 ***
## attained_age.factor94 5.609723 0.177689 31.570 < 2e-16 ***
## attained_age.factor95 5.649933 0.178023 31.737 < 2e-16 ***
## attained_age.factor96 5.791280 0.178669 32.414 < 2e-16 ***
## attained_age.factor97 5.798873 0.179494 32.307 < 2e-16 ***
## attained_age.factor98 5.916055 0.180265 32.819 < 2e-16 ***
## attained_age.factor99 5.981756 0.182186 32.833 < 2e-16 ***
## attained_age.factor100 6.089337 0.185224 32.875 < 2e-16 ***
## attained_age.factor101 5.953352 0.191768 31.045 < 2e-16 ***
## attained_age.factor102 5.793288 0.201269 28.784 < 2e-16 ***
## attained_age.factor103 5.591851 0.221601 25.234 < 2e-16 ***
## attained_age.factor104 5.817647 0.239929 24.247 < 2e-16 ***
## attained_age.factor105 5.693339 0.276956 20.557 < 2e-16 ***
## attained_age.factor106 5.312168 0.338502 15.693 < 2e-16 ***
## attained_age.factor107 5.704780 0.377308 15.120 < 2e-16 ***
## attained_age.factor108 4.264482 0.728869 5.851 4.89e-09 ***
## attained_age.factor109 5.988855 0.417261 14.353 < 2e-16 ***
## attained_age.factor110 5.468575 0.728869 7.503 6.25e-14 ***
## attained_age.factor111 4.819556 1.015505 4.746 2.08e-06 ***
## attained_age.factor112 4.914431 1.015505 4.839 1.30e-06 ***
## attained_age.factor113 -4.069436 85.108400 -0.048 0.961864
## attained_age.factor114 -3.761492 99.455335 -0.038 0.969830
## attained_age.factor115 -3.519823 122.724743 -0.029 0.977119
## attained_age.factor116 7.313475 0.728869 10.034 < 2e-16 ***
## attained_age.factor117 7.424556 1.015505 7.311 2.65e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 874656 on 1537853 degrees of freedom
## Residual deviance: 325807 on 1537736 degrees of freedom
## AIC: 540360
##
## Number of Fisher Scoring iterations: 10
There are 2 things to notice about the output:
There is no factor for Age 0. This means age 0 is our base level that was selected for us (since we didn’t designate one). All other age factors are expressed as a multiple of the base level (age 0 mortality).
Ages 19-38 appear to have no predictive power. What is really being said is that these ages are not statistically different than the base level (age 0). You can very quickly look at the results and see that age 0 mortality is similar to ages 19-38 mortality.
Model 3: Deaths ~ Age + Gender + Banded Face Amount + Insurance Plan + Duration
# MODEL3:
model3 <- glm(number_of_deaths~offset(log(policies_exposed)) + age.gender +
face_amount_band.factor + insurance_plan + duration.factor,
family =poisson(link = "log"),data=df.model)
Model 4: Deaths ~ Age + Gender + Banded Face Amount + Insurance Plan + Duration
Same as model 3 but with the following changes:
# MODEL4:
model4 <- glm(number_of_deaths~offset(log(policies_exposed)) + age.gender +
face_amount_band.capped.factor + insurance_plan + duration,
family =poisson(link = "log"),data=df.model)
We will compare the output from models 3 and 4 using an R Shiny dashboard. When we compare these models, keep in mind the 3 components of a good model, and see how it helps with the bias/variance tradeoff. In particular, the first two qualities:
It is easy to focus purely on Accuracy (is the A/E 100%?). This is the same as overfitting. LINK to Shiny Dashboard.