This document summarises the results of modelling lapses to understand the key drivers and the relationship between policy year and age.
We upload the aggregated data and perform a simple 2 way analysis, filtering year (2017 to 2022), products WLM15/17 and channel IFA/Bank.
| PY | Exposure | LAP | Rate |
|---|---|---|---|
| 1 | 882846 | 31623 | 3.6 |
| 2 | 838828 | 37782 | 4.5 |
| 3 | 705662 | 40986 | 5.8 |
| 4 | 493492 | 31424 | 6.4 |
| 5 | 255845 | 15126 | 5.9 |
| 6 | 82204 | 4885 | 5.9 |
| 7 | 36039 | 1962 | 5.4 |
| 8 | 4003 | 246 | 6.1 |
in order to perform the modelling step we require to split variables from type character to type numeric.
## Classes 'data.table' and 'data.frame': 135156 obs. of 14 variables:
## $ CY : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ PY_C : chr "6" "6" "6" "6" ...
## $ PY_N : num 6 6 6 6 6 6 6 6 6 6 ...
## $ PRO : chr "WLM17" "WLM17" "WLM17" "WLM17" ...
## $ CHN_CDE : chr "IFA" "IFA" "B" "IFA" ...
## $ CHN_CDE_N: num 1 1 2 1 1 1 1 1 1 1 ...
## $ AGE_C : chr "<30" "<30" "<30" "<30" ...
## $ AGE_N : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : chr "M" "F" "M" "M" ...
## $ SEX_N : num 1 2 1 1 1 2 2 1 2 2 ...
## $ SMK : chr "C" "C" "C" "C" ...
## $ SMK_N : num 1 1 1 1 1 1 1 1 1 1 ...
## $ LAP : int 1 1 0 1 0 0 0 0 0 1 ...
## $ X_LAP : num 0.1205 0.0904 0.1671 0.1671 0.337 ...
## - attr(*, ".internal.selfref")=<externalptr>
We consider the correlation matrix in order to understand if there are any strong interactions across the key variables. There is a correlation across age and policy year and between channel and duration, however, it this is fairly weak based on this data.
# ------------------------------------------------------------------------------
# 1.1 Correlation matrix
# ------------------------------------------------------------------------------
Mat <- A[ , c("PY_N", "AGE_N", "SEX_N", "CHN_CDE_N")]
kable(cor(Mat), format = "html", caption = "Correlation matrix")
| PY_N | AGE_N | SEX_N | CHN_CDE_N | |
|---|---|---|---|---|
| PY_N | 1.0000000 | 0.0498905 | 0.0081036 | -0.0791301 |
| AGE_N | 0.0498905 | 1.0000000 | -0.0059267 | -0.0008930 |
| SEX_N | 0.0081036 | -0.0059267 | 1.0000000 | 0.0025521 |
| CHN_CDE_N | -0.0791301 | -0.0008930 | 0.0025521 | 1.0000000 |
In order to test the current approach to modelling against a GAM approach we split the data into training and test sets. We have split the data 75% training and 25% testing. The training dataset is used to train the model and then we test the model on the out of sample dataset.
We then aggregate the training and test data sets across the key variable we are using to model lapses.
# ------------------------------------------------------------------------------
# Split data into train, test and predict
# ------------------------------------------------------------------------------
# Split data into train, test and predict
set.seed(123) # For reproducibility
# Generate a random permutation of row indices
indices <- sample(1:nrow(A))
# Determine the split point
split_point <- floor(0.75 * nrow(A))
# Create training and testing sets based on the split point
A_train <- A[indices[1:split_point], ]
A_test <- A[indices[(split_point + 1):nrow(A)], ]
# ----------------------------------------------------------------------------
# Aggregating data
# ----------------------------------------------------------------------------
# Aggregating lapses, lapse amounts and lapse exposures:
A_train <- A_train[, .(
LAP = sum(LAP, na.rm = TRUE),
X_LAP = sum(X_LAP, na.rm = TRUE)
),
by = .(
CY, PY_N, AGE_N, CHN_CDE_N, SMK_N, SEX_N
)]
A_test <- A_test[, .(
LAP = sum(LAP, na.rm = TRUE),
X_LAP = sum(X_LAP, na.rm = TRUE)
),
by = .(
CY, PY_N, AGE_N, CHN_CDE_N, SMK_N, SEX_N
)]
# ----------------------------------------------------------------------------
# Preparing variables for regression
# ----------------------------------------------------------------------------
# # Specifying categorical (factor) variables:
# cols <- c(
# "PY_N", "SEX_N", "SMK_N", "AGE_N", "CHN_CDE_N"
# )
cols <- c(
"SEX_N", "SMK_N", "CHN_CDE_N"
)
# Changing format to factors:
A_train[, (cols) := lapply(.SD, function(x) x <- as.factor(x)), .SDcols = cols]
A_test[, (cols) := lapply(.SD, function(x) x <- as.factor(x)), .SDcols = cols]
# ------------------------------------------------------------------------------
# Fitting the regression model
# ------------------------------------------------------------------------------
# Equation that replicates the current Japan approach using PY_N as a factor:
eq0 <- formula(LAP ~ as.factor(PY_N) + offset(log(X_LAP)))
# Equation for a GAM based on variable PY_N and AGE_N:
eq1 <- formula(LAP ~ 1 + s(PY_N, k = 8) + s(AGE_N, k = 10) + offset(log(X_LAP)))
# fitting the GLM and GAM models:
glm0_out <- glm(eq0, data = A_train, family = poisson(link = "log"))
gam1_out <- mgcv::gam(eq1, data = A_train, family = poisson(link = "log"), method = "REML")
# fitting the model to the data to calcuate predicted number of lapses:
A_train[ ,Predict_Lap_GLM0 := predict(glm0_out, type = "response")]
A_train[ ,Predict_Lap_GAM1 := predict(gam1_out, type = "response")]
A_train[ ,sq_residuals_0 := (LAP - Predict_Lap_GLM0)^2]
A_train[ ,sq_residuals_1 := (LAP - Predict_Lap_GAM1)^2]
# Calculate the rmse for each model to compare model fits:
A_train %>%
summarise(rmse_0 = sqrt(sum(sq_residuals_0)),
rmse_1 = sqrt(sum(sq_residuals_1)))
# summary of fitted models
summary(glm0_out)
##
## Call:
## glm(formula = eq0, family = poisson(link = "log"), data = A_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.194 -1.872 -0.417 1.915 16.669
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.334462 0.006478 -514.741 < 0.0000000000000002 ***
## as.factor(PY_N)2 0.214831 0.008802 24.408 < 0.0000000000000002 ***
## as.factor(PY_N)3 0.491928 0.008630 57.001 < 0.0000000000000002 ***
## as.factor(PY_N)4 0.599154 0.009144 65.525 < 0.0000000000000002 ***
## as.factor(PY_N)5 0.443849 0.011507 38.572 < 0.0000000000000002 ***
## as.factor(PY_N)6 0.568055 0.017486 32.487 < 0.0000000000000002 ***
## as.factor(PY_N)7 0.403041 0.027066 14.891 < 0.0000000000000002 ***
## as.factor(PY_N)8 0.467940 0.076521 6.115 0.000000000965 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 20839 on 1303 degrees of freedom
## Residual deviance: 14840 on 1296 degrees of freedom
## AIC: 20821
##
## Number of Fisher Scoring iterations: 4
summary(gam1_out)
##
## Family: poisson
## Link function: log
##
## Formula:
## LAP ~ 1 + s(PY_N, k = 8) + s(AGE_N, k = 10) + offset(log(X_LAP))
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.950645 0.004233 -697 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(PY_N) 6.805 6.977 6305 <0.0000000000000002 ***
## s(AGE_N) 7.404 8.368 2165 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.923 Deviance explained = 38.9%
## -REML = 9397.6 Scale est. = 1 n = 1304
#calculate RMSE
sqrt(mean((A_train$LAP - A_train$Predict_Lap_GLM0)^2))
## [1] 51.75507
sqrt(mean((A_train$LAP - A_train$Predict_Lap_GAM1)^2))
## [1] 45.73293
# metrics to compare how well model fits
AIC(glm0_out)
## [1] 20820.57
AIC(gam1_out)
## [1] 18732.99
# plots of the GAM model and residuals:
plot(gam1_out, pages = 1, all.terms = TRUE, shade = TRUE, shade.col = "lightblue")
plot(gam1_out, residuals = TRUE)
plot(gam1_out, residuals = TRUE, pch = 1, cex = 1)
# checks performed on GAM:
gam.check(gam1_out)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [0.000001499652,0.00234476]
## (score 9397.556 & scale 1).
## Hessian positive definite, eigenvalue range [2.586339,3.426502].
## Model rank = 17 / 17
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(PY_N) 7.0 6.8 1.01 0.69
## s(AGE_N) 9.0 7.4 0.99 0.25
concurvity(gam1_out, full = TRUE)
## para s(PY_N) s(AGE_N)
## worst 0.0000000000000000000000000081796 0.00105025966 0.00105025966
## observed 0.0000000000000000000000000081796 0.00009503156 0.00002147322
## estimate 0.0000000000000000000000000081796 0.00042340357 0.00055356602
concurvity(gam1_out, full = FALSE)
## $worst
## para s(PY_N)
## para 1.000000000000000000000000000000000 0.00000000000000000000000000331432
## s(PY_N) 0.000000000000000000000000002501084 1.00000000000000000000000000000000
## s(AGE_N) 0.000000000000000000000000004217329 0.00105025965514571719208736233497
## s(AGE_N)
## para 0.000000000000000000000000004355257
## s(PY_N) 0.001050259655145703747980423514718
## s(AGE_N) 1.000000000000000000000000000000000
##
## $observed
## para
## para 1.000000000000000000000000000000000
## s(PY_N) 0.000000000000000000000000002501084
## s(AGE_N) 0.000000000000000000000000004217329
## s(PY_N)
## para 0.0000000000000000000000000001246299
## s(PY_N) 1.0000000000000000000000000000000000
## s(AGE_N) 0.0000950315644555528875239008845810
## s(AGE_N)
## para 0.000000000000000000000000000001672127
## s(PY_N) 0.000021473217385917243023004188917646
## s(AGE_N) 1.000000000000000000000000000000000000
##
## $estimate
## para
## para 1.000000000000000000000000000000000
## s(PY_N) 0.000000000000000000000000002501084
## s(AGE_N) 0.000000000000000000000000004217329
## s(PY_N)
## para 0.00000000000000000000000000000005806932
## s(PY_N) 1.00000000000000000000000000000000000000
## s(AGE_N) 0.00042340356716716418417761902404095053
## s(AGE_N)
## para 0.0000000000000000000000000000005930063
## s(PY_N) 0.0005535660186173290515584954896155523
## s(AGE_N) 1.0000000000000000000000000000000000000
## [1] 37.41075
## [1] 34.65429