Overview

Purpose

This document summarises the results of modelling lapses to understand the key drivers and the relationship between policy year and age.

data

We upload the aggregated data and perform a simple 2 way analysis, filtering year (2017 to 2022), products WLM15/17 and channel IFA/Bank.

Lapse rate, WLM15/17, Channel IFA/BNK
PY Exposure LAP Rate
1 882846 31623 3.6
2 838828 37782 4.5
3 705662 40986 5.8
4 493492 31424 6.4
5 255845 15126 5.9
6 82204 4885 5.9
7 36039 1962 5.4
8 4003 246 6.1

in order to perform the modelling step we require to split variables from type character to type numeric.

## Classes 'data.table' and 'data.frame':   135156 obs. of  14 variables:
##  $ CY       : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ PY_C     : chr  "6" "6" "6" "6" ...
##  $ PY_N     : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ PRO      : chr  "WLM17" "WLM17" "WLM17" "WLM17" ...
##  $ CHN_CDE  : chr  "IFA" "IFA" "B" "IFA" ...
##  $ CHN_CDE_N: num  1 1 2 1 1 1 1 1 1 1 ...
##  $ AGE_C    : chr  "<30" "<30" "<30" "<30" ...
##  $ AGE_N    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SEX      : chr  "M" "F" "M" "M" ...
##  $ SEX_N    : num  1 2 1 1 1 2 2 1 2 2 ...
##  $ SMK      : chr  "C" "C" "C" "C" ...
##  $ SMK_N    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ LAP      : int  1 1 0 1 0 0 0 0 0 1 ...
##  $ X_LAP    : num  0.1205 0.0904 0.1671 0.1671 0.337 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Correlation matrix

We consider the correlation matrix in order to understand if there are any strong interactions across the key variables. There is a correlation across age and policy year and between channel and duration, however, it this is fairly weak based on this data.

# ------------------------------------------------------------------------------
# 1.1 Correlation matrix
# ------------------------------------------------------------------------------


Mat <- A[ , c("PY_N", "AGE_N", "SEX_N", "CHN_CDE_N")]

kable(cor(Mat), format = "html", caption = "Correlation matrix")
Correlation matrix
PY_N AGE_N SEX_N CHN_CDE_N
PY_N 1.0000000 0.0498905 0.0081036 -0.0791301
AGE_N 0.0498905 1.0000000 -0.0059267 -0.0008930
SEX_N 0.0081036 -0.0059267 1.0000000 0.0025521
CHN_CDE_N -0.0791301 -0.0008930 0.0025521 1.0000000

split data into a training and test dataset

In order to test the current approach to modelling against a GAM approach we split the data into training and test sets. We have split the data 75% training and 25% testing. The training dataset is used to train the model and then we test the model on the out of sample dataset.

We then aggregate the training and test data sets across the key variable we are using to model lapses.

# ------------------------------------------------------------------------------
# Split data into train, test and predict
# ------------------------------------------------------------------------------

# Split data into train, test and predict

set.seed(123)  # For reproducibility
# Generate a random permutation of row indices
indices <- sample(1:nrow(A))
# Determine the split point
split_point <- floor(0.75 * nrow(A))

# Create training and testing sets based on the split point
A_train <- A[indices[1:split_point], ]
A_test <- A[indices[(split_point + 1):nrow(A)], ]


# ----------------------------------------------------------------------------
# Aggregating data
# ----------------------------------------------------------------------------


# Aggregating lapses, lapse amounts and lapse exposures:
A_train <- A_train[, .(
  LAP = sum(LAP, na.rm = TRUE),
  X_LAP = sum(X_LAP, na.rm = TRUE)
  ),
by = .(
  CY, PY_N, AGE_N, CHN_CDE_N, SMK_N, SEX_N
)]

A_test <- A_test[, .(
  LAP = sum(LAP, na.rm = TRUE),
  X_LAP = sum(X_LAP, na.rm = TRUE)
 ),
by = .(
  CY, PY_N, AGE_N, CHN_CDE_N, SMK_N, SEX_N
)]

Fitting regression model to training set

# ----------------------------------------------------------------------------
# Preparing variables for regression
# ----------------------------------------------------------------------------

# # Specifying categorical (factor) variables:
# cols <- c(
#   "PY_N", "SEX_N", "SMK_N", "AGE_N", "CHN_CDE_N"
# )

cols <- c(
  "SEX_N", "SMK_N", "CHN_CDE_N"
)

# Changing format to factors:
A_train[, (cols) := lapply(.SD, function(x) x <- as.factor(x)), .SDcols = cols]

A_test[, (cols) := lapply(.SD, function(x) x <- as.factor(x)), .SDcols = cols]

# ------------------------------------------------------------------------------
# Fitting the regression model
# ------------------------------------------------------------------------------

# Equation that replicates the current Japan approach using PY_N as a factor:
eq0 <- formula(LAP ~ as.factor(PY_N) + offset(log(X_LAP)))

# Equation for a GAM based on variable PY_N and AGE_N:
eq1 <- formula(LAP ~ 1 + s(PY_N, k = 8) + s(AGE_N, k = 10) + offset(log(X_LAP)))

# fitting the GLM and GAM models:
glm0_out <- glm(eq0, data = A_train, family = poisson(link = "log"))
gam1_out <- mgcv::gam(eq1, data = A_train, family = poisson(link = "log"), method = "REML")


# fitting the model to the data to calcuate predicted number of lapses:
A_train[ ,Predict_Lap_GLM0 := predict(glm0_out, type = "response")]
A_train[ ,Predict_Lap_GAM1 := predict(gam1_out, type = "response")]
A_train[ ,sq_residuals_0 := (LAP - Predict_Lap_GLM0)^2]
A_train[ ,sq_residuals_1 := (LAP - Predict_Lap_GAM1)^2]

# Calculate the rmse for each model to compare model fits:
A_train %>%
  summarise(rmse_0 = sqrt(sum(sq_residuals_0)),
            rmse_1 = sqrt(sum(sq_residuals_1)))
# summary of fitted models
summary(glm0_out)
## 
## Call:
## glm(formula = eq0, family = poisson(link = "log"), data = A_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -13.194   -1.872   -0.417    1.915   16.669  
## 
## Coefficients:
##                   Estimate Std. Error  z value             Pr(>|z|)    
## (Intercept)      -3.334462   0.006478 -514.741 < 0.0000000000000002 ***
## as.factor(PY_N)2  0.214831   0.008802   24.408 < 0.0000000000000002 ***
## as.factor(PY_N)3  0.491928   0.008630   57.001 < 0.0000000000000002 ***
## as.factor(PY_N)4  0.599154   0.009144   65.525 < 0.0000000000000002 ***
## as.factor(PY_N)5  0.443849   0.011507   38.572 < 0.0000000000000002 ***
## as.factor(PY_N)6  0.568055   0.017486   32.487 < 0.0000000000000002 ***
## as.factor(PY_N)7  0.403041   0.027066   14.891 < 0.0000000000000002 ***
## as.factor(PY_N)8  0.467940   0.076521    6.115       0.000000000965 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 20839  on 1303  degrees of freedom
## Residual deviance: 14840  on 1296  degrees of freedom
## AIC: 20821
## 
## Number of Fisher Scoring iterations: 4
summary(gam1_out)
## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## LAP ~ 1 + s(PY_N, k = 8) + s(AGE_N, k = 10) + offset(log(X_LAP))
## 
## Parametric coefficients:
##              Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -2.950645   0.004233    -697 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq             p-value    
## s(PY_N)  6.805  6.977   6305 <0.0000000000000002 ***
## s(AGE_N) 7.404  8.368   2165 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.923   Deviance explained = 38.9%
## -REML = 9397.6  Scale est. = 1         n = 1304
#calculate RMSE
sqrt(mean((A_train$LAP - A_train$Predict_Lap_GLM0)^2))
## [1] 51.75507
sqrt(mean((A_train$LAP - A_train$Predict_Lap_GAM1)^2))
## [1] 45.73293
# metrics to compare how well model fits
AIC(glm0_out)
## [1] 20820.57
AIC(gam1_out)
## [1] 18732.99

GAM checks

# plots of the GAM model and residuals:
plot(gam1_out, pages = 1, all.terms = TRUE, shade = TRUE, shade.col = "lightblue")

plot(gam1_out, residuals = TRUE)

plot(gam1_out, residuals = TRUE, pch = 1, cex = 1)

# checks performed on GAM: 
gam.check(gam1_out)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [0.000001499652,0.00234476]
## (score 9397.556 & scale 1).
## Hessian positive definite, eigenvalue range [2.586339,3.426502].
## Model rank =  17 / 17 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k' edf k-index p-value
## s(PY_N)  7.0 6.8    1.01    0.69
## s(AGE_N) 9.0 7.4    0.99    0.25
concurvity(gam1_out, full = TRUE)
##                                       para       s(PY_N)      s(AGE_N)
## worst    0.0000000000000000000000000081796 0.00105025966 0.00105025966
## observed 0.0000000000000000000000000081796 0.00009503156 0.00002147322
## estimate 0.0000000000000000000000000081796 0.00042340357 0.00055356602
concurvity(gam1_out, full = FALSE)
## $worst
##                                         para                            s(PY_N)
## para     1.000000000000000000000000000000000 0.00000000000000000000000000331432
## s(PY_N)  0.000000000000000000000000002501084 1.00000000000000000000000000000000
## s(AGE_N) 0.000000000000000000000000004217329 0.00105025965514571719208736233497
##                                     s(AGE_N)
## para     0.000000000000000000000000004355257
## s(PY_N)  0.001050259655145703747980423514718
## s(AGE_N) 1.000000000000000000000000000000000
## 
## $observed
##                                         para
## para     1.000000000000000000000000000000000
## s(PY_N)  0.000000000000000000000000002501084
## s(AGE_N) 0.000000000000000000000000004217329
##                                       s(PY_N)
## para     0.0000000000000000000000000001246299
## s(PY_N)  1.0000000000000000000000000000000000
## s(AGE_N) 0.0000950315644555528875239008845810
##                                        s(AGE_N)
## para     0.000000000000000000000000000001672127
## s(PY_N)  0.000021473217385917243023004188917646
## s(AGE_N) 1.000000000000000000000000000000000000
## 
## $estimate
##                                         para
## para     1.000000000000000000000000000000000
## s(PY_N)  0.000000000000000000000000002501084
## s(AGE_N) 0.000000000000000000000000004217329
##                                           s(PY_N)
## para     0.00000000000000000000000000000005806932
## s(PY_N)  1.00000000000000000000000000000000000000
## s(AGE_N) 0.00042340356716716418417761902404095053
##                                         s(AGE_N)
## para     0.0000000000000000000000000000005930063
## s(PY_N)  0.0005535660186173290515584954896155523
## s(AGE_N) 1.0000000000000000000000000000000000000

Plot of lapse rates to compare fit of model against the training set

Plot of lapse rates to compare fit of models against the test set

## [1] 37.41075
## [1] 34.65429

Conclusion