Introduction and Business Problem

VahanBima is one of the leading insurance companies in India. It provides motor vehicle insurance at the best prices with 24/7 claim settlement. It offers different types of policies for both personal and commercial vehicles. It has established its brand across different regions in India.

Around 90% of businesses today use personalized services. The company wants to launch different personalized experience programs for customers of VahanBima. The personalized experience can be dedicated resources for claim settlement, different kinds of services at the doorstep, etc. In order to do so, they would like to segment the customers into different tiers based on their customer lifetime value (CLTV).

Business Question

In order to do it, they would like to predict the customer lifetime value based on the activity and interaction of the customer with the platform. So, as a part of this challenge, your task at hand is to build a high-performance and interpretable machine learning model to predict the CLTV based on user and policy data.

Data Wrangling

#Import library
library(dplyr)
library(ggplot2)
library(lubridate)

Read data

#read data
clvt <- read.csv("train_BRCpofr.csv")
head(clvt)

##   id gender  area qualification        income marital_status vintage
## 1  1   Male Urban      Bachelor        5L-10L              1       5
## 2  2   Male Rural   High School        5L-10L              0       8
## 3  3   Male Urban      Bachelor        5L-10L              1       8
## 4  4 Female Rural   High School        5L-10L              0       7
## 5  5   Male Urban   High School More than 10L              1       6
## 6  6   Male Rural   High School More than 10L              1       1
##   claim_amount num_policies policy type_of_policy   cltv
## 1         5790  More than 1      A       Platinum  64308
## 2         5080  More than 1      A       Platinum 515400
## 3         2599  More than 1      A       Platinum  64212
## 4            0  More than 1      A       Platinum  97920
## 5         3508  More than 1      A           Gold  59736
## 6            0            1      C           Gold 348768

About dataset:

id-Unique: identifier of a customer
gender : Gender of the customer
area : Area of the customer
qualification : Highest Qualification of the customer
income : Income earned in a year (in rupees)
marital_status : Marital Status of the customer {0:Single, 1: Married}
vintage : No. of years since the first policy date
claim_amount : Total Amount Claimed by the customer (in rupees)
num_policies : Total no. of policies issued by the customer
policy : Active policy of the customer
type_of_policy : Type of active policy
cltv : Customer lifetime value (Target Variable)

In this case, cltv will be the target variable, while the others will serve as predictor variables to predict cltv.

#check data type
glimpse(clvt)

## Rows: 89,392
## Columns: 12
## $ id             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ gender         <chr> "Male", "Male", "Male", "Female", "Male", "Male", "Fema…
## $ area           <chr> "Urban", "Rural", "Urban", "Rural", "Urban", "Rural", "…
## $ qualification  <chr> "Bachelor", "High School", "Bachelor", "High School", "…
## $ income         <chr> "5L-10L", "5L-10L", "5L-10L", "5L-10L", "More than 10L"…
## $ marital_status <int> 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1…
## $ vintage        <int> 5, 8, 8, 7, 6, 1, 6, 0, 3, 6, 4, 6, 7, 1, 8, 1, 5, 2, 7…
## $ claim_amount   <int> 5790, 5080, 2599, 0, 3508, 0, 0, 5473, 6105, 0, 4329, 6…
## $ num_policies   <chr> "More than 1", "More than 1", "More than 1", "More than…
## $ policy         <chr> "A", "A", "A", "A", "A", "C", "A", "A", "A", "A", "A", …
## $ type_of_policy <chr> "Platinum", "Platinum", "Platinum", "Platinum", "Gold",…
## $ cltv           <int> 64308, 515400, 64212, 97920, 59736, 348768, 238920, 641…

#check total unique value
unique(clvt$income)

## [1] "5L-10L"        "More than 10L" "2L-5L"         "<=2L"

unique(clvt$num_policies)

## [1] "More than 1" "1"

Insight :

Based on the types of columns, there are several column types that are not appropriate and will be converted to a fixed column type. The columns to be converted are as follows:

gender, area, qualification, income, marital_status, num_policies, policy, and type_of_policy -> as.factor

clvt <- clvt %>% 
  mutate_at(vars(gender, area, qualification, income, marital_status, num_policies, policy, type_of_policy), as.factor)
glimpse(clvt)

## Rows: 89,392
## Columns: 12
## $ id             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ gender         <fct> Male, Male, Male, Female, Male, Male, Female, Female, F…
## $ area           <fct> Urban, Rural, Urban, Rural, Urban, Rural, Rural, Urban,…
## $ qualification  <fct> Bachelor, High School, Bachelor, High School, High Scho…
## $ income         <fct> 5L-10L, 5L-10L, 5L-10L, 5L-10L, More than 10L, More tha…
## $ marital_status <fct> 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1…
## $ vintage        <int> 5, 8, 8, 7, 6, 1, 6, 0, 3, 6, 4, 6, 7, 1, 8, 1, 5, 2, 7…
## $ claim_amount   <int> 5790, 5080, 2599, 0, 3508, 0, 0, 5473, 6105, 0, 4329, 6…
## $ num_policies   <fct> More than 1, More than 1, More than 1, More than 1, Mor…
## $ policy         <fct> A, A, A, A, A, C, A, A, A, A, A, C, A, C, A, A, A, A, C…
## $ type_of_policy <fct> Platinum, Platinum, Platinum, Platinum, Gold, Gold, Gol…
## $ cltv           <int> 64308, 515400, 64212, 97920, 59736, 348768, 238920, 641…

Exploratory Data Analysis (EDA)

#checking data distribution for numerical data
hist(clvt$vintage)

options(scipen = 999)
hist(clvt$cltv)

hist(log(clvt$cltv))

#boxplot
boxplot(clvt$cltv)

Insight:

The data in the cltv column appears to have many outliers and does not follow a normal distribution. However, if we modify the data using the log() function, the distribution will become much more normal. Therefore, the cltv variable, which serves as the target variable, will be transformed into its logarithmic form.

clvt$log_cltv <-log(clvt$cltv)

#Total cltv based on education level
clvt %>%
  group_by(qualification) %>%
  summarise(total_clvt = sum(log_cltv)) %>% 
ggplot(mapping = aes(x= qualification, y= total_clvt))+
  geom_col(aes(fill= qualification))

#Total cltv based on marital status
clvt %>% 
   group_by(marital_status) %>%
  summarise(total_clvt = sum(log_cltv)) %>% 
ggplot(mapping = aes(x= marital_status, y= total_clvt))+
  geom_col(aes(fill= marital_status))

#Total cltv based on income
clvt %>% 
   group_by(income) %>%
  summarise(total_clvt = sum(log_cltv)) %>% 
ggplot(mapping = aes(x= income, y= total_clvt))+
  geom_col(aes(fill= income))

#Total cltv based on gender
clvt %>% 
   group_by(gender) %>%
  summarise(total_clvt = sum(log_cltv)) %>% 
ggplot(mapping = aes(x= gender, y= total_clvt))+
  geom_col(aes(fill= gender))

Insight:

Total clvt paling banyak dimiliki oleh pria

#Total cltv based on area
clvt %>% 
   group_by(area) %>%
  summarise(total_clvt = sum(log_cltv)) %>% 
ggplot(mapping = aes(x= area, y= total_clvt))+
  geom_col(aes(fill= area))

#Check the correlation between cltv and other predictor variables
library(GGally)

## Warning: package 'GGally' was built under R version 4.2.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(clvt, label = TRUE)

## Warning in ggcorr(clvt, label = TRUE): data in column(s) 'gender', 'area',
## 'qualification', 'income', 'marital_status', 'num_policies', 'policy',
## 'type_of_policy' are not numeric and were ignored

Insight:

The numeric data correlation shows that vintage has no correlation with cltv, while claim_amount has a positive correlation, but it is very weak.

Model Fitting and Feature Selection

# Remove the id and cltv data (target variable using alghoritmic form of cltv).
clvt_clean <- clvt %>% 
  select(-id, -cltv)
head(clvt_clean)

##   gender  area qualification        income marital_status vintage claim_amount
## 1   Male Urban      Bachelor        5L-10L              1       5         5790
## 2   Male Rural   High School        5L-10L              0       8         5080
## 3   Male Urban      Bachelor        5L-10L              1       8         2599
## 4 Female Rural   High School        5L-10L              0       7            0
## 5   Male Urban   High School More than 10L              1       6         3508
## 6   Male Rural   High School More than 10L              1       1            0
##   num_policies policy type_of_policy log_cltv
## 1  More than 1      A       Platinum 11.07144
## 2  More than 1      A       Platinum 13.15270
## 3  More than 1      A       Platinum 11.06995
## 4  More than 1      A       Platinum 11.49191
## 5  More than 1      A           Gold 10.99769
## 6            1      C           Gold 12.76216

Model with one predictor

#Model with one target value
clvt_claim <- lm(formula = log_cltv ~ claim_amount,
                 data = clvt_clean)
summary(clvt_claim)

## 
## Call:
## lm(formula = log_cltv ~ claim_amount, data = clvt_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9203 -0.3913 -0.1057  0.3075  2.3714 
## 
## Coefficients:
##                   Estimate    Std. Error t value            Pr(>|t|)    
## (Intercept)  11.0601850035  0.0035521502 3113.66 <0.0000000000000002 ***
## claim_amount  0.0000407097  0.0000006531   62.33 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6371 on 89390 degrees of freedom
## Multiple R-squared:  0.04165,    Adjusted R-squared:  0.04164 
## F-statistic:  3885 on 1 and 89390 DF,  p-value: < 0.00000000000000022

Model with all predictors

#Model with all predictors
clvt_all <- lm(formula = log_cltv ~.,
               data = clvt_clean)
summary(clvt_all)

## 
## Call:
## lm(formula = log_cltv ~ ., data = clvt_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6701 -0.3748 -0.1396  0.2429  2.5309 
## 
## Coefficients:
##                               Estimate    Std. Error t value
## (Intercept)              10.6717818555  0.0149645998 713.135
## genderMale                0.0040510994  0.0036613033   1.106
## areaUrban                 0.0319831287  0.0045300451   7.060
## qualificationHigh School -0.0083175538  0.0037140278  -2.239
## qualificationOthers      -0.0780491645  0.0092309590  -8.455
## income2L-5L               0.0290280027  0.0130731224   2.220
## income5L-10L             -0.0038246056  0.0128156924  -0.298
## incomeMore than 10L      -0.0162865803  0.0134905355  -1.207
## marital_status1          -0.0482944710  0.0036923094 -13.080
## vintage                   0.0031659244  0.0007904369   4.005
## claim_amount              0.0000196347  0.0000006328  31.029
## num_policiesMore than 1   0.7227126883  0.0039360237 183.615
## policyB                  -0.0454572146  0.0043020299 -10.566
## policyC                   0.0595949930  0.0064140225   9.291
## type_of_policyPlatinum   -0.0033273639  0.0044979841  -0.740
## type_of_policySilver     -0.0176372973  0.0053205849  -3.315
##                                      Pr(>|t|)    
## (Intercept)              < 0.0000000000000002 ***
## genderMale                           0.268529    
## areaUrban                    0.00000000000167 ***
## qualificationHigh School             0.025126 *  
## qualificationOthers      < 0.0000000000000002 ***
## income2L-5L                          0.026392 *  
## income5L-10L                         0.765375    
## incomeMore than 10L                  0.227335    
## marital_status1          < 0.0000000000000002 ***
## vintage                      0.00006199304410 ***
## claim_amount             < 0.0000000000000002 ***
## num_policiesMore than 1  < 0.0000000000000002 ***
## policyB                  < 0.0000000000000002 ***
## policyC                  < 0.0000000000000002 ***
## type_of_policyPlatinum               0.459456    
## type_of_policySilver                 0.000917 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5379 on 89376 degrees of freedom
## Multiple R-squared:  0.3168, Adjusted R-squared:  0.3167 
## F-statistic:  2763 on 15 and 89376 DF,  p-value: < 0.00000000000000022

Model Step

#model backward
clvt_step <- step(object = clvt_all, direction = "backward")

## Start:  AIC=-110836.8
## log_cltv ~ gender + area + qualification + income + marital_status + 
##     vintage + claim_amount + num_policies + policy + type_of_policy
## 
##                  Df Sum of Sq   RSS     AIC
## - gender          1       0.4 25862 -110838
## <none>                        25862 -110837
## - type_of_policy  2       3.8 25866 -110828
## - vintage         1       4.6 25867 -110823
## - area            1      14.4 25876 -110789
## - income          3      20.0 25882 -110774
## - qualification   2      20.7 25883 -110769
## - marital_status  1      49.5 25912 -110668
## - policy          2      71.4 25933 -110594
## - claim_amount    1     278.6 26141 -109881
## - num_policies    1    9755.7 35618  -82227
## 
## Step:  AIC=-110837.6
## log_cltv ~ area + qualification + income + marital_status + vintage + 
##     claim_amount + num_policies + policy + type_of_policy
## 
##                  Df Sum of Sq   RSS     AIC
## <none>                        25862 -110838
## - type_of_policy  2       3.9 25866 -110828
## - vintage         1       4.7 25867 -110823
## - area            1      14.6 25877 -110789
## - income          3      19.9 25882 -110775
## - qualification   2      20.8 25883 -110770
## - marital_status  1      49.2 25912 -110670
## - policy          2      71.3 25934 -110595
## - claim_amount    1     280.5 26143 -109875
## - num_policies    1    9783.2 35646  -82159

summary(clvt_step)

## 
## Call:
## lm(formula = log_cltv ~ area + qualification + income + marital_status + 
##     vintage + claim_amount + num_policies + policy + type_of_policy, 
##     data = clvt_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6692 -0.3745 -0.1401  0.2430  2.5326 
## 
## Coefficients:
##                               Estimate    Std. Error t value
## (Intercept)              10.6732276478  0.0149074611 715.965
## areaUrban                 0.0321589771  0.0045272620   7.103
## qualificationHigh School -0.0084809521  0.0037110953  -2.285
## qualificationOthers      -0.0781853389  0.0092301501  -8.471
## income2L-5L               0.0294965467  0.0130662787   2.257
## income5L-10L             -0.0031997655  0.0128032604  -0.250
## incomeMore than 10L      -0.0157224900  0.0134809159  -1.166
## marital_status1          -0.0479976831  0.0036825582 -13.034
## vintage                   0.0031720743  0.0007904184   4.013
## claim_amount              0.0000196731  0.0000006318  31.137
## num_policiesMore than 1   0.7224539403  0.0039290755 183.874
## policyB                  -0.0453373567  0.0043006713 -10.542
## policyC                   0.0596848974  0.0064135158   9.306
## type_of_policyPlatinum   -0.0032758753  0.0044977491  -0.728
## type_of_policySilver     -0.0177050472  0.0053202392  -3.328
##                                      Pr(>|t|)    
## (Intercept)              < 0.0000000000000002 ***
## areaUrban                    0.00000000000123 ***
## qualificationHigh School             0.022298 *  
## qualificationOthers      < 0.0000000000000002 ***
## income2L-5L                          0.023982 *  
## income5L-10L                         0.802651    
## incomeMore than 10L                  0.243505    
## marital_status1          < 0.0000000000000002 ***
## vintage                      0.00005996039575 ***
## claim_amount             < 0.0000000000000002 ***
## num_policiesMore than 1  < 0.0000000000000002 ***
## policyB                  < 0.0000000000000002 ***
## policyC                  < 0.0000000000000002 ***
## type_of_policyPlatinum               0.466409    
## type_of_policySilver                 0.000875 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5379 on 89377 degrees of freedom
## Multiple R-squared:  0.3168, Adjusted R-squared:  0.3167 
## F-statistic:  2960 on 14 and 89377 DF,  p-value: < 0.00000000000000022

Model Selection

cltv_selection <- lm(formula = log_cltv ~ claim_amount+vintage+gender+area+qualification+income+marital_status+policy+type_of_policy,
                 data = clvt_clean)
summary(cltv_selection)

## 
## Call:
## lm(formula = log_cltv ~ claim_amount + vintage + gender + area + 
##     qualification + income + marital_status + policy + type_of_policy, 
##     data = clvt_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75907 -0.39560 -0.09793  0.30022  2.43048 
## 
## Coefficients:
##                               Estimate    Std. Error t value
## (Intercept)              11.1304458166  0.0173152156 642.813
## claim_amount              0.0000321002  0.0000007383  43.478
## vintage                   0.0097307466  0.0009266629  10.501
## genderMale               -0.0358905291  0.0042891094  -8.368
## areaUrban                 0.0773051255  0.0053083084  14.563
## qualificationHigh School -0.0229776807  0.0043575668  -5.273
## qualificationOthers      -0.1929284317  0.0108080231 -17.850
## income2L-5L               0.0171251281  0.0153416921   1.116
## income5L-10L             -0.0506771429  0.0150367940  -3.370
## incomeMore than 10L      -0.0519732007  0.0158300902  -3.283
## marital_status1          -0.0869717441  0.0043260295 -20.104
## policyB                  -0.0195587704  0.0050459060  -3.876
## policyC                   0.0499341055  0.0075268826   6.634
## type_of_policyPlatinum    0.0115067266  0.0052777296   2.180
## type_of_policySilver     -0.0542317212  0.0062395570  -8.692
##                                      Pr(>|t|)    
## (Intercept)              < 0.0000000000000002 ***
## claim_amount             < 0.0000000000000002 ***
## vintage                  < 0.0000000000000002 ***
## genderMale               < 0.0000000000000002 ***
## areaUrban                < 0.0000000000000002 ***
## qualificationHigh School      0.0000001344838 ***
## qualificationOthers      < 0.0000000000000002 ***
## income2L-5L                          0.264319    
## income5L-10L                         0.000751 ***
## incomeMore than 10L                  0.001027 ** 
## marital_status1          < 0.0000000000000002 ***
## policyB                              0.000106 ***
## policyC                       0.0000000000328 ***
## type_of_policyPlatinum               0.029242 *  
## type_of_policySilver     < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6313 on 89377 degrees of freedom
## Multiple R-squared:  0.05911,    Adjusted R-squared:  0.05897 
## F-statistic: 401.1 on 14 and 89377 DF,  p-value: < 0.00000000000000022

Model Comparison

Adjusted r-squared

summary(clvt_claim)$r.squared

## [1] 0.04165086

summary(clvt_all)$adj.r.squared

## [1] 0.3167083

summary(clvt_step)$adj.r.squared

## [1] 0.3167066

summary(cltv_selection)$adj.r.squared

## [1] 0.05896739

Conclusion:

Based on the R-Squared value, it can be determined that the best models are the “all” model and the “step” model.

RMSE

clvt_clean$pred_claim <- predict(object = clvt_claim, newdata = clvt_clean)
clvt_clean$pred_all <- predict(object = clvt_all, newdata = clvt_clean)
clvt_clean$pred_step <- predict(object = clvt_step, newdata = clvt_clean)
clvt_clean$pred_selection <- predict(object = cltv_selection, newdata = clvt_clean)
head(clvt_clean)

##   gender  area qualification        income marital_status vintage claim_amount
## 1   Male Urban      Bachelor        5L-10L              1       5         5790
## 2   Male Rural   High School        5L-10L              0       8         5080
## 3   Male Urban      Bachelor        5L-10L              1       8         2599
## 4 Female Rural   High School        5L-10L              0       7            0
## 5   Male Urban   High School More than 10L              1       6         3508
## 6   Male Rural   High School More than 10L              1       1            0
##   num_policies policy type_of_policy log_cltv pred_claim pred_all pred_step
## 1  More than 1      A       Platinum 11.07144   11.29589 11.50460  11.50313
## 2  More than 1      A       Platinum 13.15270   11.26699 11.50815  11.50604
## 3  More than 1      A       Platinum 11.06995   11.16599 11.45144  11.44987
## 4  More than 1      A       Platinum 11.49191   11.06019 11.40119  11.40293
## 5  More than 1      A           Gold 10.99769   11.20299 11.44550  11.44369
## 6            1      C           Gold 12.76216   11.06019 10.66570  10.66388
##   pred_selection
## 1       11.28023
## 2       11.27332
## 3       11.20699
## 4       11.13641
## 5       11.18093
## 6       10.99230

library(performance)
comparison <- compare_performance(clvt_claim, clvt_all, clvt_step, cltv_selection)
as.data.frame(comparison)

##             Name Model      AIC    AIC_wt     AICc   AICc_wt      BIC
## 1     clvt_claim    lm 173076.2 0.0000000 173076.2 0.0000000 173104.4
## 2       clvt_all    lm 142848.7 0.4042558 142848.7 0.4041642 143008.5
## 3      clvt_step    lm 142848.0 0.5957442 142848.0 0.5958358 142998.4
## 4 cltv_selection    lm 171458.2 0.0000000 171458.2 0.0000000 171608.6
##        BIC_wt         R2 R2_adjusted      RMSE     Sigma
## 1 0.000000000 0.04165086  0.04164014 0.6370559 0.6370630
## 2 0.006131553 0.31682300  0.31670834 0.5378761 0.5379243
## 3 0.993868447 0.31681364  0.31670662 0.5378798 0.5379250
## 4 0.000000000 0.05911477  0.05896739 0.6312247 0.6312777

range(clvt_clean$log_cltv)

## [1] 10.11973 13.49264

#Check normal error
10.11973 + 0.5378761

## [1] 10.65761

13.49264 - 0.5378761

## [1] 12.95476

Insight:

Based on RMSE result above, it can be determined that the error is still normal.

Prediction interval

pred_interval <- predict(object = clvt_all, newdata= clvt_clean, interval= "prediction", level = 0.95)
head(pred_interval)

##        fit       lwr      upr
## 1 11.50460 10.450232 12.55896
## 2 11.50815 10.453745 12.56255
## 3 11.45144 10.397057 12.50582
## 4 11.40119 10.346788 12.45558
## 5 11.44550 10.391089 12.49992
## 6 10.66570  9.611209 11.72018

Kesimpulan:

Based on the comparison between the three models above, it turns out that the model with all predictors (cltv_all) has a smaller RMSE value compared to the model with some predictors. Therefore, in this case, the model with all predictors will be selected as the best model.

Assumption

As one of the statistical models, linear regression has strict assumptions. Here are some assumptions that need to be checked to ensure whether the model we built can be considered as the Best Linear Unbiased Estimator (BLUE) model, which means a model that can consistently predict new data.

Assumptions of linear regression model:

Linearity
Normality of Residuals
Homoscedasticity of Residuals
No Multicollinearity

Note: In testing these assumptions, the model to be used is clvt_all as the best model based on its lower error rate compared to cltv_claim.

##Linearity

plot(clvt_all, 
     which = 1)

Conclusion:

The residuals are randomly scattered over a wide range, indicating that this model does not meet the assumption of linearity.

Normality of Residuals

hist(clvt_all$residuals)

plot(clvt_all, which = 2)

library(nortest)
lillie.test(clvt_all$residuals)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  clvt_all$residuals
## D = 0.10692, p-value < 0.00000000000000022

Insight:

The shapiro.test() cannot be performed because the sample size is below 3. Therefore, an alternative approach is to use the Lilliefors test (lillie.test), which yields a p-value of 2.2e-16, indicating <0.05. Additionally, from the visualization above, the data points show a pattern that deviates from the strip line. Hence, we can conclude that the residual data is not normally distributed.

Homoscedasticity of Residuals

#Visualization
plot(x= clvt_all$fitted.values,
     y = clvt_all$residuals)
abline(h = 0, col = "red")

library(lmtest)
bptest(clvt_all)

## 
##  studentized Breusch-Pagan test
## 
## data:  clvt_all
## BP = 3408.4, df = 15, p-value < 0.00000000000000022

Conclusion:

With a p-value of < 0.00000000000000022 < 0.05, it indicates that the error spread is NOT constant or there is heteroscedasticity present. Therefore, this assumption is rejected.

No Multicollinearity

library(car)
vif(clvt_all)

##                    GVIF Df GVIF^(1/(2*Df))
## gender         1.017864  1        1.008893
## area           1.334690  1        1.155288
## qualification  1.022430  2        1.005561
## income         1.116882  3        1.018594
## marital_status 1.028913  1        1.014354
## vintage        1.012572  1        1.006266
## claim_amount   1.316496  1        1.147387
## num_policies   1.051358  1        1.025358
## policy         1.106047  2        1.025518
## type_of_policy 1.029260  2        1.007236

Conclusion:

All predictor variables meet the assumption of no multicollinearity.

Model Improvement

Karena beberapa asumsi linear regression diatas ditolak, maka akan coba dilakukan beberapa transformasi pada data serta menghapus beberapa variabel dengan tujuan memberi kemungkinan asumsi-asumsi linear regression dapat diterima.

1. Mengubah bentuk cltv dan claim amount ke bentuk sqrt: Hal ini bertujuan untuk membuat hubungan antar variabel menjadi linear.

clvt$sqrt_cltv<-sqrt(clvt$cltv)
clvt$sqrt_claim_amount<-sqrt(clvt$claim_amount)

2. Menghapus beberapa variable kategorik: Di dalam data, terdapat beberapa variabel kategorik, dimana ketika dilakukan proses EDA jumlah dari data kategorik tersebut tidak terlalu banyak dan menunjukkan nilai yang tidak signifikan ketika dilakukan pemodelan. Oleh karena itu, akan dilakukan pengapusan terhadap beberapa variabel kategorik tersebut.

summary(clvt_all)

## 
## Call:
## lm(formula = log_cltv ~ ., data = clvt_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6701 -0.3748 -0.1396  0.2429  2.5309 
## 
## Coefficients:
##                               Estimate    Std. Error t value
## (Intercept)              10.6717818555  0.0149645998 713.135
## genderMale                0.0040510994  0.0036613033   1.106
## areaUrban                 0.0319831287  0.0045300451   7.060
## qualificationHigh School -0.0083175538  0.0037140278  -2.239
## qualificationOthers      -0.0780491645  0.0092309590  -8.455
## income2L-5L               0.0290280027  0.0130731224   2.220
## income5L-10L             -0.0038246056  0.0128156924  -0.298
## incomeMore than 10L      -0.0162865803  0.0134905355  -1.207
## marital_status1          -0.0482944710  0.0036923094 -13.080
## vintage                   0.0031659244  0.0007904369   4.005
## claim_amount              0.0000196347  0.0000006328  31.029
## num_policiesMore than 1   0.7227126883  0.0039360237 183.615
## policyB                  -0.0454572146  0.0043020299 -10.566
## policyC                   0.0595949930  0.0064140225   9.291
## type_of_policyPlatinum   -0.0033273639  0.0044979841  -0.740
## type_of_policySilver     -0.0176372973  0.0053205849  -3.315
##                                      Pr(>|t|)    
## (Intercept)              < 0.0000000000000002 ***
## genderMale                           0.268529    
## areaUrban                    0.00000000000167 ***
## qualificationHigh School             0.025126 *  
## qualificationOthers      < 0.0000000000000002 ***
## income2L-5L                          0.026392 *  
## income5L-10L                         0.765375    
## incomeMore than 10L                  0.227335    
## marital_status1          < 0.0000000000000002 ***
## vintage                      0.00006199304410 ***
## claim_amount             < 0.0000000000000002 ***
## num_policiesMore than 1  < 0.0000000000000002 ***
## policyB                  < 0.0000000000000002 ***
## policyC                  < 0.0000000000000002 ***
## type_of_policyPlatinum               0.459456    
## type_of_policySilver                 0.000917 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5379 on 89376 degrees of freedom
## Multiple R-squared:  0.3168, Adjusted R-squared:  0.3167 
## F-statistic:  2763 on 15 and 89376 DF,  p-value: < 0.00000000000000022

# Menghapus kolom yang tidak digunakan
clvt2 <- clvt %>% select(-id, -claim_amount, -cltv, -log_cltv)

# Menghapus data "5L-10L" dan "more than 10L" dari kolom income
clvt2 <- clvt2[!(clvt2$income == "5L-10L" | clvt2$income == "More than 10L"), ]

# Menghapus data dengan jenis kebijakan "Platinum" dari kolom type_of_policy
clvt2 <- clvt2[!(clvt2$type_of_policy == "Platinum"), ]

ggcorr(clvt2, label = T)

## Warning in ggcorr(clvt2, label = T): data in column(s) 'gender', 'area',
## 'qualification', 'income', 'marital_status', 'num_policies', 'policy',
## 'type_of_policy' are not numeric and were ignored

new_model <- lm(formula = sqrt_cltv ~ .,
                data = clvt2)
summary(new_model)

## 
## Call:
## lm(formula = sqrt_cltv ~ ., data = clvt2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -205.57  -70.19  -30.04   32.49  511.37 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              191.70622    6.30318  30.414 < 0.0000000000000002 ***
## genderMale                 5.26526    2.06121   2.554             0.010649 *  
## areaUrban                  2.93769    3.30779   0.888             0.374500    
## qualificationHigh School   0.22416    2.11935   0.106             0.915770    
## qualificationOthers       -7.02675    4.92245  -1.427             0.153467    
## income2L-5L                4.12034    3.96130   1.040             0.298294    
## marital_status1           -9.83580    2.05990  -4.775          0.000001821 ***
## vintage                    0.58230    0.45355   1.284             0.199213    
## num_policiesMore than 1  113.09566    2.26250  49.987 < 0.0000000000000002 ***
## policyB                  -13.06886    2.34883  -5.564          0.000000027 ***
## policyC                   10.88665    3.27633   3.323             0.000894 ***
## type_of_policySilver      -2.91066    2.07524  -1.403             0.160775    
## sqrt_claim_amount          0.44080    0.04208  10.476 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.9 on 11049 degrees of freedom
## Multiple R-squared:  0.2137, Adjusted R-squared:  0.2129 
## F-statistic: 250.3 on 12 and 11049 DF,  p-value: < 0.00000000000000022

model_selection <- lm(formula = sqrt_cltv ~ gender + marital_status + num_policies + policy + sqrt_claim_amount,
                      data = clvt2)
summary(model_selection)

## 
## Call:
## lm(formula = sqrt_cltv ~ gender + marital_status + num_policies + 
##     policy + sqrt_claim_amount, data = clvt2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -206.03  -70.20  -30.03   32.84  510.55 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)             197.57707    3.75565  52.608 < 0.0000000000000002 ***
## genderMale                5.49427    2.05359   2.675             0.007474 ** 
## marital_status1          -9.69566    2.05920  -4.708         0.0000025264 ***
## num_policiesMore than 1 113.54147    2.25466  50.359 < 0.0000000000000002 ***
## policyB                 -13.37047    2.26004  -5.916         0.0000000034 ***
## policyC                  10.83520    3.27411   3.309             0.000938 ***
## sqrt_claim_amount         0.45485    0.03917  11.612 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.9 on 11055 degrees of freedom
## Multiple R-squared:  0.2132, Adjusted R-squared:  0.2127 
## F-statistic: 499.2 on 6 and 11055 DF,  p-value: < 0.00000000000000022

model_comparison <- compare_performance(new_model, model_selection)
as.data.frame(model_comparison)

##              Name Model      AIC    AIC_wt     AICc   AICc_wt      BIC
## 1       new_model    lm 134763.5 0.1143503 134763.5 0.1130909 134865.9
## 2 model_selection    lm 134759.4 0.8856497 134759.4 0.8869091 134817.9
##                BIC_wt        R2 R2_adjusted     RMSE    Sigma
## 1 0.00000000003848054 0.2137300   0.2128760 106.8127 106.8755
## 2 0.99999999996151945 0.2131678   0.2127408 106.8509 106.8847

plot(model_selection, 
     which = 1)

lillie.test(model_selection$residuals)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  model_selection$residuals
## D = 0.14255, p-value < 0.00000000000000022

plot(model_selection, which = 2)

Conclusion

Based on the testing of the four assumptions above, three assumptions of linear regression (Linearity, Normality of Residuals, and Homoscedasticity of Residuals) are rejected, and only the assumption of no multicollinearity is accepted. Therefore, we can conclude that the model we have does not meet the assumptions of linear regression. As a result, to make predictions for the customer lifetime value (CLTV) for VahanBima company, a more complex model is needed.

VahanBima’s Customer Lifetime Value (CLTV) Prediction Using Linear Regression

Miftahul Labiib Syam

2023-06-15

Introduction and Business Problem

Business Question

Data Wrangling

Read data

Exploratory Data Analysis (EDA)

Model Fitting and Feature Selection

Model with one predictor

Model with all predictors

Model Step

Model Selection

Model Comparison

Adjusted r-squared

RMSE

Prediction interval

Assumption

Normality of Residuals

Homoscedasticity of Residuals

No Multicollinearity

Model Improvement

Conclusion