VahanBima is one of the leading insurance companies in India. It provides motor vehicle insurance at the best prices with 24/7 claim settlement. It offers different types of policies for both personal and commercial vehicles. It has established its brand across different regions in India.
Around 90% of businesses today use personalized services. The company wants to launch different personalized experience programs for customers of VahanBima. The personalized experience can be dedicated resources for claim settlement, different kinds of services at the doorstep, etc. In order to do so, they would like to segment the customers into different tiers based on their customer lifetime value (CLTV).
In order to do it, they would like to predict the customer lifetime value based on the activity and interaction of the customer with the platform. So, as a part of this challenge, your task at hand is to build a high-performance and interpretable machine learning model to predict the CLTV based on user and policy data.
#Import library
library(dplyr)
library(ggplot2)
library(lubridate)
#read data
clvt <- read.csv("train_BRCpofr.csv")
head(clvt)
## id gender area qualification income marital_status vintage
## 1 1 Male Urban Bachelor 5L-10L 1 5
## 2 2 Male Rural High School 5L-10L 0 8
## 3 3 Male Urban Bachelor 5L-10L 1 8
## 4 4 Female Rural High School 5L-10L 0 7
## 5 5 Male Urban High School More than 10L 1 6
## 6 6 Male Rural High School More than 10L 1 1
## claim_amount num_policies policy type_of_policy cltv
## 1 5790 More than 1 A Platinum 64308
## 2 5080 More than 1 A Platinum 515400
## 3 2599 More than 1 A Platinum 64212
## 4 0 More than 1 A Platinum 97920
## 5 3508 More than 1 A Gold 59736
## 6 0 1 C Gold 348768
About dataset:
id-Unique: identifier of a customer
gender : Gender of the customer
area : Area of the customer
qualification : Highest Qualification of the
customer
income : Income earned in a year (in
rupees)
marital_status : Marital Status of the customer
{0:Single, 1: Married}
vintage : No. of years since the first policy
date
claim_amount : Total Amount Claimed by the customer
(in rupees)
num_policies : Total no. of policies issued by the
customer
policy : Active policy of the customer
type_of_policy : Type of active policy
cltv : Customer lifetime value (Target
Variable)
In this case, cltv will be the target variable, while the others will serve as predictor variables to predict cltv.
#check data type
glimpse(clvt)
## Rows: 89,392
## Columns: 12
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ gender <chr> "Male", "Male", "Male", "Female", "Male", "Male", "Fema…
## $ area <chr> "Urban", "Rural", "Urban", "Rural", "Urban", "Rural", "…
## $ qualification <chr> "Bachelor", "High School", "Bachelor", "High School", "…
## $ income <chr> "5L-10L", "5L-10L", "5L-10L", "5L-10L", "More than 10L"…
## $ marital_status <int> 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1…
## $ vintage <int> 5, 8, 8, 7, 6, 1, 6, 0, 3, 6, 4, 6, 7, 1, 8, 1, 5, 2, 7…
## $ claim_amount <int> 5790, 5080, 2599, 0, 3508, 0, 0, 5473, 6105, 0, 4329, 6…
## $ num_policies <chr> "More than 1", "More than 1", "More than 1", "More than…
## $ policy <chr> "A", "A", "A", "A", "A", "C", "A", "A", "A", "A", "A", …
## $ type_of_policy <chr> "Platinum", "Platinum", "Platinum", "Platinum", "Gold",…
## $ cltv <int> 64308, 515400, 64212, 97920, 59736, 348768, 238920, 641…
#check total unique value
unique(clvt$income)
## [1] "5L-10L" "More than 10L" "2L-5L" "<=2L"
unique(clvt$num_policies)
## [1] "More than 1" "1"
Insight :
Based on the types of columns, there are several column types that are not appropriate and will be converted to a fixed column type. The columns to be converted are as follows:
clvt <- clvt %>%
mutate_at(vars(gender, area, qualification, income, marital_status, num_policies, policy, type_of_policy), as.factor)
glimpse(clvt)
## Rows: 89,392
## Columns: 12
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ gender <fct> Male, Male, Male, Female, Male, Male, Female, Female, F…
## $ area <fct> Urban, Rural, Urban, Rural, Urban, Rural, Rural, Urban,…
## $ qualification <fct> Bachelor, High School, Bachelor, High School, High Scho…
## $ income <fct> 5L-10L, 5L-10L, 5L-10L, 5L-10L, More than 10L, More tha…
## $ marital_status <fct> 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1…
## $ vintage <int> 5, 8, 8, 7, 6, 1, 6, 0, 3, 6, 4, 6, 7, 1, 8, 1, 5, 2, 7…
## $ claim_amount <int> 5790, 5080, 2599, 0, 3508, 0, 0, 5473, 6105, 0, 4329, 6…
## $ num_policies <fct> More than 1, More than 1, More than 1, More than 1, Mor…
## $ policy <fct> A, A, A, A, A, C, A, A, A, A, A, C, A, C, A, A, A, A, C…
## $ type_of_policy <fct> Platinum, Platinum, Platinum, Platinum, Gold, Gold, Gol…
## $ cltv <int> 64308, 515400, 64212, 97920, 59736, 348768, 238920, 641…
#checking data distribution for numerical data
hist(clvt$vintage)
options(scipen = 999)
hist(clvt$cltv)
hist(log(clvt$cltv))
#boxplot
boxplot(clvt$cltv)
Insight:
The data in the cltv column appears to have many outliers and does
not follow a normal distribution. However, if we modify the data using
the log() function, the distribution will become much more
normal. Therefore, the cltv variable, which serves as the target
variable, will be transformed into its logarithmic form.
clvt$log_cltv <-log(clvt$cltv)
#Total cltv based on education level
clvt %>%
group_by(qualification) %>%
summarise(total_clvt = sum(log_cltv)) %>%
ggplot(mapping = aes(x= qualification, y= total_clvt))+
geom_col(aes(fill= qualification))
#Total cltv based on marital status
clvt %>%
group_by(marital_status) %>%
summarise(total_clvt = sum(log_cltv)) %>%
ggplot(mapping = aes(x= marital_status, y= total_clvt))+
geom_col(aes(fill= marital_status))
#Total cltv based on income
clvt %>%
group_by(income) %>%
summarise(total_clvt = sum(log_cltv)) %>%
ggplot(mapping = aes(x= income, y= total_clvt))+
geom_col(aes(fill= income))
#Total cltv based on gender
clvt %>%
group_by(gender) %>%
summarise(total_clvt = sum(log_cltv)) %>%
ggplot(mapping = aes(x= gender, y= total_clvt))+
geom_col(aes(fill= gender))
Insight:
Total clvt paling banyak dimiliki oleh pria
#Total cltv based on area
clvt %>%
group_by(area) %>%
summarise(total_clvt = sum(log_cltv)) %>%
ggplot(mapping = aes(x= area, y= total_clvt))+
geom_col(aes(fill= area))
#Check the correlation between cltv and other predictor variables
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(clvt, label = TRUE)
## Warning in ggcorr(clvt, label = TRUE): data in column(s) 'gender', 'area',
## 'qualification', 'income', 'marital_status', 'num_policies', 'policy',
## 'type_of_policy' are not numeric and were ignored
Insight:
The numeric data correlation shows that vintage has no correlation with cltv, while claim_amount has a positive correlation, but it is very weak.
# Remove the id and cltv data (target variable using alghoritmic form of cltv).
clvt_clean <- clvt %>%
select(-id, -cltv)
head(clvt_clean)
## gender area qualification income marital_status vintage claim_amount
## 1 Male Urban Bachelor 5L-10L 1 5 5790
## 2 Male Rural High School 5L-10L 0 8 5080
## 3 Male Urban Bachelor 5L-10L 1 8 2599
## 4 Female Rural High School 5L-10L 0 7 0
## 5 Male Urban High School More than 10L 1 6 3508
## 6 Male Rural High School More than 10L 1 1 0
## num_policies policy type_of_policy log_cltv
## 1 More than 1 A Platinum 11.07144
## 2 More than 1 A Platinum 13.15270
## 3 More than 1 A Platinum 11.06995
## 4 More than 1 A Platinum 11.49191
## 5 More than 1 A Gold 10.99769
## 6 1 C Gold 12.76216
#Model with one target value
clvt_claim <- lm(formula = log_cltv ~ claim_amount,
data = clvt_clean)
summary(clvt_claim)
##
## Call:
## lm(formula = log_cltv ~ claim_amount, data = clvt_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9203 -0.3913 -0.1057 0.3075 2.3714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0601850035 0.0035521502 3113.66 <0.0000000000000002 ***
## claim_amount 0.0000407097 0.0000006531 62.33 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6371 on 89390 degrees of freedom
## Multiple R-squared: 0.04165, Adjusted R-squared: 0.04164
## F-statistic: 3885 on 1 and 89390 DF, p-value: < 0.00000000000000022
#Model with all predictors
clvt_all <- lm(formula = log_cltv ~.,
data = clvt_clean)
summary(clvt_all)
##
## Call:
## lm(formula = log_cltv ~ ., data = clvt_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6701 -0.3748 -0.1396 0.2429 2.5309
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 10.6717818555 0.0149645998 713.135
## genderMale 0.0040510994 0.0036613033 1.106
## areaUrban 0.0319831287 0.0045300451 7.060
## qualificationHigh School -0.0083175538 0.0037140278 -2.239
## qualificationOthers -0.0780491645 0.0092309590 -8.455
## income2L-5L 0.0290280027 0.0130731224 2.220
## income5L-10L -0.0038246056 0.0128156924 -0.298
## incomeMore than 10L -0.0162865803 0.0134905355 -1.207
## marital_status1 -0.0482944710 0.0036923094 -13.080
## vintage 0.0031659244 0.0007904369 4.005
## claim_amount 0.0000196347 0.0000006328 31.029
## num_policiesMore than 1 0.7227126883 0.0039360237 183.615
## policyB -0.0454572146 0.0043020299 -10.566
## policyC 0.0595949930 0.0064140225 9.291
## type_of_policyPlatinum -0.0033273639 0.0044979841 -0.740
## type_of_policySilver -0.0176372973 0.0053205849 -3.315
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## genderMale 0.268529
## areaUrban 0.00000000000167 ***
## qualificationHigh School 0.025126 *
## qualificationOthers < 0.0000000000000002 ***
## income2L-5L 0.026392 *
## income5L-10L 0.765375
## incomeMore than 10L 0.227335
## marital_status1 < 0.0000000000000002 ***
## vintage 0.00006199304410 ***
## claim_amount < 0.0000000000000002 ***
## num_policiesMore than 1 < 0.0000000000000002 ***
## policyB < 0.0000000000000002 ***
## policyC < 0.0000000000000002 ***
## type_of_policyPlatinum 0.459456
## type_of_policySilver 0.000917 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5379 on 89376 degrees of freedom
## Multiple R-squared: 0.3168, Adjusted R-squared: 0.3167
## F-statistic: 2763 on 15 and 89376 DF, p-value: < 0.00000000000000022
#model backward
clvt_step <- step(object = clvt_all, direction = "backward")
## Start: AIC=-110836.8
## log_cltv ~ gender + area + qualification + income + marital_status +
## vintage + claim_amount + num_policies + policy + type_of_policy
##
## Df Sum of Sq RSS AIC
## - gender 1 0.4 25862 -110838
## <none> 25862 -110837
## - type_of_policy 2 3.8 25866 -110828
## - vintage 1 4.6 25867 -110823
## - area 1 14.4 25876 -110789
## - income 3 20.0 25882 -110774
## - qualification 2 20.7 25883 -110769
## - marital_status 1 49.5 25912 -110668
## - policy 2 71.4 25933 -110594
## - claim_amount 1 278.6 26141 -109881
## - num_policies 1 9755.7 35618 -82227
##
## Step: AIC=-110837.6
## log_cltv ~ area + qualification + income + marital_status + vintage +
## claim_amount + num_policies + policy + type_of_policy
##
## Df Sum of Sq RSS AIC
## <none> 25862 -110838
## - type_of_policy 2 3.9 25866 -110828
## - vintage 1 4.7 25867 -110823
## - area 1 14.6 25877 -110789
## - income 3 19.9 25882 -110775
## - qualification 2 20.8 25883 -110770
## - marital_status 1 49.2 25912 -110670
## - policy 2 71.3 25934 -110595
## - claim_amount 1 280.5 26143 -109875
## - num_policies 1 9783.2 35646 -82159
summary(clvt_step)
##
## Call:
## lm(formula = log_cltv ~ area + qualification + income + marital_status +
## vintage + claim_amount + num_policies + policy + type_of_policy,
## data = clvt_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6692 -0.3745 -0.1401 0.2430 2.5326
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 10.6732276478 0.0149074611 715.965
## areaUrban 0.0321589771 0.0045272620 7.103
## qualificationHigh School -0.0084809521 0.0037110953 -2.285
## qualificationOthers -0.0781853389 0.0092301501 -8.471
## income2L-5L 0.0294965467 0.0130662787 2.257
## income5L-10L -0.0031997655 0.0128032604 -0.250
## incomeMore than 10L -0.0157224900 0.0134809159 -1.166
## marital_status1 -0.0479976831 0.0036825582 -13.034
## vintage 0.0031720743 0.0007904184 4.013
## claim_amount 0.0000196731 0.0000006318 31.137
## num_policiesMore than 1 0.7224539403 0.0039290755 183.874
## policyB -0.0453373567 0.0043006713 -10.542
## policyC 0.0596848974 0.0064135158 9.306
## type_of_policyPlatinum -0.0032758753 0.0044977491 -0.728
## type_of_policySilver -0.0177050472 0.0053202392 -3.328
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## areaUrban 0.00000000000123 ***
## qualificationHigh School 0.022298 *
## qualificationOthers < 0.0000000000000002 ***
## income2L-5L 0.023982 *
## income5L-10L 0.802651
## incomeMore than 10L 0.243505
## marital_status1 < 0.0000000000000002 ***
## vintage 0.00005996039575 ***
## claim_amount < 0.0000000000000002 ***
## num_policiesMore than 1 < 0.0000000000000002 ***
## policyB < 0.0000000000000002 ***
## policyC < 0.0000000000000002 ***
## type_of_policyPlatinum 0.466409
## type_of_policySilver 0.000875 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5379 on 89377 degrees of freedom
## Multiple R-squared: 0.3168, Adjusted R-squared: 0.3167
## F-statistic: 2960 on 14 and 89377 DF, p-value: < 0.00000000000000022
cltv_selection <- lm(formula = log_cltv ~ claim_amount+vintage+gender+area+qualification+income+marital_status+policy+type_of_policy,
data = clvt_clean)
summary(cltv_selection)
##
## Call:
## lm(formula = log_cltv ~ claim_amount + vintage + gender + area +
## qualification + income + marital_status + policy + type_of_policy,
## data = clvt_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75907 -0.39560 -0.09793 0.30022 2.43048
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 11.1304458166 0.0173152156 642.813
## claim_amount 0.0000321002 0.0000007383 43.478
## vintage 0.0097307466 0.0009266629 10.501
## genderMale -0.0358905291 0.0042891094 -8.368
## areaUrban 0.0773051255 0.0053083084 14.563
## qualificationHigh School -0.0229776807 0.0043575668 -5.273
## qualificationOthers -0.1929284317 0.0108080231 -17.850
## income2L-5L 0.0171251281 0.0153416921 1.116
## income5L-10L -0.0506771429 0.0150367940 -3.370
## incomeMore than 10L -0.0519732007 0.0158300902 -3.283
## marital_status1 -0.0869717441 0.0043260295 -20.104
## policyB -0.0195587704 0.0050459060 -3.876
## policyC 0.0499341055 0.0075268826 6.634
## type_of_policyPlatinum 0.0115067266 0.0052777296 2.180
## type_of_policySilver -0.0542317212 0.0062395570 -8.692
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## claim_amount < 0.0000000000000002 ***
## vintage < 0.0000000000000002 ***
## genderMale < 0.0000000000000002 ***
## areaUrban < 0.0000000000000002 ***
## qualificationHigh School 0.0000001344838 ***
## qualificationOthers < 0.0000000000000002 ***
## income2L-5L 0.264319
## income5L-10L 0.000751 ***
## incomeMore than 10L 0.001027 **
## marital_status1 < 0.0000000000000002 ***
## policyB 0.000106 ***
## policyC 0.0000000000328 ***
## type_of_policyPlatinum 0.029242 *
## type_of_policySilver < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6313 on 89377 degrees of freedom
## Multiple R-squared: 0.05911, Adjusted R-squared: 0.05897
## F-statistic: 401.1 on 14 and 89377 DF, p-value: < 0.00000000000000022
summary(clvt_claim)$r.squared
## [1] 0.04165086
summary(clvt_all)$adj.r.squared
## [1] 0.3167083
summary(clvt_step)$adj.r.squared
## [1] 0.3167066
summary(cltv_selection)$adj.r.squared
## [1] 0.05896739
Conclusion:
Based on the R-Squared value, it can be determined that the best models are the “all” model and the “step” model.
clvt_clean$pred_claim <- predict(object = clvt_claim, newdata = clvt_clean)
clvt_clean$pred_all <- predict(object = clvt_all, newdata = clvt_clean)
clvt_clean$pred_step <- predict(object = clvt_step, newdata = clvt_clean)
clvt_clean$pred_selection <- predict(object = cltv_selection, newdata = clvt_clean)
head(clvt_clean)
## gender area qualification income marital_status vintage claim_amount
## 1 Male Urban Bachelor 5L-10L 1 5 5790
## 2 Male Rural High School 5L-10L 0 8 5080
## 3 Male Urban Bachelor 5L-10L 1 8 2599
## 4 Female Rural High School 5L-10L 0 7 0
## 5 Male Urban High School More than 10L 1 6 3508
## 6 Male Rural High School More than 10L 1 1 0
## num_policies policy type_of_policy log_cltv pred_claim pred_all pred_step
## 1 More than 1 A Platinum 11.07144 11.29589 11.50460 11.50313
## 2 More than 1 A Platinum 13.15270 11.26699 11.50815 11.50604
## 3 More than 1 A Platinum 11.06995 11.16599 11.45144 11.44987
## 4 More than 1 A Platinum 11.49191 11.06019 11.40119 11.40293
## 5 More than 1 A Gold 10.99769 11.20299 11.44550 11.44369
## 6 1 C Gold 12.76216 11.06019 10.66570 10.66388
## pred_selection
## 1 11.28023
## 2 11.27332
## 3 11.20699
## 4 11.13641
## 5 11.18093
## 6 10.99230
library(performance)
comparison <- compare_performance(clvt_claim, clvt_all, clvt_step, cltv_selection)
as.data.frame(comparison)
## Name Model AIC AIC_wt AICc AICc_wt BIC
## 1 clvt_claim lm 173076.2 0.0000000 173076.2 0.0000000 173104.4
## 2 clvt_all lm 142848.7 0.4042558 142848.7 0.4041642 143008.5
## 3 clvt_step lm 142848.0 0.5957442 142848.0 0.5958358 142998.4
## 4 cltv_selection lm 171458.2 0.0000000 171458.2 0.0000000 171608.6
## BIC_wt R2 R2_adjusted RMSE Sigma
## 1 0.000000000 0.04165086 0.04164014 0.6370559 0.6370630
## 2 0.006131553 0.31682300 0.31670834 0.5378761 0.5379243
## 3 0.993868447 0.31681364 0.31670662 0.5378798 0.5379250
## 4 0.000000000 0.05911477 0.05896739 0.6312247 0.6312777
range(clvt_clean$log_cltv)
## [1] 10.11973 13.49264
#Check normal error
10.11973 + 0.5378761
## [1] 10.65761
13.49264 - 0.5378761
## [1] 12.95476
Insight:
Based on RMSE result above, it can be determined that the error is still normal.
pred_interval <- predict(object = clvt_all, newdata= clvt_clean, interval= "prediction", level = 0.95)
head(pred_interval)
## fit lwr upr
## 1 11.50460 10.450232 12.55896
## 2 11.50815 10.453745 12.56255
## 3 11.45144 10.397057 12.50582
## 4 11.40119 10.346788 12.45558
## 5 11.44550 10.391089 12.49992
## 6 10.66570 9.611209 11.72018
Kesimpulan:
Based on the comparison between the three models above, it turns out that the model with all predictors (cltv_all) has a smaller RMSE value compared to the model with some predictors. Therefore, in this case, the model with all predictors will be selected as the best model.
As one of the statistical models, linear regression has strict assumptions. Here are some assumptions that need to be checked to ensure whether the model we built can be considered as the Best Linear Unbiased Estimator (BLUE) model, which means a model that can consistently predict new data.
Assumptions of linear regression model:
Note: In testing these assumptions, the model to be
used is clvt_all as the best model based on its lower error
rate compared to cltv_claim.
##Linearity
plot(clvt_all,
which = 1)
Conclusion:
The residuals are randomly scattered over a wide range, indicating that this model does not meet the assumption of linearity.
hist(clvt_all$residuals)
plot(clvt_all, which = 2)
library(nortest)
lillie.test(clvt_all$residuals)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: clvt_all$residuals
## D = 0.10692, p-value < 0.00000000000000022
Insight:
The shapiro.test() cannot be performed because the
sample size is below 3. Therefore, an alternative approach is to use the
Lilliefors test (lillie.test), which yields a p-value of
2.2e-16, indicating <0.05. Additionally, from the visualization
above, the data points show a pattern that deviates from the strip line.
Hence, we can conclude that the residual data is not normally
distributed.
#Visualization
plot(x= clvt_all$fitted.values,
y = clvt_all$residuals)
abline(h = 0, col = "red")
library(lmtest)
bptest(clvt_all)
##
## studentized Breusch-Pagan test
##
## data: clvt_all
## BP = 3408.4, df = 15, p-value < 0.00000000000000022
Conclusion:
With a p-value of < 0.00000000000000022 < 0.05, it indicates that the error spread is NOT constant or there is heteroscedasticity present. Therefore, this assumption is rejected.
library(car)
vif(clvt_all)
## GVIF Df GVIF^(1/(2*Df))
## gender 1.017864 1 1.008893
## area 1.334690 1 1.155288
## qualification 1.022430 2 1.005561
## income 1.116882 3 1.018594
## marital_status 1.028913 1 1.014354
## vintage 1.012572 1 1.006266
## claim_amount 1.316496 1 1.147387
## num_policies 1.051358 1 1.025358
## policy 1.106047 2 1.025518
## type_of_policy 1.029260 2 1.007236
Conclusion:
All predictor variables meet the assumption of no multicollinearity.
Karena beberapa asumsi linear regression diatas ditolak, maka akan coba dilakukan beberapa transformasi pada data serta menghapus beberapa variabel dengan tujuan memberi kemungkinan asumsi-asumsi linear regression dapat diterima.
1. Mengubah bentuk cltv dan claim amount ke bentuk
sqrt: Hal ini bertujuan untuk membuat hubungan
antar variabel menjadi linear.
clvt$sqrt_cltv<-sqrt(clvt$cltv)
clvt$sqrt_claim_amount<-sqrt(clvt$claim_amount)
2. Menghapus beberapa variable kategorik: Di dalam data, terdapat beberapa variabel kategorik, dimana ketika dilakukan proses EDA jumlah dari data kategorik tersebut tidak terlalu banyak dan menunjukkan nilai yang tidak signifikan ketika dilakukan pemodelan. Oleh karena itu, akan dilakukan pengapusan terhadap beberapa variabel kategorik tersebut.
summary(clvt_all)
##
## Call:
## lm(formula = log_cltv ~ ., data = clvt_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6701 -0.3748 -0.1396 0.2429 2.5309
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 10.6717818555 0.0149645998 713.135
## genderMale 0.0040510994 0.0036613033 1.106
## areaUrban 0.0319831287 0.0045300451 7.060
## qualificationHigh School -0.0083175538 0.0037140278 -2.239
## qualificationOthers -0.0780491645 0.0092309590 -8.455
## income2L-5L 0.0290280027 0.0130731224 2.220
## income5L-10L -0.0038246056 0.0128156924 -0.298
## incomeMore than 10L -0.0162865803 0.0134905355 -1.207
## marital_status1 -0.0482944710 0.0036923094 -13.080
## vintage 0.0031659244 0.0007904369 4.005
## claim_amount 0.0000196347 0.0000006328 31.029
## num_policiesMore than 1 0.7227126883 0.0039360237 183.615
## policyB -0.0454572146 0.0043020299 -10.566
## policyC 0.0595949930 0.0064140225 9.291
## type_of_policyPlatinum -0.0033273639 0.0044979841 -0.740
## type_of_policySilver -0.0176372973 0.0053205849 -3.315
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## genderMale 0.268529
## areaUrban 0.00000000000167 ***
## qualificationHigh School 0.025126 *
## qualificationOthers < 0.0000000000000002 ***
## income2L-5L 0.026392 *
## income5L-10L 0.765375
## incomeMore than 10L 0.227335
## marital_status1 < 0.0000000000000002 ***
## vintage 0.00006199304410 ***
## claim_amount < 0.0000000000000002 ***
## num_policiesMore than 1 < 0.0000000000000002 ***
## policyB < 0.0000000000000002 ***
## policyC < 0.0000000000000002 ***
## type_of_policyPlatinum 0.459456
## type_of_policySilver 0.000917 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5379 on 89376 degrees of freedom
## Multiple R-squared: 0.3168, Adjusted R-squared: 0.3167
## F-statistic: 2763 on 15 and 89376 DF, p-value: < 0.00000000000000022
# Menghapus kolom yang tidak digunakan
clvt2 <- clvt %>% select(-id, -claim_amount, -cltv, -log_cltv)
# Menghapus data "5L-10L" dan "more than 10L" dari kolom income
clvt2 <- clvt2[!(clvt2$income == "5L-10L" | clvt2$income == "More than 10L"), ]
# Menghapus data dengan jenis kebijakan "Platinum" dari kolom type_of_policy
clvt2 <- clvt2[!(clvt2$type_of_policy == "Platinum"), ]
ggcorr(clvt2, label = T)
## Warning in ggcorr(clvt2, label = T): data in column(s) 'gender', 'area',
## 'qualification', 'income', 'marital_status', 'num_policies', 'policy',
## 'type_of_policy' are not numeric and were ignored
new_model <- lm(formula = sqrt_cltv ~ .,
data = clvt2)
summary(new_model)
##
## Call:
## lm(formula = sqrt_cltv ~ ., data = clvt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205.57 -70.19 -30.04 32.49 511.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 191.70622 6.30318 30.414 < 0.0000000000000002 ***
## genderMale 5.26526 2.06121 2.554 0.010649 *
## areaUrban 2.93769 3.30779 0.888 0.374500
## qualificationHigh School 0.22416 2.11935 0.106 0.915770
## qualificationOthers -7.02675 4.92245 -1.427 0.153467
## income2L-5L 4.12034 3.96130 1.040 0.298294
## marital_status1 -9.83580 2.05990 -4.775 0.000001821 ***
## vintage 0.58230 0.45355 1.284 0.199213
## num_policiesMore than 1 113.09566 2.26250 49.987 < 0.0000000000000002 ***
## policyB -13.06886 2.34883 -5.564 0.000000027 ***
## policyC 10.88665 3.27633 3.323 0.000894 ***
## type_of_policySilver -2.91066 2.07524 -1.403 0.160775
## sqrt_claim_amount 0.44080 0.04208 10.476 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106.9 on 11049 degrees of freedom
## Multiple R-squared: 0.2137, Adjusted R-squared: 0.2129
## F-statistic: 250.3 on 12 and 11049 DF, p-value: < 0.00000000000000022
model_selection <- lm(formula = sqrt_cltv ~ gender + marital_status + num_policies + policy + sqrt_claim_amount,
data = clvt2)
summary(model_selection)
##
## Call:
## lm(formula = sqrt_cltv ~ gender + marital_status + num_policies +
## policy + sqrt_claim_amount, data = clvt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -206.03 -70.20 -30.03 32.84 510.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 197.57707 3.75565 52.608 < 0.0000000000000002 ***
## genderMale 5.49427 2.05359 2.675 0.007474 **
## marital_status1 -9.69566 2.05920 -4.708 0.0000025264 ***
## num_policiesMore than 1 113.54147 2.25466 50.359 < 0.0000000000000002 ***
## policyB -13.37047 2.26004 -5.916 0.0000000034 ***
## policyC 10.83520 3.27411 3.309 0.000938 ***
## sqrt_claim_amount 0.45485 0.03917 11.612 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106.9 on 11055 degrees of freedom
## Multiple R-squared: 0.2132, Adjusted R-squared: 0.2127
## F-statistic: 499.2 on 6 and 11055 DF, p-value: < 0.00000000000000022
model_comparison <- compare_performance(new_model, model_selection)
as.data.frame(model_comparison)
## Name Model AIC AIC_wt AICc AICc_wt BIC
## 1 new_model lm 134763.5 0.1143503 134763.5 0.1130909 134865.9
## 2 model_selection lm 134759.4 0.8856497 134759.4 0.8869091 134817.9
## BIC_wt R2 R2_adjusted RMSE Sigma
## 1 0.00000000003848054 0.2137300 0.2128760 106.8127 106.8755
## 2 0.99999999996151945 0.2131678 0.2127408 106.8509 106.8847
plot(model_selection,
which = 1)
lillie.test(model_selection$residuals)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: model_selection$residuals
## D = 0.14255, p-value < 0.00000000000000022
plot(model_selection, which = 2)
Based on the testing of the four assumptions above, three assumptions of linear regression (Linearity, Normality of Residuals, and Homoscedasticity of Residuals) are rejected, and only the assumption of no multicollinearity is accepted. Therefore, we can conclude that the model we have does not meet the assumptions of linear regression. As a result, to make predictions for the customer lifetime value (CLTV) for VahanBima company, a more complex model is needed.