First we’ll load the data into R. I have renamed my data set for easy loading
df <- read.csv("//au.qbe.pri/Home/NSW_Home/200583/Profile/Documents/DataCademy/ADS.csv")
Next let’s load libraries that will be useful in our analysis:
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Then inspect the data by using str, summary and head functions:
head(df)
## Customer State Customer.Lifetime.Value Response Coverage Education
## 1 BU79786 Washington 2763.519 No Basic Bachelor
## 2 QZ44356 Arizona 6979.536 No Extended Bachelor
## 3 AI49188 Nevada 12887.432 No Premium Bachelor
## 4 WW63253 California 7645.862 No Basic Bachelor
## 5 HB64268 Washington 2813.693 No Basic Bachelor
## 6 OC83172 Oregon 8256.298 Yes Basic Bachelor
## Effective.To.Date EmploymentStatus Gender Income Location.Code
## 1 2/24/11 Employed F 56274 Suburban
## 2 1/31/11 Unemployed F 0 Suburban
## 3 2/19/11 Employed F 48767 Suburban
## 4 1/20/11 Unemployed M 0 Suburban
## 5 2/3/11 Employed M 43836 Rural
## 6 1/25/11 Employed F 62902 Rural
## Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim
## 1 Married 69 32
## 2 Single 94 13
## 3 Married 108 18
## 4 Married 106 18
## 5 Single 73 12
## 6 Married 69 14
## Months.Since.Policy.Inception Number.of.Open.Complaints
## 1 5 0
## 2 42 0
## 3 38 0
## 4 65 0
## 5 44 0
## 6 94 0
## Number.of.Policies Policy.Type Policy Renew.Offer.Type
## 1 1 Corporate Auto Corporate L3 Offer1
## 2 8 Personal Auto Personal L3 Offer3
## 3 2 Personal Auto Personal L3 Offer1
## 4 7 Corporate Auto Corporate L2 Offer1
## 5 1 Personal Auto Personal L1 Offer1
## 6 2 Personal Auto Personal L3 Offer2
## Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
## 1 Agent 384.8111 Two-Door Car Medsize
## 2 Agent 1131.4649 Four-Door Car Medsize
## 3 Agent 566.4722 Two-Door Car Medsize
## 4 Call Center 529.8813 SUV Medsize
## 5 Agent 138.1309 Four-Door Car Medsize
## 6 Web 159.3830 Two-Door Car Medsize
str(df)
## 'data.frame': 9134 obs. of 24 variables:
## $ Customer : Factor w/ 9134 levels "AA10041","AA11235",..: 601 5947 97 8017 2489 4948 8434 756 1352 548 ...
## $ State : Factor w/ 5 levels "Arizona","California",..: 5 1 3 2 5 4 4 1 4 4 ...
## $ Customer.Lifetime.Value : num 2764 6980 12887 7646 2814 ...
## $ Response : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ Coverage : Factor w/ 3 levels "Basic","Extended",..: 1 2 3 1 1 1 1 3 1 2 ...
## $ Education : Factor w/ 5 levels "Bachelor","College",..: 1 1 1 1 1 1 2 5 1 2 ...
## $ Effective.To.Date : Factor w/ 59 levels "1/1/11","1/10/11",..: 48 25 42 13 53 18 48 10 19 40 ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 2 5 2 5 2 2 2 5 3 2 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
## $ Income : int 56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
## $ Location.Code : Factor w/ 3 levels "Rural","Suburban",..: 2 2 2 2 1 1 2 3 2 3 ...
## $ Marital.Status : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
## $ Monthly.Premium.Auto : int 69 94 108 106 73 69 67 101 71 93 ...
## $ Months.Since.Last.Claim : int 32 13 18 18 12 14 0 0 13 17 ...
## $ Months.Since.Policy.Inception: int 5 42 38 65 44 94 13 68 3 7 ...
## $ Number.of.Open.Complaints : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number.of.Policies : int 1 8 2 7 1 2 9 4 2 8 ...
## $ Policy.Type : Factor w/ 3 levels "Corporate Auto",..: 1 2 2 1 2 2 1 1 1 3 ...
## $ Policy : Factor w/ 9 levels "Corporate L1",..: 3 6 6 2 4 6 3 3 3 8 ...
## $ Renew.Offer.Type : Factor w/ 4 levels "Offer1","Offer2",..: 1 3 1 1 1 2 1 1 1 2 ...
## $ Sales.Channel : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
## $ Total.Claim.Amount : num 385 1131 566 530 138 ...
## $ Vehicle.Class : Factor w/ 6 levels "Four-Door Car",..: 6 1 6 5 1 6 1 1 1 1 ...
## $ Vehicle.Size : Factor w/ 3 levels "Large","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(df)
## Customer State Customer.Lifetime.Value Response
## AA10041: 1 Arizona :1703 Min. : 1898 No :7826
## AA11235: 1 California:3150 1st Qu.: 3994 Yes:1308
## AA16582: 1 Nevada : 882 Median : 5780
## AA30683: 1 Oregon :2601 Mean : 8005
## AA34092: 1 Washington: 798 3rd Qu.: 8962
## AA35519: 1 Max. :83325
## (Other):9128
## Coverage Education Effective.To.Date
## Basic :5568 Bachelor :2748 1/10/11: 195
## Extended:2742 College :2681 1/27/11: 194
## Premium : 824 Doctor : 342 2/14/11: 186
## High School or Below:2622 1/26/11: 181
## Master : 741 1/17/11: 180
## 1/19/11: 179
## (Other):8019
## EmploymentStatus Gender Income Location.Code
## Disabled : 405 F:4658 Min. : 0 Rural :1773
## Employed :5698 M:4476 1st Qu.: 0 Suburban:5779
## Medical Leave: 432 Median :33890 Urban :1582
## Retired : 282 Mean :37657
## Unemployed :2317 3rd Qu.:62320
## Max. :99981
##
## Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim
## Divorced:1369 Min. : 61.00 Min. : 0.0
## Married :5298 1st Qu.: 68.00 1st Qu.: 6.0
## Single :2467 Median : 83.00 Median :14.0
## Mean : 93.22 Mean :15.1
## 3rd Qu.:109.00 3rd Qu.:23.0
## Max. :298.00 Max. :35.0
##
## Months.Since.Policy.Inception Number.of.Open.Complaints
## Min. : 0.00 Min. :0.0000
## 1st Qu.:24.00 1st Qu.:0.0000
## Median :48.00 Median :0.0000
## Mean :48.06 Mean :0.3844
## 3rd Qu.:71.00 3rd Qu.:0.0000
## Max. :99.00 Max. :5.0000
##
## Number.of.Policies Policy.Type Policy
## Min. :1.000 Corporate Auto:1968 Personal L3 :3426
## 1st Qu.:1.000 Personal Auto :6788 Personal L2 :2122
## Median :2.000 Special Auto : 378 Personal L1 :1240
## Mean :2.966 Corporate L3:1014
## 3rd Qu.:4.000 Corporate L2: 595
## Max. :9.000 Corporate L1: 359
## (Other) : 378
## Renew.Offer.Type Sales.Channel Total.Claim.Amount
## Offer1:3752 Agent :3477 Min. : 0.099
## Offer2:2926 Branch :2567 1st Qu.: 272.258
## Offer3:1432 Call Center:1765 Median : 383.945
## Offer4:1024 Web :1325 Mean : 434.089
## 3rd Qu.: 547.515
## Max. :2893.240
##
## Vehicle.Class Vehicle.Size
## Four-Door Car:4621 Large : 946
## Luxury Car : 163 Medsize:6424
## Luxury SUV : 184 Small :1764
## Sports Car : 484
## SUV :1796
## Two-Door Car :1886
##
Let’s change the data type of Customer as it is a encoded as a factor variable
df$Customer <- as.character(df$Customer)
Also notice that Effective.to.date column is coded as a factor where in fact it should be formatted as Date. Let’s reformat it:
df$Effective.To.Date <- as.Date(df$Effective.To.Date, "%m/%d/%y")
typeof(df$Effective.To.Date)
## [1] "double"
Test if there is any NA values over all df
sum(is.na(df))
## [1] 0
Check per column, count number of True in the df returned by is.na(df) and format as Data Frame for more readability:
data.frame(colSums(is.na(df)))
## colSums.is.na.df..
## Customer 0
## State 0
## Customer.Lifetime.Value 0
## Response 0
## Coverage 0
## Education 0
## Effective.To.Date 0
## EmploymentStatus 0
## Gender 0
## Income 0
## Location.Code 0
## Marital.Status 0
## Monthly.Premium.Auto 0
## Months.Since.Last.Claim 0
## Months.Since.Policy.Inception 0
## Number.of.Open.Complaints 0
## Number.of.Policies 0
## Policy.Type 0
## Policy 0
## Renew.Offer.Type 0
## Sales.Channel 0
## Total.Claim.Amount 0
## Vehicle.Class 0
## Vehicle.Size 0
Check if there is any duplicates
anyDuplicated(df)
## [1] 0
let’s look at the five number summary of the variables we will be hugely interested with:
For Customer Lifetime Value
library(moments)
summary(df$Customer.Lifetime.Value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1898 3994 5780 8005 8962 83325
skewness(df$Customer.Lifetime.Value)
## [1] 3.031782
kurtosis(df$Customer.Lifetime.Value)
## [1] 16.81531
For Monthly Premium
summary(df$Monthly.Premium.Auto)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.00 68.00 83.00 93.22 109.00 298.00
skewness(df$Monthly.Premium.Auto)
## [1] 2.123198
kurtosis(df$Monthly.Premium.Auto)
## [1] 9.189558
For Claim Amount
summary(df$Total.Claim.Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.099 272.258 383.945 434.089 547.515 2893.240
skewness(df$Total.Claim.Amount)
## [1] 1.714684
kurtosis(df$Total.Claim.Amount)
## [1] 8.975472
Now we’ll look into their distributions using histograms. First we import these libraries to aid us in our analysis:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that these Metrics are all Skewed to the Left, with varying Skewness!
Rural customers are LESS valuable than Urban customers
df %>%
group_by(Location.Code) %>%
summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value), Std_clv = sd(Customer.Lifetime.Value)) %>%
arrange(ave_clv)
## # A tibble: 3 x 4
## Location.Code cnt ave_clv Std_clv
## <fctr> <int> <dbl> <dbl>
## 1 Rural 1773 7953.699 6595.473
## 2 Suburban 5779 8004.457 6963.661
## 3 Urban 1582 8064.133 6836.296
We will test the difference between two means: The Null Hypothesis is that the mean CLV for RURAL is just equal to URBAN the ALternative Hypothesis is that RURAL is LESS than URBAN
First, let’s try to Visualize the distribution of CLV:
Rural <- df%>% filter(Location.Code == 'Rural')
Urban <- df%>% filter(Location.Code == 'Urban')
ggplot(Rural, aes(x = Customer.Lifetime.Value))+geom_histogram()+ggtitle('Rural CLV')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Urban, aes(x = Customer.Lifetime.Value))+geom_histogram()+ggtitle('Urban CLV')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
then we chec if the difference is statistically significant using t test
urban <- df$Customer.Lifetime.Value[df$Location.Code == 'Urban']
rural <- df$Customer.Lifetime.Value[df$Location.Code == 'Rural']
stat = mean(urban) - mean(rural)
t.test(rural,urban, alternative = "less", conf.level = .95)
##
## Welch Two Sample t-test
##
## data: rural and urban
## t = -0.47489, df = 3279.4, p-value = 0.3174
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 272.1745
## sample estimates:
## mean of x mean of y
## 7953.699 8064.133
the p-value reported by the t.test is 0.3174, which is higher than the critical value of 0.5 thus we fail to reject the Null Hypothesis i.e. there is insufficent evidence that Rural customers are less valuable than Urban
Educated customers (with a bachelors or equivalent degree) are MORE valuable than others
what are the levels of education in our dataset?:
levels(df$Education)
## [1] "Bachelor" "College" "Doctor"
## [4] "High School or Below" "Master"
df %>%
group_by(Education) %>%
summarise(ave_clv = mean(Customer.Lifetime.Value)) %>%
arrange(ave_clv)
## # A tibble: 5 x 2
## Education ave_clv
## <fctr> <dbl>
## 1 Doctor 7520.345
## 2 College 7851.065
## 3 Bachelor 7872.660
## 4 Master 8243.485
## 5 High School or Below 8296.709
Similar to the previous item, but this time we will have alternative Hypothesis as Educated Customers are More educated than others
Educated <- df$Customer.Lifetime.Value[df$Education != 'High School or Below']
Non.Educ <- df$Customer.Lifetime.Value[df$Education == 'High School or Below']
t.test(Educated, Non.Educ, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: Educated and Non.Educ
## t = -2.4988, df = 4546.6, p-value = 0.9938
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -678.6923 Inf
## sample estimates:
## mean of x mean of y
## 7887.462 8296.709
at p-value 0.99 We fail to reject the Null Hypothesis i.e. there is insufficient evidence that Educated customers have higher CLV than others
Marital status has no role to play in determining the value of a customer
df %>%
group_by(Marital.Status) %>%
summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value), Std_clv = sd(Customer.Lifetime.Value)) %>%
arrange(ave_clv)
## # A tibble: 3 x 4
## Marital.Status cnt ave_clv Std_clv
## <fctr> <int> <dbl> <dbl>
## 1 Single 2467 7714.837 6731.479
## 2 Married 5298 8078.967 6902.504
## 3 Divorced 1369 8241.239 6984.894
There are three Levels, now we test difference of CLV means among each of them Use One way Anova or Regression: or Kruskal-Wallis test:
kruskal.test(Customer.Lifetime.Value ~ Marital.Status, data = df)
##
## Kruskal-Wallis rank sum test
##
## data: Customer.Lifetime.Value by Marital.Status
## Kruskal-Wallis chi-squared = 20.896, df = 2, p-value = 2.901e-05
By the Kruskal-Wallis Test, there is significant difference between the Marital Status on determining the CLV.
Let’s try linear Regression:
Using linear regression we are testing the Hypothesis that Marital Status has a statistically significant reletionship with CLV. Our Null Hypothesis would be there is no relationship i.e the intercept is 0.
CLV.lm = lm(Customer.Lifetime.Value ~ Marital.Status, data = df)
summary(CLV.lm)
##
## Call:
## lm(formula = Customer.Lifetime.Value ~ Marital.Status, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6237 -3990 -2219 956 75246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8241.2 185.7 44.390 <2e-16 ***
## Marital.StatusMarried -162.3 208.3 -0.779 0.436
## Marital.StatusSingle -526.4 231.5 -2.274 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6869 on 9131 degrees of freedom
## Multiple R-squared: 0.0007262, Adjusted R-squared: 0.0005073
## F-statistic: 3.318 on 2 and 9131 DF, p-value: 0.03628
anova(CLV.lm)
## Analysis of Variance Table
##
## Response: Customer.Lifetime.Value
## Df Sum Sq Mean Sq F value Pr(>F)
## Marital.Status 2 3.1310e+08 156548158 3.3177 0.03628 *
## Residuals 9131 4.3086e+11 47186247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result shows that the intercept(mean response) is statistically significant. WeReject he NUll Hypothesis
West Coast agents are performing poorly - They are signing low value customers
levels(df$State)
## [1] "Arizona" "California" "Nevada" "Oregon" "Washington"
levels(df$Sales.Channel)
## [1] "Agent" "Branch" "Call Center" "Web"
first, we filter the data to get policies which have “Agent” as Sales.Channel:
Agentdf <- df %>% filter(Sales.Channel == "Agent")
Agentdf %>%
group_by(State) %>%
summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value),Std_clv = sd(Customer.Lifetime.Value))
## # A tibble: 5 x 4
## State cnt ave_clv Std_clv
## <fctr> <int> <dbl> <dbl>
## 1 Arizona 643 7841.994 6886.165
## 2 California 1201 7878.525 6059.478
## 3 Nevada 330 8729.443 8125.603
## 4 Oregon 988 7941.217 6414.209
## 5 Washington 315 7739.067 7077.971
Let the Null Hypothesis be that Average California CLV from agents are just equal to the rest of the Agents;
Alternative; California CLV less than Rest
we set up the data needed
California <- Agentdf$Customer.Lifetime.Value[df$State == 'California']
Others <- Agentdf$Customer.Lifetime.Value[df$State != 'California']
Then we test the Hypothesis:
t.test(California, Others, alternative = "less")
##
## Welch Two Sample t-test
##
## data: California and Others
## t = 0.44252, df = 1886.1, p-value = 0.6709
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 517.6109
## sample estimates:
## mean of x mean of y
## 8035.284 7925.594
The T.test result shows that there is insufficient evidence that california Agents have lower CLV than the rest of the Agents
Call Center is not performing well comparerd to other channels throughout the country (in terms of high value customers)
Who are the High Value Customers? Let’s assign High Value customers as those with CLV above Overall Average CLV.
mean_CLV <- mean(df$Customer.Lifetime.Value)
High_Value <- df %>% filter(Customer.Lifetime.Value > mean_CLV)
hist(High_Value$Customer.Lifetime.Value)
Let’s construct our dataset:
High_Value %>%
group_by(Sales.Channel) %>%
summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value),Std_clv = sd(Customer.Lifetime.Value))
## # A tibble: 4 x 4
## Sales.Channel cnt ave_clv Std_clv
## <fctr> <int> <dbl> <dbl>
## 1 Agent 1151 14471.77 7991.968
## 2 Branch 874 14849.87 8546.938
## 3 Call Center 590 14867.56 8767.698
## 4 Web 424 14543.84 8353.562
HV_CC <- High_Value$Customer.Lifetime.Value[High_Value$Sales.Channel == 'Call Center']
HV_NonCC <- High_Value$Customer.Lifetime.Value[High_Value$Sales.Channel != 'Call Center']
Then we perform t.test:
t.test(HV_CC, HV_NonCC, alternative = "less")
##
## Welch Two Sample t-test
##
## data: HV_CC and HV_NonCC
## t = 0.62463, df = 858.02, p-value = 0.7338
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 903.1439
## sample estimates:
## mean of x mean of y
## 14867.56 14619.18
There is no sufficient Evidence that Call Center Channel performs lesser than the others
What does a policy that is about to expire mean? DOes effective.To.date represent expiry date?
as Month Since Policy Inception becomes larger, the older the policy is. And it is more likely to end The argument is that as the policy is about to expire the more claims is happening. Using these as a proxy, We would see that Months.Since.Policy.Inception is inversely proportional to Months.Since.Last.Claim
Let’s visualize that:
colnames(df)
## [1] "Customer" "State"
## [3] "Customer.Lifetime.Value" "Response"
## [5] "Coverage" "Education"
## [7] "Effective.To.Date" "EmploymentStatus"
## [9] "Gender" "Income"
## [11] "Location.Code" "Marital.Status"
## [13] "Monthly.Premium.Auto" "Months.Since.Last.Claim"
## [15] "Months.Since.Policy.Inception" "Number.of.Open.Complaints"
## [17] "Number.of.Policies" "Policy.Type"
## [19] "Policy" "Renew.Offer.Type"
## [21] "Sales.Channel" "Total.Claim.Amount"
## [23] "Vehicle.Class" "Vehicle.Size"
plot(df$Months.Since.Policy.Inception, df$Total.Claim.Amount)
test <- df %>%
group_by(Months.Since.Last.Claim, Months.Since.Policy.Inception) %>%
mutate(cnt = n())
barplot(test$cnt)
Let’s see the National Proportion osf sales (Total Premium - Total Claims Amount:
First we calculate Total Premiums Paid and from it subtract the total Claim Amount then we’ll get the profit.
df$Total.Paid <- df$Monthly.Premium.Auto*df$Months.Since.Policy.Inception
df$Profit <- df$Total.Paid - df$Total.Claim.Amount
National_Sales <- df %>%
group_by(Policy.Type) %>%
summarise(Policy_cnt = n(), Total.Paid = sum(Total.Paid)) %>%
mutate(prop = Total.Paid/sum(Total.Paid))
and then, Let’s filter the table to reflect california only:
Cal_Sales <- df %>%
group_by(Policy.Type) %>%
filter(State == 'California')%>%
summarise(Policy_cnt = n(), Total.Paid = sum(Total.Paid)) %>%
mutate(prop = Total.Paid/sum(Total.Paid))
National_Sales
## # A tibble: 3 x 4
## Policy.Type Policy_cnt Total.Paid prop
## <fctr> <int> <int> <dbl>
## 1 Corporate Auto 1968 8859245 0.21553789
## 2 Personal Auto 6788 30615238 0.74484269
## 3 Special Auto 378 1628475 0.03961941
Cal_Sales
## # A tibble: 3 x 4
## Policy.Type Policy_cnt Total.Paid prop
## <fctr> <int> <int> <dbl>
## 1 Corporate Auto 739 3331408 0.23768913
## 2 Personal Auto 2298 10221194 0.72926122
## 3 Special Auto 113 463218 0.03304965
There a very small difference of West coast to national Average, Test if this increase is significant enough:
cbind(PolicyType = National_Sales$Policy.Type,National = National_Sales$Total.Paid, California = Cal_Sales$Total.Paid)
## PolicyType National California
## [1,] 1 8859245 3331408
## [2,] 2 30615238 10221194
## [3,] 3 1628475 463218
table_df <- National_Sales %>% select(Policy.Type,National = Total.Paid)%>%
mutate(California = Cal_Sales$Total.Paid)
table_df
## # A tibble: 3 x 3
## Policy.Type National California
## <fctr> <int> <int>
## 1 Corporate Auto 8859245 3331408
## 2 Personal Auto 30615238 10221194
## 3 Special Auto 1628475 463218
Do the prop test later
4.Who are my most valuable customers? By this i mean in the West Coast. Are they more or less valuable than the National Average?
What is the national Average for Customer Lifetime Value?
mean(df$Customer.Lifetime.Value)
## [1] 8004.94
cal <- df %>% filter(State == 'California')%>%
arrange(desc(Customer.Lifetime.Value))
ggplot(cal, aes(Customer.Lifetime.Value))+geom_histogram(binwidth = 1000)
boxplot(cal$Customer.Lifetime.Value)
plot(cal$Customer.Lifetime.Value)
the Mean CLV for California is the same as the Average Let’s Describe which of the California Policies are Higher Value
First, filter out those Policies that have CLV less than the 75th percentile
cal2 <- cal %>% filter(Customer.Lifetime.Value >= quantile(Customer.Lifetime.Value, 0.75))
dim(cal2)
## [1] 788 26
cal2 %>%
group_by(Policy.Type) %>%
summarise(Policy_cnt = n(), TCA = sum(Total.Claim.Amount), MPA = sum(Total.Paid),ave_CLV = mean(Customer.Lifetime.Value)) %>%
mutate(prop = Policy_cnt/sum(Policy_cnt))
## # A tibble: 3 x 6
## Policy.Type Policy_cnt TCA MPA ave_CLV prop
## <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Corporate Auto 190 108330.86 1062094 16193.12 0.2411168
## 2 Personal Auto 567 315106.64 3097510 16707.64 0.7195431
## 3 Special Auto 31 16556.67 134154 16437.82 0.0393401
Given the number of Personal Auto Policies in California with a relatively high ave_CLV, these customers are High value!
5.For my region, how does the comparison of channel performance look like? How does this compare with trends for the entire country?
Get SUmmary for the entire country:
df %>%
group_by(Sales.Channel) %>%
summarise(Policy_cnt = n(), TCA = sum(Total.Claim.Amount), MPA = sum(Total.Paid),ave_CLV = mean(Customer.Lifetime.Value)) %>%
mutate(prop = Policy_cnt/sum(Policy_cnt))%>%
arrange(desc(ave_CLV))
## # A tibble: 4 x 6
## Sales.Channel Policy_cnt TCA MPA ave_CLV prop
## <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Branch 2567 1111169.1 11690107 8119.712 0.2810379
## 2 Call Center 1765 755640.0 7757750 8100.086 0.1932341
## 3 Agent 3477 1524437.4 15638171 7957.709 0.3806656
## 4 Web 1325 573720.7 6016930 7779.788 0.1450624
On the Average, Policies thru the Branch Channel have the highest Customer Lifetime Value. Policies thru the Web Channel have the least proportion and CLV. In terms of penetration, Agent Channel is the most frequent. Though Call Center Channel has less Policy counts than Branch, it makes a good CLV at slightly lower than that of Branch
Let’s see how this compares per region:
q5 <- df %>%
group_by(Sales.Channel, State) %>%
summarise(Policy_cnt = n(), TCA = sum(Total.Claim.Amount), MPA = sum(Total.Paid),ave_CLV = mean(Customer.Lifetime.Value)) %>%
mutate(prop = Policy_cnt/sum(Policy_cnt))%>%
arrange(desc(ave_CLV))
q5
## # A tibble: 20 x 7
## # Groups: Sales.Channel [4]
## Sales.Channel State Policy_cnt TCA MPA ave_CLV
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Branch Washington 232 106478.14 1153179 8795.950
## 2 Agent Nevada 330 149073.73 1575473 8729.443
## 3 Call Center Nevada 163 76727.51 721756 8655.268
## 4 Web Arizona 238 103625.31 1131286 8615.442
## 5 Branch Oregon 719 306942.39 3394463 8410.491
## 6 Call Center Oregon 497 211509.97 2229183 8261.157
## 7 Branch California 900 402141.64 4054585 8204.506
## 8 Call Center California 605 256429.43 2637512 8076.694
## 9 Agent Oregon 988 430809.13 4464649 7941.217
## 10 Call Center Washington 147 63337.61 670643 7934.857
## 11 Agent California 1201 532881.73 5316623 7878.525
## 12 Agent Arizona 643 279406.22 2897992 7841.994
## 13 Web California 444 187677.89 2007100 7835.420
## 14 Agent Washington 315 132266.55 1383434 7739.067
## 15 Call Center Arizona 353 147635.44 1498656 7725.846
## 16 Branch Arizona 469 193723.58 1976771 7607.172
## 17 Web Oregon 397 177004.14 1778929 7586.298
## 18 Branch Nevada 247 101883.33 1111109 7302.339
## 19 Web Washington 104 46119.89 464491 7271.581
## 20 Web Nevada 142 59293.43 635124 7118.399
## # ... with 1 more variables: prop <dbl>