MCV <- read.csv("C:/Users/112266/Desktop/DATAcademyMarketingCustomerValue_MMG_170703_v00-00.csv", stringsAsFactors = FALSE)
str(MCV)
## 'data.frame': 9134 obs. of 24 variables:
## $ Customer : chr "BU79786" "QZ44356" "AI49188" "WW63253" ...
## $ State : chr "Washington" "Arizona" "Nevada" "California" ...
## $ Customer.Lifetime.Value : num 2764 6980 12887 7646 2814 ...
## $ Response : chr "No" "No" "No" "No" ...
## $ Coverage : chr "Basic" "Extended" "Premium" "Basic" ...
## $ Education : chr "Bachelor" "Bachelor" "Bachelor" "Bachelor" ...
## $ Effective.To.Date : chr "2/24/11" "1/31/11" "2/19/11" "1/20/11" ...
## $ EmploymentStatus : chr "Employed" "Unemployed" "Employed" "Unemployed" ...
## $ Gender : chr "F" "F" "F" "M" ...
## $ Income : int 56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
## $ Location.Code : chr "Suburban" "Suburban" "Suburban" "Suburban" ...
## $ Marital.Status : chr "Married" "Single" "Married" "Married" ...
## $ Monthly.Premium.Auto : int 69 94 108 106 73 69 67 101 71 93 ...
## $ Months.Since.Last.Claim : int 32 13 18 18 12 14 0 0 13 17 ...
## $ Months.Since.Policy.Inception: int 5 42 38 65 44 94 13 68 3 7 ...
## $ Number.of.Open.Complaints : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number.of.Policies : int 1 8 2 7 1 2 9 4 2 8 ...
## $ Policy.Type : chr "Corporate Auto" "Personal Auto" "Personal Auto" "Corporate Auto" ...
## $ Policy : chr "Corporate L3" "Personal L3" "Personal L3" "Corporate L2" ...
## $ Renew.Offer.Type : chr "Offer1" "Offer3" "Offer1" "Offer1" ...
## $ Sales.Channel : chr "Agent" "Agent" "Agent" "Call Center" ...
## $ Total.Claim.Amount : num 385 1131 566 530 138 ...
## $ Vehicle.Class : chr "Two-Door Car" "Four-Door Car" "Two-Door Car" "SUV" ...
## $ Vehicle.Size : chr "Medsize" "Medsize" "Medsize" "Medsize" ...
Most of the variables are either in CHR ,NUM or INT. We identify the columns that needs to be converted.
MCV$Customer.Lifetime.Value <- as.numeric(MCV$Customer.Lifetime.Value)
MCV$Total.Claim.Amount <- as.integer(MCV$Total.Claim.Amount)
MCV$Education <- as.factor(MCV$Education)
levels(MCV$Education)=c("High School or Below","Bachelor","College", "Master", "Doctor")
MCV$Vehicle.Size <- as.factor(MCV$Vehicle.Size)
levels(MCV$Vehicle.Size)=c("Small", "Medsize", "Large")
MCV$Effective.To.Date <- as.Date(MCV$Effective.To.Date, "%m/%d/%y")
MCV$Marital.Status <- as.factor(MCV$Marital.Status)
MCV$Sales.Channel <- as.factor(MCV$Sales.Channel)
str(MCV)
## 'data.frame': 9134 obs. of 24 variables:
## $ Customer : chr "BU79786" "QZ44356" "AI49188" "WW63253" ...
## $ State : chr "Washington" "Arizona" "Nevada" "California" ...
## $ Customer.Lifetime.Value : num 2764 6980 12887 7646 2814 ...
## $ Response : chr "No" "No" "No" "No" ...
## $ Coverage : chr "Basic" "Extended" "Premium" "Basic" ...
## $ Education : Factor w/ 5 levels "High School or Below",..: 1 1 1 1 1 1 2 5 1 2 ...
## $ Effective.To.Date : Date, format: "2011-02-24" "2011-01-31" ...
## $ EmploymentStatus : chr "Employed" "Unemployed" "Employed" "Unemployed" ...
## $ Gender : chr "F" "F" "F" "M" ...
## $ Income : int 56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
## $ Location.Code : chr "Suburban" "Suburban" "Suburban" "Suburban" ...
## $ Marital.Status : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
## $ Monthly.Premium.Auto : int 69 94 108 106 73 69 67 101 71 93 ...
## $ Months.Since.Last.Claim : int 32 13 18 18 12 14 0 0 13 17 ...
## $ Months.Since.Policy.Inception: int 5 42 38 65 44 94 13 68 3 7 ...
## $ Number.of.Open.Complaints : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number.of.Policies : int 1 8 2 7 1 2 9 4 2 8 ...
## $ Policy.Type : chr "Corporate Auto" "Personal Auto" "Personal Auto" "Corporate Auto" ...
## $ Policy : chr "Corporate L3" "Personal L3" "Personal L3" "Corporate L2" ...
## $ Renew.Offer.Type : chr "Offer1" "Offer3" "Offer1" "Offer1" ...
## $ Sales.Channel : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
## $ Total.Claim.Amount : int 384 1131 566 529 138 159 321 363 511 425 ...
## $ Vehicle.Class : chr "Two-Door Car" "Four-Door Car" "Two-Door Car" "SUV" ...
## $ Vehicle.Size : Factor w/ 3 levels "Small","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...
Now we have all our variables in order
apply(MCV, 2, function(x) any(is.null(x)))
## Customer State
## FALSE FALSE
## Customer.Lifetime.Value Response
## FALSE FALSE
## Coverage Education
## FALSE FALSE
## Effective.To.Date EmploymentStatus
## FALSE FALSE
## Gender Income
## FALSE FALSE
## Location.Code Marital.Status
## FALSE FALSE
## Monthly.Premium.Auto Months.Since.Last.Claim
## FALSE FALSE
## Months.Since.Policy.Inception Number.of.Open.Complaints
## FALSE FALSE
## Number.of.Policies Policy.Type
## FALSE FALSE
## Policy Renew.Offer.Type
## FALSE FALSE
## Sales.Channel Total.Claim.Amount
## FALSE FALSE
## Vehicle.Class Vehicle.Size
## FALSE FALSE
The function returns FALSE for all columns means that there is no null values in any of the columns. Hence, there are no null values for the data MCV.
length(apply(MCV, 1, function(x) unique(x)))
## [1] 9134
The function returns 9,134 unique rows which is equal to the number of rows in the data MCV, as seen in the data pane on the right. Therefore, there are no duplicate rows in MCV.
hist(MCV$Customer.Lifetime.Value, main="Histogram for Customer Lifetime Value", xlab = "Customer Lifetime Value")
mean(MCV$Customer.Lifetime.Value)
## [1] 8004.94
median(MCV$Customer.Lifetime.Value)
## [1] 5780.182
skewness(MCV$Customer.Lifetime.Value,3)
## [1] 3.031284
kurtosis(MCV$Customer.Lifetime.Value,3)
## [1] 13.81163
By comparing the mean and median of the Customer Lifetime Value data (mean>>median), you can see that the data is positively skewed.
hist(MCV$Monthly.Premium.Auto, main="Histogram for Monthly Premium Auto", xlab = "Monthly Premium Auto")
mean(MCV$Monthly.Premium.Auto)
## [1] 93.21929
median(MCV$Monthly.Premium.Auto)
## [1] 83
skewness(MCV$Monthly.Premium.Auto,3)
## [1] 2.122849
kurtosis(MCV$Monthly.Premium.Auto,3)
## [1] 6.187546
Similary, the Monthly Premium Auto data is skewed to the left.
hist(MCV$Total.Claim.Amount, main="Histogram for Total Claim Amount", xlab = "Total Claim Auto")
mean(MCV$Total.Claim.Amount)
## [1] 433.6227
median(MCV$Total.Claim.Amount)
## [1] 383.5
skewness(MCV$Total.Claim.Amount,3)
## [1] 1.714039
kurtosis(MCV$Total.Claim.Amount,3)
## [1] 5.971797
Lastly, the Total Claim Amount data follows the similar shape as the previous two.
In general, the above attributes does not follow a normal distribution.
We set the independent variable as the Customer Lifetime Value (CLV) since it is the most logical element to consider. However, in some statements, we may extend to Total Claim Amount and Monthly Premium Auto also.
loc_1 <- MCV[,c("Location.Code", "Customer.Lifetime.Value")] %>% subset(Location.Code == "Urban" | Location.Code == "Rural")
wilcox.test(Customer.Lifetime.Value~Location.Code, data=loc_1, alternative="less")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Customer.Lifetime.Value by Location.Code
## W = 1395900, p-value = 0.4081
## alternative hypothesis: true location shift is less than 0
Since we know from previous item that Customer Lifetime Value is not normal, we use a non parametric test for significance. We use the Wilcoxon 2-sample Rank sum test which is the equivalent test for 2-sample t test. Based from the test result’s p-value, the difference in distributions for rural and urban in terms of Customer Value Lifetime is not significant.
loc_1 <- MCV[,c("Location.Code", "Total.Claim.Amount")] %>% subset(Location.Code == "Urban" | Location.Code == "Rural")
wilcox.test(Total.Claim.Amount~Location.Code, data=loc_1, alternative="less")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Total.Claim.Amount by Location.Code
## W = 106560, p-value < 2.2e-16
## alternative hypothesis: true location shift is less than 0
For Total Claim amount, the difference between the distributions of rural and urban customers are significant. Therefore, rural customers are less valuable in terms of claim amount.
loc_1 <- MCV[,c("Location.Code", "Monthly.Premium.Auto")] %>% subset(Location.Code == "Urban" | Location.Code == "Rural")
wilcox.test(Monthly.Premium.Auto~Location.Code, data=loc_1, alternative="less")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Monthly.Premium.Auto by Location.Code
## W = 1402100, p-value = 0.495
## alternative hypothesis: true location shift is less than 0
In the same way, the difference between the distributions of rural and urban customers are not significant in terms of monthly premium auto.
Thus, the statement is only true when we talk about Total claims amount.
educ_1 <- MCV[,c("Education", "Customer.Lifetime.Value")]
kruskal.test(Customer.Lifetime.Value~Education, data=educ_1)
##
## Kruskal-Wallis rank sum test
##
## data: Customer.Lifetime.Value by Education
## Kruskal-Wallis chi-squared = 12.234, df = 4, p-value = 0.01569
Since the two variables are independent of each other and are not related, then we use the Kruskal-Wallis test to determine significance of education to CLV. By the test result with p-value<0.05, we reject the null hypothesis. Hence, the statement is true.
mari_1 <- MCV[,c("Marital.Status", "Customer.Lifetime.Value")]
kruskal.test(Customer.Lifetime.Value~Marital.Status, data=mari_1)
##
## Kruskal-Wallis rank sum test
##
## data: Customer.Lifetime.Value by Marital.Status
## Kruskal-Wallis chi-squared = 20.896, df = 2, p-value = 2.901e-05
Similar to b, we use the Kruskal-Wallis test to determine significance of marital status to CLV. By the results of the test, the two distrbutions are nonidentical. Therefore, our null hypothesis should be that Marital status has a role to play in CLV. If we use the latter hypothesis, we then reject the null hypothesis that Marital Status has no role to play in the value of a customer.Then the new hypothesis is true.
sales_1 <- MCV[,c("Sales.Channel", "Customer.Lifetime.Value")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Agent",1,0)
wilcox.test(Customer.Lifetime.Value~ Agent_Tag, data=sales_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Customer.Lifetime.Value by Agent_Tag
## W = 9733300, p-value = 0.4073
## alternative hypothesis: true location shift is not equal to 0
By the Wilcoxon test, we can say that agents have no role in determining CLV. However, if we try using the MOnthly Premium Auto and Total Claim Amount as independent variable, respectively, we have
sales_1 <- MCV[,c("Sales.Channel", "Monthly.Premium.Auto")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Agent",1,0)
wilcox.test(Monthly.Premium.Auto~ Agent_Tag, data=sales_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Monthly.Premium.Auto by Agent_Tag
## W = 9799300, p-value = 0.7724
## alternative hypothesis: true location shift is not equal to 0
sales_1 <- MCV[,c("Sales.Channel", "Total.Claim.Amount")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Agent",1,0)
wilcox.test(Total.Claim.Amount~ Agent_Tag, data=sales_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Total.Claim.Amount by Agent_Tag
## W = 9761000, p-value = 0.5472
## alternative hypothesis: true location shift is not equal to 0
Still in both scenarios,agents are not significant. Hence, we can say that agent channel has no role to play in CLV distribution.
sales_1 <- MCV[,c("Sales.Channel", "Customer.Lifetime.Value")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Call Center",1,0)
wilcox.test(Customer.Lifetime.Value~ Agent_Tag, data=sales_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Customer.Lifetime.Value by Agent_Tag
## W = 6462000, p-value = 0.6793
## alternative hypothesis: true location shift is not equal to 0
sales_1 <- MCV[,c("Sales.Channel", "Monthly.Premium.Auto")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Call Center",1,0)
wilcox.test(Monthly.Premium.Auto~ Agent_Tag, data=sales_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Monthly.Premium.Auto by Agent_Tag
## W = 6530800, p-value = 0.781
## alternative hypothesis: true location shift is not equal to 0
sales_1 <- MCV[,c("Sales.Channel", "Total.Claim.Amount")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Call Center",1,0)
wilcox.test(Total.Claim.Amount~ Agent_Tag, data=sales_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Total.Claim.Amount by Agent_Tag
## W = 6564700, p-value = 0.5359
## alternative hypothesis: true location shift is not equal to 0
By the results of the three tests above, we can say that call center channel has no role to play in CLV, Premiums and Claims.
In this scenario, we let our hypothesis be that customers make more claims in terms of amount than the average during the closing stages of the policy. We only consider valid claim months and take the difference of the two. We assume that the policy is claiming in the closing stages of its cover as the difference grow larger.
MCV$Valid.Last.Claim.Month <- ifelse(MCV$Months.Since.Last.Claim<=MCV$Months.Since.Policy.Inception,MCV$Months.Since.Last.Claim,0)
MCV$Difference <- MCV$Months.Since.Policy.Inception-MCV$Valid.Last.Claim.Month
MCV_CLM <- MCV[which(MCV$Valid.Last.Claim.Month>0),]
hist(MCV_CLM$Difference, main="Histogram of the difference between claim months and inception", xlab="Difference in Inception and last claim")
From the histogram, you will note that there are lesser claims in the 80-100 month difference bracket. Meaning, we can hypothesize that claims amount is driven by the difference in months. Now, we compare the claim amount and the claim proximity to the expiry date.
MCV_CLM <- MCV[,c("Total.Claim.Amount","Difference")]
wilcox.test(MCV_CLM$Total.Claim.Amount,MCV_CLM$Difference, paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: MCV_CLM$Total.Claim.Amount and MCV_CLM$Difference
## V = 41540000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The result tells us that Total Claim Amount and Difference in Claim month and inception month have nonidentical distributions. Hence, we reject the null hypothesis. The previously stated hypothesis is true.
Now, to check the correlation of the two variables,
cor(MCV_CLM$Total.Claim.Amount, MCV_CLM$Difference)
## [1] -0.003049366
The negative correlation means that as one of the variables get larger, the other one gets smaller. Either way, we can not say that customers always claim teh most when their policy is expiring, in general.
Consider the whole data set as your population and let west coast, as in California, to be your sample. We first look at the significance of POlicy Type to Monthly Premiums.The increase in ratio means that there is more premium for corporate than in personal. Hence, the average corporate monthly premium is higher than personal monthly premium. Our null hypothesis would be both policy types have the identical distributions.
auto_1 <- MCV[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto") %>% group_by(Policy.Type) %>% summarise(MPA=mean(Monthly.Premium.Auto))
auto_1
## # A tibble: 2 x 2
## Policy.Type MPA
## <chr> <dbl>
## 1 Corporate Auto 93.23831
## 2 Personal Auto 93.23085
auto_1 <- MCV[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto")
wilcox.test(Monthly.Premium.Auto~Policy.Type, data=auto_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Monthly.Premium.Auto by Policy.Type
## W = 6662900, p-value = 0.8673
## alternative hypothesis: true location shift is not equal to 0
We can see that the difference in distribution of both policy types by Monthly premiums is not significant. Then we can’t reject the null hypothesis.
Now, if we look into California,
MCV_1 <- MCV[which(MCV$State=="California"),]
auto_1 <- MCV_1[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto") %>% group_by(Policy.Type) %>% summarise(MPA=mean(Monthly.Premium.Auto))
auto_1
## # A tibble: 2 x 2
## Policy.Type MPA
## <chr> <dbl>
## 1 Corporate Auto 92.18268
## 2 Personal Auto 93.97607
auto_1 <- MCV_1[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto")
wilcox.test(Monthly.Premium.Auto~Policy.Type, data=auto_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Monthly.Premium.Auto by Policy.Type
## W = 820940, p-value = 0.1742
## alternative hypothesis: true location shift is not equal to 0
we see that although there is no impact in the over-all distribution of monthly premiums, the significance of policy type is relatively “stronger” than in the parent population.
Let us assume that west coast is california and the parent population is the whole data set. Then we take the top 10 customers for each of the two space mentioned and compare.
MCV_1 <- MCV %>% subset(State=="California") %>% group_by(Customer) %>% summarise(CLV = mean(Customer.Lifetime.Value))
tail(arrange(MCV_1, desc(CLV)),n=10)
## # A tibble: 10 x 2
## Customer CLV
## <chr> <dbl>
## 1 FS86064 2117.989
## 2 RN61682 2114.738
## 3 UF44021 2108.494
## 4 WL65572 2064.459
## 5 AU46604 2030.784
## 6 EP72155 2004.351
## 7 JX50334 2004.351
## 8 EH70612 1918.120
## 9 XT54610 1904.001
## 10 UM17151 1898.008
MCV_2 <- MCV %>% group_by(Customer) %>% summarise(CLV = mean(Customer.Lifetime.Value))
tail(arrange(MCV_2, desc(CLV)),n=10)
## # A tibble: 10 x 2
## Customer CLV
## <chr> <dbl>
## 1 JX50334 2004.351
## 2 SQ54906 2004.351
## 3 WX92260 2004.351
## 4 XR52852 2004.351
## 5 MN19541 1994.775
## 6 XI84293 1940.981
## 7 EH70612 1918.120
## 8 XT54610 1904.001
## 9 CI68869 1898.684
## 10 UM17151 1898.008
From the list above, you can see that not all top customers in West Coast based on Client Lifetime Value are in the national list (MCV_2). Hence, we can day that there may be more customers that are valuable outside of West Coast. Now, we hypothesize that being in West Coast makes the customer more valuable. Similar to 1.d, we use the Mann-Whitney-Wilcoxon test the siginificance of being in West coast to vCLV.
val_1 <- MCV[,c("State", "Customer.Lifetime.Value")]
val_1$Cali_Tag <- ifelse(val_1$State=="California",1,0)
wilcox.test(Customer.Lifetime.Value~Cali_Tag, data=val_1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Customer.Lifetime.Value by Cali_Tag
## W = 9312500, p-value = 0.3486
## alternative hypothesis: true location shift is not equal to 0
By the test result, we say that being in California does not affect CLV. Hence, we can not say for certain if customers in West Coast are more or less valuable than any other state in the population.
Let us assume that the sample population is customers coming from California, the parent population is the dataset and our basis of performance is the Monthly premium. By the Kruskal-Wallis test,
MCV_Cali <- MCV[which(MCV$State=="California"),]
reg_1 <- MCV_Cali[,c("Sales.Channel", "Monthly.Premium.Auto")]
kruskal.test(Monthly.Premium.Auto~Sales.Channel, data=reg_1)
##
## Kruskal-Wallis rank sum test
##
## data: Monthly.Premium.Auto by Sales.Channel
## Kruskal-Wallis chi-squared = 3.7333, df = 3, p-value = 0.2917
we can say that being in Sales Channel in California does not affect the Monthly Premiums distribution. Now, we have seen in 1e that for the whole population, sales channel is not significant to Monthly Premiums. Hence, the trends in the local Region is also manifested in the parent population.