Problem Set 1

Preliminaries

MCV <- read.csv("C:/Users/112266/Desktop/DATAcademyMarketingCustomerValue_MMG_170703_v00-00.csv", stringsAsFactors = FALSE)
str(MCV)
## 'data.frame':    9134 obs. of  24 variables:
##  $ Customer                     : chr  "BU79786" "QZ44356" "AI49188" "WW63253" ...
##  $ State                        : chr  "Washington" "Arizona" "Nevada" "California" ...
##  $ Customer.Lifetime.Value      : num  2764 6980 12887 7646 2814 ...
##  $ Response                     : chr  "No" "No" "No" "No" ...
##  $ Coverage                     : chr  "Basic" "Extended" "Premium" "Basic" ...
##  $ Education                    : chr  "Bachelor" "Bachelor" "Bachelor" "Bachelor" ...
##  $ Effective.To.Date            : chr  "2/24/11" "1/31/11" "2/19/11" "1/20/11" ...
##  $ EmploymentStatus             : chr  "Employed" "Unemployed" "Employed" "Unemployed" ...
##  $ Gender                       : chr  "F" "F" "F" "M" ...
##  $ Income                       : int  56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
##  $ Location.Code                : chr  "Suburban" "Suburban" "Suburban" "Suburban" ...
##  $ Marital.Status               : chr  "Married" "Single" "Married" "Married" ...
##  $ Monthly.Premium.Auto         : int  69 94 108 106 73 69 67 101 71 93 ...
##  $ Months.Since.Last.Claim      : int  32 13 18 18 12 14 0 0 13 17 ...
##  $ Months.Since.Policy.Inception: int  5 42 38 65 44 94 13 68 3 7 ...
##  $ Number.of.Open.Complaints    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Number.of.Policies           : int  1 8 2 7 1 2 9 4 2 8 ...
##  $ Policy.Type                  : chr  "Corporate Auto" "Personal Auto" "Personal Auto" "Corporate Auto" ...
##  $ Policy                       : chr  "Corporate L3" "Personal L3" "Personal L3" "Corporate L2" ...
##  $ Renew.Offer.Type             : chr  "Offer1" "Offer3" "Offer1" "Offer1" ...
##  $ Sales.Channel                : chr  "Agent" "Agent" "Agent" "Call Center" ...
##  $ Total.Claim.Amount           : num  385 1131 566 530 138 ...
##  $ Vehicle.Class                : chr  "Two-Door Car" "Four-Door Car" "Two-Door Car" "SUV" ...
##  $ Vehicle.Size                 : chr  "Medsize" "Medsize" "Medsize" "Medsize" ...

Most of the variables are either in CHR ,NUM or INT. We identify the columns that needs to be converted.

MCV$Customer.Lifetime.Value <- as.numeric(MCV$Customer.Lifetime.Value)
MCV$Total.Claim.Amount <- as.integer(MCV$Total.Claim.Amount)
MCV$Education <- as.factor(MCV$Education)
levels(MCV$Education)=c("High School or Below","Bachelor","College", "Master", "Doctor")
MCV$Vehicle.Size <- as.factor(MCV$Vehicle.Size)
levels(MCV$Vehicle.Size)=c("Small", "Medsize", "Large")
MCV$Effective.To.Date <- as.Date(MCV$Effective.To.Date, "%m/%d/%y")
MCV$Marital.Status <- as.factor(MCV$Marital.Status)
MCV$Sales.Channel <- as.factor(MCV$Sales.Channel)
str(MCV)
## 'data.frame':    9134 obs. of  24 variables:
##  $ Customer                     : chr  "BU79786" "QZ44356" "AI49188" "WW63253" ...
##  $ State                        : chr  "Washington" "Arizona" "Nevada" "California" ...
##  $ Customer.Lifetime.Value      : num  2764 6980 12887 7646 2814 ...
##  $ Response                     : chr  "No" "No" "No" "No" ...
##  $ Coverage                     : chr  "Basic" "Extended" "Premium" "Basic" ...
##  $ Education                    : Factor w/ 5 levels "High School or Below",..: 1 1 1 1 1 1 2 5 1 2 ...
##  $ Effective.To.Date            : Date, format: "2011-02-24" "2011-01-31" ...
##  $ EmploymentStatus             : chr  "Employed" "Unemployed" "Employed" "Unemployed" ...
##  $ Gender                       : chr  "F" "F" "F" "M" ...
##  $ Income                       : int  56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
##  $ Location.Code                : chr  "Suburban" "Suburban" "Suburban" "Suburban" ...
##  $ Marital.Status               : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
##  $ Monthly.Premium.Auto         : int  69 94 108 106 73 69 67 101 71 93 ...
##  $ Months.Since.Last.Claim      : int  32 13 18 18 12 14 0 0 13 17 ...
##  $ Months.Since.Policy.Inception: int  5 42 38 65 44 94 13 68 3 7 ...
##  $ Number.of.Open.Complaints    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Number.of.Policies           : int  1 8 2 7 1 2 9 4 2 8 ...
##  $ Policy.Type                  : chr  "Corporate Auto" "Personal Auto" "Personal Auto" "Corporate Auto" ...
##  $ Policy                       : chr  "Corporate L3" "Personal L3" "Personal L3" "Corporate L2" ...
##  $ Renew.Offer.Type             : chr  "Offer1" "Offer3" "Offer1" "Offer1" ...
##  $ Sales.Channel                : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
##  $ Total.Claim.Amount           : int  384 1131 566 529 138 159 321 363 511 425 ...
##  $ Vehicle.Class                : chr  "Two-Door Car" "Four-Door Car" "Two-Door Car" "SUV" ...
##  $ Vehicle.Size                 : Factor w/ 3 levels "Small","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...

Now we have all our variables in order

Data Check

1. Are there null values in the data?

apply(MCV, 2, function(x) any(is.null(x)))
##                      Customer                         State 
##                         FALSE                         FALSE 
##       Customer.Lifetime.Value                      Response 
##                         FALSE                         FALSE 
##                      Coverage                     Education 
##                         FALSE                         FALSE 
##             Effective.To.Date              EmploymentStatus 
##                         FALSE                         FALSE 
##                        Gender                        Income 
##                         FALSE                         FALSE 
##                 Location.Code                Marital.Status 
##                         FALSE                         FALSE 
##          Monthly.Premium.Auto       Months.Since.Last.Claim 
##                         FALSE                         FALSE 
## Months.Since.Policy.Inception     Number.of.Open.Complaints 
##                         FALSE                         FALSE 
##            Number.of.Policies                   Policy.Type 
##                         FALSE                         FALSE 
##                        Policy              Renew.Offer.Type 
##                         FALSE                         FALSE 
##                 Sales.Channel            Total.Claim.Amount 
##                         FALSE                         FALSE 
##                 Vehicle.Class                  Vehicle.Size 
##                         FALSE                         FALSE

The function returns FALSE for all columns means that there is no null values in any of the columns. Hence, there are no null values for the data MCV.

2.Are there duplicates in the data?

length(apply(MCV, 1, function(x) unique(x)))
## [1] 9134

The function returns 9,134 unique rows which is equal to the number of rows in the data MCV, as seen in the data pane on the right. Therefore, there are no duplicate rows in MCV.

Descriptive Statistics

hist(MCV$Customer.Lifetime.Value, main="Histogram for Customer Lifetime Value", xlab = "Customer Lifetime Value")

mean(MCV$Customer.Lifetime.Value)
## [1] 8004.94
median(MCV$Customer.Lifetime.Value)
## [1] 5780.182
skewness(MCV$Customer.Lifetime.Value,3)
## [1] 3.031284
kurtosis(MCV$Customer.Lifetime.Value,3)
## [1] 13.81163

By comparing the mean and median of the Customer Lifetime Value data (mean>>median), you can see that the data is positively skewed.

hist(MCV$Monthly.Premium.Auto, main="Histogram for Monthly Premium Auto", xlab = "Monthly Premium Auto")

mean(MCV$Monthly.Premium.Auto)
## [1] 93.21929
median(MCV$Monthly.Premium.Auto)
## [1] 83
skewness(MCV$Monthly.Premium.Auto,3)
## [1] 2.122849
kurtosis(MCV$Monthly.Premium.Auto,3)
## [1] 6.187546

Similary, the Monthly Premium Auto data is skewed to the left.

hist(MCV$Total.Claim.Amount, main="Histogram for Total Claim Amount", xlab = "Total Claim Auto")

mean(MCV$Total.Claim.Amount)
## [1] 433.6227
median(MCV$Total.Claim.Amount)
## [1] 383.5
skewness(MCV$Total.Claim.Amount,3)
## [1] 1.714039
kurtosis(MCV$Total.Claim.Amount,3)
## [1] 5.971797

Lastly, the Total Claim Amount data follows the similar shape as the previous two.

In general, the above attributes does not follow a normal distribution.

Inferential Statistics

1. Which of the following hypotheses are true or false?

We set the independent variable as the Customer Lifetime Value (CLV) since it is the most logical element to consider. However, in some statements, we may extend to Total Claim Amount and Monthly Premium Auto also.

  1. Rural customers are LESS valuable than Urban customers
loc_1 <- MCV[,c("Location.Code", "Customer.Lifetime.Value")] %>% subset(Location.Code == "Urban" | Location.Code == "Rural")
wilcox.test(Customer.Lifetime.Value~Location.Code, data=loc_1, alternative="less")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Customer.Lifetime.Value by Location.Code
## W = 1395900, p-value = 0.4081
## alternative hypothesis: true location shift is less than 0

Since we know from previous item that Customer Lifetime Value is not normal, we use a non parametric test for significance. We use the Wilcoxon 2-sample Rank sum test which is the equivalent test for 2-sample t test. Based from the test result’s p-value, the difference in distributions for rural and urban in terms of Customer Value Lifetime is not significant.

loc_1 <- MCV[,c("Location.Code", "Total.Claim.Amount")] %>% subset(Location.Code == "Urban" | Location.Code == "Rural")
wilcox.test(Total.Claim.Amount~Location.Code, data=loc_1, alternative="less")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Total.Claim.Amount by Location.Code
## W = 106560, p-value < 2.2e-16
## alternative hypothesis: true location shift is less than 0

For Total Claim amount, the difference between the distributions of rural and urban customers are significant. Therefore, rural customers are less valuable in terms of claim amount.

loc_1 <- MCV[,c("Location.Code", "Monthly.Premium.Auto")] %>% subset(Location.Code == "Urban" | Location.Code == "Rural")
wilcox.test(Monthly.Premium.Auto~Location.Code, data=loc_1, alternative="less")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Monthly.Premium.Auto by Location.Code
## W = 1402100, p-value = 0.495
## alternative hypothesis: true location shift is less than 0

In the same way, the difference between the distributions of rural and urban customers are not significant in terms of monthly premium auto.

Thus, the statement is only true when we talk about Total claims amount.

  1. Educated customers (with a bachelors or equivalent degree) are more valuable than others
educ_1 <- MCV[,c("Education", "Customer.Lifetime.Value")]
kruskal.test(Customer.Lifetime.Value~Education, data=educ_1)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Customer.Lifetime.Value by Education
## Kruskal-Wallis chi-squared = 12.234, df = 4, p-value = 0.01569

Since the two variables are independent of each other and are not related, then we use the Kruskal-Wallis test to determine significance of education to CLV. By the test result with p-value<0.05, we reject the null hypothesis. Hence, the statement is true.

  1. Marital status has no role to play in determining the value of a customer
mari_1 <- MCV[,c("Marital.Status", "Customer.Lifetime.Value")]
kruskal.test(Customer.Lifetime.Value~Marital.Status, data=mari_1)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Customer.Lifetime.Value by Marital.Status
## Kruskal-Wallis chi-squared = 20.896, df = 2, p-value = 2.901e-05

Similar to b, we use the Kruskal-Wallis test to determine significance of marital status to CLV. By the results of the test, the two distrbutions are nonidentical. Therefore, our null hypothesis should be that Marital status has a role to play in CLV. If we use the latter hypothesis, we then reject the null hypothesis that Marital Status has no role to play in the value of a customer.Then the new hypothesis is true.

  1. West Coast agents are performing poorly - They are signing low value customers In this scenario, we use the Mann-Whitney-Wilcoxon Test since the samples are coming from 2 distinct populations. Note that, we need to convert the sales agent column to binary to indicate 0=not an agent, 1=agent.
sales_1 <- MCV[,c("Sales.Channel", "Customer.Lifetime.Value")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Agent",1,0)
wilcox.test(Customer.Lifetime.Value~ Agent_Tag, data=sales_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Customer.Lifetime.Value by Agent_Tag
## W = 9733300, p-value = 0.4073
## alternative hypothesis: true location shift is not equal to 0

By the Wilcoxon test, we can say that agents have no role in determining CLV. However, if we try using the MOnthly Premium Auto and Total Claim Amount as independent variable, respectively, we have

sales_1 <- MCV[,c("Sales.Channel", "Monthly.Premium.Auto")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Agent",1,0)
wilcox.test(Monthly.Premium.Auto~ Agent_Tag, data=sales_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Monthly.Premium.Auto by Agent_Tag
## W = 9799300, p-value = 0.7724
## alternative hypothesis: true location shift is not equal to 0
sales_1 <- MCV[,c("Sales.Channel", "Total.Claim.Amount")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Agent",1,0)
wilcox.test(Total.Claim.Amount~ Agent_Tag, data=sales_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Total.Claim.Amount by Agent_Tag
## W = 9761000, p-value = 0.5472
## alternative hypothesis: true location shift is not equal to 0

Still in both scenarios,agents are not significant. Hence, we can say that agent channel has no role to play in CLV distribution.

  1. Call Center is not performing well compared to other channels throughout the country (in terms of high value customers) We test Call center significance against CLV, Premium and Claim in a similar way as in d.
sales_1 <- MCV[,c("Sales.Channel", "Customer.Lifetime.Value")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Call Center",1,0)
wilcox.test(Customer.Lifetime.Value~ Agent_Tag, data=sales_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Customer.Lifetime.Value by Agent_Tag
## W = 6462000, p-value = 0.6793
## alternative hypothesis: true location shift is not equal to 0
sales_1 <- MCV[,c("Sales.Channel", "Monthly.Premium.Auto")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Call Center",1,0)
wilcox.test(Monthly.Premium.Auto~ Agent_Tag, data=sales_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Monthly.Premium.Auto by Agent_Tag
## W = 6530800, p-value = 0.781
## alternative hypothesis: true location shift is not equal to 0
sales_1 <- MCV[,c("Sales.Channel", "Total.Claim.Amount")]
sales_1$Agent_Tag <- ifelse(sales_1$Sales.Channel=="Call Center",1,0)
wilcox.test(Total.Claim.Amount~ Agent_Tag, data=sales_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Total.Claim.Amount by Agent_Tag
## W = 6564700, p-value = 0.5359
## alternative hypothesis: true location shift is not equal to 0

By the results of the three tests above, we can say that call center channel has no role to play in CLV, Premiums and Claims.

2.I have noticed recently that when a policy is about to expire, customers take advantage of the limited time available to them by making more claims than average during the closing stages of their policy. I want to know if this is true nationwide.

In this scenario, we let our hypothesis be that customers make more claims in terms of amount than the average during the closing stages of the policy. We only consider valid claim months and take the difference of the two. We assume that the policy is claiming in the closing stages of its cover as the difference grow larger.

MCV$Valid.Last.Claim.Month <- ifelse(MCV$Months.Since.Last.Claim<=MCV$Months.Since.Policy.Inception,MCV$Months.Since.Last.Claim,0)
MCV$Difference <- MCV$Months.Since.Policy.Inception-MCV$Valid.Last.Claim.Month
MCV_CLM <- MCV[which(MCV$Valid.Last.Claim.Month>0),]
hist(MCV_CLM$Difference, main="Histogram of the difference between claim months and inception", xlab="Difference in Inception and last claim")

From the histogram, you will note that there are lesser claims in the 80-100 month difference bracket. Meaning, we can hypothesize that claims amount is driven by the difference in months. Now, we compare the claim amount and the claim proximity to the expiry date.

MCV_CLM <- MCV[,c("Total.Claim.Amount","Difference")]
wilcox.test(MCV_CLM$Total.Claim.Amount,MCV_CLM$Difference, paired=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  MCV_CLM$Total.Claim.Amount and MCV_CLM$Difference
## V = 41540000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The result tells us that Total Claim Amount and Difference in Claim month and inception month have nonidentical distributions. Hence, we reject the null hypothesis. The previously stated hypothesis is true.

Now, to check the correlation of the two variables,

cor(MCV_CLM$Total.Claim.Amount, MCV_CLM$Difference)
## [1] -0.003049366

The negative correlation means that as one of the variables get larger, the other one gets smaller. Either way, we can not say that customers always claim teh most when their policy is expiring, in general.

3.West Coast USA is predominantly Tech industry. And they are nowadays taking out Corporate Auto insurance policies for their emplpoyees. So i have noticed an increase on the ratio of sales of corporate and personal Auto insurance. How does the national ratio compare to that of West Coast?

Consider the whole data set as your population and let west coast, as in California, to be your sample. We first look at the significance of POlicy Type to Monthly Premiums.The increase in ratio means that there is more premium for corporate than in personal. Hence, the average corporate monthly premium is higher than personal monthly premium. Our null hypothesis would be both policy types have the identical distributions.

auto_1 <- MCV[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto") %>% group_by(Policy.Type) %>% summarise(MPA=mean(Monthly.Premium.Auto))
auto_1
## # A tibble: 2 x 2
##      Policy.Type      MPA
##            <chr>    <dbl>
## 1 Corporate Auto 93.23831
## 2  Personal Auto 93.23085
auto_1 <- MCV[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto")
wilcox.test(Monthly.Premium.Auto~Policy.Type, data=auto_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Monthly.Premium.Auto by Policy.Type
## W = 6662900, p-value = 0.8673
## alternative hypothesis: true location shift is not equal to 0

We can see that the difference in distribution of both policy types by Monthly premiums is not significant. Then we can’t reject the null hypothesis.

Now, if we look into California,

MCV_1 <- MCV[which(MCV$State=="California"),]
auto_1 <- MCV_1[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto") %>% group_by(Policy.Type) %>% summarise(MPA=mean(Monthly.Premium.Auto))
auto_1
## # A tibble: 2 x 2
##      Policy.Type      MPA
##            <chr>    <dbl>
## 1 Corporate Auto 92.18268
## 2  Personal Auto 93.97607
auto_1 <- MCV_1[,c("Policy.Type", "Monthly.Premium.Auto")] %>% subset(Policy.Type == "Corporate Auto" | Policy.Type == "Personal Auto")
wilcox.test(Monthly.Premium.Auto~Policy.Type, data=auto_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Monthly.Premium.Auto by Policy.Type
## W = 820940, p-value = 0.1742
## alternative hypothesis: true location shift is not equal to 0

we see that although there is no impact in the over-all distribution of monthly premiums, the significance of policy type is relatively “stronger” than in the parent population.

4.Who are my most valuable customers? By this i mean in the West Coast. Are they more or less valuable than the National Average?

Let us assume that west coast is california and the parent population is the whole data set. Then we take the top 10 customers for each of the two space mentioned and compare.

MCV_1 <- MCV %>% subset(State=="California") %>% group_by(Customer) %>% summarise(CLV = mean(Customer.Lifetime.Value))
tail(arrange(MCV_1, desc(CLV)),n=10)
## # A tibble: 10 x 2
##    Customer      CLV
##       <chr>    <dbl>
##  1  FS86064 2117.989
##  2  RN61682 2114.738
##  3  UF44021 2108.494
##  4  WL65572 2064.459
##  5  AU46604 2030.784
##  6  EP72155 2004.351
##  7  JX50334 2004.351
##  8  EH70612 1918.120
##  9  XT54610 1904.001
## 10  UM17151 1898.008
MCV_2 <- MCV %>% group_by(Customer) %>% summarise(CLV = mean(Customer.Lifetime.Value))
tail(arrange(MCV_2, desc(CLV)),n=10)
## # A tibble: 10 x 2
##    Customer      CLV
##       <chr>    <dbl>
##  1  JX50334 2004.351
##  2  SQ54906 2004.351
##  3  WX92260 2004.351
##  4  XR52852 2004.351
##  5  MN19541 1994.775
##  6  XI84293 1940.981
##  7  EH70612 1918.120
##  8  XT54610 1904.001
##  9  CI68869 1898.684
## 10  UM17151 1898.008

From the list above, you can see that not all top customers in West Coast based on Client Lifetime Value are in the national list (MCV_2). Hence, we can day that there may be more customers that are valuable outside of West Coast. Now, we hypothesize that being in West Coast makes the customer more valuable. Similar to 1.d, we use the Mann-Whitney-Wilcoxon test the siginificance of being in West coast to vCLV.

val_1 <- MCV[,c("State", "Customer.Lifetime.Value")]
val_1$Cali_Tag <- ifelse(val_1$State=="California",1,0)
wilcox.test(Customer.Lifetime.Value~Cali_Tag, data=val_1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Customer.Lifetime.Value by Cali_Tag
## W = 9312500, p-value = 0.3486
## alternative hypothesis: true location shift is not equal to 0

By the test result, we say that being in California does not affect CLV. Hence, we can not say for certain if customers in West Coast are more or less valuable than any other state in the population.