Problem Set 1

Data Sanity Checks

First we’ll load the data into R. I have renamed my data set for easy loading

df <- read.csv("//au.qbe.pri/Home/NSW_Home/200583/Profile/Documents/DataCademy/ADS.csv")

Next let’s load libraries that will be useful in our analysis:

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Then inspect the data by using str, summary and head functions:

head(df)

##   Customer      State Customer.Lifetime.Value Response Coverage Education
## 1  BU79786 Washington                2763.519       No    Basic  Bachelor
## 2  QZ44356    Arizona                6979.536       No Extended  Bachelor
## 3  AI49188     Nevada               12887.432       No  Premium  Bachelor
## 4  WW63253 California                7645.862       No    Basic  Bachelor
## 5  HB64268 Washington                2813.693       No    Basic  Bachelor
## 6  OC83172     Oregon                8256.298      Yes    Basic  Bachelor
##   Effective.To.Date EmploymentStatus Gender Income Location.Code
## 1           2/24/11         Employed      F  56274      Suburban
## 2           1/31/11       Unemployed      F      0      Suburban
## 3           2/19/11         Employed      F  48767      Suburban
## 4           1/20/11       Unemployed      M      0      Suburban
## 5            2/3/11         Employed      M  43836         Rural
## 6           1/25/11         Employed      F  62902         Rural
##   Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim
## 1        Married                   69                      32
## 2         Single                   94                      13
## 3        Married                  108                      18
## 4        Married                  106                      18
## 5         Single                   73                      12
## 6        Married                   69                      14
##   Months.Since.Policy.Inception Number.of.Open.Complaints
## 1                             5                         0
## 2                            42                         0
## 3                            38                         0
## 4                            65                         0
## 5                            44                         0
## 6                            94                         0
##   Number.of.Policies    Policy.Type       Policy Renew.Offer.Type
## 1                  1 Corporate Auto Corporate L3           Offer1
## 2                  8  Personal Auto  Personal L3           Offer3
## 3                  2  Personal Auto  Personal L3           Offer1
## 4                  7 Corporate Auto Corporate L2           Offer1
## 5                  1  Personal Auto  Personal L1           Offer1
## 6                  2  Personal Auto  Personal L3           Offer2
##   Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
## 1         Agent           384.8111  Two-Door Car      Medsize
## 2         Agent          1131.4649 Four-Door Car      Medsize
## 3         Agent           566.4722  Two-Door Car      Medsize
## 4   Call Center           529.8813           SUV      Medsize
## 5         Agent           138.1309 Four-Door Car      Medsize
## 6           Web           159.3830  Two-Door Car      Medsize

str(df)

## 'data.frame':    9134 obs. of  24 variables:
##  $ Customer                     : Factor w/ 9134 levels "AA10041","AA11235",..: 601 5947 97 8017 2489 4948 8434 756 1352 548 ...
##  $ State                        : Factor w/ 5 levels "Arizona","California",..: 5 1 3 2 5 4 4 1 4 4 ...
##  $ Customer.Lifetime.Value      : num  2764 6980 12887 7646 2814 ...
##  $ Response                     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ Coverage                     : Factor w/ 3 levels "Basic","Extended",..: 1 2 3 1 1 1 1 3 1 2 ...
##  $ Education                    : Factor w/ 5 levels "Bachelor","College",..: 1 1 1 1 1 1 2 5 1 2 ...
##  $ Effective.To.Date            : Factor w/ 59 levels "1/1/11","1/10/11",..: 48 25 42 13 53 18 48 10 19 40 ...
##  $ EmploymentStatus             : Factor w/ 5 levels "Disabled","Employed",..: 2 5 2 5 2 2 2 5 3 2 ...
##  $ Gender                       : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
##  $ Income                       : int  56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
##  $ Location.Code                : Factor w/ 3 levels "Rural","Suburban",..: 2 2 2 2 1 1 2 3 2 3 ...
##  $ Marital.Status               : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
##  $ Monthly.Premium.Auto         : int  69 94 108 106 73 69 67 101 71 93 ...
##  $ Months.Since.Last.Claim      : int  32 13 18 18 12 14 0 0 13 17 ...
##  $ Months.Since.Policy.Inception: int  5 42 38 65 44 94 13 68 3 7 ...
##  $ Number.of.Open.Complaints    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Number.of.Policies           : int  1 8 2 7 1 2 9 4 2 8 ...
##  $ Policy.Type                  : Factor w/ 3 levels "Corporate Auto",..: 1 2 2 1 2 2 1 1 1 3 ...
##  $ Policy                       : Factor w/ 9 levels "Corporate L1",..: 3 6 6 2 4 6 3 3 3 8 ...
##  $ Renew.Offer.Type             : Factor w/ 4 levels "Offer1","Offer2",..: 1 3 1 1 1 2 1 1 1 2 ...
##  $ Sales.Channel                : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
##  $ Total.Claim.Amount           : num  385 1131 566 530 138 ...
##  $ Vehicle.Class                : Factor w/ 6 levels "Four-Door Car",..: 6 1 6 5 1 6 1 1 1 1 ...
##  $ Vehicle.Size                 : Factor w/ 3 levels "Large","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...

summary(df)

##     Customer           State      Customer.Lifetime.Value Response  
##  AA10041:   1   Arizona   :1703   Min.   : 1898           No :7826  
##  AA11235:   1   California:3150   1st Qu.: 3994           Yes:1308  
##  AA16582:   1   Nevada    : 882   Median : 5780                     
##  AA30683:   1   Oregon    :2601   Mean   : 8005                     
##  AA34092:   1   Washington: 798   3rd Qu.: 8962                     
##  AA35519:   1                     Max.   :83325                     
##  (Other):9128                                                       
##      Coverage                   Education    Effective.To.Date
##  Basic   :5568   Bachelor            :2748   1/10/11: 195     
##  Extended:2742   College             :2681   1/27/11: 194     
##  Premium : 824   Doctor              : 342   2/14/11: 186     
##                  High School or Below:2622   1/26/11: 181     
##                  Master              : 741   1/17/11: 180     
##                                              1/19/11: 179     
##                                              (Other):8019     
##       EmploymentStatus Gender       Income       Location.Code 
##  Disabled     : 405    F:4658   Min.   :    0   Rural   :1773  
##  Employed     :5698    M:4476   1st Qu.:    0   Suburban:5779  
##  Medical Leave: 432             Median :33890   Urban   :1582  
##  Retired      : 282             Mean   :37657                  
##  Unemployed   :2317             3rd Qu.:62320                  
##                                 Max.   :99981                  
##                                                                
##   Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim
##  Divorced:1369   Min.   : 61.00       Min.   : 0.0           
##  Married :5298   1st Qu.: 68.00       1st Qu.: 6.0           
##  Single  :2467   Median : 83.00       Median :14.0           
##                  Mean   : 93.22       Mean   :15.1           
##                  3rd Qu.:109.00       3rd Qu.:23.0           
##                  Max.   :298.00       Max.   :35.0           
##                                                              
##  Months.Since.Policy.Inception Number.of.Open.Complaints
##  Min.   : 0.00                 Min.   :0.0000           
##  1st Qu.:24.00                 1st Qu.:0.0000           
##  Median :48.00                 Median :0.0000           
##  Mean   :48.06                 Mean   :0.3844           
##  3rd Qu.:71.00                 3rd Qu.:0.0000           
##  Max.   :99.00                 Max.   :5.0000           
##                                                         
##  Number.of.Policies         Policy.Type            Policy    
##  Min.   :1.000      Corporate Auto:1968   Personal L3 :3426  
##  1st Qu.:1.000      Personal Auto :6788   Personal L2 :2122  
##  Median :2.000      Special Auto  : 378   Personal L1 :1240  
##  Mean   :2.966                            Corporate L3:1014  
##  3rd Qu.:4.000                            Corporate L2: 595  
##  Max.   :9.000                            Corporate L1: 359  
##                                           (Other)     : 378  
##  Renew.Offer.Type     Sales.Channel  Total.Claim.Amount
##  Offer1:3752      Agent      :3477   Min.   :   0.099  
##  Offer2:2926      Branch     :2567   1st Qu.: 272.258  
##  Offer3:1432      Call Center:1765   Median : 383.945  
##  Offer4:1024      Web        :1325   Mean   : 434.089  
##                                      3rd Qu.: 547.515  
##                                      Max.   :2893.240  
##                                                        
##        Vehicle.Class   Vehicle.Size 
##  Four-Door Car:4621   Large  : 946  
##  Luxury Car   : 163   Medsize:6424  
##  Luxury SUV   : 184   Small  :1764  
##  Sports Car   : 484                 
##  SUV          :1796                 
##  Two-Door Car :1886                 
##

Let’s change the data type of Customer as it is a encoded as a factor variable

df$Customer <- as.character(df$Customer)

Also notice that Effective.to.date column is coded as a factor where in fact it should be formatted as Date. Let’s reformat it:

df$Effective.To.Date <- as.Date(df$Effective.To.Date, "%m/%d/%y")
typeof(df$Effective.To.Date)

## [1] "double"

Test if there is any NA values over all df

sum(is.na(df))

## [1] 0

Check per column, count number of True in the df returned by is.na(df) and format as Data Frame for more readability:

data.frame(colSums(is.na(df)))

##                               colSums.is.na.df..
## Customer                                       0
## State                                          0
## Customer.Lifetime.Value                        0
## Response                                       0
## Coverage                                       0
## Education                                      0
## Effective.To.Date                              0
## EmploymentStatus                               0
## Gender                                         0
## Income                                         0
## Location.Code                                  0
## Marital.Status                                 0
## Monthly.Premium.Auto                           0
## Months.Since.Last.Claim                        0
## Months.Since.Policy.Inception                  0
## Number.of.Open.Complaints                      0
## Number.of.Policies                             0
## Policy.Type                                    0
## Policy                                         0
## Renew.Offer.Type                               0
## Sales.Channel                                  0
## Total.Claim.Amount                             0
## Vehicle.Class                                  0
## Vehicle.Size                                   0

Check if there is any duplicates

anyDuplicated(df)

## [1] 0

Descriptive Statistics

let’s look at the five number summary of the variables we will be hugely interested with:

For Customer Lifetime Value

library(moments)
summary(df$Customer.Lifetime.Value)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1898    3994    5780    8005    8962   83325

skewness(df$Customer.Lifetime.Value)

## [1] 3.031782

kurtosis(df$Customer.Lifetime.Value)

## [1] 16.81531

For Monthly Premium

summary(df$Monthly.Premium.Auto)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   61.00   68.00   83.00   93.22  109.00  298.00

skewness(df$Monthly.Premium.Auto)

## [1] 2.123198

kurtosis(df$Monthly.Premium.Auto)

## [1] 9.189558

For Claim Amount

summary(df$Total.Claim.Amount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.099  272.258  383.945  434.089  547.515 2893.240

skewness(df$Total.Claim.Amount)

## [1] 1.714684

kurtosis(df$Total.Claim.Amount)

## [1] 8.975472

Now we’ll look into their distributions using histograms. First we import these libraries to aid us in our analysis:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that these Metrics are all Skewed to the Left, with varying Skewness!

Inferential Statistics

Item 1: Are these Hypothesis True or False?

Rural customers are LESS valuable than Urban customers

df %>%
  group_by(Location.Code) %>%
  summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value), Std_clv = sd(Customer.Lifetime.Value)) %>%
  arrange(ave_clv)

## # A tibble: 3 x 4
##   Location.Code   cnt  ave_clv  Std_clv
##          <fctr> <int>    <dbl>    <dbl>
## 1         Rural  1773 7953.699 6595.473
## 2      Suburban  5779 8004.457 6963.661
## 3         Urban  1582 8064.133 6836.296

We will test the difference between two means: The Null Hypothesis is that the mean CLV for RURAL is just equal to URBAN the ALternative Hypothesis is that RURAL is LESS than URBAN

First, let’s try to Visualize the distribution of CLV:

Rural <- df%>% filter(Location.Code == 'Rural')
Urban <- df%>% filter(Location.Code == 'Urban')

ggplot(Rural, aes(x = Customer.Lifetime.Value))+geom_histogram()+ggtitle('Rural CLV')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Urban, aes(x = Customer.Lifetime.Value))+geom_histogram()+ggtitle('Urban CLV')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

then we chec if the difference is statistically significant using t test

  urban <- df$Customer.Lifetime.Value[df$Location.Code == 'Urban']
  rural <- df$Customer.Lifetime.Value[df$Location.Code == 'Rural']
    
  stat = mean(urban) - mean(rural)
  
  t.test(rural,urban, alternative = "less", conf.level = .95)

## 
##  Welch Two Sample t-test
## 
## data:  rural and urban
## t = -0.47489, df = 3279.4, p-value = 0.3174
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 272.1745
## sample estimates:
## mean of x mean of y 
##  7953.699  8064.133

the p-value reported by the t.test is 0.3174, which is higher than the critical value of 0.5 thus we fail to reject the Null Hypothesis i.e. there is insufficent evidence that Rural customers are less valuable than Urban

Educated customers (with a bachelors or equivalent degree) are MORE valuable than others

what are the levels of education in our dataset?:

levels(df$Education)

## [1] "Bachelor"             "College"              "Doctor"              
## [4] "High School or Below" "Master"

df %>%
  group_by(Education) %>%
  summarise(ave_clv = mean(Customer.Lifetime.Value)) %>%
  arrange(ave_clv)

## # A tibble: 5 x 2
##              Education  ave_clv
##                 <fctr>    <dbl>
## 1               Doctor 7520.345
## 2              College 7851.065
## 3             Bachelor 7872.660
## 4               Master 8243.485
## 5 High School or Below 8296.709

Similar to the previous item, but this time we will have alternative Hypothesis as Educated Customers are More educated than others

Educated <- df$Customer.Lifetime.Value[df$Education != 'High School or Below']
Non.Educ <- df$Customer.Lifetime.Value[df$Education == 'High School or Below']

t.test(Educated, Non.Educ, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  Educated and Non.Educ
## t = -2.4988, df = 4546.6, p-value = 0.9938
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -678.6923       Inf
## sample estimates:
## mean of x mean of y 
##  7887.462  8296.709

at p-value 0.99 We fail to reject the Null Hypothesis i.e. there is insufficient evidence that Educated customers have higher CLV than others

Marital status has no role to play in determining the value of a customer

df %>%
  group_by(Marital.Status) %>%
  summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value), Std_clv = sd(Customer.Lifetime.Value)) %>%
  arrange(ave_clv)

## # A tibble: 3 x 4
##   Marital.Status   cnt  ave_clv  Std_clv
##           <fctr> <int>    <dbl>    <dbl>
## 1         Single  2467 7714.837 6731.479
## 2        Married  5298 8078.967 6902.504
## 3       Divorced  1369 8241.239 6984.894

There are three Levels, now we test difference of CLV means among each of them Use One way Anova or Regression: or Kruskal-Wallis test:

kruskal.test(Customer.Lifetime.Value ~ Marital.Status, data = df)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Customer.Lifetime.Value by Marital.Status
## Kruskal-Wallis chi-squared = 20.896, df = 2, p-value = 2.901e-05

By the Kruskal-Wallis Test, there is significant difference between the Marital Status on determining the CLV.

Let’s try linear Regression:

Using linear regression we are testing the Hypothesis that Marital Status has a statistically significant reletionship with CLV. Our Null Hypothesis would be there is no relationship i.e the intercept is 0.

CLV.lm = lm(Customer.Lifetime.Value ~ Marital.Status, data = df)
summary(CLV.lm)

## 
## Call:
## lm(formula = Customer.Lifetime.Value ~ Marital.Status, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6237  -3990  -2219    956  75246 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8241.2      185.7  44.390   <2e-16 ***
## Marital.StatusMarried   -162.3      208.3  -0.779    0.436    
## Marital.StatusSingle    -526.4      231.5  -2.274    0.023 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6869 on 9131 degrees of freedom
## Multiple R-squared:  0.0007262,  Adjusted R-squared:  0.0005073 
## F-statistic: 3.318 on 2 and 9131 DF,  p-value: 0.03628

anova(CLV.lm)

## Analysis of Variance Table
## 
## Response: Customer.Lifetime.Value
##                  Df     Sum Sq   Mean Sq F value  Pr(>F)  
## Marital.Status    2 3.1310e+08 156548158  3.3177 0.03628 *
## Residuals      9131 4.3086e+11  47186247                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result shows that the intercept(mean response) is statistically significant. WeReject he NUll Hypothesis

West Coast agents are performing poorly - They are signing low value customers

levels(df$State)

## [1] "Arizona"    "California" "Nevada"     "Oregon"     "Washington"

levels(df$Sales.Channel)

## [1] "Agent"       "Branch"      "Call Center" "Web"

first, we filter the data to get policies which have “Agent” as Sales.Channel:

Agentdf <- df %>% filter(Sales.Channel == "Agent")
              

Agentdf %>%
  group_by(State) %>%
  summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value),Std_clv = sd(Customer.Lifetime.Value))

## # A tibble: 5 x 4
##        State   cnt  ave_clv  Std_clv
##       <fctr> <int>    <dbl>    <dbl>
## 1    Arizona   643 7841.994 6886.165
## 2 California  1201 7878.525 6059.478
## 3     Nevada   330 8729.443 8125.603
## 4     Oregon   988 7941.217 6414.209
## 5 Washington   315 7739.067 7077.971

Let the Null Hypothesis be that Average California CLV from agents are just equal to the rest of the Agents;

Alternative; California CLV less than Rest

we set up the data needed

California <- Agentdf$Customer.Lifetime.Value[df$State == 'California']
Others <- Agentdf$Customer.Lifetime.Value[df$State != 'California']

Then we test the Hypothesis:

t.test(California, Others, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  California and Others
## t = 0.44252, df = 1886.1, p-value = 0.6709
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 517.6109
## sample estimates:
## mean of x mean of y 
##  8035.284  7925.594

The T.test result shows that there is insufficient evidence that california Agents have lower CLV than the rest of the Agents

Call Center is not performing well comparerd to other channels throughout the country (in terms of high value customers)

Who are the High Value Customers? Let’s assign High Value customers as those with CLV above Overall Average CLV.

mean_CLV <- mean(df$Customer.Lifetime.Value)
High_Value <- df %>% filter(Customer.Lifetime.Value  > mean_CLV)
hist(High_Value$Customer.Lifetime.Value)

Let’s construct our dataset:

High_Value %>%
  group_by(Sales.Channel) %>%
  summarise(cnt = n(), ave_clv = mean(Customer.Lifetime.Value),Std_clv = sd(Customer.Lifetime.Value))

## # A tibble: 4 x 4
##   Sales.Channel   cnt  ave_clv  Std_clv
##          <fctr> <int>    <dbl>    <dbl>
## 1         Agent  1151 14471.77 7991.968
## 2        Branch   874 14849.87 8546.938
## 3   Call Center   590 14867.56 8767.698
## 4           Web   424 14543.84 8353.562

HV_CC <- High_Value$Customer.Lifetime.Value[High_Value$Sales.Channel == 'Call Center']
HV_NonCC <- High_Value$Customer.Lifetime.Value[High_Value$Sales.Channel != 'Call Center']

Then we perform t.test:

t.test(HV_CC, HV_NonCC, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  HV_CC and HV_NonCC
## t = 0.62463, df = 858.02, p-value = 0.7338
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 903.1439
## sample estimates:
## mean of x mean of y 
##  14867.56  14619.18

There is no sufficient Evidence that Call Center Channel performs lesser than the others

2.I have noticed recently that when a policy is about to expire . customers take advantage of the limited time available to them by making more claims than average during the closing stages of their policy. I want to know if this is true nation wide

What does a policy that is about to expire mean? DOes effective.To.date represent expiry date?

as Month Since Policy Inception becomes larger, the older the policy is. And it is more likely to end The argument is that as the policy is about to expire the more claims is happening. Using these as a proxy, We would see that Months.Since.Policy.Inception is inversely proportional to Months.Since.Last.Claim

Let’s visualize that:

colnames(df)

##  [1] "Customer"                      "State"                        
##  [3] "Customer.Lifetime.Value"       "Response"                     
##  [5] "Coverage"                      "Education"                    
##  [7] "Effective.To.Date"             "EmploymentStatus"             
##  [9] "Gender"                        "Income"                       
## [11] "Location.Code"                 "Marital.Status"               
## [13] "Monthly.Premium.Auto"          "Months.Since.Last.Claim"      
## [15] "Months.Since.Policy.Inception" "Number.of.Open.Complaints"    
## [17] "Number.of.Policies"            "Policy.Type"                  
## [19] "Policy"                        "Renew.Offer.Type"             
## [21] "Sales.Channel"                 "Total.Claim.Amount"           
## [23] "Vehicle.Class"                 "Vehicle.Size"

plot(df$Months.Since.Policy.Inception, df$Total.Claim.Amount)

test <- df %>%
  group_by(Months.Since.Last.Claim, Months.Since.Policy.Inception) %>%
  mutate(cnt = n())

barplot(test$cnt)

3.West Coast USA is predominantly Tech industry. And they are nowadays taking out Corporate Auto insurance policies for their emplpoyees. So i have noticed an increase on the ratio of sales of corporate and personal Auto insurance. How does the national ratio compare to that of West Coast?

Let’s see the National Proportion osf sales (Total Premium - Total Claims Amount:

First we calculate Total Premiums Paid and from it subtract the total Claim Amount then we’ll get the profit.

df$Total.Paid <- df$Monthly.Premium.Auto*df$Months.Since.Policy.Inception
df$Profit <- df$Total.Paid - df$Total.Claim.Amount

National_Sales <- df %>%
  group_by(Policy.Type) %>%
  summarise(Policy_cnt = n(), Total.Paid = sum(Total.Paid)) %>%
  mutate(prop = Total.Paid/sum(Total.Paid))

and then, Let’s filter the table to reflect california only:

Cal_Sales <- df %>%
  group_by(Policy.Type) %>%
  filter(State == 'California')%>%
  summarise(Policy_cnt = n(), Total.Paid = sum(Total.Paid)) %>%
  mutate(prop = Total.Paid/sum(Total.Paid))


National_Sales

## # A tibble: 3 x 4
##      Policy.Type Policy_cnt Total.Paid       prop
##           <fctr>      <int>      <int>      <dbl>
## 1 Corporate Auto       1968    8859245 0.21553789
## 2  Personal Auto       6788   30615238 0.74484269
## 3   Special Auto        378    1628475 0.03961941

Cal_Sales

## # A tibble: 3 x 4
##      Policy.Type Policy_cnt Total.Paid       prop
##           <fctr>      <int>      <int>      <dbl>
## 1 Corporate Auto        739    3331408 0.23768913
## 2  Personal Auto       2298   10221194 0.72926122
## 3   Special Auto        113     463218 0.03304965

There a very small difference of West coast to national Average, Test if this increase is significant enough:

cbind(PolicyType = National_Sales$Policy.Type,National = National_Sales$Total.Paid, California = Cal_Sales$Total.Paid)

##      PolicyType National California
## [1,]          1  8859245    3331408
## [2,]          2 30615238   10221194
## [3,]          3  1628475     463218

table_df <- National_Sales %>% select(Policy.Type,National = Total.Paid)%>%
  mutate(California = Cal_Sales$Total.Paid)

table_df

## # A tibble: 3 x 3
##      Policy.Type National California
##           <fctr>    <int>      <int>
## 1 Corporate Auto  8859245    3331408
## 2  Personal Auto 30615238   10221194
## 3   Special Auto  1628475     463218

Do the prop test later

4.Who are my most valuable customers? By this i mean in the West Coast. Are they more or less valuable than the National Average?

What is the national Average for Customer Lifetime Value?

mean(df$Customer.Lifetime.Value)

## [1] 8004.94

cal <- df %>% filter(State == 'California')%>%
  arrange(desc(Customer.Lifetime.Value))

ggplot(cal, aes(Customer.Lifetime.Value))+geom_histogram(binwidth = 1000)

boxplot(cal$Customer.Lifetime.Value)

plot(cal$Customer.Lifetime.Value)

the Mean CLV for California is the same as the Average Let’s Describe which of the California Policies are Higher Value

First, filter out those Policies that have CLV less than the 75th percentile

cal2 <- cal %>% filter(Customer.Lifetime.Value >= quantile(Customer.Lifetime.Value, 0.75))
dim(cal2)

## [1] 788  26

cal2 %>%
  group_by(Policy.Type) %>%
  summarise(Policy_cnt = n(), TCA = sum(Total.Claim.Amount), MPA = sum(Total.Paid),ave_CLV = mean(Customer.Lifetime.Value)) %>%
  mutate(prop = Policy_cnt/sum(Policy_cnt))

## # A tibble: 3 x 6
##      Policy.Type Policy_cnt       TCA     MPA  ave_CLV      prop
##           <fctr>      <int>     <dbl>   <int>    <dbl>     <dbl>
## 1 Corporate Auto        190 108330.86 1062094 16193.12 0.2411168
## 2  Personal Auto        567 315106.64 3097510 16707.64 0.7195431
## 3   Special Auto         31  16556.67  134154 16437.82 0.0393401

Given the number of Personal Auto Policies in California with a relatively high ave_CLV, these customers are High value!

5.For my region, how does the comparison of channel performance look like? How does this compare with trends for the entire country?

Get SUmmary for the entire country:

df %>%
  group_by(Sales.Channel) %>%
  summarise(Policy_cnt = n(), TCA = sum(Total.Claim.Amount), MPA = sum(Total.Paid),ave_CLV = mean(Customer.Lifetime.Value)) %>%
  mutate(prop = Policy_cnt/sum(Policy_cnt))%>%
  arrange(desc(ave_CLV))

## # A tibble: 4 x 6
##   Sales.Channel Policy_cnt       TCA      MPA  ave_CLV      prop
##          <fctr>      <int>     <dbl>    <int>    <dbl>     <dbl>
## 1        Branch       2567 1111169.1 11690107 8119.712 0.2810379
## 2   Call Center       1765  755640.0  7757750 8100.086 0.1932341
## 3         Agent       3477 1524437.4 15638171 7957.709 0.3806656
## 4           Web       1325  573720.7  6016930 7779.788 0.1450624

On the Average, Policies thru the Branch Channel have the highest Customer Lifetime Value. Policies thru the Web Channel have the least proportion and CLV. In terms of penetration, Agent Channel is the most frequent. Though Call Center Channel has less Policy counts than Branch, it makes a good CLV at slightly lower than that of Branch

Let’s see how this compares per region:

q5 <- df %>%
  group_by(Sales.Channel, State) %>%
  summarise(Policy_cnt = n(), TCA = sum(Total.Claim.Amount), MPA = sum(Total.Paid),ave_CLV = mean(Customer.Lifetime.Value)) %>%
  mutate(prop = Policy_cnt/sum(Policy_cnt))%>%
  arrange(desc(ave_CLV))

q5

## # A tibble: 20 x 7
## # Groups:   Sales.Channel [4]
##    Sales.Channel      State Policy_cnt       TCA     MPA  ave_CLV
##           <fctr>     <fctr>      <int>     <dbl>   <int>    <dbl>
##  1        Branch Washington        232 106478.14 1153179 8795.950
##  2         Agent     Nevada        330 149073.73 1575473 8729.443
##  3   Call Center     Nevada        163  76727.51  721756 8655.268
##  4           Web    Arizona        238 103625.31 1131286 8615.442
##  5        Branch     Oregon        719 306942.39 3394463 8410.491
##  6   Call Center     Oregon        497 211509.97 2229183 8261.157
##  7        Branch California        900 402141.64 4054585 8204.506
##  8   Call Center California        605 256429.43 2637512 8076.694
##  9         Agent     Oregon        988 430809.13 4464649 7941.217
## 10   Call Center Washington        147  63337.61  670643 7934.857
## 11         Agent California       1201 532881.73 5316623 7878.525
## 12         Agent    Arizona        643 279406.22 2897992 7841.994
## 13           Web California        444 187677.89 2007100 7835.420
## 14         Agent Washington        315 132266.55 1383434 7739.067
## 15   Call Center    Arizona        353 147635.44 1498656 7725.846
## 16        Branch    Arizona        469 193723.58 1976771 7607.172
## 17           Web     Oregon        397 177004.14 1778929 7586.298
## 18        Branch     Nevada        247 101883.33 1111109 7302.339
## 19           Web Washington        104  46119.89  464491 7271.581
## 20           Web     Nevada        142  59293.43  635124 7118.399
## # ... with 1 more variables: prop <dbl>

PS_1

rimar