ProblemSet

Please solve Questions 1 to 5 based on the dataset `WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv`

Hello team! I am Srivatsa the Regional Manager for ABC a larger Insurance company that operates throughout the United States. I am responsible for selling and monitoring insurance policies in the West Coast of United states. I sell various different kind of insurance policies and use different channels to sell my products (Call center, Agents etc). My boss, William D’Souza is the Group Vice President of Sales for the United States and he sent me over an analysis conducted nationally. The attached dataset contains the overall Customer Value (present value of future spends with ABC) of some of ABC’s customers thoughout USA. This document was discussed in the Board meetings of ABC and William wanted to make sure that some of the points discussed there were infact valid

Data Sanity Checks

How many null values are there per column?
Are there any abnormal values?
- Numeric values in textual fields or vice versa
- Are there any duplicates (Especially Customer unique ids)?

Reading in the data file csv Checking the dimesions of the data and the datatypes

cva_1 <- read.csv("WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv", header = TRUE, stringsAsFactors = FALSE)

library("moments", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")

dim(cva_1)

## [1] 9134   24

str(cva_1)

## 'data.frame':    9134 obs. of  24 variables:
##  $ Customer                     : chr  "BU79786" "QZ44356" "AI49188" "WW63253" ...
##  $ State                        : chr  "Washington" "Arizona" "Nevada" "California" ...
##  $ Customer.Lifetime.Value      : num  2764 6980 12887 7646 2814 ...
##  $ Response                     : chr  "No" "No" "No" "No" ...
##  $ Coverage                     : chr  "Basic" "Extended" "Premium" "Basic" ...
##  $ Education                    : chr  "Bachelor" "Bachelor" "Bachelor" "Bachelor" ...
##  $ Effective.To.Date            : chr  "2/24/11" "1/31/11" "2/19/11" "1/20/11" ...
##  $ EmploymentStatus             : chr  "Employed" "Unemployed" "Employed" "Unemployed" ...
##  $ Gender                       : chr  "F" "F" "F" "M" ...
##  $ Income                       : int  56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
##  $ Location.Code                : chr  "Suburban" "Suburban" "Suburban" "Suburban" ...
##  $ Marital.Status               : chr  "Married" "Single" "Married" "Married" ...
##  $ Monthly.Premium.Auto         : int  69 94 108 106 73 69 67 101 71 93 ...
##  $ Months.Since.Last.Claim      : int  32 13 18 18 12 14 0 0 13 17 ...
##  $ Months.Since.Policy.Inception: int  5 42 38 65 44 94 13 68 3 7 ...
##  $ Number.of.Open.Complaints    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Number.of.Policies           : int  1 8 2 7 1 2 9 4 2 8 ...
##  $ Policy.Type                  : chr  "Corporate Auto" "Personal Auto" "Personal Auto" "Corporate Auto" ...
##  $ Policy                       : chr  "Corporate L3" "Personal L3" "Personal L3" "Corporate L2" ...
##  $ Renew.Offer.Type             : chr  "Offer1" "Offer3" "Offer1" "Offer1" ...
##  $ Sales.Channel                : chr  "Agent" "Agent" "Agent" "Call Center" ...
##  $ Total.Claim.Amount           : num  385 1131 566 530 138 ...
##  $ Vehicle.Class                : chr  "Two-Door Car" "Four-Door Car" "Two-Door Car" "SUV" ...
##  $ Vehicle.Size                 : chr  "Medsize" "Medsize" "Medsize" "Medsize" ...

Why did i use stringsAsFactors = FALSE?

Checking null values in each column and storing the value in a data frame na_counts

na_counts <- sapply(cva_1, function(y) sum(is.na(y)))
na_counts <- data.frame(na_counts)
na_counts

##                               na_counts
## Customer                              0
## State                                 0
## Customer.Lifetime.Value               0
## Response                              0
## Coverage                              0
## Education                             0
## Effective.To.Date                     0
## EmploymentStatus                      0
## Gender                                0
## Income                                0
## Location.Code                         0
## Marital.Status                        0
## Monthly.Premium.Auto                  0
## Months.Since.Last.Claim               0
## Months.Since.Policy.Inception         0
## Number.of.Open.Complaints             0
## Number.of.Policies                    0
## Policy.Type                           0
## Policy                                0
## Renew.Offer.Type                      0
## Sales.Channel                         0
## Total.Claim.Amount                    0
## Vehicle.Class                         0
## Vehicle.Size                          0

There are no null values
Date columns are stored as factors and integers. Needs to be cast right

Checking if the date colums are in the correct format

typeof(cva_1$Effective.To.Date)

## [1] "character"

Casting Date columns - Changing the format of Effective.To.Date column and checking its head

cva_1$Effective.To.Date <- as.Date(cva_1$Effective.To.Date, "%m/%d/%y")

head(cva_1$Effective.To.Date)

## [1] "2011-02-24" "2011-01-31" "2011-02-19" "2011-01-20" "2011-02-03"
## [6] "2011-01-25"

typeof(cva_1$Effective.To.Date)

## [1] "double"

I want to quickly perform Year and Month analyses later…So i will create separate year and month columns

cva_1$Year <- format(cva_1$Effective.To.Date, "%Y")
cva_1$Month <- format(cva_1$Effective.To.Date, "%B")

Check how many unique year and month values are there

unique(cva_1$Year)

## [1] "2011"

unique(cva_1$Month)

## [1] "February" "January"

I want to cast some of the columns as factors. Especially columns that have very few levels. This will help with plotting and has other benefits

# The columns to convert to factors are Response, Coverage, Education (Ordered), EmploymentStatus, Gender, Location.Code, Marital.Status, Policy.Type, Policy, Renew.Offer.Type, Sales.Channel, Vehicle.Class, Vehicle.Size, Month (No Year as there is only 1)

cva_1$Response <- factor(cva_1$Response)
cva_1$Coverage <- factor(cva_1$Coverage)
cva_1$Education <- factor(cva_1$Education, levels = c("High School or Below", "College", "Bachelor", "Master", "Doctor"), ordered = TRUE)
cva_1$EmploymentStatus <- factor(cva_1$EmploymentStatus)
cva_1$Gender <- factor(cva_1$Gender)
cva_1$Location.Code <- factor(cva_1$Location.Code)
cva_1$Marital.Status <- factor(cva_1$Marital.Status)
cva_1$Policy.Type <- factor(cva_1$Policy.Type)
cva_1$Policy <- factor(cva_1$Policy)
cva_1$Renew.Offer.Type <- factor(cva_1$Renew.Offer.Type)
cva_1$Sales.Channel <- factor(cva_1$Sales.Channel)
cva_1$Vehicle.Class <- factor(cva_1$Vehicle.Class)
cva_1$Vehicle.Size <- factor(cva_1$Vehicle.Size)
cva_1$Month <- factor(cva_1$Month)

Quick check to see if the dataframe looks good now

str(cva_1)

## 'data.frame':    9134 obs. of  26 variables:
##  $ Customer                     : chr  "BU79786" "QZ44356" "AI49188" "WW63253" ...
##  $ State                        : chr  "Washington" "Arizona" "Nevada" "California" ...
##  $ Customer.Lifetime.Value      : num  2764 6980 12887 7646 2814 ...
##  $ Response                     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ Coverage                     : Factor w/ 3 levels "Basic","Extended",..: 1 2 3 1 1 1 1 3 1 2 ...
##  $ Education                    : Ord.factor w/ 5 levels "High School or Below"<..: 3 3 3 3 3 3 2 4 3 2 ...
##  $ Effective.To.Date            : Date, format: "2011-02-24" "2011-01-31" ...
##  $ EmploymentStatus             : Factor w/ 5 levels "Disabled","Employed",..: 2 5 2 5 2 2 2 5 3 2 ...
##  $ Gender                       : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
##  $ Income                       : int  56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
##  $ Location.Code                : Factor w/ 3 levels "Rural","Suburban",..: 2 2 2 2 1 1 2 3 2 3 ...
##  $ Marital.Status               : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
##  $ Monthly.Premium.Auto         : int  69 94 108 106 73 69 67 101 71 93 ...
##  $ Months.Since.Last.Claim      : int  32 13 18 18 12 14 0 0 13 17 ...
##  $ Months.Since.Policy.Inception: int  5 42 38 65 44 94 13 68 3 7 ...
##  $ Number.of.Open.Complaints    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Number.of.Policies           : int  1 8 2 7 1 2 9 4 2 8 ...
##  $ Policy.Type                  : Factor w/ 3 levels "Corporate Auto",..: 1 2 2 1 2 2 1 1 1 3 ...
##  $ Policy                       : Factor w/ 9 levels "Corporate L1",..: 3 6 6 2 4 6 3 3 3 8 ...
##  $ Renew.Offer.Type             : Factor w/ 4 levels "Offer1","Offer2",..: 1 3 1 1 1 2 1 1 1 2 ...
##  $ Sales.Channel                : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
##  $ Total.Claim.Amount           : num  385 1131 566 530 138 ...
##  $ Vehicle.Class                : Factor w/ 6 levels "Four-Door Car",..: 6 1 6 5 1 6 1 1 1 1 ...
##  $ Vehicle.Size                 : Factor w/ 3 levels "Large","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year                         : chr  "2011" "2011" "2011" "2011" ...
##  $ Month                        : Factor w/ 2 levels "February","January": 1 2 1 2 1 2 1 2 2 1 ...

Everything looks great and we are now ready for our analysis

Descriptive Statistics

Can you give the descriptive statistics of Customer.Lifetime.Value, Monthly.Premium.Auto and Total.Claim.Amount columns?
- Histograms
- Skewness and Kurtosis

Customer.Lifetime.Value

Maximum CLV is $8.332510^{4} and the minimum CLV is $1898

First Moment Mean of CLV is $8005 and the Median is $5780

Second Moment The Variance in CLV is 4.721019610^{7} and the Standard Deviation is $6870.9676084

Third Moment Skewness is 3.0317823

Fourth Moment Kurtosis is 16.8153107

This means that the distribution of CLV is positively skewed (as expected) and is heavily Leptokurtic. These results indicate a distribution that is heavily skewed with a very large tail. There are a LOT of Customers with low CLV. Very few customers with high CLV. This can be visually understood using the Histogram

hist(cva_1$Customer.Lifetime.Value, col = "red")

hist(cva_1$Customer.Lifetime.Value, breaks = (max(cva_1$Customer.Lifetime.Value) - min(cva_1$Customer.Lifetime.Value))/100, freq = FALSE, main = "CLV Histogram", border = "red")

Monthly.Premium.Auto(MPA)

Maximum MPA is $298 and the minimum MPA is $61

First Moment Mean of MPA is $93 and the Median is $83

Second Moment The Variance in MPA is 1183.9082187 and the Standard Deviation is $34.4079674

Third Moment Skewness is 2.1231977

Fourth Moment Kurtosis is 9.1895583

Monthly premiums follow a trend similar to CLV although the distribution is NOT as skewed or as long tailed as CLV. This can be visually seen in the Histogram

hist(cva_1$Monthly.Premium.Auto, col = "red")

Total.Claim.Amount(TCA)

Maximum TCA is $2893 and the minimum TCA is $0

First Moment Mean of TCA is $434 and the Median is $384

Second Moment The Variance in MPA is 8.439030310^{4} and the Standard Deviation is $290.5000918

Third Moment Skewness is 1.7146842

Fourth Moment Kurtosis is 8.9754716

Total Claim amounts also follow a trend similar to CLV and MPA although the distribution is NOT as skewed or as long tailed as MPA. This can be visually seen in the Histogram. This means that variation in data is CLV > MPA > TCA

hist(cva_1$Total.Claim.Amount, col = "red")

We can conclude from this that we are very far away from normal distributions and maybe if we want to model these variable we should think of a variable transformation (?)

Inferential Statistics

The most obvious candidate for Independent variable is CLV. This also makes sense from a Business Perspective as we want to understand what contribues to making a high value customer (Descriptive analysis) and maybe later on predict who is going to be high value customer (Predictive analysis)

Can you guys please go through the dataset and answer if the following hypotheses are true or false?
- Rural customers are LESS valuable than Urban customers
- Educated customers (with a bachelors or equivalent degree) are more valuable than others
- Marital status has no role to play in determining the value of a customer
- West Coast agents are performing poorly - They are signing low value customers
- Call Center is not performing well comparerd to other channels throughout the country (in terms of high value customers)

# Hypothesis 1 - Rural customers are less valuable than Urban customers

# What we want to do here is group the dataframe by Rural and Urban Customers and Aggregate the CLV column using the 'mean' function. This way we can compare the mean CLV for rual and urban customers
# Loading dplyr to do some data manipulation tasks. It is easier to group data using dplyr

library("dplyr", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

hyp1_test1 <- cva_1 %>% group_by(Location.Code) %>% summarise(CLV = mean(Customer.Lifetime.Value), MPA = mean(Monthly.Premium.Auto), TCA = mean(Total.Claim.Amount))

library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")

hyp1_test1

## # A tibble: 3 x 4
##   Location.Code      CLV      MPA      TCA
##          <fctr>    <dbl>    <dbl>    <dbl>
## 1         Rural 7953.699 89.80598 109.9051
## 2      Suburban 8004.457 95.22703 562.1599
## 3         Urban 8064.133 89.71049 329.5723

#plot(hyp1_test1$Location.Code, hyp1_test1$CLV, barplot(hyp1_test1$CLV))

#hyp1_test1_plot1 <- ggplot(data = hyp1_test1, mapping = aes(hyp1_test1$Location.Code, hyp1_test1$CLV))

# Another way of doing the same thing is to do

I have noticed recently that when a policy is about to expire … customers take advantage of the limited time available to them by making more claims than average during the closing stages of their policy. I want to know if this is true nation wide
West Coast USA is predominantly Tech industry. And they are nowadays taking out Corporate Auto insurance policies for their emplpoyees. So i have noticed an increase on the ratio of sales of corporate and personal Auto insurance. How does the national ratio compare to that of West Coast?
Who are my most valuable customers? By this i mean in the West Coast. Are they more or less valuable than the National Average?
For my region, how does the comparison of channel performance look like? How does this compare with trends for the entire country?

P:S - By West Coast i mean California :)

PP:S - Please submit your answers as Markdown files that show the analysis and results in one place

ProblemSet_1

Srivatsa

6/22/2017

Please solve Questions 1 to 5 based on the dataset `WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv`

Data Sanity Checks

Descriptive Statistics

Customer.Lifetime.Value

Monthly.Premium.Auto(MPA)

Total.Claim.Amount(TCA)

Inferential Statistics

ProblemSet_1

Srivatsa

6/22/2017

Please solve Questions 1 to 5 based on the dataset WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv

Data Sanity Checks

Descriptive Statistics

Customer.Lifetime.Value

Monthly.Premium.Auto(MPA)

Total.Claim.Amount(TCA)

Inferential Statistics

Please solve Questions 1 to 5 based on the dataset `WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv`