WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv
Hello team! I am Srivatsa the Regional Manager for ABC a larger Insurance company that operates throughout the United States. I am responsible for selling and monitoring insurance policies in the West Coast of United states. I sell various different kind of insurance policies and use different channels to sell my products (Call center, Agents etc). My boss, William D’Souza is the Group Vice President of Sales for the United States and he sent me over an analysis conducted nationally. The attached dataset contains the overall Customer Value (present value of future spends with ABC) of some of ABC’s customers thoughout USA. This document was discussed in the Board meetings of ABC and William wanted to make sure that some of the points discussed there were infact valid
Reading in the data file csv Checking the dimesions of the data and the datatypes
cva_1 <- read.csv("WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv", header = TRUE, stringsAsFactors = FALSE)
library("moments", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")
dim(cva_1)
## [1] 9134 24
str(cva_1)
## 'data.frame': 9134 obs. of 24 variables:
## $ Customer : chr "BU79786" "QZ44356" "AI49188" "WW63253" ...
## $ State : chr "Washington" "Arizona" "Nevada" "California" ...
## $ Customer.Lifetime.Value : num 2764 6980 12887 7646 2814 ...
## $ Response : chr "No" "No" "No" "No" ...
## $ Coverage : chr "Basic" "Extended" "Premium" "Basic" ...
## $ Education : chr "Bachelor" "Bachelor" "Bachelor" "Bachelor" ...
## $ Effective.To.Date : chr "2/24/11" "1/31/11" "2/19/11" "1/20/11" ...
## $ EmploymentStatus : chr "Employed" "Unemployed" "Employed" "Unemployed" ...
## $ Gender : chr "F" "F" "F" "M" ...
## $ Income : int 56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
## $ Location.Code : chr "Suburban" "Suburban" "Suburban" "Suburban" ...
## $ Marital.Status : chr "Married" "Single" "Married" "Married" ...
## $ Monthly.Premium.Auto : int 69 94 108 106 73 69 67 101 71 93 ...
## $ Months.Since.Last.Claim : int 32 13 18 18 12 14 0 0 13 17 ...
## $ Months.Since.Policy.Inception: int 5 42 38 65 44 94 13 68 3 7 ...
## $ Number.of.Open.Complaints : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number.of.Policies : int 1 8 2 7 1 2 9 4 2 8 ...
## $ Policy.Type : chr "Corporate Auto" "Personal Auto" "Personal Auto" "Corporate Auto" ...
## $ Policy : chr "Corporate L3" "Personal L3" "Personal L3" "Corporate L2" ...
## $ Renew.Offer.Type : chr "Offer1" "Offer3" "Offer1" "Offer1" ...
## $ Sales.Channel : chr "Agent" "Agent" "Agent" "Call Center" ...
## $ Total.Claim.Amount : num 385 1131 566 530 138 ...
## $ Vehicle.Class : chr "Two-Door Car" "Four-Door Car" "Two-Door Car" "SUV" ...
## $ Vehicle.Size : chr "Medsize" "Medsize" "Medsize" "Medsize" ...
Why did i use stringsAsFactors = FALSE
?
Checking null values in each column and storing the value in a data frame na_counts
na_counts <- sapply(cva_1, function(y) sum(is.na(y)))
na_counts <- data.frame(na_counts)
na_counts
## na_counts
## Customer 0
## State 0
## Customer.Lifetime.Value 0
## Response 0
## Coverage 0
## Education 0
## Effective.To.Date 0
## EmploymentStatus 0
## Gender 0
## Income 0
## Location.Code 0
## Marital.Status 0
## Monthly.Premium.Auto 0
## Months.Since.Last.Claim 0
## Months.Since.Policy.Inception 0
## Number.of.Open.Complaints 0
## Number.of.Policies 0
## Policy.Type 0
## Policy 0
## Renew.Offer.Type 0
## Sales.Channel 0
## Total.Claim.Amount 0
## Vehicle.Class 0
## Vehicle.Size 0
Checking if the date colums are in the correct format
typeof(cva_1$Effective.To.Date)
## [1] "character"
Casting Date columns - Changing the format of Effective.To.Date column and checking its head
cva_1$Effective.To.Date <- as.Date(cva_1$Effective.To.Date, "%m/%d/%y")
head(cva_1$Effective.To.Date)
## [1] "2011-02-24" "2011-01-31" "2011-02-19" "2011-01-20" "2011-02-03"
## [6] "2011-01-25"
typeof(cva_1$Effective.To.Date)
## [1] "double"
I want to quickly perform Year and Month analyses later…So i will create separate year and month columns
cva_1$Year <- format(cva_1$Effective.To.Date, "%Y")
cva_1$Month <- format(cva_1$Effective.To.Date, "%B")
Check how many unique year and month values are there
unique(cva_1$Year)
## [1] "2011"
unique(cva_1$Month)
## [1] "February" "January"
I want to cast some of the columns as factors. Especially columns that have very few levels. This will help with plotting and has other benefits
# The columns to convert to factors are Response, Coverage, Education (Ordered), EmploymentStatus, Gender, Location.Code, Marital.Status, Policy.Type, Policy, Renew.Offer.Type, Sales.Channel, Vehicle.Class, Vehicle.Size, Month (No Year as there is only 1)
cva_1$Response <- factor(cva_1$Response)
cva_1$Coverage <- factor(cva_1$Coverage)
cva_1$Education <- factor(cva_1$Education, levels = c("High School or Below", "College", "Bachelor", "Master", "Doctor"), ordered = TRUE)
cva_1$EmploymentStatus <- factor(cva_1$EmploymentStatus)
cva_1$Gender <- factor(cva_1$Gender)
cva_1$Location.Code <- factor(cva_1$Location.Code)
cva_1$Marital.Status <- factor(cva_1$Marital.Status)
cva_1$Policy.Type <- factor(cva_1$Policy.Type)
cva_1$Policy <- factor(cva_1$Policy)
cva_1$Renew.Offer.Type <- factor(cva_1$Renew.Offer.Type)
cva_1$Sales.Channel <- factor(cva_1$Sales.Channel)
cva_1$Vehicle.Class <- factor(cva_1$Vehicle.Class)
cva_1$Vehicle.Size <- factor(cva_1$Vehicle.Size)
cva_1$Month <- factor(cva_1$Month)
Quick check to see if the dataframe looks good now
str(cva_1)
## 'data.frame': 9134 obs. of 26 variables:
## $ Customer : chr "BU79786" "QZ44356" "AI49188" "WW63253" ...
## $ State : chr "Washington" "Arizona" "Nevada" "California" ...
## $ Customer.Lifetime.Value : num 2764 6980 12887 7646 2814 ...
## $ Response : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ Coverage : Factor w/ 3 levels "Basic","Extended",..: 1 2 3 1 1 1 1 3 1 2 ...
## $ Education : Ord.factor w/ 5 levels "High School or Below"<..: 3 3 3 3 3 3 2 4 3 2 ...
## $ Effective.To.Date : Date, format: "2011-02-24" "2011-01-31" ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 2 5 2 5 2 2 2 5 3 2 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
## $ Income : int 56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
## $ Location.Code : Factor w/ 3 levels "Rural","Suburban",..: 2 2 2 2 1 1 2 3 2 3 ...
## $ Marital.Status : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
## $ Monthly.Premium.Auto : int 69 94 108 106 73 69 67 101 71 93 ...
## $ Months.Since.Last.Claim : int 32 13 18 18 12 14 0 0 13 17 ...
## $ Months.Since.Policy.Inception: int 5 42 38 65 44 94 13 68 3 7 ...
## $ Number.of.Open.Complaints : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number.of.Policies : int 1 8 2 7 1 2 9 4 2 8 ...
## $ Policy.Type : Factor w/ 3 levels "Corporate Auto",..: 1 2 2 1 2 2 1 1 1 3 ...
## $ Policy : Factor w/ 9 levels "Corporate L1",..: 3 6 6 2 4 6 3 3 3 8 ...
## $ Renew.Offer.Type : Factor w/ 4 levels "Offer1","Offer2",..: 1 3 1 1 1 2 1 1 1 2 ...
## $ Sales.Channel : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
## $ Total.Claim.Amount : num 385 1131 566 530 138 ...
## $ Vehicle.Class : Factor w/ 6 levels "Four-Door Car",..: 6 1 6 5 1 6 1 1 1 1 ...
## $ Vehicle.Size : Factor w/ 3 levels "Large","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : chr "2011" "2011" "2011" "2011" ...
## $ Month : Factor w/ 2 levels "February","January": 1 2 1 2 1 2 1 2 2 1 ...
Everything looks great and we are now ready for our analysis
Maximum CLV is $8.332510^{4} and the minimum CLV is $1898
First Moment Mean of CLV is $8005 and the Median is $5780
Second Moment The Variance in CLV is 4.721019610^{7} and the Standard Deviation is $6870.9676084
Third Moment Skewness is 3.0317823
Fourth Moment Kurtosis is 16.8153107
This means that the distribution of CLV is positively skewed (as expected) and is heavily Leptokurtic. These results indicate a distribution that is heavily skewed with a very large tail. There are a LOT of Customers with low CLV. Very few customers with high CLV. This can be visually understood using the Histogram
hist(cva_1$Customer.Lifetime.Value, col = "red")
hist(cva_1$Customer.Lifetime.Value, breaks = (max(cva_1$Customer.Lifetime.Value) - min(cva_1$Customer.Lifetime.Value))/100, freq = FALSE, main = "CLV Histogram", border = "red")
Maximum TCA is $2893 and the minimum TCA is $0
First Moment Mean of TCA is $434 and the Median is $384
Second Moment The Variance in MPA is 8.439030310^{4} and the Standard Deviation is $290.5000918
Third Moment Skewness is 1.7146842
Fourth Moment Kurtosis is 8.9754716
Total Claim amounts also follow a trend similar to CLV and MPA although the distribution is NOT as skewed or as long tailed as MPA. This can be visually seen in the Histogram. This means that variation in data is CLV > MPA > TCA
hist(cva_1$Total.Claim.Amount, col = "red")
We can conclude from this that we are very far away from normal distributions and maybe if we want to model these variable we should think of a variable transformation (?)
The most obvious candidate for Independent variable is CLV. This also makes sense from a Business Perspective as we want to understand what contribues to making a high value customer (Descriptive analysis) and maybe later on predict who is going to be high value customer (Predictive analysis)
# Hypothesis 1 - Rural customers are less valuable than Urban customers
# What we want to do here is group the dataframe by Rural and Urban Customers and Aggregate the CLV column using the 'mean' function. This way we can compare the mean CLV for rual and urban customers
# Loading dplyr to do some data manipulation tasks. It is easier to group data using dplyr
library("dplyr", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hyp1_test1 <- cva_1 %>% group_by(Location.Code) %>% summarise(CLV = mean(Customer.Lifetime.Value), MPA = mean(Monthly.Premium.Auto), TCA = mean(Total.Claim.Amount))
library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")
hyp1_test1
## # A tibble: 3 x 4
## Location.Code CLV MPA TCA
## <fctr> <dbl> <dbl> <dbl>
## 1 Rural 7953.699 89.80598 109.9051
## 2 Suburban 8004.457 95.22703 562.1599
## 3 Urban 8064.133 89.71049 329.5723
#plot(hyp1_test1$Location.Code, hyp1_test1$CLV, barplot(hyp1_test1$CLV))
#hyp1_test1_plot1 <- ggplot(data = hyp1_test1, mapping = aes(hyp1_test1$Location.Code, hyp1_test1$CLV))
# Another way of doing the same thing is to do
I have noticed recently that when a policy is about to expire … customers take advantage of the limited time available to them by making more claims than average during the closing stages of their policy. I want to know if this is true nation wide
West Coast USA is predominantly Tech industry. And they are nowadays taking out Corporate Auto insurance policies for their emplpoyees. So i have noticed an increase on the ratio of sales of corporate and personal Auto insurance. How does the national ratio compare to that of West Coast?
Who are my most valuable customers? By this i mean in the West Coast. Are they more or less valuable than the National Average?
For my region, how does the comparison of channel performance look like? How does this compare with trends for the entire country?
P:S - By West Coast i mean California :)
PP:S - Please submit your answers as Markdown files that show the analysis and results in one place