Customer Personality Analysis

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

Here we have a data of 2240 customers about their educational level, marital status, year of birth, amount spend on our products, purchase modes in last 2 years. Here, we will be discovering about the factors that affects amount spend by customers.

About the data

Variable Description
ID Customer’s unique identifier
Year_Birth Customer’s birth year
Education Customer’s education level
Marital_Status Customer’s marital status
Income Customer’s yearly household income
Kidhome Number of children in household
Teenhome Number of teenagers in household
Dt_Customer Date of customer’s enrollment with company
Recency Number of days since customer’s last purchase
Complain 1: customer complained in last 2 yrs, else: 0
Variable Description
MntWines Amount spent on wine in last 2 years
MntFruits Amount spent on fruits in last 2 years
MntMeatProducts Amount spent on meat in last 2 years
MntFishProducts Amount spent on fish in last 2 years
MntSweetProducts Amount spent on sweets in last 2 years
MntGoldProds Amount spent on gold in last 2 years

Promotions

  • AcceptedCmp1 1 (1 to 5) if customer accepted the offer in the 1st campaign, 0 otherwise
  • Response 1 if customer accepted the offer in the last campaign, 0 otherwise
Variable Description
NumDealsPurchases Number of purchases made with a discount
NumWebPurchases Number of purchases made through the company’s web site
NumCatalogPurchases Number of purchases made using a catalogue
NumStorePurchases Number of purchases made directly in stores
NumWebVisitsMonth Number of visits to company’s web site in the last month

Load the data

# install.packages('skimr')
library(skimr)
# install.packages('DataExplorer')
library(DataExplorer)
library(tidyverse)
# install.packages('car')
library(car)
# install.packages('rstatix')
library(rstatix)
library(paletteer)

main_df = read.table("marketing_campaign.csv", sep="\t", header=T)
head(main_df)
##     ID Year_Birth  Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524       1957 Graduation         Single  58138       0        0  04-09-2012
## 2 2174       1954 Graduation         Single  46344       1        1  08-03-2014
## 3 4141       1965 Graduation       Together  71613       0        0  21-08-2013
## 4 6182       1984 Graduation       Together  26646       1        0  10-02-2014
## 5 5324       1981        PhD        Married  58293       1        0  19-01-2014
## 6 7446       1967     Master       Together  62513       0        1  09-09-2013
##   Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1      58      635        88             546             172               88
## 2      38       11         1               6               2                1
## 3      26      426        49             127             111               21
## 4      26       11         4              20              10                3
## 5      94      173        43             118              46               27
## 6      16      520        42              98               0               42
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1           88                 3               8                  10
## 2            6                 2               1                   1
## 3           42                 1               8                   2
## 4            5                 2               2                   0
## 5           15                 5               5                   3
## 6           14                 2               6                   4
##   NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1                 4                 7            0            0            0
## 2                 2                 5            0            0            0
## 3                10                 4            0            0            0
## 4                 4                 6            0            0            0
## 5                 6                 5            0            0            0
## 6                10                 6            0            0            0
##   AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1            0            0        0             3        11        1
## 2            0            0        0             3        11        0
## 3            0            0        0             3        11        0
## 4            0            0        0             3        11        0
## 5            0            0        0             3        11        0
## 6            0            0        0             3        11        0

Inspect the data

Using {DataExplorer} library function to plot an introduction to the dataset

DataExplorer::plot_intro(data = main_df, title ="DataFrame Intro", ggtheme = NULL )

The plot shows that missing observations make up 0.037%, and 98.9% are complete rows. Most of the data we have is continuous.

We can use {skimr} library to get quick summary statistics of our data.

skimr::skim(main_df)
Data summary
Name main_df
Number of rows 2240
Number of columns 29
_______________________
Column type frequency:
character 3
numeric 26
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Education 0 1 3 10 0 5 0
Marital_Status 0 1 4 8 0 8 0
Dt_Customer 0 1 10 10 0 663 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 5592.16 3246.66 0 2828.25 5458.5 8427.75 11191 ▇▇▇▇▇
Year_Birth 0 1.00 1968.81 11.98 1893 1959.00 1970.0 1977.00 1996 ▁▁▂▇▅
Income 24 0.99 52247.25 25173.08 1730 35303.00 51381.5 68522.00 666666 ▇▁▁▁▁
Kidhome 0 1.00 0.44 0.54 0 0.00 0.0 1.00 2 ▇▁▆▁▁
Teenhome 0 1.00 0.51 0.54 0 0.00 0.0 1.00 2 ▇▁▇▁▁
Recency 0 1.00 49.11 28.96 0 24.00 49.0 74.00 99 ▇▇▇▇▇
MntWines 0 1.00 303.94 336.60 0 23.75 173.5 504.25 1493 ▇▂▂▁▁
MntFruits 0 1.00 26.30 39.77 0 1.00 8.0 33.00 199 ▇▁▁▁▁
MntMeatProducts 0 1.00 166.95 225.72 0 16.00 67.0 232.00 1725 ▇▁▁▁▁
MntFishProducts 0 1.00 37.53 54.63 0 3.00 12.0 50.00 259 ▇▁▁▁▁
MntSweetProducts 0 1.00 27.06 41.28 0 1.00 8.0 33.00 263 ▇▁▁▁▁
MntGoldProds 0 1.00 44.02 52.17 0 9.00 24.0 56.00 362 ▇▁▁▁▁
NumDealsPurchases 0 1.00 2.33 1.93 0 1.00 2.0 3.00 15 ▇▂▁▁▁
NumWebPurchases 0 1.00 4.08 2.78 0 2.00 4.0 6.00 27 ▇▃▁▁▁
NumCatalogPurchases 0 1.00 2.66 2.92 0 0.00 2.0 4.00 28 ▇▂▁▁▁
NumStorePurchases 0 1.00 5.79 3.25 0 3.00 5.0 8.00 13 ▂▇▂▃▂
NumWebVisitsMonth 0 1.00 5.32 2.43 0 3.00 6.0 7.00 20 ▅▇▁▁▁
AcceptedCmp3 0 1.00 0.07 0.26 0 0.00 0.0 0.00 1 ▇▁▁▁▁
AcceptedCmp4 0 1.00 0.07 0.26 0 0.00 0.0 0.00 1 ▇▁▁▁▁
AcceptedCmp5 0 1.00 0.07 0.26 0 0.00 0.0 0.00 1 ▇▁▁▁▁
AcceptedCmp1 0 1.00 0.06 0.25 0 0.00 0.0 0.00 1 ▇▁▁▁▁
AcceptedCmp2 0 1.00 0.01 0.11 0 0.00 0.0 0.00 1 ▇▁▁▁▁
Complain 0 1.00 0.01 0.10 0 0.00 0.0 0.00 1 ▇▁▁▁▁
Z_CostContact 0 1.00 3.00 0.00 3 3.00 3.0 3.00 3 ▁▁▇▁▁
Z_Revenue 0 1.00 11.00 0.00 11 11.00 11.0 11.00 11 ▁▁▇▁▁
Response 0 1.00 0.15 0.36 0 0.00 0.0 0.00 1 ▇▁▁▁▂
DataExplorer::plot_bar(data = main_df, ncol = 2, ggtheme = NULL)

Clean the data

The columns Z_CostContact and Z_Revenue are single value columns and should be removed.

main_df = main_df %>% 
  subset(select = -c(Z_CostContact, Z_Revenue))

Let’s look at the Income variable and its distribution

ggplot(main_df, aes(x= Income)) +
  geom_density(color='green', fill='grey30') +
  scale_x_continuous(n.breaks = 10, labels = scales::dollar_format()) +
  
  ggdark::dark_mode() +
  labs(
    title = "Income Distribution ",
    x="Income"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),
    axis.text.x = element_text(size = 11)
  )
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
## Warning: Removed 24 rows containing non-finite outside the scale range
## (`stat_density()`).

Clearly the income distribution is largely less than $100,000 for this dataset.

We can fill in the missing value with a median value

main_df$Income[is.na(main_df$Income)] = median(main_df$Income, na.rm = TRUE)

ggplot(main_df, aes(x= Income)) +
  geom_density(color='green', fill='grey30') +
  scale_x_continuous(n.breaks = 10, labels = scales::dollar_format()) +
  
  ggdark::dark_mode() +
  labs(
    title = "Income Distribution ",
    x="Income"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),
    axis.text.x = element_text(size = 11)
  )

A slight change in the the density of the income distribution.

Education values

Next is to change the category for Master’s degree as labeled 2nd cycle.

main_df = main_df %>% 
  mutate(Education = replace(Education, Education =="2n Cycle", "Master"))

Marital Status values

Group the marital status into a binary of ‘single’ and ‘in relationship’.

main_df = main_df %>% 
  mutate(Marital_Status = replace(Marital_Status, Marital_Status == 'Alone' | Marital_Status == 'Absurd' | Marital_Status =='YOLO' | Marital_Status =='Widow' | Marital_Status == 'Divorced','Single')) %>% 
  mutate(Marital_Status = replace(Marital_Status, Marital_Status =='Married' | Marital_Status =='Together','In Relationship'))

head(main_df$Marital_Status)
## [1] "Single"          "Single"          "In Relationship" "In Relationship"
## [5] "In Relationship" "In Relationship"

Year_Birth

Need to calculate the current year, and add the column Age to replace Year_Birth

current_date = Sys.Date()
current_year = format(current_date, format="%Y")
current_year = as.integer(current_year)

main_df = main_df %>% 
  mutate(Age = 2024 - Year_Birth) %>% 
  select( - Year_Birth)

Adding Age range

Needing to use dplyr::if_else() statement to capture the various age ranges to know what the customers age cohorts will be.

main_df = main_df %>% 
  mutate(
    Age_Range = if_else(Age <= 16, "Child",
                        if_else(17 <= Age & Age <= 18,"Youth",
                                if_else(18 < Age & Age <= 30,"Young Adult",
                                        if_else(30 < Age & Age <=65,"Adult", "Senior"))))
  )

Calculate Total Spenditure

main_df = main_df %>%
  mutate(Total_Spent = MntWines + MntFruits + MntMeatProducts 
         + MntFishProducts + MntSweetProducts + MntGoldProds)

Calculate Total Purchases by Customers

main_df = main_df %>%
  mutate(Total_Purchases = NumWebPurchases + 
           NumCatalogPurchases + NumStorePurchases + NumDealsPurchases)

Calculate the Enrollment Year of Customers

main_df$Dt_Customer[1]
## [1] "04-09-2012"
# dates are messy, in this data there are formats with month-day-year and some with day-month-year
main_df = main_df %>%
  mutate(
    Enrollment = if_else(
      lubridate::mdy(Dt_Customer, quiet = TRUE) %in% lubridate::ymd(),
      lubridate::mdy(Dt_Customer),
      lubridate::dmy(Dt_Customer)
    )
  ) %>% 
  mutate(Enrollment_Yr  = year(Enrollment),
         Seniority = year(current_date) - Enrollment_Yr)

Calculate the Total Accepted Offers

main_df = main_df %>%
  mutate(Total_Offers = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 )
head( main_df, 2)
##     ID  Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency
## 1 5524 Graduation         Single  58138       0        0  04-09-2012      58
## 2 2174 Graduation         Single  46344       1        1  08-03-2014      38
##   MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1      635        88             546             172               88
## 2       11         1               6               2                1
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1           88                 3               8                  10
## 2            6                 2               1                   1
##   NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1                 4                 7            0            0            0
## 2                 2                 5            0            0            0
##   AcceptedCmp1 AcceptedCmp2 Complain Response Age Age_Range Total_Spent
## 1            0            0        0        1  67    Senior        1617
## 2            0            0        0        0  70    Senior          27
##   Total_Purchases Enrollment Enrollment_Yr Seniority Total_Offers
## 1              25 2012-09-04          2012        12            0
## 2               6 2014-03-08          2014        10            0

Exploratory Data Analysis

Age Group Percentiles

age_range = main_df %>% 
  select(Age_Range) %>% 
  group_by(Age_Range) %>% 
  summarise( num = n() ) %>% 
  mutate( Age_Range_Percent = round(num*100 / sum(num), 2) ) %>% 
  mutate(Age_Range = as_factor(Age_Range))

ggplot(age_range, 
       aes(x=age_range$Age_Range , 
           y=num,
           fill=Age_Range
           )
       )+ 
  geom_bar(stat='identity', show.legend = FALSE) + 
  scale_y_continuous(n.breaks = 10) +
  labs(
    title= "Age Range Percentages",
    x='Age Group', 
    y='Count'
    )+
  geom_text(aes(label=paste0(num ,' (',age_range$Age_Range_Percent,'%',')')),
            vjust= -0.5, color='grey90', 
            show.legend = FALSE,
            size = 4
            ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    # plot.margin = margin(t = 2,r = 0,b = 0,l = 0,unit = 'cm')
    
  )

Our demographics for this dataset that was categorized:

  • if person is ages between 18 and 30, they were labelled “Young Adult”
  • if person was between the ages of 30 and 65, they were labelled “Adult”
  • if person was over age 65 is labelled “Senior”

76% of data is adults who have purchasing power, along with under a quarter (23%) of the data population being seniors, while those under 30 but are adults is less than 1 percent (0.45%). The data did not find any person under the age of 18 to be labelled “child”.

Probability Density Function

Probability Density Function shows the likelihood of a continuous variable to take on a specific value within a range. Here we want to see the probability of customer spending amounts within the density range.

main_df %>% 
  ggplot(
    aes(x= Total_Spent)
  ) +
  geom_histogram(bins = 10, aes(y= stat(width*density)), color='black' , fill='gold', alpha= 0.5 ) +
  scale_x_continuous(breaks =  seq(0,2500, by= 250), labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::percent_format(accuracy=1), n.breaks = 10) +
  labs(
    title = "PDF: Total Spent Probability",
    x="Total Spent",
    y="Probability"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
    axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
    axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
  )

The Probability Density Function plot shows:

  • Over 35% of the data population spends less than $250
  • 15% of the data population is spending $250
  • around 10% for spending amounts between 500 to 1,000 dollars
  • customers spending more than $2,250 is around 0%

ECDF on Spending

Empirical cumulative distribution function (ECDF) is graphical representation of the cumulative probability distribution of a dataset. An ECDF is a step function that shows the proportion of data points in a sample that are less than or equal to a given value. It provides a way to visualize the distribution of a dataset without making assumptions about its underlying distribution.

main_df %>% 
  ggplot(
    aes(x= Total_Spent)
  )+
  stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
  labs(
    title = "ECDF: Total Spend",
    y="Probability\n"
  )+
  scale_y_continuous(breaks =  seq(0, 1.0, by= 0.1)) +
  scale_x_continuous(breaks = seq(0,2600, by= 200) , labels = scales::dollar_format() ) +
  geom_hline( aes(yintercept = 0.55, color='gold'), show.legend = FALSE) +
  geom_vline( aes(xintercept = 502, color='gold'), show.legend = FALSE )+
  geom_text(aes(x=410,  y=0.53, 
                label = "55%", 
                vjust = -1, 
                colour = 'gold', 
                nudge=1,
                size = 11
                ), 
            show.legend = FALSE
            ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),
    axis.text.y = element_text(size = 11, colour = 'grey70', face = 'bold'),
    axis.text.x = element_text(size = 11, angle =45, hjust = 1)
  )

The ECDF plot shows that 55% of the data population spend less than 500 dollars.

PDF on Income

main_df %>% 
  ggplot(
    aes(x= Income)
  ) +
  geom_histogram(bins = 6, aes(y= stat(width*density)), color='grey0' , fill='gold', alpha= 0.5 ) +
  labs(title = "PDF on Income", y="Probability" ) +
  scale_x_continuous(limits=c(0,200000), breaks=seq(0,200000,by=20000), labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::percent_format(accuracy=1) )+
  theme(
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
    axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
    axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
  )

The probability density function plot on income shows that 60% of the customer population in the data has income ranging from 20,000 to 60,000 dollars annually. Second group of customers are those who make 60,000 to 100k dollars annually, while customers who have income greater than 100k are marginal.

ECDF on Income

main_df %>% 
  ggplot(
    aes(x= Income)
  )+
  stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
  labs(
    title = "ECDF: Income",
    y="Probability\n"
  )+
  scale_y_continuous(breaks =  seq(0, 1.0, by= 0.1)) +
  scale_x_continuous(limits=c(0,200000),breaks=seq(0,200000,by=20000), labels = scales::dollar_format() )+
  geom_hline( aes(yintercept = 0.95, color='gold'), show.legend = FALSE) +
  geom_vline( aes(xintercept = 83e3, color='gold'), show.legend = FALSE )+
  geom_text(aes(x=65e3,  y=0.83,
                label = "95%",
                vjust = -1,
                colour = 'gold',
                nudge=1,
                size = 11
                ),
            show.legend = FALSE
            ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),
    axis.text.y = element_text(size = 11, face = 'bold'),
    axis.text.x = element_text(size = 11, angle =45, hjust = 1),
    # axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
    
  )

The ECDF plot shows that 95% of the customer population makes less than $83,000 annually. While 50% of the population makes $50,000 annually.

Correlation: Income and Expense Variable

A correlation test between income and total spent variable

# 3 variables are required for this library function
main_df %>% rstatix::cor_mat(Income, Total_Spent, Age)
## # A tibble: 3 × 4
##   rowname     Income Total_Spent   Age
## * <chr>        <dbl>       <dbl> <dbl>
## 1 Income        1           0.66  0.16
## 2 Total_Spent   0.66        1     0.11
## 3 Age           0.16        0.11  1
cor.test(main_df$Income, main_df$Total_Spent)
## 
##  Pearson's product-moment correlation
## 
## data:  main_df$Income and main_df$Total_Spent
## t = 42.098, df = 2238, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6410074 0.6872687
## sample estimates:
##       cor 
## 0.6647749

Since the p-value is less than 0.05, we have sufficient evidence to state that there is correlation between these two variables Income and Total Spent.

Correlation Plot

main_df %>% 
  ggplot(
    aes(x= Income, y= Total_Spent)
  )+
  geom_point(shape= 21, color='green', fill='black', size=3 ) +
  scale_x_continuous(limits=c(0,2e5),breaks=seq(0,2e5,by=20000), labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::dollar_format() , limits =  c(0,3500), n.breaks = 10)+
  geom_smooth(method=lm, se=FALSE, color='gold',linetype='dashed') +
  labs(title = "Correlation Plot: Income and Total Spent", 
       y="Total Spent") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14),
    axis.text.y = element_text(size = 12, face = 'bold', color = 'gold'),
    axis.text.x = element_text(size = 12, angle =45, hjust = 1, color = 'green'),
  )

The correlation coefficient is 0.664 for income and total spent variables. The plot shows that there is a moderate positive correlation, meaning that as income rises, so does the total spent.

Average amount spent by Education

main_df %>% 
  ggplot( aes(x= Education, y= Total_Spent )) +  
  geom_boxplot(fill = "#004d80",alpha = 0.5, color = '#0099ff',outlier.colour = 'yellow') +
  scale_y_continuous(n.breaks = 5, labels = scales::dollar_format() )+
  # facet_wrap(~Education, scale= "free") +
  labs(title = "Amount Spent based on Education Levels", y="Total Spent") +
  theme(
    plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
    axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
    axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
  )

Customers with basic education have much lower spending amounts compared to other education levels, which looks to be around $500.

Average Spending by Marital Status

ggplot(main_df, aes(x=Marital_Status,y=Total_Spent))+
  geom_boxplot(fill = "#004d80",alpha = 0.5, color = '#0099ff') +
  scale_y_continuous(n.breaks = 5, labels = scales::dollar_format() )+
  labs(title = "Amount Spent based on Marital Status", y="Total Spent") +
  theme(
    plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
    axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
    axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
  )

Customer population have similar spending based on their marital status.

Average Spending by Age Range

ggplot(main_df, aes(x=Age_Range, y=Total_Spent ))+
  geom_boxplot(fill = "#004d80",alpha = 0.5, color = '#0099ff') +
  scale_y_continuous(n.breaks = 5, labels = scales::dollar_format() )+
  labs(title = "Amount Spent based on Age Range", y="Total Spent", x="Age Range") +
  theme(
    plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
    axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
    axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
  )

Interesting that young adults are a small percentage of the customer population, but spend more than the other adults.

Average Spend by Products

products_df = main_df %>% select(starts_with("Mnt"))
Product_Name = data.frame(Product_Name=rep(c('Wine','Fruit','Meat','Fish','Sweet','Gold'),each=2240))
Total_Spent = data.frame(Total_Spent = unlist(products_df),row.names=NULL)
products_df = data.frame(Product_Name,Total_Spent)
head(products_df,3)
##   Product_Name Total_Spent
## 1         Wine         635
## 2         Wine          11
## 3         Wine         426
ggplot(products_df, 
       aes(x=Product_Name, 
           y=Total_Spent 
           )
       )+
  geom_boxplot(fill = "#004d80",
               alpha = 0.5, 
               color = '#0099ff', 
               outlier.colour = 'yellow', 
               # outlier.size = 1.5,
               # outlier.alpha = 0.5,
               # outlier.shape = 2
               ) +
  scale_y_continuous(n.breaks = 7, labels = scales::dollar_format() )+
  labs(title = "Amounts Spent by Product", y="Total Spent", x="Products") +
  theme(
    plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
    axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
    axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
  )

The boxplot shows that the meat and wine are where customers are spending the most amount of money.

PDF on Purchases in last 2 years

main_df %>% 
  ggplot(
    aes(x= Total_Purchases)
  ) +
  geom_histogram(bins = 6, aes(y= stat(width*density)), color='grey0' , fill='gold', alpha= 0.5 ) +
  labs(
    title = "PDF on Purchases", 
    y="Probability" ,
    x= "Number of Purchases"
    ) +
  
  # scale_x_continuous(limits=c(0,200000), breaks=seq(0,200000,by=20000), labels = scales::dollar_format()) +
  # scale_y_continuous(labels = scales::percent_format(accuracy=1) )+
  # 
  scale_y_continuous(labels = scales::percent_format(accuracy=1)) +
  scale_x_continuous(limits=c(0,50),breaks=seq(0,50,by=5)) +
  
  theme(
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
    axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
    axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
  )

The PDF plot shows that 40% of customers purchase between 15 and 25 times, followed by customers who purchase items between 5 and 15 times (38%). The customers who purchase items between 25 and 35 times make up 9%.

ECDF on Total Purchases

main_df %>% 
  ggplot(
    aes(x= Total_Purchases)
  )+
  stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
  labs(
    title = "ECDF: Total Purchases",
    y="Probability\n",
    x="Total Purchases"
  ) +
  scale_y_continuous(breaks=seq(0,1.0,by=0.1))+
  scale_x_continuous(n.breaks = 10)+
  
  # geom_hline( aes(yintercept = 0.50, color='gold'), show.legend = FALSE) +
  geom_vline( aes(xintercept = 15, color='gold'), show.legend = FALSE )+
  geom_text(aes(x=12,  y=0.48,
                label = "50%",
                vjust = -1,
                colour = 'gold',
                nudge=1,
                size = 11
                ),
            show.legend = FALSE
            ) +
  
  geom_hline( aes(yintercept = 0.88, color='gold'), show.legend = FALSE) +
  geom_vline( aes(xintercept = 25, color='gold'), show.legend = FALSE )+
  geom_text(aes(x=20,  y=0.78,
                label = "88%",
                vjust = -1,
                colour = 'gold',
                nudge=1,
                size = 11
                ),
            show.legend = FALSE
            ) +
  
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),
    axis.text.y = element_text(size = 11, face = 'bold'),
    axis.text.x = element_text(size = 11, angle =45, hjust = 1),
    axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
    
  )

The ECDF plot shows that 50% of customers make 15 purchases or less. Customers who make 25 purchases or less make up 88% of the customer population.

Preferred Purchase by Customers

purchase_df = main_df %>% select(starts_with("Num")) %>% select(ends_with('Purchases'))
Purchase_Name = data.frame(Purchase_Name=rep(c('Deal','Web','Catalog','Store'),each=2240))
Total_Purchases = data.frame(Total_Purchases = unlist(purchase_df),row.names=NULL)
purchase_df = data.frame(Purchase_Name,Total_Purchases)
head(purchase_df,3)
##   Purchase_Name Total_Purchases
## 1          Deal               3
## 2          Deal               2
## 3          Deal               1
purchase_df %>% 
  ggplot( 
       aes(x=Purchase_Name, 
           y=Total_Purchases
           )
       )+
  geom_boxplot(fill = "#004d80",
               alpha = 0.5, 
               color = '#0099ff', 
               outlier.colour = 'yellow', 
               # outlier.size = 1.5,
               # outlier.alpha = 0.5,
               # outlier.shape = 2
               ) +
  scale_y_continuous(n.breaks = 7, labels = scales::dollar_format() )+
  labs(title = "Total Purchases by Products", y="Total Purchases", x="Purchase Type") +
  theme(
    plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
    axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
    axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
  )

The catalog and the web have the outliers with purchases over $25, while median is below 5 dollars. For store purchase types median is $5.

purchase_df %>% 
  group_by( Purchase_Name) %>% 
  summarise( num= sum(Total_Purchases)) %>% 
  mutate(Percent = round(num*100/sum(num) ,2)) %>% 
  arrange( desc(Percent))
## # A tibble: 4 × 3
##   Purchase_Name   num Percent
##   <chr>         <int>   <dbl>
## 1 Store         12970    39.0
## 2 Web            9150    27.5
## 3 Catalog        5963    17.9
## 4 Deal           5208    15.6

The table shows that 38.96% for store purchase product type is the preferred method for customers, followed by the website at 27.48%.

Average Web Visits

main_df %>% 
  summarise(
    Avg_Web_Visits = mean(NumWebVisitsMonth),
    Avg_Store_Purchases = mean(NumStorePurchases),
    Avg_Catalog_Purchase = mean(NumCatalogPurchases)
  )
##   Avg_Web_Visits Avg_Store_Purchases Avg_Catalog_Purchase
## 1       5.316518            5.790179             2.662054

PDF on Web Visits

main_df %>% 
  ggplot(
    aes(x= NumWebVisitsMonth)
  ) +
  geom_histogram(bins = 6,
                 aes(y= stat(width*density)), 
                 color='black' , 
                 fill='gold', 
                 alpha= 0.5 ) +
  scale_x_continuous(limits=c(0,20), breaks=seq(0,20,by=2)) +
  scale_y_continuous(labels = scales::percent_format(accuracy=1), n.breaks = 15) +
  labs(
    title = "PDF: Number of Web Visits",
    x="Web Traffic",
    y="Probability"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
    axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
    axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
  )

Web traffic:

  • 46% of customers visit the website between 2 to 6 times
  • 36% of customer visit the website between 6 and 10 times
  • customers visiting the website more than 10 times is marginal

ECDF on Web Traffic

main_df %>% 
  ggplot(
    aes(x= NumWebVisitsMonth)
  )+
  stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
  labs(
    title = "ECDF: Number of Website Visits",
    y="Probability\n",
    x="Website Visits"
  ) +
  scale_y_continuous(breaks=seq(0,1.0,by=0.1))+
  scale_x_continuous(limits=c(0,20),breaks=seq(0,20,by=1)) +
  
  geom_hline( aes(yintercept =0.95, color='gold'), show.legend = FALSE) +
  geom_vline( aes(xintercept = 9, color='gold'), show.legend = FALSE )+
  geom_text(aes(x=8.7,  y=0.85,
                label = "95%",
                vjust = -1,
                colour = 'gold',
                nudge=1,
                size = 11
                ),
            show.legend = FALSE
            ) +
  
  theme(
    plot.title = element_text(hjust = 0.5, size = 13),
    axis.text.y = element_text(size = 11, face = 'bold'),
    axis.text.x = element_text(size = 11, angle =45, hjust = 1),
    axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
    
  )

The ECDF plot shows that 95% of customer visits are less than 9 times.

Percentage of Enrollment by Age

seniority = main_df %>% 
  select(Seniority) %>% 
  rename(Total_Years = Seniority) %>% 
  group_by(Total_Years) %>% 
  summarise(num = n() ) %>% 
  mutate(Seniority_Percentage = round(num*100/ sum(num), 2))

seniority['Total_Years'] = as_factor(seniority$Total_Years)

seniority %>% 
  ggplot( aes(x= Total_Years, y= num, fill = Total_Years)) +
  geom_col(stat = "identity", show.legend = FALSE) +
  scale_y_continuous(breaks = seq(0,1300, by=200)) +
  geom_text(aes(label=paste0(num ,' (',Seniority_Percentage,'%',')')),
            color='grey70',
            vjust=-.5,
            show.legend = FALSE
            )+
  labs(title = "Enrollment by Seniority", y= "Count", x="Number of Years")

Customers who have been enrolled for 11 years make up 53% of the customer population.

main_df %>% 
  group_by(Enrollment_Yr) %>% 
  summarise( Total_Spent = sum(Total_Spent)) %>% 
  mutate(Spent_Percent = round(Total_Spent*100/ sum(Total_Spent), 2 ))
## # A tibble: 3 × 3
##   Enrollment_Yr Total_Spent Spent_Percent
##           <dbl>       <int>         <dbl>
## 1          2012      368269          27.1
## 2          2013      715425          52.7
## 3          2014      273294          20.1

Out Of the customers who have seniority, those who have been around since 2013 make up 52% of the total amount spent by customers.

Correlation between Income and Purchases

cor.test(main_df$Income, main_df$Total_Purchases)
## 
##  Pearson's product-moment correlation
## 
## data:  main_df$Income and main_df$Total_Purchases
## t = 32.265, df = 2238, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5345074 0.5910728
## sample estimates:
##       cor 
## 0.5634501

With a p-value of 0.05, we can sufficiently state that there is a correlation between these two variables that is statistically significant.

Correlation Plot

main_df %>% 
  ggplot(
    aes(x= Income, y= Total_Purchases)
  )+
  geom_point(shape= 21, color='green', fill='black', size=3 ) +
  scale_x_continuous(limits=c(0,1.7e5),breaks=seq(0,1.7e5, by=10e3), labels = scales::dollar_format()) +
  scale_y_continuous( limits =  c(0,50), n.breaks = 10)+
  geom_smooth(method=lm, se=FALSE, color='gold',linetype='dashed') +
  labs(title = "Correlation Plot: Income and Total Purchases", 
       y="Total Purchases") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14),
    axis.text.y = element_text(size = 12, face = 'bold', color = 'gold'),
    axis.text.x = element_text(size = 12, angle =45, hjust = 1, color = 'green'),
  )

There is a moderate positive correlation between the variables income and total purchases, with a coefficient of 0.56.

Percentage of Accepted Offers

Promotions

offers = main_df %>% 
  select(Total_Offers) %>% 
  group_by(Total_Offers) %>% 
  summarise(num = n() ) %>% 
  mutate(Offers_Percentage = round(num*100/sum(num), 2)) %>% 
  mutate(Total_Offers = as_factor(Total_Offers))


offers %>% 
  ggplot( aes(x= Total_Offers , y= num, fill= Total_Offers )) +
  geom_col( show.legend = FALSE) +
  labs(
    title = "Count of Total Offers",
    x='Accepted number of Offers ',y='Count')+
  geom_text(aes(label=paste0(num ,' (',offers$Offers_Percentage,'%',')')),
            vjust=-.5,
            show.legend = FALSE,
            color='grey70'
            ) +
  scale_y_continuous(n.breaks = 10)

79% of customers do not accept the promotion offer on the first attempt, with just 14.5% who do on the second attempt.

main_df %>% 
  group_by(Total_Offers) %>% 
  summarise(Median_Spend_on_Offers = median(Income)) %>% 
  arrange( desc( Median_Spend_on_Offers)) %>% 
  ggplot( aes(x= Total_Offers, y= Median_Spend_on_Offers, fill = Median_Spend_on_Offers)) +
  geom_col(show.legend = FALSE) + 
  scale_fill_paletteer_c(`"ggthemes::Green"`) +
  labs(
    title = "Median Spend on Offer Campaign",
    y="Median Spend",
    x="Offer Type"
  ) +
  scale_y_continuous(labels = scales::dollar_format(), breaks = seq(0,100e3,by=10e3)) +
  theme(
    plot.title = element_text(size=12, hjust = 0.5, face = 'bold'),
    axis.text.x = element_text(size = 11),
    axis.text.y = element_text(size = 11, colour = 'green')
  )

Despite the offer types of 2, 3 and 4, 138 customers are the ones spend $70,000 or more, but are marginal to overall customer population since they do not accept the first 2 campaign offers.

Statistical Testing

Query:

Is there any relationship among Education level, Age range, and Marital Status of customers when considering the amount of spend on products?

Hypothesis Testing

  • A hypothesis test is used to test whether or not some hypothesis about a population parameter is correct/ true
  • \(H_0\) Null Hypothesis is the claim that there is not a statistically significant difference
  • \(\alpha\) alpha is the value of probability of rejecting the null hypothesis (assuming it is true). This is also known as Type 1 Error (false-positive).
  • p-value is the “evidence” against the null hypothesis. The smaller p-value means the stronger evidence can be established to reject the null hypothesis
if p_value < 0.05:
  then reject the null_hypothesis
if p_value is >= 0.05:
  then fail to reject the null hypothesis 

Chi-Square Test of Independence

The chi-square test in R is used to determine whether there is a significant association between two categorical variables. It is a statistical test of independence, commonly used to analyze the relationship between two nominal variables.

This test assumes the following:

  • both variables are categorical
  • all observations are independent
  • individuals in the dataset are mutually exclusive
  • expected value of cells should be 5 or greater in at least 80% of cells
Null_Hypothesis:
  the amount of spend by customers is independent of their education and marital status

Alternative Hypothesis:
  there is an association between educational level and marital status of customers
single = main_df %>% filter(Marital_Status == 'Single') %>% group_by(Education) %>% 
summarize(single = mean(Total_Spent))
in_relationship = main_df %>% filter(Marital_Status == 'In Relationship') %>% group_by(Education) %>% 
summarize(in_relationship = mean(Total_Spent))

joined_df1 = data.frame(single, in_relationship)
rownames(joined_df1) = c('Basic','Graduation','Master','PhD')
joined_df1 = joined_df1 %>% select(single, in_relationship)
joined_df1
##              single in_relationship
## Basic       58.5000         95.5000
## Graduation 617.9608        620.9986
## Master     665.0209        523.9136
## PhD        640.4124        690.7379

Chi-Square Test

chisq_test(joined_df1)
## # A tibble: 1 × 6
##       n statistic         p    df method          p.signif
## * <dbl>     <dbl>     <dbl> <int> <chr>           <chr>   
## 1 3913.      26.9 0.0000062     3 Chi-square test ****

The p-value is 0.0000062 which is very below the alpha of 0.05, we can reject the null hypothesis at the 5% significance level.

There is strong evidence to support the claim that there is an association between educational levels and marital status of customer when considering their expenditures.

Cramer’s V

Cramer’s V is a non-parametric measure of association between two categorical variables. It ranges from 0 to 1, with 0 indicating no association and 1 indicating a perfect association.

While the chi-square test indicates whether there is a statistically significant association, Cramer’s V provides a measure of the strength of that association. Cramer’s V is independent of the sample size, making it a more comparable metric across different datasets.

  • small = 0.1 - 0.3
  • medium = 0.3-0.5
  • large= 0.5-1.0
statistic = 26.8933
df_rows = 2240
tablerow = 4
tablecolumns = 2
# Cramer's V formula
sqrt(statistic / (df_rows * (min(tablerow, tablecolumns) - 1)))
## [1] 0.1095716
# install.packages('lsr')
library(lsr)
lsr::cramersV(joined_df1)
## [1] 0.08290193

Both calculations indicate that there is a weak or small association between the variables.

Hypothesis Test

  • Null Hypothesis: the amount spend by customers is independent of their education and age range
  • Alternative Hypothesis: there is an association between educational levels and age range when it comes to customer spending amounts
single = main_df %>% filter(Marital_Status == 'Single') %>% group_by(Age_Range) %>% 
summarize(single = mean(Total_Spent)) 
in_relationship = main_df %>% filter(Marital_Status == 'In Relationship') %>% group_by(Age_Range) %>% 
summarize(in_relationship = mean(Total_Spent))

joined_df2 = data.frame(single, in_relationship)
rownames(joined_df2) = c('Middle Aged','Old','Youth')
joined_df2 = joined_df2 %>% select(single, in_relationship)
joined_df2
##               single in_relationship
## Middle Aged 586.7159        557.9612
## Old         716.3351        731.1752
## Youth       966.0000        630.0000

Chi-Square Test

chisq_test(joined_df2)
## # A tibble: 1 × 6
##       n statistic        p    df method          p.signif
## * <dbl>     <dbl>    <dbl> <int> <chr>           <chr>   
## 1 4188.      42.7 5.41e-10     2 Chi-square test ****

The p-value is below alpha, therefore we can reject the null hypothesis. There is sufficient evidence to support the claim that there is an association between age range and marital status of customers while considering the expenses.

Cramer’s V

lsr::cramersV(joined_df2)
## [1] 0.1009419
statistic = 42.67454
df_rows = 2240
tablerow = 4
tablecolumns = 2
# Cramer's V formula
sqrt(statistic / (df_rows * (min(tablerow, tablecolumns) - 1)))
## [1] 0.1380258

Both values indicate that there is weak relationship between marital status and age range of customers.

ANOVA Hypothesis Test

Query: Are there any differences between the amount spent when considering various education levels of customers?

A one-way ANOVA (“analysis of variance”) compares the means of three or more independent groups to determine if there is a statistically significant difference between the corresponding population means. This test simply indicates if the means are statistically significant different from each other, but not where. After conducting ANOVA test, Tukey Test is used to interpret the results and determine where the statistical significance is coming from. A more robust test is the Welch’s analysis of variance because it can handle unequal variances, and is used with Games-Howell comparison method.

ANOVA assumptions:

Normality - By Central Limit Theorem, sample means of large samples are often well-approximated by a normal distribution even if the data are not normally distributed. It is therefore not required to test the normality assumption when the number of observations in each group/sample is large (usually n >= 30)

Independence – As we can see that, each group is independent of each other

Equal Variance – Now we need to check whether the variances for each group are equal. To test equality of variance we can use Levene’s Test.

Levene’s Test

levene_test(data = main_df,formula =  Total_Spent ~ Education)
## # A tibble: 1 × 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1     3  2236      23.6 5.07e-15

The p-value is below the alpha, we can reject the null hypothesis and claim that spending among customers is not equal. Welch’s one-way test can be used since variances are not equal across educational levels.

Welch’s Test

oneway.test(data = main_df, formula = Total_Spent ~ Education, var.equal = FALSE)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Total_Spent and Education
## F = 216.09, num df = 3, denom df = 471, p-value < 2.2e-16

The p-value is less than alpha, so reject the null hypothesis and claim that the amount customers spend varies based upon their education level. Education level alone differentiates customers spending.

Games-Howell Post Hoc Test

rstatix::games_howell_test(data = main_df, formula = Total_Spent ~ Education, conf.level = 0.95)
## # A tibble: 6 × 8
##   .y.         group1    group2 estimate conf.low conf.high    p.adj p.adj.signif
## * <chr>       <chr>     <chr>     <dbl>    <dbl>     <dbl>    <dbl> <chr>       
## 1 Total_Spent Basic     Gradu…    538.    475.       602.  7.88e-15 ****        
## 2 Total_Spent Basic     Master    489.    412.       567.  0        ****        
## 3 Total_Spent Basic     PhD       591.    507.       675.  0        ****        
## 4 Total_Spent Graduati… Master    -48.9  -128.        30.1 3.82e- 1 ns          
## 5 Total_Spent Graduati… PhD        52.5   -32.9      138.  3.89e- 1 ns          
## 6 Total_Spent Master    PhD       101.      5.00     198.  3.5 e- 2 *

Statistically significant differences between education levels and spending, when comparing Basic education level with graduation, masters and PhD levels.

ANOVA Hypothesis Test

Query: Are there any differences between the amount spent when considering marital status of customers?

Two Sample t-test

A two-sample t-test is a statistical test used to determine if there is a significant difference between the means of two independent groups. This test tests the null hypothesis that the two population means are equal. This test assumes that the two groups are independent of each other.

For valid results, the assumptions are required to me satisfied:

Normality - By Central Limit Theorem, sample means of large samples are often well-approximated by a normal distribution even if the data are not normally distributed. It is therefore not required to test the normality assumption when the number of observations in each group/sample is large (usually n >= 30)

Independence – As we can see that, each group is independent of each other

Equal Variance – Now we need to check whether the variances for each group are equal. To test equality of variance we can use Levene’s Test.

Levene’s Test

print( levene_test(data = main_df, formula =  Total_Spent ~ Marital_Status) )
## # A tibble: 1 × 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     1  2238      1.18 0.278
tibble( car::leveneTest(data= main_df, Total_Spent ~ Marital_Status) )
## # A tibble: 2 × 3
##      Df `F value` `Pr(>F)`
##   <int>     <dbl>    <dbl>
## 1     1      1.18    0.278
## 2  2238     NA      NA

The p-value is greater than the alpha (0.05) so we do not reject the null hypothesis, meaning that All customers spend equally irrespective of their marital status. Variances are equal across marital status of customers, so two sample t-test can be used.

2 Sample t-test

rstatix::t_test(data = main_df, formula = Total_Spent ~ Marital_Status, var.equal = TRUE)
## # A tibble: 1 × 8
##   .y.         group1          group2    n1    n2 statistic    df     p
## * <chr>       <chr>           <chr>  <int> <int>     <dbl> <dbl> <dbl>
## 1 Total_Spent In Relationship Single  1444   796    -0.840  2238 0.401
t.test( formula = Total_Spent ~ Marital_Status, data= main_df, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  Total_Spent by Marital_Status
## t = -0.83958, df = 2238, p-value = 0.4012
## alternative hypothesis: true difference in means between group In Relationship and group Single is not equal to 0
## 95 percent confidence interval:
##  -74.46289  29.81730
## sample estimates:
## mean in group In Relationship          mean in group Single 
##                      597.8657                      620.1884

The p-value is greater than alpha, fail to reject the null hypothesis, meaning that all customers spend equally irrespective of their marital status. Marital status alone does not contribute to spending.

Hypothesis Test

Query: Are there any differences between the amount spent when considering age group of customers?

Normality - Since the sample size for each group is large enough (n >= 30), there is no need to check the Normality of each group

Independence – As we can see that, each group is independent of each other

Equal Variance – Now we need to check whether the variances for each group are equal. To test equality of variance we can use Levene’s Test.

Levene’s Test

print( levene_test(data = main_df, formula =  Total_Spent ~ Age_Range) )
## # A tibble: 1 × 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     2  2237     0.733 0.480
tibble( car::leveneTest(data= main_df, Total_Spent ~ Age_Range) )
## # A tibble: 2 × 3
##      Df `F value` `Pr(>F)`
##   <int>     <dbl>    <dbl>
## 1     2     0.733    0.480
## 2  2237    NA       NA

The p-value is greater than alpha, so we fail to reject the null hypothesis. All customers spend equally regardless of their age group. Age range is not a factor in spending alone, variance is equal across the age group of customers.

ANOVA 1-way

anova_model = aov(formula = Total_Spent ~ Age_Range, data = main_df )
summary.aov( anova_model)
##               Df    Sum Sq Mean Sq F value   Pr(>F)    
## Age_Range      2  10418034 5209017   14.54 5.35e-07 ***
## Residuals   2237 801676681  358371                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(data = main_df, formula = Total_Spent ~ Age_Range, type = 1)
## ANOVA Table (type I tests)
## 
##      Effect DFn  DFd      F        p p<.05   ges
## 1 Age_Range   2 2237 14.535 5.35e-07     * 0.013

The p-value is less than alpha, so we reject the null hypothesis and claim that some customers have a high spending compared to others based on age group. Age is a factor alone on spending.

Tukey Post-Hoc test

Tukey’s Honestly Significant Difference (HSD) test is a statistical test used to compare all possible pairs of means after an ANOVA test. It’s often used as a post-hoc analysis to determine which specific groups differ significantly from each other

The HSD value is a critical value that determines the minimum difference between two means required to be considered statistically significant. It’s calculated based on the sample size, number of groups, and the desired significance level (usually 0.05).

rstatix::tukey_hsd(anova_model)
## # A tibble: 3 × 9
##   term      group1 group2      null.value estimate conf.low conf.high      p.adj
## * <chr>     <chr>  <chr>            <dbl>    <dbl>    <dbl>     <dbl>      <dbl>
## 1 Age_Range Adult  Senior               0     158.     87.4      228.    4.79e-7
## 2 Age_Range Adult  Young Adult          0     264.   -182.       709.    3.47e-1
## 3 Age_Range Senior Young Adult          0     106.   -342.       554.    8.45e-1
## # ℹ 1 more variable: p.adj.signif <chr>
TukeyHSD(x = anova_model, ordered = TRUE, conf.level = 0.95 )
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = Total_Spent ~ Age_Range, data = main_df)
## 
## $Age_Range
##                        diff        lwr      upr     p adj
## Senior-Adult       157.7213   87.36512 228.0775 0.0000005
## Young Adult-Adult  263.5217 -181.74883 708.7922 0.3473526
## Young Adult-Senior 105.8004 -342.43141 554.0322 0.8446522

The p-value is below the alpha, we reject the null hypothesis and claim that some customers have a high spending compared to others based on age group. The senior-adult age ranges have strongest statistical significance.

Hypothesis Test

Query: Are there any differences between the amount spent when considering the products sold?

A repeated measures ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more groups in which the same subjects show up in each group. Since the amount spend for all products is listed out for each customer, it is better to go with repeated measures anova to test the differences between each product statistically. However, before we perform a repeated measures ANOVA we must make sure the following assumptions are met:

Sphericity assumes that the variances of the differences between all pairs of conditions are equal. If this assumption is violated, it can inflate the Type I error rate, leading to false positive results.

Mauchly’s test is used to assess whether the sphericity assumption is met. It tests the null hypothesis that the variances of the differences between all pairs of conditions are equal.

products_df = products_df %>% 
  mutate(Customer_ID = rep(1:2240, times= 6))

head( products_df,3)
##   Product_Name Total_Spent Customer_ID
## 1         Wine         635           1
## 2         Wine          11           2
## 3         Wine         426           3

ANOVA Repeated Measures

A Repeated Measures ANOVA is a statistical technique used to compare the means of a dependent variable across different conditions or time points within the same group of subjects. It’s particularly useful when the same participants are measured multiple times under different conditions.

  • dependent variable is numerical
  • wid is variable name specifying the sample identifier within grouping variable
rep_measures_anova = rstatix::anova_test(data = products_df, dv = Total_Spent, wid = Customer_ID, within = Product_Name)

rep_measures_anova$ANOVA
##         Effect DFn   DFd        F p p<.05   ges
## 1 Product_Name   5 11195 1268.182 0     * 0.269
rep_measures_anova$`Mauchly's Test for Sphericity`
##         Effect        W p p<.05
## 1 Product_Name 0.000463 0     *
rep_measures_anova$`Sphericity Corrections`
##         Effect   GGe        DF[GG] p[GG] p[GG]<.05   HFe        DF[HF] p[HF]
## 1 Product_Name 0.332 1.66, 3718.85     0         * 0.332 1.66, 3721.27     0
##   p[HF]<.05
## 1         *

Since Mauchy’s Test p-value is less than alpha, we reject the null hypothesis, suggesting that the sphericity assumption is violated.

Friedman Test

The Friedman Test is a non-parametric alternative to the Repeated Measures ANOVA. It is used to determine whether or not there is a statistically significant difference between the means of three or more groups in which the same subjects show up in each group.

For a Friedman Test, the appropriate post-hoc test is the pairwise Wilcoxon rank sum test with a bonferroni correction. The Bonferroni correction adjusts probability (p) values because of the increased risk of a type I error when making multiple statistical tests.

The assumptions for the Friedman Test include:

  • Continuous : The variable that you care about (and want to see if it is different across the 3+ groups) must be continuous. Continuous means that the variable can take on any reasonable value

  • Random Sample : The data points for each group in your analysis must have come from a simple random sample

  • Enough Data : The sample size also depends on the expected size of the difference across groups.

friedman.test(y = products_df$Total_Spent, groups = products_df$Product_Name, blocks =  products_df$Customer_ID)
## 
##  Friedman rank sum test
## 
## data:  products_df$Total_Spent, products_df$Product_Name and products_df$Customer_ID
## Friedman chi-squared = 5967.7, df = 5, p-value < 2.2e-16

The p-value is less than alpha, so we reject the null hypothesis and claim that some products sell more than others, there is variance in spending among customers. Products is a factor alone in customer spending.

Pairwise Wilcox test

rstatix::pairwise_wilcox_test(data = products_df, formula = Total_Spent ~ Product_Name, p.adjust.method = 'bonf')
## # A tibble: 15 × 9
##    .y.      group1 group2    n1    n2 statistic         p     p.adj p.adj.signif
##  * <chr>    <chr>  <chr>  <int> <int>     <dbl>     <dbl>     <dbl> <chr>       
##  1 Total_S… Fish   Fruit   2240  2240  2763626  3.53e-  9 5.3 e-  8 ****        
##  2 Total_S… Fish   Gold    2240  2240  1997634  3.18e- 32 4.77e- 31 ****        
##  3 Total_S… Fish   Meat    2240  2240  1277322. 3.39e-178 5.08e-177 ****        
##  4 Total_S… Fish   Sweet   2240  2240  2761441  4.76e-  9 7.14e-  8 ****        
##  5 Total_S… Fish   Wine    2240  2240   995501  5.06e-268 7.59e-267 ****        
##  6 Total_S… Fruit  Gold    2240  2240  1698338. 2.41e- 78 3.61e- 77 ****        
##  7 Total_S… Fruit  Meat    2240  2240  1052869  2.92e-248 4.38e-247 ****        
##  8 Total_S… Fruit  Sweet   2240  2240  2509044  9.95e-  1 1   e+  0 ns          
##  9 Total_S… Fruit  Wine    2240  2240   835352. 0         0         ****        
## 10 Total_S… Gold   Meat    2240  2240  1624746  9.96e- 93 1.49e- 91 ****        
## 11 Total_S… Gold   Sweet   2240  2240  3308424. 2.55e- 76 3.82e- 75 ****        
## 12 Total_S… Gold   Wine    2240  2240  1251724. 1.88e-185 2.82e-184 ****        
## 13 Total_S… Meat   Sweet   2240  2240  3953848  1.29e-244 1.93e-243 ****        
## 14 Total_S… Meat   Wine    2240  2240  1982321  4.88e- 34 7.32e- 33 ****        
## 15 Total_S… Sweet  Wine    2240  2240   841643  0         0         ****
pairwise.wilcox.test(x = products_df$Total_Spent, g = products_df$Product_Name, p.adjust.method = 'bonf')
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  products_df$Total_Spent and products_df$Product_Name 
## 
##       Fish    Fruit   Gold    Meat    Sweet  
## Fruit 5.3e-08 -       -       -       -      
## Gold  < 2e-16 < 2e-16 -       -       -      
## Meat  < 2e-16 < 2e-16 < 2e-16 -       -      
## Sweet 7.1e-08 1       < 2e-16 < 2e-16 -      
## Wine  < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16
## 
## P value adjustment method: bonferroni

Based on the results, we can conclude that there are significant differences in the Total_Spent between various product categories. The specific differences can be inferred by looking at the p-values in the table.

The strong statistical significant differences in products are wine, meat, gold then fish. Sweet potatoes are not as strong in statistical significance.

Summary

Besides, 60% of customers earn between 20k & 60k dollars yearly. Thus, by making the campaigns that have offers that are feasible to accept offers for customers who earn below 50k dollars would increase the rate of accepting the offers.