Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
Here we have a data of 2240 customers about their educational level, marital status, year of birth, amount spend on our products, purchase modes in last 2 years. Here, we will be discovering about the factors that affects amount spend by customers.
| Variable | Description |
|---|---|
| ID | Customer’s unique identifier |
| Year_Birth | Customer’s birth year |
| Education | Customer’s education level |
| Marital_Status | Customer’s marital status |
| Income | Customer’s yearly household income |
| Kidhome | Number of children in household |
| Teenhome | Number of teenagers in household |
| Dt_Customer | Date of customer’s enrollment with company |
| Recency | Number of days since customer’s last purchase |
| Complain | 1: customer complained in last 2 yrs, else: 0 |
| Variable | Description |
|---|---|
| MntWines | Amount spent on wine in last 2 years |
| MntFruits | Amount spent on fruits in last 2 years |
| MntMeatProducts | Amount spent on meat in last 2 years |
| MntFishProducts | Amount spent on fish in last 2 years |
| MntSweetProducts | Amount spent on sweets in last 2 years |
| MntGoldProds | Amount spent on gold in last 2 years |
Promotions
AcceptedCmp1 1 (1 to 5) if customer accepted the offer
in the 1st campaign, 0 otherwiseResponse 1 if customer accepted the offer in the last
campaign, 0 otherwise| Variable | Description |
|---|---|
| NumDealsPurchases | Number of purchases made with a discount |
| NumWebPurchases | Number of purchases made through the company’s web site |
| NumCatalogPurchases | Number of purchases made using a catalogue |
| NumStorePurchases | Number of purchases made directly in stores |
| NumWebVisitsMonth | Number of visits to company’s web site in the last month |
# install.packages('skimr')
library(skimr)
# install.packages('DataExplorer')
library(DataExplorer)
library(tidyverse)
# install.packages('car')
library(car)
# install.packages('rstatix')
library(rstatix)
library(paletteer)
main_df = read.table("marketing_campaign.csv", sep="\t", header=T)
head(main_df)
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524 1957 Graduation Single 58138 0 0 04-09-2012
## 2 2174 1954 Graduation Single 46344 1 1 08-03-2014
## 3 4141 1965 Graduation Together 71613 0 0 21-08-2013
## 4 6182 1984 Graduation Together 26646 1 0 10-02-2014
## 5 5324 1981 PhD Married 58293 1 0 19-01-2014
## 6 7446 1967 Master Together 62513 0 1 09-09-2013
## Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 58 635 88 546 172 88
## 2 38 11 1 6 2 1
## 3 26 426 49 127 111 21
## 4 26 11 4 20 10 3
## 5 94 173 43 118 46 27
## 6 16 520 42 98 0 42
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 88 3 8 10
## 2 6 2 1 1
## 3 42 1 8 2
## 4 5 2 2 0
## 5 15 5 5 3
## 6 14 2 6 4
## NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1 4 7 0 0 0
## 2 2 5 0 0 0
## 3 10 4 0 0 0
## 4 4 6 0 0 0
## 5 6 5 0 0 0
## 6 10 6 0 0 0
## AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1 0 0 0 3 11 1
## 2 0 0 0 3 11 0
## 3 0 0 0 3 11 0
## 4 0 0 0 3 11 0
## 5 0 0 0 3 11 0
## 6 0 0 0 3 11 0
Using {DataExplorer} library function to plot an
introduction to the dataset
DataExplorer::plot_intro(data = main_df, title ="DataFrame Intro", ggtheme = NULL )
The plot shows that missing observations make up 0.037%, and 98.9% are complete rows. Most of the data we have is continuous.
We can use {skimr} library to get quick summary
statistics of our data.
skimr::skim(main_df)
| Name | main_df |
| Number of rows | 2240 |
| Number of columns | 29 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 26 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Education | 0 | 1 | 3 | 10 | 0 | 5 | 0 |
| Marital_Status | 0 | 1 | 4 | 8 | 0 | 8 | 0 |
| Dt_Customer | 0 | 1 | 10 | 10 | 0 | 663 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1.00 | 5592.16 | 3246.66 | 0 | 2828.25 | 5458.5 | 8427.75 | 11191 | ▇▇▇▇▇ |
| Year_Birth | 0 | 1.00 | 1968.81 | 11.98 | 1893 | 1959.00 | 1970.0 | 1977.00 | 1996 | ▁▁▂▇▅ |
| Income | 24 | 0.99 | 52247.25 | 25173.08 | 1730 | 35303.00 | 51381.5 | 68522.00 | 666666 | ▇▁▁▁▁ |
| Kidhome | 0 | 1.00 | 0.44 | 0.54 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▆▁▁ |
| Teenhome | 0 | 1.00 | 0.51 | 0.54 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▇▁▁ |
| Recency | 0 | 1.00 | 49.11 | 28.96 | 0 | 24.00 | 49.0 | 74.00 | 99 | ▇▇▇▇▇ |
| MntWines | 0 | 1.00 | 303.94 | 336.60 | 0 | 23.75 | 173.5 | 504.25 | 1493 | ▇▂▂▁▁ |
| MntFruits | 0 | 1.00 | 26.30 | 39.77 | 0 | 1.00 | 8.0 | 33.00 | 199 | ▇▁▁▁▁ |
| MntMeatProducts | 0 | 1.00 | 166.95 | 225.72 | 0 | 16.00 | 67.0 | 232.00 | 1725 | ▇▁▁▁▁ |
| MntFishProducts | 0 | 1.00 | 37.53 | 54.63 | 0 | 3.00 | 12.0 | 50.00 | 259 | ▇▁▁▁▁ |
| MntSweetProducts | 0 | 1.00 | 27.06 | 41.28 | 0 | 1.00 | 8.0 | 33.00 | 263 | ▇▁▁▁▁ |
| MntGoldProds | 0 | 1.00 | 44.02 | 52.17 | 0 | 9.00 | 24.0 | 56.00 | 362 | ▇▁▁▁▁ |
| NumDealsPurchases | 0 | 1.00 | 2.33 | 1.93 | 0 | 1.00 | 2.0 | 3.00 | 15 | ▇▂▁▁▁ |
| NumWebPurchases | 0 | 1.00 | 4.08 | 2.78 | 0 | 2.00 | 4.0 | 6.00 | 27 | ▇▃▁▁▁ |
| NumCatalogPurchases | 0 | 1.00 | 2.66 | 2.92 | 0 | 0.00 | 2.0 | 4.00 | 28 | ▇▂▁▁▁ |
| NumStorePurchases | 0 | 1.00 | 5.79 | 3.25 | 0 | 3.00 | 5.0 | 8.00 | 13 | ▂▇▂▃▂ |
| NumWebVisitsMonth | 0 | 1.00 | 5.32 | 2.43 | 0 | 3.00 | 6.0 | 7.00 | 20 | ▅▇▁▁▁ |
| AcceptedCmp3 | 0 | 1.00 | 0.07 | 0.26 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| AcceptedCmp4 | 0 | 1.00 | 0.07 | 0.26 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| AcceptedCmp5 | 0 | 1.00 | 0.07 | 0.26 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| AcceptedCmp1 | 0 | 1.00 | 0.06 | 0.25 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| AcceptedCmp2 | 0 | 1.00 | 0.01 | 0.11 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| Complain | 0 | 1.00 | 0.01 | 0.10 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| Z_CostContact | 0 | 1.00 | 3.00 | 0.00 | 3 | 3.00 | 3.0 | 3.00 | 3 | ▁▁▇▁▁ |
| Z_Revenue | 0 | 1.00 | 11.00 | 0.00 | 11 | 11.00 | 11.0 | 11.00 | 11 | ▁▁▇▁▁ |
| Response | 0 | 1.00 | 0.15 | 0.36 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
DataExplorer::plot_bar(data = main_df, ncol = 2, ggtheme = NULL)
The columns Z_CostContact and Z_Revenue are
single value columns and should be removed.
main_df = main_df %>%
subset(select = -c(Z_CostContact, Z_Revenue))
Let’s look at the Income variable and its distribution
ggplot(main_df, aes(x= Income)) +
geom_density(color='green', fill='grey30') +
scale_x_continuous(n.breaks = 10, labels = scales::dollar_format()) +
ggdark::dark_mode() +
labs(
title = "Income Distribution ",
x="Income"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 13),
axis.text.x = element_text(size = 11)
)
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
## Warning: Removed 24 rows containing non-finite outside the scale range
## (`stat_density()`).
Clearly the income distribution is largely less than $100,000 for this dataset.
We can fill in the missing value with a median value
main_df$Income[is.na(main_df$Income)] = median(main_df$Income, na.rm = TRUE)
ggplot(main_df, aes(x= Income)) +
geom_density(color='green', fill='grey30') +
scale_x_continuous(n.breaks = 10, labels = scales::dollar_format()) +
ggdark::dark_mode() +
labs(
title = "Income Distribution ",
x="Income"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 13),
axis.text.x = element_text(size = 11)
)
A slight change in the the density of the income distribution.
Next is to change the category for Master’s degree as labeled
2nd cycle.
main_df = main_df %>%
mutate(Education = replace(Education, Education =="2n Cycle", "Master"))
Group the marital status into a binary of ‘single’ and ‘in relationship’.
main_df = main_df %>%
mutate(Marital_Status = replace(Marital_Status, Marital_Status == 'Alone' | Marital_Status == 'Absurd' | Marital_Status =='YOLO' | Marital_Status =='Widow' | Marital_Status == 'Divorced','Single')) %>%
mutate(Marital_Status = replace(Marital_Status, Marital_Status =='Married' | Marital_Status =='Together','In Relationship'))
head(main_df$Marital_Status)
## [1] "Single" "Single" "In Relationship" "In Relationship"
## [5] "In Relationship" "In Relationship"
Need to calculate the current year, and add the column Age to replace Year_Birth
current_date = Sys.Date()
current_year = format(current_date, format="%Y")
current_year = as.integer(current_year)
main_df = main_df %>%
mutate(Age = 2024 - Year_Birth) %>%
select( - Year_Birth)
Needing to use dplyr::if_else() statement to capture the
various age ranges to know what the customers age cohorts will be.
main_df = main_df %>%
mutate(
Age_Range = if_else(Age <= 16, "Child",
if_else(17 <= Age & Age <= 18,"Youth",
if_else(18 < Age & Age <= 30,"Young Adult",
if_else(30 < Age & Age <=65,"Adult", "Senior"))))
)
main_df = main_df %>%
mutate(Total_Spent = MntWines + MntFruits + MntMeatProducts
+ MntFishProducts + MntSweetProducts + MntGoldProds)
main_df = main_df %>%
mutate(Total_Purchases = NumWebPurchases +
NumCatalogPurchases + NumStorePurchases + NumDealsPurchases)
main_df$Dt_Customer[1]
## [1] "04-09-2012"
# dates are messy, in this data there are formats with month-day-year and some with day-month-year
main_df = main_df %>%
mutate(
Enrollment = if_else(
lubridate::mdy(Dt_Customer, quiet = TRUE) %in% lubridate::ymd(),
lubridate::mdy(Dt_Customer),
lubridate::dmy(Dt_Customer)
)
) %>%
mutate(Enrollment_Yr = year(Enrollment),
Seniority = year(current_date) - Enrollment_Yr)
main_df = main_df %>%
mutate(Total_Offers = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 )
head( main_df, 2)
## ID Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency
## 1 5524 Graduation Single 58138 0 0 04-09-2012 58
## 2 2174 Graduation Single 46344 1 1 08-03-2014 38
## MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 635 88 546 172 88
## 2 11 1 6 2 1
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 88 3 8 10
## 2 6 2 1 1
## NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1 4 7 0 0 0
## 2 2 5 0 0 0
## AcceptedCmp1 AcceptedCmp2 Complain Response Age Age_Range Total_Spent
## 1 0 0 0 1 67 Senior 1617
## 2 0 0 0 0 70 Senior 27
## Total_Purchases Enrollment Enrollment_Yr Seniority Total_Offers
## 1 25 2012-09-04 2012 12 0
## 2 6 2014-03-08 2014 10 0
age_range = main_df %>%
select(Age_Range) %>%
group_by(Age_Range) %>%
summarise( num = n() ) %>%
mutate( Age_Range_Percent = round(num*100 / sum(num), 2) ) %>%
mutate(Age_Range = as_factor(Age_Range))
ggplot(age_range,
aes(x=age_range$Age_Range ,
y=num,
fill=Age_Range
)
)+
geom_bar(stat='identity', show.legend = FALSE) +
scale_y_continuous(n.breaks = 10) +
labs(
title= "Age Range Percentages",
x='Age Group',
y='Count'
)+
geom_text(aes(label=paste0(num ,' (',age_range$Age_Range_Percent,'%',')')),
vjust= -0.5, color='grey90',
show.legend = FALSE,
size = 4
) +
theme(
plot.title = element_text(hjust = 0.5, size = 13),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
# plot.margin = margin(t = 2,r = 0,b = 0,l = 0,unit = 'cm')
)
Our demographics for this dataset that was categorized:
76% of data is adults who have purchasing power, along with under a quarter (23%) of the data population being seniors, while those under 30 but are adults is less than 1 percent (0.45%). The data did not find any person under the age of 18 to be labelled “child”.
Probability Density Function shows the likelihood of a continuous variable to take on a specific value within a range. Here we want to see the probability of customer spending amounts within the density range.
main_df %>%
ggplot(
aes(x= Total_Spent)
) +
geom_histogram(bins = 10, aes(y= stat(width*density)), color='black' , fill='gold', alpha= 0.5 ) +
scale_x_continuous(breaks = seq(0,2500, by= 250), labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::percent_format(accuracy=1), n.breaks = 10) +
labs(
title = "PDF: Total Spent Probability",
x="Total Spent",
y="Probability"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 12),
axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
)
The Probability Density Function plot shows:
Empirical cumulative distribution function (ECDF) is graphical representation of the cumulative probability distribution of a dataset. An ECDF is a step function that shows the proportion of data points in a sample that are less than or equal to a given value. It provides a way to visualize the distribution of a dataset without making assumptions about its underlying distribution.
main_df %>%
ggplot(
aes(x= Total_Spent)
)+
stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
labs(
title = "ECDF: Total Spend",
y="Probability\n"
)+
scale_y_continuous(breaks = seq(0, 1.0, by= 0.1)) +
scale_x_continuous(breaks = seq(0,2600, by= 200) , labels = scales::dollar_format() ) +
geom_hline( aes(yintercept = 0.55, color='gold'), show.legend = FALSE) +
geom_vline( aes(xintercept = 502, color='gold'), show.legend = FALSE )+
geom_text(aes(x=410, y=0.53,
label = "55%",
vjust = -1,
colour = 'gold',
nudge=1,
size = 11
),
show.legend = FALSE
) +
theme(
plot.title = element_text(hjust = 0.5, size = 13),
axis.text.y = element_text(size = 11, colour = 'grey70', face = 'bold'),
axis.text.x = element_text(size = 11, angle =45, hjust = 1)
)
The ECDF plot shows that 55% of the data population spend less than 500 dollars.
main_df %>%
ggplot(
aes(x= Income)
) +
geom_histogram(bins = 6, aes(y= stat(width*density)), color='grey0' , fill='gold', alpha= 0.5 ) +
labs(title = "PDF on Income", y="Probability" ) +
scale_x_continuous(limits=c(0,200000), breaks=seq(0,200000,by=20000), labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::percent_format(accuracy=1) )+
theme(
plot.title = element_text(hjust = 0.5, size = 12),
axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
)
The probability density function plot on income shows that 60% of the customer population in the data has income ranging from 20,000 to 60,000 dollars annually. Second group of customers are those who make 60,000 to 100k dollars annually, while customers who have income greater than 100k are marginal.
main_df %>%
ggplot(
aes(x= Income)
)+
stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
labs(
title = "ECDF: Income",
y="Probability\n"
)+
scale_y_continuous(breaks = seq(0, 1.0, by= 0.1)) +
scale_x_continuous(limits=c(0,200000),breaks=seq(0,200000,by=20000), labels = scales::dollar_format() )+
geom_hline( aes(yintercept = 0.95, color='gold'), show.legend = FALSE) +
geom_vline( aes(xintercept = 83e3, color='gold'), show.legend = FALSE )+
geom_text(aes(x=65e3, y=0.83,
label = "95%",
vjust = -1,
colour = 'gold',
nudge=1,
size = 11
),
show.legend = FALSE
) +
theme(
plot.title = element_text(hjust = 0.5, size = 13),
axis.text.y = element_text(size = 11, face = 'bold'),
axis.text.x = element_text(size = 11, angle =45, hjust = 1),
# axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
)
The ECDF plot shows that 95% of the customer population makes less than $83,000 annually. While 50% of the population makes $50,000 annually.
A correlation test between income and total spent variable
# 3 variables are required for this library function
main_df %>% rstatix::cor_mat(Income, Total_Spent, Age)
## # A tibble: 3 × 4
## rowname Income Total_Spent Age
## * <chr> <dbl> <dbl> <dbl>
## 1 Income 1 0.66 0.16
## 2 Total_Spent 0.66 1 0.11
## 3 Age 0.16 0.11 1
cor.test(main_df$Income, main_df$Total_Spent)
##
## Pearson's product-moment correlation
##
## data: main_df$Income and main_df$Total_Spent
## t = 42.098, df = 2238, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6410074 0.6872687
## sample estimates:
## cor
## 0.6647749
Since the p-value is
less than 0.05, we have sufficient evidence to state that there
is correlation between these two variables Income and Total Spent.
main_df %>%
ggplot(
aes(x= Income, y= Total_Spent)
)+
geom_point(shape= 21, color='green', fill='black', size=3 ) +
scale_x_continuous(limits=c(0,2e5),breaks=seq(0,2e5,by=20000), labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format() , limits = c(0,3500), n.breaks = 10)+
geom_smooth(method=lm, se=FALSE, color='gold',linetype='dashed') +
labs(title = "Correlation Plot: Income and Total Spent",
y="Total Spent") +
theme(
plot.title = element_text(hjust = 0.5, size = 14),
axis.text.y = element_text(size = 12, face = 'bold', color = 'gold'),
axis.text.x = element_text(size = 12, angle =45, hjust = 1, color = 'green'),
)
The correlation coefficient is 0.664 for income and total
spent variables. The plot shows that there is a moderate positive
correlation, meaning that as income rises, so does the total spent.
main_df %>%
ggplot( aes(x= Education, y= Total_Spent )) +
geom_boxplot(fill = "#004d80",alpha = 0.5, color = '#0099ff',outlier.colour = 'yellow') +
scale_y_continuous(n.breaks = 5, labels = scales::dollar_format() )+
# facet_wrap(~Education, scale= "free") +
labs(title = "Amount Spent based on Education Levels", y="Total Spent") +
theme(
plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
)
Customers with basic education have much lower spending amounts compared to other education levels, which looks to be around $500.
ggplot(main_df, aes(x=Marital_Status,y=Total_Spent))+
geom_boxplot(fill = "#004d80",alpha = 0.5, color = '#0099ff') +
scale_y_continuous(n.breaks = 5, labels = scales::dollar_format() )+
labs(title = "Amount Spent based on Marital Status", y="Total Spent") +
theme(
plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
)
Customer population have similar spending based on their marital status.
ggplot(main_df, aes(x=Age_Range, y=Total_Spent ))+
geom_boxplot(fill = "#004d80",alpha = 0.5, color = '#0099ff') +
scale_y_continuous(n.breaks = 5, labels = scales::dollar_format() )+
labs(title = "Amount Spent based on Age Range", y="Total Spent", x="Age Range") +
theme(
plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
)
Interesting that young adults are a small percentage of the customer population, but spend more than the other adults.
products_df = main_df %>% select(starts_with("Mnt"))
Product_Name = data.frame(Product_Name=rep(c('Wine','Fruit','Meat','Fish','Sweet','Gold'),each=2240))
Total_Spent = data.frame(Total_Spent = unlist(products_df),row.names=NULL)
products_df = data.frame(Product_Name,Total_Spent)
head(products_df,3)
## Product_Name Total_Spent
## 1 Wine 635
## 2 Wine 11
## 3 Wine 426
ggplot(products_df,
aes(x=Product_Name,
y=Total_Spent
)
)+
geom_boxplot(fill = "#004d80",
alpha = 0.5,
color = '#0099ff',
outlier.colour = 'yellow',
# outlier.size = 1.5,
# outlier.alpha = 0.5,
# outlier.shape = 2
) +
scale_y_continuous(n.breaks = 7, labels = scales::dollar_format() )+
labs(title = "Amounts Spent by Product", y="Total Spent", x="Products") +
theme(
plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
)
The boxplot shows that the meat and wine are where customers are spending the most amount of money.
main_df %>%
ggplot(
aes(x= Total_Purchases)
) +
geom_histogram(bins = 6, aes(y= stat(width*density)), color='grey0' , fill='gold', alpha= 0.5 ) +
labs(
title = "PDF on Purchases",
y="Probability" ,
x= "Number of Purchases"
) +
# scale_x_continuous(limits=c(0,200000), breaks=seq(0,200000,by=20000), labels = scales::dollar_format()) +
# scale_y_continuous(labels = scales::percent_format(accuracy=1) )+
#
scale_y_continuous(labels = scales::percent_format(accuracy=1)) +
scale_x_continuous(limits=c(0,50),breaks=seq(0,50,by=5)) +
theme(
plot.title = element_text(hjust = 0.5, size = 12),
axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
)
The PDF plot shows that 40% of customers purchase between 15 and 25 times, followed by customers who purchase items between 5 and 15 times (38%). The customers who purchase items between 25 and 35 times make up 9%.
main_df %>%
ggplot(
aes(x= Total_Purchases)
)+
stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
labs(
title = "ECDF: Total Purchases",
y="Probability\n",
x="Total Purchases"
) +
scale_y_continuous(breaks=seq(0,1.0,by=0.1))+
scale_x_continuous(n.breaks = 10)+
# geom_hline( aes(yintercept = 0.50, color='gold'), show.legend = FALSE) +
geom_vline( aes(xintercept = 15, color='gold'), show.legend = FALSE )+
geom_text(aes(x=12, y=0.48,
label = "50%",
vjust = -1,
colour = 'gold',
nudge=1,
size = 11
),
show.legend = FALSE
) +
geom_hline( aes(yintercept = 0.88, color='gold'), show.legend = FALSE) +
geom_vline( aes(xintercept = 25, color='gold'), show.legend = FALSE )+
geom_text(aes(x=20, y=0.78,
label = "88%",
vjust = -1,
colour = 'gold',
nudge=1,
size = 11
),
show.legend = FALSE
) +
theme(
plot.title = element_text(hjust = 0.5, size = 13),
axis.text.y = element_text(size = 11, face = 'bold'),
axis.text.x = element_text(size = 11, angle =45, hjust = 1),
axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
)
The ECDF plot shows that 50% of customers make 15 purchases or less. Customers who make 25 purchases or less make up 88% of the customer population.
purchase_df = main_df %>% select(starts_with("Num")) %>% select(ends_with('Purchases'))
Purchase_Name = data.frame(Purchase_Name=rep(c('Deal','Web','Catalog','Store'),each=2240))
Total_Purchases = data.frame(Total_Purchases = unlist(purchase_df),row.names=NULL)
purchase_df = data.frame(Purchase_Name,Total_Purchases)
head(purchase_df,3)
## Purchase_Name Total_Purchases
## 1 Deal 3
## 2 Deal 2
## 3 Deal 1
purchase_df %>%
ggplot(
aes(x=Purchase_Name,
y=Total_Purchases
)
)+
geom_boxplot(fill = "#004d80",
alpha = 0.5,
color = '#0099ff',
outlier.colour = 'yellow',
# outlier.size = 1.5,
# outlier.alpha = 0.5,
# outlier.shape = 2
) +
scale_y_continuous(n.breaks = 7, labels = scales::dollar_format() )+
labs(title = "Total Purchases by Products", y="Total Purchases", x="Purchase Type") +
theme(
plot.title = element_text(hjust = 0.5, size=13, face = 'bold'),
axis.text.x = element_text(size = 11, colour = '#0099ff', face = 'bold'),
axis.text.y = element_text(size = 11, colour = 'grey80', face = 'bold')
)
The catalog and the web have the outliers with purchases over $25, while median is below 5 dollars. For store purchase types median is $5.
purchase_df %>%
group_by( Purchase_Name) %>%
summarise( num= sum(Total_Purchases)) %>%
mutate(Percent = round(num*100/sum(num) ,2)) %>%
arrange( desc(Percent))
## # A tibble: 4 × 3
## Purchase_Name num Percent
## <chr> <int> <dbl>
## 1 Store 12970 39.0
## 2 Web 9150 27.5
## 3 Catalog 5963 17.9
## 4 Deal 5208 15.6
The table shows that 38.96% for store purchase product type is the preferred method for customers, followed by the website at 27.48%.
main_df %>%
summarise(
Avg_Web_Visits = mean(NumWebVisitsMonth),
Avg_Store_Purchases = mean(NumStorePurchases),
Avg_Catalog_Purchase = mean(NumCatalogPurchases)
)
## Avg_Web_Visits Avg_Store_Purchases Avg_Catalog_Purchase
## 1 5.316518 5.790179 2.662054
main_df %>%
ggplot(
aes(x= NumWebVisitsMonth)
) +
geom_histogram(bins = 6,
aes(y= stat(width*density)),
color='black' ,
fill='gold',
alpha= 0.5 ) +
scale_x_continuous(limits=c(0,20), breaks=seq(0,20,by=2)) +
scale_y_continuous(labels = scales::percent_format(accuracy=1), n.breaks = 15) +
labs(
title = "PDF: Number of Web Visits",
x="Web Traffic",
y="Probability"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 12),
axis.text.x = element_text(size = 11, face = 'bold', angle = 45, hjust = 1 ),
axis.text.y = element_text(size = 12, face = 'bold', colour = 'gold'),
axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
)
Web traffic:
main_df %>%
ggplot(
aes(x= NumWebVisitsMonth)
)+
stat_ecdf(geom = "step", size=1, color='green', linetype='dotted') +
labs(
title = "ECDF: Number of Website Visits",
y="Probability\n",
x="Website Visits"
) +
scale_y_continuous(breaks=seq(0,1.0,by=0.1))+
scale_x_continuous(limits=c(0,20),breaks=seq(0,20,by=1)) +
geom_hline( aes(yintercept =0.95, color='gold'), show.legend = FALSE) +
geom_vline( aes(xintercept = 9, color='gold'), show.legend = FALSE )+
geom_text(aes(x=8.7, y=0.85,
label = "95%",
vjust = -1,
colour = 'gold',
nudge=1,
size = 11
),
show.legend = FALSE
) +
theme(
plot.title = element_text(hjust = 0.5, size = 13),
axis.text.y = element_text(size = 11, face = 'bold'),
axis.text.x = element_text(size = 11, angle =45, hjust = 1),
axis.title.y = element_text(size = 12, face = 'italic', colour = 'gold'),
)
The ECDF plot shows that 95% of customer visits are less than 9 times.
seniority = main_df %>%
select(Seniority) %>%
rename(Total_Years = Seniority) %>%
group_by(Total_Years) %>%
summarise(num = n() ) %>%
mutate(Seniority_Percentage = round(num*100/ sum(num), 2))
seniority['Total_Years'] = as_factor(seniority$Total_Years)
seniority %>%
ggplot( aes(x= Total_Years, y= num, fill = Total_Years)) +
geom_col(stat = "identity", show.legend = FALSE) +
scale_y_continuous(breaks = seq(0,1300, by=200)) +
geom_text(aes(label=paste0(num ,' (',Seniority_Percentage,'%',')')),
color='grey70',
vjust=-.5,
show.legend = FALSE
)+
labs(title = "Enrollment by Seniority", y= "Count", x="Number of Years")
Customers who have been enrolled for 11 years make up 53% of the customer population.
main_df %>%
group_by(Enrollment_Yr) %>%
summarise( Total_Spent = sum(Total_Spent)) %>%
mutate(Spent_Percent = round(Total_Spent*100/ sum(Total_Spent), 2 ))
## # A tibble: 3 × 3
## Enrollment_Yr Total_Spent Spent_Percent
## <dbl> <int> <dbl>
## 1 2012 368269 27.1
## 2 2013 715425 52.7
## 3 2014 273294 20.1
Out Of the customers who have seniority, those who have been around since 2013 make up 52% of the total amount spent by customers.
cor.test(main_df$Income, main_df$Total_Purchases)
##
## Pearson's product-moment correlation
##
## data: main_df$Income and main_df$Total_Purchases
## t = 32.265, df = 2238, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5345074 0.5910728
## sample estimates:
## cor
## 0.5634501
With a p-value of
0.05, we can sufficiently
state that there is a correlation between these two variables that is
statistically significant.
main_df %>%
ggplot(
aes(x= Income, y= Total_Purchases)
)+
geom_point(shape= 21, color='green', fill='black', size=3 ) +
scale_x_continuous(limits=c(0,1.7e5),breaks=seq(0,1.7e5, by=10e3), labels = scales::dollar_format()) +
scale_y_continuous( limits = c(0,50), n.breaks = 10)+
geom_smooth(method=lm, se=FALSE, color='gold',linetype='dashed') +
labs(title = "Correlation Plot: Income and Total Purchases",
y="Total Purchases") +
theme(
plot.title = element_text(hjust = 0.5, size = 14),
axis.text.y = element_text(size = 12, face = 'bold', color = 'gold'),
axis.text.x = element_text(size = 12, angle =45, hjust = 1, color = 'green'),
)
There is a moderate positive correlation between the variables income and total purchases, with a
coefficient of 0.56.
Promotions
AcceptedCmp1 1 (1 to 5) if customer accepted the offer
in the 1st campaign, 0 otherwiseResponse 1 if customer accepted the offer in the last
campaign, 0 otherwiseoffers = main_df %>%
select(Total_Offers) %>%
group_by(Total_Offers) %>%
summarise(num = n() ) %>%
mutate(Offers_Percentage = round(num*100/sum(num), 2)) %>%
mutate(Total_Offers = as_factor(Total_Offers))
offers %>%
ggplot( aes(x= Total_Offers , y= num, fill= Total_Offers )) +
geom_col( show.legend = FALSE) +
labs(
title = "Count of Total Offers",
x='Accepted number of Offers ',y='Count')+
geom_text(aes(label=paste0(num ,' (',offers$Offers_Percentage,'%',')')),
vjust=-.5,
show.legend = FALSE,
color='grey70'
) +
scale_y_continuous(n.breaks = 10)
79% of customers do not accept the promotion offer on the first attempt, with just 14.5% who do on the second attempt.
main_df %>%
group_by(Total_Offers) %>%
summarise(Median_Spend_on_Offers = median(Income)) %>%
arrange( desc( Median_Spend_on_Offers)) %>%
ggplot( aes(x= Total_Offers, y= Median_Spend_on_Offers, fill = Median_Spend_on_Offers)) +
geom_col(show.legend = FALSE) +
scale_fill_paletteer_c(`"ggthemes::Green"`) +
labs(
title = "Median Spend on Offer Campaign",
y="Median Spend",
x="Offer Type"
) +
scale_y_continuous(labels = scales::dollar_format(), breaks = seq(0,100e3,by=10e3)) +
theme(
plot.title = element_text(size=12, hjust = 0.5, face = 'bold'),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11, colour = 'green')
)
Despite the offer types of 2, 3 and 4, 138 customers are the ones spend $70,000 or more, but are marginal to overall customer population since they do not accept the first 2 campaign offers.
Query:
Is there any relationship among Education level, Age range, and Marital Status of customers when considering the amount of spend on products?
if p_value < 0.05:
then reject the null_hypothesis
if p_value is >= 0.05:
then fail to reject the null hypothesis
The chi-square test in R is used to determine whether there is a significant association between two categorical variables. It is a statistical test of independence, commonly used to analyze the relationship between two nominal variables.
This test assumes the following:
Null_Hypothesis:
the amount of spend by customers is independent of their education and marital status
Alternative Hypothesis:
there is an association between educational level and marital status of customers
single = main_df %>% filter(Marital_Status == 'Single') %>% group_by(Education) %>%
summarize(single = mean(Total_Spent))
in_relationship = main_df %>% filter(Marital_Status == 'In Relationship') %>% group_by(Education) %>%
summarize(in_relationship = mean(Total_Spent))
joined_df1 = data.frame(single, in_relationship)
rownames(joined_df1) = c('Basic','Graduation','Master','PhD')
joined_df1 = joined_df1 %>% select(single, in_relationship)
joined_df1
## single in_relationship
## Basic 58.5000 95.5000
## Graduation 617.9608 620.9986
## Master 665.0209 523.9136
## PhD 640.4124 690.7379
Chi-Square Test
chisq_test(joined_df1)
## # A tibble: 1 × 6
## n statistic p df method p.signif
## * <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 3913. 26.9 0.0000062 3 Chi-square test ****
The p-value is 0.0000062 which is very below the alpha of 0.05, we can reject the null hypothesis at the 5% significance level.
There is strong evidence to support the claim that there is an association between educational levels and marital status of customer when considering their expenditures.
Cramer’s V is a non-parametric measure of association between two categorical variables. It ranges from 0 to 1, with 0 indicating no association and 1 indicating a perfect association.
While the chi-square test indicates whether there is a statistically significant association, Cramer’s V provides a measure of the strength of that association. Cramer’s V is independent of the sample size, making it a more comparable metric across different datasets.
statistic = 26.8933
df_rows = 2240
tablerow = 4
tablecolumns = 2
# Cramer's V formula
sqrt(statistic / (df_rows * (min(tablerow, tablecolumns) - 1)))
## [1] 0.1095716
# install.packages('lsr')
library(lsr)
lsr::cramersV(joined_df1)
## [1] 0.08290193
Both calculations indicate that there is a weak or small association between the variables.
single = main_df %>% filter(Marital_Status == 'Single') %>% group_by(Age_Range) %>%
summarize(single = mean(Total_Spent))
in_relationship = main_df %>% filter(Marital_Status == 'In Relationship') %>% group_by(Age_Range) %>%
summarize(in_relationship = mean(Total_Spent))
joined_df2 = data.frame(single, in_relationship)
rownames(joined_df2) = c('Middle Aged','Old','Youth')
joined_df2 = joined_df2 %>% select(single, in_relationship)
joined_df2
## single in_relationship
## Middle Aged 586.7159 557.9612
## Old 716.3351 731.1752
## Youth 966.0000 630.0000
chisq_test(joined_df2)
## # A tibble: 1 × 6
## n statistic p df method p.signif
## * <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 4188. 42.7 5.41e-10 2 Chi-square test ****
The p-value is below alpha, therefore we can reject the null hypothesis. There is sufficient evidence to support the claim that there is an association between age range and marital status of customers while considering the expenses.
lsr::cramersV(joined_df2)
## [1] 0.1009419
statistic = 42.67454
df_rows = 2240
tablerow = 4
tablecolumns = 2
# Cramer's V formula
sqrt(statistic / (df_rows * (min(tablerow, tablecolumns) - 1)))
## [1] 0.1380258
Both values indicate that there is weak relationship between marital status and age range of customers.
Query: Are there any differences between the amount spent when considering various education levels of customers?
A one-way ANOVA (“analysis of variance”) compares the means of three or more independent groups to determine if there is a statistically significant difference between the corresponding population means. This test simply indicates if the means are statistically significant different from each other, but not where. After conducting ANOVA test, Tukey Test is used to interpret the results and determine where the statistical significance is coming from. A more robust test is the Welch’s analysis of variance because it can handle unequal variances, and is used with Games-Howell comparison method.
ANOVA assumptions:
Normality - By Central Limit Theorem, sample means of large samples are often well-approximated by a normal distribution even if the data are not normally distributed. It is therefore not required to test the normality assumption when the number of observations in each group/sample is large (usually n >= 30)
Independence – As we can see that, each group is independent of each other
Equal Variance – Now we need to check whether the variances for each group are equal. To test equality of variance we can use Levene’s Test.
levene_test(data = main_df,formula = Total_Spent ~ Education)
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 3 2236 23.6 5.07e-15
The p-value is below the alpha, we can reject the null hypothesis and claim that spending among customers is not equal. Welch’s one-way test can be used since variances are not equal across educational levels.
oneway.test(data = main_df, formula = Total_Spent ~ Education, var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: Total_Spent and Education
## F = 216.09, num df = 3, denom df = 471, p-value < 2.2e-16
The p-value is less than alpha, so reject the null hypothesis and claim that the amount customers spend varies based upon their education level. Education level alone differentiates customers spending.
rstatix::games_howell_test(data = main_df, formula = Total_Spent ~ Education, conf.level = 0.95)
## # A tibble: 6 × 8
## .y. group1 group2 estimate conf.low conf.high p.adj p.adj.signif
## * <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Total_Spent Basic Gradu… 538. 475. 602. 7.88e-15 ****
## 2 Total_Spent Basic Master 489. 412. 567. 0 ****
## 3 Total_Spent Basic PhD 591. 507. 675. 0 ****
## 4 Total_Spent Graduati… Master -48.9 -128. 30.1 3.82e- 1 ns
## 5 Total_Spent Graduati… PhD 52.5 -32.9 138. 3.89e- 1 ns
## 6 Total_Spent Master PhD 101. 5.00 198. 3.5 e- 2 *
Statistically significant differences between education levels and spending, when comparing Basic education level with graduation, masters and PhD levels.
Query: Are there any differences between the amount spent when considering marital status of customers?
A two-sample t-test is a statistical test used to determine if there is a significant difference between the means of two independent groups. This test tests the null hypothesis that the two population means are equal. This test assumes that the two groups are independent of each other.
For valid results, the assumptions are required to me satisfied:
Normality - By Central Limit Theorem, sample means of large samples are often well-approximated by a normal distribution even if the data are not normally distributed. It is therefore not required to test the normality assumption when the number of observations in each group/sample is large (usually n >= 30)
Independence – As we can see that, each group is independent of each other
Equal Variance – Now we need to check whether the variances for each group are equal. To test equality of variance we can use Levene’s Test.
print( levene_test(data = main_df, formula = Total_Spent ~ Marital_Status) )
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 1 2238 1.18 0.278
tibble( car::leveneTest(data= main_df, Total_Spent ~ Marital_Status) )
## # A tibble: 2 × 3
## Df `F value` `Pr(>F)`
## <int> <dbl> <dbl>
## 1 1 1.18 0.278
## 2 2238 NA NA
The p-value is greater than the alpha (0.05) so we do not reject the null hypothesis, meaning that All customers spend equally irrespective of their marital status. Variances are equal across marital status of customers, so two sample t-test can be used.
rstatix::t_test(data = main_df, formula = Total_Spent ~ Marital_Status, var.equal = TRUE)
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Total_Spent In Relationship Single 1444 796 -0.840 2238 0.401
t.test( formula = Total_Spent ~ Marital_Status, data= main_df, var.equal=TRUE)
##
## Two Sample t-test
##
## data: Total_Spent by Marital_Status
## t = -0.83958, df = 2238, p-value = 0.4012
## alternative hypothesis: true difference in means between group In Relationship and group Single is not equal to 0
## 95 percent confidence interval:
## -74.46289 29.81730
## sample estimates:
## mean in group In Relationship mean in group Single
## 597.8657 620.1884
The p-value is greater than alpha, fail to reject the null hypothesis, meaning that all customers spend equally irrespective of their marital status. Marital status alone does not contribute to spending.
Query: Are there any differences between the amount spent when considering age group of customers?
Normality - Since the sample size for each group is large enough (n >= 30), there is no need to check the Normality of each group
Independence – As we can see that, each group is independent of each other
Equal Variance – Now we need to check whether the variances for each group are equal. To test equality of variance we can use Levene’s Test.
print( levene_test(data = main_df, formula = Total_Spent ~ Age_Range) )
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 2 2237 0.733 0.480
tibble( car::leveneTest(data= main_df, Total_Spent ~ Age_Range) )
## # A tibble: 2 × 3
## Df `F value` `Pr(>F)`
## <int> <dbl> <dbl>
## 1 2 0.733 0.480
## 2 2237 NA NA
The p-value is greater than alpha, so we fail to reject the null hypothesis. All customers spend equally regardless of their age group. Age range is not a factor in spending alone, variance is equal across the age group of customers.
anova_model = aov(formula = Total_Spent ~ Age_Range, data = main_df )
summary.aov( anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Age_Range 2 10418034 5209017 14.54 5.35e-07 ***
## Residuals 2237 801676681 358371
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(data = main_df, formula = Total_Spent ~ Age_Range, type = 1)
## ANOVA Table (type I tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Age_Range 2 2237 14.535 5.35e-07 * 0.013
The p-value is less than alpha, so we reject the null hypothesis and claim that some customers have a high spending compared to others based on age group. Age is a factor alone on spending.
Tukey’s Honestly Significant Difference (HSD) test is a statistical test used to compare all possible pairs of means after an ANOVA test. It’s often used as a post-hoc analysis to determine which specific groups differ significantly from each other
The HSD value is a critical value that determines the minimum difference between two means required to be considered statistically significant. It’s calculated based on the sample size, number of groups, and the desired significance level (usually 0.05).
rstatix::tukey_hsd(anova_model)
## # A tibble: 3 × 9
## term group1 group2 null.value estimate conf.low conf.high p.adj
## * <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Age_Range Adult Senior 0 158. 87.4 228. 4.79e-7
## 2 Age_Range Adult Young Adult 0 264. -182. 709. 3.47e-1
## 3 Age_Range Senior Young Adult 0 106. -342. 554. 8.45e-1
## # ℹ 1 more variable: p.adj.signif <chr>
TukeyHSD(x = anova_model, ordered = TRUE, conf.level = 0.95 )
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## factor levels have been ordered
##
## Fit: aov(formula = Total_Spent ~ Age_Range, data = main_df)
##
## $Age_Range
## diff lwr upr p adj
## Senior-Adult 157.7213 87.36512 228.0775 0.0000005
## Young Adult-Adult 263.5217 -181.74883 708.7922 0.3473526
## Young Adult-Senior 105.8004 -342.43141 554.0322 0.8446522
The p-value is below the alpha, we reject the null hypothesis and claim that some customers have a high spending compared to others based on age group. The senior-adult age ranges have strongest statistical significance.
Query: Are there any differences between the amount spent when considering the products sold?
A repeated measures ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more groups in which the same subjects show up in each group. Since the amount spend for all products is listed out for each customer, it is better to go with repeated measures anova to test the differences between each product statistically. However, before we perform a repeated measures ANOVA we must make sure the following assumptions are met:
Independence : Each of the observations should be independent.
Normality : The distribution of the response variable is normally distributed.
Sphericity Corrections: The variances of the differences between all combinations of related groups must be equal.
Sphericity assumes that the variances of the differences between all pairs of conditions are equal. If this assumption is violated, it can inflate the Type I error rate, leading to false positive results.
Mauchly’s test is used to assess whether the sphericity assumption is met. It tests the null hypothesis that the variances of the differences between all pairs of conditions are equal.
products_df = products_df %>%
mutate(Customer_ID = rep(1:2240, times= 6))
head( products_df,3)
## Product_Name Total_Spent Customer_ID
## 1 Wine 635 1
## 2 Wine 11 2
## 3 Wine 426 3
A Repeated Measures ANOVA is a statistical technique used to compare the means of a dependent variable across different conditions or time points within the same group of subjects. It’s particularly useful when the same participants are measured multiple times under different conditions.
wid is variable name specifying the sample identifier
within grouping variablerep_measures_anova = rstatix::anova_test(data = products_df, dv = Total_Spent, wid = Customer_ID, within = Product_Name)
rep_measures_anova$ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Product_Name 5 11195 1268.182 0 * 0.269
rep_measures_anova$`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 Product_Name 0.000463 0 *
rep_measures_anova$`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF]
## 1 Product_Name 0.332 1.66, 3718.85 0 * 0.332 1.66, 3721.27 0
## p[HF]<.05
## 1 *
Since Mauchy’s Test p-value is less than alpha, we reject the null hypothesis, suggesting that the sphericity assumption is violated.
The Friedman Test is a non-parametric alternative to the Repeated Measures ANOVA. It is used to determine whether or not there is a statistically significant difference between the means of three or more groups in which the same subjects show up in each group.
For a Friedman Test, the appropriate post-hoc test is the pairwise Wilcoxon rank sum test with a bonferroni correction. The Bonferroni correction adjusts probability (p) values because of the increased risk of a type I error when making multiple statistical tests.
The assumptions for the Friedman Test include:
Continuous : The variable that you care about (and want to see if it is different across the 3+ groups) must be continuous. Continuous means that the variable can take on any reasonable value
Random Sample : The data points for each group in your analysis must have come from a simple random sample
Enough Data : The sample size also depends on the expected size of the difference across groups.
friedman.test(y = products_df$Total_Spent, groups = products_df$Product_Name, blocks = products_df$Customer_ID)
##
## Friedman rank sum test
##
## data: products_df$Total_Spent, products_df$Product_Name and products_df$Customer_ID
## Friedman chi-squared = 5967.7, df = 5, p-value < 2.2e-16
The p-value is less than alpha, so we reject the null hypothesis and claim that some products sell more than others, there is variance in spending among customers. Products is a factor alone in customer spending.
rstatix::pairwise_wilcox_test(data = products_df, formula = Total_Spent ~ Product_Name, p.adjust.method = 'bonf')
## # A tibble: 15 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Total_S… Fish Fruit 2240 2240 2763626 3.53e- 9 5.3 e- 8 ****
## 2 Total_S… Fish Gold 2240 2240 1997634 3.18e- 32 4.77e- 31 ****
## 3 Total_S… Fish Meat 2240 2240 1277322. 3.39e-178 5.08e-177 ****
## 4 Total_S… Fish Sweet 2240 2240 2761441 4.76e- 9 7.14e- 8 ****
## 5 Total_S… Fish Wine 2240 2240 995501 5.06e-268 7.59e-267 ****
## 6 Total_S… Fruit Gold 2240 2240 1698338. 2.41e- 78 3.61e- 77 ****
## 7 Total_S… Fruit Meat 2240 2240 1052869 2.92e-248 4.38e-247 ****
## 8 Total_S… Fruit Sweet 2240 2240 2509044 9.95e- 1 1 e+ 0 ns
## 9 Total_S… Fruit Wine 2240 2240 835352. 0 0 ****
## 10 Total_S… Gold Meat 2240 2240 1624746 9.96e- 93 1.49e- 91 ****
## 11 Total_S… Gold Sweet 2240 2240 3308424. 2.55e- 76 3.82e- 75 ****
## 12 Total_S… Gold Wine 2240 2240 1251724. 1.88e-185 2.82e-184 ****
## 13 Total_S… Meat Sweet 2240 2240 3953848 1.29e-244 1.93e-243 ****
## 14 Total_S… Meat Wine 2240 2240 1982321 4.88e- 34 7.32e- 33 ****
## 15 Total_S… Sweet Wine 2240 2240 841643 0 0 ****
pairwise.wilcox.test(x = products_df$Total_Spent, g = products_df$Product_Name, p.adjust.method = 'bonf')
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: products_df$Total_Spent and products_df$Product_Name
##
## Fish Fruit Gold Meat Sweet
## Fruit 5.3e-08 - - - -
## Gold < 2e-16 < 2e-16 - - -
## Meat < 2e-16 < 2e-16 < 2e-16 - -
## Sweet 7.1e-08 1 < 2e-16 < 2e-16 -
## Wine < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
Based on the results, we can conclude that there are significant differences in the Total_Spent between various product categories. The specific differences can be inferred by looking at the p-values in the table.
The strong statistical significant differences in products are wine, meat, gold then fish. Sweet potatoes are not as strong in statistical significance.
Besides, 60% of customers earn between 20k & 60k dollars yearly. Thus, by making the campaigns that have offers that are feasible to accept offers for customers who earn below 50k dollars would increase the rate of accepting the offers.
Moderate positive correlation between the income level and the amount spend by customers. Similarly, there is a moderate positive correlation between the income level and the total number of purchases made by customers.
Customers with University education spend more than the customers with primary education
Marital Status, on its own, does not constitute any difference in the amount spend by customers
The products that most likely customers buy are: Wine, Meat, Fish, Gold. Customers spend equally on fruits and sweets
52% of total amount spend by customers who have enrolled since the year 2013
Customers prefer to purchase in stores and on the website, rather than the other purchase modes. 39% of customers prefer Stores and 27.5% of customers prefer online purchasing.
In last 2 years, 95% of customers have visited the company’s website less than 9 times which shows the need to improve the website