The purpose of this project is to answer a series of general business questions using R programming language. This project has a dataset that provides many marketing data include general customer information, products sold by the company, the company’s sale channels, and the performance of a series of marketing campaigns that the company has launched.
The business questions that we are tasked to answer are separated into two main bodies, one is for exploring data analysis and the other one is for statistical analysis.
Business questions for exploring data analysis are below.
Business questions for statistical analysis are below.
if(!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, skimr, kableExtra, DT, lubridate, zoo,
gridExtra, qqplotr, ggrepel, car, caret, DescTools,
multcomp, olsrr, gghalves, ggsci, corrplot,
RColorBrewer)
theme = theme_bw() +
theme(plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(size = 10),
axis.title = element_text(size = 10),
legend.position = "none")
We set up the environment with packages and themes.
The dataset has customer information, sales channels, products sold by the company, and the acceptance data of marketing campaigns. It is a comprehensive dataset and contains classifiable information.
The dataset is very complete. All variables have a 100% complete_rate and 0% n_missing. Except for the variable income, its complete_rate is 98.9%. Generally, this is already high enough to accept and will just remove these NA rows. However, we will fill up by looking for mean (if normality is observed) or median (if outlier is observed) from relevant information. There are no white spaces in the five character variables, which have to clean for sure, by examining the whitespace column.
df.1 = df.0
df.1 = df.1 %>%
mutate(Education = as.factor(Education),
Marital_Status = as.factor(Marital_Status),
Country = as.factor(Country)) %>%
mutate(Dt_Customer = mdy(Dt_Customer)) %>%
mutate(Income = gsub(",", "", Income),
Income = gsub("\\$", "", Income),
Income = str_sub(Income, end = -4),
Income = as.numeric(Income))
Some variables have to transfer their types to the right one and to be more efficient and accurate for data analysis and computational learning. Following are a few conversions that we perform.
## $Education
## [1] "2n Cycle" "Basic" "Graduation" "Master" "PhD"
##
## $Marital_Status
## [1] "Absurd" "Alone" "Divorced" "Married" "Single" "Together" "Widow"
## [8] "YOLO"
##
## $Country
## [1] "AUS" "CA" "GER" "IND" "ME" "SA" "SP" "US"
Sometimes, a missing value in a character column is filled up manually as “null” or “missing value”. Then, it will not be detected by R functions since it will be recognized as a part of the data in character type. So, we examine the presence of these words in factor variables. Based on the result, we do not have the tricker situation to deal with.
df.1[is.na(df.1$Income), ] %>%
group_by(Country, Education) %>%
summarise(Count = n()) %>%
datatable()
Income normally is related to other variables, such as education, age, marital status, and country. Due to the restriction of the two dimensional plot, we will rely on education and country to fill up the missing value for income. From the table, we know how many missing values by country and education are.
df.cp = df.1[complete.cases(df.1), ]
ggplot(data = df.cp,
aes(x = Education,
y = Income,
fill = Education)) +
geom_half_violin(side = "l",
alpha = 0.5,
trim = F) +
geom_half_boxplot(side = "r",
alpha = 0.5,
outlier.shape = NA) +
facet_wrap(~Country,
scale = "free") +
scale_fill_d3(palette = c("category20"),
alpha = 0.5) +
scale_y_continuous(labels = function(x) paste0("$",
{x/1000}, "k")) +
coord_flip() +
theme +
theme(strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 12)) +
labs(title =
"Distribution of Customer's Income by Education by Country",
caption = "Figure.1")
We perform different methods for filling missing values in each group.
df.2 = df.1 %>%
group_by(Country, Education) %>%
mutate(Income = ifelse(Country == "AUS" & is.na(Income),
mean(Income, na.rm = T),
Income),
Income = ifelse(Country == "CA" & is.na(Income),
median(Income, na.rm = T),
Income),
Income = ifelse(Country == "GER" & is.na(Income),
mean(Income, na.rm = T),
Income),
Income = ifelse(Country == "IND" & is.na(Income),
mean(Income, na.rm = T),
Income),
Income = ifelse(Country == "SP" & is.na(Income),
median(Income, na.rm = T),
Income),
Income = ifelse(Country == "US" & is.na(Income),
mean(Income, na.rm = T),
Income),) %>%
ungroup()
df.2[!complete.cases(df.2), ]
## # A tibble: 0 x 28
## # ... with 28 variables: ID <dbl>, Year_Birth <dbl>, Education <fct>,
## # Marital_Status <fct>, Income <dbl>, Kidhome <dbl>, Teenhome <dbl>,
## # Dt_Customer <date>, Recency <dbl>, MntWines <dbl>, MntFruits <dbl>,
## # MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>,
## # MntGoldProds <dbl>, NumDealsPurchases <dbl>, NumWebPurchases <dbl>,
## # NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## # NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>,
## # AcceptedCmp5 <dbl>, AcceptedCmp1 <dbl>, AcceptedCmp2 <dbl>, Response <dbl>,
## # Complain <dbl>, Country <fct>
We check and there is no more NA in the dataset.
df.2 = df.2 %>%
mutate(age = year(Dt_Customer) - Year_Birth + 1,
Age_Group = case_when(
age < 1 ~ "baby",
age > 1 & age < 14 ~ "youth",
age > 15 & age < 24 ~ "young.adult",
age > 25 & age < 44 ~ "middle.adult",
age > 45 & age < 65 ~ "older.adult",
TRUE ~ "senior")) %>%
rename("Age_atRegister" = "age") %>%
mutate(Age_Group = factor(Age_Group,
levels = c("young.adult",
"middle.adult",
"older.adult",
"senior")))
We create age and age group variables in case these are required in later usage.
In summary of the data processing, we convert the variables into their right types. We extract income column and clean up dollar and comma signs and convert it into numerical type. Also, we fill up missing values with a consideration of respective country and education level. Finally, we create new columns for age and age groups.
We will explore customer information of the company by age, country, education, marital status, family size, and income.
temp.1 = df.2 %>%
group_by(Country) %>%
mutate(x.lab = paste0(Country,
"\n",
"(n=",
n(),
")"))
ggplot(data = temp.1,
aes(x = x.lab,
y = Age_atRegister,
fill = Country)) +
geom_boxplot(outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(aes(fill = Country),
shape = 21,
alpha = 0.8,
width = 0.05) +
geom_hline(yintercept = 30,
linetype = 2,
color = "#D43F3AFF") +
geom_hline(yintercept = 55,
linetype = 2,
color = "#D43F3AFF") +
scale_y_continuous(limits = c(0, 90),
breaks = seq(0, 90, 10)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.5) +
theme +
labs(title = "Distribution of Customer's Age by Country",
caption = "Figure.2",
x = "Country",
y = "Age at Registration")
The company has customers from eight different countries, and most of them are aged from 30 to 55 due to the body of boxplots representing 50% of the data. There is no difference in age among countries. The country “SP” has the most customers, and the country “ME” has the least customers.
temp.2 = df.2 %>%
group_by(Country) %>%
summarise(count = n()) %>%
mutate(per = round(count/sum(count)*100, 2))
ggplot(data = temp.2,
aes(x = "",
y = count,
fill = Country)) +
geom_col(position = "fill") +
coord_polar("y",
start = 10) +
geom_text(aes(label = paste0(per, "%")),
position = position_fill(vjust = 0.5),
size = 4) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
legend.position = "right") +
labs(title = "Distribution of Customer by Country",
caption = "Figure.3",
x = NULL,
y = NULL)
The pie chart shows that nearly half of the customers are from country “SP” which is Spain. The next one is from “SA” which is South Africa. Then, the third pool of customer is “CA” which is Canada.
temp.3.1 = df.2 %>%
group_by(Education) %>%
summarise(count = n()) %>%
mutate(per = round(count/sum(count)*100, 0))
p.1 = ggplot(data = temp.3.1,
aes(x = reorder(Education, -count),
y = count,
fill = Education)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0(per, "%")),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 1200),
breaks = seq(0, 1200, 100)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_text(angle = 90,
vjust = 0.3,
hjust = 1)) +
labs(title = "Education",
x = NULL,
y = NULL)
temp.3.2 = df.2 %>%
group_by(Country, Education) %>%
summarise(count = n()) %>%
mutate(per = round(count/sum(count)*100, 0))
p.2 = ggplot(data = temp.3.2,
aes(x = "",
y = count,
fill = Education)) +
geom_col(position = "fill") +
coord_polar("y",
start = 80) +
geom_text(aes(label = paste0(per, "%")),
position = position_fill(vjust = 0.5),
size = 2.5) +
facet_wrap(~Country) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
legend.position = "right",
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 12)) +
labs(title = "Education by Country",
caption = "Figure.4",
x = NULL,
y = NULL)
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,1,2,2,2)))
Overall, half of customers have the graduation education level. Then, the Ph.D. and the master follow up, so does the rank sequence in each country.
temp.4.1 = df.2 %>%
group_by(Marital_Status) %>%
summarise(count = n()) %>%
mutate(per = round(count/sum(count)*100, 0))
p.1 = ggplot(data = temp.4.1,
aes(x = reorder(Marital_Status, -count),
y = count,
fill = Marital_Status)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0(per, "%")),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 900),
breaks = seq(0, 900, 100)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_text(angle = 90,
vjust = 0.3,
hjust = 1)) +
labs(title = "Marital Status",
x = NULL,
y = NULL)
temp.4.2 = df.2 %>%
group_by(Country, Marital_Status) %>%
summarise(count = n()) %>%
mutate(per = round(count/sum(count)*100, 0))
p.2 = ggplot(data = temp.4.2,
aes(x = "",
y = count,
fill = Marital_Status)) +
geom_col(position = "fill") +
coord_polar("y",
start = 80) +
geom_text(aes(label = paste0(per, "%")),
position = position_fill(vjust = 0.5),
size = 2.5) +
facet_wrap(~Country) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
legend.position = "right",
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 12)) +
labs(title = "Marital Status by Country",
caption = "Figure.5",
x = NULL,
y = NULL,
fill = "Marital Status")
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,1,2,2,2)))
Nearly 40% of customers are married, staying together is the second, and then being single is the third.
temp.5.1 = df.2 %>%
mutate(Family_Size = paste0(Kidhome,
" Kid ",
Teenhome,
" Teen")) %>%
group_by(Family_Size) %>%
summarise(count = n()) %>%
mutate(per = round(count/sum(count)*100, 0))
p.1 = ggplot(data = temp.5.1,
aes(x = reorder(Family_Size, -count),
y = count,
fill = Family_Size)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0(per, "%")),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 700),
breaks = seq(0, 700, 100)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_text(angle = 90,
vjust = 0.3,
hjust = 1)) +
labs(title = "Family Size",
x = NULL,
y = NULL)
temp.5.2 = df.2 %>%
mutate(Family_Size = paste0(Kidhome,
" Kid ",
Teenhome,
" Teen")) %>%
group_by(Country, Family_Size) %>%
summarise(count = n()) %>%
mutate(per = round(count/sum(count)*100, 0))
p.2 = ggplot(data = temp.5.2,
aes(x = "",
y = count,
fill = Family_Size)) +
geom_col(position = "fill") +
coord_polar("y",
start = 20) +
geom_text(aes(label = paste0(per, "%")),
position = position_fill(vjust = 0.5),
size = 2.5) +
facet_wrap(~Country) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
legend.position = "right",
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 12)) +
labs(title = "Family Size by Country",
caption = "Figure.6",
x = NULL,
y = NULL,
fill = "Family Size")
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,1,2,2,2)))
28% of customers have no children, and 72% of customers have at least one kid or one teen in the family, and most of them have only one child as a majority.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1730 35539 51550 52250 68290 666666
The mean income of customer is $52,250 in annual; the median is $51,550. Because the mean and the median is not far away from each other, we assume a normal distribution for the income of customer.
##
## One Sample t-test
##
## data: df.2$Income
## t = 98.759, df = 2239, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 51212.46 53287.48
## sample estimates:
## mean of x
## 52249.97
We conduct a one sample t-test and conclude the true mean is away from zero and falling between $51,212 and $53,287 in a 95% confidence interval.
p.1 = ggplot(data = df.2,
aes(y = Income)) +
geom_half_violin(side = "l",
alpha = 0.5,
trim = F,
fill = "#D62728FF") +
geom_half_boxplot(side = "r",
alpha = 0.5,
fill = "#FF7F0EFF") +
scale_y_continuous(limits = c(0, 180000),
breaks = seq(0, 180000, 20000),
labels = function(x) paste0("$",
{x/1000}, "k")) +
theme +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(title = "Income",
y = NULL)
p.2 = ggplot(data = df.2,
aes(x = Age_Group,
y = Income,
fill = Education)) +
geom_half_violin(side = "l",
alpha = 0.5,
trim = F,
fill = "#D62728FF") +
geom_half_boxplot(side = "r",
alpha = 0.5,
fill = "#FF7F0EFF") +
scale_y_continuous(limits = c(0, 180000),
breaks = seq(0, 180000, 20000),
labels = function(x) paste0("$",
{x/1000}, "k")) +
facet_wrap(~Education, nrow = 1) +
theme +
theme(axis.text.x = element_text(angle = 90,
vjust = 0.3,
hjust = 1),
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 12)) +
labs(title = "Income by Education by Age Group",
caption = "Figure.7",
x = NULL,
y = NULL)
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,2,2,2,2)))
An outlier at $666,666 has been removed. 50% of customers have the income to be between $35k to $70k annually. Income among education levels is not obviously different from each other except the customers from basic, which have the lowest income.
We take the columns of mnt_products, which are the amount of dollars spent on product in the last two years. Take mntwines as an example, this records the amount spent on wine in the last two years.
temp.6 = df.2 %>%
dplyr::select(MntWines,
MntFruits,
MntMeatProducts,
MntFishProducts,
MntSweetProducts,
MntGoldProds) %>%
rename("Wine" = "MntWines",
"Fruit" = "MntFruits",
"Meat" = "MntMeatProducts",
"Fish" = "MntFishProducts",
"Sweet" = "MntSweetProducts",
"Gold" = "MntGoldProds") %>%
pivot_longer(c(Wine, Fruit, Meat, Fish, Sweet, Gold),
names_to = "products",
values_to = "amount_spent") %>%
group_by(products) %>%
summarise(total_revenue = sum(amount_spent)) %>%
mutate(per = round(total_revenue/sum(total_revenue)*100, 2))
ggplot(data = temp.6,
aes(x = reorder(products, -total_revenue),
y = total_revenue,
fill = products)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0("$",
prettyNum(total_revenue,
big.mark = ","),
"\n",
"(",
per,
"%",
")")),
vjust = -0.3) +
scale_y_continuous(limits = c(0, 800000),
labels = function(x) paste0("$",
{x/1000}, "k")) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
labs(title = "Performance of Each Product Category",
caption = "Figure.8",
x = "Product Category",
y = "Total Revenue")
Products of the company include wine, meat, gold, fish, sweet, and fruit. Wine is the best selling product which contributes 50.15% to the total revenue.
We use acceptedcmp and response columns. As for acceptedcmp1, 1 if customer accepted the offer in the 1st campaign, 0 otherwise. As for response, 1 if customer accepted the offer in the last campaign, 0 otherwise.
temp.7 = df.2 %>%
dplyr::select(ID,
AcceptedCmp1,
AcceptedCmp2,
AcceptedCmp3,
AcceptedCmp4,
AcceptedCmp5,
Response) %>%
pivot_longer(c(AcceptedCmp1,
AcceptedCmp2,
AcceptedCmp3,
AcceptedCmp4,
AcceptedCmp5,
Response),
names_to = "Campaign",
values_to = "Reaction") %>%
group_by(Campaign) %>%
summarise(Rate = round(sum(Reaction)/n()*100, 2)) %>%
mutate(Campaign = fct_recode(Campaign,
"First" = "AcceptedCmp1",
"Second" = "AcceptedCmp2",
"Third" = "AcceptedCmp3",
"Fourth" = "AcceptedCmp4",
"Fifth" = "AcceptedCmp5",
"Last" = "Response"))
ggplot(data = temp.7,
aes(x = Campaign,
y = Rate,
fill = Campaign)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0(Rate, "%")),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 20),
breaks = seq(0, 20, 5)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
labs(title = "Accentance Rate of Each Marketing Campaign",
caption = "Figure.9",
x = "Campaign",
y = "Acceptance Rate (%)")
There are six marketing campaigns, including five campaigns and the last campaign in the column of response. The last marketing campaign is the most successful with an acceptance rate of 14.91%. It is a value that is nearly doubling the acceptance rate of most of the other marketing campaigns launched.
We take the numwebpurchases, numcatalogpurchases, and numstorepurchases, which are the number of purchases made through the company’s website, catalog, and in-store. As for numdealspurchases and numwebvisitsmonth, those are not considered as sale channels to discuss in this section.
temp.8 = df.2 %>%
dplyr::select(NumWebPurchases,
NumCatalogPurchases,
NumStorePurchases) %>%
rename("Web" = "NumWebPurchases",
"Catalog" = "NumCatalogPurchases",
"In_Store" = "NumStorePurchases") %>%
pivot_longer(c(Web, Catalog, In_Store),
names_to = "channels",
values_to = "no_of_purchases") %>%
group_by(channels) %>%
summarise(total = sum(no_of_purchases)) %>%
mutate(per = round(total/sum(total)*100, 2))
p.1 = ggplot(data = temp.8,
aes(x = reorder(channels, -total),
y = total,
fill = channels)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0(prettyNum(total,
big.mark = ","))),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 14000),
breaks = seq(0, 14000, 2000)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
labs(title = "Comparison of Sale Channels in Barchart & Piechart",
caption = NULL,
x = "Sale Channels",
y = "Total Number of Purchases")
p.2 = ggplot(data = temp.8,
aes(x = "",
y = total,
fill = channels)) +
geom_col(position = "fill") +
coord_polar("y",
start = 20) +
geom_text(aes(label = paste0(per, "%")),
position = position_fill(vjust = 0.5),
size = 4) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
legend.position = "right",
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 12)) +
labs(title = NULL,
caption = "Figure.10",
x = NULL,
y = NULL,
fill = "Sale Channels")
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,2)))
The catalog is the most underperforming sale channel and then followed by purchases made through the company’s website.
Based on different questions (comparison/factorization), we can use different statistical methods to test. Before the slash is the method we choose to use in the test. But, there is still another option which is the after the slash.
Comparison: ANOVA/Linear Regression
df.3 = df.2 %>%
dplyr::select(Country,
NumWebPurchases,
NumCatalogPurchases,
NumStorePurchases) %>%
mutate(total_purchases =
NumWebPurchases +
NumCatalogPurchases +
NumStorePurchases) %>%
dplyr::select(-NumWebPurchases,
-NumCatalogPurchases,
-NumStorePurchases)
mod.2 = aov(data = df.3,
total_purchases ~ Country)
We synthesize total_purchases by using numwebpurchases, numcatalogpurchases, and numstorepurchases. We are comparing total purchases per customer in different countries. So, we will use the ANOVA test to compare multiple (>= 2) means differences. Here, we use the one way ANOVA. There are two assumptions to be satisfied.
a = c("One way ANOVA test",
"Kruskal Wallis test",
"Mann Whitney test")
b = c("v",
"x",
"x")
c = c("v",
"v",
"x")
data.frame(a, b, c) %>%
rename("ANOVA" = "a",
"Normally distributed?" = "b",
"Equal variance?" = "c") %>%
kbl(align = "c",
caption = "v: Yes, x: No") %>%
kable_classic("hover")
ANOVA | Normally distributed? | Equal variance? |
---|---|---|
One way ANOVA test | v | v |
Kruskal Wallis test | x | v |
Mann Whitney test | x | x |
Normality Test with QQ Plot
mod.2$residuals %>%
as.data.frame() %>%
ggplot(aes(sample = .)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
theme +
labs(title = "QQ Plot",
subtitle =
"Shaded region = 95% confidence interval for Normality",
caption = "Figure.12",
x = "Theoretical Value",
y = "Sample Value")
The assumption of residuals normality is violated based on the QQ plot.
Homogeneity of Variance Test with Levene’s Test
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 7 1.1192 0.3479
## 2232
The result shows that the variance of total purchases in each country is similar due to p-value < 0.05. Based on our dataset is non-normally distributed and equal variance, we will use Kruskal Wallis test instead of the original ANOVA test.
##
## Kruskal-Wallis rank sum test
##
## data: df.3$total_purchases and df.3$Country
## Kruskal-Wallis chi-squared = 7.2796, df = 7, p-value = 0.4004
The Kruskal Wallis test instead is selected as an omnibus test. The result shows that there is no significant difference among countries in terms of total purchases per customer. So, we have not enough significant evidence to say that customers in a country can purchase more items than another country in this dataset.
temp.9.1 = df.3 %>%
group_by(Country) %>%
summarise(total = sum(total_purchases)) %>%
arrange(desc(total))
p.1 = ggplot(data = temp.9.1,
aes(x = reorder(Country, -total),
y = total,
fill = Country)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = paste0(prettyNum(total,
big.mark = ","))),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 16000),
breaks = seq(0, 16000, 5000),
labels = function(x) paste0({x/1000})) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x =
element_text(color = ifelse(temp.9.1$Country == "US",
"#D43F3AFF",
"grey30"))) +
labs(title = "Comparison of Total Purchases in Each Country",
caption = "",
x = "",
y = "Total Purchase (Count in k)")
temp.9.2 = df.3 %>%
group_by(Country) %>%
summarise(count = n(),
total = sum(total_purchases),
avg = round(total/count, 2)) %>%
mutate(x.label = paste0(Country,
"\n",
"(n=",
count,
")")) %>%
arrange(desc(total))
p.2 = ggplot(data = temp.9.2,
aes(x = reorder(x.label, -total),
y = avg,
fill = Country)) +
geom_col(position = "dodge",
width = 0.7) +
geom_text(aes(label = avg),
vjust = -0.5) +
scale_y_continuous(limits = c(0, 20),
breaks = seq(0, 20, 5)) +
coord_cartesian(ylim = c(10, 20)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(axis.text.x =
element_text(color = ifelse(temp.9.2$Country == "US",
"#D43F3AFF",
"grey30"))) +
labs(title = "Comparison of Purchases per Customer in Each Country",
caption = "Figure.13",
x = "Country",
y = "Average Purchase per Customer")
grid.arrange(p.1, p.2,
layout_matrix = cbind(c(1,2)))
Customers located in a country that has fewer customers from the company, such as “US”, are able to purchase similar or more items than a country that has many customers from the company, such as “SP”. In detail, US customers have better purchasing power, but not with statistical significance.
Comparison: ANOVA/T Test
The people who buy gold are more conservative and might have more in-store purchases. This is the hypothesis that we will justify or refute with an appropriate statistical test in this section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 24.00 44.02 56.00 362.00
The average amount spent on gold is 44.02. Therefore, any customer who spends above this amount of money on gold will have more in-store purchases.
temp.10.1 = df.4 %>%
mutate(group =
case_when(
MntGoldProds > 44.02 ~ "Above_Avg",
TRUE ~ "Below_Avg"),
group = factor(group)) %>%
group_by(group) %>%
mutate(count = n(),
x.label = paste0(group,
"\n",
"(n=",
count,
")"))
p.1 = ggplot(data = temp.10.1,
aes(x = x.label,
y = NumStorePurchases)) +
geom_half_violin(side = "l",
alpha = 0.5,
trim = F,
fill = "#D62728FF") +
geom_half_boxplot(side = "r",
alpha = 0.5,
fill = "#FF7F0EFF") +
stat_summary(fun = "mean",
shape = 4,
color = "black",
size = 0.5,
stroke = 2) +
theme +
labs(title =
"Number of Store Purchase\nby Categories of Gold Buyer",
caption = "",
x = "Gold Buyer",
y = "Number of Store Purchases (Count)")
set.seed(123)
temp.10.2 = temp.10.1 %>%
group_by(group) %>%
sample_n(500) %>%
mutate(count = n(),
x.label = paste0(group,
"\n",
"(n=",
count,
")"))
p.2 = ggplot(data = temp.10.2,
aes(x = x.label,
y = NumStorePurchases)) +
geom_half_violin(side = "l",
alpha = 0.5,
trim = F,
fill = "#D62728FF") +
geom_half_boxplot(side = "r",
alpha = 0.5,
fill = "#FF7F0EFF") +
stat_summary(fun = "mean",
shape = 4,
color = "black",
size = 0.5,
stroke = 2) +
theme +
labs(title =
"Number of Store Purchase\nby Categories of Gold Buyer*",
caption = "Figure.14",
x = "Gold Buyer",
y = "Number of Store Purchases (Count)")
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,2)))
The plot shows that customers who spend above average amounts on gold purchase more items in the store than those who spend below average amounts on gold. We will proceed with this statement with a statistical test. Due to unequal count to cost an unfair comparison, we randomly select 500 samples for each group and plot the figure again. The interpretation of the graph remains the same as above.
We are comparing in-store purchase numbers in two categories of gold buyer. We will use the ANOVA test here.
Normality Test with Shapiro Wilk Test
## temp.10.2$group: Above_Avg
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.95891, p-value = 1.368e-10
##
## ------------------------------------------------------------
## temp.10.2$group: Below_Avg
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.84495, p-value < 2.2e-16
The test result shows that the residuals in the dataset are not normally distributed based on the p-value < 0.05. As long as one variable is not normally distributed, the result of this text is considered not normally distributed.
Homogeneity of Variance Test with Levene’s Test
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 10.537 0.001209 **
## 998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variance between the two gold buyer groups is significantly different based on Levene’s test. The variance between both groups is not equal. Since the dataset is non-normally distributed and non-equal variance, we will use Mann Whitney test instead of the original ANOVA test.
##
## Wilcoxon rank sum test with continuity correction
##
## data: temp.10.2$NumStorePurchases by temp.10.2$group
## W = 195535, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The result supports the statement that people who spend an above average amount on gold in the last two years would have more in-store purchases with a p-value < 0.05.
Comparison: ANOVA/T Test
Fish has omega-3 fatty acids which are good for the brain. So, do married Ph.D. candidates have a significant relation with the amount spent on fish?
df.5.1 = df.2 %>%
dplyr::select(Education,
Marital_Status,
MntFishProducts) %>%
mutate(group.1 = paste0(Education, Marital_Status)) %>%
mutate(group.2 =
case_when(
group.1 == "PhDMarried" ~ "Married_PhD_Candidates",
TRUE ~ "Others"
),
group.2 = as.factor(group.2))
df.5.1 %>%
group_by(group.2) %>%
summarise(count = n()) %>%
rename("Group" = "group.2",
"Count" = "count") %>%
kbl(align = "c",
caption = "Group Sample Size") %>%
kable_classic("hover")
Group | Count |
---|---|
Married_PhD_Candidates | 192 |
Others | 2048 |
set.seed(123)
df.5.2 = df.5.1 %>%
group_by(group.2) %>%
sample_n(192) %>%
dplyr::select(group.2, MntFishProducts)
In order to make the comparison fairer, we randomly select 192 samples from the 2048 samples of the “Others” group to make two groups with the same sample size.
temp.11 = df.5.2 %>%
group_by(group.2) %>%
mutate(count = n(),
x.label = paste0(group.2,
"\n",
"(n=",
count,
")"))
ggplot(data = temp.11,
aes(x = x.label,
y = MntFishProducts)) +
geom_half_violin(side = "l",
alpha = 0.5,
trim = F,
fill = "#D62728FF") +
geom_half_boxplot(side = "r",
alpha = 0.5,
fill = "#FF7F0EFF") +
stat_summary(fun = "mean",
shape = 4,
color = "black",
size = 0.5,
stroke = 2) +
theme +
labs(title = "Amount of Fish Purchases by Married Ph.D. Candidates",
caption = "Figure.15",
x = NULL,
y = "Amount Spent on Fish")
The plot shows that the mean and median of the group “Others” are slightly higher than the group of married Ph.D. candidates.
We are comparing the amount of fish buying in two categories of buyers. We will continually use the ANOVA test.
Normality Test with Shapiro Wilk Test
## temp.11$group.2: Married_PhD_Candidates
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.69269, p-value < 2.2e-16
##
## ------------------------------------------------------------
## temp.11$group.2: Others
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.70236, p-value < 2.2e-16
Shapiro Wilk test suggests that data between the two groups are not normally distributed.
Homogeneity of Variance Test with Levene’s Test
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.0238 0.08286 .
## 382
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene’s test concludes that the spread of variance between the two groups of data are equal based on the p-value of greater than 0.05. Since the dataset is non-normally distributed and equal variance, we will use Kruskal Wallis test instead of the original ANOVA test.
##
## Kruskal-Wallis rank sum test
##
## data: temp.11$MntFishProducts and temp.11$group.2
## Kruskal-Wallis chi-squared = 11.215, df = 1, p-value = 0.0008113
The p-value of 0.0008113 concludes that married Ph.D. candidates spend significantly less than other customers on fish.
Comparison: Logistic Regression/ANOVA
df.7.1 = df.2 %>%
dplyr::select(Country,
AcceptedCmp1,
AcceptedCmp2,
AcceptedCmp3,
AcceptedCmp4,
AcceptedCmp5,
Response)
df.7.2 = df.7.1 %>%
rename("Camp1" = AcceptedCmp1,
"Camp2" = AcceptedCmp2,
"Camp3" = AcceptedCmp3,
"Camp4" = AcceptedCmp4,
"Camp5" = AcceptedCmp5,
"LastCamp" = Response)
temp.12 = df.7.2 %>%
pivot_longer(c(Camp1,
Camp2,
Camp3,
Camp4,
Camp5,
LastCamp),
names_to = "Cmp",
values_to = "Result") %>%
mutate(Cmp = as.factor(Cmp)) %>%
group_by(Country, Cmp) %>%
summarise(count = n(),
rate = round(sum(Result)/count*100, 2)) %>%
mutate(x.label = paste0(Country,
"\n",
"(n=",
count,
")"))
p.1 = ggplot(data = temp.12 %>% filter(Country == "ME"),
aes(x = x.label,
y = rate,
fill = Cmp)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = function(x) {paste0(x, "%")},
limits = c(0, 80)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
labs(title = "Anomaly",
caption = "",
x = NULL,
y = "Acceptance Rate")
p.2 = ggplot(data = temp.12 %>% filter(Country != "ME"),
aes(x = x.label,
y = rate,
fill = Cmp)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = function(x) {paste0(x, "%")},
limits = c(0, 20)) +
scale_fill_d3(palette = c("category20"),
alpha = 0.7) +
theme +
theme(legend.position = "right") +
labs(title = "Acceptance Rate of Marketing Campaigns by Country",
caption = "Figure.16",
x = NULL,
y = NULL,
fill = "Campaign")
grid.arrange(p.1, p.2,
layout_matrix = rbind(c(1,2,2,2,2)))
Most campaigns across countries have a low acceptance rate and mostly less than 15%. Except for the country “ME”, other countries have a minimum of 100 customers.
We will use logistic regression which is performed using GLM binomial family on the effect of countries to six respective marketing campaigns to test the statement in this section.
df.7.3 = df.7.2 %>%
mutate_if(is.double, as.factor)
mod.6.1 = glm(data = df.7.3,
Camp1 ~ Country,
family = "binomial")
mod.6.1.sum = summary(mod.6.1)
mod.6.1.sum$coefficients %>%
as.data.frame() %>%
dplyr::select(Estimate, `Pr(>|z|)`) %>%
rename("P-value" = `Pr(>|z|)`) %>%
filter(`P-value` < 0.05) %>%
rownames_to_column(var = "Variable") %>%
filter(Variable != "(Intercept)") %>%
kbl(align = "c",
caption = "Compare Campaign 1 Across Countries") %>%
kable_classic("hover")
Variable | Estimate | P-value |
---|---|---|
There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.
mod.6.2 = glm(data = df.7.3,
Camp2 ~ Country,
family = "binomial")
mod.6.2.sum = summary(mod.6.2)
mod.6.2.sum$coefficients %>%
as.data.frame() %>%
dplyr::select(Estimate, `Pr(>|z|)`) %>%
rename("P-value" = `Pr(>|z|)`) %>%
filter(`P-value` < 0.05) %>%
rownames_to_column(var = "Variable") %>%
filter(Variable != "(Intercept)") %>%
kbl(align = "c",
caption = "Compare Campaign 2 Across Countries") %>%
kable_classic("hover")
Variable | Estimate | P-value |
---|---|---|
There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.
mod.6.3 = glm(data = df.7.3,
Camp3 ~ Country,
family = "binomial")
mod.6.3.sum = summary(mod.6.3)
mod.6.3.sum$coefficients %>%
as.data.frame() %>%
dplyr::select(Estimate, `Pr(>|z|)`) %>%
rename("P-value" = `Pr(>|z|)`) %>%
filter(`P-value` < 0.05) %>%
rownames_to_column(var = "Variable") %>%
filter(Variable != "(Intercept)") %>%
kbl(align = "c",
caption = "Compare Campaign 3 Across Countries") %>%
kable_classic("hover")
Variable | Estimate | P-value |
---|---|---|
There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.
mod.6.4 = glm(data = df.7.3,
Camp4 ~ Country,
family = "binomial")
mod.6.4.sum = summary(mod.6.4)
mod.6.4.sum$coefficients %>%
as.data.frame() %>%
dplyr::select(Estimate, `Pr(>|z|)`) %>%
rename("P-value" = `Pr(>|z|)`) %>%
filter(`P-value` < 0.05) %>%
rownames_to_column(var = "Variable") %>%
filter(Variable != "(Intercept)") %>%
kbl(align = "c",
caption = "Compare Campaign 4 Across Countries") %>%
kable_classic("hover")
Variable | Estimate | P-value |
---|---|---|
CountryCA | 0.9260787 | 0.0477882 |
Most countries have no relation to the success of marketing campaigns. Only Canada (CA) has a week significant p-value of 0.048 which is close to 0.05 to reject the null hypothesis. Therefore, we still conclude there is no effect of the country on campaign success.
mod.6.5 = glm(data = df.7.3,
Camp5 ~ Country,
family = "binomial")
mod.6.5.sum = summary(mod.6.5)
mod.6.5.sum$coefficients %>%
as.data.frame() %>%
dplyr::select(Estimate, `Pr(>|z|)`) %>%
rename("P-value" = `Pr(>|z|)`) %>%
filter(`P-value` < 0.05) %>%
rownames_to_column(var = "Variable") %>%
filter(Variable != "(Intercept)") %>%
kbl(align = "c",
caption = "Compare Campaign 5 Across Countries") %>%
kable_classic("hover")
Variable | Estimate | P-value |
---|---|---|
There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.
mod.6.6 = glm(data = df.7.3,
LastCamp ~ Country,
family = "binomial")
mod.6.6.sum = summary(mod.6.6)
mod.6.6.sum$coefficients %>%
as.data.frame() %>%
dplyr::select(Estimate, `Pr(>|z|)`) %>%
rename("P-value" = `Pr(>|z|)`) %>%
filter(`P-value` < 0.05) %>%
rownames_to_column(var = "Variable") %>%
filter(Variable != "(Intercept)") %>%
kbl(align = "c",
caption = "Compare Last Campaign Across Countries") %>%
kable_classic("hover")
Variable | Estimate | P-value |
---|---|---|
CountryME | 2.477634 | 0.0466367 |
Most countries have no relation to the success of marketing campaigns. Only Montenegro (ME) has a week significant p-value of 0.047 which is close to 0.05 to reject the null hypothesis. Also, there are only three samples in this country. The sample size is too small to generate an accurate result. Therefore, we still conclude there is no effect of the country on campaign success.
There are four tasks for visual exploration and six tasks for statistical analysis.
Results from exploring data analysis reveal that most customers are in mature age between 30 to 55, half of the customers are from Spain, 72% of the customers have a least one children home, half of the customers have income between $35k to $70k annually, and wine is the best selling product that contributing over 50% to the total revenue. All marketing campaigns are not doing very well with acceptance rates of lower than 15%. In-store sales are the most performing channel, and catalog sales are the most underperforming channel.
Results from statistical analysis reveal in-store purchases are significantly affected by the number of deals in a positive way and by the number of kids at home in a negative the most. The purchasing power of customers from the US do not significantly different from other countries. Gold buyers have more in-store purchases. Married Ph.D. candidates spend significantly less than other customer groups on fish. Statistical analysis also concludes that there is no significant relationship between geographical regions and the success of a campaign.