1 Objective

The purpose of this project is to answer a series of general business questions using R programming language. This project has a dataset that provides many marketing data include general customer information, products sold by the company, the company’s sale channels, and the performance of a series of marketing campaigns that the company has launched.

The business questions that we are tasked to answer are separated into two main bodies, one is for exploring data analysis and the other one is for statistical analysis.

Business questions for exploring data analysis are below.

  1. What does the average customer look like for this company?
  2. Which products are performing the best?
  3. Which marketing campaign is the most successful?
  4. Which channels are underperforming?

Business questions for statistical analysis are below.

  1. What are the factors that are significantly related to in-store purchases?
  2. Does US purchasing power significantly better than the rest of the world in terms of total purchases?
  3. Are gold lovers prefer to shop in-store?
  4. Do married Ph.D. candidates have a significant relation with the amount spend on fish?
  5. What other factors are significantly related to the amount spent on fish?
  6. Is there a significant relationship between geographical regions and the success of a campaign?

2 Preparation

Environment

if(!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, skimr, kableExtra, DT, lubridate, zoo,
               gridExtra, qqplotr, ggrepel, car, caret, DescTools,
               multcomp, olsrr, gghalves, ggsci, corrplot,
               RColorBrewer)

theme = theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 15),
        plot.subtitle = element_text(size = 10),
        axis.title = element_text(size = 10), 
        legend.position = "none")

We set up the environment with packages and themes.

Dataset

The dataset has customer information, sales channels, products sold by the company, and the acceptance data of marketing campaigns. It is a comprehensive dataset and contains classifiable information.

  • Customer information: id, age, education, marital status, income, number of kids and teens at home, date of customer’s enrollment with the company, recency complain made in the last two years, country of the customer, the number of purchases made during a discount, and customers’ frequency of visiting company’s website
  • Products of the company: wine, fruits, meat, fish, sweet, and gold
  • Sale channels: number of purchases made by the company website, catalog, and in-store
  • Customers’ responses to the company’s marketing campaigns. There are a total of six campaigns launched
df.0 = read_csv("DATA.csv")
# skim_without_charts(df.0)

The dataset is very complete. All variables have a 100% complete_rate and 0% n_missing. Except for the variable income, its complete_rate is 98.9%. Generally, this is already high enough to accept and will just remove these NA rows. However, we will fill up by looking for mean (if normality is observed) or median (if outlier is observed) from relevant information. There are no white spaces in the five character variables, which have to clean for sure, by examining the whitespace column.

Clean & Manipulation

df.1 = df.0
df.1 = df.1 %>%
  mutate(Education = as.factor(Education),
         Marital_Status = as.factor(Marital_Status),
         Country = as.factor(Country)) %>% 
  mutate(Dt_Customer = mdy(Dt_Customer)) %>% 
  mutate(Income = gsub(",", "", Income),
         Income = gsub("\\$", "", Income),
         Income = str_sub(Income, end = -4),
         Income = as.numeric(Income))

Some variables have to transfer their types to the right one and to be more efficient and accurate for data analysis and computational learning. Following are a few conversions that we perform.

  • Factor (fct) type: education, marital_status, country
  • Numerical (dbl) type: income
  • Date (date) type: dt_customer
check = df.1 %>% dplyr::select(is.factor, -ID)
sapply(check, levels)
## $Education
## [1] "2n Cycle"   "Basic"      "Graduation" "Master"     "PhD"       
## 
## $Marital_Status
## [1] "Absurd"   "Alone"    "Divorced" "Married"  "Single"   "Together" "Widow"   
## [8] "YOLO"    
## 
## $Country
## [1] "AUS" "CA"  "GER" "IND" "ME"  "SA"  "SP"  "US"

Sometimes, a missing value in a character column is filled up manually as “null” or “missing value”. Then, it will not be detected by R functions since it will be recognized as a part of the data in character type. So, we examine the presence of these words in factor variables. Based on the result, we do not have the tricker situation to deal with.

df.1[is.na(df.1$Income), ] %>%
  group_by(Country, Education) %>% 
  summarise(Count = n()) %>% 
  datatable()

Income normally is related to other variables, such as education, age, marital status, and country. Due to the restriction of the two dimensional plot, we will rely on education and country to fill up the missing value for income. From the table, we know how many missing values by country and education are.

df.cp = df.1[complete.cases(df.1), ]

ggplot(data = df.cp,
       aes(x = Education,
           y = Income,
           fill = Education)) +
  geom_half_violin(side = "l",
                   alpha = 0.5,
                   trim = F) +
  geom_half_boxplot(side = "r",
                    alpha = 0.5,
                    outlier.shape = NA) +
  facet_wrap(~Country,
             scale = "free") +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.5) +  
  scale_y_continuous(labels = function(x) paste0("$", 
                                                 {x/1000}, "k")) +
  coord_flip() +  
  theme +
  theme(strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 12)) +
  labs(title = 
         "Distribution of Customer's Income by Education by Country",
       caption = "Figure.1")

We perform different methods for filling missing values in each group.

  • For AUS group, using mean due to distribution of normality
  • For CA group, using meidan due to presence of extreme outlier
  • For GER group, using mean due to distribution of normality
  • For IND group, using mean due to distribution of normality
  • For SP group, using meidan due to presence of extreme outlier
  • For US group, using mean due to distribution of normality
df.2 = df.1 %>%
  group_by(Country, Education) %>% 
  mutate(Income = ifelse(Country == "AUS" & is.na(Income),
                         mean(Income, na.rm = T),
                         Income),
         Income = ifelse(Country == "CA" & is.na(Income),
                         median(Income, na.rm = T),
                         Income),
         Income = ifelse(Country == "GER" & is.na(Income),
                         mean(Income, na.rm = T),
                         Income),
         Income = ifelse(Country == "IND" & is.na(Income),
                         mean(Income, na.rm = T),
                         Income),
         Income = ifelse(Country == "SP" & is.na(Income),
                         median(Income, na.rm = T),
                         Income),
         Income = ifelse(Country == "US" & is.na(Income),
                         mean(Income, na.rm = T),
                         Income),) %>% 
  ungroup()

df.2[!complete.cases(df.2), ]
## # A tibble: 0 x 28
## # ... with 28 variables: ID <dbl>, Year_Birth <dbl>, Education <fct>,
## #   Marital_Status <fct>, Income <dbl>, Kidhome <dbl>, Teenhome <dbl>,
## #   Dt_Customer <date>, Recency <dbl>, MntWines <dbl>, MntFruits <dbl>,
## #   MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>,
## #   MntGoldProds <dbl>, NumDealsPurchases <dbl>, NumWebPurchases <dbl>,
## #   NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## #   NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>,
## #   AcceptedCmp5 <dbl>, AcceptedCmp1 <dbl>, AcceptedCmp2 <dbl>, Response <dbl>,
## #   Complain <dbl>, Country <fct>

We check and there is no more NA in the dataset.

df.2 = df.2 %>%
  mutate(age = year(Dt_Customer) - Year_Birth + 1,
         Age_Group = case_when(
           age < 1 ~ "baby",
           age > 1 & age < 14 ~ "youth",
           age > 15 & age < 24 ~ "young.adult",
           age > 25 & age < 44 ~ "middle.adult",
           age > 45 & age < 65 ~ "older.adult",
           TRUE ~ "senior")) %>% 
  rename("Age_atRegister" = "age") %>% 
  mutate(Age_Group = factor(Age_Group,
                            levels = c("young.adult",
                                       "middle.adult",
                                       "older.adult",
                                       "senior")))

We create age and age group variables in case these are required in later usage.

In summary of the data processing, we convert the variables into their right types. We extract income column and clean up dollar and comma signs and convert it into numerical type. Also, we fill up missing values with a consideration of respective country and education level. Finally, we create new columns for age and age groups.

3 Exploring Data Analysis

1) What does the average customer look like for this company?

We will explore customer information of the company by age, country, education, marital status, family size, and income.

Age

temp.1 = df.2 %>%
  group_by(Country) %>% 
  mutate(x.lab = paste0(Country, 
                        "\n",
                        "(n=",
                        n(),
                        ")"))

ggplot(data = temp.1,
       aes(x = x.lab,
           y = Age_atRegister,
           fill = Country)) +
  geom_boxplot(outlier.shape = NA) +
  stat_boxplot(geom = "errorbar") +
  geom_jitter(aes(fill = Country),
              shape = 21,
              alpha = 0.8,
              width = 0.05) +
  geom_hline(yintercept = 30,
             linetype = 2,
             color = "#D43F3AFF") +
  geom_hline(yintercept = 55,
             linetype = 2,
             color = "#D43F3AFF") +  
  scale_y_continuous(limits = c(0, 90),
                     breaks = seq(0, 90, 10)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.5) +  
  theme +
  labs(title = "Distribution of Customer's Age by Country",
       caption = "Figure.2",
       x = "Country",
       y = "Age at Registration")

The company has customers from eight different countries, and most of them are aged from 30 to 55 due to the body of boxplots representing 50% of the data. There is no difference in age among countries. The country “SP” has the most customers, and the country “ME” has the least customers.

Country

temp.2 = df.2 %>%
  group_by(Country) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 2))

ggplot(data = temp.2,
       aes(x = "",
           y = count,
           fill = Country)) +
  geom_col(position = "fill") +
  coord_polar("y",
              start = 10) +
  geom_text(aes(label = paste0(per, "%")),
            position = position_fill(vjust = 0.5),
            size = 4) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "right") +
  labs(title = "Distribution of Customer by Country",
       caption = "Figure.3",
       x = NULL,
       y = NULL)

The pie chart shows that nearly half of the customers are from country “SP” which is Spain. The next one is from “SA” which is South Africa. Then, the third pool of customer is “CA” which is Canada.

Education

temp.3.1 = df.2 %>%
  group_by(Education) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 0))

p.1 = ggplot(data = temp.3.1,
       aes(x = reorder(Education, -count),
           y = count,
           fill = Education)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0(per, "%")),
            vjust = -0.5) +
  scale_y_continuous(limits = c(0, 1200),
                     breaks = seq(0, 1200, 100)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.3,
                                   hjust = 1)) +  
  labs(title = "Education",
       x = NULL,
       y = NULL)

temp.3.2 = df.2 %>%
  group_by(Country, Education) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 0))

p.2 = ggplot(data = temp.3.2,
       aes(x = "",
           y = count,
           fill = Education)) +
  geom_col(position = "fill") +
  coord_polar("y",
              start = 80) +  
  geom_text(aes(label = paste0(per, "%")),
            position = position_fill(vjust = 0.5),
            size = 2.5) +
  facet_wrap(~Country) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "right",
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 12)) +
  labs(title = "Education by Country",
       caption = "Figure.4",
       x = NULL,
       y = NULL)  

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,1,2,2,2)))

Overall, half of customers have the graduation education level. Then, the Ph.D. and the master follow up, so does the rank sequence in each country.

Marital Status

temp.4.1 = df.2 %>%
  group_by(Marital_Status) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 0))

p.1 = ggplot(data = temp.4.1,
       aes(x = reorder(Marital_Status, -count),
           y = count,
           fill = Marital_Status)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0(per, "%")),
            vjust = -0.5) +
  scale_y_continuous(limits = c(0, 900),
                     breaks = seq(0, 900, 100)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.3,
                                   hjust = 1)) +
  labs(title = "Marital Status",
       x = NULL,
       y = NULL)

temp.4.2 = df.2 %>%
  group_by(Country, Marital_Status) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 0))

p.2 = ggplot(data = temp.4.2,
       aes(x = "",
           y = count,
           fill = Marital_Status)) +
  geom_col(position = "fill") +
  coord_polar("y",
              start = 80) +  
  geom_text(aes(label = paste0(per, "%")),
            position = position_fill(vjust = 0.5),
            size = 2.5) +
  facet_wrap(~Country) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "right",
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 12)) +
  labs(title = "Marital Status by Country",
       caption = "Figure.5",
       x = NULL,
       y = NULL,
       fill = "Marital Status")  

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,1,2,2,2)))

Nearly 40% of customers are married, staying together is the second, and then being single is the third.

Family Size

temp.5.1 = df.2 %>%
  mutate(Family_Size = paste0(Kidhome, 
                              " Kid ",
                              Teenhome,
                              " Teen")) %>% 
  group_by(Family_Size) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 0))

p.1 = ggplot(data = temp.5.1,
       aes(x = reorder(Family_Size, -count),
           y = count,
           fill = Family_Size)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0(per, "%")),
            vjust = -0.5) +
  scale_y_continuous(limits = c(0, 700),
                     breaks = seq(0, 700, 100)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.3,
                                   hjust = 1)) +
  labs(title = "Family Size",
       x = NULL,
       y = NULL)

temp.5.2 = df.2 %>%
  mutate(Family_Size = paste0(Kidhome, 
                              " Kid ",
                              Teenhome,
                              " Teen")) %>% 
  group_by(Country, Family_Size) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 0))

p.2 = ggplot(data = temp.5.2,
       aes(x = "",
           y = count,
           fill = Family_Size)) +
  geom_col(position = "fill") +
  coord_polar("y",
              start = 20) +  
  geom_text(aes(label = paste0(per, "%")),
            position = position_fill(vjust = 0.5),
            size = 2.5) +
  facet_wrap(~Country) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "right",
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 12)) +
  labs(title = "Family Size by Country",
       caption = "Figure.6",
       x = NULL,
       y = NULL,
       fill = "Family Size")  

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,1,2,2,2)))

28% of customers have no children, and 72% of customers have at least one kid or one teen in the family, and most of them have only one child as a majority.

Income

summary(df.2$Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1730   35539   51550   52250   68290  666666

The mean income of customer is $52,250 in annual; the median is $51,550. Because the mean and the median is not far away from each other, we assume a normal distribution for the income of customer.

t.test(df.2$Income)
## 
##  One Sample t-test
## 
## data:  df.2$Income
## t = 98.759, df = 2239, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  51212.46 53287.48
## sample estimates:
## mean of x 
##  52249.97

We conduct a one sample t-test and conclude the true mean is away from zero and falling between $51,212 and $53,287 in a 95% confidence interval.

p.1 = ggplot(data = df.2,
       aes(y = Income)) +
  geom_half_violin(side = "l",
                   alpha = 0.5,
                   trim = F,
                   fill = "#D62728FF") +
  geom_half_boxplot(side = "r",
                    alpha = 0.5,
                    fill = "#FF7F0EFF") +
  scale_y_continuous(limits = c(0, 180000),
                     breaks = seq(0, 180000, 20000),
                     labels = function(x) paste0("$", 
                                                 {x/1000}, "k")) +
  theme +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  labs(title = "Income",
       y = NULL)

p.2 = ggplot(data = df.2,
       aes(x = Age_Group,
           y = Income,
           fill = Education)) +
  geom_half_violin(side = "l",
                   alpha = 0.5,
                   trim = F,
                   fill = "#D62728FF") +
  geom_half_boxplot(side = "r",
                    alpha = 0.5,
                    fill = "#FF7F0EFF") +
  scale_y_continuous(limits = c(0, 180000),
                     breaks = seq(0, 180000, 20000),
                     labels = function(x) paste0("$", 
                                                 {x/1000}, "k")) +
  facet_wrap(~Education, nrow = 1) +
  theme +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.3,
                                   hjust = 1),
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 12)) +  
  labs(title = "Income by Education by Age Group",
       caption = "Figure.7",
       x = NULL,
       y = NULL)

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,2,2,2,2)))

An outlier at $666,666 has been removed. 50% of customers have the income to be between $35k to $70k annually. Income among education levels is not obviously different from each other except the customers from basic, which have the lowest income.

2) Which products are performing the best?

We take the columns of mnt_products, which are the amount of dollars spent on product in the last two years. Take mntwines as an example, this records the amount spent on wine in the last two years.

temp.6 = df.2 %>%
  dplyr::select(MntWines, 
                MntFruits, 
                MntMeatProducts, 
                MntFishProducts, 
                MntSweetProducts, 
                MntGoldProds) %>% 
  rename("Wine" = "MntWines",
         "Fruit" = "MntFruits",
         "Meat" = "MntMeatProducts",
         "Fish" = "MntFishProducts",
         "Sweet" = "MntSweetProducts",
         "Gold" = "MntGoldProds") %>% 
  pivot_longer(c(Wine, Fruit, Meat, Fish, Sweet, Gold),
               names_to = "products",
               values_to = "amount_spent") %>% 
  group_by(products) %>% 
  summarise(total_revenue = sum(amount_spent)) %>% 
  mutate(per = round(total_revenue/sum(total_revenue)*100, 2))

ggplot(data = temp.6,
       aes(x = reorder(products, -total_revenue),
           y = total_revenue,
           fill = products)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0("$",
                               prettyNum(total_revenue,
                                         big.mark = ","),
                               "\n",
                               "(",
                               per,
                               "%",
                               ")")),
            vjust = -0.3) +
  scale_y_continuous(limits = c(0, 800000),
                     labels = function(x) paste0("$", 
                                                 {x/1000}, "k")) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  labs(title = "Performance of Each Product Category",
       caption = "Figure.8",
       x = "Product Category",
       y = "Total Revenue")

Products of the company include wine, meat, gold, fish, sweet, and fruit. Wine is the best selling product which contributes 50.15% to the total revenue.

3) Which marketing campaign is the most successful?

We use acceptedcmp and response columns. As for acceptedcmp1, 1 if customer accepted the offer in the 1st campaign, 0 otherwise. As for response, 1 if customer accepted the offer in the last campaign, 0 otherwise.

temp.7 = df.2 %>%
  dplyr::select(ID, 
                AcceptedCmp1, 
                AcceptedCmp2,
                AcceptedCmp3,
                AcceptedCmp4,
                AcceptedCmp5, 
                Response) %>% 
  pivot_longer(c(AcceptedCmp1,
                 AcceptedCmp2,
                 AcceptedCmp3,
                 AcceptedCmp4,
                 AcceptedCmp5,
                 Response),
               names_to = "Campaign",
               values_to = "Reaction") %>% 
  group_by(Campaign) %>% 
  summarise(Rate = round(sum(Reaction)/n()*100, 2)) %>% 
  mutate(Campaign = fct_recode(Campaign,
                               "First" = "AcceptedCmp1",
                               "Second" = "AcceptedCmp2",
                               "Third" = "AcceptedCmp3",
                               "Fourth" = "AcceptedCmp4",
                               "Fifth" = "AcceptedCmp5",
                               "Last" = "Response"))

ggplot(data = temp.7,
       aes(x = Campaign,
           y = Rate,
           fill = Campaign)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0(Rate, "%")),
            vjust = -0.5) +
  scale_y_continuous(limits = c(0, 20),
                     breaks = seq(0, 20, 5)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  labs(title = "Accentance Rate of Each Marketing Campaign",
       caption = "Figure.9",
       x = "Campaign",
       y = "Acceptance Rate (%)")

There are six marketing campaigns, including five campaigns and the last campaign in the column of response. The last marketing campaign is the most successful with an acceptance rate of 14.91%. It is a value that is nearly doubling the acceptance rate of most of the other marketing campaigns launched.

4) Which channels are underperforming?

We take the numwebpurchases, numcatalogpurchases, and numstorepurchases, which are the number of purchases made through the company’s website, catalog, and in-store. As for numdealspurchases and numwebvisitsmonth, those are not considered as sale channels to discuss in this section.

temp.8 = df.2 %>%
  dplyr::select(NumWebPurchases, 
                NumCatalogPurchases,
                NumStorePurchases) %>% 
  rename("Web" = "NumWebPurchases",
         "Catalog" = "NumCatalogPurchases",
         "In_Store" = "NumStorePurchases") %>% 
  pivot_longer(c(Web, Catalog, In_Store),
               names_to = "channels",
               values_to = "no_of_purchases") %>% 
  group_by(channels) %>% 
  summarise(total = sum(no_of_purchases)) %>% 
  mutate(per = round(total/sum(total)*100, 2))

p.1 = ggplot(data = temp.8,
       aes(x = reorder(channels, -total),
           y = total,
           fill = channels)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0(prettyNum(total,
                                         big.mark = ","))),
            vjust = -0.5) +
  scale_y_continuous(limits = c(0, 14000),
                     breaks = seq(0, 14000, 2000)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  labs(title = "Comparison of Sale Channels in Barchart & Piechart",
       caption = NULL,
       x = "Sale Channels",
       y = "Total Number of Purchases")

p.2 = ggplot(data = temp.8,
       aes(x = "",
           y = total,
           fill = channels)) +
  geom_col(position = "fill") +
  coord_polar("y",
              start = 20) +  
  geom_text(aes(label = paste0(per, "%")),
            position = position_fill(vjust = 0.5),
            size = 4) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "right",
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 12)) +
  labs(title = NULL,
       caption = "Figure.10",
       x = NULL,
       y = NULL,
       fill = "Sale Channels")  

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,2)))

The catalog is the most underperforming sale channel and then followed by purchases made through the company’s website.

4 Statistical Analysis

Based on different questions (comparison/factorization), we can use different statistical methods to test. Before the slash is the method we choose to use in the test. But, there is still another option which is the after the slash.

2) Does US purchasing power significantly better than the rest of the world in terms of total purchases?

Comparison: ANOVA/Linear Regression

df.3 = df.2 %>%
  dplyr::select(Country, 
                NumWebPurchases,
                NumCatalogPurchases,
                NumStorePurchases) %>% 
  mutate(total_purchases = 
           NumWebPurchases + 
           NumCatalogPurchases + 
           NumStorePurchases) %>% 
  dplyr::select(-NumWebPurchases, 
                -NumCatalogPurchases,
                -NumStorePurchases)

mod.2 = aov(data = df.3,
            total_purchases ~ Country)

We synthesize total_purchases by using numwebpurchases, numcatalogpurchases, and numstorepurchases. We are comparing total purchases per customer in different countries. So, we will use the ANOVA test to compare multiple (>= 2) means differences. Here, we use the one way ANOVA. There are two assumptions to be satisfied.

  • Normality:
    • QQ plot
    • Shapiro Wilk test
  • Homogeneity of variance:
    • Normally distributed > Bartlett’s test
    • Non-normally distributed > Levene’s test
a = c("One way ANOVA test",
      "Kruskal Wallis test",
      "Mann Whitney test")

b = c("v",
      "x",
      "x")

c = c("v",
      "v",
      "x")

data.frame(a, b, c) %>% 
  rename("ANOVA" = "a",
         "Normally distributed?" = "b",
         "Equal variance?" = "c") %>% 
  kbl(align = "c",
      caption = "v: Yes, x: No") %>% 
  kable_classic("hover")
v: Yes, x: No
ANOVA Normally distributed? Equal variance?
One way ANOVA test v v
Kruskal Wallis test x v
Mann Whitney test x x

Normality Test with QQ Plot

mod.2$residuals %>% 
  as.data.frame() %>% 
  ggplot(aes(sample = .)) +
  stat_qq_band() +
  stat_qq_line() +
  stat_qq_point() +
  theme +
  labs(title = "QQ Plot",
       subtitle = 
         "Shaded region = 95% confidence interval for Normality",
       caption = "Figure.12",
       x = "Theoretical Value",
       y = "Sample Value")  

The assumption of residuals normality is violated based on the QQ plot.

Homogeneity of Variance Test with Levene’s Test

leveneTest(df.3$total_purchases, 
           df.3$Country)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    7  1.1192 0.3479
##       2232

The result shows that the variance of total purchases in each country is similar due to p-value < 0.05. Based on our dataset is non-normally distributed and equal variance, we will use Kruskal Wallis test instead of the original ANOVA test.

kruskal.test(df.3$total_purchases, 
             df.3$Country)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df.3$total_purchases and df.3$Country
## Kruskal-Wallis chi-squared = 7.2796, df = 7, p-value = 0.4004

The Kruskal Wallis test instead is selected as an omnibus test. The result shows that there is no significant difference among countries in terms of total purchases per customer. So, we have not enough significant evidence to say that customers in a country can purchase more items than another country in this dataset.

temp.9.1 = df.3 %>% 
  group_by(Country) %>% 
  summarise(total = sum(total_purchases)) %>% 
  arrange(desc(total))

p.1 = ggplot(data = temp.9.1,
       aes(x = reorder(Country, -total),
           y = total,
           fill = Country)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = paste0(prettyNum(total,
                                         big.mark = ","))),
            vjust = -0.5) +
  scale_y_continuous(limits = c(0, 16000),
                     breaks = seq(0, 16000, 5000),
                     labels = function(x) paste0({x/1000})) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = 
          element_text(color = ifelse(temp.9.1$Country == "US",
                                      "#D43F3AFF",
                                      "grey30"))) +
  labs(title = "Comparison of Total Purchases in Each Country",
       caption = "",
       x = "",
       y = "Total Purchase (Count in k)")

temp.9.2 = df.3 %>%
  group_by(Country) %>% 
  summarise(count = n(),
            total = sum(total_purchases),
            avg = round(total/count, 2)) %>% 
  mutate(x.label = paste0(Country, 
                          "\n",
                          "(n=",
                          count,
                          ")")) %>% 
  arrange(desc(total))

p.2 = ggplot(data = temp.9.2,
       aes(x = reorder(x.label, -total),
           y = avg,
           fill = Country)) +
  geom_col(position = "dodge",
           width = 0.7) +
  geom_text(aes(label = avg),
            vjust = -0.5) +
  scale_y_continuous(limits = c(0, 20),
                     breaks = seq(0, 20, 5)) +
  coord_cartesian(ylim = c(10, 20)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(axis.text.x = 
          element_text(color = ifelse(temp.9.2$Country == "US",
                                      "#D43F3AFF",
                                      "grey30"))) +
  labs(title = "Comparison of Purchases per Customer in Each Country",
       caption = "Figure.13",
       x = "Country",
       y = "Average Purchase per Customer")

grid.arrange(p.1, p.2,
             layout_matrix = cbind(c(1,2)))

Customers located in a country that has fewer customers from the company, such as “US”, are able to purchase similar or more items than a country that has many customers from the company, such as “SP”. In detail, US customers have better purchasing power, but not with statistical significance.

3) Are gold lovers prefer to shop in-store?

Comparison: ANOVA/T Test

The people who buy gold are more conservative and might have more in-store purchases. This is the hypothesis that we will justify or refute with an appropriate statistical test in this section.

df.4 = df.2 %>%
  dplyr::select(MntGoldProds, NumStorePurchases)

summary(df.4$MntGoldProds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   24.00   44.02   56.00  362.00

The average amount spent on gold is 44.02. Therefore, any customer who spends above this amount of money on gold will have more in-store purchases.

temp.10.1 = df.4 %>%
  mutate(group = 
           case_when(
             MntGoldProds > 44.02 ~ "Above_Avg",
             TRUE ~ "Below_Avg"),
         group = factor(group)) %>% 
  group_by(group) %>% 
  mutate(count = n(),
         x.label = paste0(group, 
                          "\n",
                          "(n=",
                          count,
                          ")"))

p.1 = ggplot(data = temp.10.1,
       aes(x = x.label,
           y = NumStorePurchases)) +
  geom_half_violin(side = "l",
                   alpha = 0.5,
                   trim = F,
                   fill = "#D62728FF") +
  geom_half_boxplot(side = "r",
                    alpha = 0.5,
                    fill = "#FF7F0EFF") +
  stat_summary(fun = "mean", 
               shape = 4, 
               color = "black",
               size = 0.5,
               stroke = 2) +
  theme +
  labs(title = 
         "Number of Store Purchase\nby Categories of Gold Buyer",
       caption = "",
       x = "Gold Buyer",
       y = "Number of Store Purchases (Count)")

set.seed(123)
temp.10.2 = temp.10.1 %>%
  group_by(group) %>% 
  sample_n(500) %>% 
  mutate(count = n(),
         x.label = paste0(group, 
                          "\n",
                          "(n=",
                          count,
                          ")"))

p.2 = ggplot(data = temp.10.2,
       aes(x = x.label,
           y = NumStorePurchases)) +
  geom_half_violin(side = "l",
                   alpha = 0.5,
                   trim = F,
                   fill = "#D62728FF") +
  geom_half_boxplot(side = "r",
                    alpha = 0.5,
                    fill = "#FF7F0EFF") +
  stat_summary(fun = "mean", 
               shape = 4, 
               color = "black",
               size = 0.5,
               stroke = 2) +
  theme +
  labs(title = 
         "Number of Store Purchase\nby Categories of Gold Buyer*",
       caption = "Figure.14",
       x = "Gold Buyer",
       y = "Number of Store Purchases (Count)")

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,2)))

The plot shows that customers who spend above average amounts on gold purchase more items in the store than those who spend below average amounts on gold. We will proceed with this statement with a statistical test. Due to unequal count to cost an unfair comparison, we randomly select 500 samples for each group and plot the figure again. The interpretation of the graph remains the same as above.

mod.3 = aov(data = temp.10.2,
            NumStorePurchases ~ group)

We are comparing in-store purchase numbers in two categories of gold buyer. We will use the ANOVA test here.

Normality Test with Shapiro Wilk Test

by(temp.10.2$NumStorePurchases,
   temp.10.2$group,
   shapiro.test)
## temp.10.2$group: Above_Avg
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.95891, p-value = 1.368e-10
## 
## ------------------------------------------------------------ 
## temp.10.2$group: Below_Avg
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.84495, p-value < 2.2e-16

The test result shows that the residuals in the dataset are not normally distributed based on the p-value < 0.05. As long as one variable is not normally distributed, the result of this text is considered not normally distributed.

Homogeneity of Variance Test with Levene’s Test

leveneTest(temp.10.2$NumStorePurchases, 
           temp.10.2$group)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   1  10.537 0.001209 **
##       998                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variance between the two gold buyer groups is significantly different based on Levene’s test. The variance between both groups is not equal. Since the dataset is non-normally distributed and non-equal variance, we will use Mann Whitney test instead of the original ANOVA test.

wilcox.test(temp.10.2$NumStorePurchases ~ temp.10.2$group)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  temp.10.2$NumStorePurchases by temp.10.2$group
## W = 195535, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The result supports the statement that people who spend an above average amount on gold in the last two years would have more in-store purchases with a p-value < 0.05.

4) Do married Ph.D. candidates have a significant relation with the amount spend on fish?

Comparison: ANOVA/T Test

Fish has omega-3 fatty acids which are good for the brain. So, do married Ph.D. candidates have a significant relation with the amount spent on fish?

df.5.1 = df.2 %>%
  dplyr::select(Education,
                Marital_Status,
                MntFishProducts) %>% 
  mutate(group.1 = paste0(Education, Marital_Status)) %>% 
  mutate(group.2 = 
           case_when(
             group.1 == "PhDMarried" ~ "Married_PhD_Candidates",
             TRUE ~ "Others"
           ),
         group.2 = as.factor(group.2))

df.5.1 %>% 
  group_by(group.2) %>% 
  summarise(count = n()) %>% 
  rename("Group" = "group.2",
         "Count" = "count") %>%   
  kbl(align = "c",
      caption = "Group Sample Size") %>%
  kable_classic("hover")
Group Sample Size
Group Count
Married_PhD_Candidates 192
Others 2048
set.seed(123)
df.5.2 = df.5.1 %>%
  group_by(group.2) %>% 
  sample_n(192) %>% 
  dplyr::select(group.2, MntFishProducts)

In order to make the comparison fairer, we randomly select 192 samples from the 2048 samples of the “Others” group to make two groups with the same sample size.

temp.11 = df.5.2 %>%
  group_by(group.2) %>% 
  mutate(count = n(),
         x.label = paste0(group.2, 
                          "\n",
                          "(n=",
                          count,
                          ")"))

ggplot(data = temp.11,
       aes(x = x.label,
           y = MntFishProducts)) +
  geom_half_violin(side = "l",
                   alpha = 0.5,
                   trim = F,
                   fill = "#D62728FF") +
  geom_half_boxplot(side = "r",
                    alpha = 0.5,
                    fill = "#FF7F0EFF") +
  stat_summary(fun = "mean", 
               shape = 4, 
               color = "black",
               size = 0.5,
               stroke = 2) +
  theme +
  labs(title = "Amount of Fish Purchases by Married Ph.D. Candidates",
       caption = "Figure.15",
       x = NULL,
       y = "Amount Spent on Fish")

The plot shows that the mean and median of the group “Others” are slightly higher than the group of married Ph.D. candidates.

mod.4 = aov(data = temp.11,
            MntFishProducts ~ group.2)

We are comparing the amount of fish buying in two categories of buyers. We will continually use the ANOVA test.

Normality Test with Shapiro Wilk Test

by(temp.11$MntFishProducts,
   temp.11$group.2,
   shapiro.test)
## temp.11$group.2: Married_PhD_Candidates
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.69269, p-value < 2.2e-16
## 
## ------------------------------------------------------------ 
## temp.11$group.2: Others
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.70236, p-value < 2.2e-16

Shapiro Wilk test suggests that data between the two groups are not normally distributed.

Homogeneity of Variance Test with Levene’s Test

leveneTest(temp.11$MntFishProducts, 
           temp.11$group.2)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  3.0238 0.08286 .
##       382                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene’s test concludes that the spread of variance between the two groups of data are equal based on the p-value of greater than 0.05. Since the dataset is non-normally distributed and equal variance, we will use Kruskal Wallis test instead of the original ANOVA test.

kruskal.test(temp.11$MntFishProducts, 
             temp.11$group.2)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  temp.11$MntFishProducts and temp.11$group.2
## Kruskal-Wallis chi-squared = 11.215, df = 1, p-value = 0.0008113

The p-value of 0.0008113 concludes that married Ph.D. candidates spend significantly less than other customers on fish.

6) Is there a significant relationship between geographical regions and the success of a campaign?

Comparison: Logistic Regression/ANOVA

df.7.1 = df.2 %>%
  dplyr::select(Country, 
                AcceptedCmp1, 
                AcceptedCmp2, 
                AcceptedCmp3, 
                AcceptedCmp4, 
                AcceptedCmp5, 
                Response)

df.7.2 = df.7.1 %>%
  rename("Camp1" = AcceptedCmp1,
         "Camp2" = AcceptedCmp2,
         "Camp3" = AcceptedCmp3,
         "Camp4" = AcceptedCmp4,
         "Camp5" = AcceptedCmp5,
         "LastCamp" = Response)

temp.12 = df.7.2 %>%
  pivot_longer(c(Camp1,
                 Camp2,
                 Camp3,
                 Camp4,
                 Camp5,
                 LastCamp),
               names_to = "Cmp",
               values_to = "Result") %>% 
  mutate(Cmp = as.factor(Cmp)) %>% 
  group_by(Country, Cmp) %>% 
  summarise(count = n(),
            rate = round(sum(Result)/count*100, 2)) %>% 
  mutate(x.label = paste0(Country, 
                          "\n",
                          "(n=",
                          count,
                          ")"))

p.1 = ggplot(data = temp.12 %>% filter(Country == "ME"),
       aes(x = x.label,
           y = rate,
           fill = Cmp)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = function(x) {paste0(x, "%")},
                     limits = c(0, 80)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  labs(title = "Anomaly",
       caption = "",
       x = NULL,
       y = "Acceptance Rate")  

p.2 = ggplot(data = temp.12 %>% filter(Country != "ME"),
       aes(x = x.label,
           y = rate,
           fill = Cmp)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = function(x) {paste0(x, "%")},
                     limits = c(0, 20)) +
  scale_fill_d3(palette = c("category20"),
                alpha = 0.7) +  
  theme +
  theme(legend.position = "right") +
  labs(title = "Acceptance Rate of Marketing Campaigns by Country",
       caption = "Figure.16",
       x = NULL,
       y = NULL,
       fill = "Campaign")  

grid.arrange(p.1, p.2,
             layout_matrix = rbind(c(1,2,2,2,2)))

Most campaigns across countries have a low acceptance rate and mostly less than 15%. Except for the country “ME”, other countries have a minimum of 100 customers.

We will use logistic regression which is performed using GLM binomial family on the effect of countries to six respective marketing campaigns to test the statement in this section.

df.7.3 = df.7.2 %>% 
  mutate_if(is.double, as.factor)

mod.6.1 = glm(data = df.7.3,
              Camp1 ~ Country,
              family = "binomial")

mod.6.1.sum = summary(mod.6.1)

mod.6.1.sum$coefficients %>%
  as.data.frame() %>% 
  dplyr::select(Estimate, `Pr(>|z|)`) %>% 
  rename("P-value" = `Pr(>|z|)`) %>% 
  filter(`P-value` < 0.05) %>% 
  rownames_to_column(var = "Variable") %>% 
  filter(Variable != "(Intercept)") %>%
  kbl(align = "c",
      caption = "Compare Campaign 1 Across Countries") %>%
  kable_classic("hover")
Compare Campaign 1 Across Countries
Variable Estimate P-value

There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.

mod.6.2 = glm(data = df.7.3,
              Camp2 ~ Country,
              family = "binomial")

mod.6.2.sum = summary(mod.6.2)

mod.6.2.sum$coefficients %>%
  as.data.frame() %>% 
  dplyr::select(Estimate, `Pr(>|z|)`) %>% 
  rename("P-value" = `Pr(>|z|)`) %>% 
  filter(`P-value` < 0.05) %>% 
  rownames_to_column(var = "Variable") %>% 
  filter(Variable != "(Intercept)") %>%
  kbl(align = "c",
      caption = "Compare Campaign 2 Across Countries") %>%
  kable_classic("hover")
Compare Campaign 2 Across Countries
Variable Estimate P-value

There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.

mod.6.3 = glm(data = df.7.3,
              Camp3 ~ Country,
              family = "binomial")

mod.6.3.sum = summary(mod.6.3)

mod.6.3.sum$coefficients %>%
  as.data.frame() %>% 
  dplyr::select(Estimate, `Pr(>|z|)`) %>% 
  rename("P-value" = `Pr(>|z|)`) %>% 
  filter(`P-value` < 0.05) %>% 
  rownames_to_column(var = "Variable") %>% 
  filter(Variable != "(Intercept)") %>%
  kbl(align = "c",
      caption = "Compare Campaign 3 Across Countries") %>%
  kable_classic("hover")
Compare Campaign 3 Across Countries
Variable Estimate P-value

There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.

mod.6.4 = glm(data = df.7.3,
              Camp4 ~ Country,
              family = "binomial")

mod.6.4.sum = summary(mod.6.4)

mod.6.4.sum$coefficients %>%
  as.data.frame() %>% 
  dplyr::select(Estimate, `Pr(>|z|)`) %>% 
  rename("P-value" = `Pr(>|z|)`) %>% 
  filter(`P-value` < 0.05) %>% 
  rownames_to_column(var = "Variable") %>% 
  filter(Variable != "(Intercept)") %>%
  kbl(align = "c",
      caption = "Compare Campaign 4 Across Countries") %>%
  kable_classic("hover")
Compare Campaign 4 Across Countries
Variable Estimate P-value
CountryCA 0.9260787 0.0477882

Most countries have no relation to the success of marketing campaigns. Only Canada (CA) has a week significant p-value of 0.048 which is close to 0.05 to reject the null hypothesis. Therefore, we still conclude there is no effect of the country on campaign success.

mod.6.5 = glm(data = df.7.3,
              Camp5 ~ Country,
              family = "binomial")

mod.6.5.sum = summary(mod.6.5)

mod.6.5.sum$coefficients %>%
  as.data.frame() %>% 
  dplyr::select(Estimate, `Pr(>|z|)`) %>% 
  rename("P-value" = `Pr(>|z|)`) %>% 
  filter(`P-value` < 0.05) %>% 
  rownames_to_column(var = "Variable") %>% 
  filter(Variable != "(Intercept)") %>%
  kbl(align = "c",
      caption = "Compare Campaign 5 Across Countries") %>%
  kable_classic("hover")
Compare Campaign 5 Across Countries
Variable Estimate P-value

There is no effect of the country on campaign success. The p-value of all countries is higher than 0.05.

mod.6.6 = glm(data = df.7.3,
              LastCamp ~ Country,
              family = "binomial")

mod.6.6.sum = summary(mod.6.6)

mod.6.6.sum$coefficients %>%
  as.data.frame() %>% 
  dplyr::select(Estimate, `Pr(>|z|)`) %>% 
  rename("P-value" = `Pr(>|z|)`) %>% 
  filter(`P-value` < 0.05) %>% 
  rownames_to_column(var = "Variable") %>% 
  filter(Variable != "(Intercept)") %>%
  kbl(align = "c",
      caption = "Compare Last Campaign Across Countries") %>%
  kable_classic("hover")
Compare Last Campaign Across Countries
Variable Estimate P-value
CountryME 2.477634 0.0466367

Most countries have no relation to the success of marketing campaigns. Only Montenegro (ME) has a week significant p-value of 0.047 which is close to 0.05 to reject the null hypothesis. Also, there are only three samples in this country. The sample size is too small to generate an accurate result. Therefore, we still conclude there is no effect of the country on campaign success.

5 Conclusion

There are four tasks for visual exploration and six tasks for statistical analysis.

Results from exploring data analysis reveal that most customers are in mature age between 30 to 55, half of the customers are from Spain, 72% of the customers have a least one children home, half of the customers have income between $35k to $70k annually, and wine is the best selling product that contributing over 50% to the total revenue. All marketing campaigns are not doing very well with acceptance rates of lower than 15%. In-store sales are the most performing channel, and catalog sales are the most underperforming channel.

Results from statistical analysis reveal in-store purchases are significantly affected by the number of deals in a positive way and by the number of kids at home in a negative the most. The purchasing power of customers from the US do not significantly different from other countries. Gold buyers have more in-store purchases. Married Ph.D. candidates spend significantly less than other customer groups on fish. Statistical analysis also concludes that there is no significant relationship between geographical regions and the success of a campaign.

6 Reference