Reading time: 35 minutes

1 SUMMARY

This is a personal project uses a public dataset from Kaggle that designed for students in MSc. in Business Analytics as their final project. The dataset comprised of customer information, sales channels, products sold by the company, and the acceptance data of marketing campaigns. There are 4 tasks for visual exploration and 5 tasks for statistical analysis.

All 9 business tasks have been solved by 18 graphs and 10 statistical methods. Statistical methods used in this project include:

  • Exploratory data analysis (EDA)
  • Multicollinearity test using correlation & VIF
  • Data partitioning to create train & test for RMSE & R2
  • Stepwise model selection
  • Multiple linear regression (MLR)
  • Generalised Linear Models (GLM)’s binomial for logistic regression
  • Normality test using Q-Q plot and Shapiro-Wilk test
  • Variance test with Levene’s test
  • Omnibus test with Kruskal-Wallis Test
  • Two-sample non-parametric test with Mann-Whitney test

Results from EDA reveal that most customer are in mature age between 30 to 55, 51% of customers are from Spain, half of the customers have education level of “graduate” level, 71.5% of customers have a least 1 children home, half of the customers have income between $35k to $70k and wine is the best selling product that contributing 50.15% to the total revenue. All marketing campaigns weren’t doing very well with acceptance rates of lower than 15%. In store sales is the most performing channels.

Results from statistical analysis reveal in-store purchases is significantly affected by products, deals, purchases made through web, income, number of kids at home, sales through catalog and customer’s visit frequency to company’s website. Purchasing power of customers from the US do not significantly different from other countries (P-value 0.4004). Gold buyers have more store purchases (P-value < 0.05). Married PhD candidates spent significantly less than other customer groups on fish, with a p-value of less than 0.05. Statistical analysis also concluded that there is no significant relationship between geographical regions and success of a campaign.


Highlights



2 R PACKAGES

R packages loaded in this project include tidyverse packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, kableExtra, DT, lubridate, zoo, gridExtra, ggplotr, ggrepel, car, caret, DescTools, multcomp, and olsrr.

library(tidyverse)
library(skimr)
library(kableExtra)
library(DT)
library(lubridate)
library(zoo) 
library(gridExtra)
library(qqplotr)
library(ggrepel)
library(car)     # For VIF
library(caret)
library(DescTools)
library(multcomp)
library(olsrr)

3 INTRODUCTION

The purpose of this project is to answer a series of general business questions using R programming language. This project has a dataset that provides many marketing data include general customer information, products sold by the company, the company’s sale channels, and the performance of a series of marketing campaigns that the company has launched.

The business questions that I am tasked to answer are separated into 2 main bodies, one is for data exploration analysis, and the other one is for statistical analysis.

Business questions for data exploration analysis (EDA) are:

  • What does the average customer look like for this company?
  • Which products are performing best?
  • Which marketing campaign is most successful?
  • Which channels are underperforming?

Business questions for statistical analysis are:

  • What are the factors that significantly related to in-store purchases?
  • Does US’s purchasing power significantly better than the rest of hte world in terms of total purchase?
  • Are gold lovers prefer to shop in store?
  • Do “Married PhD candidates” have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish?
  • Is there a significant relationship between geographical regional and success of a campaign?

This project uses a public dataset from Kaggle. Kaggle is a professional website for data science community. The dataset was supplied by Dr. Omar Romero-Hernandez for his students for their final project in order to test their statistical analysis skills as part of a MSc. in Business Analytics. Please click this Link to the website. All of the above business questions are actually questions for the marketing students to answer, visit this Link to the relevant website.

4 DATA PREPARATION

4.1 Data import

Following codes import the dataset (click the code button). Table below is an indication of successful data import.


marketing_data <- read_csv("marketing_data.csv")
datatable(marketing_data)

4.2 Data description

Following table describes all variables in the dataset, extracted from the Kaggle website.

Variable <- c("ID", 
              "Year_Birth", 
              "Education",
              "Marital_Status",
              "Income",
              "Kidhome",
              "Teenhome",
              "Dt_Customer",  
              "Recency",
              "MntWines",
              "MntFruits",
              "MntMeatProducts",
              "MntFishProducts",
              "MntSweetProducts",
              "MntGoldProds",
              "NumDealsPurchases",
              "NumWebPurchases",
              "NumCatalogPurchases",
              "NumStorePurchases",
              "NumWebVisitsMonth",
              "AcceptedCmp3",
              "AcceptedCmp4",
              "AcceptedCmp5",
              "AcceptedCmp1",
              "AcceptedCmp2",
              "Response",
              "Complain",
              "Country")

Description <- c("Customer's unique identifier",
                 "Customer's birth year",
                 "Customer's education level",
                 "Customer's marital status",
                 "Customer's yearly household income",
                 "Number of children in customer's household",
                 "Number of teenagers in customer's household",
                 "Date of customer's enrollment with the company",
                 "Number of days since customer's last purchase",
                 "Amount spent on wine in the last 2 years",
                 "Amount spent on fruits in the last 2 years",
                 "Amount spent on meat in the last 2 years",
                 "Amount spent on fish in the last 2 years",
                 "Amount spent on sweets in the last 2 years",
                 "Amount spent on gold in the last 2 years",
                 "Number of purchases made with a discount",
                 "Number of purchases made through the company's web site",
                 "Number of purchases made using a catalogue",
                 "Number of purchases made directly in stores",
                 "Number of visits to company's web site in the last month",
                 "1 if customer accepted the offer in the 3rd campaign, 0 otherwise",
                 "1 if customer accepted the offer in the 4th campaign, 0 otherwise",
                 "1 if customer accepted the offer in the 5th campaign, 0 otherwise",
                 "1 if customer accepted the offer in the 1st campaign, 0 otherwise",
                 "1 if customer accepted the offer in the 2nd campaign, 0 otherwise",
                 "1 if customer accepted the offer in the last campaign, 0 otherwise",
                 "1 if customer complained in the last 2 years, 0 otherwise",
                 "Customer's location")

data.frame(No = c(1:28), Variable, Description) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = c("bordered", "striped", "hover"))
No Variable Description
1 ID Customer’s unique identifier
2 Year_Birth Customer’s birth year
3 Education Customer’s education level
4 Marital_Status Customer’s marital status
5 Income Customer’s yearly household income
6 Kidhome Number of children in customer’s household
7 Teenhome Number of teenagers in customer’s household
8 Dt_Customer Date of customer’s enrollment with the company
9 Recency Number of days since customer’s last purchase
10 MntWines Amount spent on wine in the last 2 years
11 MntFruits Amount spent on fruits in the last 2 years
12 MntMeatProducts Amount spent on meat in the last 2 years
13 MntFishProducts Amount spent on fish in the last 2 years
14 MntSweetProducts Amount spent on sweets in the last 2 years
15 MntGoldProds Amount spent on gold in the last 2 years
16 NumDealsPurchases Number of purchases made with a discount
17 NumWebPurchases Number of purchases made through the company’s web site
18 NumCatalogPurchases Number of purchases made using a catalogue
19 NumStorePurchases Number of purchases made directly in stores
20 NumWebVisitsMonth Number of visits to company’s web site in the last month
21 AcceptedCmp3 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
22 AcceptedCmp4 1 if customer accepted the offer in the 4th campaign, 0 otherwise
23 AcceptedCmp5 1 if customer accepted the offer in the 5th campaign, 0 otherwise
24 AcceptedCmp1 1 if customer accepted the offer in the 1st campaign, 0 otherwise
25 AcceptedCmp2 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
26 Response 1 if customer accepted the offer in the last campaign, 0 otherwise
27 Complain 1 if customer complained in the last 2 years, 0 otherwise
28 Country Customer’s location

It is a comprehensive dataset and it contains classifiable information.

    1. Customer information: ID, age, education, marital status, income, number of kids and teens at home, date of customer’s enrollment with the company, recency, complain(s) made in the last 2 years, country of the customer currently at (I assume), the number of purchases made during a discount, and customers’ frequency of visiting company’s website.
    1. Products of the company: Wines, fruits, meat, fish, sweet, and gold.
    1. Sale Channels: Number of purchases made via company website, catalog, and in the store.
    1. Customers responses on company’s marketing campaigns. There are a total of 6 campaigns launched.

4.3 Data exploration

This dataset has 2,240 rows of observations and 28 columns of variables. Among the variables, there are 5 character variables and 23 numerical variables.


skim_without_charts(marketing_data)
Data summary
Name marketing_data
Number of rows 2240
Number of columns 28
_______________________
Column type frequency:
character 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Education 0 1.00 3 10 0 5 0
Marital_Status 0 1.00 4 8 0 8 0
Income 24 0.99 9 11 0 1974 0
Dt_Customer 0 1.00 6 8 0 663 0
Country 0 1.00 2 3 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
ID 0 1 5592.16 3246.66 0 2828.25 5458.5 8427.75 11191
Year_Birth 0 1 1968.81 11.98 1893 1959.00 1970.0 1977.00 1996
Kidhome 0 1 0.44 0.54 0 0.00 0.0 1.00 2
Teenhome 0 1 0.51 0.54 0 0.00 0.0 1.00 2
Recency 0 1 49.11 28.96 0 24.00 49.0 74.00 99
MntWines 0 1 303.94 336.60 0 23.75 173.5 504.25 1493
MntFruits 0 1 26.30 39.77 0 1.00 8.0 33.00 199
MntMeatProducts 0 1 166.95 225.72 0 16.00 67.0 232.00 1725
MntFishProducts 0 1 37.53 54.63 0 3.00 12.0 50.00 259
MntSweetProducts 0 1 27.06 41.28 0 1.00 8.0 33.00 263
MntGoldProds 0 1 44.02 52.17 0 9.00 24.0 56.00 362
NumDealsPurchases 0 1 2.33 1.93 0 1.00 2.0 3.00 15
NumWebPurchases 0 1 4.08 2.78 0 2.00 4.0 6.00 27
NumCatalogPurchases 0 1 2.66 2.92 0 0.00 2.0 4.00 28
NumStorePurchases 0 1 5.79 3.25 0 3.00 5.0 8.00 13
NumWebVisitsMonth 0 1 5.32 2.43 0 3.00 6.0 7.00 20
AcceptedCmp3 0 1 0.07 0.26 0 0.00 0.0 0.00 1
AcceptedCmp4 0 1 0.07 0.26 0 0.00 0.0 0.00 1
AcceptedCmp5 0 1 0.07 0.26 0 0.00 0.0 0.00 1
AcceptedCmp1 0 1 0.06 0.25 0 0.00 0.0 0.00 1
AcceptedCmp2 0 1 0.01 0.11 0 0.00 0.0 0.00 1
Response 0 1 0.15 0.36 0 0.00 0.0 0.00 1
Complain 0 1 0.01 0.10 0 0.00 0.0 0.00 1

Insights:

  • This dataset is very complete. All variables have a complete_rate of 1, which stands for 100%, indicating there is no any missing value their columns. It associated with the adjacent column n_missing, which is almost 0 for all variables.

  • Except for the variable “Income”, it has a complete_rate of 0.989 (98.9%). Generally, an analyst will remove these NA rows because 98.9% is a level that is already high enough to accept. However, I will fill up these blank cells by looking for mean (if normality is observed) or median (if outlier is observed) from relevant information. This will be carried out in data manipulation section.

  • There are no white spaces among the 5 character variables that needed to clean as well, by examining the column whitespace.

Following summary provides a quick way to examine are the data types assigned to each column rightfully.

glimpse(marketing_data)
## Rows: 2,240
## Columns: 28
## $ ID                  <dbl> 1826, 1, 10476, 1386, 5371, 7348, 4073, 1991, 4047~
## $ Year_Birth          <dbl> 1970, 1961, 1958, 1967, 1989, 1958, 1954, 1967, 19~
## $ Education           <chr> "Graduation", "Graduation", "Graduation", "Graduat~
## $ Marital_Status      <chr> "Divorced", "Single", "Married", "Together", "Sing~
## $ Income              <chr> "$84,835.00", "$57,091.00", "$67,267.00", "$32,474~
## $ Kidhome             <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,~
## $ Teenhome            <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,~
## $ Dt_Customer         <chr> "6/16/14", "6/15/14", "5/13/14", "5/11/14", "4/8/1~
## $ Recency             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ MntWines            <dbl> 189, 464, 134, 10, 6, 336, 769, 78, 384, 384, 450,~
## $ MntFruits           <dbl> 104, 5, 11, 0, 16, 130, 80, 0, 0, 0, 26, 4, 82, 10~
## $ MntMeatProducts     <dbl> 379, 64, 59, 1, 24, 411, 252, 11, 102, 102, 535, 6~
## $ MntFishProducts     <dbl> 111, 7, 15, 0, 11, 240, 15, 0, 21, 21, 73, 0, 80, ~
## $ MntSweetProducts    <dbl> 189, 0, 2, 0, 0, 32, 34, 0, 32, 32, 98, 13, 20, 16~
## $ MntGoldProds        <dbl> 218, 37, 30, 0, 34, 43, 65, 7, 5, 5, 26, 4, 102, 3~
## $ NumDealsPurchases   <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 3, 3, 1, 2, 1, 1, 0, 4, 4,~
## $ NumWebPurchases     <dbl> 4, 7, 3, 1, 3, 4, 10, 2, 6, 6, 5, 3, 3, 1, 25, 2, ~
## $ NumCatalogPurchases <dbl> 4, 3, 2, 0, 1, 7, 10, 1, 2, 2, 6, 1, 6, 1, 0, 1, 1~
## $ NumStorePurchases   <dbl> 6, 7, 5, 2, 2, 5, 7, 3, 9, 9, 10, 6, 6, 2, 0, 5, 5~
## $ NumWebVisitsMonth   <dbl> 1, 5, 2, 7, 7, 2, 6, 5, 4, 4, 1, 4, 1, 6, 1, 4, 4,~
## $ AcceptedCmp3        <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ AcceptedCmp4        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ AcceptedCmp5        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ AcceptedCmp1        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ AcceptedCmp2        <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Response            <dbl> 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,~
## $ Complain            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Country             <chr> "SP", "CA", "US", "AUS", "SP", "SP", "GER", "SP", ~

I will convert some of the variables into factor type. “Factor” is a variable type that can be assigned to either character or numerical variable when these variables are found useful for grouping the data. For example, variables “Country”, “Marital_Status”, and “Year_Birth” can be useful to group data. This feature is important during data analysis processing for cleaning, computational and machine learning techniques.

Following are a few conversions that I will perform in the next data cleaning and manipulation section.

Variables that need to be converted into factor type (fct) are

  • Education
  • Marital_Status
  • Country

Variables that need to be converted into numerical (dbl) type:

  • Income

Variable that need to be in date (date) data type:

  • Dt_Customer

5 DATA CLEANING AND MANIPULATION

5.1 Data type conversion

This section continues with the work mentioned in the last section to convert some variables into their rightfully type so that the data analysis can be more efficient and accurate. Following codes complete the task (click the right button).

# I will retain the original dataset as marketing_data and give a new object named "md".

md <- marketing_data

# Start mutating the dataset to obtain the correct data type.

md <- md %>% 
  mutate(Education = as.factor(Education),
         Marital_Status = as.factor(Marital_Status),
         Country = as.factor(Country),
         Dt_Customer = mdy(Dt_Customer))

md <- md %>% 
  mutate(Income = gsub(",", "", Income),
         Income = gsub("\\$", "", Income),
         Income = str_sub(Income, end = -4),
         Income = as.numeric(Income))

Checking the type of these variables using following code (click the right botton). Following results show that the conversions have been successful

check <- md %>% dplyr::select(Education, Marital_Status, Country, Dt_Customer, Income)
glimpse(check)
## Rows: 2,240
## Columns: 5
## $ Education      <fct> Graduation, Graduation, Graduation, Graduation, Graduat~
## $ Marital_Status <fct> Divorced, Single, Married, Together, Single, Single, Ma~
## $ Country        <fct> SP, CA, US, AUS, SP, SP, GER, SP, US, IND, US, SP, IND,~
## $ Dt_Customer    <date> 2014-06-16, 2014-06-15, 2014-05-13, 2014-05-11, 2014-0~
## $ Income         <dbl> 84835, 57091, 67267, 32474, 21474, 71691, 63564, 44931,~

5.2 Examing tricky missing value

Though the dataset was found very complete without any missing value, however, if a missing value is filled up manually with character string data such as “null”, or “missing value” into the cells by the dataset provider, then it will not be detected by the exploration method I used because R will recognise them as a part of the data in character type.

This section will examine the presence of these words (or relevant) in the character and factor variables of the dataset.

check2 <- md %>% 
  dplyr::select(is.factor, -ID)

sapply(check2, levels)
## $Education
## [1] "2n Cycle"   "Basic"      "Graduation" "Master"     "PhD"       
## 
## $Marital_Status
## [1] "Absurd"   "Alone"    "Divorced" "Married"  "Single"   "Together" "Widow"   
## [8] "YOLO"    
## 
## $Country
## [1] "AUS" "CA"  "GER" "IND" "ME"  "SA"  "SP"  "US"

There are no any written form of missing value in all variables (such as “Null”, “Missing_value”, “Blank”, or whatever relevant). The column “ID” has its original data type as numeric at the absence of any alphabetical string data, hence it does not contain any written form of missing value.

Still, I am able to check is there any character data (such as “Null”, “Missing_value”, “Blank”, or whatever relevant) falling into the ID column as well, by running following code (click the right button). The result is 0, which indicates that there is no character data in the column of ID.

which(str_detect(md$ID, "[[:alpha:]]"))
## integer(0)

5.3 Fill up the blank salary cells

In the section of data preparation, I found that the column income has 24 missing value.


skim_without_charts(md$Income)
Data summary
Name md$Income
Number of rows 2240
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
data 24 0.99 52247.25 25173.08 1730 35303 51381.5 68522 666666

Though the proportion of missing data in the Income column is only 1.1%, I will fill up these missing values with either mean or median to make the dataset complete. It will rely on the normality of the data or the presence of outliers. The computation will be affected by variables education level, age, and country of the customer. These are the only information I can rely on from the dataset.

Following table shows how many many missing values by country and education.

# Examine the characteristics of individual missing salary. 

md_na <- md[is.na(md$Income),] %>% 
  dplyr::select(ID, Education, Country) %>% 
  group_by(Country, Education) %>% 
  count()

md_na
## # A tibble: 14 x 3
## # Groups:   Country, Education [14]
##    Country Education      n
##    <fct>   <fct>      <int>
##  1 AUS     2n Cycle       1
##  2 AUS     Graduation     6
##  3 AUS     Master         3
##  4 AUS     PhD            3
##  5 CA      Graduation     1
##  6 CA      PhD            1
##  7 GER     2n Cycle       1
##  8 GER     Graduation     1
##  9 GER     Master         1
## 10 GER     PhD            1
## 11 IND     2n Cycle       1
## 12 SP      Graduation     2
## 13 US      Graduation     1
## 14 US      Master         1

I have 2 options when filling up these missing values, either to use the median or mean. I will use boxplot to help making the decision.

md_complete <- md[complete.cases(md),]

ggplot(md_complete, aes(x = Education, y = Income)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(colour = "blue", alpha = 0.5, width = 0.2) +
  facet_wrap(~Country, scale = "free") +
  stat_boxplot(geom = "errorbar")

Insights:

  • For AUS group, I will use mean for the missing value in the 2n Cycle, Graduation, Master, and PhD education levels.

  • For CA, it is better to use median to fill up the missing value in the PhD dataset because of the presence of extreme outlier.

  • For GER, I will use mean to fill up missing values in teh 2n Cycle, Graduation, Master, and PhD education levels.

  • For IND, I will use mean to fill up the missing value in the 2n Cycle education level.

  • For SP, I will use median to fill up the missing value in the Graduation education level.

  • For US, I will use mean to fill up the missing value in the Graduation and Master education levels.

Following codes complete the compuations.

md2 <- md %>% 
  group_by(Country, Education) %>% 
  mutate(Income = ifelse(Country == "AUS" & is.na(Income), mean(Income, na.rm = T), Income),
         Income = ifelse(Country == "CA" & is.na(Income), median(Income, na.rm = T), Income),
         Income = ifelse(Country == "GER" & is.na(Income), mean(Income, na.rm = T), Income),
         Income = ifelse(Country == "IND" & is.na(Income), mean(Income, na.rm = T), Income),
         Income = ifelse(Country == "SP" & is.na(Income), median(Income, na.rm = T), Income),
         Income = ifelse(Country == "US" & is.na(Income), mean(Income, na.rm =T), Income)) %>% 
  ungroup()

Checking is there any more NA in the md2 dataset. The result shown 0 rows, meanning the conversion is successful.

# Checking is there any more NA in the md2 dataset
md2[!complete.cases(md2), ]  
## # A tibble: 0 x 28
## # ... with 28 variables: ID <dbl>, Year_Birth <dbl>, Education <fct>,
## #   Marital_Status <fct>, Income <dbl>, Kidhome <dbl>, Teenhome <dbl>,
## #   Dt_Customer <date>, Recency <dbl>, MntWines <dbl>, MntFruits <dbl>,
## #   MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>,
## #   MntGoldProds <dbl>, NumDealsPurchases <dbl>, NumWebPurchases <dbl>,
## #   NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## #   NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>,
## #   AcceptedCmp5 <dbl>, AcceptedCmp1 <dbl>, AcceptedCmp2 <dbl>, Response <dbl>,
## #   Complain <dbl>, Country <fct>

Checking the 24 rows that had no income data in the beginning. They have now been filed up with newly computed income data based on mean or median, based on country and education level. Following result shows that the computation has been successful.

# Getting the 24 rows that had missing values

NA_locate <- which(!complete.cases(md))

# Checking newly computed Income. 

md2_check <- md2 %>% dplyr::select(Country, Education, Income)
md2_check[NA_locate,]
## # A tibble: 24 x 3
##    Country Education  Income
##    <fct>   <fct>       <dbl>
##  1 GER     PhD        57845.
##  2 US      Graduation 52576.
##  3 AUS     PhD        52770.
##  4 AUS     Graduation 53555.
##  5 CA      PhD        59187 
##  6 GER     2n Cycle   43183 
##  7 US      Master     54154.
##  8 GER     Graduation 55840.
##  9 AUS     2n Cycle   45892.
## 10 AUS     Master     51995.
## # ... with 14 more rows

5.4 Create Age Column

Since there is year of birth recorded in the dataset, an “age” column can be pre-prepared just in case if this information is required in the later stages. I compute the age by using the enrollment rate minus the year of birth and synthese a column named “Age_atRegister”. A column “age_group” has also been synthesised base on the “Age_atRegister” column.

Following code complete the creation (click the right button).

# set up the data frame

md2 <- md2 %>% 
  mutate(age = year(Dt_Customer) - Year_Birth + 1,
         age_group = case_when(
           age < 1 ~ "baby",
           age > 1 & age < 14 ~ "youth",
           age > 15 & age < 24 ~ "young.adult",
           age > 25 & age < 44 ~ "middle.adult",
           age > 45 & age < 65 ~ "older.adult",
           TRUE ~ "senior"
         )) %>% 
  relocate(age, .after = Year_Birth) %>% 
  relocate(age_group, .after = age) %>% 
  rename("Age_atRegister" = "age")

# factorise the "age group" column and rearrange the sequence of levels

md2 <- md2 %>% mutate(age_group = factor(age_group, 
                                         levels = c("young.adult", "middle.adult", 
                                                    "older.adult", "senior")))

5.5 Data processing summary

  • I have converted the variables Education, Marital_Status, and Country into factor.

  • I have converted the Dt_Customer from character into date type.

  • I have removed the dollar and comma signs in the Income column and converted it into numerical type.

  • There are 24 out of 2240 rows of missing values in the Income column, I have filled them up with mean and median, with consideration of respective country and education level.

  • I created two new columns Age_atRegister and age_group.

The dataset is now analysis-ready.

6 EXPLORATORY DATA ANALYSIS (EDA)

6.1 What does the average customer look like for this company?

In section 6.1, I will explore customer information of the company by age, country, education, marital status, family size, and income.

6.1.1 Age

The company has customers from 8 different countries, and most of their customers are aged between 30 to 55. The body of boxplots represent 50% of the data. Boxplots of each country overlap each other indicating that there is no difference in age among countries.

  • The country “SP” has the most customers followed by “SA” and “CA”. The country “ME” has the least customers with only data from 3 customers.

  • All countries have similar spread in customer age.

# set up df

df611 <- md2 %>% 
  group_by(Country) %>% 
  mutate(x_lab = paste0(Country, "\n", "(n= ", n(), ")"))
  
# plot

ggplot(df611, aes(x = x_lab, y = Age_atRegister, fill = Country)) +
  geom_hline(yintercept = 30, linetype = 2, color = "blue") +
  geom_hline(yintercept = 55, linetype = 2, color = "blue") +
  geom_boxplot(outlier.shape = NA, alpha = 0.3) + 
  stat_boxplot(geom = "errorbar") +
  geom_jitter(aes(fill = Country) , shape = 21, alpha = 0.4, width = 0.2) +
  scale_y_continuous(lim = c(0, 80), breaks = seq( 0, 80, 10)) +
  theme(legend.position = "NA",
        axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
        plot.title = element_text(face = "bold")) +
  labs(title = "Figure 1. Distribution of Customer's Age from Different Country",
       y = "Age at Registration") 

6.1.2 Country

Following pi-chart shows that nearly half of the customers are from “SP”, which is Spain. The next largest pool of customer is from SA (South Africa) at 15%, then followed by the third CA (Canada) at 12%.

# produce dataframe

df_coun <- md2 %>% 
  group_by(Country) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count / sum(count)*100, 1),
         per = paste0(per, "%"))

# plot

ggplot(df_coun, aes(x = "", y = count, fill = Country)) +
  geom_bar(stat = "identity") +
  coord_polar(theta = "y",
              start = 50) +  
  geom_text(aes(label = paste0(Country, " (", per, ")")), 
            position = position_stack(vjust = 0.5),
            size = 4) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.title = element_blank(),
        plot.title = element_text(face = "bold")) +
  labs(title = "Figure 2. Distribution of Customers from Different Countries")

6.1.3 Education

  • Overall, most customers have a “Graduation” education level (50.3%).

  • “Graduation”, “PHD” and “Master” customers generally ranked the first 3 most important customer group in all country.

# Set up data frame

df613.1 <- md2 %>%
  group_by(Education) %>% 
  summarise(Count = n()) %>% 
  mutate(per = round(Count/sum(Count)*100, 1),
         per = paste(per,"%") )

# plot

plot_df613.1 <- ggplot(df613.1, aes(x = reorder(Education, -Count), y = Count, fill = Education)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = per, vjust = -1.2)) +
  scale_y_continuous(lim = c(0, 1200), breaks = seq(0, 1200, 200)) +
  theme(legend.position = "none",
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 50, vjust = 0.75)) +
  labs(x = "Education",
       title = "Figure 3. Education") 


# set up faceted data frame

df613.2 <- md2 %>%
  group_by(Country, Education) %>% 
  summarise(Count = n()) %>% 
  mutate(per = round(Count/sum(Count)*100, 1),
         per = paste0(per, "%"))

# plot

plot_df613.2 <- ggplot(df613.2, aes(x = reorder(Education, -Count), y = Count, fill = Education)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = per, vjust = -1.2), size = 3) +
  scale_y_continuous(lim = c(0, 1200), breaks = seq(0, 1200, 200)) +
  theme(legend.position = "none",
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.text.x = element_text(angle = 50, vjust = 0.75),
        plot.title = element_text(face = "bold")) +
  labs(x = "Education",
       title = "Figure 4. Education By Countries") +
  facet_wrap(~ Country)
      

# Combine

grid.arrange(plot_df613.1, plot_df613.2,
                         layout_matrix = rbind(c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2)))

6.1.4 Marital Status

Nearly 40% of the customers are married, 25.6% staying together, and 21.3% are single.

# set up dataframe

df614 <- md2 %>% 
  group_by(Marital_Status) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 1),
         per = paste0(per, "%"))


# plot

plot_df614.1 <- ggplot(df614, aes(x = reorder(Marital_Status, -count), y = count, fill = Marital_Status)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = per, vjust = -1.2)) +
  scale_y_continuous(lim = c(0, 1200), breaks = seq(0, 1200, 200)) +
  theme(legend.position = "none",
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 50, vjust = 0.75)) +
  labs(x = "Marital_Status",
       title = "Figure 5. Marital_Status")


# set up dataframe2

df614.2 <- md2 %>% 
  group_by(Country, Marital_Status) %>% 
  summarise(count = n()) %>% 
  mutate(per = round(count/sum(count)*100, 1),
         per = paste0(per, "%"))


# plot 

plot_df614.2 <- ggplot(df614.2, aes(x = reorder(Marital_Status, -count), y = count, fill = Marital_Status)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = per, vjust = -1.2), size = 3) +
  scale_y_continuous(lim = c(0, 1200), breaks = seq(0, 1200, 200)) +
  theme(legend.position = "none",
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 50, vjust = 0.75)) +
  labs(x = "Marital_Status",
       title = "Figure 6. Marital_Status by countries") +
  facet_wrap(~ Country)


# Combine

grid.arrange(plot_df614.1, plot_df614.2,
                         layout_matrix = rbind(c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2),
                                   c(1,1,1,2,2,2,2,2)))

6.1.5 Customer’s Family size

  • Only 28.5% of customers have no children (either at least a kid or teen in the family).

  • 71.5% of customers have at least 1 kid or 1 teen in the family.

# set up df

df615_0 <- md2 %>% 
  mutate(Familysize = paste0(Kidhome, " Kid ", Teenhome, " Teen")) %>% 
  dplyr::select(Familysize)

df615_1 <- df615_0 %>% 
  group_by(Familysize) %>% 
  summarise(Count = n()) %>% 
  mutate(per = round(Count/sum(Count)*100, 1),
         per = paste(per, "%"),
         Familysize = factor(Familysize))

# plot

ggplot(df615_1, aes(x = reorder(Familysize, -Count), y = Count, fill = Familysize)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = per), size = 4, vjust = -0.8) +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 25, vjust = 0.5),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))) +
  scale_y_continuous(lim = c(0, 800)) +
  labs(title = "Figure 7. Bar Chart of Family Size among Customers",
       x = "Family Size")

6.1.6 Income

The average and median income of customer is $52,250 and $51,550. The mean and median income is not very far away from each other, I would assume a normal distribution around the mean and median. The maximum and minimum incomes might be outliers.

summary(md2$Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1730   35539   51550   52250   68290  666666
  • I am 95% confidence that the true mean is falling between $51,212.46 and $53,287.48.
t.test(md2$Income)
## 
##  One Sample t-test
## 
## data:  md2$Income
## t = 98.759, df = 2239, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  51212.46 53287.48
## sample estimates:
## mean of x 
##  52249.97
  • 50% of customers have income falls approximately between $35k to $70k.
ggplot(md2, aes(x = Income)) +
  geom_boxplot(fill = "purple", alpha = 0.5, size = 1, width = 0.1)  +
  scale_x_continuous(lim = c(0, 200000),
                     breaks = seq(0, 200000, 20000),
                     labels = function(x)paste0("$", {x/1000}, "k")) +
  theme_classic() +
  theme(plot.title = element_text(face = "bold"),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  labs(title = "Figure 8. Boxplot to show income distribution",
       subtitle = "A outlier point at $666,666 has been removed")

* Income among education levels are not obviously different from each other except the customers from basic education level, they have the lowest income.

ggplot(md2, aes(y = Income, x = age_group, fill = Education)) +
  geom_boxplot(alpha = 0.5, size = 1)  +
  scale_y_continuous(lim = c(0, 200000),
                     breaks = seq(0, 200000, 20000),
                     labels = function(x)paste0("$", {x/1000}, "k")) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 90, vjust = -0.15),
        axis.title.x = element_text(vjust = -2),
        legend.position = "none") +
  labs(title = "Figure 8. Boxplot to show income distribution",
       subtitle = "A outlier point at $666,666 has been removed",
       x = "Age Group") +
  facet_wrap(~ Education, nrow = 1) 

6.2 Which products are performing the best?

Products of the company include wines, fruits, meat, fish, sweet, and gold. Wine is the best selling products of the company which contributing 50.15% to the total revenue, followed by 27.56% of meat and then the 7.27% of gold.

# set up df

df6.2 <- md2 %>% 
  dplyr::select(MntWines, MntFruits, MntMeatProducts, 
         MntFishProducts, MntSweetProducts, MntGoldProds) %>% 
  rename("Wine" = "MntWines",
         "Fruit" = "MntFruits",
         "Meat" = "MntMeatProducts",
         "Fish" = "MntFishProducts",
         "Sweet" = "MntSweetProducts",
         "Gold" = "MntGoldProds") %>% 
  pivot_longer(c(Wine, Fruit, Meat, Fish, Sweet, Gold),
               names_to = "products",
               values_to = "amount_spent") %>% 
  group_by(products) %>% 
  summarise(total_revenue = sum(amount_spent)) %>% 
  mutate(per = round(total_revenue/sum(total_revenue)*100, 2),
         per = paste0(per, "%"),
         lbl = paste0("$", prettyNum(total_revenue,big.mark = ","), "\n", "(", per, ")"))

  
# plot

ggplot(df6.2, aes(x = reorder(products, -total_revenue), 
                  y = total_revenue, fill = products, colour = products)) +
  geom_bar(stat = "identity", alpha = 0.5) +
  geom_text(aes(label = lbl), colour = "black", vjust = -0.3) +
  theme_classic() +
  theme(legend.position = "none") +
  scale_y_continuous(lim = c(0, 800000),  
                     labels = function(x)paste0("$", {x/1000}, "k")) +
  labs(title = "Figure 9. Performance of Each Product Category",
       x = "Product Category",
       y = "Total Revenue")

6.3 Which marketing campaign is most successful?

There are actually 6 marketing campaigns in the dataset, including the 5 campaigns that have already been launched and the last marketing campaign is actually the column “Response”. Refer to section 4.2.

The last marketing campaign was the most successful campaign with acceptance rate at 14.91%. It is a value that is nearly doubling the acceptance rate of most of the other marketing campaigns launched.

# Set up data frame 

df6.3 <- md2 %>%
  dplyr::select(ID, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response) %>% 
  pivot_longer(c(AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response),
               names_to = "Campaign",
               values_to = "Reaction")  # Reaction 1 = accepted the offer, 0 means otherwise

df6.3.2 <- df6.3 %>% 
  group_by(Campaign) %>% 
  summarise(Count = n(),
            Acceptance_rate = sum(Reaction)/Count*100,
            Acceptance_rate2 = paste0(round(sum(Reaction)/Count*100, 2), "%")) %>% 
  mutate(Campaign = fct_recode(Campaign,
         "First" = "AcceptedCmp1",
         "Second" = "AcceptedCmp2",
         "Third" = "AcceptedCmp3",
         "Fourth" = "AcceptedCmp4",
         "Fifth" = "AcceptedCmp5",
         "Last" = "Response"))
  

# plot

ggplot(df6.3.2, aes(x = Campaign, y = Acceptance_rate, group = 1)) +
  geom_point(stat = "identity", size = 3.5, colour = "purple") +
  geom_path(linetype = 2, size = 1, colour = "purple") +
  geom_label_repel(aes(label = Acceptance_rate2), nudge_y = 2) +
  theme_classic() +  
  scale_y_continuous(lim = c(0, 20), breaks = seq(0, 20, 4)) +
  labs(title = "Figure 10. Accentance Rates of Each Marketing Campaign",
       y = "Acceptance Rate (%)") +
  theme(axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
        plot.title = element_text(face = "bold"))

6.4 Which channels are underperforming?

There are 3 sales channels recorded in the given dataset, the company website, catalog, and in-store purchase. Catalog was the most underperforming sale channel, followed by purchases made through company web. Most purchases were made in store.

# set up dataframe for this section

df6.4 <- md2 %>% 
  dplyr::select(NumWebPurchases, NumCatalogPurchases, NumStorePurchases) %>% 
  rename("Web" = "NumWebPurchases",
         "Catalog" = "NumCatalogPurchases",
         "In_store" = "NumStorePurchases") %>% 
  pivot_longer(c(Web, Catalog, In_store),
               names_to = "channels",
               values_to = "No_of_purchases") 

df6.4 <- df6.4 %>% 
  group_by(channels) %>% 
  summarise(total_no_of_purchases = sum(No_of_purchases))

# plot the graph

ggplot(df6.4, aes(x = reorder(channels, -total_no_of_purchases), y = total_no_of_purchases)) +
  geom_bar(stat = "identity", fill = "white", aes(colour = channels), size = 1) +
  geom_text(aes(label = channels), 
            vjust = -2.5, fontface = "bold") +
   geom_text(aes(label = paste0("Count: ", prettyNum(total_no_of_purchases, big.mark = ","))), 
            vjust = -1) +
  theme_minimal() +
  labs(title = "Figure 11. Comparison of Sale Channels",
       x = "Sale Channels",
       y = "Total No. of Purchases") +
  theme(plot.margin = unit(c(1,1,1,1), "cm"),
        plot.title = element_text(vjust = 2, hjust = 0.5, face = "bold"),
        legend.position = "none") +
  scale_y_continuous(lim = c(0, 15000),
                     breaks = seq(0, 15000, 2000))

7 STATISTICAL ANALYSIS

7.1 Testing for multicollinearity

It is a preliminary step before statistical analysis starts at section 7.2. This section is to check correlations between numerical independent variables within the dataset to avoid significant pairs being used in the same model, as it would make coefficient estimates unstable and creating invalid statistical results. It is known as multicollinearity, detail.

7.1.1 Correlation assessment

The rule of thumb is that if correlation is greater than 0.8 between two independent variables, then multicollinearity would exist. Following table successfully present the correlation between each numerical variables in the dataset.

One can just look for correlation value that is 0.8 but not equal to 1, then refer to the variable pair that associated with this value.

# Filter non-numerical and binary variable, to select only numerical variables (predictors) to test relationships among themselve. 

df7.1 <- md2 %>%  dplyr::select(Year_Birth, Age_atRegister, Income, Kidhome, Teenhome, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth, Response) %>% 
  ungroup()

# Correlation test of each pair

datatable(round(cor(df7.1),2))

Though inspecting the table visually is an option, but I would use code to done the job for me. Following codes (click the right “code” button) help me to check the presence of correlation values that are higher than 0.8, and not 1.

  • There is no multicollinearity in the dataset.
# Checking.

cor.table <- data.frame(round(cor(df7.1),2))

cor.table <- cor.table %>% 
  rownames_to_column(var = "x1") %>% 
  pivot_longer(c(2:18),
               names_to = "x2",
               values_to = "correlation")

# filter to get correlation that is higher than 0.8 but is not 1

cor.table %>% filter(correlation > 0.8 & correlation < 1)
## # A tibble: 0 x 3
## # ... with 3 variables: x1 <chr>, x2 <chr>, correlation <dbl>

7.1.2 Variance Inflation factor (VIF) assessment

As a rule of thumb, VIF should be less than 4 to have no multicollinearity issue (the Pennsylvania State University 2018). The result should be the same as previous correlation analysis. Apart from the first two variables “Year_Birth” and “Age_atRegister”, there is no multicollinearity exists in all variables.

model_vif <- lm(NumStorePurchases ~. , data = df7.1)
vif(model_vif)
##          Year_Birth      Age_atRegister              Income             Kidhome 
##          378.833154          379.086706            2.159072            1.818572 
##            Teenhome            MntWines           MntFruits     MntMeatProducts 
##            1.604499            2.385003            1.908478            2.909310 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##            2.068151            1.893030            1.491667            1.591688 
##     NumWebPurchases NumCatalogPurchases   NumWebVisitsMonth            Response 
##            1.871632            2.960579            2.403747            1.147800

The first two variables are essentially the same thing, I created “age_atRegister” from the “Year_Birth”. I will only use one of them in any model because they represent the same information.

7.3 Does US’s purchasing power significantly better than the rest of the world in terms of total purchases ?

7.3.1 Comparing total purchases PER CUSTOMER in different country**

A column of new variable “total_purchases” will be synthesied using “NumWebPurchases”, “NumCatalogPurchases”, and “NumStorePurchases”. Following codes create the new variable (click the right “code” button).

# set up data frame

df7.3 <- md2 %>% 
  dplyr::select(Country, NumWebPurchases, NumCatalogPurchases, NumStorePurchases) %>% 
  mutate(total_purchases = NumWebPurchases + NumCatalogPurchases + NumStorePurchases) %>% 
  dplyr::select(-NumWebPurchases, -NumCatalogPurchases, -NumStorePurchases)

Building the model

Following codes build the model.

model7.3 <- aov(total_purchases ~ Country, data = df7.3)

7.3.2 Assumption tests

Normality test with Q-Q plot

The assumption of residuals normality is violated based on following Q-Q plot.

df7.3$residuals <-  model7.3$residuals

ggplot(df7.3, aes(sample = residuals)) +
  stat_qq_band() +
  stat_qq_line() +
  stat_qq_point() +
  labs(title = "Figure 12. QQ-plot",
       subtitle = "Shaded region = 95% confidence interval for Normality") +
  theme(plot.title = element_text(face = "bold"))

Variance test with Levene’s test

Levene’s test shows that the variance of total purchase in each country are the similar, without significant differences with a p-value of larger than 0.05.

leveneTest(df7.3$total_purchases, df7.3$Country)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    7  1.1192 0.3479
##       2232

7.3.4 Kruskal Wallis statistical test

A kruskal wallies test is selected as the omnibus test. The result shows that there is no significant difference (P-value = 0.4004) among countries in term of total purchases per customer.

Therefore, we have not enough significant evidence to say that customer in a country can purchase more items than another country in this dataset. Customers located in a country that has less customers from the company is able to purchase similar or more items than a country that has many customers from the company.

kruskal.test(df7.3$total_purchases, df7.3$Country)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df7.3$total_purchases and df7.3$Country
## Kruskal-Wallis chi-squared = 7.2796, df = 7, p-value = 0.4004

7.3.5 Visualisation

In term of total purchase by count, US ranked the second last country.

df7.3.2 <- df7.3 %>% 
  group_by(Country) %>% 
  summarise(total_purchases_country = sum(total_purchases))


ggplot(df7.3.2, aes(x = reorder(Country, -total_purchases_country), y = total_purchases_country, fill = Country)) +
  geom_bar(stat = "identity") +  
  scale_fill_manual(values = c("darkgreen", "darkgreen", "darkgreen", "darkgreen",
                               "darkgreen", "darkgreen", "darkgreen", "orange")) +      # the location of US level is 7th
  geom_text(aes(label = total_purchases_country), vjust = -1) +
  scale_y_continuous(lim = c(0, 15000)) +
  labs(x = "Country",
       y = "Total Purchases (Count)",
       title = "Figure 13. Comparison of Total purchases in each Country") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

However, US ranked the second in term of average purchase per customer (figure 14). It suggests that US customers have slightly better purchasing power than the rest of the World in terms of total purchases, but not statistical significance.

df7.3.3 <- df7.3 %>% 
  group_by(Country) %>% 
  summarise(count = n(),
            total_purchase_country = sum(total_purchases),
            average_purchase_customer = round(total_purchase_country/count, 2)) %>% 
  mutate(x_label = paste0(Country, "\n", "(n = ", count, ")"))

ggplot(df7.3.3, aes(x = x_label, y = average_purchase_customer, colour = Country)) +
  geom_point(shape = 21, size = 5) +
  geom_text_repel(aes(label = average_purchase_customer), fontface = "bold") +
  theme_classic() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", vjust = 2),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))) +
  scale_colour_manual(values = c("darkgreen", "darkgreen", "darkgreen", "darkgreen",
                               "darkgreen", "darkgreen", "darkgreen", "orange")) +
  labs(title = "Figure 14. US Customer Buy More in Average than Most of Country",
       x = "Country",
       y = "Average purchase per Customer")

7.4 Are gold lovers prefer to shop in store?

One of the supervisors associated with the dataset insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. I will justify or refute this statement using an appropriate statistical test in this section.

The average (mean) amount spent on gold is 44.02. Therefore, the supervisor is claiming that any customer who spent above this amount of money on gold will have more in store purchases.

# set up data frame

df7.4 <- md2 %>% 
  dplyr::select(MntGoldProds, NumStorePurchases)

# average 

summary(df7.4$MntGoldProds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   24.00   44.02   56.00  362.00

7.4.1 Visualisation

Following graph shows that many customers who have spent above average amount on gold "“Above_average”" purchased more items in the store than who spent less than the average on gold. The mean and median of the group “Above_average” are higher than the group “Below_average”. However, is it statistically significant?

# set up dataframe

df7.4 <- df7.4 %>% 
  mutate(
    Group = case_when(
    MntGoldProds > 44.02 ~ "Above_average", 
    TRUE ~ "Below_average"),
    Group = factor(Group)
    ) %>% 
  group_by(Group) %>% 
  mutate(count = n(),
         xlab = paste0(Group, "\n", "(n = ", count, ")"))


# plot

ggplot(df7.4, aes(x = xlab, y = NumStorePurchases, colour = xlab)) +
  geom_jitter(alpha = 0.5) +
  geom_boxplot(alpha = 0.5, colour = "black", size = 1.5, outlier.shape = NA) +
  theme_classic() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", vjust = 2),
        plot.subtitle = element_text(vjust = 2),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))) +
  labs(title = "Figure 15. Number of Store Purchase vurses 2 Categories of Gold Buyer",
       subtitle = "Gold buyers categorised into 1) Who spend above, (2) and who spend below average amount on gold",
       y = "Number of Store Purchases (Count)",
       x = "Gold Buyer") +
  stat_boxplot(geom = "errorbar", size = 1.5, colour = "black") +
  stat_summary(fun = "mean", shape = 4, colour = "black", size = 3, stroke = 2)

It can be argued that “Below_average” has a lot more data points than “Above_average”, the comparison might be unfair. To make it become a fairer comparison, I randomly select 694 samples from the 1546 samples in the group of “Below_average”, and creating following figure.

# set up data frame 

df7.4.2 <- df7.4 %>%
  dplyr::select(Group, NumStorePurchases) %>% 
  group_by(Group) %>% 
  sample_n(694) %>% 
  mutate(count = n(),
         xlab = paste0(Group, "\n", "(n = ", count, ")"))

# plot

ggplot(df7.4.2, aes(x = xlab, y = NumStorePurchases, colour = xlab)) +
  geom_jitter(alpha = 0.5) +
  geom_boxplot(alpha = 0.5, colour = "black", size = 1.5, outlier.shape = NA) +
  theme_classic() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", vjust = 2),
        plot.subtitle = element_text(vjust = 2),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))) +
  labs(title = "Figure 16. Number of Store Purchase vurses 2 Categories of Gold Buyer",
       subtitle = "Gold buyers categorised into 1) Who spend above, (2) and who spend below average amount on gold",
       y = "Number of Store Purchases (Count)",
       x = "Gold Buyer") +
  stat_boxplot(geom = "errorbar", size = 1.5, colour = "black") +
  stat_summary(fun = "mean", shape = 4, colour = "black", size = 3, stroke = 2)

Now, two groups have equal sample size. Reduction of sample size from 1546 to 694 in the group of “Below_average” leads to a slight reduction of mean by 0.105. Still, the interpretation of the graph remains the same as figure 15.

df7.4 %>% group_by(Group) %>% summarise(ave = mean(NumStorePurchases))
## # A tibble: 2 x 2
##   Group           ave
##   <fct>         <dbl>
## 1 Above_average  7.77
## 2 Below_average  4.90
df7.4.2 %>% group_by(Group) %>% summarise(ave = mean(NumStorePurchases))
## # A tibble: 2 x 2
##   Group           ave
##   <fct>         <dbl>
## 1 Above_average  7.77
## 2 Below_average  4.87
4.903622 - 4.798271
## [1] 0.105351

7.4.2 Building a model

model742 <- aov(NumStorePurchases ~ Group, data = df7.4.2)
summary(model742)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Group          1   2902  2902.1   332.3 <2e-16 ***
## Residuals   1386  12104     8.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.4.3 Assumption tests

Normality test with Q-Q plot and Shapiro-Wilk test

Large amount of data points are out of the shaded region of following Q-Q plot, and do not form and match the straight line. It suggests that the data residuals are not normality distributed.

df7.4.2$residuals <-  model742$residuals

ggplot(df7.4.2, aes(sample = residuals)) +
  stat_qq_band() +
  stat_qq_line() +
  stat_qq_point() +
  labs(title = "Figure 16. QQ-plot",
       subtitle = "Shaded region = 95% confidence interval for Normality") +
  theme(plot.title = element_text(face = "bold"))

A shapiro-wilk test is also carried out and show the same result that the residuals in the dataset are not normally distributed.

by(df7.4.2$NumStorePurchases, df7.4.2$Group, shapiro.test)
## df7.4.2$Group: Above_average
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.96161, p-value = 1.674e-12
## 
## ------------------------------------------------------------ 
## df7.4.2$Group: Below_average
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.84965, p-value < 2.2e-16

Variance test with Levene’s test

The variance between two gold buyer groups is also significant different based on following Levene’s test (P-value < 0.05). The variance between both groups are not equal.

leveneTest(df7.4.2$NumStorePurchases, df7.4.2$Group)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1  20.596 6.161e-06 ***
##       1386                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.4.4 Mann-Whitney test

For testing groups that do not have their data points normality distributed and do not have their variances equal to each other, Mann-Whitney test is selected.

wilcox.test(df7.4.2$NumStorePurchases ~ df7.4.2$Group)          
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df7.4.2$NumStorePurchases by df7.4.2$Group
## W = 371797, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# default of wilcox.test is for unpaired test, non-normally distributed, unequal variances pair.

Statistical result from the test supports the supervisor’s claim that people who spent an above average amount on gold in the last 2 years would have more in store purchases, with a p-value of less than 0.05.

7.5 Consumption of married PhD candidates on Fish

Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do “Married PhD candidates” have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish?

7.5.1 Married PhD candidates vs others in amount spent on fish

Following summary shows that the sample size between the married PhD group and the other customer groups have a substantial different in total sample count. To make the comparison fairer, I will randomly select 192 samples from the 2048 samples of the “Others” group.

# set up data frame

df751 <- md2 %>% 
  dplyr::select(Education, Marital_Status, MntFishProducts) %>%
  mutate(group = paste0(Education, Marital_Status)) %>% 
  mutate(group2 = case_when(
    group == "PhDMarried" ~ "Married_PhD_candidates",
    TRUE ~ "Others"
  ),
  group2 = as.factor(group2))

# summarise sample count

df751 %>% 
  group_by(group2) %>% 
  summarise(count = n()) 
## # A tibble: 2 x 2
##   group2                 count
##   <fct>                  <int>
## 1 Married_PhD_candidates   192
## 2 Others                  2048

Random Sampling

Following codes complete the random sampling of 192 samples from the “Others” group.

df751_192 <- df751 %>% 
  dplyr::select(group2, MntFishProducts) %>% 
  group_by(group2) %>% 
  sample_n(192) 

Visualisation

Figure belows show that the mean and median of the group “Others” are slightly higher than the group of married PhD.

# set up x label

df751_192 <- df751_192 %>% 
  group_by(group2) %>% 
  mutate(count = n(),
         label_x = paste0(group2, "\n", "(n =", count, ")"))

# plot

ggplot(df751_192, aes(x = label_x, y = MntFishProducts, fill = group2)) +
  geom_jitter() +
  geom_boxplot(outlier.shape = NA, alpha = 0.9, size = 1) +
  stat_summary(fun = "mean", shape = 4, size = 2, colour = "blue", stroke = 2) +
  labs("Figure 17. Married PhD Candidates have lesser mean and median") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        axis.title.x = element_blank(),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))) +
  labs(y = "Amount spent on fish",
       title = "Figure 17. Married PhD Candidates spent less on Fish",
       subtitle = "In term of mean (x sign) and median (the thicker line in boxplots)")

Model building

Codes below build the model for this section (click the right button).

model_192 <- aov(MntFishProducts ~ group2, data = df751_192)

Normality testing with Shapiro-Wilk test

Shapiro-Wilk test suggests that data between the two groups are not normally distributed.

by(df751_192$MntFishProducts, df751_192$group2, shapiro.test)
## df751_192$group2: Married_PhD_candidates
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.69269, p-value < 2.2e-16
## 
## ------------------------------------------------------------ 
## df751_192$group2: Others
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.75365, p-value < 2.2e-16

Variance test with Levene’s test

Levene’s test concludes that the spread of variance between the two groups of data are not equal, with a P-value of lower than 0.05, and rejects the null hypothesis that the variances between the groups are equal.

leveneTest(df751_192$MntFishProducts, df751_192$group2)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   1  10.167 0.001548 **
##       382                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mann-Whitney test

Base on above assumption tests, Mann-Whitney test is selected as the two-sample test, it is written as wilcox.test in R as below.

  • The P-value of 0.0003503 concludes that married PhD candidates spent significantly less than other customers on fish.
wilcox.test(df751_192$MntFishProducts ~ df751_192$group2)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df751_192$MntFishProducts by df751_192$group2
## W = 14774, p-value = 0.0007012
## alternative hypothesis: true location shift is not equal to 0

7.6 Is there a significant relationship between geographical regional and success of a campaign?

Graph below shows that most campaigns across countries have low acceptance rate and are mostly less than 15%. The country “ME” has only 3 customers, the sample size is too small to represent the acceptance rate of the country. Other countries that have minimum of 100 customers.

Visualisation

# set up dataframe

df76 <- md2 %>% 
  dplyr::select(Country, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response)

df76_viz <- df76 %>%
  rename("Camp1" = AcceptedCmp1,
         "Camp2" = AcceptedCmp2,
         "Camp3" = AcceptedCmp3,
         "Camp4" = AcceptedCmp4,
         "Camp5" = AcceptedCmp5,
         "LastCamp" = Response) %>% 
  pivot_longer(c(Camp1, Camp2, Camp3, Camp4, Camp5, LastCamp),
               names_to = "Cmp", 
               values_to = "result") %>% 
  mutate(Cmp = as.factor(Cmp)) %>% 
  group_by(Country, Cmp) %>% 
  summarise(count = n(), 
            rate = round(sum(result)/count*100, 1)) %>% 
  mutate(xlabel = paste0(Country, "\n", "(n =", count, ")"),
         barlabel = paste0(rate, "%"))
  
# plot

ggplot(df76_viz, aes(x = xlabel, y = rate, fill = Cmp)) +
  geom_hline(yintercept = 15, linetype = 2, colour = "grey") +
  geom_bar(stat = "identity", position = position_dodge(), colour = "black") +
  theme_bw() +
  theme(legend.position = c(0.1, 0.75),
        legend.background = element_rect(colour = "black"),
        plot.title = element_text(face = "bold", vjust = 2),
        axis.title.y = element_text(margin = margin(0, 8, 0, 0))) +
  scale_y_continuous(labels = function(x){paste0(x, "%")},
                     lim = c(0, 70)) +
  labs(title = "Figure 18. Acceptance Rate of Marketing Campaigns Across Countries",
       fill = "Campaign",
       y = "Acceptance Rate",
       x = "Country") 

Generalised Linear Models (GLM)

Logistic regression is performed using GLM’s binomial family on the effect of countries to 6 respective marketing campaign.

The results show that there is no significant relationship between geographical regions and the success of a campaign, with a P-value of all countries in all marketing campagins being higher than 0.05. It indicates insufficient evidence to reject the null hypothesis that geographical regionas has no relation to the success of marketing campaign.

  1. Compare campaign 1 across countries.
  • Result: There is not enough evidence to reject the null hypothesis and I conclude that in campaign 1, there is no effect of country on campaign success. P value of all countries are higher than 0.05.
# Set up data frame

df76_2 <- df76 %>% 
  rename("Camp1" = AcceptedCmp1,
         "Camp2" = AcceptedCmp2,
         "Camp3" = AcceptedCmp3,
         "Camp4" = AcceptedCmp4,
         "Camp5" = AcceptedCmp5,
         "LastCamp" = Response) %>%
  mutate_if(is.double, as.factor)

# Use the binomial family of GLM to make the glm model carry out logistic regression 

model1 <- glm(Camp1 ~ Country, data = df76_2, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = Camp1 ~ Country, family = "binomial", data = df76_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3844  -0.3844  -0.3729  -0.3498   2.5017  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.08453    0.38651  -7.980 1.46e-15 ***
## CountryCA     0.45344    0.45711   0.992    0.321    
## CountryGER    0.30305    0.54873   0.552    0.581    
## CountryIND    0.08168    0.54712   0.149    0.881    
## CountryME   -11.48154  509.65227  -0.023    0.982    
## CountrySA     0.32136    0.45005   0.714    0.475    
## CountrySP     0.51662    0.40398   1.279    0.201    
## CountryUS     0.40547    0.54959   0.738    0.461    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1068.9  on 2239  degrees of freedom
## Residual deviance: 1065.4  on 2232  degrees of freedom
## AIC: 1081.4
## 
## Number of Fisher Scoring iterations: 13
  1. Compare campaign 2 across countries.
  • Result: There is not enough evidence to reject the null hypothesis and I conclude that in campaign 2, there is no effect of country on campaign success. P value of all countries are higher than 0.05.
model2 <- glm(Camp2 ~ Country, data = df76_2, family = "binomial") 
summary(model2)
## 
## Call:
## glm(formula = Camp2 ~ Country, family = "binomial", data = df76_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2128  -0.1716  -0.1716  -0.1545   2.9779  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.957e+01  8.502e+02  -0.023    0.982
## CountryCA    1.579e+01  8.502e+02   0.019    0.985
## CountryGER   1.549e+01  8.502e+02   0.018    0.985
## CountryIND   1.528e+01  8.502e+02   0.018    0.986
## CountryME   -3.570e-09  6.267e+03   0.000    1.000
## CountrySA    1.514e+01  8.502e+02   0.018    0.986
## CountrySP    1.535e+01  8.502e+02   0.018    0.986
## CountryUS   -3.406e-09  1.336e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 318.38  on 2239  degrees of freedom
## Residual deviance: 309.41  on 2232  degrees of freedom
## AIC: 325.41
## 
## Number of Fisher Scoring iterations: 18
  1. Compare campaign 3 across countries.
  • Result: There is not enough evidence to reject the null hypothesis and I conclude that in campaign 3, there is no effect of country on campaign success. P value of all countries are higher than 0.05.
model3 <- glm(Camp3 ~ Country, data = df76_2, family = "binomial") 
summary(model3)
## 
## Call:
## glm(formula = Camp3 ~ Country, family = "binomial", data = df76_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9005  -0.3971  -0.3971  -0.3587   2.3992  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.8201     0.3431  -8.219   <2e-16 ***
## CountryCA     0.1890     0.4211   0.449   0.6536    
## CountryGER    0.4222     0.4763   0.886   0.3754    
## CountryIND    0.4797     0.4495   1.067   0.2859    
## CountryME     2.1269     1.2719   1.672   0.0945 .  
## CountrySA     0.1088     0.4105   0.265   0.7909    
## CountrySP     0.3192     0.3616   0.883   0.3774    
## CountryUS     0.2844     0.5026   0.566   0.5715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1168.1  on 2239  degrees of freedom
## Residual deviance: 1164.2  on 2232  degrees of freedom
## AIC: 1180.2
## 
## Number of Fisher Scoring iterations: 5
  1. Compare campaign 4 across countries.
  • Result: Apart from country “Ca”, there is not enough evidence to reject the null hypothesis and I conclude that in campaign 4, there is no effect of country on campaign success. P value of all countries are higher than 0.05.

  • Most countries have no relation to the success of marketing campaign. Only Canada has a week significant P value of 0.0478 that very close to 0.05 to reject the null hypothesis.

model4 <- glm(Camp4 ~ Country, data = df76_2, family = "binomial") 
summary(model4)
## 
## Call:
## glm(formula = Camp4 ~ Country, family = "binomial", data = df76_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4385  -0.4118  -0.4118  -0.3498   2.5626  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.2452     0.4161  -7.799 6.26e-15 ***
## CountryCA     0.9261     0.4679   1.979   0.0478 *  
## CountryGER    0.9517     0.5227   1.821   0.0686 .  
## CountryIND    0.7231     0.5209   1.388   0.1651    
## CountryME   -11.3209   509.6523  -0.022   0.9823    
## CountrySA     0.4820     0.4757   1.013   0.3109    
## CountrySP     0.8201     0.4306   1.905   0.0568 .  
## CountryUS     0.4022     0.5912   0.680   0.4963    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1188.4  on 2239  degrees of freedom
## Residual deviance: 1180.2  on 2232  degrees of freedom
## AIC: 1196.2
## 
## Number of Fisher Scoring iterations: 13
  1. Compare campaign 5 across countries.
  • Result: There is not enough evidence to reject the null hypothesis and I conclude that in campaign 5, there is no effect of country on campaign success. P value of all countries are higher than 0.05.
model5 <- glm(Camp5 ~ Country, data = df76_2, family = "binomial") 
summary(model5)
## 
## Call:
## glm(formula = Camp5 ~ Country, family = "binomial", data = df76_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4118  -0.4118  -0.4117  -0.3587   2.5320  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.425e+00  2.894e-01  -8.382   <2e-16 ***
## CountryCA   -3.938e-02  3.680e-01  -0.107    0.915    
## CountryGER  -2.136e-01  4.665e-01  -0.458    0.647    
## CountryIND  -7.386e-01  5.074e-01  -1.456    0.145    
## CountryME   -1.214e+01  5.097e+02  -0.024    0.981    
## CountrySA   -2.857e-01  3.668e-01  -0.779    0.436    
## CountrySP    3.822e-04  3.098e-01   0.001    0.999    
## CountryUS   -6.095e-01  5.416e-01  -1.125    0.260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1168.1  on 2239  degrees of freedom
## Residual deviance: 1161.6  on 2232  degrees of freedom
## AIC: 1177.6
## 
## Number of Fisher Scoring iterations: 13
  1. Compare the last marketing campaign across countries.
  • In the last Campaign, only the country “ME” is significantly different from other countries in term of relation to the acceptance rate, however the there are only 3 samples in this country to generate the statistical outcome, the sample size is too small, and the result can be inaccurate.

  • Result: In most country, there is not enough evidence to reject the null hypothesis and I conclude that in the last campaign, there is no effect of country on campaign success. P value of all countries are higher than 0.05.

modelLast <- glm(LastCamp ~ Country, data = df76_2, family = "binomial") 
summary(modelLast)
## 
## Call:
## glm(formula = LastCamp ~ Country, family = "binomial", data = df76_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4823  -0.5920  -0.5789  -0.5040   2.2056  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.78449    0.22534  -7.919 2.39e-15 ***
## CountryCA   -0.01601    0.28538  -0.056   0.9553    
## CountryGER  -0.01703    0.34541  -0.049   0.9607    
## CountryIND  -0.55584    0.36751  -1.512   0.1304    
## CountryME    2.47763    1.24530   1.990   0.0466 *  
## CountrySA    0.08324    0.27114   0.307   0.7588    
## CountrySP    0.13168    0.23989   0.549   0.5830    
## CountryUS   -0.21491    0.37164  -0.578   0.5631    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1875.4  on 2232  degrees of freedom
## AIC: 1891.4
## 
## Number of Fisher Scoring iterations: 4

8 CONCLUSION

Conclusion for exploratory data analysis (EDA):

  • Most customers are in mature age between 30 to 55.

  • 51% of the customers are from Spain, which is the largest pool of customer.

  • 50.3% of customers have education level of “graduation” level.

  • 71.5% of customers have at least 1 children at home (being either 1 kid or 1 teen in the family).

  • 50% of customers have income falls approximately between $35k to $70k.

  • Wine is the best selling products of the company which contributing 50.15% to the total revenue, followed by 27.56% of meat and then the 7.27% of gold.

  • All 6 marketing campagins are not doing very well with acceptance rates of lower than 15%. The last marketing campaign was the most successful with acceptance rate 14.91%. The rest of all other campaigns have acceptance rate close to or below 7%.

  • In store sales is the most performing channels, followed by web, and then the last sale channel, catalog.

Conclusion for statistical analysis:

  • Number of in-store purchase has a significant positive relationship with the amount of wine, fruit, fish, and sweet sold, as well as deals, purchases made through web and the income of customers. Number of in-store purchase has a significant negative relationship with number of visits to company’s website by the customer, number of kid at home, sales through catelog and interestingly, the amount of meat sold.

  • Purchasing power of customers from the US do not significantly different from the rest of the world in terms of total purchase per customer, at a P-value of 0.4004. In term of total purchase, US ranked the second last country. However, purchasing power of US customers ranked the second in term of average purchase per customer (13.51).

  • People who spent an above average amount on gold in the last 2 years have more in store purchases, with a p-value of less than 0.05.

  • Married PhD candidates spent significantly less than other customer groups on fish, with a p-value of less than 0.05. Other factors that significantly related to amount spent on fish in a positive way are the amount spent on fruits, sweet, amd meat, numbers of purchases made through catalog and store, marital status of absurd, country (ME), and Education of 2n_Cycle. Other factors that significantly related to amount spents on fish in a negative way are the number of PhD customers, number of teens at home, and number of web visits by the customers.

  • Logistic regression was carried out for 6 marketing campaigns and result shows that there is no significant relationship between geographical regional and success of a campaign.

9 LEGALITY

This is a personal project created and designed for skill demonstration and non-commercial use only. All photos in this project, such as those that applied as the thumbnail are only for demonstration, they are not related to the dataset, the location where the data were collected, and the results of this analysis.

10 REFERENCE

Australia Bureau of Statistics 2014, Age Standard, viewed 26 July 2021, https://www.abs.gov.au/statistics/standards/age-standard/latest-release

Daoud J 2021, EDA & Statistics Analysis with Marketing Data, viewed 26 July 2021, https://www.kaggle.com/jackdaoud/eda-statistics-analysis-with-marketing-data.

Crockett J 2021, Marketing Analytics EDA task [Final], viewed 28 July 2021, https://www.kaggle.com/jennifercrockett/marketing-analytics-eda-task-final

Minitab Blog Editor 2013, Enough Is Enough! Handling Multicollinearity in Regression Analysis, viewed 26 July 2021, https://blog.minitab.com/en/understanding-statistics/handling-multicollinearity-in-regression-analysis

The Pennsylvania State University 2018, 10.7 - Detecting Multicollinearity Using Variance Inflation Factors, viewed 27 July 2021, https://online.stat.psu.edu/stat462/node/180/

11 ACKNOWLEDGEMENT

  • Thank you Dr. Omar Romero-Hernandez for providing this dataset.