1. Background of the Case Study

The dataset originates from the CRM database of a anonymous e-commerce company, presenting all its transactional records for the year 2021. This dataset is available on public resource for research purpose only, and certain details have been anonymized and pruned. Customer data is represented by neutral code to ensure confidentiality.

We tend to perform the CRM analytics on this given dataset to gain insights into the operation and business of this e-commerce company. By analyzing purchase, customer patterns, and other salient aspects, we aim to better understand its business and customers, planning to tweaking and developing the new pertaining marketing strategy.


A Quick Glimpse at the dataset

The first 10 rows of the dataset

customer_id

age

gender

revenue_total

n_purchases

purchase_date

purchase_value

pay_method

time_spent

browser

newsletter

voucher

504308

21

0

32.02

7

22.06.21

24.91

1

831

0

1

1

504309

33

1

56.96

3

10.12.21

2.90

2

1,431

0

1

1

504310

18

1

29.24

2

14.03.21

10.60

0

1,165

0

0

0

504311

22

0

75.94

8

25.10.21

43.28

1

821

0

0

0

504312

22

1

37.69

2

14.09.21

56.90

1

1,377

0

1

1

504313

20

0

40.16

8

14.05.21

12.47

1

1,153

1

0

1

504314

33

1

43.13

1

09.01.21

2.46

0

1,083

0

1

1

504315

32

1

12.43

4

28.03.21

6.56

1

1,483

0

1

1

504316

24

0

49.99

7

04.08.21

11.88

0

776

3

1

0

504317

28

1

14.30

4

06.10.21

11.90

3

1,101

0

1

1


Overview of the Variables.

There are 12 variables in the dataset, which are described below:

  • customer ID Each customer has a unique ID number.

  • Age The age of the customers in the year that the data were collected.

  • Gender 0: stands for male, 1: stands for female.

  • Revenue in total The values of transaction in sum that the customers bought within the year.

  • No. of Purchase The number of orders in sum that the customers placed within the year.

  • Purchase Date The date that customers placed the last order in the year.

  • Purchase Value The transaction value in Euro of customer’s last order in the year.

  • Purchase Method 0: Digital Wallets, 1: any kind of Bank Card, 2: PayPal, and 3: any other payment methods.

  • Time Spent Time spent (in sec) on website.

  • Browser 0: Chrome, 1: Safari, 2: Edge, and 3: any other browsers.

  • Newsletter 0: stands for not subscribed, 1: stands for subscribed.

  • Voucher 0: states not any voucher used for the order in the e-shop, 1: denotes that voucher was at least used one time for the order in the year.

There are some other variables that are not provided due to data protection reasons, which don’t fundamentally hinder the next analytics.

The dimension of the dataset:

## [1] 65796    12

There are 65,796 rows, which means 65,796 distinct customers who placed at least one order in the e-shop in the year. The number of columns is 12, which are the variables as previously listed.


2. Verify and clean the dataset

2.1 Working out the cleaned dataset

Before starting analytic work, the data in the table must be first verified and cleaned up for further process.

Number of missing value in the dataset

The statistical description of the dataset is presented below:

##                vars     n      mean       sd    median   trimmed      mad
## customer_id       1 65796 537205.50 18993.81 537205.50 537205.50 24387.29
## age               2 65796     36.96    11.04     37.00     36.79    13.34
## gender            3 65796      0.67     0.47      1.00      0.71     0.00
## revenue_total     4 65796     55.73    22.53     56.37     55.36    23.65
## n_purchases       5 65796      4.36     2.61      4.00      4.10     2.97
## purchase_date*    6 65796    183.89   105.52    184.00    183.87   134.92
## purchase_value    7 65796     15.97    13.23     12.64     14.48    13.43
## pay_method        8 65796      1.11     0.93      1.00      1.03     1.48
## time_spent        9 65796    775.00   285.67    735.00    755.96   266.87
## browser          10 65796      0.63     1.00      0.00      0.41     0.00
## newsletter       11 65796      0.57     0.50      1.00      0.58     0.00
## voucher          12 65796      0.57     0.49      1.00      0.59     0.00
##                      min       max    range  skew kurtosis    se
## customer_id    504308.00 570103.00 65795.00  0.00    -1.20 74.05
## age                15.00     79.00    64.00  0.11    -0.76  0.04
## gender              0.00      1.00     1.00 -0.73    -1.47  0.00
## revenue_total       5.47    122.39   116.92  0.12    -0.48  0.09
## n_purchases         1.00     12.00    11.00  0.86     0.14  0.01
## purchase_date*      1.00    366.00   365.00  0.00    -1.19  0.41
## purchase_value      0.00     59.90    59.90  0.88     0.09  0.05
## pay_method          0.00      3.00     3.00  0.46    -0.66  0.00
## time_spent        124.00   1500.00  1376.00  0.54    -0.19  1.11
## browser             0.00      3.00     3.00  1.45     0.79  0.00
## newsletter          0.00      1.00     1.00 -0.27    -1.93  0.00
## voucher             0.00      1.00     1.00 -0.29    -1.92  0.00

From the statistical description and above plot, all the data in the cleaned dataset lie in the reasonable ranges, without nonsensical numbers. We can assume that the post-cleaned dataset can be utilized for further analysis.


3. The questions from shareholders

I have collected and organized the questions from several marketing experts in real life in reference to this case, and present here top questions raised, which will be answered in the next section.

  • Q1: What is Average Transaction Values (ATC)?

  • Q2: Which age group made most purchases?

  • Q3: What is the relationship between gender and revenue? Who spends more, men or women?

  • Q4: What are the most popular payment methods of customers? And how do payment methods differ by gender, or age group?

  • Q5: What is correlation between age, gender, spending, number of purchases, browser brands, payment method, time spent on the site, newsletter description, and voucher usage?

  • Q6: What is distribution the last month in which customer made their last purchase in the year?

  • Q7: Does newsletter subscription play a role in revenues and purchasing frequency?

  • Q8: Is there an association between the sum of revenue of purchases, purchase frequency and the use of vouchers?

  • Q9: What is the profile of high revenue customers, in terms of provided variables? What is the Persona of a high spending customer?


4. The Answers to the Questions

Q1: What is Average Transaction Values (ATC)?

print(avg.transaction.value)
## [1] 12.77

This means the average transaction value for a single purchase lies at 12.77 Euro, regardless of age, gender or other features.

print(avg.revenue)
## [1] 55.73

This amount (55.73 Euro) stands for the total spendings on average of a customer in a year, regardless of age, gender or other features.


Q2: Which age group made most purchases?

From the plot, the age seemingly follows a normal distribution pattern. The age of the customers lie between 15 and 79, with average value of 37.

The most purchase patterns occur at the ages of around 30 and around 40. It can not, however, identify the profile of customers by such a rough observation.

The plot indicates a positive correlation between customer age and revenue, suggesting that customer spending tends to increase with age. However, we cannot hastily assume that older customers will necessarily spend more.

Moreover, a distinct square-shaped cluster is visible at the top of the plot, exhibiting revenue ranging from 80 to 120 Euros, far over the average spending of 55.73 Euro as previously calculated. This cluster of customers exhibits significantly higher spending patterns compared to the average and fall within the age range of 35 to 55.


Q3: What is the relationship between gender and revenue? Who spends more, men or women?

2/3 of the customers are female (44,172 out of 65,796) The web shop may provide the products or/and service more targeting at female customers.

The given visualization reveals that female customers not only outnumber male customers, but also exhibit slightly higher average spendings. Additionally, a drastic decline in male customer spending is observed around the 80 Euro, potentially contributing to their overall lower spending patterns. This trend can request the further investigation.


Q5: What is correlation between age, gender, spending, number of purchases, browser brands, payment method, time spent on the site, newsletter description, and voucher usage?

The highest positive correlation, with the value of .43 , is observed between the pair of revenue total and number of purchases. On the other hand, the highest negative correlation occurs between the browser types chrome and safari.


Q6: What is distribution the last month in which customer made their last purchase in the year?

From the plot, we can deduce that there is a relatively even distribution of customer purchases across the entire year up to their final order. This raises an important question: why do so many customers stop purchasing after their first order in the early months and not become repeat customers?

Or, the counterintuitive visual could be attributed to errors in the dataset.


Q7: Does newsletter subscription play role in revenues and purchase frequency?

There is no significantly difference in Overall Spendings between Newsletter susbcribers and non-subscriers.

But if we split it by gender, we will see somewhat different implications.

There is a outstanding pattern at the top of plot in the male group, regardless of subscription situation. This pattern exhibits a sharp fall in revenue beyond 80 Euro.

However, such pattern is not observed in female customers, which means newsletters may be more effective in boosting spending for females compared to males. Further verification is certainly under requst.

On average, newsletter subscribers purchase slightly more frequently than non-subscribers by about 1 time per year. There is a strong positive correlation between purchase frequency and newsletter subscription status. In other words, the majority of high-frequency customers (those who purchase more than 6 times per year) are newsletter subscribers.


Q8: Is there an association between the sum of revenue of purchases, purchase frequency and the use of vouchers?

Holistic Views on the correlation between Gender, No. of Purchase, Newsletter, Voucher and Revenue in total.

A different correlations can be observed by gender. In male group, total spending showed a positive correlation with all the listed variables, expect for Newsletter.

The correlations between Newsletter and Voucher was fairly weak in both groups.

In contrast, total spending in female group showed a negative correlation with Time Spent on the website and Voucher, which exhibits a divergent response based on gender.

Can we conclude that Newsletter doesn’t have so much impact on increasing total spending?


Q9: What is the profile of high revenue customers, in terms of provided variables? What is the Persona of a high spending customer?

Firstly, I conducted the classic but powerful Multiple Linear Regression model by setting Revenue in total as dependent variables and the others as independent ones to see how much the variance of Revenue in total can be explained by the others in this model.

## 
## Call:
## lm(formula = revenue_total ~ ., data = lm.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.155 -13.712   0.926  13.125  70.765 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.2013665  0.4268668   96.52   <2e-16 ***
## age          0.3395189  0.0071909   47.22   <2e-16 ***
## gender      11.1875768  0.1682164   66.51   <2e-16 ***
## n_purchases  2.8597150  0.0295866   96.66   <2e-16 ***
## time_spent  -0.0219568  0.0002913  -75.38   <2e-16 ***
## newsletter  -1.7406803  0.1533017  -11.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.69 on 65790 degrees of freedom
## Multiple R-squared:  0.3123, Adjusted R-squared:  0.3122 
## F-statistic:  5974 on 5 and 65790 DF,  p-value: < 2.2e-16

The formula of the Multiple Linear Regression is

\(Revenuve \ in \ total = Intercept+0.34·age+11.19·gender +2.86·no. \ of \ purchases-0.02·time \ spent - 1.74·newsletter+error\)

Based on Multiple Linear Regression, only Age, Gender, No. of Purchase, Time Spent and Newsletter have a statistically significant impact on the dependent variable Revenue in total. However, among them, Time Spent and Newsletter have negative coefficient values, suggesting a decline in dependent variable as either of these two independent variables increases.

In general only 31.2% of the variance of Revenue in total can be accounted for by the variables included in the model. This states the Multiple Linear Regression only has limited explanatory power, and the majority of variance is supposed to be explained by other variables, which are not given in the anonymized dataset.


Profile of higher spender

A bit more advanced Machine Learning Model is leveraged.

From the advanced classification plot, we can identify two relatively high-spending groups, which account for a significant portion of the total customers.

The first group, comprising around 18,000 customers (27% of the total customers), generates an average revenue of 72 Euro (much higher than the overall average of 55.73 Euro). The profile is:

  • Female customers who have made more than five purchases on the website in the past year, contributing an average of 72 Euros in revenue.

The second groug, representing approximately 12,000 customers (13% of the total customers), brings an average revenue of 68 Euro per customer. This profile is:

  • Female customers who have made fewer than five purchases on the website but have spent less than 800 seconds on the website.


Plot Importance of Variables

The plot conveys not only the variables merely with important impact on the outcome of the classification model, but them in sequence of importance value as well.


Profile of male high spenders

Two highest-spending subgroups emerge from the male customer segment, comprising 9% and 12% of the total male customer base.

The first group, accounting for 9% of male customers, comprises individuals between the ages of 35 and 51. The second subgroup, representing 12% of male customers, consists of men younger than 35 who didn’t subscribe newsletter.

Aside from newsletter subscription, male customers younger than 35, contributes the average revenue surpassing the average expenditure of both males and total customers.

The observed pattern aligns with the previous plot shared in response to the Q4.


Profile of Female high spenders

What can be concluded from the above plot?


5. Conclusions

5.1 Comparison between the diverse models

One of the outstanding advantages of classification regression tree models compared with classic linear regression models is that they don’t just depict a simple up or down movement between variables. Instead, they define the range in which certain variables have a significant impact, either positive or negative, and the range in which they may play no role at all.

Additionally, classification regression tree models take into account the combination of the variables, which classic linear regression models do not.


5.2 Customer segement and Variables identification

By employing a combination of statistically descriptive visualizations, multiple linear regression and more sophisticated machine learning classification models, we identified and characterized the high-spending customer segments, as well as the relative importance of the given variables.

The analysis provides the company with a valuable understanding of their business and customers, enabling them to develop an effective marketing strategy to retain high-spending customers, understand their profiling, find more customers like them, tweak the newsletter content, develop other promotion tools, nurture new leads and so on.


5.3 What are not answered?

  • Due to the lack of relevant data, it is difficult to determine which customers are newly acquired and which are existing. So that some critical metrics, such as retention rate, churn rate and so on, cannot be calculated.

  • What is the family of products that customers buy?

  • The lack of information regarding geographic location, education, family situation, and other demographic characteristics hinders the creation of a clear demographic profile of customers.

  • The effectiveness of the promotion on the website, such as how likely the purchase is a result of a “recommendation window”, cannot be answered.

  • It is difficult to evaluate the effectiveness of allocating budget to other tools for customer acquisition and revenue growth.


6. Contact of the Author

Ned lin MBA @ EMST Berlin

Email:

My LinkedIn