The dataset originates from the CRM database of a anonymous e-commerce company, presenting all its transactional records for the year 2021. This dataset is available on public resource for research purpose only, and certain details have been anonymized and pruned. Customer data is represented by neutral code to ensure confidentiality.
We tend to perform the CRM analytics on this given dataset to gain insights into the operation and business of this e-commerce company. By analyzing purchase, customer patterns, and other salient aspects, we aim to better understand its business and customers, planning to tweaking and developing the new pertaining marketing strategy.
The first 10 rows of the dataset | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | age | gender | revenue_total | n_purchases | purchase_date | purchase_value | pay_method | time_spent | browser | newsletter | voucher |
504308 | 21 | 0 | 32.02 | 7 | 22.06.21 | 24.91 | 1 | 831 | 0 | 1 | 1 |
504309 | 33 | 1 | 56.96 | 3 | 10.12.21 | 2.90 | 2 | 1,431 | 0 | 1 | 1 |
504310 | 18 | 1 | 29.24 | 2 | 14.03.21 | 10.60 | 0 | 1,165 | 0 | 0 | 0 |
504311 | 22 | 0 | 75.94 | 8 | 25.10.21 | 43.28 | 1 | 821 | 0 | 0 | 0 |
504312 | 22 | 1 | 37.69 | 2 | 14.09.21 | 56.90 | 1 | 1,377 | 0 | 1 | 1 |
504313 | 20 | 0 | 40.16 | 8 | 14.05.21 | 12.47 | 1 | 1,153 | 1 | 0 | 1 |
504314 | 33 | 1 | 43.13 | 1 | 09.01.21 | 2.46 | 0 | 1,083 | 0 | 1 | 1 |
504315 | 32 | 1 | 12.43 | 4 | 28.03.21 | 6.56 | 1 | 1,483 | 0 | 1 | 1 |
504316 | 24 | 0 | 49.99 | 7 | 04.08.21 | 11.88 | 0 | 776 | 3 | 1 | 0 |
504317 | 28 | 1 | 14.30 | 4 | 06.10.21 | 11.90 | 3 | 1,101 | 0 | 1 | 1 |
There are 12 variables in the dataset, which are described below:
customer ID Each customer has a unique ID
number.
Age The age of the customers in the year that the
data were collected.
Gender 0: stands for male,
1: stands for female.
Revenue in total The values of transaction in sum
that the customers bought within the year.
No. of Purchase The number of orders in sum that the
customers placed within the year.
Purchase Date The date that customers placed the
last order in the year.
Purchase Value The transaction value in Euro of
customer’s last order in the year.
Purchase Method 0: Digital Wallets,
1: any kind of Bank Card, 2: PayPal,
and 3: any other payment methods.
Time Spent Time spent (in sec) on website.
Browser 0: Chrome,
1: Safari, 2: Edge, and
3: any other browsers.
Newsletter 0: stands for not
subscribed, 1: stands for subscribed.
Voucher 0: states not any voucher
used for the order in the e-shop, 1: denotes that
voucher was at least used one time for the order in the year.
There are some other variables that are not provided due to data protection reasons, which don’t fundamentally hinder the next analytics.
The dimension of the dataset:
## [1] 65796 12
There are 65,796 rows, which means 65,796 distinct customers who placed at least one order in the e-shop in the year. The number of columns is 12, which are the variables as previously listed.
Before starting analytic work, the data in the table must be first verified and cleaned up for further process.
Number of missing value in the dataset
The statistical description of the dataset is presented below:
## vars n mean sd median trimmed mad
## customer_id 1 65796 537205.50 18993.81 537205.50 537205.50 24387.29
## age 2 65796 36.96 11.04 37.00 36.79 13.34
## gender 3 65796 0.67 0.47 1.00 0.71 0.00
## revenue_total 4 65796 55.73 22.53 56.37 55.36 23.65
## n_purchases 5 65796 4.36 2.61 4.00 4.10 2.97
## purchase_date* 6 65796 183.89 105.52 184.00 183.87 134.92
## purchase_value 7 65796 15.97 13.23 12.64 14.48 13.43
## pay_method 8 65796 1.11 0.93 1.00 1.03 1.48
## time_spent 9 65796 775.00 285.67 735.00 755.96 266.87
## browser 10 65796 0.63 1.00 0.00 0.41 0.00
## newsletter 11 65796 0.57 0.50 1.00 0.58 0.00
## voucher 12 65796 0.57 0.49 1.00 0.59 0.00
## min max range skew kurtosis se
## customer_id 504308.00 570103.00 65795.00 0.00 -1.20 74.05
## age 15.00 79.00 64.00 0.11 -0.76 0.04
## gender 0.00 1.00 1.00 -0.73 -1.47 0.00
## revenue_total 5.47 122.39 116.92 0.12 -0.48 0.09
## n_purchases 1.00 12.00 11.00 0.86 0.14 0.01
## purchase_date* 1.00 366.00 365.00 0.00 -1.19 0.41
## purchase_value 0.00 59.90 59.90 0.88 0.09 0.05
## pay_method 0.00 3.00 3.00 0.46 -0.66 0.00
## time_spent 124.00 1500.00 1376.00 0.54 -0.19 1.11
## browser 0.00 3.00 3.00 1.45 0.79 0.00
## newsletter 0.00 1.00 1.00 -0.27 -1.93 0.00
## voucher 0.00 1.00 1.00 -0.29 -1.92 0.00
From the statistical description and above plot, all the data in the cleaned dataset lie in the reasonable ranges, without nonsensical numbers. We can assume that the post-cleaned dataset can be utilized for further analysis.
print(avg.transaction.value)
## [1] 12.77
This means the average transaction value for a single purchase lies at 12.77 Euro, regardless of age, gender or other features.
print(avg.revenue)
## [1] 55.73
This amount (55.73 Euro) stands for the total spendings on average of a customer in a year, regardless of age, gender or other features.
From the plot, the age seemingly follows a normal distribution pattern. The age of the customers lie between 15 and 79, with average value of 37.
The most purchase patterns occur at the ages of around 30 and around 40. It can not, however, identify the profile of customers by such a rough observation.
The plot indicates a positive correlation between customer age and revenue, suggesting that customer spending tends to increase with age. However, we cannot hastily assume that older customers will necessarily spend more.
Moreover, a distinct square-shaped cluster is visible at the top of the plot, exhibiting revenue ranging from 80 to 120 Euros, far over the average spending of 55.73 Euro as previously calculated. This cluster of customers exhibits significantly higher spending patterns compared to the average and fall within the age range of 35 to 55.
2/3 of the customers are female (44,172 out of 65,796) The web shop may provide the products or/and service more targeting at female customers.
The given visualization reveals that female customers not only outnumber male customers, but also exhibit slightly higher average spendings. Additionally, a drastic decline in male customer spending is observed around the 80 Euro, potentially contributing to their overall lower spending patterns. This trend can request the further investigation.
The pie chart depicts the proportion of payment methods used by all 65,796 customers. The adjacent bar chart illustrates the number of customers utilizing distinct payment methods.
Upon examining payment methods by gender, there is a ostensible distinction, with females, which make of the most prevalent customer group, having the strongest inclination to use Bank Card for online transaction on this web shop.
This could call for a investigation in a further study.
Based on the above plot, we can infer the following insights, which cannot be easily obtained without this visualization exercise.
There are two distinct age groups among male customers, with the first group ranging from approximately 20 to 30 years old, and the second group spanning from around 35 to 50 years old. However, it is not clear whether these groups also exhibit high spending patterns.
The largest group of female customers are those between about 30 and 40 years old, which favor much bank card for payment, although we don’t know whether this group also falls into the category of high spenders.
From the plot, we can deduce that there is a relatively even distribution of customer purchases across the entire year up to their final order. This raises an important question: why do so many customers stop purchasing after their first order in the early months and not become repeat customers?
Or, the counterintuitive visual could be attributed to errors in the dataset.
Holistic Views on the correlation between Gender,
No. of Purchase, Newsletter,
Voucher and Revenue in total.
A different correlations can be observed by gender. In male group,
total spending showed a positive correlation with all the listed
variables, expect for Newsletter.
The correlations between Newsletter and
Voucher was fairly weak in both groups.
In contrast, total spending in female group showed a negative
correlation with Time Spent on the website and
Voucher, which exhibits a divergent response based on
gender.
Can we conclude that Newsletter doesn’t have so much
impact on increasing total spending?
Firstly, I conducted the classic but powerful
Multiple Linear Regression
model by setting Revenue in total as dependent variables
and the others as independent ones to see how much the variance of
Revenue in total can be explained by the others in this
model.
##
## Call:
## lm(formula = revenue_total ~ ., data = lm.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.155 -13.712 0.926 13.125 70.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.2013665 0.4268668 96.52 <2e-16 ***
## age 0.3395189 0.0071909 47.22 <2e-16 ***
## gender 11.1875768 0.1682164 66.51 <2e-16 ***
## n_purchases 2.8597150 0.0295866 96.66 <2e-16 ***
## time_spent -0.0219568 0.0002913 -75.38 <2e-16 ***
## newsletter -1.7406803 0.1533017 -11.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.69 on 65790 degrees of freedom
## Multiple R-squared: 0.3123, Adjusted R-squared: 0.3122
## F-statistic: 5974 on 5 and 65790 DF, p-value: < 2.2e-16
The formula of the Multiple Linear Regression is
\(Revenuve \ in \ total = Intercept+0.34·age+11.19·gender +2.86·no. \ of \ purchases-0.02·time \ spent - 1.74·newsletter+error\)
Based on Multiple Linear
Regression, only Age, Gender,
No. of Purchase, Time Spent and
Newsletter have a statistically significant impact on the
dependent variable Revenue in total. However, among them,
Time Spent and Newsletter have negative
coefficient values, suggesting a decline in dependent variable as either
of these two independent variables increases.
In general only 31.2% of the variance of
Revenue in total can be accounted for by the variables
included in the model. This states the
Multiple Linear Regression
only has limited explanatory power, and the majority of variance is
supposed to be explained by other variables, which are not given in the
anonymized dataset.
Profile of higher spender
A bit more advanced Machine Learning Model is leveraged.
From the advanced classification plot, we can identify two relatively high-spending groups, which account for a significant portion of the total customers.
The first group, comprising around 18,000 customers (27% of the total customers), generates an average revenue of 72 Euro (much higher than the overall average of 55.73 Euro). The profile is:
The second groug, representing approximately 12,000 customers (13% of the total customers), brings an average revenue of 68 Euro per customer. This profile is:
Plot Importance of Variables
The plot conveys not only the variables merely with important impact on the outcome of the classification model, but them in sequence of importance value as well.
Profile of male high spenders
Two highest-spending subgroups emerge from the male customer segment, comprising 9% and 12% of the total male customer base.
The first group, accounting for 9% of male customers, comprises individuals between the ages of 35 and 51. The second subgroup, representing 12% of male customers, consists of men younger than 35 who didn’t subscribe newsletter.
Aside from newsletter subscription, male customers younger than 35, contributes the average revenue surpassing the average expenditure of both males and total customers.
The observed pattern aligns with the previous plot shared in response to the Q4.
Profile of Female high spenders
What can be concluded from the above plot?
One of the outstanding advantages of classification regression tree models compared with classic linear regression models is that they don’t just depict a simple up or down movement between variables. Instead, they define the range in which certain variables have a significant impact, either positive or negative, and the range in which they may play no role at all.
Additionally, classification regression tree models take into account the combination of the variables, which classic linear regression models do not.
By employing a combination of statistically descriptive visualizations, multiple linear regression and more sophisticated machine learning classification models, we identified and characterized the high-spending customer segments, as well as the relative importance of the given variables.
The analysis provides the company with a valuable understanding of their business and customers, enabling them to develop an effective marketing strategy to retain high-spending customers, understand their profiling, find more customers like them, tweak the newsletter content, develop other promotion tools, nurture new leads and so on.
Due to the lack of relevant data, it is difficult to determine which customers are newly acquired and which are existing. So that some critical metrics, such as retention rate, churn rate and so on, cannot be calculated.
What is the family of products that customers buy?
The lack of information regarding geographic location, education, family situation, and other demographic characteristics hinders the creation of a clear demographic profile of customers.
The effectiveness of the promotion on the website, such as how likely the purchase is a result of a “recommendation window”, cannot be answered.
It is difficult to evaluate the effectiveness of allocating budget to other tools for customer acquisition and revenue growth.