| Name | Matric Number |
|---|---|
| Tan Siao Shuen | 23090576 |
| Tey Huimin | 23101808 |
| Areej Abdurahman Mohammad | 23119264 |
| Loh Ee Sam | 17196177 |
| Audrey Law Jia Ling | 24065586 |
In today’s competitive market, understanding customer spending behavior is crucial for driving revenue growth and enhancing customer satisfaction. By analyzing spending patterns, businesses can identify trends, segment customers effectively, and implement targeted marketing strategies. These insights not only help improve customer retention but also optimize product offerings and revenue forecasting for informed decision-making.
This project focuses on analyzing customer spending behavior using a detailed dataset. The dataset includes transactional data capturing customer demographics (age, gender, and country), product categories, purchase quantities, and revenue generated from sales. By leveraging this data, we aim to uncover actionable insights that address critical business challenges, such as improving customer segmentation, identifying high-revenue customers, and optimizing marketing efforts.
Through this analysis, we aim to bridge the gap between data and strategy, offering businesses a comprehensive understanding of their customers while supporting better decision-making in a dynamic business landscape.
The goals of this project are to:
Classify customers into revenue classes (Low, Medium, High) based on their spending behavior using advanced machine learning techniques, including Random Forest and XGBoost.
Forecast profit in order to enable businesses to predict future revenue, adjust strategies, and improve profitability by understanding the factors that most significantly influence profit
By achieving these objectives, this project provides valuable insights into customer behavior and equips businesses with tools to enhance their segmentation strategies, forecast profit, and improve overall profitability.
This analysis will provide insights that can assist businesses in personalizing marketing strategies and optimizing customer engagement based on spending behavior.
The dataset used in this analysis is from Kaggle and contains information about customers, their demographics, and transaction history. The dataset includes various features such as:
Age: Customer’s age.
Gender: Customer’s gender.
Country: Customer’s country.
Purchase Category: The type of products or services purchased by the customer.
Revenue: The total revenue generated from the sale.
The dataset consists of 34,867 rows and 16 columns and was collected in 2016 for the purpose of analyzing consumer spending habits.
In this section, we will perform the necessary data cleaning steps to ensure that our dataset is ready for analysis. We will import the required libraries, load the dataset, and inspect it for any issues that need to be addressed, such as missing values, irrelevant columns, or improper data types.
# Importing Dataset
df=read.csv('Analyzing Customer Spending Behaviour.csv')
# Understanding data
# basic structure of the dataset
glimpse(df)
## Rows: 34,867
## Columns: 16
## $ index <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ Date <chr> "02/19/16", "02/20/16", "02/27/16", "03/12/16", "03/1…
## $ Year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
## $ Month <chr> "February", "February", "February", "March", "March",…
## $ Customer.Age <dbl> 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 2…
## $ Customer.Gender <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
## $ Country <chr> "United States", "United States", "United States", "U…
## $ State <chr> "Washington", "Washington", "Washington", "Washington…
## $ Product.Category <chr> "Accessories", "Clothing", "Accessories", "Accessorie…
## $ Sub.Category <chr> "Tires and Tubes", "Gloves", "Tires and Tubes", "Tire…
## $ Quantity <dbl> 1, 2, 3, 2, 3, 1, 2, 1, 2, 2, 2, 3, 2, 1, 3, 3, 1, 2,…
## $ Unit.Cost <dbl> 80.00, 24.50, 3.67, 87.50, 35.00, 66.00, 52.00, 60.00…
## $ Unit.Price <dbl> 109.00000, 28.50000, 5.00000, 116.50000, 41.66667, 78…
## $ Cost <dbl> 80, 49, 11, 175, 105, 66, 104, 60, 16, 5, 35, 120, 92…
## $ Revenue <dbl> 109, 57, 15, 233, 125, 78, 120, 68, 20, 6, 42, 146, 1…
## $ Column1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# summary of the dataset
summary(df)
## index Date Year Month
## Min. : 0 Length:34867 Min. :2015 Length:34867
## 1st Qu.: 8716 Class :character 1st Qu.:2015 Class :character
## Median :17433 Mode :character Median :2016 Mode :character
## Mean :17433 Mean :2016
## 3rd Qu.:26150 3rd Qu.:2016
## Max. :34866 Max. :2016
## NA's :1
## Customer.Age Customer.Gender Country State
## Min. :17.00 Length:34867 Length:34867 Length:34867
## 1st Qu.:28.00 Class :character Class :character Class :character
## Median :35.00 Mode :character Mode :character Mode :character
## Mean :36.38
## 3rd Qu.:44.00
## Max. :87.00
## NA's :1
## Product.Category Sub.Category Quantity Unit.Cost
## Length:34867 Length:34867 Min. :1.000 Min. : 0.67
## Class :character Class :character 1st Qu.:1.000 1st Qu.: 45.00
## Mode :character Mode :character Median :2.000 Median : 150.00
## Mean :2.003 Mean : 349.88
## 3rd Qu.:3.000 3rd Qu.: 455.00
## Max. :3.000 Max. :3240.00
## NA's :1 NA's :1
## Unit.Price Cost Revenue Column1
## Min. : 0.667 Min. : 2 Min. : 2.0 Min. : 2.0
## 1st Qu.: 53.667 1st Qu.: 85 1st Qu.: 102.0 1st Qu.: 104.2
## Median : 179.000 Median : 261 Median : 319.0 Median : 390.5
## Mean : 389.232 Mean : 576 Mean : 640.9 Mean : 688.0
## 3rd Qu.: 521.000 3rd Qu.: 769 3rd Qu.: 902.0 3rd Qu.: 975.8
## Max. :5082.000 Max. :3600 Max. :5082.0 Max. :3681.0
## NA's :1 NA's :1 NA's :32293
# number of null values by column
sapply(df,function(x)sum(is.na(x)))
## index Date Year Month
## 0 0 1 0
## Customer.Age Customer.Gender Country State
## 1 0 0 0
## Product.Category Sub.Category Quantity Unit.Cost
## 0 0 1 1
## Unit.Price Cost Revenue Column1
## 1 1 0 32293
# remove 'Column1' column as it is meaningless (according to the data source Kaggle)
df<-select(df,-Column1)
# remove row with null value
df<-df[complete.cases(df),]
# Data in 'Date' column is written as character/string
df$Date<-mdy(df$Date)
# Remove Index Columns
cust_spend_data <- select(df, -index)
# Rename column name with spaces
cust_spend_data <- cust_spend_data %>%
rename("Customer_Age" = "Customer.Age",
"Customer_Gender" = "Customer.Gender",
"Product_Category" = "Product.Category",
"Sub_Category" = "Sub.Category",
"Unit_Cost" = "Unit.Cost",
"Unit_Price" = "Unit.Price")
# Convert necessary columns to appropriate data types
cust_spend_data_2 <- cust_spend_data %>%
mutate(
Date = as.Date(Date, format = "%m/%d/%y"),
Customer_Gender = as.factor(Customer_Gender),
Country = as.factor(Country),
State = as.factor(State),
Product_Category = as.factor(Product_Category),
Sub_Category = as.factor(Sub_Category)
)
# Handle missing values (if any)
cust_spend_data_2 <- cust_spend_data_2 %>% drop_na()
# Create a feature for average unit profit
cust_spend_data_2 <- cust_spend_data_2 %>% mutate(Unit_Profit = Unit_Price - Unit_Cost)
# Creating a copy for regression modeling
cust_spend_data_reg <- cust_spend_data_2 %>% mutate(Margin = Revenue - Cost)
cust_spend_data_cls <- cust_spend_data_2
data_head <- head(cust_spend_data_2, n=5)
datatable(data_head, options=list(scrollX=TRUE)) # interactive display of df_head in 'viewer' tab
Exploratory Data Analysis (EDA) is an essential step in the data analysis process where we summarize and visualize the main characteristics of the dataset. Through EDA, we aim to uncover patterns, spot anomalies, and test initial hypotheses using both graphical and quantitative techniques. This step provides a foundation for further analysis and model building.
In this section, we focus on understanding individual variables within the dataset to uncover patterns and trends. Univariate analysis provides a deeper insight into the distribution and characteristics of key features, such as customer demographics, purchase behavior, and product categories. By analyzing each variable separately, we can identify potential drivers of customer spending and set the stage for more complex analyses in subsequent sections.
This chart shows the age distribution of customers, with the highest count in the 30-39 age range, followed by the 40-49 and 20-29 age ranges. The 10-19 and 70-79 age ranges have the lowest counts.
Peak Customer Age Group (30-39):
This age group may be the most significant customers demographic due to characteristics such as purchasing power, product/service preferences, and marketing strategy alignment.
Second-Largest Groups (20-29 and 40-49):
Customers in their 20s may be young and starting their career, while those in their 40s might belong to mature workers or families with stable incomes.
Low Representation in 10-19 and 70-79 Age Ranges:
The 10-19 age group may have fewer customers due to financial constraints or lack of interest in the products.
Similarly, the 70-79 age range may have the lowest count due to decreased product relevance, mobility, or technological usage among older demographics.
This bar chart shows the count of customers across various states, California having the highest customer count, followed by Washington and England. Other states and regions have significantly lower customer counts.
The large number of customers in California and Washington makes these two states outliers compared to others.
The numbers for both genders are close, with a slight male majority. It indicates a well-balanced gender distribution among customers. Because the difference is minimal, marketing and promotional activities techniques could be targeted to both genders without considerable bias.
The pie chart visually divides the categories proportionally, with Accessories making the greatest portion, followed by Bikes and Clothing.
Accessories are the largest category (64.2%), indicating that they possibly include essential or frequently purchased items that appeal to a wide customer base. Because of the strong demand for accessories, marketing and stocking strategies may emphasize them.
Bikes, with 20.5% of the total, are still contributing significantly to the business. This category may include higher-value items, making them significant even if less common than Accessories.
Clothing constitutes a smaller percentage (15.3%), implying that it may appeal to a particular audience or be less essential than the other categories.
The chart shows that Clothing has the highest mean unit profit, followed by Accessories, and then Bikes.
Profit Ranking:
Clothing (blue): This category has the highest mean unit profit, suggesting that on average, each unit sold in this category generates the most profit when compared to other categories. Despite having the smallest distribution (as shown in the “Product Category Distribution” figure), clothing has the greatest mean unit profit. This indicates that it is in the higher-value category.
Accessories (green): Accessories not only dominate in volume (64.2% of total distribution), but they also rank second in mean unit profit. This combination makes Accessories a significant contributor to revenue and profitability.
Bikes (orange): Bikes have the lowest mean unit profit, shows that they are lower-margin items, which could be due to high production costs, competitive pricing, or other reasons.
Resource Allocation suggestion: Allocate more resources (e.g., marketing budget, inventory) to more profitable categories, such as Clothing and Accessories, while exploring enhancements for Bikes.
The bar chart illustrates the total profit values of each country, with the United States’ bar being the highest and the United Kingdom as the lowest.
The United States and Germany should be emphasized for resource allocation, marketing efforts, and expansion strategies due to their significant profitability.
Bivariate analysis examines the relationship between two variables to uncover patterns and correlations. In this section, we analyze the interactions between pairs of variables, such as customer demographics and spending behavior, or product categories and revenue. By exploring these relationships, we can identify trends and potential causal factors that influence customer decisions, which are crucial for targeted marketing strategies, product recommendations, and business forecasting.
## `geom_smooth()` using formula = 'y ~ x'
The black trend line indicates a positive correlation between unit cost and price. This implies that as the unit cost increases, the unit price tends to increase as well. This pattern shows that more-cost products are typically priced higher.
Accessories (green): These data points are likely to cluster in the lower range of both unit cost (around 0-500) and unit price (near 0-700), implying that Accessories are typically low-cost, low-priced items. This could reflect the items’ status as supplementary or add-on purchases.
Bikes (orange): These points are likely scattered across higher ranges of unit cost (500-3000) and unit price (1000-4000), reflecting Bikes’ higher-cost, higher-priced products. This highlights the fact that they are as premium or core products in the business.
Clothing (blue): These points might be in-between between Accessories and Bikes in terms of unit cost and price, indicating a balance between volume sales and profitability.
The scatter plot matrix visualizing the correlation matrix and provides a detailed view of the relationships between multiple variables in the dataset. The color scale (ranging from -1 to 1) helps distinguish negative (red) and positive (blue) correlations.
High Correlation:
Unit Cost and Unit Price: There is a strong positive correlation between Unit Cost and Unit Price. This means that as the cost of a unit increases, the price at which it is sold also tends to increase.
Revenue and Total Spending: These two variables also show a high positive correlation, indicating that higher revenue is associated with higher total spending. It possibly reflecting customers who buy more expensive products or larger quantities.
Low Correlation:
Quantity and Year: There is a low correlation between Quantity and Year, suggesting that the quantity of items sold does not significantly change over the years. Market stability in terms of sales volume suggests that growth initiatives may require external variables such as product expansion or new marketing.
Customer Age and Unit Profit: The correlation between Customer Age and Unit Profit is low, indicating that the age of customers does not have a strong impact on the profit per unit.
The data points are spread around the plot, showing that there is no discernible trend or pattern, indicating that unit profit does not consistently rise or decrease with customer age.
Because no age group is particularly profitable, marketing efforts can target all age groups equally, focusing on product attractiveness, quality, and value.
The dots span the whole range of Customer Age (10-70) and Revenue (0-5000), suggesting that revenue creation is not highly connected with customer age.
This shows that customers of all ages contribute to revenue and age has no major impact on spending habits.
The chart provides useful information about the relationship between unit pricing and unit profit for numerous subcategories of the Accessories, Bikes, and Clothing categories.
Bikes (such as Touring Bikes, Mountain Bikes, and Road Bikes) have high unit prices, but Bike Racks (from Accessories) generate the largest unit profit.
This shows that, despite high-priced products like Touring Bikes may high in revenue, certain low-cost, high-margin products can contribute more significantly to overall profitability. Thus, the businesses might want to concentrate on products like Bike Racks (despite their lower price) because of the greater margins.
In this section, our objective of this analysis was to classify customers into three revenue classes (Low, Medium, and High) based on their spending behavior. The classification was conducted using two machine learning models, Random Forest and XGBoost, and the results were evaluated using metrics such as sensitivity, specificity, and feature importance.
The target variable, Revenue_Class, was created by segmenting the Revenue column into three classes using equal-width binning:
Features Used: Key features used in modeling include: Customer Age, Customer Gender, Country, Product Category, Quantity, Cost.
The table below summarizes the range of revenues for each class:
Model Training
Two classification models were trained:
###### 2: Split the dataset into training and test sets ######
cust_spend_data_cls <- cust_spend_data_cls %>% select(-Revenue) # drop multicollinearity
set.seed(123)
trainIndex <- createDataPartition(cust_spend_data_cls$Revenue_Class, p = 0.8, list = FALSE)
trainData <- cust_spend_data_cls[trainIndex, ]
testData <- cust_spend_data_cls[-trainIndex, ]
###### 3: Train a classification model (e.g., Random Forest) ######
rf_model <- randomForest(Revenue_Class ~ Customer_Age + Customer_Gender + Country +
Product_Category + Quantity + Cost,
data = trainData, ntree = 100)
# Evaluate the model
predictions_rf <- predict(rf_model, testData)
confusion_rf <- confusionMatrix(predictions_rf, testData$Revenue_Class)
stats_rf <- as.data.frame(confusion_rf$byClass)
stats_rf <- stats_rf %>%
mutate(Metric = rownames(stats_rf), Model = "Random Forest") %>%
select(Model, Metric, everything())
# Convert data to a matrix for XGBoost
train_matrix <- model.matrix(Revenue_Class ~ . -1, data = trainData)
train_label <- as.numeric(trainData$Revenue_Class) - 1
# Similarly, for the test data
test_matrix <- model.matrix(Revenue_Class ~ . -1, data = testData)
test_label <- as.numeric(testData$Revenue_Class) - 1
# Train the XGBoost model
xgb_model <- xgboost(data = train_matrix, label = train_label, max.depth = 6,
eta = 0.3, nrounds = 100, objective = "multi:softmax",
num_class = length(unique(train_label)))
Evaluation Metrics by Class
The evaluation metrics for both models are presented below, aggregated across revenue classes:
Both models performed exceptionally well for the Low and Medium revenue classes, with sensitivity, specificity, and positive predictive values close to 1.
For the High revenue class: XGBoost outperformed Random Forest with a higher sensitivity (0.6667 vs. 0.1667) and balanced accuracy (0.8333 vs. 0.5833).
Both models showed perfect specificity and positive predictive values for the High revenue class, although the prevalence of this class was very low.
The feature importance analysis of our random forest model highlights the following insights:
Feature importance plots from both models align in identifying Cost and Quantity as the primary drivers of revenue classification. These findings suggest prioritizing Cost and Product Category for driving insights into customer revenue class tiers.
Overall Performance:
Random Forest
Xgboost
For the High (Class: 2) class: Sensitivity improves significantly (66.67%) compared to Random Forest.
Balanced Accuracy: 83.33%, a marked improvement over Random Forest.
Observations
XGBoost outperforms Random Forest in every metric for the High class and matches or slightly exceeds performance for the Low and Medium classes.Random Forest performs strongly for dominant classes but struggles to classify rare classes due to lower sensitivity and balanced accuracy.
The improvement in sensitivity for the High class indicates that XGBoost better handles imbalanced datasets. XGBoost delivers better performance overall, especially for the High (Class: 2) class, handling the class imbalance more effectively.
Conclusion
XGBoost demonstrates superior performance across all metrics and classes, making it the more robust model for this classification task. Its ability to better handle imbalanced datasets provides a significant advantage over Random Forest.
This section focuses on predicting margin (revenue - cost) based on various independent variables, such as customer age, quantity and unit_profit (unit_price - unit_cost). We begin by checking the correlation between features to identify potential multicollinearity issues and drop the redundant features if neccessary. After splitting the data into training and validation sets, we apply one-hot encoding to handle categorical variables. A linear regression model is then built, evaluated for performance using metrics like RMSE, MAE, and R-squared, and visualized by comparing predicted versus actual profit values.
We first examined the correlation matrix between the features. The matrix revealed there’s no strong correlation between those numerical features.
Pre-processing steps like converting date time into numerical form has been performed. Subsequently, the dataset was partitioned into an 80% training set and a 20% validation set to ensure robust model evaluation and avoid overfitting. To facilitate regression modeling, one-hot encoding was performed on categorical variables, converting them into numerical form.
## Year Month Customer_Age Customer_Gender
## 0 0 0 0
## Country State Product_Category Sub_Category
## 0 0 0 0
## Quantity Unit_Profit Margin Day
## 0 0 0 0
## RMSE: 62.76195
## MAE: 37.81645
## R-squared: 0.8329754
The scatter plot for Actual vs Predicted graph indicate the actual-predicted relationship for the regression model output. The scatterplot depicts a strong alignment between actual and predicted values, clustering near the red diagonal line (representing perfect prediction). This indicates that the model generally performs well in predicting profit values. There is some deviation from the ideal line at the extremes, suggesting potential underperformance in capturing very high or very low profit values.
Model Performance Metrics:
Root Mean Squared Error (RMSE): 62.76195 This value indicates the model’s average prediction error magnitude in the same units as the target variable (profit). The relatively low RMSE confirms a good overall model fit.
Mean Absolute Error (MAE): 37.81645 The MAE shows that, on average, the model’s predictions deviate from actual values by approximately 37.82 units. This highlights consistent predictive performance with low error rates.
R-squared (R²): 0.83298 The model explains approximately 83.3% of the variance in the profit variable. This strong R² value indicates that the independent variables used in the regression effectively predict the dependent variable.
Observations
The model demonstrates strong predictive power, as reflected by the high R² value and low error metrics (RMSE and MAE). However, slight discrepancies at the extremes suggest that model performance could be further improved by:
● Investigating non-linear relationships or higher-order terms.
● Addressing potential outliers or heteroscedasticity.
● Exploring additional features or interaction terms.
Recommendations for future model enhancements:
Refine the model by incorporating techniques like regularization (e.g., LASSO, Ridge) or transforming input features to better capture extreme values.
Evaluate residual plots to identify systematic errors or patterns in prediction errors.
Test the model’s robustness by performing cross-validation or out-of-sample predictions.
Conclusion
The regression model performs well in predicting profit, with strong metrics and visual confirmation of fit. While some improvement opportunities exist, the current model provides a reliable basis for practical applications in forecasting profit.
Classification
In conclusion, the classification models, including Random Forest and XGBoost, demonstrate strong performance in segmenting customers into revenue classes (Low, Medium, High), particularly with the XGBoost model, which significantly outperforms Random Forest for the High revenue class. XGBoost achieves high sensitivity (99.98% for Low and 99.88% for Medium), specificity (99.88% for Low and 99.95% for Medium), and a balanced accuracy of 83.33% for the High class, a notable improvement over Random Forest’s performance.
While the High revenue class remains a small group, XGBoost’s ability to better handle this class’s imbalances is evident in its superior sensitivity (66.67% vs. 16.67%) and balanced accuracy. This highlights the importance of using advanced models like XGBoost to classify customers into distinct revenue segments effectively. The insights gained can help tailor marketing strategies that cater to high-value customers in the High revenue class, with targeted campaigns such as personalized offers, exclusive promotions, and early access to new luxury products. Additionally, these customers could be offered VIP loyalty programs, concierge services, or personalized shopping experiences to further increase their engagement and brand loyalty.
For the Low and Medium revenue classes, businesses can employ promotional strategies aimed at increasing engagement and boosting spending. For the Medium class, volume-based discounts, bundled product offers, and rewards programs can encourage more frequent purchases. Low-tier customers can be re-engaged with welcome offers, referral programs, and flash sales that create a sense of urgency. By targeting these segments with the right offers, businesses can drive customers up the loyalty ladder, ultimately increasing their lifetime value.
Regression
The regression model also provides valuable insights, with strong performance metrics: RMSE of 62.76, MAE of 37.82, and R² of 0.8330. The scatterplot of actual versus predicted values shows a solid alignment, confirming that the model can reliably forecast profit values. However, slight deviations at the extremes suggest that incorporating non-linear relationships or addressing outliers could further improve the model.
By integrating these insights from both the classification and regression models, businesses can enhance their customer segmentation, tailor marketing efforts, and optimize their strategies to increase customer loyalty, engagement, and revenue growth.