Group Member
1. Benjamin Law Seng Yuan 23086626
2. Ang Eng Hooi 23086408
3. Clement Lim Kee Min 23121442
4. Meng Hui Dan 23104917
5. Shi Yan 24057850
Customer Analysis involves a thorough examination of a company’s optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.
By conducting a Customer Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.
This analysis explores customer behavior patterns in a retail context, focusing on understanding customer segmentation and channel preferences. The dataset contains customer purchase behavior across different channels (Store, Web, Catalog) along with demographic information and product preferences. Understanding these patterns is crucial for developing targeted marketing strategies and optimizing channel operations.
# Install required packages if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("corrplot", quietly = TRUE)) install.packages("corrplot")
if (!requireNamespace("lattice", quietly = TRUE)) install.packages("lattice")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("caret", quietly = TRUE)) install.packages("caret")
if (!requireNamespace("car", quietly = TRUE)) install.packages("car")
if (!requireNamespace("rpart", quietly = TRUE)) install.packages("rpart")
if (!requireNamespace("rpart.plot", quietly = TRUE)) install.packages("rpart.plot")
if (!requireNamespace("tikzDevice", quietly = TRUE)) install.packages("tikzDevice")
if (!requireNamespace("cluster", quietly = TRUE)) install.packages("cluster")
if (!requireNamespace("factoextra", quietly = TRUE)) install.packages("factoextra")
if (!requireNamespace("naniar", quietly = TRUE)) install.packages("naniar")
if (!requireNamespace("reshape2", quietly = TRUE))install.packages("reshape2")
if (!requireNamespace("data.table", quietly = TRUE))install.packages("data.table")
# Load libraries
library(tidyverse) #A collection of R packages for data manipulation, visualization, and analysis
library(randomForest)
library(corrplot) #For visualizing correlation matrices
library(lattice) #For creating high-level data visualizations
library(ggplot2) #Part of tidyverse, used for advanced data visualization with a layered grammar of graphics
library(caret) #For machine learning workflows, including data preprocessing, model training, and evaluation
library(car) #Tools for regression analysis, diagnostics, and multicollinearity checks
library(rpart) #For building classification and regression trees
library(rpart.plot) #For visualizing rpart decision trees
library(tikzDevice) #Exports R plots as TikZ/LaTeX figures for integration into LaTeX documents
library(cluster) #Tools for cluster analysis
library(factoextra) #Simplifies visualization and interpretation of clustering and PCA results
library(naniar) #For handling, visualizing, and analyzing missing data
library(reshape2) #For reshaping data between wide and long formats
library(data.table) #High-performance data manipulation and aggregation.
# Read the dataset
customer_data <- read.csv("customer_segmentation.csv")
# Check initial structure and summary
cat("Dataset Dimensions: ", dim(customer_data), "\n")
## Dataset Dimensions: 2240 29
str(customer_data)
## 'data.frame': 2240 obs. of 29 variables:
## $ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
## $ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Single" "Single" "Together" "Together" ...
## $ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
## $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
## $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
## $ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
## $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
## $ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
## $ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
## $ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
## $ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
## $ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
## $ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
## $ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
## $ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
## $ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
## $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
## $ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
## $ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : int 1 0 0 0 0 0 0 0 1 0 ...
summary(customer_data)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 Length:2240 Length:2240
## 1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
## Median : 5458 Median :1970 Mode :character Mode :character
## Mean : 5592 Mean :1969
## 3rd Qu.: 8428 3rd Qu.:1977
## Max. :11191 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2240
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median : 51382 Median :0.0000 Median :0.0000 Mode :character
## Mean : 52247 Mean :0.4442 Mean :0.5062
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :666666 Max. :2.0000 Max. :2.0000
## NA's :24
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## Min. :0.00000 Min. :0.000000 Min. :3 Min. :11
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:3 1st Qu.:11
## Median :0.00000 Median :0.000000 Median :3 Median :11
## Mean :0.01339 Mean :0.009375 Mean :3 Mean :11
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:3 3rd Qu.:11
## Max. :1.00000 Max. :1.000000 Max. :3 Max. :11
##
## Response
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1491
## 3rd Qu.:0.0000
## Max. :1.0000
##
# Count missing values in each column
missing_summary <- colSums(is.na(customer_data))
cat("Missing Values Summary:\n")
## Missing Values Summary:
print(missing_summary)
## ID Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 24 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 0 0 0 0
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## 0 0 0 0
## Response
## 0
# Remove rows with missing values
df <- na.omit(customer_data)
# Verify no missing values remain
cat("Missing Values After Cleaning:\n")
## Missing Values After Cleaning:
colSums(is.na(df))
## ID Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 0 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 0 0 0 0
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## 0 0 0 0
## Response
## 0
# Visualize missing values
gg_miss_var(customer_data) +
theme_minimal() +
labs(title = "Missing Values by Variable",
x = "Variables",
y = "Number of Missing Values")
# Check for duplicate rows
duplicates <- df[duplicated(df), ]
cat("Number of Duplicate Rows: ", nrow(duplicates), "\n")
## Number of Duplicate Rows: 0
# Remove duplicates
df <- df[!duplicated(df), ]
cat("Dataset Dimensions After Removing Duplicates: ", dim(df), "\n")
## Dataset Dimensions After Removing Duplicates: 2216 29
# Identify outlier
numerical_cols <- names(df)[sapply(df, is.numeric)]
categorical_cols <- names(df)[sapply(df, is.character)]
outlier_summary <- data.frame(Variable = character(), Lower_Bound = numeric(), Upper_Bound = numeric(),
Outliers_Count = integer(), stringsAsFactors = FALSE)
for (col in numerical_cols) {
if (sum(!is.na(df[[col]])) > 0) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outlier_count <- sum(df[[col]] < lower_bound | df[[col]] > upper_bound, na.rm = TRUE)
outlier_summary <- rbind(outlier_summary, data.frame(
Variable = col,
Lower_Bound = lower_bound,
Upper_Bound = upper_bound,
Outliers_Count = outlier_count
))
}
}
cat("Outlier Summary:\n")
## Outlier Summary:
print(outlier_summary)
## Variable Lower_Bound Upper_Bound Outliers_Count
## 25% ID -5595.750 16832.250 0
## 25%1 Year_Birth 1932.000 2004.000 3
## 25%2 Income -14525.500 118350.500 8
## 25%3 Kidhome -1.500 2.500 0
## 25%4 Teenhome -1.500 2.500 0
## 25%5 Recency -51.000 149.000 0
## 25%6 MntWines -697.500 1226.500 35
## 25%7 MntFruits -44.500 79.500 246
## 25%8 MntMeatProducts -308.375 556.625 174
## 25%9 MntFishProducts -67.500 120.500 222
## 25%10 MntSweetProducts -47.000 81.000 246
## 25%11 MntGoldProds -61.500 126.500 205
## 25%12 NumDealsPurchases -2.000 6.000 84
## 25%13 NumWebPurchases -4.000 12.000 3
## 25%14 NumCatalogPurchases -6.000 10.000 23
## 25%15 NumStorePurchases -4.500 15.500 0
## 25%16 NumWebVisitsMonth -3.000 13.000 8
## 25%17 AcceptedCmp3 0.000 0.000 163
## 25%18 AcceptedCmp4 0.000 0.000 164
## 25%19 AcceptedCmp5 0.000 0.000 162
## 25%20 AcceptedCmp1 0.000 0.000 142
## 25%21 AcceptedCmp2 0.000 0.000 30
## 25%22 Complain 0.000 0.000 21
## 25%23 Z_CostContact 3.000 3.000 0
## 25%24 Z_Revenue 11.000 11.000 0
## 25%25 Response 0.000 0.000 333
# Cap and floor outliers
cap_floor_outliers <- function(df, cols) {
for (col in cols) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
df[[col]] <- ifelse(df[[col]] < lower_bound, lower_bound, df[[col]])
df[[col]] <- ifelse(df[[col]] > upper_bound, upper_bound, df[[col]])
}
return(df)
}
# Apply the function to the dataset
df <- cap_floor_outliers(df, numerical_cols)
Check if any outlier remains
# Outlier Summary after cap floor method
outlier_summary <- data.frame(Variable = character(), Lower_Bound = numeric(), Upper_Bound = numeric(),
Outliers_Count = integer(), stringsAsFactors = FALSE)
for (col in numerical_cols) {
if (sum(!is.na(df[[col]])) > 0) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outlier_count <- sum(df[[col]] < lower_bound | df[[col]] > upper_bound, na.rm = TRUE)
outlier_summary <- rbind(outlier_summary, data.frame(
Variable = col,
Lower_Bound = lower_bound,
Upper_Bound = upper_bound,
Outliers_Count = outlier_count
))
}
}
cat("Outlier Summary:\n")
## Outlier Summary:
print(outlier_summary)
## Variable Lower_Bound Upper_Bound Outliers_Count
## 25% ID -5595.750 16832.250 0
## 25%1 Year_Birth 1932.000 2004.000 0
## 25%2 Income -14525.500 118350.500 0
## 25%3 Kidhome -1.500 2.500 0
## 25%4 Teenhome -1.500 2.500 0
## 25%5 Recency -51.000 149.000 0
## 25%6 MntWines -697.500 1226.500 0
## 25%7 MntFruits -44.500 79.500 0
## 25%8 MntMeatProducts -308.375 556.625 0
## 25%9 MntFishProducts -67.500 120.500 0
## 25%10 MntSweetProducts -47.000 81.000 0
## 25%11 MntGoldProds -61.500 126.500 0
## 25%12 NumDealsPurchases -2.000 6.000 0
## 25%13 NumWebPurchases -4.000 12.000 0
## 25%14 NumCatalogPurchases -6.000 10.000 0
## 25%15 NumStorePurchases -4.500 15.500 0
## 25%16 NumWebVisitsMonth -3.000 13.000 0
## 25%17 AcceptedCmp3 0.000 0.000 0
## 25%18 AcceptedCmp4 0.000 0.000 0
## 25%19 AcceptedCmp5 0.000 0.000 0
## 25%20 AcceptedCmp1 0.000 0.000 0
## 25%21 AcceptedCmp2 0.000 0.000 0
## 25%22 Complain 0.000 0.000 0
## 25%23 Z_CostContact 3.000 3.000 0
## 25%24 Z_Revenue 11.000 11.000 0
## 25%25 Response 0.000 0.000 0
# Summary of the cleaned dataset
summary(df)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1932 Length:2216 Length:2216
## 1st Qu.: 2815 1st Qu.:1959 Class :character Class :character
## Median : 5458 Median :1970 Mode :character Mode :character
## Mean : 5588 Mean :1969
## 3rd Qu.: 8422 3rd Qu.:1977
## Max. :11191 Max. :1996
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2216
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median : 51382 Median :0.0000 Median :0.0000 Mode :character
## Mean : 51874 Mean :0.4418 Mean :0.5054
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :118351 Max. :2.0000 Max. :2.0000
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 2.00 1st Qu.: 16.0
## Median :49.00 Median : 174.5 Median : 8.00 Median : 68.0
## Mean :49.01 Mean : 303.3 Mean :21.57 Mean :151.3
## 3rd Qu.:74.00 3rd Qu.: 505.0 3rd Qu.:33.00 3rd Qu.:232.2
## Max. :99.00 Max. :1226.5 Max. :79.50 Max. :556.6
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. :0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.0 1st Qu.:1.000
## Median : 12.00 Median : 8.00 Median : 24.5 Median :2.000
## Mean : 32.19 Mean :21.95 Mean : 39.5 Mean :2.216
## 3rd Qu.: 50.00 3rd Qu.:33.00 3rd Qu.: 56.0 3rd Qu.:3.000
## Max. :120.50 Max. :81.00 Max. :126.5 Max. :6.000
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 4.068 Mean : 2.633 Mean : 5.801 Mean : 5.301
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :12.000 Max. :10.000 Max. :13.000 Max. :13.000
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
## Complain Z_CostContact Z_Revenue Response
## Min. :0 Min. :3 Min. :11 Min. :0
## 1st Qu.:0 1st Qu.:3 1st Qu.:11 1st Qu.:0
## Median :0 Median :3 Median :11 Median :0
## Mean :0 Mean :3 Mean :11 Mean :0
## 3rd Qu.:0 3rd Qu.:3 3rd Qu.:11 3rd Qu.:0
## Max. :0 Max. :3 Max. :11 Max. :0
cat("Final Dataset Dimensions: ", dim(df), "\n")
## Final Dataset Dimensions: 2216 29
# Compare dataset dimensions before and after cleaning
before_cleaning <- nrow(customer_data)
after_cleaning <- nrow(df)
numeric_data <- customer_data %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
cleaning_summary <- data.frame(
Step = c("Before Cleaning", "After Cleaning"),
Rows = c(before_cleaning, after_cleaning)
)
ggplot(cleaning_summary, aes(x = Step, y = Rows, fill = Step)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Dataset Size Before and After Cleaning",
x = "Step",
y = "Number of Rows")
# Visualize numeric variable distributions
ggplot(numeric_data, aes(x = Value)) +
geom_histogram(fill = "steelblue", bins = 30) +
facet_wrap(~Variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Variables",
x = "Value",
y = "Frequency")
# Product category analysis
purchase_data <- df %>%
select(MntWines, MntFruits, MntMeatProducts,
MntFishProducts, MntSweetProducts, MntGoldProds) %>%
gather(key = "Category", value = "Amount")
# Spending by category
p1 <- ggplot(purchase_data, aes(x = reorder(Category, -Amount), y = Amount)) +
geom_boxplot(fill = "lightblue") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Purchase Amounts by Category")
# Channel usage patterns
channel_summary <- df %>%
summarise(
Store = sum(NumStorePurchases),
Web = sum(NumWebPurchases),
Catalog = sum(NumCatalogPurchases)
) %>%
gather(key = "Channel", value = "Total_Purchases")
p2 <- ggplot(channel_summary, aes(x = reorder(Channel, -Total_Purchases),
y = Total_Purchases)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(title = "Total Purchases by Channel")
gridExtra::grid.arrange(p1, p2, ncol = 2)
# Convert categorical variables to factors
df$Education <- as.factor(df$Education)
df$Marital_Status <- as.factor(df$Marital_Status)
# Frequency of categories
education_freq <- table(df$Education)
maristatus_freq <- table(df$Marital_Status)
# Calculate an appropriate y-axis limit
y_max <- max(education_freq) + 100
y1_max <- max(maristatus_freq) + 200
# Visualize using bar plot with adjusted y-axis
barplot(education_freq, main = "Education Distribution", col = "lightblue", ylim = c(0, y_max))
barplot(table(df$Marital_Status),main= "Marital Status",col= "lightblue", ylim= c(0, y1_max))
# Check the structure of the modified dataset
str(df)
## 'data.frame': 2216 obs. of 29 variables:
## $ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
## $ Year_Birth : num 1957 1954 1965 1984 1981 ...
## $ Education : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 5 4 3 5 5 5 ...
## $ Marital_Status : Factor w/ 8 levels "Absurd","Alone",..: 5 5 6 6 4 6 3 4 6 6 ...
## $ Income : num 58138 46344 71613 26646 58293 ...
## $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
## $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
## $ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
## $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
## $ MntWines : num 635 11 426 11 173 520 235 76 14 28 ...
## $ MntFruits : num 79.5 1 49 4 43 42 65 10 0 0 ...
## $ MntMeatProducts : num 546 6 127 20 118 98 164 56 24 6 ...
## $ MntFishProducts : num 120 2 111 10 46 ...
## $ MntSweetProducts : num 81 1 21 3 27 42 49 1 3 1 ...
## $ MntGoldProds : num 88 6 42 5 15 14 27 23 2 13 ...
## $ NumDealsPurchases : num 3 2 1 2 5 2 4 2 1 1 ...
## $ NumWebPurchases : num 8 1 8 2 5 6 7 4 3 1 ...
## $ NumCatalogPurchases: num 10 1 2 0 3 4 3 0 0 0 ...
## $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
## $ NumWebVisitsMonth : num 7 5 4 6 5 6 6 8 9 13 ...
## $ AcceptedCmp3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp4 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Complain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:24] 11 28 44 49 59 72 91 92 93 129 ...
## ..- attr(*, "names")= chr [1:24] "11" "28" "44" "49" ...
# Create new features for target variable of Decision Tree
df$Preferred_Channel <- apply(df[, c("NumStorePurchases", "NumWebPurchases", "NumCatalogPurchases")], 1,
function(x) c("Store", "Web", "Catalog")[which.max(x)])
df$Preferred_Channel <- as.factor(df$Preferred_Channel)
df$Preferred_Channel <- as.numeric(as.factor(df$Preferred_Channel))
#Finding correlation using Pearson Coefficient
# Select numerical columns
numerical_columns <- df[, sapply(df, is.numeric)]
# Calculate Pearson correlation for each numerical variable with 'Preferred Channel'
correlations <- sapply(numerical_columns, function(x) cor(x, df$Preferred_Channel, use = "complete.obs", method = "pearson"))
# Calculate the correlation matrix
correlation_matrix <- cor(numerical_columns, use = "complete.obs", method = "pearson")
# Melt the correlation matrix into a long format for ggplot
correlation_melted <- melt(correlation_matrix)
# Create the heatmap using ggplot2
heatmap <- ggplot(data = correlation_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
limit = c(-1, 1), space = "Lab",
name = "Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(title = "Correlation Heatmap", x = "Variables", y = "Variables")
# Print the heatmap
print(heatmap)
Spending Pattern
Purchasing Channels
Other Demographics
MntWines, MntMeatProducts, etc.), suggesting
higher-income customers tend to spend more.Notes: The grey box are binary feature that do not show correlation.
#Data Partition
#Split the dataset into 70% training and 30 % testing sets
set.seed(123)
sample <- sample(1:nrow(df), 0.7 * nrow(df))
train_data <- df[sample, ]
test_data <- df[-sample, ]
# Consumer behavior target variable
df$Preferred_Channel <- as.factor(df$Preferred_Channel)
index_channel <- createDataPartition(df$Preferred_Channel, p = 0.7, list = FALSE)
train_channel <- df[index_channel, ]
test_channel <- df[-index_channel, ]
#Random Forest Model - Consumer Behavior Analysis
rf_channel <- randomForest(Preferred_Channel ~ Income + Kidhome + Teenhome +
MntFruits + MntMeatProducts + MntFishProducts +
MntSweetProducts + MntGoldProds +
NumDealsPurchases + NumWebPurchases +
NumCatalogPurchases + NumStorePurchases,
data = train_channel, importance = TRUE, ntree = 100)
# Feature Importance
importance_channel <- importance(rf_channel)
varImpPlot(rf_channel)
From the feature importance, NumStorePurchases, NumWebPurchases, NumCatalogPurchases are highly important as the model’s overall accuracy will decrease if these features are removed during training. While Mean decrease gini shows that these three features is important in contributing to the homogeneity of decision trees in the Random Forest.
#Decision Tree
tree_channel <- rpart(Preferred_Channel ~ NumStorePurchases + NumWebPurchases + NumCatalogPurchases ,
data = train_data, method = "class")
# Decision Tree Visualization
rpart.plot(tree_channel,
main = "Channel Preference Analysis Decision Tree",
type = 2,
extra = 102,
tweak = 2
)
NumCatalogPurchases >= 8).NumWebPurchases >= 8).pred_channel <- predict(tree_channel, test_data, type = "class")
conf_matrix_channel <- table(Predicted = pred_channel, Actual = test_data$Preferred_Channel)
print(conf_matrix_channel)
## Actual
## Predicted 1 2 3
## 1 31 0 0
## 2 5 515 3
## 3 1 3 107
# Select and scale features
features <- df %>%
select(Income, Recency, MntWines, MntFruits, MntMeatProducts,
MntFishProducts, MntSweetProducts, MntGoldProds,
NumDealsPurchases, NumWebPurchases, NumCatalogPurchases,
NumStorePurchases, NumWebVisitsMonth)
scaled_features <- scale(features)
# Correlation analysis
corrplot(cor(features),
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
diag = FALSE)
Feature Scaling before K-means Clustering
Before fitting the model, there is a further step to do. k-means is sensitive to variables that have incomparable units, leading to misleading results. To ensure all variables to be treated equally, the variables need to be transformed to be on a similar scale. In this way, they can be compared correctly using the distance metric.
There are different methods to tackle this issue. The most known and used is standardization, which consists in subtracting the average value from the feature value and, then, dividing it by its standard deviation. This technique will allow obtaining features with a mean of 0 and a deviation of 1. The variables can be scaled using the scale() function. Since this returns a matrix, the code is cleaner using a base-R style rather than a tidyverse style. The scale() function standardizes the values in a vector by subtracting the mean and dividing by the standard deviation.
# Calculate metrics for different k values
set.seed(333)
# Calculate WSS for k=1 to k=10
set.seed(333)
k_values <- 1:10
wss_values <- sapply(k_values, function(k) {
kmeans(scaled_features, centers = k, nstart = 25)$tot.withinss
})
# Function to calculate elbow point using maximum curvature
find_elbow_point <- function(x, y) {
# Calculate distances from point to line
nPoints <- length(x)
allCoords <- cbind(x, y)
firstPoint <- allCoords[1,]
lineVec <- allCoords[nPoints,] - firstPoint
lineVecN <- lineVec / sqrt(sum(lineVec * lineVec))
# Vector from point to first point
vecFromFirst <- t(apply(allCoords, 1, function(p) p - firstPoint))
# Distance from point to line
scalarProduct <- vecFromFirst %*% lineVecN
vecFromFirstParallel <- outer(as.vector(scalarProduct), lineVecN)
vecToLine <- vecFromFirst - vecFromFirstParallel
distToLine <- sqrt(rowSums(vecToLine * vecToLine))
# Return the elbow point
return(which.max(distToLine))
}
# Find elbow point
elbow_k <- find_elbow_point(k_values, wss_values)
# Create data frame for plotting
plot_data <- data.frame(
k = k_values,
wss = wss_values
)
# Calculate coordinates for annotation
x_pos <- elbow_k + 0.2
y_pos <- wss_values[elbow_k]
# Create elbow plot with automatic indicator
ggplot(plot_data, aes(x = k, y = wss)) +
geom_line(size = 1) +
geom_point(size = 3) +
geom_vline(xintercept = elbow_k, linetype = "dashed", color = "red", size = 1) +
geom_point(aes(x = elbow_k, y = wss_values[elbow_k]),
color = "red", size = 4) +
annotate("text", x = x_pos, y = y_pos, label = paste("Elbow Point k =", elbow_k), hjust = -0.3, color = "red") +
labs(title = "Elbow Method for Optimal Number of Clusters (k)",
x = "Number of Clusters (k)",
y = "Total Within-Cluster Sum of Squares (WSS)") +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)
)
# Print elbow point details
print(paste("Optimal Number of Clusters (k) =", elbow_k))
## [1] "Optimal Number of Clusters (k) = 3"
print(paste("Within-Cluster Sum of Squares (WSS) at Elbow Point =", round(wss_values[elbow_k], 2)))
## [1] "Within-Cluster Sum of Squares (WSS) at Elbow Point = 14524.29"
Elbow Method is used to determine the optimal number of clusters for a k-means clustering algorithm
Within-Cluster Sum of Squares (WSS) Calculation
Process
The code begins by setting a random seed (123) for reproducibility. It
then calculates the Within-Cluster Sum of Squares (WSS) for k values
ranging from 1 to 10. The “sapply” function iterates through each k
value, running k-means clustering with 25 different random starts
(nstart=25) on the scaled feature set. For each k, it stores the total
within-cluster sum of squares (tot.withinss), which measures how compact
the clusters are. Lower WSS values indicate more compact clusters.
Elbow Point Calculation Function
The “find_elbow_point” function implements a geometric approach to find
the elbow. First, it calculates the perpendicular distance from each
point to a line drawn from the first to last point. The point with the
maximum distance from this line is identified as the elbow point. This
mathematical approach effectively finds the point of maximum curvature
in the WSS curve, which represents the optimal k value.
Elbow Plot Creation
The elbow plot is created using ggplot2 with several key elements: - A
line plot showing the WSS values for each k - Points marking each k
value - A vertical red dashed line at the calculated elbow point - A red
point highlighting the exact elbow location - An annotation indicating
the optimal k value
Output Details
It concludes by printing two key pieces of information: 1. The optimal
number of clusters (k) identified by the algorithm 2. The corresponding
Within-Cluster Sum of Squares (WSS) value at this elbow point, rounded
to two decimal places These values provide quantitative support for the
visual representation and can be used in subsequent analysis steps.
# Perform clustering
set.seed(333)
km_result <- kmeans(scaled_features, centers = 3, nstart = 25)
# Visualize clusters using PCA
fviz_cluster(km_result, data = scaled_features,
main = "Customer Segments",
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())
K-Means clustering groups the data on similar groups.
The algorithm is as follows:
1. Choose the number K clusters.
2. Select at random K points, the centroids.
3. Assign each data point to closest centroid that forms K
clusters.
4. Compute and place the new centroid of each centroid.
5. Reassign each data point to new cluster.
K-Means function to perform the clustering
The dataset containing the scaled features is passed as the first argument, and the “centers” parameter is set to the optimal number of clusters, 3, which was provided by the elbow plot, which means that the algorithm will try to find 3 clusters. The “nstart” parameter is set to 25, which means that the algorithm will run 25 times with different initial starting points to try to find the best clustering solution.
# Add clusters to original data
df$Cluster <- as.factor(km_result$cluster)
# Calculate cluster profiles
cluster_profile <- df %>%
group_by(Cluster) %>%
summarise(
Size = n(),
Avg_Income = mean(Income),
Avg_Total_Spent = mean(MntWines + MntFruits + MntMeatProducts +
MntFishProducts + MntSweetProducts + MntGoldProds),
Avg_Web_Visits = mean(NumWebVisitsMonth),
Avg_Purchases = mean(NumWebPurchases + NumCatalogPurchases + NumStorePurchases),
Avg_Recency = mean(Recency)
)
# Visualize cluster sizes
ggplot(cluster_profile, aes(x = Cluster, y = Size)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = Size), vjust = -0.5, size = 4) +
theme_minimal() +
labs(title = "Cluster Sizes", x = "Cluster", y = "Number of Customers")
The 3 clusters are made and their sizes are shown respectively in the bar plot above.
# Visualize key metrics by cluster
cluster_metrics <- cluster_profile %>%
select(-Size) %>%
gather(key = "Metric", value = "Value", -Cluster)
ggplot(cluster_metrics, aes(x = Cluster, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = sprintf("%.1f", Value)), position = position_dodge(width = 0.9),
vjust = -0.5, size = 2.5) +
facet_wrap(~Metric, scales = "free_y") +
theme_minimal() +
labs(title = "Cluster Characteristics",
x = "Cluster",
y = "Value")
Cluster 1: High-Value customers
Cluster 2: Regular Shoppers
Cluster 3: Budget-Conscious Customers
The K-means clustering analysis revealed three distinct customer
segments:
1. High-Value Customers (Cluster 1): This group demonstrates the highest
spending across all product categories, including wines, meat products,
and gold products. They are likely to use multiple channels for
purchases, including online and in-store options. These customers visit
frequently, showing high engagement with the brand, and make frequent
purchases with a high average transaction value.
2. Regular Shoppers (Cluster 2): These customers exhibit moderate
spending across most product categories, with a balanced distribution.
They prefer a mix of online and catalog purchases. Their visit frequency
is regular but not as high as Cluster 1. Their purchasing behavior is
consistent, with a moderate number of transactions.
3. Budget-Conscious Customers (Cluster 3): This segment shows the lowest
spending across all product categories, focusing primarily on essential
items. They prefer in-store purchases and tend to avoid premium
channels. Their visit frequency is lower compared to the other clusters,
and they are characterized as occasional buyers with a low average
transaction value.
# Calculate silhouette score
sil <- silhouette(km_result$cluster, dist(scaled_features))
avg_sil <- mean(sil[,3])
print("Average Silhouette Score:")
## [1] "Average Silhouette Score:"
print(avg_sil)
## [1] 0.2917457
# Plot silhouette
fviz_silhouette(sil) +
theme_minimal() +
ggtitle("Silhouette Plot")
## cluster size ave.sil.width
## 1 1 558 0.10
## 2 2 673 0.20
## 3 3 985 0.47
Observation
Cluster 3 has the highest average silhouette width (0.47), suggesting
it has the best clustering quality among the three clusters.
Cluster 1 has the lowest average silhouette width (0.10), implying that
it may have weaker clustering quality compared to others.
Cluster 2 has a moderate silhouette width (0.20), which indicates some
boundary overlap or less distinct clustering.
# Total Sum of Squares
total_ss <- sum(scale(scaled_features)^2)
# Between Sum of Squares (BSS)
bss <- km_result$betweenss
# Within Sum of Squares (WSS)
wss <- km_result$tot.withinss
# Calculate Explained variance ratio
explained_variance_ratio <- bss / (bss + wss)
# Print metrics
print("K-means Clustering Metrics:")
## [1] "K-means Clustering Metrics:"
print(paste("Total Sum of Squares:", round(total_ss, 2)))
## [1] "Total Sum of Squares: 28795"
print(paste("Between Sum of Squares:", round(bss, 2)))
## [1] "Between Sum of Squares: 14270.71"
print(paste("Within Sum of Squares:", round(wss, 2)))
## [1] "Within Sum of Squares: 14524.29"
print(paste("Explained Variance Ratio:", round(explained_variance_ratio, 4)))
## [1] "Explained Variance Ratio: 0.4956"
Interpretation of result
Explained Variance Ratio
An value of 0.4956 means that the clusters explain 49.56% of the total
variance in the dataset. This is a moderate result, as higher explained
variance ratios (closer to 1) indicate better clustering.
High Within-Cluster Variance (WSS):
The WSS (14524.29) is slightly high, suggesting that data points within
clusters are not tightly grouped.
Between-Cluster Variance (BSS):
The BSS (14270.71) is comparable to WSS, indicating that clusters are
moderately well-separated. Ideally, BSS should dominate (e.g., > 70%
of TSS).
# Calculate average distance to centroid for each cluster
distances_to_centroid <- numeric(length(km_result$cluster))
for(i in 1:nrow(scaled_features)) {
cluster_id <- km_result$cluster[i]
centroid <- km_result$centers[cluster_id,]
distances_to_centroid[i] <- sqrt(sum((scaled_features[i,] - centroid)^2))
}
avg_dist_by_cluster <- tapply(distances_to_centroid, km_result$cluster, mean)
print("Average Distance to Centroid by Cluster:")
## [1] "Average Distance to Centroid by Cluster:"
print(round(avg_dist_by_cluster, 4))
## 1 2 3
## 2.7564 3.1401 1.6826
# Calculate and display quality metrics
quality_metrics <- data.frame(
Metric = c("Average Silhouette Score", "Between SS / Total SS"),
Value = c(avg_sil, km_result$betweenss / km_result$totss)
)
print(quality_metrics)
## Metric Value
## 1 Average Silhouette Score 0.2917457
## 2 Between SS / Total SS 0.4955969
Silhouette Score Interpretation
Scale -1.0 to -0.5: Poor clustering, possible misassignment
-0.5 to 0.0: Overlapping clusters
0.0 to 0.25: Weak structure
0.25 to 0.50: Fair structure
0.50 to 0.75: Good structure
0.75 to 1.00: Strong structure
An Average Silhouette Score of 0.2917 indicates: Fair clustering quality: The clusters have some meaningful structure Moderate separation: Objects are somewhat well-matched to their clusters Some overlap: There might be some overlap between clusters Acceptable for real data: Many real-world datasets achieve scores in this range due to natural noise and complexity
Within the cluster, the sum of squares is 49.56% (0.4956).