Introduction

The following analysis was conducted using data retrieved from Kaggle. Here is brief description of the set, along with exploratory questions my analysis attempts to answer: "The Mall Customers Dataset provides data on 200 individuals who visit a mall, including demographic information, annual income, and spending habits. This dataset is useful for exploratory data analysis, customer segmentation, and clustering tasks (e.g., K-means clustering)…

What is the relationship between annual income and spending score? Does gender or age influence spending behavior? Which customers have high spending scores but low incomes, or vice versa? "

library(h2o)
library(ggplot2)
library(reshape2)
library(pheatmap)
setwd("/Users/robertvargas/Documents/Projects/R/Mall Data")
mall_df <- data.frame(read.csv("Mall_Customers.csv"))
colnames(mall_df)= c("ID", "Gender","Age", "AnnualIncome","SpendingScore")

Exploratory Data Analysis

I took a high level analysis of the variables that we think could affect a customer’s Spending Score. The histogram of Spending Scores indicates that the majority of scores are concentrated between 40 and 60, with the peak of the distribution occurring within this range. This suggests that most individuals in the dataset have spending scores in this middle range, with fewer individuals exhibiting very low or very high spending scores. The distribution appears to be roughly symmetric, indicating a balanced spread of values around the center.

When examining Spending Scores in relation to Annual Income, Gender, and Age, the only noticeable correlation is with Age. The scatter plot reveals that customers aged between 20 and 40 generally have higher Spending Scores compared to those over 40. To further quantify these relationships, I have constructed a correlation matrix.

# Visualizing the data through histograms, barplots, and scatterplots.
hist(mall_df$SpendingScore, col = 'purple', xlab = 'Customer Spending Score', ylab = ' Frequency', main = 'Histogram of Customer Spending Score')

plot(x = mall_df$AnnualIncome, y = mall_df$SpendingScore, xlab = 'Annual Income (Thousands)', ylab = 'Spending Score', main = 'Spending Score vs Annual Income', col = 'purple', pch = 16)

ggplot(data = mall_df, aes(x = Gender, y = SpendingScore)) + stat_summary(fun = 'mean', geom = 'bar', aes(fill = Gender))+
  labs(title = "Average Spending Score by Gender", x = "Gender", y = "Spending Score")

plot(x = mall_df$Age , y = mall_df$SpendingScore, xlab = 'Age', ylab = 'Spending Score', main = 'Spending Score vs Age', col = 'purple', pch = 16)

# Correlation Heatma
cor_df = subset(mall_df,select = -c(Gender, ID ))
cor_matrix = cor(cor_df)
pheatmap(cor_matrix, display_numbers = TRUE, main = 'Correlation Heatmap')

Estimating a Model

I used a Generalized Linear Model (GLM) to assess the relationship between demographic features (Age, Gender, Annual Income) and spending score. This model allows us to estimate the effect of each predictor while controlling for the others.

# creating the model
model = h2o.glm(x = c("Gender", "Age", "AnnualIncome"),
                y = "SpendingScore",
                training_frame = df,
                family = "gaussian",
                lambda = 0,
                compute_p_values = TRUE )
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
coefficients <- h2o.coef(model)
coef_df <- as.data.frame(coefficients)
coef_df$Variable <- rownames(coef_df)
colnames(coef_df) <- c("Coefficient", "Variable")

# Plotting the coeffecient's value
ggplot(coef_df, aes(x = reorder(Variable, Coefficient), y = Coefficient)) +
  geom_bar(stat = "identity", fill = 'purple') +
  coord_flip() + 
  theme_minimal() +
  labs(title = "Model Coefficients", x = "Predictor", y = "Coefficient Value")

model@model$coefficients_table
## Coefficients: glm coefficients
##          names coefficients std_error   z_value  p_value
## 1    Intercept    73.930034  6.642253 11.130265 0.000000
## 2  Gender.Male    -2.013234  3.511825 -0.573273 0.567117
## 3          Age    -0.600371  0.124916 -4.806205 0.000003
## 4 AnnualIncome     0.007929  0.066420  0.119383 0.905094
##   standardized_coefficients
## 1                 51.085823
## 2                 -2.013234
## 3                 -8.386587
## 4                  0.208263

Upon reviewing the model’s coefficient summary, we found that only the intercept and Age are statistically significant at the 0.05 threshold. For each additional year in a customer’s age, their Spending Score decreases by approximately 0.60, holding all other factors constant. While Spending Scores tend to be lower for males and increase with higher Annual Income, these relationships were not statistically significant enough to draw firm conclusions.

Segmentation

To further understand our customer demographics, I used K-means clustering (through trial and error) to divide our customers into 3 distinct clusters. Visually, the clusters appear to be well-defined, though there is some overlap between them."

# Creating clusters
mall_df$Cluster <- as.factor(clusters$predict)

ggplot(mall_df, aes(x = AnnualIncome, y = SpendingScore, color = Cluster)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "K-means Clustering: Spending Score vs Annual Income",
       x = "Annual Income (Thousands)",
       y = "Spending Score",
       color = "Cluster")

# Summary statistics for each cluster
cluster_summary <- aggregate(cbind(AnnualIncome, SpendingScore, Age) ~ Cluster, data = mall_df, FUN = mean)
print(cluster_summary)
##   Cluster AnnualIncome SpendingScore      Age
## 1       0     48.00000      37.59722 52.15278
## 2       1     89.24324      26.40541 38.78378
## 3       2     58.83516      69.84615 28.35165
melted_summary <- melt(cluster_summary, id.vars = "Cluster")

# Ploting the summary
ggplot(melted_summary, aes(x = Cluster, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Cluster Characteristics", x = "Cluster", y = "Mean Value") +
  theme_minimal()

For now, we have clusters 0, 1, & 2. We can give them names to better describe them.

Final Thoughts

When conducting our exploratory analysis, we concluded that, holding everything else constant (such as Annual Income and Gender), Age has a negative correlation with Spending Score. However, I believe there are other factors, not captured in the dataset, that could influence a customer’s Spending Score. It would be valuable to include data related to race, family size, and occupation in my analysis.

Due to these limitations, I believe the most meaningful insights were captured in the segmentation analysis. We already know that Age is a significant indicator of Spending Score, but to better understand the types of customers this mall attracts, we can look at our clusters. Cluster 2, for example, had the highest average Spending Score, making it crucial to not only attract these customers but also maintain their loyalty. Additionally, we observe that some shoppers, despite having sable income, don’t shop as much. For instance, Cluster 1 has an average annual income exceeding $80k but the lowest Spending Score, which presents an opportunity for growth if the mall can better engage these customers.

In addition to capturing more customer-related data, it would be helpful to gain deeper insights into customer behavior. Information about the number of visits, average purchase value, etc., could further help us understand the mall’s customer base. Furthermore, data on promotions, timing of purchases, and store preferences could inform strategies to drive sales growth.