Introduction

Customer is king. By knowing your customers, you can serve them better. Satisfied customers tend to be more loyal to the business in the long term. It's a win-win for both parties. So why not know your customer?

In this project, we'll be looking at a Mall Customer Dataset available at Kaggle. We'll try to group these customers into clusters that we can identify and profile using K-means clustering technique.

Data Preparation

Before starting, let's impot the libraries that we'll be using

library(tidyverse) # Data wrangling & visualization
library(factoextra) # Clustering visualization
library(kableExtra) # Pretty printing tables

Let's now import the data and get to know it.

data <- read_csv('mall_customers.csv')
head(data) %>% 
  kable() %>% 
  kable_styling()
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76

The dat consists of 200 observations containing 5 columns :

  • CustomerID : customer ID
  • Gender : Gender of the custoemr
  • Age : Age of the customer
  • Annual Income (k$) : Annual income, in thousands of dollars
  • Spending Score : A self-assigned spending score 0 - 100.

Before we get deeper, let's drop the rename the columns for easier access

data <- data %>% 
  rename(
    customer_id = CustomerID,
    gender = Gender,
    age = Age,
    income = `Annual Income (k$)`,
    score = `Spending Score (1-100)` 
  )

Customer id is only an id column and we don't need that. Let's drop the column.

data <- data %>% 
  select(-customer_id)

Now let's check for missing values

colSums(is.na(data))
## gender    age income  score 
##      0      0      0      0

We don't have any missing data.

Data Exploration & Wrangling

Feature selection & engineering

We'll not be able to use non-numeric values during our selection, but it'll be interesting to see the effect of gender for clustering. We'll transform gender into numeric value by using 1 for signifying 'female'.

data$gender <- ifelse(data$gender == 'Female', 1, 0)

Distribution

Let's see the distribution of our data

outlier <- boxplot(data, main = 'Distribution of paramaters', xlab = 'Paramaters', ylab = 'Value')

K-means is sensitive to outliers, so it's best to remove the outliers.

data <- data %>% 
  filter(income < outlier$out)

Our data is already in scale, except gender, because we gave encode it into 0 and 1. Because our clustering method is based on distance, let's scale the variables so they will be balanced.

data.scaled <- scale(data)
boxplot(data.scaled, main = 'Distribution of paramaters, outliers removed', xlab = 'Paramaters', ylab = 'Value')

K-Means Clustering

Let's start with clustering. The technique that we are using is K-means clustering. This technique places K number of points (centroid) at random. Each data point will then calculate their distances to the centroid, and assign themselves to the nearest centroid. The centroid will then move to the center of the points closest to the centroid. The points will then recalculate their nearest centroid, and the cycle continues.

Elbow plot

The amount of centroid is configurable at the beginning. Choosing how many centroids to choose is not a trivial matter, as it will finally decide how many clusters there are. A method to finding the optimum number of centroids is by using an elbow plot. An elbow plot tries out varying values of centroids and plots the resulting total distance from teh centroid centers to the data points. From the plot, we can pick the numebr of centroids that gives the best decrease in distance.

fviz_nbclust(data.scaled, kmeans, method = 'wss')

From the elbow plot, we can see that there is 1 "inflection point" at k number 4. This point is the point where the gradient decrease reduces to become flatter.

Clustering

Let's cluster our data and visualize it.

# Clustering
RNGkind(sample.kind = "Rounding")
set.seed(100)
cluster <- kmeans(data.scaled, centers = 6)

fviz_cluster(cluster, data.scaled)

Interpreting

Let's see what we can learn from our clusters

# Assigning cluster back to data
data.cluster <- data %>% 
  mutate(cluster = factor(cluster$cluster))

# Making comparison graph
data.cluster %>% 
  pivot_longer(-cluster, names_to = 'params') %>% 
  group_by(cluster, params) %>% 
  summarize(value = mean(value)) %>% 
  ggplot(aes( x = cluster, y = value, fill = cluster)) +
  geom_col() +
  facet_wrap(~params, scale = 'free_y') +
  labs(title = 'Customer cluster profile', x= 'Parameters', y = 'Value')

From the graph, we can see that clusters 2, 1, and 4 has the top 3 scores. The table below summarizes the statistics of each clusters

data.table <- data.cluster %>% 
  group_by(cluster) %>% 
  summarize(
            avg_score = round(mean(score), 2),
            age_range = paste(min(age),'-',max(age)),
            avg_age = round(mean(age), 2),
            percent_female = round(mean(gender) * 100, 2),
            percent_male = 100-percent_female,
            avg_income = round(mean(income) * 1000, 5),
            percent_of_customer = round(n() / nrow(data.cluster) * 100, 2)
          ) %>% 
  arrange(-avg_score)
  
data.table %>% 
  kable() %>% 
  kable_styling()
cluster avg_score age_range avg_age percent_female percent_male avg_income percent_of_customer
6 82.11 27 - 40 32.76 55.26 44.74 85210.53 19.19
2 64.17 18 - 35 24.27 56.25 43.75 40625.00 24.24
1 41.54 35 - 70 56.57 0.00 100.00 47392.86 14.14
5 39.73 23 - 68 47.33 100.00 0.00 67591.84 24.75
4 22.88 20 - 58 39.81 93.75 6.25 27812.50 8.08
3 13.84 19 - 59 39.89 0.00 100.00 82421.05 9.60

From the table, we can see that the top spender, 6, is made up customers with an average age of 32.76 years. It is made up 55.26% female and 55.26% male, with an average income of 8.521052610^{4}. 6 makes up 19.19% of the mall's customer base. Looking at the data, 6 seems to fit the profile of professional executives or families with combined income.

The second cluster of spenders, 2 makes 24.24% of all the customer. The cluster's average age (24.27) and income (4.062510^{4}). The profile seemes to fit early professionals who are still single.

The remaining 4 clusters are less discernible from the first 2. They make up the remaining half (43.43%) of the mall's customer. However, they have a lower score (< 50) than the first 2 groups.

Conclusion

Clustering is a very useful technique in machine learning with . Clustering falls under the realm of unsupervised learning, and is more geared to discovery of relation between the data compared to the supervised machine learning methods geared towards prediction. In this project we have implemented k-means clustering to understand a mall's customer deeper. This information is valueable for the mall and the stores' management, and can be used to tailer the marketing and sales strategy to achieve better sales.