Customer is king. By knowing your customers, you can serve them better. Satisfied customers tend to be more loyal to the business in the long term. It's a win-win for both parties. So why not know your customer?
In this project, we'll be looking at a Mall Customer Dataset available at Kaggle. We'll try to group these customers into clusters that we can identify and profile using K-means clustering technique.
Before starting, let's impot the libraries that we'll be using
library(tidyverse) # Data wrangling & visualization
library(factoextra) # Clustering visualization
library(kableExtra) # Pretty printing tablesLet's now import the data and get to know it.
data <- read_csv('mall_customers.csv')
head(data) %>%
kable() %>%
kable_styling()| CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) |
|---|---|---|---|---|
| 1 | Male | 19 | 15 | 39 |
| 2 | Male | 21 | 15 | 81 |
| 3 | Female | 20 | 16 | 6 |
| 4 | Female | 23 | 16 | 77 |
| 5 | Female | 31 | 17 | 40 |
| 6 | Female | 22 | 17 | 76 |
The dat consists of 200 observations containing 5 columns :
Before we get deeper, let's drop the rename the columns for easier access
data <- data %>%
rename(
customer_id = CustomerID,
gender = Gender,
age = Age,
income = `Annual Income (k$)`,
score = `Spending Score (1-100)`
)Customer id is only an id column and we don't need that. Let's drop the column.
data <- data %>%
select(-customer_id)Now let's check for missing values
colSums(is.na(data))## gender age income score
## 0 0 0 0
We don't have any missing data.
We'll not be able to use non-numeric values during our selection, but it'll be interesting to see the effect of gender for clustering. We'll transform gender into numeric value by using 1 for signifying 'female'.
data$gender <- ifelse(data$gender == 'Female', 1, 0)Let's see the distribution of our data
outlier <- boxplot(data, main = 'Distribution of paramaters', xlab = 'Paramaters', ylab = 'Value')K-means is sensitive to outliers, so it's best to remove the outliers.
data <- data %>%
filter(income < outlier$out)Our data is already in scale, except gender, because we gave encode it into 0 and 1. Because our clustering method is based on distance, let's scale the variables so they will be balanced.
data.scaled <- scale(data)
boxplot(data.scaled, main = 'Distribution of paramaters, outliers removed', xlab = 'Paramaters', ylab = 'Value')Let's start with clustering. The technique that we are using is K-means clustering. This technique places K number of points (centroid) at random. Each data point will then calculate their distances to the centroid, and assign themselves to the nearest centroid. The centroid will then move to the center of the points closest to the centroid. The points will then recalculate their nearest centroid, and the cycle continues.
The amount of centroid is configurable at the beginning. Choosing how many centroids to choose is not a trivial matter, as it will finally decide how many clusters there are. A method to finding the optimum number of centroids is by using an elbow plot. An elbow plot tries out varying values of centroids and plots the resulting total distance from teh centroid centers to the data points. From the plot, we can pick the numebr of centroids that gives the best decrease in distance.
fviz_nbclust(data.scaled, kmeans, method = 'wss')From the elbow plot, we can see that there is 1 "inflection point" at k number 4. This point is the point where the gradient decrease reduces to become flatter.
Let's cluster our data and visualize it.
# Clustering
RNGkind(sample.kind = "Rounding")
set.seed(100)
cluster <- kmeans(data.scaled, centers = 6)
fviz_cluster(cluster, data.scaled)Let's see what we can learn from our clusters
# Assigning cluster back to data
data.cluster <- data %>%
mutate(cluster = factor(cluster$cluster))
# Making comparison graph
data.cluster %>%
pivot_longer(-cluster, names_to = 'params') %>%
group_by(cluster, params) %>%
summarize(value = mean(value)) %>%
ggplot(aes( x = cluster, y = value, fill = cluster)) +
geom_col() +
facet_wrap(~params, scale = 'free_y') +
labs(title = 'Customer cluster profile', x= 'Parameters', y = 'Value')From the graph, we can see that clusters 2, 1, and 4 has the top 3 scores. The table below summarizes the statistics of each clusters
data.table <- data.cluster %>%
group_by(cluster) %>%
summarize(
avg_score = round(mean(score), 2),
age_range = paste(min(age),'-',max(age)),
avg_age = round(mean(age), 2),
percent_female = round(mean(gender) * 100, 2),
percent_male = 100-percent_female,
avg_income = round(mean(income) * 1000, 5),
percent_of_customer = round(n() / nrow(data.cluster) * 100, 2)
) %>%
arrange(-avg_score)
data.table %>%
kable() %>%
kable_styling()| cluster | avg_score | age_range | avg_age | percent_female | percent_male | avg_income | percent_of_customer |
|---|---|---|---|---|---|---|---|
| 6 | 82.11 | 27 - 40 | 32.76 | 55.26 | 44.74 | 85210.53 | 19.19 |
| 2 | 64.17 | 18 - 35 | 24.27 | 56.25 | 43.75 | 40625.00 | 24.24 |
| 1 | 41.54 | 35 - 70 | 56.57 | 0.00 | 100.00 | 47392.86 | 14.14 |
| 5 | 39.73 | 23 - 68 | 47.33 | 100.00 | 0.00 | 67591.84 | 24.75 |
| 4 | 22.88 | 20 - 58 | 39.81 | 93.75 | 6.25 | 27812.50 | 8.08 |
| 3 | 13.84 | 19 - 59 | 39.89 | 0.00 | 100.00 | 82421.05 | 9.60 |
From the table, we can see that the top spender, 6, is made up customers with an average age of 32.76 years. It is made up 55.26% female and 55.26% male, with an average income of 8.521052610^{4}. 6 makes up 19.19% of the mall's customer base. Looking at the data, 6 seems to fit the profile of professional executives or families with combined income.
The second cluster of spenders, 2 makes 24.24% of all the customer. The cluster's average age (24.27) and income (4.062510^{4}). The profile seemes to fit early professionals who are still single.
The remaining 4 clusters are less discernible from the first 2. They make up the remaining half (43.43%) of the mall's customer. However, they have a lower score (< 50) than the first 2 groups.
Clustering is a very useful technique in machine learning with . Clustering falls under the realm of unsupervised learning, and is more geared to discovery of relation between the data compared to the supervised machine learning methods geared towards prediction. In this project we have implemented k-means clustering to understand a mall's customer deeper. This information is valueable for the mall and the stores' management, and can be used to tailer the marketing and sales strategy to achieve better sales.