Source: How to Automatically Segment Customers using Purchase Data ¹

Customer segmentation is the activity of grouping your customers by several characteristics. It can be their personal information, spending behavior, demographic, etc. The purpose of doing customer segmentation is to understand each segment, so you can market and promote your brand effectively².

To understand your customer persona, sometimes you need a technique to efficient your goals. One way to do the customer segmentation is by creating several machine learning algorithms to do the job. This article focus on the difference between kmeans and knn algorithms in the customer segmentation case.

Mall Customer Segmentation

In this customer segmentation analysis, we use mall customer segementation dataset. The data is downloaded from kaggle. The data itself gain from the customer membership in a mall. Here, we will grouping the customer based on its personal information and their shopping behavior. Let us load the library and dataset first.

Library setup

library(factoextra) 
library(tidyverse) # data transformation
library(rsample)
library(ggrepel)
library(ggpubr)
library(class)
library(caret)

mall <- read.csv("data_input/Mall_Customers.csv")

Then, take a glimpse on how the data set looks like

glimpse(mall)

## Observations: 200
## Variables: 5
## $ CustomerID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Gender         <fct> Male, Male, Female, Female, Female, Female, Female, ...
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, ...
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, ...
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77,...

The dataset contains 200 observations and 5 variables. However, below is the description foe each variable:

CustomerID = Unique ID assigned to the customer
Gender = Gender of the customer
Age = Age of the customer
Annual Income = (k$) Annual Income of the customer
Spending Score = (1-100) Score assigned by the mall based on customer behavior and spending nature

Customer Segmentation using K-Means

K-Means is a centroid-based clustering algorithm that follows a simple procedure of classifying a given dataset into a pre-determined number of clusters, denoted as “k”. We will discuss about one use case that can be done using kmeans algorithm.

Before we jump into clustering the data, we should do the scaling for the variables that will be used in clustering analysis. Here, let us explore the annual income and spending score variables first.

mall_sc <- mall %>% 
  select(3:5) %>% 
  scale

In the K-Means, we should determine the K (number of cluster) that we will be created. The number of K can be determine based on its business case, or we can use elbow method to make a consideration. Below is the function for building the elbow method.

wss <- function(data, maxCluster = 9) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}

Now, let us apply the wss function that has been created earlier to the desired data.

set.seed(100)
wss(mall_sc)

The elbow plot above suggests us that, using six clusters, we were able to explain most of the variance in data, since y-axis (within groups sum of square) is saturating after six cluster.

Next, we will build 5 cluster from our data. The kmeans algorithm can be called using kmeans() function, and do not forget to determine the random state (set.seed()) before calling the kmeans() function.

RNGkind(sample.kind = "Rounding")
set.seed(100)
mall_kmeans <- kmeans(mall_sc, centers = 6)

fviz_cluster(mall_kmeans,data = mall_sc) + 
  theme_minimal()

mall_segmentation <- mall[,-6]

mall_segmentation$cluster <- mall_kmeans$cluster

mall_segmentation[,-c(1,2)] %>% 
  group_by(cluster) %>% 
  summarise_all("mean")

Result:
- Cluster 1 : Medium annual income, medium spending (young age target customer)
- Cluster 2 : High annual income, high spending (young age wealthy customer)
- Cluster 3 : Medium annual income, medium spending (old age target customer)
- Cluster 4 : Low annual income, high spending (young age spendthrift)
- Cluster 5 : Low annual income, low spending (pennywise)
- Cluster 6 : High annual income, low spending (miser)

mall_segementation <- mall_segmentation %>% 
  mutate(behavior = factor(
    case_when(cluster == 1 ~ "(young) target customer",
              cluster == 2 ~ "(young) priority customer",
              cluster == 3 ~ "(old) target customer",
              cluster == 4 ~ "(young) spendthrift",
              cluster == 5 ~ "pennywise",
              TRUE ~ "miser"),
    levels = c("(young) target customer",
               "(young) priority customer",
               "(old) target customer",
               "(young) spendthrift", 
               "pennywise", "miser"))) %>% 
  select(-cluster)

Customer Segmentation using K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm is an supervised algorithm that can solve both classification and regression problems. Then, how does KNN works? Like its name, the KNN is looking for the similar characteristics based on its nearest neighbors for each data points.

index <- c(1, 41, 161, 139, 181, 44, 46, 125, 127, 13,
133, 200, 5, 92, 194, 038, 134, 123, 50, 43, 23, 173, 93, 7, 190, 198, 20, 73, 90, 182, 196, 186, 174, 32, 19, 162, 29 )
train <- mall_segementation[-index,]
test <- mall_segementation[index,]

Then we can continue to our next step, that is split the predictors and target from the train and dataset.

train_x <- train[, -c(1,2,6)]
test_x <- test[,-c(1,2,6)]
train_y <- train[,6]
test_y <- test[,6]

In K-Nearest Neighbors algorithm, the meaning of K is the number of closest neighbors to do the majority voting of its prediction class. The K can be determined on its desired, or we can start by using the square root of its number of row in our train dataset.

sqrt(163)

## [1] 12.76715

The result of its square root is 12.7, what should we use the k = 12 or k = 13? Since our target variable has even categories (has 6 categories), we should avoid the even number, because it can break the tie when 50:50 voting occured.

pred_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 13)

One of the differences in knn and kmeans is we can check our model performance, by using accuracy– if the case is classification problem, or error– if the case is regression problem. Now we want to check the accuracy of the model using confusionMatrix() function from caret package.

#confusionMatrix(pred_knn, reference = test_y)

From the KNN model, we have a quite good accuracy, that is aroud 78.3%. But, if you are still not satisfied with the accuracy, you can tune the number of K or back to the data preprocessing.

test_baru <- cbind(test, pred_knn)
test_baru <- test_baru %>% 
  mutate(behavior = pred_knn) %>% 
  select(-pred_knn)

data_baru <- rbind(train, test_baru)
data_baru <- data_baru %>% 
  arrange(CustomerID)

Now, let us visualize the result of K-Nearest Neighbor prediction to customer segmentation

# labelss <- c(1, 43, 139, 161, 127)
plot_knn <- data_baru[-index,] %>% 
  ggplot(aes(x = Annual.Income, y = Spending.Score)) +
  geom_point(aes(col = behavior)) +
   geom_point(data_baru[index, ], 
             mapping = aes(x = Annual.Income, 
                           y = Spending.Score, 
                           col = behavior),
             shape = 1) +
  geom_text_repel(data_baru[index, ], 
            mapping = aes(label = CustomerID, 
                col = as.factor(behavior))) +
  labs(x = "Annual Income", 
       y = "Spending Score", 
       title = "Customer Segmentation using K-Nearest Neighbors",
       col = "Behavior") +
  scale_color_manual(values = c("hotpink", "darkblue", 
                                "darkgreen", "yellow4", 
                                "firebrick", "tan4")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Since we want to compare the K-Means and K-Nearest Neighbors result in customer segmentation, we then create a plot to visualizing the result of K-Means algorithm to cluster the potential customer.

plot_kmeans <- mall_segmentation[-index,] %>% 
  ggplot(aes(x = Annual.Income, y = Spending.Score)) +
  geom_point(aes(col = as.factor(cluster))) +
  geom_point(mall_segmentation[index, ], 
             mapping = aes(x = Annual.Income, y = Spending.Score, 
                           col = as.factor(cluster)),
             shape = 1) +
  geom_text_repel(mall_segmentation[index, ], 
            mapping = aes(label = CustomerID, 
                col = as.factor(cluster))) +
  labs(x = "Annual Income", 
       y = "Spending Score", 
       title = "Customer Segmentation using K-Means",
       col = "Cluster") +
  scale_color_manual(values = c("hotpink", "darkblue", 
                                "darkgreen", "yellow4", "firebrick", "tan4")) +
  theme_minimal() +
  theme(legend.position = "bottom")

To make it more clear about how the result of K-Means and KNN in customer segmentation, here we arrange the two plots in one frame.

ggarrange(plot_kmeans, plot_knn, nrow = 2)

Conclusion

The K-Means and KNN algorithms are the algorithms that take advantage of distance between eac point. The difference between the two is the k-means algorithm is one of unsupervised learning algorithms that aim to cluster the unlabelled data. Otherwise, the K-Nearest Neighbor is a supervised learning algorithms that aim to predict the unlabelled data. It learns from the historical labelled data to generate its prediction.

In the customer segmentation case above, we can infer that, when we have unlabelled dataset, we can cluster the customer using the K-Means algorithm, hence we can generate a label based on its cluster result. However, when we have a number of new customers data, we can predict its behavior using k-nearest neighbors (KNN) algorithm to predict their behavior.

Customer Segmentation using K-Means and KNN Algorithms

Sitta

April 17, 2020

Mall Customer Segmentation

Customer Segmentation using K-Means

Customer Segmentation using K-Nearest Neighbors

Conclusion

References