Customer segmentation is the activity of grouping your customers by several characteristics. It can be their personal information, spending behavior, demographic, etc. The purpose of doing customer segmentation is to understand each segment, so you can market and promote your brand effectively2.

To understand your customer persona, sometimes you need a technique to efficient your goals. One way to do the customer segmentation is by creating several machine learning algorithms to do the job. This article focus on the difference between kmeans and knn algorithms in the customer segmentation case.

Mall Customer Segmentation

In this customer segmentation analysis, we use mall customer segementation dataset. The data is downloaded from kaggle. The data itself gain from the customer membership in a mall. Here, we will grouping the customer based on its personal information and their shopping behavior. Let us load the library and dataset first.

Library setup

Then, take a glimpse on how the data set looks like

## Observations: 200
## Variables: 5
## $ CustomerID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Gender         <fct> Male, Male, Female, Female, Female, Female, Female, ...
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, ...
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, ...
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77,...

The dataset contains 200 observations and 5 variables. However, below is the description foe each variable:

  • CustomerID = Unique ID assigned to the customer
  • Gender = Gender of the customer
  • Age = Age of the customer
  • Annual Income = (k$) Annual Income of the customer
  • Spending Score = (1-100) Score assigned by the mall based on customer behavior and spending nature

Customer Segmentation using K-Means

K-Means is a centroid-based clustering algorithm that follows a simple procedure of classifying a given dataset into a pre-determined number of clusters, denoted as “k”. We will discuss about one use case that can be done using kmeans algorithm.

Before we jump into clustering the data, we should do the scaling for the variables that will be used in clustering analysis. Here, let us explore the annual income and spending score variables first.

In the K-Means, we should determine the K (number of cluster) that we will be created. The number of K can be determine based on its business case, or we can use elbow method to make a consideration. Below is the function for building the elbow method.

Now, let us apply the wss function that has been created earlier to the desired data.

The elbow plot above suggests us that, using six clusters, we were able to explain most of the variance in data, since y-axis (within groups sum of square) is saturating after six cluster.

Next, we will build 5 cluster from our data. The kmeans algorithm can be called using kmeans() function, and do not forget to determine the random state (set.seed()) before calling the kmeans() function.

Result:
- Cluster 1 : Medium annual income, medium spending (young age target customer)
- Cluster 2 : High annual income, high spending (young age wealthy customer)
- Cluster 3 : Medium annual income, medium spending (old age target customer)
- Cluster 4 : Low annual income, high spending (young age spendthrift)
- Cluster 5 : Low annual income, low spending (pennywise)
- Cluster 6 : High annual income, low spending (miser)

Customer Segmentation using K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm is an supervised algorithm that can solve both classification and regression problems. Then, how does KNN works? Like its name, the KNN is looking for the similar characteristics based on its nearest neighbors for each data points.

Then we can continue to our next step, that is split the predictors and target from the train and dataset.

In K-Nearest Neighbors algorithm, the meaning of K is the number of closest neighbors to do the majority voting of its prediction class. The K can be determined on its desired, or we can start by using the square root of its number of row in our train dataset.

## [1] 12.76715

The result of its square root is 12.7, what should we use the k = 12 or k = 13? Since our target variable has even categories (has 6 categories), we should avoid the even number, because it can break the tie when 50:50 voting occured.

One of the differences in knn and kmeans is we can check our model performance, by using accuracy– if the case is classification problem, or error– if the case is regression problem. Now we want to check the accuracy of the model using confusionMatrix() function from caret package.

From the KNN model, we have a quite good accuracy, that is aroud 78.3%. But, if you are still not satisfied with the accuracy, you can tune the number of K or back to the data preprocessing.

Now, let us visualize the result of K-Nearest Neighbor prediction to customer segmentation

Since we want to compare the K-Means and K-Nearest Neighbors result in customer segmentation, we then create a plot to visualizing the result of K-Means algorithm to cluster the potential customer.

To make it more clear about how the result of K-Means and KNN in customer segmentation, here we arrange the two plots in one frame.

Conclusion

The K-Means and KNN algorithms are the algorithms that take advantage of distance between eac point. The difference between the two is the k-means algorithm is one of unsupervised learning algorithms that aim to cluster the unlabelled data. Otherwise, the K-Nearest Neighbor is a supervised learning algorithms that aim to predict the unlabelled data. It learns from the historical labelled data to generate its prediction.

In the customer segmentation case above, we can infer that, when we have unlabelled dataset, we can cluster the customer using the K-Means algorithm, hence we can generate a label based on its cluster result. However, when we have a number of new customers data, we can predict its behavior using k-nearest neighbors (KNN) algorithm to predict their behavior.

References