1. Intro

Halo my name is Nunu. Today we’ll practice K-Means Clustering using R. The dataset used in this Rpubs was uploaded to kaggle by Shwetabh123.

Clustering method is a technique to categorize data by devide data into groups according to certain desired characteristics. The process of grouping a set of objects into the same object classes is called clustering (Han et al. 2001). Clustering methods including Hierarchical clustering, Non-hierarchical clustering and others. K-means is one method of non-hierarchical data clustering which seeks to partition existing data into one or more clusters (split or merge). The k-means method is a distance-based clustering method that will divide data into a specified number of clusters, but this algorithm only works on numeric attributes (Khairani and Sutoyo, 2020).

2. What are We Working with?

Customer segmentation can help us in many ways. it’s very important to optimize our marketing strategies. By that we can focus on those customers who will give the higher profit to our business. So here’s the variables:

CustomerID : ID of costumer

Genre : Sex

Age : Age of costumer

Annual.Income : Annual Income in $k (1000)

Spending.Score : range (1-100)

We’re going to know the segment of the costumers mall by their age, spendings and income into different group. The greater value of Spending.score, the more loyal custumers to the Mall. It means the costumers spend more money.

3. Data Preparation

Load the required package

library(ggplot2)
library(factoextra)
library(dplyr)
library(ClusterR)

Load the dataset, named mall_cust

mall_cust <- read.csv("Mall_Customers.csv")
rmarkdown::paged_table(mall_cust)
str(mall_cust %>%
  mutate(Genre = as.factor(Genre)) %>% 
    select(-CustomerID))
#> 'data.frame':    200 obs. of  4 variables:
#>  $ Genre         : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 1 2 1 ...
#>  $ Age           : int  19 21 20 23 31 22 35 23 64 30 ...
#>  $ Annual.Income : int  15 15 16 16 17 17 18 18 19 19 ...
#>  $ Spending.Score: int  39 81 6 77 40 76 6 94 3 72 ...

4. Data Exploration

hist(mall_cust$Age,
     col = '#CBAACB',
     main = 'Customers Age',
     xlab = 'Years old')

The distribution of Age variable is skewed to the right. It means more customers on the left (adults <40 years old) than on the right side (middle-aged-old adults).

hist(mall_cust$Annual.Income,
     col = '#C7CEEA',
     main = 'Annual Income of Customers',
     xlab = 'Annual Income')

Same as the previous histogram, the distribution of Annual Income is skewed to the right. The left side of Annual Income hist is greater than the right side. A lot of costumers have an annual income around 60-80k USD per year. But for costumer who has annual income which greater than that is not too much.

hist(mall_cust$Spending.Score,
     col = '#B5EAD7',
     main = 'Spending Score of Customers',
     xlab = 'Spending Score')

The distribution of spending score is tend to be normal. We can see the shape of spneding score are concentrated around the mean. For high and low spending score, it’s quite spread out in both ends. it means quite a lot of customers who spend very little or very much in our mall.

5. K-Means Clustering

Annual Income and Spending Score are the variable that influence consumer behaviour the most. Therefore the clusters will be generated only on those variable. Spending score is the value given by the Mall to customers based on customer behavior. The number of clusters can be determined using several statistical criteria, for example silhouette coeffiecient and WCSS (Within Cluster Sum of Square). Silhouette coefficient is calculated based on the distance between observations. This coefficient measures how close one observation with other observations in the same cluster. A large coefficient indicates that the clusters formed. WSS calculates the variance in the clusters formed. The smaller variance in the clusters, it’s indicates that the clusters are appropriate.

Silhoutte

Optimal_Clusters_KMeans(mall_cust %>% 
               select(Annual.Income, Spending.Score, Age),
               max_clusters = 10, plot_clusters = T, criterion = 'silhouette')

#>  [1] 0.0000000 0.2949337 0.3887438 0.4546643 0.4719404 0.4590234 0.3960786
#>  [8] 0.3802983 0.3252582 0.3109356
#> attr(,"class")
#> [1] "k-means clustering"

In Silhoutte, the number of clusters with the highest coefficient value that we choose. From the graph above, we can see the highest average sillhoute value is 0.47 and located in k = 5. How about WCSSE?

WCSSE

Optimal_Clusters_KMeans(mall_cust %>% 
               select(Annual.Income, Spending.Score, Age),
               max_clusters = 10, plot_clusters = T, criterion = 'WCSSE')

#>  [1] 308812.78 212840.17 158518.39 104366.15  75378.76  58300.44  54818.94
#>  [8]  47785.78  48880.68  45714.63
#> attr(,"class")
#> [1] "k-means clustering"

Whereas in WSS, the number of clusters we choose is based on which line of cluster is shaped like an elbow. Fifth cluster forms an elbow shape. So in this dataset, we can choose 5 as a number of cluster.

cust_kmean <- eclust(mall_cust %>% 
               select(Annual.Income, Spending.Score, Age),
               stand = TRUE,FUNcluster = "kmeans",k=5,graph = F)
aggregate(mall_cust %>% 
               select(Annual.Income, Spending.Score, Age),
          by =list(gerombol=cust_kmean$cluster), FUN = mean)
fviz_cluster(cust_kmean)

6. Interpretation

  • Group 1 and group 2 relatively have the same annual income and spending scores (medium val). However, group 1 contains of by young adults and the second group is filled by Old-aged adults.

  • Group 3 has unique behaviour. They have a small annual income but a large spending score. When we’re looking at the age, the 3rd group consists of very young adults.

  • Group 4 is like the opposite of group 3. They have high annual income but spends very little.

  • Group 5 has the highest annual income and spending scores. This group consists of middle-age adults.

As an analyst, I’d suggest the mall manager to give extra special treatment to group 3 and 5 customers.So they keep remain loyal to our mall. The spending score of group 3 is very small, then the group should be given a discount in order to increase its spending score.

7.References

  • Han, Jiawei , Kamber, Micheline. 2001. Data Mining Concepts and Techniques Second Edition. San Francisco: Morgan Kauffman

  • Khairani NA, Sutoyo E. Application of K-Means Clustering Algorithm for Determination of Fire-Prone Areas Utilizing Hotspots in West Kalimantan Province. International Journal of Advances in Data and Information Systems.1(1):9-16.