Introduction

Unsupervised Learning is a machine learning technique in which models are not supervised using training dataset. The models find hidden patterns and insights from the data. The unsupervised learning algorithm can be categorized into two types of problems:

In this project, we are going to use clustering method for customer segmentation. The dataset is from Kaggle. The objectives are to make group of customers, profiling, and suggest the marketing strategy for them.

First, import the library:

library(lubridate)
library(dplyr)
library(GGally)
library(ggplot2)
library(factoextra)
library(ggiraphExtra)
library(reshape2)

Data Preparation

Input Data

Input our data and put it into customers object.

customers <- read.csv("Mall_Customers.csv")

Overview our data:

head(customers)
##   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Check the number of columns and rows.

dim(customers)
## [1] 200   5

Data contains 200 rows and 5 columns.

View all columns and the data types.

glimpse(customers)
## Rows: 200
## Columns: 5
## $ CustomerID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ~
## $ Gender                 <chr> "Male", "Male", "Female", "Female", "Female", "~
## $ Age                    <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35,~
## $ Annual.Income..k..     <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19,~
## $ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15~

The dataset contains variables: CustomerID, Gender, Age, Annual Income (k$), and Spending Score(1-100).

Data Wrangling

We will adjust data type of Gender to Factor and rename Annual Income, Spending Score columns.

customers <- customers %>% 
  mutate_at(c("CustomerID", "Gender"), as.factor) %>% 
  rename(Annual_Income = Annual.Income..k..,
         Spending_Score = Spending.Score..1.100.)

Next, checking the missing value.

colSums(is.na(customers))
##     CustomerID         Gender            Age  Annual_Income Spending_Score 
##              0              0              0              0              0

No missing value found.

Exploratory Data Analysis

Let’s see the summary of all columns.

summary(customers)
##    CustomerID     Gender         Age        Annual_Income    Spending_Score 
##  1      :  1   Female:112   Min.   :18.00   Min.   : 15.00   Min.   : 1.00  
##  2      :  1   Male  : 88   1st Qu.:28.75   1st Qu.: 41.50   1st Qu.:34.75  
##  3      :  1                Median :36.00   Median : 61.50   Median :50.00  
##  4      :  1                Mean   :38.85   Mean   : 60.56   Mean   :50.20  
##  5      :  1                3rd Qu.:49.00   3rd Qu.: 78.00   3rd Qu.:73.00  
##  6      :  1                Max.   :70.00   Max.   :137.00   Max.   :99.00  
##  (Other):194

Distribution of Age Based on Gender

ggplot(customers, aes(Age, fill = Gender)) +
  geom_histogram(bins = 20) +
  ggtitle("Distribution of Age per Gender") +
  theme_minimal()

ggplot(customers, aes(x = Gender, y = Age, fill = Gender)) +
  geom_boxplot() +
  ggtitle("Boxplot of Age per Gender") +
  theme_minimal()

Distribution of Annual Income Based on Gender

ggplot(customers, aes(Annual_Income, fill = Gender)) +
  geom_histogram(bins = 20) +
  ggtitle("Distribution of Annual Income per Gender") +
  theme_minimal()

ggplot(customers, aes(x = Annual_Income, y = Age, fill = Gender)) +
  geom_boxplot() +
  ggtitle("Boxplot of Annual Income per Gender") +
  theme_minimal()

Distribution of Spending Score Based on Gender

ggplot(customers, aes(Spending_Score, fill = Gender)) +
  geom_histogram(bins = 20) +
  ggtitle("Distribution of Spending Score per Gender") +
  theme_minimal()

ggplot(customers, aes(x = Gender, y = Spending_Score, fill = Gender)) +
  geom_boxplot() +
  ggtitle("Boxplot of Spending Score per Gender") +
  theme_minimal()

Correlation between numeric variables.

ggcorr(customers, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Based on data exploration above, it can be seen that there are more female customers than male. The distribution of age is from 18 - 70 years and more customers are in the age category of 20 - 40 years. The distribution of annual income is from 15k$ - 137k$. All of the numeric variables have no outliers and between variables do not have strong correlation.

K-Means Clustering

Clustering is grouping data based on its characteristics. The objective is to produce clusters where:

We can use K-means Clustering for this project. K-means is one of the centroid-based clustering algorithms, meaning that each cluster has one centroid that represents the cluster. The centroid is the center point. K-means is an iterative process of:

  1. Create a random k cluster center (centroid).
  2. Calculating the observation distance to the center of the cluster.
  3. Assign each observation to each cluster.
  4. Shift the centroid to the center point of the cluster.
  5. Back to step number two, until the observations assigned to each cluster do not change anymore.

K-means is usually used for variables that have a numeric data type.

customers_clean <- customers %>% 
  select(-c(CustomerID, Gender))

head(customers_clean)
##   Age Annual_Income Spending_Score
## 1  19            15             39
## 2  21            15             81
## 3  20            16              6
## 4  23            16             77
## 5  31            17             40
## 6  22            17             76

Scalling

Feature scaling is essential for machine learning algorithms that calculate distances between data.

customers_scale <- scale(customers_clean)
summary(customers_scale)
##       Age          Annual_Income      Spending_Score     
##  Min.   :-1.4926   Min.   :-1.73465   Min.   :-1.905240  
##  1st Qu.:-0.7230   1st Qu.:-0.72569   1st Qu.:-0.598292  
##  Median :-0.2040   Median : 0.03579   Median :-0.007745  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.7266   3rd Qu.: 0.66401   3rd Qu.: 0.882916  
##  Max.   : 2.2299   Max.   : 2.91037   Max.   : 1.889750

Choose Optimum K

To choose the optimum \(k\):

  1. Business perspective (how many groups are needed?)
  2. Statistics method: Elbow method, visualize with fviz_nbclust()
fviz_nbclust(x = customers_scale,
             FUNcluster = kmeans,
             method = "wss")

From the graph, the optimum k is 5. After k = 5, an increase in the number of K does not have significant decrease in the total within sum of squares.

Clustering

RNGkind(sample.kind = "Rounding")
set.seed(100)

cust_kmeans <- kmeans(x = customers_scale, centers = 5)
cust_kmeans
## K-means clustering with 5 clusters of sizes 47, 40, 58, 22, 33
## 
## Cluster means:
##          Age Annual_Income Spending_Score
## 1 -0.8117514    -0.3980098     -0.2549218
## 2 -0.4277326     0.9724070      1.2130414
## 3  1.1956271    -0.4598275     -0.3262196
## 4 -0.9719569    -1.3262173      1.1293439
## 5  0.2211606     1.0805138     -1.2868231
## 
## Clustering vector:
##   [1] 1 4 1 4 1 4 1 4 3 4 3 4 3 4 1 4 1 4 3 4 1 4 3 4 3 4 3 4 1 4 3 4 3 4 3 4 3
##  [38] 4 1 4 3 4 3 1 3 4 3 1 1 1 3 1 1 3 3 3 3 3 1 3 3 1 3 3 3 1 3 3 1 1 3 3 3 3
##  [75] 3 1 3 1 1 3 3 1 3 3 1 3 3 1 1 3 3 1 3 1 1 1 3 1 3 1 1 3 3 1 3 1 3 3 3 3 3
## [112] 1 1 1 1 1 3 3 3 3 1 1 2 2 1 2 5 2 5 2 5 2 1 2 5 2 5 2 1 2 5 2 1 2 5 2 5 2
## [149] 5 2 5 2 5 2 5 2 5 2 5 2 3 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5
## [186] 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2
## 
## Within cluster sum of squares by cluster:
## [1] 44.302173 23.915440 56.931671  8.191823 34.516303
##  (between_SS / total_SS =  71.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Goodnest of fit

  • Within Sum of Squares ($withinss): he sum of the squared distances from each observation to the centroid of each cluster.
  • Between Sum of Squares ($betweenss): the sum of the weighted squared distances from each centroid to the global average. Weighted based on the number of observations in the cluster.
  • Total Sum of Squares ($totss): the sum of the squared distances from each observation to the global average.
# Within Sum of Squares
cust_kmeans$tot.withinss
## [1] 167.8574
# Sum of Squares
cust_kmeans$betweenss
## [1] 429.1426
# ratio BSS/TSS
cust_kmeans$betweenss/cust_kmeans$totss
## [1] 0.7188318

A good clustering method will produce high quality clusters in which:

  • WSS is low: the distance between observations in the same group is getting lower.
  • The intra-class similarity is high.
  • The inter-class similarity is low.
  • The BSS/TSS ratio is close to 1.

Cluster Profiling

customers_clean$cluster <- cust_kmeans$cluster
ggRadar(data = customers_clean, aes(colour = cluster), interactive = TRUE)

Perform cluster profiling by calculate the average value for each cluster.

customers_clean %>% 
  group_by(cluster) %>% 
  summarise_all(mean)
## # A tibble: 5 x 4
##   cluster   Age Annual_Income Spending_Score
##     <int> <dbl>         <dbl>          <dbl>
## 1       1  27.5          50.1           43.6
## 2       2  32.9          86.1           81.5
## 3       3  55.6          48.5           41.8
## 4       4  25.3          25.7           79.4
## 5       5  41.9          88.9           17.0

Profiling:

  • Cluster 1: This group is consists of young adults with medium annual income and spending score.
  • Cluster 2: This group is consists of middle-aged adults with high annual income and spending score.
  • Cluster 3: This group is consists of old-aged adults with medium annual income and spending score.
  • Cluster 4: This group is consists of young adults with low annual income and high spending score.
  • Cluster 5: This group is consists of middle-aged adults with high annual income and low spending score.
fviz_cluster(object = cust_kmeans, 
             data = customers_scale,
             ggtheme = theme_minimal(),
             shape = 19,
             show.clust.cent = TRUE,
             geom = "point")

Based on the result above, we can build marketing strategy as follows:

  • We know from exploratory data that female customers is higher than male customers. We could create a marketing campaign targeting the male customers.
  • Cluster 1 have medium annual income and cluster 3 have high annual income, but they have medium in spending score. So, we could give some promotions to encourage them purchase more.
  • We could give special treatment or loyalty program to customer in cluster 2 and 4 which are customer who have high spending scores.
  • Cluster 5 have high annual income but low spending score. It consists of middle-aged adults. We could do marketing research to identify their needs, wants, and demands, so that they will increase the purchases.

Conclusion

Customer segmentation can be created using an unsupervised machine learning algorithm. The clustering method used in this project is K-Means clustering. The mall customers grouped into 5 clusters that have different characteristics according to the age, annual income, and spending score. From this results, we could create the right marketing strategy for each cluster.