Customer Segmentation

Customer segmentation is an important process needed in a business to get to know customers better. So, business processes in marketing and Customer Relationship Management (CMR) can be done more sharply. For example, marketing messages can be more personalized for each segment at a more optimal cost. With sharper processes, business performance may be better too. To find a good segmentation, it is necessary to process data analysis from customer profiles that are quite numerous and routine. This can be helped by computer algorithms.

Data preparation is the first step that we do before using any algorithm to perform data analysis. This is because each algorithm implementation demands a different structure and data type. And for the case of the K-Means algorithm that we will use for clustering automation, then the data structure is data.frame or matrix which contains all numbers. Nothing can be of type character. But in the real case, this is certainly not possible. For example, the contents of professional data such as “Professional”, “Housewife” is text. And this needs to be converted to numeric first, but if needed it can still retrieve text data.

The dataset that we use is customer data with fields “Customer_ID”, “Nama Pelanggan”, “Jenis Kelamin”, “Umur”, “Profesi”, “Tipe Residen” and “NilaiBelanjaSetahun” with the following display. The full dataset can be viewed at https://storage.googleapis.com/dqlab-dataset/customer_segments.txt

The data has seven columns with the following explanation:

Customer_ID is customer codes in format CUST-text followed by a number
Nama Pelanggan is names of the customers in text format
Jenis Kelamin is genders of the customers, there are only two data categories, Pria (male) and Wanita (female)
Umur is ages of the customers in numeric format
Profesi is the professions of the customers, also in the type of category text consisting of Wiraswasta (entrepreneur), Pelajar (student), Professional, Ibu Rumah Tangga (housewife), and Pelajar (student).
Tipe Residen is resident types of the customers, there are only two categories, Cluster and Sector.
NilaiBelanjaSetahun is annual shopping values of the customers

There are columns that contain only numbers, namely Umur and NilaiBelanjaSetahun. The rest is filled with category data for the “Jenis Kelamin”, “Profesi” and “Tipe Residen” columns. While “Customer ID” and “Nama Pelanggan” we consider to have unique values for each row of data and represent each individual. Because of this, they will not be used as variables determining our segmentation, but the rest of the other columns will be used.

Data Preparation

The first step we need to do is to read the dataset from a text file into a data.frame in R with the read.csv function.

pelanggan <- read.csv("https://storage.googleapis.com/dqlab-dataset/customer_segments.txt", sep="\t")
head(pelanggan)

##   Customer_ID      Nama.Pelanggan Jenis.Kelamin Umur      Profesi Tipe.Residen
## 1    CUST-001        Budi Anggara          Pria   58   Wiraswasta       Sector
## 2    CUST-002    Shirley Ratuwati        Wanita   14      Pelajar      Cluster
## 3    CUST-003        Agus Cahyono          Pria   48 Professional      Cluster
## 4    CUST-004    Antonius Winarta          Pria   53 Professional      Cluster
## 5    CUST-005 Ibu Sri Wahyuni, IR        Wanita   41   Wiraswasta      Cluster
## 6    CUST-006     Rosalina Kurnia        Wanita   24 Professional      Cluster
##   NilaiBelanjaSetahun
## 1             9497927
## 2             2722700
## 3             5286429
## 4             5204498
## 5            10615206
## 6             5215541

Look that if the original column name contains a space character, it will be changed to a full stop after reading with read.csv. For example, “Nama Pelanggan” is changed to “Nama.Pelanggan”.

As previously explained, the data contents of the three customer columns, namely “Jenis.Kelamin”, “Profesi” and “Tipe.Residen” are categorical data in the form of text. For the k-means function, these three columns are unusable unless they are converted to numerical data. One way is to use data.matrix() function.

pelanggan_matrix <- data.matrix(pelanggan[c("Jenis.Kelamin", "Profesi", "Tipe.Residen")])
head(pelanggan_matrix)

##      Jenis.Kelamin Profesi Tipe.Residen
## [1,]             1       5            2
## [2,]             2       3            1
## [3,]             1       4            1
## [4,]             1       4            1
## [5,]             2       5            1
## [6,]             2       4            1

After we converted the columns to numerical data, we need to combine the columns back into our original variable. It is especially useful in advanced practice, which is when we will identify new customer segmentation. To combine the data from the data.matrix conversion to the customer, we use data.frame() function.

pelanggan <- data.frame(pelanggan, pelanggan_matrix)
head(pelanggan)

##   Customer_ID      Nama.Pelanggan Jenis.Kelamin Umur      Profesi Tipe.Residen
## 1    CUST-001        Budi Anggara          Pria   58   Wiraswasta       Sector
## 2    CUST-002    Shirley Ratuwati        Wanita   14      Pelajar      Cluster
## 3    CUST-003        Agus Cahyono          Pria   48 Professional      Cluster
## 4    CUST-004    Antonius Winarta          Pria   53 Professional      Cluster
## 5    CUST-005 Ibu Sri Wahyuni, IR        Wanita   41   Wiraswasta      Cluster
## 6    CUST-006     Rosalina Kurnia        Wanita   24 Professional      Cluster
##   NilaiBelanjaSetahun Jenis.Kelamin.1 Profesi.1 Tipe.Residen.1
## 1             9497927               1         5              2
## 2             2722700               2         3              1
## 3             5286429               1         4              1
## 4             5204498               1         4              1
## 5            10615206               2         5              1
## 6             5215541               2         4              1

This time we look at the column “NilaiBelanjaSetahun” values are in millions. When this column is used for clustering, the calculation of the sum of squared errors (in the kmeans algorithm) will be very large. We will normalize the values to make the calculation simpler and easier, but not reduce accuracy. It is especially useful if the amount of data is very large, for example it has 200 thousand data. Normalization can be done in many ways. In our case, it is enough to divide the millions into tens.

pelanggan <- data.frame(pelanggan, pelanggan_matrix)
pelanggan$NilaiBelanjaSetahun <- pelanggan$NilaiBelanjaSetahun/1000000

After merging the data, we know exactly how many numeric numbers the category text is converted to. Now, we need to create master data.

head(pelanggan[c("Profesi","Profesi.1")], 20)

##             Profesi Profesi.1
## 1        Wiraswasta         5
## 2           Pelajar         3
## 3      Professional         4
## 4      Professional         4
## 5        Wiraswasta         5
## 6      Professional         4
## 7        Wiraswasta         5
## 8      Professional         4
## 9      Professional         4
## 10     Professional         4
## 11     Professional         4
## 12     Professional         4
## 13       Wiraswasta         5
## 14       Wiraswasta         5
## 15       Wiraswasta         5
## 16     Professional         4
## 17 Ibu Rumah Tangga         1
## 18 Ibu Rumah Tangga         1
## 19       Wiraswasta         5
## 20          Pelajar         3

It seems that Wiraswasta is converted to 5, Pelajar is 3, Profesi is 4, Ibu Rumah Tangga is 1, and Mahasiswa is converted to 2. The list of categorical data and their conversion results is very important to be used as a reference so that later when there is new data, we can “map” it into numerical data that is ready to be used for the clustering algorithm. Well, the problem is that the data above is too long, when in fact we only need 5 rows of data, right? In R, we can summarize it with a unique function.

Profesi <- unique(pelanggan[c("Profesi", "Profesi.1")])
Jenis.Kelamin <- unique(pelanggan[c("Jenis.Kelamin", "Jenis.Kelamin.1")])
Tipe.Residen <- unique(pelanggan[c("Tipe.Residen", "Tipe.Residen.1")])

Profesi

##             Profesi Profesi.1
## 1        Wiraswasta         5
## 2           Pelajar         3
## 3      Professional         4
## 17 Ibu Rumah Tangga         1
## 31        Mahasiswa         2

The data has been summarized with category text and its numeric pairs. Then pay attention to the numbers 1, 2, 3, 17 and 31 on the far left. This indicates the line position where the text is found. This concise and unique data is hereinafter referred to as reference data or master data.

We need to create a vertor variable to save fields used so that it can be used repeatedly in R scripts.

field_yang_digunakan = c("Jenis.Kelamin.1", "Umur", "Profesi.1", "Tipe.Residen.1","NilaiBelanjaSetahun")
head(pelanggan[field_yang_digunakan])

##   Jenis.Kelamin.1 Umur Profesi.1 Tipe.Residen.1 NilaiBelanjaSetahun
## 1               1   58         5              2            9.497927
## 2               2   14         3              1            2.722700
## 3               1   48         4              1            5.286429
## 4               1   53         4              1            5.204498
## 5               2   41         5              1           10.615206
## 6               2   24         4              1            5.215541

Clustering with K-Means

Clustering is the process of dividing objects into several groups (clusters) based on the degree of similarity between one object and another. Many algorithms have been developed to perform clustering automatically, one of which is very popular is K-Means. K-means is an algorithm that divides data into a number of partitions in a simple way – finding the proximity of each point in a cluster to a number of average or mean values. There are two key concepts that are also the origin of the name k-means:

the number of required partitions are represented by the letter k.
finding the “closest distance” of each point to a number of observed cluster mean values, represented by means.

The k-means algorithm is already in the basic R package in the form of a function called kmeans.

The kmeans function requires a minimum of 2 parameters, i.e.:

x data used, where all data contents must be numeric
centers the needed number of clusters

And the kmeans function is usually accompanied by a call to the seet.seed function. It is useful to “equate” the list of the same random values from kmeans so that we get the same output.

set.seed(100)
#create 5 clusters and 25 random scenarios
segmentasi <- kmeans(x = pelanggan[field_yang_digunakan], centers = 5, nstart = 25)
segmentasi

## K-means clustering with 5 clusters of sizes 5, 12, 14, 9, 10
## 
## Cluster means:
##   Jenis.Kelamin.1     Umur Profesi.1 Tipe.Residen.1 NilaiBelanjaSetahun
## 1            1.40 61.80000  4.200000       1.400000            8.696132
## 2            1.75 31.58333  3.916667       1.250000            7.330958
## 3            2.00 20.07143  3.571429       1.357143            5.901089
## 4            2.00 42.33333  4.000000       1.555556            8.804791
## 5            1.70 52.50000  3.800000       1.300000            6.018321
## 
## Clustering vector:
##  [1] 1 3 5 5 4 3 1 5 2 2 5 5 1 1 3 2 2 1 2 3 4 5 2 4 2 5 2 4 5 4 3 4 3 3 4 2 3 4
## [39] 3 3 3 2 2 3 3 3 5 4 2 5
## 
## Within cluster sum of squares by cluster:
## [1]  58.21123 174.85164 316.73367 171.67372 108.49735
##  (between_SS / total_SS =  92.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Sometimes, centers (number of segments) is not enough as a parameter. It is necessary to use another parameter, namely nstart, which is the number of random combinations generated internally by R. Based on the number given, the algorithm will select combinations whose distance from each point to mean of its own cluster is smaller than to mean of other clusters. It should be remembered that mean or average value here is often referred to as centroid in various data science literature.

The output can be divided into five parts, with an explanation according to the serial number in the figure as follows:

K-means clustering with 5 clusters of sizes 5, 12, 14, 9, 10 is size / number of data points in each cluster
Cluster means is the average value (centroid) of each cluster
Cluster vector is cluster division of each data element based on its position
Within cluster sum of squares by cluster is the sum of the squared distances from each point to the centroid. The concept of sum of squares (SS) is the sum of the “square distances” of the difference between each data point and its mean or centroid. This SS can be the mean or centroid for each cluster or the whole data. Sum of squares in other data science literature is often referred to as Sum of Squared Errors (SSE). The larger the SS value, the wider the difference between each data point in the cluster.
Available components is other information components contained in this kmeans object.

Clustering vector is a series of vectors containing a cluster number. From our output, the vector contains numbers 1 to 5, maximum according to the number of clusters we set. This vector starts from number 1, which means the first data from our dataset will be allocated to cluster number 1. The output can be accessed with the cluster component of the results object as follows:

segmentasi$cluster

##  [1] 1 3 5 5 4 3 1 5 2 2 5 5 1 1 3 2 2 1 2 3 4 5 2 4 2 5 2 4 5 4 3 4 3 3 4 2 3 4
## [39] 3 3 3 2 2 3 3 3 5 4 2 5

Now, we need to add the result of the segmentation to the original data. The method is quite easy, namely by creating a new column (we call it cluster) in the customer variable whose contents are segmentation$cluster. Then display the structure of the customer data with str() function.

pelanggan$cluster <- segmentasi$cluster
str(pelanggan)

## 'data.frame':    50 obs. of  14 variables:
##  $ Customer_ID        : chr  "CUST-001" "CUST-002" "CUST-003" "CUST-004" ...
##  $ Nama.Pelanggan     : chr  "Budi Anggara" "Shirley Ratuwati" "Agus Cahyono" "Antonius Winarta" ...
##  $ Jenis.Kelamin      : chr  "Pria" "Wanita" "Pria" "Pria" ...
##  $ Umur               : int  58 14 48 53 41 24 64 52 29 33 ...
##  $ Profesi            : chr  "Wiraswasta" "Pelajar" "Professional" "Professional" ...
##  $ Tipe.Residen       : chr  "Sector" "Cluster" "Cluster" "Cluster" ...
##  $ NilaiBelanjaSetahun: num  9.5 2.72 5.29 5.2 10.62 ...
##  $ Jenis.Kelamin.1    : int  1 2 1 1 2 2 1 1 2 1 ...
##  $ Profesi.1          : int  5 3 4 4 5 4 5 4 4 4 ...
##  $ Tipe.Residen.1     : int  2 1 1 1 1 1 2 1 2 1 ...
##  $ Jenis.Kelamin.2    : int  1 2 1 1 2 2 1 1 2 1 ...
##  $ Profesi.2          : int  5 3 4 4 5 4 5 4 4 4 ...
##  $ Tipe.Residen.2     : int  2 1 1 1 1 1 2 1 2 1 ...
##  $ cluster            : int  1 3 5 5 4 3 1 5 2 2 ...

K-means clustering with 5 clusters of sizes 5, 12, 14, 9, 10 means that by k-means we have divided the customer dataset by 5 clusters, where the 1st cluster has 5 data, the 2nd cluster has 12 data, cluster The 3rd cluster has 14 data, the 4th cluster has 9 data, and the 5th cluster has 10 data. With a total of 50 data, which is also the total number of customer data. Let’s verify this by starting from cluster 1. Take customer data whose cluster column is 1 by using which() function.

which(pelanggan$cluster == 1)

## [1]  1  7 13 14 18

To see the clustering results we simply use the syntax as below

pelanggan[which(pelanggan$cluster == 1),]

##    Customer_ID Nama.Pelanggan Jenis.Kelamin Umur          Profesi Tipe.Residen
## 1     CUST-001   Budi Anggara          Pria   58       Wiraswasta       Sector
## 7     CUST-007  Cahyono, Agus          Pria   64       Wiraswasta       Sector
## 13    CUST-013   Cahaya Putri        Wanita   64       Wiraswasta      Cluster
## 14    CUST-014 Mario Setiawan          Pria   60       Wiraswasta      Cluster
## 18    CUST-018    Nelly Halim        Wanita   63 Ibu Rumah Tangga      Cluster
##    NilaiBelanjaSetahun Jenis.Kelamin.1 Profesi.1 Tipe.Residen.1 Jenis.Kelamin.2
## 1             9.497927               1         5              2               1
## 7             9.837260               1         5              2               1
## 13            9.333168               2         5              1               2
## 14            9.471615               1         5              1               1
## 18            5.340690               2         1              1               2
##    Profesi.2 Tipe.Residen.2 cluster
## 1          5              2       1
## 7          5              2       1
## 13         5              1       1
## 14         5              1       1
## 18         1              1       1

What does Cluster means result mean?

The first column containing the numbers 1 through 5 is the cluster number.
Column Gender.1 shows the mean value of the sex data that has been converted to numeric, with 1 representing Male and 2 representing female.

The cluster 1 of column Jenis.Kelamin.1 is 1.40, means that cluster 1 is mixed but tends to male (1). Now, for values of the cluster 3 and 4 are 2.00, mean that the data only contains data with a female (2) profile.

segmentasi$centers

##   Jenis.Kelamin.1     Umur Profesi.1 Tipe.Residen.1 NilaiBelanjaSetahun
## 1            1.40 61.80000  4.200000       1.400000            8.696132
## 2            1.75 31.58333  3.916667       1.250000            7.330958
## 3            2.00 20.07143  3.571429       1.357143            5.901089
## 4            2.00 42.33333  4.000000       1.555556            8.804791
## 5            1.70 52.50000  3.800000       1.300000            6.018321

Based on the SS concept, the following is an explanation for the results of the kmeans output above:

The value 58.21123 is the SS for the 1st cluster, 174.85164 is the SS for the 2nd cluster, and so on. The smaller the value, the better the potential.
total_SS is the SS for all points against the global average, not per cluster. This value is always fixed and is not affected by the number of clusters.
between_SS is the total_SS minus the sum of the SS values of the entire cluster.
(between_SS / total_SS) is the ratio between between_SS divided by total_SS. The bigger the percentage, generally the better.

Available Components are nine object components that we can use to see the details of the k-means object. The following is a brief description of the nine components.

cluster is a vector of clusters for each data point.
centers is information on the centroid point of each cluster, as in section.
totss is total Sum of Squares (SS) for all centroids.
withinss is total Sum of Squares per cluster.
tot.withinss is total sum of each SS from withinss.
betweenss is value difference between totss and tot.withinss.
size is number of data points in each cluster.
iter is number of outer iterations used by kmeans.
ifault is an integer value that indicates a problem indicator in the algorithm.

All of these components can be accessed using $ accessor.

segmentasi$tot.withinss

## [1] 829.9676

Determine Optimal Cluster for Kmeans

From the information generated by the kmeans function, the Sum of Squares (SS) metric or often called the Sum of Squared Errors (SSE) is very important to be used as the basis for determining the optimal number of clusters. Theoretically, here are some things we can observe with SS:

The smaller the number of clusters produced, the greater the SS value, vice versa, the more the number of clusters produced, the smaller the SS value.
Because it is quadratic, if there is a significant difference between each cluster combination, the difference in the SS value will be even greater.
And as the number of clusters increases, the difference between each SS will be smaller.

The decision-making process based on elbow plotting is commonly called the Elbow Effect or the Elbow Method.

The elbow method metric used as the basis for justification is the Sum of Squares (SS), or more precisely the tot.withinss component of the kmeans object. This metric will look for the progressive value of tot.withinss for each combination of the number of clusters, and stored in vector form in R. For this purpose, we will use sapply(). The function that will be used to call the kmeans function for a range of the number of clusters. The range we will use is 1 to 10.

set.seed(100)
sse <- sapply(1:10, function(param_k)
              {kmeans(pelanggan[field_yang_digunakan], param_k, nstart=25)$tot.withinss}
             )
sse

##  [1] 10990.9740  3016.5612  1550.8725  1064.4187   829.9676   625.1462
##  [7]   508.1568   431.6977   374.1095   317.9424

We will visualize a Sum of Squares (SS) or Sum of Squared Errors (SSE) vector. We will use ggplot for visualization, the dataset is a combination of data frames from SSE and a value range of 1:10.

library(ggplot2)

jumlah_cluster_max <- 10
ssdata = data.frame(cluster=c(1:jumlah_cluster_max),sse)
ggplot(ssdata, aes(x=cluster,y=sse)) +
                geom_line(color="red") + geom_point() +
                ylab("Within Cluster Sum of Squares") + xlab("Jumlah Cluster") +
                geom_text(aes(label=format(round(sse, 2), nsmall = 2)),hjust=-0.2, vjust=-0.5) +
  scale_x_discrete(limits=c(1:jumlah_cluster_max))

## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?

By utilizing the value of Sum of Squares (SS) or Sum of Squared Errors (SSE) we can make a decision on the optimal number of segmentation that we use. This is done by simulating the iteration of the number of clusters from 1 to the maximum number we want. In this example, we use iteration numbers 1 to 10. After getting the SS value for each number of clusters, we can plot it to a line graph and use the elbow method to determine the optimal number of clusters.

Package Kmeans Model

We will name the segments according to their characteristics. To help, the following figure shows the mean values for each column used by each cluster as well as the column values before conversion.

Let’s try to name clusters 1 to 5 as follows:

Cluster 1 is Diamond Senior Member
Cluster 2 is Gold Young Professional
Cluster 3 is Silver Young Gals
Cluster 4 is Diamond Professional
Cluster 5 is Silver Mid Professional

Segmen.Pelanggan <- data.frame(cluster = c(1, 2, 3, 4, 5), Nama.Segmen = c("Diamond Senior Member", "Gold Young Professional", "Silver Youth Gals", "Diamond Professional", "Silver Mid Professional"))

Merge variables we create as reference. This will be our model that can be saved to a file and used when needed.

Identitas.Cluster <- list(Profesi=Profesi, Jenis.Kelamin=Jenis.Kelamin, Tipe.Residen=Tipe.Residen, Segmentasi=segmentasi, Segmen.Pelanggan=Segmen.Pelanggan, field_yang_digunakan=field_yang_digunakan)

The merged object already has all the assets needed to allocate new data to the appropriate segment. To save this object into a file we use saveRDS() function. This file can then be reopened as an object in the future.

saveRDS(Identitas.Cluster, "cluster.rds")

Operationalize Kmeans Model

The object from the processing of the K-Means algorithm and the related variables that we generated earlier must be applied to the real case so that a complete cycle occurs. The real case for our clustering is quite simple: how new data can automatically help marketing and CRM teams to quickly identify which customer segment they are in. With the speed of identification, the organization or business can quickly move with effective marketing messages and win the competition.

New customer data must be quickly mapped to segments. Assuming each new customer data is inputted into the system, then the processing is per record.

databaru <- data.frame(Customer_ID="CUST-100", Nama.Pelanggan="Rudi Wilamar",Umur=20,Jenis.Kelamin="Wanita",Profesi="Pelajar",Tipe.Residen="Cluster",NilaiBelanjaSetahun=3.5)
databaru

##   Customer_ID Nama.Pelanggan Umur Jenis.Kelamin Profesi Tipe.Residen
## 1    CUST-100   Rudi Wilamar   20        Wanita Pelajar      Cluster
##   NilaiBelanjaSetahun
## 1                 3.5

Open the file that we saved earlier with the command and recognized in R as the object that we will use to process the new data. To open the file, we use readRDS() function.

Identitas.Cluster <- readRDS(file="cluster.rds")
Identitas.Cluster

## $Profesi
##             Profesi Profesi.1
## 1        Wiraswasta         5
## 2           Pelajar         3
## 3      Professional         4
## 17 Ibu Rumah Tangga         1
## 31        Mahasiswa         2
## 
## $Jenis.Kelamin
##   Jenis.Kelamin Jenis.Kelamin.1
## 1          Pria               1
## 2        Wanita               2
## 
## $Tipe.Residen
##   Tipe.Residen Tipe.Residen.1
## 1       Sector              2
## 2      Cluster              1
## 
## $Segmentasi
## K-means clustering with 5 clusters of sizes 5, 12, 14, 9, 10
## 
## Cluster means:
##   Jenis.Kelamin.1     Umur Profesi.1 Tipe.Residen.1 NilaiBelanjaSetahun
## 1            1.40 61.80000  4.200000       1.400000            8.696132
## 2            1.75 31.58333  3.916667       1.250000            7.330958
## 3            2.00 20.07143  3.571429       1.357143            5.901089
## 4            2.00 42.33333  4.000000       1.555556            8.804791
## 5            1.70 52.50000  3.800000       1.300000            6.018321
## 
## Clustering vector:
##  [1] 1 3 5 5 4 3 1 5 2 2 5 5 1 1 3 2 2 1 2 3 4 5 2 4 2 5 2 4 5 4 3 4 3 3 4 2 3 4
## [39] 3 3 3 2 2 3 3 3 5 4 2 5
## 
## Within cluster sum of squares by cluster:
## [1]  58.21123 174.85164 316.73367 171.67372 108.49735
##  (between_SS / total_SS =  92.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"      
## 
## $Segmen.Pelanggan
##   cluster             Nama.Segmen
## 1       1   Diamond Senior Member
## 2       2 Gold Young Professional
## 3       3       Silver Youth Gals
## 4       4    Diamond Professional
## 5       5 Silver Mid Professional
## 
## $field_yang_digunakan
## [1] "Jenis.Kelamin.1"     "Umur"                "Profesi.1"          
## [4] "Tipe.Residen.1"      "NilaiBelanjaSetahun"

With the new data and the object containing the reference data read again, we can combine this new data to get the numeric conversion of the Jenis.Kelamin, Profesi and Tipe.Residen fields. The goal is that we will be able to find customer segments with the combined numerical data. The way to combine them is to use merge() function, where the two data will be combined by looking for the equation of the column names and their contents.

databaru <- merge(databaru, Identitas.Cluster$Profesi)
databaru <- merge(databaru, Identitas.Cluster$Jenis.Kelamin)
databaru <- merge(databaru, Identitas.Cluster$Tipe.Residen)
databaru

##   Tipe.Residen Jenis.Kelamin Profesi Customer_ID Nama.Pelanggan Umur
## 1      Cluster        Wanita Pelajar    CUST-100   Rudi Wilamar   20
##   NilaiBelanjaSetahun Profesi.1 Jenis.Kelamin.1 Tipe.Residen.1
## 1                 3.5         3               2              1

After merging reference to new data, which segment does this new data fall into? Use this which.min(sapply( 1:5, function( x ) sum( ( data[kolom] - objekkmeans$centers[x,])^2 ) )) to find it.

Identitas.Cluster$Segmen.Pelanggan[which.min(sapply( 1:5, function( x ) sum( ( databaru[Identitas.Cluster$field_yang_digunakan] - Identitas.Cluster$Segmentasi$centers[x,])^2 ) )),]

##   cluster       Nama.Segmen
## 3       3 Silver Youth Gals

Closing

So far, the technical topic for customer segmentation that we have studied is the K-Means algorithm implemented in R. Keep in mind that K-Means is not the only clustering algorithm, there are many other algorithms such as Hierarchical Clustering, Parallelized Hierarchical Clustering, and others.

And each algorithm also has its own advantages and disadvantages. But basic knowledge by starting from a popular algorithm and solving it will certainly be a valuable provision.

Data Science in Marketing: Customer Segmentation

Joseph Armando Carvallo