I would like to share what I’ve learned in in DQLab about Customer Segmentation. Customer Segmentation is the process where we cluster our customers based on their characteristics. This process is important because it can determine the targeted advertisement and make the marketing budget more efficient. In this case, I am using R. I use the dataset from Kaggle, which you can access here. There are 3 steps of doing customer segmentation, preparing the data, determining the number of clusters, and clustering and analyzing.
The first step is calling the data with read.csv function. The dataset is saved as Mall_Customers variable.
Mall_Customers<-read.csv("C:/Users/Home/Downloads/Mall_Customers.csv")
str(Mall_Customers)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...
The dataset consists of 5 columns, they are: 1. CustomerID: Unique ID for each customer 2. Gender: Customers’ gender, male or female 3. Age: Customers’ age 4. Annual.Income..k..: Customers’ income per year, in $ 5. Spending.Score..1.100.: Customers’ spending score, from 1-100 K-Means consists of 2 parameters, they are: 1. x: The data, all have to be numerical 2. centers: The desired clusters, will be determined in 2nd process Since in this data the Gender column is the only non-numerical type, so we have to transform this data to numerical type, in this case, represented by customer_matrix. After we transform the data, we combine the old data with the transformed gender data, saved with customer_field variable. The data will be like this:
customer_matrix <- data.matrix(Mall_Customers[c("Gender")])
Mall_Customers <- data.frame(Mall_Customers, customer_matrix)
customer_field<-c("Gender.1", "Age", "Annual.Income..k..", "Spending.Score..1.100.")
Mall_Customers
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100. Gender.1
## 1 1 Male 19 15 39 2
## 2 2 Male 21 15 81 2
## 3 3 Female 20 16 6 1
## 4 4 Female 23 16 77 1
## 5 5 Female 31 17 40 1
## 6 6 Female 22 17 76 1
## 7 7 Female 35 18 6 1
## 8 8 Female 23 18 94 1
## 9 9 Male 64 19 3 2
## 10 10 Female 30 19 72 1
## 11 11 Male 67 19 14 2
## 12 12 Female 35 19 99 1
## 13 13 Female 58 20 15 1
## 14 14 Female 24 20 77 1
## 15 15 Male 37 20 13 2
## 16 16 Male 22 20 79 2
## 17 17 Female 35 21 35 1
## 18 18 Male 20 21 66 2
## 19 19 Male 52 23 29 2
## 20 20 Female 35 23 98 1
## 21 21 Male 35 24 35 2
## 22 22 Male 25 24 73 2
## 23 23 Female 46 25 5 1
## 24 24 Male 31 25 73 2
## 25 25 Female 54 28 14 1
## 26 26 Male 29 28 82 2
## 27 27 Female 45 28 32 1
## 28 28 Male 35 28 61 2
## 29 29 Female 40 29 31 1
## 30 30 Female 23 29 87 1
## 31 31 Male 60 30 4 2
## 32 32 Female 21 30 73 1
## 33 33 Male 53 33 4 2
## 34 34 Male 18 33 92 2
## 35 35 Female 49 33 14 1
## 36 36 Female 21 33 81 1
## 37 37 Female 42 34 17 1
## 38 38 Female 30 34 73 1
## 39 39 Female 36 37 26 1
## 40 40 Female 20 37 75 1
## 41 41 Female 65 38 35 1
## 42 42 Male 24 38 92 2
## 43 43 Male 48 39 36 2
## 44 44 Female 31 39 61 1
## 45 45 Female 49 39 28 1
## 46 46 Female 24 39 65 1
## 47 47 Female 50 40 55 1
## 48 48 Female 27 40 47 1
## 49 49 Female 29 40 42 1
## 50 50 Female 31 40 42 1
## 51 51 Female 49 42 52 1
## 52 52 Male 33 42 60 2
## 53 53 Female 31 43 54 1
## 54 54 Male 59 43 60 2
## 55 55 Female 50 43 45 1
## 56 56 Male 47 43 41 2
## 57 57 Female 51 44 50 1
## 58 58 Male 69 44 46 2
## 59 59 Female 27 46 51 1
## 60 60 Male 53 46 46 2
## 61 61 Male 70 46 56 2
## 62 62 Male 19 46 55 2
## 63 63 Female 67 47 52 1
## 64 64 Female 54 47 59 1
## 65 65 Male 63 48 51 2
## 66 66 Male 18 48 59 2
## 67 67 Female 43 48 50 1
## 68 68 Female 68 48 48 1
## 69 69 Male 19 48 59 2
## 70 70 Female 32 48 47 1
## 71 71 Male 70 49 55 2
## 72 72 Female 47 49 42 1
## 73 73 Female 60 50 49 1
## 74 74 Female 60 50 56 1
## 75 75 Male 59 54 47 2
## 76 76 Male 26 54 54 2
## 77 77 Female 45 54 53 1
## 78 78 Male 40 54 48 2
## 79 79 Female 23 54 52 1
## 80 80 Female 49 54 42 1
## 81 81 Male 57 54 51 2
## 82 82 Male 38 54 55 2
## 83 83 Male 67 54 41 2
## 84 84 Female 46 54 44 1
## 85 85 Female 21 54 57 1
## 86 86 Male 48 54 46 2
## 87 87 Female 55 57 58 1
## 88 88 Female 22 57 55 1
## 89 89 Female 34 58 60 1
## 90 90 Female 50 58 46 1
## 91 91 Female 68 59 55 1
## 92 92 Male 18 59 41 2
## 93 93 Male 48 60 49 2
## 94 94 Female 40 60 40 1
## 95 95 Female 32 60 42 1
## 96 96 Male 24 60 52 2
## 97 97 Female 47 60 47 1
## 98 98 Female 27 60 50 1
## 99 99 Male 48 61 42 2
## 100 100 Male 20 61 49 2
## 101 101 Female 23 62 41 1
## 102 102 Female 49 62 48 1
## 103 103 Male 67 62 59 2
## 104 104 Male 26 62 55 2
## 105 105 Male 49 62 56 2
## 106 106 Female 21 62 42 1
## 107 107 Female 66 63 50 1
## 108 108 Male 54 63 46 2
## 109 109 Male 68 63 43 2
## 110 110 Male 66 63 48 2
## 111 111 Male 65 63 52 2
## 112 112 Female 19 63 54 1
## 113 113 Female 38 64 42 1
## 114 114 Male 19 64 46 2
## 115 115 Female 18 65 48 1
## 116 116 Female 19 65 50 1
## 117 117 Female 63 65 43 1
## 118 118 Female 49 65 59 1
## 119 119 Female 51 67 43 1
## 120 120 Female 50 67 57 1
## 121 121 Male 27 67 56 2
## 122 122 Female 38 67 40 1
## 123 123 Female 40 69 58 1
## 124 124 Male 39 69 91 2
## 125 125 Female 23 70 29 1
## 126 126 Female 31 70 77 1
## 127 127 Male 43 71 35 2
## 128 128 Male 40 71 95 2
## 129 129 Male 59 71 11 2
## 130 130 Male 38 71 75 2
## 131 131 Male 47 71 9 2
## 132 132 Male 39 71 75 2
## 133 133 Female 25 72 34 1
## 134 134 Female 31 72 71 1
## 135 135 Male 20 73 5 2
## 136 136 Female 29 73 88 1
## 137 137 Female 44 73 7 1
## 138 138 Male 32 73 73 2
## 139 139 Male 19 74 10 2
## 140 140 Female 35 74 72 1
## 141 141 Female 57 75 5 1
## 142 142 Male 32 75 93 2
## 143 143 Female 28 76 40 1
## 144 144 Female 32 76 87 1
## 145 145 Male 25 77 12 2
## 146 146 Male 28 77 97 2
## 147 147 Male 48 77 36 2
## 148 148 Female 32 77 74 1
## 149 149 Female 34 78 22 1
## 150 150 Male 34 78 90 2
## 151 151 Male 43 78 17 2
## 152 152 Male 39 78 88 2
## 153 153 Female 44 78 20 1
## 154 154 Female 38 78 76 1
## 155 155 Female 47 78 16 1
## 156 156 Female 27 78 89 1
## 157 157 Male 37 78 1 2
## 158 158 Female 30 78 78 1
## 159 159 Male 34 78 1 2
## 160 160 Female 30 78 73 1
## 161 161 Female 56 79 35 1
## 162 162 Female 29 79 83 1
## 163 163 Male 19 81 5 2
## 164 164 Female 31 81 93 1
## 165 165 Male 50 85 26 2
## 166 166 Female 36 85 75 1
## 167 167 Male 42 86 20 2
## 168 168 Female 33 86 95 1
## 169 169 Female 36 87 27 1
## 170 170 Male 32 87 63 2
## 171 171 Male 40 87 13 2
## 172 172 Male 28 87 75 2
## 173 173 Male 36 87 10 2
## 174 174 Male 36 87 92 2
## 175 175 Female 52 88 13 1
## 176 176 Female 30 88 86 1
## 177 177 Male 58 88 15 2
## 178 178 Male 27 88 69 2
## 179 179 Male 59 93 14 2
## 180 180 Male 35 93 90 2
## 181 181 Female 37 97 32 1
## 182 182 Female 32 97 86 1
## 183 183 Male 46 98 15 2
## 184 184 Female 29 98 88 1
## 185 185 Female 41 99 39 1
## 186 186 Male 30 99 97 2
## 187 187 Female 54 101 24 1
## 188 188 Male 28 101 68 2
## 189 189 Female 41 103 17 1
## 190 190 Female 36 103 85 1
## 191 191 Female 34 103 23 1
## 192 192 Female 32 103 69 1
## 193 193 Male 33 113 8 2
## 194 194 Female 38 113 91 1
## 195 195 Female 47 120 16 1
## 196 196 Female 35 120 79 1
## 197 197 Female 45 126 28 1
## 198 198 Male 32 126 74 2
## 199 199 Male 32 137 18 2
## 200 200 Male 30 137 83 2
Now that we already complete the first condition, now we will fulfill the second condition, centers. We will use Elbow Method, the method where we compare the best combination of number of clusters and SSE (Sum Square of Error). The best combination provides higher between_SS/total_SS percentage and lower SSE with lesser number of clusters. So, we can choose the combination that formed the biggest angle measurement, located at the “elbow” of the curve. To make the Elbow Curve, we will need the SSE for each cluster, so we will count them and the results are as follow:
set.seed(100)
sse<-sapply(1:10,function(param_k){kmeans(Mall_Customers[customer_field],param_k,nstart=25)$tot.withinss})
sse
## [1] 308862.06 212889.44 143391.59 104414.68 75399.62 58348.64 51130.69
## [8] 44355.31 40615.15 37061.44
Now we already have the SSE for each clusters, we can now proceed to the curve. We will need ggplot2 package for this curve. The result will be:
library(ggplot2)
cluster_max <- 10
ssdata = data.frame(cluster=c(1:cluster_max),sse)
ggplot(ssdata, aes(x=cluster,y=sse)) +
geom_line(color="red") + geom_point() +
ylab("Within Cluster Sum of Squares") + xlab("Total Cluster") +
geom_text(aes(label=format(round(sse, 2), nsmall = 2)),hjust=-0.2, vjust=-0.5) +
scale_x_discrete(limits=c(1:cluster_max))
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
Based on this curve, the most optimum number of clusters is 5 or 6. So, we will analyze based on its data distribution, SSE value, and between_SS/total_SS value. This is the result when we use 5 clusters:
set.seed(100)
kmeans(x=Mall_Customers[c(customer_field)],centers=5,nstart=25)
## K-means clustering with 5 clusters of sizes 79, 39, 36, 23, 23
##
## Cluster means:
## Gender.1 Age Annual.Income..k.. Spending.Score..1.100.
## 1 1.417722 43.08861 55.29114 49.56962
## 2 1.461538 32.69231 86.53846 82.12821
## 3 1.527778 40.66667 87.75000 17.58333
## 4 1.391304 25.52174 26.30435 78.56522
## 5 1.391304 45.21739 26.30435 20.91304
##
## Clustering vector:
## [1] 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5
## [38] 4 5 4 5 4 5 4 5 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 1 2 3 2 3 2
## [149] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## [186] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##
## Within cluster sum of squares by cluster:
## [1] 30157.266 13982.051 17678.472 4627.739 8954.087
## (between_SS / total_SS = 75.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
If we use 5 clusters, the data isn’t evenly distributed. We can see that cluster 1 have the biggest number of data, but compared to other clusters, the gap is big. The SSE values are relatively bigger, resulting in lower percentage of between_SS/total_SS. Now compare with 6 clusters. The result is as follow:
set.seed(100)
kmeans(x=Mall_Customers[c(customer_field)],centers=6,nstart=25)
## K-means clustering with 6 clusters of sizes 45, 39, 35, 22, 21, 38
##
## Cluster means:
## Gender.1 Age Annual.Income..k.. Spending.Score..1.100.
## 1 1.444444 56.15556 53.37778 49.08889
## 2 1.461538 32.69231 86.53846 82.12821
## 3 1.571429 41.68571 88.22857 17.28571
## 4 1.409091 25.27273 25.72727 79.36364
## 5 1.380952 44.14286 25.14286 19.52381
## 6 1.342105 27.00000 56.65789 49.13158
##
## Clustering vector:
## [1] 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5
## [38] 4 5 4 1 4 1 6 5 4 1 6 6 6 1 6 6 1 1 1 1 1 6 1 1 6 1 1 1 6 1 1 6 6 1 1 1 1
## [75] 1 6 1 6 6 1 1 6 1 1 6 1 1 6 6 1 1 6 1 6 6 6 1 6 1 6 6 1 1 6 1 6 1 1 1 1 1
## [112] 6 6 6 6 6 1 1 1 1 6 6 6 2 6 2 3 2 3 2 3 2 6 2 3 2 3 2 3 2 3 2 6 2 3 2 3 2
## [149] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## [186] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##
## Within cluster sum of squares by cluster:
## [1] 8073.244 13982.051 16699.429 4105.136 7737.333 7751.447
## (between_SS / total_SS = 81.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
If we use 6 clusters, the data is more evenly distributed and the gap between each clusters isn’t too big unlike when we use 5 clusters. The SSE values are also relatively smaller, resulting in higher percentage of between_SS/total_SS. So, based on this comparation analysis, we decided to use 6 clusters.
Now that we got our desired number of clusters, we can cluster and analyze its result. It will be saved in segmentation variable. We will analyze the characteristics of each cluster by using the mean of centers.
set.seed(100)
segmentation <-kmeans(x=Mall_Customers[c(customer_field)],centers=6,nstart=25)
segmentation$centers
## Gender.1 Age Annual.Income..k.. Spending.Score..1.100.
## 1 1.444444 56.15556 53.37778 49.08889
## 2 1.461538 32.69231 86.53846 82.12821
## 3 1.571429 41.68571 88.22857 17.28571
## 4 1.409091 25.27273 25.72727 79.36364
## 5 1.380952 44.14286 25.14286 19.52381
## 6 1.342105 27.00000 56.65789 49.13158
Each clusters has different characteristics. Now we will break down the characteristics of each cluster. Since the Gender is in numerical form, please note that 1 is for female and 2 is for male. Cluster 1 Dominated with female, the average age is 56 years old, the average annual income is $53, and the average spending score is 49 of 100. Cluster 2 Dominated with female, the average age is 32 years old, the average annual income is $86, and the average spending score is 82 of 100. Cluster 3 Dominated with male, the average age is 41 years old, the average annual income is $88, and the average spending score is 17 of 100. Cluster 4 Dominated with female, the average age is 25 years old, the average annual income is $25, and the average spending score is 79 of 100. Cluster 5 Dominated with female, the average age is 44 years old, the average annual income is $25, and the average spending score is 19 of 100. Cluster 6 Dominated with female, the average age is 27 years old, the average annual income is $56, and the average spending score is 49 of 100. Based on the clusters’ characteristics, we can see that people in their 40s tend to spend less than others, proved by the lower spending score. And most of the clusters that is dominated by female tend to have moderate to high spending score.