1 Introduction

In this analysis, we will use mall customer data which contain basic data like customer ID, age, gender, annual income and spending score. The goal of this analysis is to identify customer segment using K-Means Clustering, in order to understand which customers segment who become the target for marketing team to plan the marketing strategies.

2 Loading Packages & Data Preparation

2.1 Loading Packages

library(dplyr)
library(FactoMineR)
library(ggplot2)
library(funModeling)

2.2 Import Data

cust <- read.csv("Mall_Customers.csv")
names(cust) <- c("CustomerID","Gender","Age","Annual_Income", "Spending_Score")
cust

Data description:

CustomerID: Unique ID assigned to the customer.
Gender: Gender of the customer.
Age: Age of the customer.
Annual_Income (k$): Annual Income of the customer.
Spending_Score (1-100): Score assigned by the mall based on customer behavior and spending nature.

2.3 Inspect Data

Check Missing Value

colSums(is.na(cust))

##     CustomerID         Gender            Age  Annual_Income Spending_Score 
##              0              0              0              0              0

Explanotory Data Analysis (EDA)

Customer Distribution by Gender

freq(cust)

Customer Distribution by Age

hist(cust$Age,
    col="orange",
    main="Histogram to Show Count of Age Class",
    xlab="Age Class",
    ylab="Frequency",
    labels=TRUE)

3 K-Means Clustering

3.1 Scaling Data

Choosing Annual Income & Spending score for clustering subject and scale the data

cust.sc <-scale(cust[,c(4,5)])

3.2 Elbow Method

Finding best K for K mean using Elbow Method.

wss <- function(data, maxCluster = 10) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}

set.seed(100)
wss(cust.sc)

From elbow method result, we can see Elbow is Bending round k=5, therefore k=5 is the number of cluster that we use in this analysis case.

cust.KM<-kmeans(cust.sc,5)  
cust.KM

## K-means clustering with 5 clusters of sizes 35, 81, 23, 22, 39
## 
## Cluster means:
##   Annual_Income Spending_Score
## 1     1.0523622    -1.28122394
## 2    -0.2004097    -0.02638995
## 3    -1.3042458    -1.13411939
## 4    -1.3262173     1.12934389
## 5     0.9891010     1.23640011
## 
## Clustering vector:
##   [1] 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3
##  [36] 4 3 4 3 4 3 4 3 2 3 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 1 5 2 5 1 5 1 5 2 5 1 5 1 5 1 5
## [141] 1 5 2 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1
## [176] 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5
## 
## Within cluster sum of squares by cluster:
## [1] 18.304646 14.485632  7.577407  5.217630 19.655252
##  (between_SS / total_SS =  83.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

# Adding 'Cluster' column 
cust$Cluster <- cust.KM$cluster
cust

c_Clust=cust[,c(4,5)]
ggplot(c_Clust, aes(x = Annual_Income, y = Spending_Score)) + 
    geom_point(stat = "identity", aes(color = as.factor(cust.KM$cluster))) +
    scale_color_discrete(name=" ",
                         breaks=c("1", "2", "3", "4", "5"),
                         labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5")) +
    ggtitle("Customer Cluster")+
  xlab("Annual Income")+ylab("Spending Score")

Interpretation for the customer cluster/segment:

Cluster 1. Customers with high annual income but low spending score.
Cluster 2. Customers with medium annual income and medium spending score.
Cluster 3. Customers with low annual income and low spending score.
Cluster 4. Customers with low annual income but high spending score.
Cluster 5. Customers with high annual income and high spending score.

cust %>% group_by(Cluster,Gender) %>% 
  summarise(med_age=median(Age),med_income = median(Annual_Income), med_spend = median(Spending_Score))

4 Summary

With the data provided above, we could take some analysis summary that can be used for marketing plan as follows:

We could see from the EDA part that the female customers percentage (56%) is slightly higher than male customers (44%),with this information we could targeting the male customers more for marketing campaign or promotions than female customers even though the percentage different is not too big. This case we also can choose male customer target wisely with combined factors from their age and cluster.
We can doing marketing campaigns/loyalty program to customer in cluster 5 & 4 which are customer who have high spending scores, especially customer at age 20’s & 30’s, to maintain such customer and raising possibility of sales.
Cluster 1 & 3 in general have low spending scores, despite their income levels, and generally the customers are above 40s. With those data, we could consider to research and adding some brands that are popular among customers at those ages, and running campaigns to target them with the right products.
As we can see from the above data, cluster 2 have medium spending score, in order to increase sales in this customer cluster, we could give them some promotions to encourage them to buy more products.

Customer Segmentation

Regita Anggriani

August 29, 2019