Customer Segmentation using K-Means Clustering in R

Market segmentation is a process that is used in market research to divide customers into different groups or segments according to a certain set of characteristics.

In this project, I will divide the customers of a fictional shopping mall (let’s say, Shopping Mall X) into clusters using K-means clustering.This dataset was obtained from the Coursera Guided project.

First, I will import the packages needed for this project.

#Importing required packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(cluster)

Exploring the data

Now, I will import and explore the data using the read.csv function. After importing the dataset, I will explore it to get an understanding of the data.

#Importing the "Mall_Customers.csv" data 
customer <- read.csv("A:/RStudio/R_Projects/projects/cust_seg/Mall_Customers.csv")

#Check the names of columns and structure of the dataset
names(customer)

## [1] "CustomerID"             "Gender"                 "Age"                   
## [4] "Annual.Income..k.."     "Spending.Score..1.100."

str(customer)

## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                : chr  "Male" "Male" "Female" "Female" ...
##  $ Age                   : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual.Income..k..    : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending.Score..1.100.: int  39 81 6 77 40 76 6 94 3 72 ...

I will rename some of the variables in order to use them easily in the future. After that, I will summarise the data to gain an objective summary of it.

##Rename some column names
customer <- rename(customer, annual_income=Annual.Income..k..,
       spending_score=Spending.Score..1.100.)

##Summarise the data
summary(customer)

##    CustomerID        Gender               Age        annual_income   
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00  
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50  
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50  
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56  
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00  
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00  
##  spending_score 
##  Min.   : 1.00  
##  1st Qu.:34.75  
##  Median :50.00  
##  Mean   :50.20  
##  3rd Qu.:73.00  
##  Max.   :99.00

Descriptive analyses

Now that I’ve gotten a glimpse of the data, I think I will create some plots to get a visualisation of how my data is spread out.

Firstly, I’d like to look at the demographics of the customers. This way, I’ll be able to easily see who exactly the customers of Shopping Mall X are before conducting more advanced statistical analyses.

# Creating a histogram to show dispersion of mall customers based on age
ggplot(customer,aes(x=Age)) +
  geom_histogram() +
  labs(title="Histogram showing distribution of Age")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram shows that the customers range from those below their 20’s, to those in their 70’s. One way to simplify this data is to create a histogram of the customers based on age groups. This creates a neater visualisation of our data.

# Creating a histogram to show dispersion of mall customers based on age groups
ggplot(customer, aes(x = Age)) +
  geom_vline(aes(xintercept = mean(Age)), color = "blue",  #adding an intercept to indicate mean age
             linetype = "dashed", size = 1.0) +
  geom_histogram(binwidth = 5, aes(y = ..density..), 
                 color = "black", fill = "white") +
  geom_density(alpha = 0.4, fill = "red") +  #adding density plot
  labs(title = "Histogram to Show Density of Age Groups")

Now, let’s look at the gender distribution of the mall’s customer base.

# Creating a barplot to assess gender distribution of my sample of customers.
ggplot(customer,aes(x= Gender)) +
  geom_bar(stat="count",width=0.5,fill="steelblue") +
  theme_minimal()+
  labs(title="Barplot to display Gender Comparison", xlab="Gender")

Now that I know the age groups of the customers, and general distribution of their genders, I’d like to see how those two variables (gender and age) are dispersed throughout the mall customers.

## 3.3: Create a histogram for the variable "Age" by Gender
ggplot(customer,aes(x=Age, fill=Gender, color=Gender))+
  geom_histogram(bins = 10, position = "identity",alpha=0.5) +
  labs(title="Histogram showing distribution of Gender by Age")

that I know more about the characteristics of the sample, it’s time to get an understanding of their spending behaviour as customers of the mall.

# Creating density plot to show customer's annual income
ggplot(customer, aes(x = annual_income)) +
  geom_density(alpha=0.4, fill="blue") +
  scale_x_continuous(breaks = seq(0, 200, by = 10)) +
  labs(title="Density Plot for Annual Income")

The density plot shows the distrbution of the mall customer’s annual income (in thousands). It seems that a majority of the customer base earns between 50,000 to around 80,000. Not a lot of the customer base earns more than 100,000 annually.

# Create a boxplot to understand cutomer's spending score by gender.
ggplot(customer, aes(x = spending_score, y= Gender)) +
  geom_boxplot() +
  labs(title = "Boxplot showing customers' Spending Score by Gender")

From the boxplot, we can see that the median spending score for both males and females are equal. We can also see that more women have a spending score above the median (50), whereas men tend to have a spending score below the median.

Conducting the cluster analysis

The general steps for conducting a cluster analysis using a k-means algorithm is as follows:

Choose the number of clusters “K”
Select random K points that are going to be the centroids for each cluster
Assign each data point to the nearest centroid, doing so will enable us to create “K” number of clusters
Calculate a new centroid for each cluster
Reassign each data point to the new closest centroid
Go to step 4 and repeat

Now, I will use Gap statistics to determine the optimal number of clusters to segment the mall customers into. A more detailed explanation on the use of gap-statistics in k-means clustering can be found in this well-written article by Tim Löhr.

# Setting seed to 125 for reproducibility
set.seed(125)

#using the gap-statistics to get the optimal number of clusters
stat_gap<-clusGap(customer[,3:5], FUN=kmeans, nstart=25, K.max = 10, B=50)

#Plot the optimal number of clusters based on the gap statistic
plot(stat_gap)

The plot above shows that, based on the gap statistic, 6 is the optimal number of clusters to segment the mall customers into.

Now, it is time to create the k-means clustering model for the data.

#Creating the customer clusters with KMeans
k6<-kmeans(customer[,3:5], 6, iter.max = 100, nstart=50,
           algorithm = "Lloyd")

#Printing the result
k6

## K-means clustering with 6 clusters of sizes 35, 22, 38, 44, 22, 39
## 
## Cluster means:
##        Age annual_income spending_score
## 1 41.68571      88.22857       17.28571
## 2 44.31818      25.77273       20.27273
## 3 27.00000      56.65789       49.13158
## 4 56.34091      53.70455       49.38636
## 5 25.27273      25.72727       79.36364
## 6 32.69231      86.53846       82.12821
## 
## Clustering vector:
##   [1] 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2
##  [38] 5 2 5 4 5 2 3 2 5 4 3 3 3 4 3 3 4 4 4 4 4 3 4 4 3 4 4 4 3 4 4 3 3 4 4 4 4
##  [75] 4 3 4 3 3 4 4 3 4 4 3 4 4 3 3 4 4 3 4 3 3 3 4 3 4 3 3 4 4 3 4 3 4 4 4 4 4
## [112] 3 3 3 3 3 4 4 4 4 3 3 3 6 3 6 1 6 1 6 1 6 3 6 1 6 1 6 1 6 1 6 3 6 1 6 1 6
## [149] 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1
## [186] 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6
## 
## Within cluster sum of squares by cluster:
## [1] 16690.857  8189.000  7742.895  7607.477  4099.818 13972.359
##  (between_SS / total_SS =  81.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

#Showing the six KMeans clusters
clusplot(customer, k6$cluster, color=TRUE, shade=TRUE, labels=0, lines=0)

From the clustering model, it seems that two main components can explain up to 66% of the variability in the data. The results also show more details of the cluster, including the means of the customers’ age, annual income, and spending score in each cluster.

Next, I will perform a Principal Component Analysis (PCA) to reduce the dimensionality of the data and capture the 2 most significant components of the data. For more information on the PCA, this is a well-written write up to refer to.

#Perform Principal Component Analysis
pcclust<-prcomp(customer[, 3:5], scale=FALSE)

#Checking the summary of the PCA model
summary(pcclust)

## Importance of components:
##                            PC1     PC2     PC3
## Standard deviation     26.4625 26.1597 12.9317
## Proportion of Variance  0.4512  0.4410  0.1078
## Cumulative Proportion   0.4512  0.8922  1.0000

# Applying the PCA model on the data
pcclust$rotation[, 1:2]

##                       PC1        PC2
## Age             0.1889742 -0.1309652
## annual_income  -0.5886410 -0.8083757
## spending_score -0.7859965  0.5739136

Results from the PCA show that components 1 and 2 (PC1 and PC2) contribute the most variance to the data. The high correlation between PC1 and spending score (-0.786) and PC2 and annual income (-0.808) show that annual income and spending income are the 2 major components of the data.

Finally, I will plot the customer segments based on results from the cluster analysis and PCA.

# Set seed to 1
set.seed(1)

#Create a plot of the customers segments
ggplot(customer, aes(x = annual_income , y = spending_score)) + 
  geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name = " ", 
                       breaks=c("1", "2", "3", "4", "5","6"),
                       labels=c("Cluster 1", "Cluster 2", "Cluster 3", 
                                "Cluster 4", "Cluster 5","Cluster 6")) +
  ggtitle("Segments of Mall Customers", 
          subtitle = "Using K-means Clustering")

I want to make this plot more consumable and formal for the use of external stakeholders. Based on the plot, we can easily classify each cluster by annual income and spending score.

#Create a plot of the customers segments
ggplot(customer, aes(x = annual_income , y = spending_score)) + 
  geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name = " ", 
                       breaks=c("1", "2", "3", "4", "5","6"),
                       labels=c("High Income, Low Spending", "Low Income, Low Spending", "Medium Income, Medium Spending", 
                                "Medium Income, Medium Spending", "Low Income, High Spending","High Income, High Spending")) +
  labs(x="Annual Income", y="Spending Score") +
  ggtitle("Segments of Mall X Customers", 
          subtitle = "Using K-means Clustering")

Now I have a final plot that can be easily understood. The results show 6 distinct clusters of customers of Mall X. However, we can see that some overlapping in areas of the 3rd and 4th clusters (Medium Income, Medium Spending). Further analysis is needed to figure out why these two segments were separated in different clusters, although it might be indicative of a 3rd factor at play.

Customer Segmentation using K-Means Clustering in R

Fadzlina Aziz

Exploring the data

Descriptive analyses

Conducting the cluster analysis