The data set I choose is from Kaggle and contain basic data about customers like Customer ID, age, gender, annual income and spending score. As the store manager, I want to understand the customers’ preference so that our marketing team can plan the strategy accordingly. I want to use clustering to achieve customer segmentation and identify the target customers with whom I can start marketing strategy.
data2 <- read.csv("https://raw.githubusercontent.com/JennierJ/CUNY_DATA_622/main/Mall_Customers.csv", header = TRUE)
glimpse(data2)
Rows: 200
Columns: 5
$ CustomerID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ~
$ Gender <chr> "Male", "Male", "Female", "Female", "Female", "~
$ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35,~
$ Annual.Income..k.. <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19,~
$ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15~
head(data2)
CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
1 1 Male 19 15 39
2 2 Male 21 15 81
3 3 Female 20 16 6
4 4 Female 23 16 77
5 5 Female 31 17 40
6 6 Female 22 17 76The data has 200 rows indicating 200 customer information. And let me take a look at data distribution.
Let’s take a look at the data distribution on different variables.
Annual Income Distribution
hist(data2$Annual.Income..k..,
main = "Annual income distribution",
xlab = "Annual Income",
col = "darkmagenta")
The annual income falls 40k to 80K.
Age Distribution
hist(data2$Age,
main = "Age distribution",
xlab = "Age",
col = "blue")
The data has a wide range of age.
Spending Score Distribution
hist(data2$Spending.Score..1.100.,
main = "Spending Score Distribution",
xlab = "Spending Score",
col = "red")
The spending score frequency focus between 40 to 60.
Gender Analysis
counts <- table(data2$Gender)
barplot(counts, main="Gender Distribution",
xlab="Gender", col = "darkblue")
There are more female than male in customer data.
I used the numeric values for K-mean clustering analysis.
ggplot(data2, aes(x= Annual.Income..k.., y= Spending.Score..1.100., color=Gender)) +
geom_point(size=3)
The data seems to be scattered by spending score and annual income between gender. I want to limit the segmentaion to two features: Spending score and annual income. Let me take a look at the summary statistics for these two features:
data2 %>%
select(Annual.Income..k.., Spending.Score..1.100.) %>%
summary()
Annual.Income..k.. Spending.Score..1.100.
Min. : 15.00 Min. : 1.00
1st Qu.: 41.50 1st Qu.:34.75
Median : 61.50 Median :50.00
Mean : 60.56 Mean :50.20
3rd Qu.: 78.00 3rd Qu.:73.00
Max. :137.00 Max. :99.00
We can see that the range of values for both features are different. I need to normalize the value prior to building a model.
data2_scaled <- data2 %>%
select(Annual.Income..k.., Spending.Score..1.100.) %>%
scale()
Now the new data set show the normalized values for the two features we intend to use for segmentation.
summary(data2_scaled)
Annual.Income..k.. Spending.Score..1.100.
Min. :-1.73465 Min. :-1.905240
1st Qu.:-0.72569 1st Qu.:-0.598292
Median : 0.03579 Median :-0.007745
Mean : 0.00000 Mean : 0.000000
3rd Qu.: 0.66401 3rd Qu.: 0.882916
Max. : 2.91037 Max. : 1.889750
I am ready to cluster the data. I start with k =3 and initial configurations as 25.
set.seed(1234)
k_3 <- kmeans(data2_scaled, centers = 3, nstart = 25)
k_3$size
[1] 123 39 38
The number tells us for the three clusters, I have 39, 38 and 123 observations.
k_3$centers
Annual.Income..k.. Spending.Score..1.100.
1 -0.6246222 -0.01435636
2 0.9891010 1.23640011
3 1.0066735 -1.22246770
This represents the center for each of the clusters.
fviz_cluster(k_3, data = data2_scaled, repel = TRUE)
Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
The visualization shows three clusters. One group has lower anuual income and has wide range of spending score. One group has higher annual income, but lower spending score. The last group has the high annual income and higher spending score.
Let’s choose the right number of clusters.
fviz_nbclust(data2_scaled, kmeans, method = "wss")fviz_nbclust(data2_scaled, kmeans, method = "silhouette")fviz_nbclust(data2_scaled, kmeans, method = "gap_stat")
Among multiple methods including elbow method, average sihouette method and gap statistic, the optimal number should be 1 or 6. Let’s use 6 for K, and re-create and visualize the clusters.
k_6 <- kmeans(data2_scaled, centers = 6, nstart = 25)
fviz_cluster(
k_6,
data = data2_scaled,
main = "Clustering in Mall Customer Segmentation",
repel = TRUE)
Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Visualization of the mall customer segmentaion into 6 clusters based on the spending score and annual income.
set.seed(1234)
k_clust <- kmeans(data2_scaled, centers = 6, nstart = 25)
fviz_cluster(
k_clust,
data = data2_scaled,
main = "Mall Customers Segmented by Income and Spending Score",
repel = TRUE)
Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Another algorithm I want to use is principal component analysis. PCA is handy when working with large data set and I would like to include age, annual income and spending score in the analysis as numeric values. I would like to identify groups of samples that are similar and work out which variables make one group different from another.
pca <- prcomp(data2[c(3,4,5)], scale = TRUE, center = TRUE)
pca
Standard deviations (1, .., p=3):
[1] 1.1523823 0.9996256 0.8202217
Rotation (n x k) = (3 x 3):
PC1 PC2 PC3
Age 0.70638235 -0.03014116 0.707188441
Annual.Income..k.. -0.04802398 -0.99883160 0.005397916
Spending.Score..1.100. -0.70619946 0.03777499 0.707004506
summary(pca)
Importance of components:
PC1 PC2 PC3
Standard deviation 1.1524 0.9996 0.8202
Proportion of Variance 0.4427 0.3331 0.2243
Cumulative Proportion 0.4427 0.7758 1.0000
I obtain 3 principal components, which explains a percentage of the total variation in the data set. PC1 explains 44% of the total variance, which means that nearly half of the information in the data set can be encapsulated by just that one Principal Component. PC2 explains 33% of the variance. So by knowing the position of a sample in relation to just PC1 and PC2, I can get a good view on where it stands in relation to other samples, as just PC1 and PC2 can explain 78% of the variance.
Let me take a look at PCA
str(pca)
List of 5
$ sdev : num [1:3] 1.15 1 0.82
$ rotation: num [1:3, 1:3] 0.7064 -0.048 -0.7062 -0.0301 -0.9988 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "Age" "Annual.Income..k.." "Spending.Score..1.100."
.. ..$ : chr [1:3] "PC1" "PC2" "PC3"
$ center : Named num [1:3] 38.9 60.6 50.2
..- attr(*, "names")= chr [1:3] "Age" "Annual.Income..k.." "Spending.Score..1.100."
$ scale : Named num [1:3] 14 26.3 25.8
..- attr(*, "names")= chr [1:3] "Age" "Annual.Income..k.." "Spending.Score..1.100."
$ x : num [1:200, 1:3] -0.6142 -1.6616 0.337 -1.4529 -0.0384 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "PC1" "PC2" "PC3"
- attr(*, "class")= chr "prcomp"
I will make a biplot including both the position of each sample in terms of PC1 and PC2 and also show the initial variables map on this.
ggbiplot(pca)ggbiplot(pca, labels = rownames(data2))
I can see that Age contribute to PC1, which higher values in this variable moving the samples to the right on this plot. Maybe I can put them into two categories as gender and draw ellipse around each group.
ggbiplot(pca, ellipse = TRUE, groups = data2$Gender)
I don’t see distinct cluster between gender based on the plot. It makes me think that gender don’t play a important role here. So in this homework, I used clustering and PCA to understand the customer data. As a store manager, I would plan my marketing strategy to target the 6 groups indivisually.