Data 622 - HW4

Zhenni Xie

2022-05-21

Homework 4 Clustering and PCA on Mall Customer Data

The data set I choose is from Kaggle and contain basic data about customers like Customer ID, age, gender, annual income and spending score. As the store manager, I want to understand the customers’ preference so that our marketing team can plan the strategy accordingly. I want to use clustering to achieve customer segmentation and identify the target customers with whom I can start marketing strategy.

Install Packages and Load Libraries

Load Data

data2 <- read.csv("https://raw.githubusercontent.com/JennierJ/CUNY_DATA_622/main/Mall_Customers.csv", header = TRUE)
glimpse(data2)
Rows: 200
Columns: 5
$ CustomerID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ~
$ Gender                 <chr> "Male", "Male", "Female", "Female", "Female", "~
$ Age                    <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35,~
$ Annual.Income..k..     <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19,~
$ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15~
head(data2)
  CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
1          1   Male  19                 15                     39
2          2   Male  21                 15                     81
3          3 Female  20                 16                      6
4          4 Female  23                 16                     77
5          5 Female  31                 17                     40
6          6 Female  22                 17                     76

The data has 200 rows indicating 200 customer information. And let me take a look at data distribution.

Data Distribution

Let’s take a look at the data distribution on different variables.

Annual Income Distribution

hist(data2$Annual.Income..k..,
     main = "Annual income distribution",
     xlab = "Annual Income",
     col = "darkmagenta")


The annual income falls 40k to 80K.
Age Distribution

hist(data2$Age,
     main = "Age distribution",
     xlab = "Age",
     col = "blue")


The data has a wide range of age.
Spending Score Distribution

hist(data2$Spending.Score..1.100.,
     main = "Spending Score Distribution",
     xlab = "Spending Score",
     col = "red")


The spending score frequency focus between 40 to 60.
Gender Analysis

counts <- table(data2$Gender)
barplot(counts, main="Gender Distribution",
        xlab="Gender", col = "darkblue")


There are more female than male in customer data.

Cluster Analysis

I used the numeric values for K-mean clustering analysis.

Scatterplot

ggplot(data2, aes(x= Annual.Income..k.., y= Spending.Score..1.100., color=Gender)) + 
  geom_point(size=3)


The data seems to be scattered by spending score and annual income between gender. I want to limit the segmentaion to two features: Spending score and annual income. Let me take a look at the summary statistics for these two features:

data2 %>%
  select(Annual.Income..k.., Spending.Score..1.100.) %>%
  summary()
 Annual.Income..k.. Spending.Score..1.100.
 Min.   : 15.00     Min.   : 1.00         
 1st Qu.: 41.50     1st Qu.:34.75         
 Median : 61.50     Median :50.00         
 Mean   : 60.56     Mean   :50.20         
 3rd Qu.: 78.00     3rd Qu.:73.00         
 Max.   :137.00     Max.   :99.00         


We can see that the range of values for both features are different. I need to normalize the value prior to building a model.

data2_scaled <- data2 %>%
  select(Annual.Income..k.., Spending.Score..1.100.) %>%
  scale()


Now the new data set show the normalized values for the two features we intend to use for segmentation.

summary(data2_scaled)
 Annual.Income..k.. Spending.Score..1.100.
 Min.   :-1.73465   Min.   :-1.905240     
 1st Qu.:-0.72569   1st Qu.:-0.598292     
 Median : 0.03579   Median :-0.007745     
 Mean   : 0.00000   Mean   : 0.000000     
 3rd Qu.: 0.66401   3rd Qu.: 0.882916     
 Max.   : 2.91037   Max.   : 1.889750     


I am ready to cluster the data. I start with k =3 and initial configurations as 25.

set.seed(1234)
k_3 <- kmeans(data2_scaled, centers = 3, nstart = 25)
k_3$size
[1] 123  39  38


The number tells us for the three clusters, I have 39, 38 and 123 observations.

k_3$centers
  Annual.Income..k.. Spending.Score..1.100.
1         -0.6246222            -0.01435636
2          0.9891010             1.23640011
3          1.0066735            -1.22246770


This represents the center for each of the clusters.

Visualize the Clusters

fviz_cluster(k_3, data = data2_scaled, repel = TRUE)
Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
increasing max.overlaps


The visualization shows three clusters. One group has lower anuual income and has wide range of spending score. One group has higher annual income, but lower spending score. The last group has the high annual income and higher spending score.
Let’s choose the right number of clusters.

fviz_nbclust(data2_scaled, kmeans, method = "wss")

fviz_nbclust(data2_scaled, kmeans, method = "silhouette")

fviz_nbclust(data2_scaled, kmeans, method = "gap_stat")


Among multiple methods including elbow method, average sihouette method and gap statistic, the optimal number should be 1 or 6. Let’s use 6 for K, and re-create and visualize the clusters.

k_6 <- kmeans(data2_scaled, centers = 6, nstart = 25)
fviz_cluster(
  k_6,
  data = data2_scaled,
  main = "Clustering in Mall Customer Segmentation",
  repel = TRUE)
Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
increasing max.overlaps


Visualization of the mall customer segmentaion into 6 clusters based on the spending score and annual income.

set.seed(1234)
k_clust <- kmeans(data2_scaled, centers = 6, nstart = 25)
fviz_cluster(
  k_clust,
  data = data2_scaled,
  main = "Mall Customers Segmented by Income and Spending Score",
  repel = TRUE)
Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Pricipal Components Analysis


Another algorithm I want to use is principal component analysis. PCA is handy when working with large data set and I would like to include age, annual income and spending score in the analysis as numeric values. I would like to identify groups of samples that are similar and work out which variables make one group different from another.

pca <- prcomp(data2[c(3,4,5)], scale = TRUE, center = TRUE)
pca
Standard deviations (1, .., p=3):
[1] 1.1523823 0.9996256 0.8202217

Rotation (n x k) = (3 x 3):
                               PC1         PC2         PC3
Age                     0.70638235 -0.03014116 0.707188441
Annual.Income..k..     -0.04802398 -0.99883160 0.005397916
Spending.Score..1.100. -0.70619946  0.03777499 0.707004506
summary(pca)
Importance of components:
                          PC1    PC2    PC3
Standard deviation     1.1524 0.9996 0.8202
Proportion of Variance 0.4427 0.3331 0.2243
Cumulative Proportion  0.4427 0.7758 1.0000


I obtain 3 principal components, which explains a percentage of the total variation in the data set. PC1 explains 44% of the total variance, which means that nearly half of the information in the data set can be encapsulated by just that one Principal Component. PC2 explains 33% of the variance. So by knowing the position of a sample in relation to just PC1 and PC2, I can get a good view on where it stands in relation to other samples, as just PC1 and PC2 can explain 78% of the variance.
Let me take a look at PCA

str(pca)
List of 5
 $ sdev    : num [1:3] 1.15 1 0.82
 $ rotation: num [1:3, 1:3] 0.7064 -0.048 -0.7062 -0.0301 -0.9988 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "Age" "Annual.Income..k.." "Spending.Score..1.100."
  .. ..$ : chr [1:3] "PC1" "PC2" "PC3"
 $ center  : Named num [1:3] 38.9 60.6 50.2
  ..- attr(*, "names")= chr [1:3] "Age" "Annual.Income..k.." "Spending.Score..1.100."
 $ scale   : Named num [1:3] 14 26.3 25.8
  ..- attr(*, "names")= chr [1:3] "Age" "Annual.Income..k.." "Spending.Score..1.100."
 $ x       : num [1:200, 1:3] -0.6142 -1.6616 0.337 -1.4529 -0.0384 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "PC1" "PC2" "PC3"
 - attr(*, "class")= chr "prcomp"

Plot PCA


I will make a biplot including both the position of each sample in terms of PC1 and PC2 and also show the initial variables map on this.

ggbiplot(pca)

ggbiplot(pca, labels = rownames(data2))


I can see that Age contribute to PC1, which higher values in this variable moving the samples to the right on this plot. Maybe I can put them into two categories as gender and draw ellipse around each group.

ggbiplot(pca, ellipse = TRUE, groups = data2$Gender)


I don’t see distinct cluster between gender based on the plot. It makes me think that gender don’t play a important role here. So in this homework, I used clustering and PCA to understand the customer data. As a store manager, I would plan my marketing strategy to target the 6 groups indivisually.