Machine Learning - K-means Clustering

PART I: Collect the Data
PART II: Explore and Prepare the Data
- 1. Impute missing Gender variables
- 2. Convert Income from string to numeric value
PART III: Segment Customers
PART IV: Interpret the Results

PART I: Collect the Data

library(stats)
library(mice) 
library(tidyverse) 
library(factoextra)

data <- read_csv("https://s3.amazonaws.com/notredame.analytics.data/mallcustomers.csv")

PART II: Explore and Prepare the Data

1. Impute missing Gender variables

# Checking missing values for Gender variable
data <- data %>% 
  mutate(Gender = as.factor(Gender))
summary(data$Gender)

Female   Male   NA's 
    90     76     34

# We have 34 missing values need to be imputed

imputed_gender <- mice(data, m=1, maxit=5, meth='logreg',seed=1234)


 iter imp variable
  1   1  Gender
  2   1  Gender
  3   1  Gender
  4   1  Gender
  5   1  Gender

imputed_gender$imp$Gender[1:10,]

 [1] Female Female Female Female Female Female Female Male   Female Female
Levels: Female Male

data2 <- mice::complete(imputed_gender)

2. Convert Income from string to numeric value

data3 <- data2 %>% 
  mutate(Income= as.numeric((str_replace_all(data2$Income, " USD|,", ""))))
# mutate(Income = str_replace(Income, ",", "")) %>% 
# mutate(Income = as.numeric(Income))

PART III: Segment Customers

1. Segment customers into 3 clusters

# Convert Gender into dummy variable

data3 <- data3 %>% 
  mutate(Gender = ifelse(Gender=="Female", 0, 1)) 
# 0 indicates Female, and 1 indicates Male

# Normalize the variables using z-score normalization

data3_no_id <- data3 %>% 
  select(-CustomerID)

data3_z <- scale(data3_no_id)

set.seed(1234)
k3 <- kmeans(data3_z, 3, nstart=25)
k3

K-means clustering with 3 clusters of sizes 68, 68, 64

Cluster means:
      Gender        Age     Income SpendingScore
1 -0.8931925 -0.6482839 -0.1137065    0.52072909
2 -0.1552608  1.1171791 -0.2900782   -0.57380427
3  1.1139816 -0.4982011  0.4290213    0.05639239

Clustering vector:
  [1] 3 1 1 1 1 1 2 1 2 1 2 1 2 1 2 3 1 1 2 1 3 1 2 1 2 3 2 3 2 1 2 1 2 3 2
 [36] 1 2 1 2 1 2 3 2 1 2 1 2 1 1 3 2 3 1 2 2 2 2 2 1 2 2 3 2 2 2 3 1 2 3 1
 [71] 2 2 2 2 2 3 1 3 1 2 2 3 2 2 1 3 2 1 1 2 2 3 3 1 1 3 2 1 3 3 3 2 2 3 2
[106] 1 2 2 2 2 2 1 1 3 3 1 2 2 2 2 3 1 1 1 1 1 3 3 2 3 2 3 1 1 3 1 2 3 3 1
[141] 2 3 1 1 3 1 3 1 1 3 3 3 3 1 2 1 3 1 3 1 2 3 3 3 3 1 3 1 3 3 3 3 3 3 2
[176] 1 2 3 2 3 1 3 3 1 1 3 2 3 2 1 1 1 3 1 2 1 3 3 3 3

Within cluster sum of squares by cluster:
[1] 135.6445 169.6290 166.6704
 (between_SS / total_SS =  40.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"

k3$size

[1] 68 68 64

k3$centers

      Gender        Age     Income SpendingScore
1 -0.8931925 -0.6482839 -0.1137065    0.52072909
2 -0.1552608  1.1171791 -0.2900782   -0.57380427
3  1.1139816 -0.4982011  0.4290213    0.05639239

fviz_cluster(k3, geom = "point", data = data3_z) + ggtitle("k = 3")

3. Using the Elbow Method to determine new K value

wcss <- vector()

n = 20
set.seed(1234)
for(k in 1:n) {
  wcss[k] <- sum(kmeans(data3_z, k)$withinss)
}

wcss

 [1] 796.00000 585.63139 472.10785 379.41309 351.08214 317.41366 234.09948
 [8] 208.22430 168.40494 163.57368 137.32837 147.70344 134.74454 120.69258
[15] 109.38226 107.81616  92.64071  82.83343  91.45414  78.39783

tibble(value = wcss) %>%
  ggplot(mapping=aes(x=seq(1,length(wcss)), y=value)) +
  geom_point()+
  geom_line() +
  labs(title = "The Elbow Method", y = "WCSS", x = "Number of Clusters (k)" ) +
  theme_minimal()

A value of 7 would be our optimal K value.

4. Create a new visualization for the clustering result with K = 7

set.seed(1234)
k7 <- kmeans(data3_z, 7, nstart=25)
k7

K-means clustering with 7 clusters of sizes 19, 47, 24, 37, 25, 25, 23

Cluster means:
      Gender        Age     Income SpendingScore
1  1.1139816 -0.4300874  1.0587747     1.2395880
2 -0.8931925  0.8256119 -0.3080906    -0.5037466
3 -0.8931925 -0.4038703  0.9257665     0.9764999
4 -0.8931925 -0.9295675 -0.8610049     0.4088039
5  0.8731207  0.0193285  1.1193722    -1.3228250
6  1.1139816  1.4138442 -0.4705932    -0.4399090
7  1.1139816 -0.9728057 -0.5311804     0.2448055

Clustering vector:
  [1] 7 4 4 4 4 4 2 4 6 4 6 4 2 4 2 7 4 4 6 4 7 4 2 4 2 7 2 7 2 4 6 4 6 7 2
 [36] 4 2 4 2 4 2 7 6 4 2 4 2 4 4 7 2 7 4 6 2 6 2 6 4 6 6 7 2 2 2 7 2 2 7 4
 [71] 6 6 2 2 6 7 2 7 4 2 6 7 6 2 4 6 2 4 4 2 2 7 6 2 4 7 2 4 6 7 7 2 6 7 2
[106] 4 2 6 6 6 6 4 2 7 7 4 2 2 2 2 7 2 3 3 4 3 5 1 6 1 5 1 4 3 5 3 2 1 5 3
[141] 2 1 3 3 5 3 5 3 2 1 5 1 5 3 2 3 5 3 5 3 2 1 5 1 5 3 5 3 5 1 5 1 5 1 2
[176] 3 5 1 5 1 3 1 5 3 3 1 2 1 5 3 5 3 5 3 5 3 5 1 5 1

Within cluster sum of squares by cluster:
[1] 12.83533 62.54858 19.60042 44.30267 39.09549 26.97428 21.82940
 (between_SS / total_SS =  71.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"

k7$centers

      Gender        Age     Income SpendingScore
1  1.1139816 -0.4300874  1.0587747     1.2395880
2 -0.8931925  0.8256119 -0.3080906    -0.5037466
3 -0.8931925 -0.4038703  0.9257665     0.9764999
4 -0.8931925 -0.9295675 -0.8610049     0.4088039
5  0.8731207  0.0193285  1.1193722    -1.3228250
6  1.1139816  1.4138442 -0.4705932    -0.4399090
7  1.1139816 -0.9728057 -0.5311804     0.2448055

fviz_cluster(k7, geom = "point", data = data3_z) + ggtitle("k = 7")

PART IV: Interpret the Results

1. Assign label to each cluster

# Add cluster variable to the complete data frame
data3$Cluster <- k7$cluster

final_data <- data3
  
agg_mean <- aggregate(final_data,by=list(final_data$Cluster),FUN=mean, na.rm=TRUE)
agg_mean

  Group.1 CustomerID Gender      Age   Income SpendingScore Cluster
1       1  164.52632   1.00 32.84211 88368.42      82.21053       1
2       2   81.87234   0.00 50.38298 52468.09      37.19149       2
3       3  159.33333   0.00 33.20833 84875.00      75.41667       3
4       4   49.62162   0.00 25.86486 37945.95      60.75676       4
5       5  166.20000   0.88 39.12000 89960.00      16.04000       5
6       6   70.60000   1.00 58.60000 48200.00      38.84000       6
7       7   67.21739   1.00 25.26087 46608.70      56.52174       7

Cluster 1: Younger men. Make good money. Spend a lot. Cluster 2: Middle-aged women. Make average money. Spend little. Cluster 3: Younger women. Make good money. Spend a lot. Cluster 4: Young women. Make little money. Spend average amount. Cluster 5: Middle-aged (mostly men). Make good money. Spend VERY little. Cluster 6: Older men. Make average money. Spend little. Cluster 7: Young men. Make average money. Spend average amount.

2. How does the average age and gender distribution for each cluster compare to that of the overall data set?

mean(final_data$Age)

[1] 38.85

mean(final_data$Gender)

[1] 0.445

agg_mean

  Group.1 CustomerID Gender      Age   Income SpendingScore Cluster
1       1  164.52632   1.00 32.84211 88368.42      82.21053       1
2       2   81.87234   0.00 50.38298 52468.09      37.19149       2
3       3  159.33333   0.00 33.20833 84875.00      75.41667       3
4       4   49.62162   0.00 25.86486 37945.95      60.75676       4
5       5  166.20000   0.88 39.12000 89960.00      16.04000       5
6       6   70.60000   1.00 58.60000 48200.00      38.84000       6
7       7   67.21739   1.00 25.26087 46608.70      56.52174       7

Cluster 1: Below average in age. Below average in gender distribution (solely men). Cluster 2: Above average in age. Above average in gender distribution (solely women). Cluster 3: Below average in age. Above average in gender distribution (solely women). Cluster 4: Far below average in age. Above average in gender distribution (solely women). Cluster 5: Average in age. Below average in gender distribution (88% men). Cluster 6: Far above average in age. Below average in gender distribution (solely men). Cluster 7: Far below average in age. Below average in gender distribution (solely men).

3. Based on the results of your work, what recommendations would you make to Acme Holdings?

Purely based on spending scores, there are two marketing approaches that Acme Holdings can take. It can focus on leveraging high spenders, or it can focus on improving the spending habits of low spenders. We suggest taking the former, which would entail marketing to clusters 1, 3, 4, and 7. In general, this includes men and women between the ages of 25 and 33. Assuming that people between these ages are buying homes and starting families, Acme Holdings should consider home-improvement and family-starting marketing campaigns.