Machine Learning - K-means Clustering
Machine Learning - K-means Clustering
PART I: Collect the Data
PART II: Explore and Prepare the Data
1. Impute missing Gender variables
# Checking missing values for Gender variable
data <- data %>%
mutate(Gender = as.factor(Gender))
summary(data$Gender)Female Male NA's
90 76 34
# We have 34 missing values need to be imputed
imputed_gender <- mice(data, m=1, maxit=5, meth='logreg',seed=1234)
iter imp variable
1 1 Gender
2 1 Gender
3 1 Gender
4 1 Gender
5 1 Gender
[1] Female Female Female Female Female Female Female Male Female Female
Levels: Female Male
2. Convert Income from string to numeric value
PART III: Segment Customers
1. Segment customers into 3 clusters
# Convert Gender into dummy variable
data3 <- data3 %>%
mutate(Gender = ifelse(Gender=="Female", 0, 1))
# 0 indicates Female, and 1 indicates Male
# Normalize the variables using z-score normalization
data3_no_id <- data3 %>%
select(-CustomerID)
data3_z <- scale(data3_no_id)
set.seed(1234)
k3 <- kmeans(data3_z, 3, nstart=25)
k3K-means clustering with 3 clusters of sizes 68, 68, 64
Cluster means:
Gender Age Income SpendingScore
1 -0.8931925 -0.6482839 -0.1137065 0.52072909
2 -0.1552608 1.1171791 -0.2900782 -0.57380427
3 1.1139816 -0.4982011 0.4290213 0.05639239
Clustering vector:
[1] 3 1 1 1 1 1 2 1 2 1 2 1 2 1 2 3 1 1 2 1 3 1 2 1 2 3 2 3 2 1 2 1 2 3 2
[36] 1 2 1 2 1 2 3 2 1 2 1 2 1 1 3 2 3 1 2 2 2 2 2 1 2 2 3 2 2 2 3 1 2 3 1
[71] 2 2 2 2 2 3 1 3 1 2 2 3 2 2 1 3 2 1 1 2 2 3 3 1 1 3 2 1 3 3 3 2 2 3 2
[106] 1 2 2 2 2 2 1 1 3 3 1 2 2 2 2 3 1 1 1 1 1 3 3 2 3 2 3 1 1 3 1 2 3 3 1
[141] 2 3 1 1 3 1 3 1 1 3 3 3 3 1 2 1 3 1 3 1 2 3 3 3 3 1 3 1 3 3 3 3 3 3 2
[176] 1 2 3 2 3 1 3 3 1 1 3 2 3 2 1 1 1 3 1 2 1 3 3 3 3
Within cluster sum of squares by cluster:
[1] 135.6445 169.6290 166.6704
(between_SS / total_SS = 40.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
[1] 68 68 64
Gender Age Income SpendingScore
1 -0.8931925 -0.6482839 -0.1137065 0.52072909
2 -0.1552608 1.1171791 -0.2900782 -0.57380427
3 1.1139816 -0.4982011 0.4290213 0.05639239
3. Using the Elbow Method to determine new K value
wcss <- vector()
n = 20
set.seed(1234)
for(k in 1:n) {
wcss[k] <- sum(kmeans(data3_z, k)$withinss)
}
wcss [1] 796.00000 585.63139 472.10785 379.41309 351.08214 317.41366 234.09948
[8] 208.22430 168.40494 163.57368 137.32837 147.70344 134.74454 120.69258
[15] 109.38226 107.81616 92.64071 82.83343 91.45414 78.39783
tibble(value = wcss) %>%
ggplot(mapping=aes(x=seq(1,length(wcss)), y=value)) +
geom_point()+
geom_line() +
labs(title = "The Elbow Method", y = "WCSS", x = "Number of Clusters (k)" ) +
theme_minimal()A value of 7 would be our optimal K value.
4. Create a new visualization for the clustering result with K = 7
K-means clustering with 7 clusters of sizes 19, 47, 24, 37, 25, 25, 23
Cluster means:
Gender Age Income SpendingScore
1 1.1139816 -0.4300874 1.0587747 1.2395880
2 -0.8931925 0.8256119 -0.3080906 -0.5037466
3 -0.8931925 -0.4038703 0.9257665 0.9764999
4 -0.8931925 -0.9295675 -0.8610049 0.4088039
5 0.8731207 0.0193285 1.1193722 -1.3228250
6 1.1139816 1.4138442 -0.4705932 -0.4399090
7 1.1139816 -0.9728057 -0.5311804 0.2448055
Clustering vector:
[1] 7 4 4 4 4 4 2 4 6 4 6 4 2 4 2 7 4 4 6 4 7 4 2 4 2 7 2 7 2 4 6 4 6 7 2
[36] 4 2 4 2 4 2 7 6 4 2 4 2 4 4 7 2 7 4 6 2 6 2 6 4 6 6 7 2 2 2 7 2 2 7 4
[71] 6 6 2 2 6 7 2 7 4 2 6 7 6 2 4 6 2 4 4 2 2 7 6 2 4 7 2 4 6 7 7 2 6 7 2
[106] 4 2 6 6 6 6 4 2 7 7 4 2 2 2 2 7 2 3 3 4 3 5 1 6 1 5 1 4 3 5 3 2 1 5 3
[141] 2 1 3 3 5 3 5 3 2 1 5 1 5 3 2 3 5 3 5 3 2 1 5 1 5 3 5 3 5 1 5 1 5 1 2
[176] 3 5 1 5 1 3 1 5 3 3 1 2 1 5 3 5 3 5 3 5 3 5 1 5 1
Within cluster sum of squares by cluster:
[1] 12.83533 62.54858 19.60042 44.30267 39.09549 26.97428 21.82940
(between_SS / total_SS = 71.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
Gender Age Income SpendingScore
1 1.1139816 -0.4300874 1.0587747 1.2395880
2 -0.8931925 0.8256119 -0.3080906 -0.5037466
3 -0.8931925 -0.4038703 0.9257665 0.9764999
4 -0.8931925 -0.9295675 -0.8610049 0.4088039
5 0.8731207 0.0193285 1.1193722 -1.3228250
6 1.1139816 1.4138442 -0.4705932 -0.4399090
7 1.1139816 -0.9728057 -0.5311804 0.2448055
PART IV: Interpret the Results
1. Assign label to each cluster
# Add cluster variable to the complete data frame
data3$Cluster <- k7$cluster
final_data <- data3
agg_mean <- aggregate(final_data,by=list(final_data$Cluster),FUN=mean, na.rm=TRUE)
agg_mean Group.1 CustomerID Gender Age Income SpendingScore Cluster
1 1 164.52632 1.00 32.84211 88368.42 82.21053 1
2 2 81.87234 0.00 50.38298 52468.09 37.19149 2
3 3 159.33333 0.00 33.20833 84875.00 75.41667 3
4 4 49.62162 0.00 25.86486 37945.95 60.75676 4
5 5 166.20000 0.88 39.12000 89960.00 16.04000 5
6 6 70.60000 1.00 58.60000 48200.00 38.84000 6
7 7 67.21739 1.00 25.26087 46608.70 56.52174 7
Cluster 1: Younger men. Make good money. Spend a lot. Cluster 2: Middle-aged women. Make average money. Spend little. Cluster 3: Younger women. Make good money. Spend a lot. Cluster 4: Young women. Make little money. Spend average amount. Cluster 5: Middle-aged (mostly men). Make good money. Spend VERY little. Cluster 6: Older men. Make average money. Spend little. Cluster 7: Young men. Make average money. Spend average amount.
2. How does the average age and gender distribution for each cluster compare to that of the overall data set?
[1] 38.85
[1] 0.445
Group.1 CustomerID Gender Age Income SpendingScore Cluster
1 1 164.52632 1.00 32.84211 88368.42 82.21053 1
2 2 81.87234 0.00 50.38298 52468.09 37.19149 2
3 3 159.33333 0.00 33.20833 84875.00 75.41667 3
4 4 49.62162 0.00 25.86486 37945.95 60.75676 4
5 5 166.20000 0.88 39.12000 89960.00 16.04000 5
6 6 70.60000 1.00 58.60000 48200.00 38.84000 6
7 7 67.21739 1.00 25.26087 46608.70 56.52174 7
Cluster 1: Below average in age. Below average in gender distribution (solely men). Cluster 2: Above average in age. Above average in gender distribution (solely women). Cluster 3: Below average in age. Above average in gender distribution (solely women). Cluster 4: Far below average in age. Above average in gender distribution (solely women). Cluster 5: Average in age. Below average in gender distribution (88% men). Cluster 6: Far above average in age. Below average in gender distribution (solely men). Cluster 7: Far below average in age. Below average in gender distribution (solely men).
3. Based on the results of your work, what recommendations would you make to Acme Holdings?
Purely based on spending scores, there are two marketing approaches that Acme Holdings can take. It can focus on leveraging high spenders, or it can focus on improving the spending habits of low spenders. We suggest taking the former, which would entail marketing to clusters 1, 3, 4, and 7. In general, this includes men and women between the ages of 25 and 33. Assuming that people between these ages are buying homes and starting families, Acme Holdings should consider home-improvement and family-starting marketing campaigns.