Cluster Analysis: K-Prototypes

Trong bài viết này tôi muốn giới thiệu đến thuật toán nâng cấp từ thuật toán K-Means đó là K-Prototypes. Trong bài viết trước khi đề cập đến phân nhóm khách hàng, chúng ta cần xem xét đến biến “Gender” đây chính là biến phân loại, trong trường hợp này K-Means chỉ đảm nhiệm vai trò đối với các thuộc tính số, để triển khai bài toán phân nhóm khách hàng có hiệu quả hơn chúng ta cần dùng đến K-Prototypes để làm việc với tập dữ liệu hỗn hợp giữa thuộc tính số và thuộc tính phân loại (categorical). Thuật toán K-Prototypes sử dụng các đối tượng mẫu (prototype) để biểu diễn cho các cụm thay vì sử dụng các đối tượng trung tâm như thuật toán K-Means. Các đối tượng dữ liệu lần lượt được phân phối cho các cụm dữ liệu sao cho chúng tương tự nhất với đối tượng mẫu tương ứng với cụm dữ liệu mà chúng được phân phối.

Ban đầu, chọn k đối tượng mẫu theo ngẫu nhiên hoặc có thể theo kinh nghiệm chuyên môn. Giai đoạn tiếp theo chúng ta phân phối lần lượt từng đối tượng dữ liệu cho các cụm ứng với đối tượng mẫu mà chúng tương tự nhất, sau mỗi lần phân phối đối tượng dữ liệu cho các cụm, chúng ta cập nhật giá trị cho các đối tượng mẫu. Sau khi tất cả các đối tượng đã được phân về cho các cụm dữ liệu, chúng ta lần lượt kiểm tra lại từng đối tượng dữ liệu cho các cụm, nếu đối tượng dữ liệu nào phân phối chưa phù hợp thì ta tiến hành di chuyển đối tượng đó sang cụm thích hợp và tiến hành cập nhật lại các đối tượng mẫu đại diện cho hai cụm này. Quá trình kiểm tra này được lặp cho đến khi chúng ta chuyển đến trạng thái tất cả các đối tượng đã được phân về đúng cụm của mình.

Các đối tượng mẫu có mô hình giống như mô hình của các đối tượng dữ liệu, nghĩa là chúng được biểu diễn bằng vector và được xác định như sau:

Mỗi giá trị của các thuộc tính số được tính trung bình cộng của các giá trị các thuộc tính số tương ứng của các đối tượng trong cụm.
Mỗi giá trị thuộc tính phân loại được tính bằng tần suất giá trị lớn nhất của các giá trị thuộc tính phân loại tương ứng của các đối tượng trong cụm.

Ví dụ như sau: xét một cụm dữ liệu có các đối tượng dữ liệu là thông tin về khách hàng có giá trị thuộc tính lần lượt là: age (year), spending score(1…100), gender (Male or Female):

X1 = (30, 77, Male)
X2 = (35, 80, Female)
X3 = (27, 55, Male)
X4 = (41, 85, Male)

Lúc này, đối tượng mẫu được xác định như sau: Prototypes = (133/4, 297/4, Male)

Thuật toán K-Prototypes là thuật toán phân cụm phân hoạch sử dụng hàm tiêu chuẩn E và các thức biểu diễn cụm bằng đối tượng mẫu

Input: Tập dữ liệu ban đầu X và số cụm k.
Output: k đối tượng mẫu sao cho hàm tiêu chuẩn đạt giá trị tối thiểu.

Các bước thực hiện:

Bước 1: Khởi tạo k đối tượng mẫu ban đầu cho X, mỗi đối tượng đóng vai trò là tâm đại diện của mỗi cụm.
Bước 2: Phân phối mỗi đối tượng trong X cho mỗi cụm sao cho chúng gần nhất với đối tượng mẫu trong cụm, đồng thời cập nhật lại đối tượng mẫu cho mỗi cụm.
Bước 3: Sau khi tất cả các đối tượng đã được phân phối hết cho các cụm, kiểm tra lại độ tương tự của các đối tượng trong mỗi cụm với các đối tượng mẫu, nếu có một đối tượng mẫu tương tự nhất với nó mà khác với đối tượng mẫu trong cụm hiện thời thì di chuyển đối tượng đang xét này sang cụm tương ứng với đối tượng mẫu mà nó gần nhất và đồng thời cập nhật các đối tượng mẫu cho hai cụm này.
Bước 4: Lặp bước 3 cho đến khi không có đối tượng nào thay đổi sau khi đã kiểm tra toàn bộ các đối tượng.

library(clustMixType)
library(tidyverse)
library(broom)
library(dplyr)


df <- read.csv('https://raw.githubusercontent.com/SteffiPeTaffy/machineLearningAZ/master/Machine%20Learning%20A-Z%20Template%20Folder/Part%204%20-%20Clustering/Section%2025%20-%20Hierarchical%20Clustering/Mall_Customers.csv')

head(df)

##   CustomerID  Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Thực hiện đổi tên các biến giúp các mã code trở nên gọn gàng hơn

df <- rename(df, 
             Id = CustomerID, 
             Age = Age,
             Gender = Genre, 
             Income = Annual.Income..k.., 
             SpendingScore = Spending.Score..1.100.)
head(df)

##   Id Gender Age Income SpendingScore
## 1  1   Male  19     15            39
## 2  2   Male  21     15            81
## 3  3 Female  20     16             6
## 4  4 Female  23     16            77
## 5  5 Female  31     17            40
## 6  6 Female  22     17            76

df$Gender <- as.factor(df$Gender)
str(df)

## 'data.frame':    200 obs. of  5 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender       : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 1 2 1 ...
##  $ Age          : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Income       : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ SpendingScore: int  39 81 6 77 40 76 6 94 3 72 ...

#count plot of some categorical variables 
ggplot(df, aes(x = df$Age, 
               y = df$SpendingScore)) + 
  geom_point(color="blue") + 
  geom_smooth()+
  labs(title = "Age ~ SpendingScore", 
       x = "Age", 
       y = "Spending Score") + 
  facet_grid(~Gender)

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

## Warning: Use of `df$Age` is discouraged. Use `Age` instead.

## Warning: Use of `df$SpendingScore` is discouraged. Use `SpendingScore` instead.

## Warning: Use of `df$Age` is discouraged. Use `Age` instead.

## Warning: Use of `df$SpendingScore` is discouraged. Use `SpendingScore` instead.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_kproto <- kproto(df, 
                    k = 3, 
                    iter.max=100, 
                    nstart = 25)

## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 2486.72 
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
## 0 observation(s) with NAs.

df_kproto$centers

##          Id Gender      Age   Income SpendingScore
## 1  37.07246 Female 37.97101 33.14493      49.81159
## 2 167.30645 Female 36.50000 89.70968      49.95161
## 3 103.89855   Male 41.84058 61.78261      50.81159

df_kproto$size

## clusters
##  1  2  3 
## 69 62 69

fit <- factor(df_kproto$cluster, 
                 order = TRUE, 
                 levels = c(1:6))

fit_df <- data.frame(df, fit)

result <- data.frame(df_kproto$size, df_kproto$centers)

result

##   clusters Freq        Id Gender      Age   Income SpendingScore
## 1        1   69  37.07246 Female 37.97101 33.14493      49.81159
## 2        2   62 167.30645 Female 36.50000 89.70968      49.95161
## 3        3   69 103.89855   Male 41.84058 61.78261      50.81159

ggplot(fit_df, 
       aes(SpendingScore, 
           Income, color = fit)) + 
  geom_point() + 
  labs(title = "Income ~ SpendingScore",
       x = "SpendingScore", 
       y = "Income") + 
  facet_grid(~Gender) +
  guides(color = guide_legend(title = "Cluster"))

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

fig <- plot_ly(fit_df,
               x= ~Income, 
               y= ~Age, 
               z= ~SpendingScore, 
               color = ~fit, 
               colors = c('#BF382A', '#0C4B8E', "#BBBBBB", "#00FF00")) %>%
  add_markers() %>% 
  layout(scene = list(
    xaxis = list(title="Annual Income"), 
      yaxis = list(title="Age"), 
      zaxis = list(title="Spending Score")))
fig

summary(df_kproto)

## Id 
##   Min. 1st Qu. Median      Mean 3rd Qu. Max.
## 1    1   18.00   35.0  37.07246   52.00   85
## 2  122  154.25  169.5 167.30645  184.75  200
## 3   54   87.00  104.0 103.89855  121.00  152
## 
## -----------------------------------------------------------------
## Gender 
##        
## cluster Female  Male
##       1  0.725 0.275
##       2  0.661 0.339
##       3  0.304 0.696
## 
## -----------------------------------------------------------------
## Age 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1   18   24.00   35.0 37.97101   49.00   68
## 2   19   30.25   34.5 36.50000   40.75   59
## 3   18   26.00   40.0 41.84058   54.00   70
## 
## -----------------------------------------------------------------
## Income 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1   15      21     33 33.14493   42.00   54
## 2   67      78     87 89.70968   98.75  137
## 3   43      57     62 61.78261   67.00   78
## 
## -----------------------------------------------------------------
## SpendingScore 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1    3    35.0     50 49.81159   73.00   99
## 2    1    18.5     49 49.95161   78.75   97
## 3    5    43.0     50 50.81159   56.00   97
## 
## -----------------------------------------------------------------

val <- validation_kproto(method = "silhouette", data = fit_df, k = 3:4, nstart = 5)

## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 2115.317 
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 2115.317 
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##            Id        Gender           Age        Income SpendingScore 
##             0             0             0             0             0 
##           fit 
##             0 
## 0 observation(s) with NAs.

val

## $k_opt
## [1] 3
## 
## $index_opt
##         3 
## 0.5745268 
## 
## $indices
##         3         4 
## 0.5745268 0.5559994 
## 
## $kp_obj
## Numeric predictors: 4 
## Categorical predictors: 2 
## Lambda: 2115.317 
## 
## Number of Clusters: 3 
## Cluster sizes: 69 69 62 
## Within cluster error: 146208.7 139785.5 157706 
## 
## Cluster prototypes:
##          Id Gender      Age   Income SpendingScore fit
## 1  37.07246 Female 37.97101 33.14493      49.81159   1
## 2 103.89855   Male 41.84058 61.78261      50.81159   3
## 3 167.30645 Female 36.50000 89.70968      49.95161   2