Contexto

La segmentación o clusters es un conjutno de técnicas cuyo propósito es formar grupos a partir de un conjunto de elementos.

Más información: R for Data Science (2ed).

Paso 1. Instalar paquetes y llamar librerías

#install.packages('clusters')
library(cluster)
## Warning: package 'cluster' was built under R version 4.3.1
#install.packages('ggplot2')
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.1
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Paso 2. Obtener los datos

df <- read.csv('/Users/kikepablos/Documents/Development/escuela/concentracion_ai/modulo_6/data_sources/wine.csv')

Paso 3. Exploración inicial de datos

summary(df)
##     Alcohol        Malic_Acid         Ash         Ash_Alcanity  
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##    Magnesium      Total_Phenols     Flavanoids    Nonflavanoid_Phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color_Intensity       Hue             OD280      
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

Paso 4. Procesamiento de datos

Suponiendo que se seleccionan las variables de interés

#df <- wine_data[, c("Alcohol", "Malic_Acid", "Ash", "Color_Intensity", "Ash_Alcanity", "Magnesium", "Total_Phenols", "Flavanoids", "Nonflavanoid_Phenols", "Proanthocyanins", "Hue", "OD280", "Proline")]

Paso 5. Cantidad de grupos

df <- scale(df)
grupos <- 3

Paso 6. Generar los segmentos

segmentos <- kmeans(df, grupos, nstart = 1)
segmentos
## K-means clustering with 3 clusters of sizes 51, 62, 65
## 
## Cluster means:
##      Alcohol Malic_Acid        Ash Ash_Alcanity   Magnesium Total_Phenols
## 1  0.1644436  0.8690954  0.1863726    0.5228924 -0.07526047   -0.97657548
## 2  0.8328826 -0.3029551  0.3636801   -0.6084749  0.57596208    0.88274724
## 3 -0.9234669 -0.3929331 -0.4931257    0.1701220 -0.49032869   -0.07576891
##    Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity        Hue
## 1 -1.21182921           0.72402116     -0.77751312       0.9388902 -1.1615122
## 2  0.97506900          -0.56050853      0.57865427       0.1705823  0.4726504
## 3  0.02075402          -0.03343924      0.05810161      -0.8993770  0.4605046
##        OD280    Proline
## 1 -1.2887761 -0.4059428
## 2  0.7770551  1.1220202
## 3  0.2700025 -0.7517257
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2
##  [75] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 326.3537 385.6983 558.6971
##  (between_SS / total_SS =  44.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Paso 7. Asignar el grupo al que pertenece cada observación

asignacion <- cbind(df, cluster = segmentos$cluster)

Paso 8. Graficar los clusters

fviz_cluster(segmentos, data = df)

Paso 9. Optimizar la cantidad de grupos

La cantidad óptima de grupos corresponde al punto más alto de la siguiente gráfica

set.seed(123)
optimizacion <- clusGap(df, FUN=kmeans, nstart = 1, K.max = 10)
plot(optimizacion, xlab='Número de clusters k')

Paso 11. Comparación entre segmentos

promedio <- aggregate(asignacion, by=list(segmentos$cluster), FUN=mean)
promedio
##   Group.1    Alcohol Malic_Acid        Ash Ash_Alcanity   Magnesium
## 1       1  0.1644436  0.8690954  0.1863726    0.5228924 -0.07526047
## 2       2  0.8328826 -0.3029551  0.3636801   -0.6084749  0.57596208
## 3       3 -0.9234669 -0.3929331 -0.4931257    0.1701220 -0.49032869
##   Total_Phenols  Flavanoids Nonflavanoid_Phenols Proanthocyanins
## 1   -0.97657548 -1.21182921           0.72402116     -0.77751312
## 2    0.88274724  0.97506900          -0.56050853      0.57865427
## 3   -0.07576891  0.02075402          -0.03343924      0.05810161
##   Color_Intensity        Hue      OD280    Proline cluster
## 1       0.9388902 -1.1615122 -1.2887761 -0.4059428       1
## 2       0.1705823  0.4726504  0.7770551  1.1220202       2
## 3      -0.8993770  0.4605046  0.2700025 -0.7517257       3

Conclusión

La segmentación o clusters es un algoritmo útil para las empresas que desean clasificar sus clientes y dirigir campañas de MKT más enfocadas y especializdas.

LS0tCnRpdGxlOiAiV2luZSAoQ2x1c3RlcikgU2VnbWVudGFjacOzbiIKYXV0aG9yOiAiRW5yaXF1ZSBQYWJsb3MgQTAwODM1MDM3IgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDoKICBodG1sX2RvY3VtZW50OgogICAgdG9jOiB0cnVlCiAgICB0b2NfZmxvYXQ6IHRydWUKICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKICAgIHRoZW1lOiAic3BhY2VsYWIiCiAgICBoaWdobGlndGg6ICJrYXRlIgotLS0KCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpCmBgYAoKIyBDb250ZXh0bwpMYSBzZWdtZW50YWNpw7NuIG8gY2x1c3RlcnMgZXMgdW4gY29uanV0bm8gZGUgdMOpY25pY2FzIGN1eW8gcHJvcMOzc2l0byBlcyBmb3JtYXIgZ3J1cG9zIGEgcGFydGlyIGRlIHVuIGNvbmp1bnRvIGRlIGVsZW1lbnRvcy4KCjxwPk3DoXMgaW5mb3JtYWNpw7NuOiA8YSBocmVmPSJodHRwczovL3I0ZHMuaGFkLmNvLm56L2luZGV4Lmh0bWwiPlIgZm9yIERhdGEgU2NpZW5jZSAoMmVkKTwvYT4uPC9wPiAgCgojIFBhc28gMS4gSW5zdGFsYXIgcGFxdWV0ZXMgeSBsbGFtYXIgbGlicmVyw61hcwpgYGB7cn0KI2luc3RhbGwucGFja2FnZXMoJ2NsdXN0ZXJzJykKbGlicmFyeShjbHVzdGVyKQojaW5zdGFsbC5wYWNrYWdlcygnZ2dwbG90MicpCmxpYnJhcnkoZ2dwbG90MikKbGlicmFyeShkYXRhLnRhYmxlKQpsaWJyYXJ5KGZhY3RvZXh0cmEpCmBgYAoKIyBQYXNvIDIuIE9idGVuZXIgbG9zIGRhdG9zCmBgYHtyfQpkZiA8LSByZWFkLmNzdignL1VzZXJzL2tpa2VwYWJsb3MvRG9jdW1lbnRzL0RldmVsb3BtZW50L2VzY3VlbGEvY29uY2VudHJhY2lvbl9haS9tb2R1bG9fNi9kYXRhX3NvdXJjZXMvd2luZS5jc3YnKQpgYGAKCiMgUGFzbyAzLiBFeHBsb3JhY2nDs24gaW5pY2lhbCBkZSBkYXRvcwpgYGB7cn0Kc3VtbWFyeShkZikKYGBgCiMgUGFzbyA0LiBQcm9jZXNhbWllbnRvIGRlIGRhdG9zCiMgU3Vwb25pZW5kbyBxdWUgc2Ugc2VsZWNjaW9uYW4gbGFzIHZhcmlhYmxlcyBkZSBpbnRlcsOpcwpgYGB7cn0KI2RmIDwtIHdpbmVfZGF0YVssIGMoIkFsY29ob2wiLCAiTWFsaWNfQWNpZCIsICJBc2giLCAiQ29sb3JfSW50ZW5zaXR5IiwgIkFzaF9BbGNhbml0eSIsICJNYWduZXNpdW0iLCAiVG90YWxfUGhlbm9scyIsICJGbGF2YW5vaWRzIiwgIk5vbmZsYXZhbm9pZF9QaGVub2xzIiwgIlByb2FudGhvY3lhbmlucyIsICJIdWUiLCAiT0QyODAiLCAiUHJvbGluZSIpXQpgYGAKCiMgUGFzbyA1LiBDYW50aWRhZCBkZSBncnVwb3MKYGBge3J9CmRmIDwtIHNjYWxlKGRmKQpncnVwb3MgPC0gMwpgYGAKCiMgUGFzbyA2LiBHZW5lcmFyIGxvcyBzZWdtZW50b3MKYGBge3J9CnNlZ21lbnRvcyA8LSBrbWVhbnMoZGYsIGdydXBvcywgbnN0YXJ0ID0gMSkKc2VnbWVudG9zCmBgYAoKCiMgUGFzbyA3LiBBc2lnbmFyIGVsIGdydXBvIGFsIHF1ZSBwZXJ0ZW5lY2UgY2FkYSBvYnNlcnZhY2nDs24KYGBge3J9CmFzaWduYWNpb24gPC0gY2JpbmQoZGYsIGNsdXN0ZXIgPSBzZWdtZW50b3MkY2x1c3RlcikKCmBgYAoKCgoKIyBQYXNvIDguIEdyYWZpY2FyIGxvcyBjbHVzdGVycwpgYGB7cn0KZnZpel9jbHVzdGVyKHNlZ21lbnRvcywgZGF0YSA9IGRmKQpgYGAKCiMgUGFzbyA5LiBPcHRpbWl6YXIgbGEgY2FudGlkYWQgZGUgZ3J1cG9zCkxhIGNhbnRpZGFkIMOzcHRpbWEgZGUgZ3J1cG9zIGNvcnJlc3BvbmRlIGFsIHB1bnRvIG3DoXMgYWx0byBkZSBsYSBzaWd1aWVudGUgZ3LDoWZpY2EKYGBge3J9CnNldC5zZWVkKDEyMykKb3B0aW1pemFjaW9uIDwtIGNsdXNHYXAoZGYsIEZVTj1rbWVhbnMsIG5zdGFydCA9IDEsIEsubWF4ID0gMTApCnBsb3Qob3B0aW1pemFjaW9uLCB4bGFiPSdOw7ptZXJvIGRlIGNsdXN0ZXJzIGsnKQpgYGAKCiMgUGFzbyAxMS4gQ29tcGFyYWNpw7NuIGVudHJlIHNlZ21lbnRvcwpgYGB7cn0KcHJvbWVkaW8gPC0gYWdncmVnYXRlKGFzaWduYWNpb24sIGJ5PWxpc3Qoc2VnbWVudG9zJGNsdXN0ZXIpLCBGVU49bWVhbikKcHJvbWVkaW8KYGBgCgoKIyBDb25jbHVzacOzbgpMYSBzZWdtZW50YWNpw7NuIG8gY2x1c3RlcnMgZXMgdW4gYWxnb3JpdG1vIMO6dGlsIHBhcmEgbGFzIGVtcHJlc2FzIHF1ZSBkZXNlYW4gY2xhc2lmaWNhciBzdXMgY2xpZW50ZXMgeSBkaXJpZ2lyIGNhbXBhw7FhcyBkZSBNS1QgbcOhcyBlbmZvY2FkYXMgeSBlc3BlY2lhbGl6ZGFzLgo=