La segmentación o clusters es un conjutno de técnicas cuyo propósito es formar grupos a partir de un conjunto de elementos.
Más información: R for Data Science (2ed).
#install.packages('clusters')
library(cluster)
## Warning: package 'cluster' was built under R version 4.3.1
#install.packages('ggplot2')
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.1
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
df <- read.csv('/Users/kikepablos/Documents/Development/escuela/concentracion_ai/modulo_6/data_sources/wine.csv')
summary(df)
## Alcohol Malic_Acid Ash Ash_Alcanity
## Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color_Intensity Hue OD280
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
#df <- wine_data[, c("Alcohol", "Malic_Acid", "Ash", "Color_Intensity", "Ash_Alcanity", "Magnesium", "Total_Phenols", "Flavanoids", "Nonflavanoid_Phenols", "Proanthocyanins", "Hue", "OD280", "Proline")]
df <- scale(df)
grupos <- 3
segmentos <- kmeans(df, grupos, nstart = 1)
segmentos
## K-means clustering with 3 clusters of sizes 51, 62, 65
##
## Cluster means:
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
## 2 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
## 3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue
## 1 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
## 2 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
## 3 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
## OD280 Proline
## 1 -1.2887761 -0.4059428
## 2 0.7770551 1.1220202
## 3 0.2700025 -0.7517257
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2
## [75] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 326.3537 385.6983 558.6971
## (between_SS / total_SS = 44.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
asignacion <- cbind(df, cluster = segmentos$cluster)
fviz_cluster(segmentos, data = df)
La cantidad óptima de grupos corresponde al punto más alto de la siguiente gráfica
set.seed(123)
optimizacion <- clusGap(df, FUN=kmeans, nstart = 1, K.max = 10)
plot(optimizacion, xlab='Número de clusters k')
promedio <- aggregate(asignacion, by=list(segmentos$cluster), FUN=mean)
promedio
## Group.1 Alcohol Malic_Acid Ash Ash_Alcanity Magnesium
## 1 1 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047
## 2 2 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208
## 3 3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869
## Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
## 1 -0.97657548 -1.21182921 0.72402116 -0.77751312
## 2 0.88274724 0.97506900 -0.56050853 0.57865427
## 3 -0.07576891 0.02075402 -0.03343924 0.05810161
## Color_Intensity Hue OD280 Proline cluster
## 1 0.9388902 -1.1615122 -1.2887761 -0.4059428 1
## 2 0.1705823 0.4726504 0.7770551 1.1220202 2
## 3 -0.8993770 0.4605046 0.2700025 -0.7517257 3
La segmentación o clusters es un algoritmo útil para las empresas que desean clasificar sus clientes y dirigir campañas de MKT más enfocadas y especializdas.