Agrupamiento o clustering es una técnica de aprendizaje automático no supervisado que agrupa datos en función de su similitud.
Algunos usos típicos de esta técnica son:
#install.packages("cluster") # Análisis de Agrupamiento
library(cluster)
#install.packages("ggplot2") # Graficar
library(ggplot2)
#install.packages("data.table") # Manejo de muchos datos
library(data.table)
#install.packages("factoextra") # Gráfica optimización de número de clusters
library(factoextra)
library(readxl)
library(dplyr)
library(janitor)
library(scales)
df_raw <- read_excel("/Users/edu_sssedu/Downloads/supermarket.xlsx", sheet = "supermercado")
## Warning: Expecting numeric in A522063 / R522063C1: got 'A563185'
## Warning: Expecting numeric in A522064 / R522064C1: got 'A563186'
## Warning: Expecting numeric in A522065 / R522065C1: got 'A563187'
# Limpieza de nombres de columnas (minúsculas y guiones bajos)
df_raw <- clean_names(df_raw)
# Vista inicial
head(df_raw)
## # A tibble: 6 × 8
## bill_no itemname quantity date time price
## <dbl> <chr> <dbl> <dttm> <dttm> <dbl>
## 1 536365 WHITE HANGING … 6 2010-12-01 00:00:00 1899-12-31 08:26:00 2.55
## 2 536365 WHITE METAL LA… 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 3 536365 CREAM CUPID HE… 8 2010-12-01 00:00:00 1899-12-31 08:26:00 2.75
## 4 536365 KNITTED UNION … 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 5 536365 RED WOOLLY HOT… 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 6 536365 SET 7 BABUSHKA… 2 2010-12-01 00:00:00 1899-12-31 08:26:00 7.65
## # ℹ 2 more variables: customer_id <dbl>, country <chr>
str(df_raw)
## tibble [522,064 × 8] (S3: tbl_df/tbl/data.frame)
## $ bill_no : num [1:522064] 536365 536365 536365 536365 536365 ...
## $ itemname : chr [1:522064] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ quantity : num [1:522064] 6 6 8 6 6 2 6 6 6 32 ...
## $ date : POSIXct[1:522064], format: "2010-12-01" "2010-12-01" ...
## $ time : POSIXct[1:522064], format: "1899-12-31 08:26:00" "1899-12-31 08:26:00" ...
## $ price : num [1:522064] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ customer_id: num [1:522064] 17850 17850 17850 17850 17850 ...
## $ country : chr [1:522064] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
# Ver nombres reales de columnas
names(df_raw)
## [1] "bill_no" "itemname" "quantity" "date" "time"
## [6] "price" "customer_id" "country"
summary(df_raw)
## bill_no itemname quantity
## Min. :536365 Length:522064 Min. :-9600.00
## 1st Qu.:547892 Class :character 1st Qu.: 1.00
## Median :560603 Mode :character Median : 3.00
## Mean :559951 Mean : 10.09
## 3rd Qu.:571892 3rd Qu.: 10.00
## Max. :581587 Max. :80995.00
## NA's :3
## date time
## Min. :2010-12-01 00:00:00 Min. :1899-12-31 06:20:00
## 1st Qu.:2011-03-28 00:00:00 1st Qu.:1899-12-31 11:48:00
## Median :2011-07-20 00:00:00 Median :1899-12-31 13:37:00
## Mean :2011-07-03 23:15:13 Mean :1899-12-31 13:36:07
## 3rd Qu.:2011-10-19 00:00:00 3rd Qu.:1899-12-31 15:30:00
## Max. :2011-12-09 00:00:00 Max. :1899-12-31 20:18:00
##
## price customer_id country
## Min. :-11062.060 Min. :12346 Length:522064
## 1st Qu.: 1.250 1st Qu.:13950 Class :character
## Median : 2.080 Median :15265 Mode :character
## Mean : 3.827 Mean :15317
## 3rd Qu.: 4.130 3rd Qu.:16837
## Max. : 13541.330 Max. :18287
## NA's :134041
num_cols <- df_raw %>% select(where(is.numeric))
names(num_cols)
## [1] "bill_no" "quantity" "price" "customer_id"
summary(num_cols)
## bill_no quantity price customer_id
## Min. :536365 Min. :-9600.00 Min. :-11062.060 Min. :12346
## 1st Qu.:547892 1st Qu.: 1.00 1st Qu.: 1.250 1st Qu.:13950
## Median :560603 Median : 3.00 Median : 2.080 Median :15265
## Mean :559951 Mean : 10.09 Mean : 3.827 Mean :15317
## 3rd Qu.:571892 3rd Qu.: 10.00 3rd Qu.: 4.130 3rd Qu.:16837
## Max. :581587 Max. :80995.00 Max. : 13541.330 Max. :18287
## NA's :3 NA's :134041
desc_stats <- num_cols %>%
summarise(across(
everything(),
list(
n = ~sum(!is.na(.)),
media = ~mean(., na.rm = TRUE),
mediana = ~median(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
min = ~min(., na.rm = TRUE),
max = ~max(., na.rm = TRUE)
),
.names = "{.col}_{.fn}"
))
desc_stats
## # A tibble: 1 × 24
## bill_no_n bill_no_media bill_no_mediana bill_no_sd bill_no_min bill_no_max
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 522061 559951. 560603 13453. 536365 581587
## # ℹ 18 more variables: quantity_n <int>, quantity_media <dbl>,
## # quantity_mediana <dbl>, quantity_sd <dbl>, quantity_min <dbl>,
## # quantity_max <dbl>, price_n <int>, price_media <dbl>, price_mediana <dbl>,
## # price_sd <dbl>, price_min <dbl>, price_max <dbl>, customer_id_n <int>,
## # customer_id_media <dbl>, customer_id_mediana <dbl>, customer_id_sd <dbl>,
## # customer_id_min <dbl>, customer_id_max <dbl>
for (col in names(num_cols)) {
print(
ggplot(df_raw, aes(x = .data[[col]])) +
geom_histogram(bins = 30) +
labs(title = paste("Histograma de", col), x = col, y = "Frecuencia")
)
}
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 134041 rows containing non-finite outside the scale range
## (`stat_bin()`).
na_por_columna <- sapply(df_raw, function(x) sum(is.na(x)))
na_por_columna
## bill_no itemname quantity date time price
## 3 1455 0 0 0 0
## customer_id country
## 134041 0
n_antes <- nrow(df_raw)
df_raw <- df_raw %>% distinct()
n_despues <- nrow(df_raw)
cat("Filas antes de eliminar duplicados:", n_antes, "\n")
## Filas antes de eliminar duplicados: 522064
cat("Filas después de eliminar duplicados:", n_despues, "\n")
## Filas después de eliminar duplicados: 516718
cat("Duplicados eliminados:", n_antes - n_despues, "\n")
## Duplicados eliminados: 5346
# Mostrar nombres:
names(df_raw)
## [1] "bill_no" "itemname" "quantity" "date" "time"
## [6] "price" "customer_id" "country"
# Dataset limpio para clustering (sin NA en columnas clave y con valores válidos)
df_clean <- df_raw %>%
filter(!is.na(customer_id), !is.na(quantity), !is.na(price)) %>%
filter(quantity > 0, price > 0)
cat("Filas originales:", nrow(df_raw), "\n")
## Filas originales: 516718
cat("Filas limpias para análisis:", nrow(df_clean), "\n")
## Filas limpias para análisis: 382737
head(df_clean)
## # A tibble: 6 × 8
## bill_no itemname quantity date time price
## <dbl> <chr> <dbl> <dttm> <dttm> <dbl>
## 1 536365 WHITE HANGING … 6 2010-12-01 00:00:00 1899-12-31 08:26:00 2.55
## 2 536365 WHITE METAL LA… 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 3 536365 CREAM CUPID HE… 8 2010-12-01 00:00:00 1899-12-31 08:26:00 2.75
## 4 536365 KNITTED UNION … 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 5 536365 RED WOOLLY HOT… 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 6 536365 SET 7 BABUSHKA… 2 2010-12-01 00:00:00 1899-12-31 08:26:00 7.65
## # ℹ 2 more variables: customer_id <dbl>, country <chr>
df_clientes <- df_clean %>%
group_by(customer_id) %>%
summarise(
frecuencia_compras = n(),
cantidad_total = sum(quantity, na.rm = TRUE),
gasto_total = sum(quantity * price, na.rm = TRUE),
ticket_promedio = mean(quantity * price, na.rm = TRUE),
.groups = "drop"
)
vars_cluster <- df_clientes %>%
select(frecuencia_compras, cantidad_total, gasto_total, ticket_promedio)
vars_cluster_scaled <- scale(vars_cluster)
head(vars_cluster_scaled)
## frecuencia_compras cantidad_total gasto_total ticket_promedio
## [1,] -0.42180306 15.0785957 8.75820212 52.27981658
## [2,] 0.44486820 0.2684172 0.27045289 -0.03036610
## [3,] -0.07704985 -0.1086638 -0.02683660 -0.03009851
## [4,] -0.34519124 -0.1982386 -0.19259403 -0.03308515
## [5,] -0.01959099 -0.1282712 0.06034168 -0.02643300
## [6,] -0.40743835 -0.2347702 -0.22117631 -0.03133645
fviz_nbclust(vars_cluster_scaled, kmeans, method = "wss") +
labs(title = "Método del codo (WSS)")
fviz_nbclust(vars_cluster_scaled, kmeans, method = "silhouette") +
labs(title = "Método Silhouette")
set.seed(123)
gap_stat <- clusGap(vars_cluster_scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 214800)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
fviz_gap_stat(gap_stat) +
labs(title = "Gap Statistic")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
set.seed(123)
k_final <- 4
modelo_kmeans <- kmeans(vars_cluster_scaled, centers = k_final, nstart = 50)
modelo_kmeans
## K-means clustering with 4 clusters of sizes 2, 14, 323, 3957
##
## Cluster means:
## frecuencia_compras cantidad_total gasto_total ticket_promedio
## 1 -0.4170148 15.7784761 14.0745121 45.15254668
## 2 9.3602272 11.8282224 12.0179838 0.06936930
## 3 1.4909428 0.8066301 0.7915682 0.01650780
## 4 -0.1546080 -0.1156668 -0.1142475 -0.02441453
##
## Clustering vector:
## [1] 1 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [38] 4 4 4 3 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4
## [75] 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 3 4 3 3 4 4 4 3 4 3 4 4 3 4 4 4 4 4 4 4 4 4
## [112] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 3 4 4 4 4 4 4
## [149] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4
## [186] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4
## [223] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 3 3 4 4 4 4 4
## [260] 4 4 4 4 4 4 3 4 4 4 3 4 4 3 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4
## [297] 4 4 4 4 4 4 3 4 2 4 4 4 3 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [334] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 3 4
## [371] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [408] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [445] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4
## [482] 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4
## [519] 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 3 4 4 4 2 4 4 4 4 4 4 3 3 4 4 3 4 4 4 4
## [556] 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4
## [593] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4
## [630] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [667] 4 3 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [704] 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [741] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [778] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 3 4
## [815] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [852] 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [889] 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [926] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [963] 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1000] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 3
## [1037] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1074] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4
## [1111] 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1148] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3
## [1185] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4
## [1222] 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 3 3 4 3 3 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4
## [1259] 3 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1296] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 3 4 3
## [1333] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1370] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [1407] 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1444] 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4
## [1481] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1518] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1555] 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4
## [1592] 3 3 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4
## [1629] 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4
## [1666] 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 3 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 3 3
## [1703] 4 3 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1740] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4
## [1777] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4
## [1814] 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4
## [1851] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1888] 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 3 3
## [1925] 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3
## [1962] 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1999] 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4
## [2036] 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2073] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2110] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 2 4 4 4 4 4 4 4
## [2147] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4
## [2184] 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2221] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2258] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [2295] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4
## [2332] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [2369] 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2406] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4
## [2443] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 3
## [2480] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4
## [2517] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4
## [2554] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2591] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2628] 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 2
## [2665] 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2702] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2739] 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2776] 3 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4
## [2813] 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4
## [2850] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4
## [2887] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2924] 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4
## [2961] 4 4 4 4 3 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4
## [2998] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4
## [3035] 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3072] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3109] 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 2 4 3 4 4 4 4 4
## [3146] 4 4 4 3 3 3 4 4 4 3 4 3 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4
## [3183] 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [3220] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3257] 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4
## [3294] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 3 4 4 4 4
## [3331] 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [3368] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3405] 4 4 4 4 3 3 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3442] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4
## [3479] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3516] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4
## [3553] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3
## [3590] 4 4 4 4 4 4 4 4 4 4 4 3 3 4 3 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4
## [3627] 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [3664] 4 3 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4
## [3701] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 2 4 4 4 4 4 4
## [3738] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3775] 4 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4 3 4 3 4 4 4 4 4 4
## [3812] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3849] 4 4 4 4 4 3 4 3 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4
## [3886] 4 3 4 3 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4
## [3923] 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 3 4 4 4 4 4
## [3960] 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [3997] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4
## [4034] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4
## [4071] 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [4108] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4
## [4145] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 3 4 4 4 4 3 4 3 4 4 4 3 4 4 4 4 4
## [4182] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4
## [4219] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 3 4 4
## [4256] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [4293] 4 4 3 4
##
## Within cluster sum of squares by cluster:
## [1] 159.1020 3702.1457 1323.1806 484.0423
## (between_SS / total_SS = 67.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(
modelo_kmeans,
data = vars_cluster_scaled,
geom = "point",
ellipse.type = "convex",
main = "Clusters de clientes (k = 4)"
)
df_clientes_cluster <- df_clientes %>%
mutate(cluster_num = modelo_kmeans$cluster)
head(df_clientes_cluster)
## # A tibble: 6 × 6
## customer_id frecuencia_compras cantidad_total gasto_total ticket_promedio
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 12346 1 74215 77184. 77184.
## 2 12347 182 2458 4310. 23.7
## 3 12349 73 631 1758. 24.1
## 4 12350 17 197 334. 19.7
## 5 12352 85 536 2506. 29.5
## 6 12353 4 20 89 22.2
## # ℹ 1 more variable: cluster_num <int>
centroides <- df_clientes_cluster %>%
group_by(cluster_num) %>%
summarise(
frecuencia_compras = mean(frecuencia_compras),
cantidad_total = mean(cantidad_total),
gasto_total = mean(gasto_total),
ticket_promedio = mean(ticket_promedio),
n_clientes = n(),
.groups = "drop"
) %>%
arrange(desc(gasto_total))
centroides
## # A tibble: 4 × 6
## cluster_num frecuencia_compras cantidad_total gasto_total ticket_promedio
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 77606 122828. 66671.
## 2 2 2044. 58467. 105171. 171.
## 3 3 400. 5066. 8784. 92.8
## 4 4 56.8 597. 1007. 32.5
## # ℹ 1 more variable: n_clientes <int>
# Se asignan nombres con base en el orden de gasto promedio (de mayor a menor)
orden_clusters <- centroides$cluster_num
df_clientes_cluster <- df_clientes_cluster %>%
mutate(
nombre_cluster = case_when(
cluster_num == orden_clusters[1] ~ "Premium / Alto valor",
cluster_num == orden_clusters[2] ~ "Frecuentes / Valor medio",
cluster_num == orden_clusters[3] ~ "Ocasionales / Bajo gasto",
TRUE ~ "Compradores de ticket específico"
)
)
head(df_clientes_cluster)
## # A tibble: 6 × 7
## customer_id frecuencia_compras cantidad_total gasto_total ticket_promedio
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 12346 1 74215 77184. 77184.
## 2 12347 182 2458 4310. 23.7
## 3 12349 73 631 1758. 24.1
## 4 12350 17 197 334. 19.7
## 5 12352 85 536 2506. 29.5
## 6 12353 4 20 89 22.2
## # ℹ 2 more variables: cluster_num <int>, nombre_cluster <chr>
df_final <- df_raw %>%
left_join(
df_clientes_cluster %>% select(customer_id, cluster_num, nombre_cluster),
by = "customer_id"
)
head(df_final)
## # A tibble: 6 × 10
## bill_no itemname quantity date time price
## <dbl> <chr> <dbl> <dttm> <dttm> <dbl>
## 1 536365 WHITE HANGING … 6 2010-12-01 00:00:00 1899-12-31 08:26:00 2.55
## 2 536365 WHITE METAL LA… 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 3 536365 CREAM CUPID HE… 8 2010-12-01 00:00:00 1899-12-31 08:26:00 2.75
## 4 536365 KNITTED UNION … 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 5 536365 RED WOOLLY HOT… 6 2010-12-01 00:00:00 1899-12-31 08:26:00 3.39
## 6 536365 SET 7 BABUSHKA… 2 2010-12-01 00:00:00 1899-12-31 08:26:00 7.65
## # ℹ 4 more variables: customer_id <dbl>, country <chr>, cluster_num <int>,
## # nombre_cluster <chr>
perfil_clusters <- df_clientes_cluster %>%
group_by(nombre_cluster) %>%
summarise(
clientes = n(),
frecuencia_promedio = mean(frecuencia_compras),
cantidad_promedio = mean(cantidad_total),
gasto_promedio = mean(gasto_total),
ticket_promedio = mean(ticket_promedio),
.groups = "drop"
) %>%
mutate(
porcentaje_clientes = clientes / sum(clientes)
) %>%
arrange(desc(gasto_promedio))
perfil_clusters
## # A tibble: 4 × 7
## nombre_cluster clientes frecuencia_promedio cantidad_promedio gasto_promedio
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Premium / Alto … 2 2 77606 122828.
## 2 Frecuentes / Va… 14 2044. 58467. 105171.
## 3 Ocasionales / B… 323 400. 5066. 8784.
## 4 Compradores de … 3957 56.8 597. 1007.
## # ℹ 2 more variables: ticket_promedio <dbl>, porcentaje_clientes <dbl>
ggplot(perfil_clusters, aes(x = reorder(nombre_cluster, gasto_promedio), y = gasto_promedio)) +
geom_col() +
coord_flip() +
labs(
title = "Gasto promedio por cluster",
x = "Cluster",
y = "Gasto promedio"
)
recomendaciones <- data.frame(
Cluster = c(
"Premium / Alto valor",
"Frecuentes / Valor medio",
"Ocasionales / Bajo gasto",
"Compradores de ticket específico"
),
Caracteristicas = c(
"Clientes con mayor gasto total y/o mayor ticket promedio. Alta contribución al ingreso.",
"Clientes con recurrencia constante y valor intermedio. Buena base para crecimiento.",
"Clientes con baja frecuencia y bajo gasto. Menor participación actual.",
"Clientes con patrón mixto: compran poco pero con tickets altos, o compran seguido con tickets bajos."
),
Recomendaciones = c(
"Implementar programa VIP, ofertas exclusivas, recompensas por lealtad y atención personalizada.",
"Aplicar estrategias de upselling/cross-selling, cupones por recurrencia y combos.",
"Lanzar campañas de reactivación, descuentos de recompra y promociones de entrada.",
"Segmentar promociones según patrón de compra y analizar categorías para personalización."
),
stringsAsFactors = FALSE
)
recomendaciones
## Cluster
## 1 Premium / Alto valor
## 2 Frecuentes / Valor medio
## 3 Ocasionales / Bajo gasto
## 4 Compradores de ticket específico
## Caracteristicas
## 1 Clientes con mayor gasto total y/o mayor ticket promedio. Alta contribución al ingreso.
## 2 Clientes con recurrencia constante y valor intermedio. Buena base para crecimiento.
## 3 Clientes con baja frecuencia y bajo gasto. Menor participación actual.
## 4 Clientes con patrón mixto: compran poco pero con tickets altos, o compran seguido con tickets bajos.
## Recomendaciones
## 1 Implementar programa VIP, ofertas exclusivas, recompensas por lealtad y atención personalizada.
## 2 Aplicar estrategias de upselling/cross-selling, cupones por recurrencia y combos.
## 3 Lanzar campañas de reactivación, descuentos de recompra y promociones de entrada.
## 4 Segmentar promociones según patrón de compra y analizar categorías para personalización.
write.csv(df_final, "supermarket_con_clusters.csv", row.names = FALSE)
write.csv(perfil_clusters, "perfil_clusters.csv", row.names = FALSE)
write.csv(recomendaciones, "recomendaciones_clusters.csv", row.names = FALSE)
Se realizó un análisis de clustering sobre clientes de supermercado a partir de un archivo Excel. Primero, se desarrolló una estadística descriptiva para identificar la distribución de variables numéricas y el comportamiento general de los datos. Posteriormente, se efectuó una limpieza de la base, eliminando valores faltantes en variables clave, registros inválidos y duplicados.
Para la segmentación, se construyó una tabla por cliente con variables de comportamiento de compra (frecuencia, cantidad total, gasto total y ticket promedio). Estas variables fueron escaladas para evitar sesgos por diferencias de magnitud. Aunque se exploraron métodos de optimización, se implementó un modelo final de k-means con 4 clusters en cumplimiento de la rúbrica.
Finalmente, se asignó a cada cliente un número y nombre de cluster, y esta información se integró a la tabla original. Con base en las características promedio de cada segmento, se elaboraron recomendaciones orientadas a fidelización, crecimiento de valor y reactivación de clientes.