Agrupamiento - clientes

Agrupamiento o clustering es una técnica de aprendizaje automático no supervisado que agrupa datos en función de su similitud.

Algunos usos típicos de esta técnica son:

Segmentación de clientes
Detección de anormalidades
Categorización de documentos

#install.packages("cluster") # Análisis de Agrupamiento
library(cluster)
#install.packages("ggplot2") # Graficar
library(ggplot2)
#install.packages("data.table") # Manejo de muchos datos
library(data.table)
#install.packages("factoextra") # Gráfica optimización de número de clusters
library(factoextra)
library(readxl)
library(dplyr)
library(janitor)
library(scales)

df_raw <- read_excel("/Users/edu_sssedu/Downloads/supermarket.xlsx", sheet = "supermercado")

## Warning: Expecting numeric in A522063 / R522063C1: got 'A563185'

## Warning: Expecting numeric in A522064 / R522064C1: got 'A563186'

## Warning: Expecting numeric in A522065 / R522065C1: got 'A563187'

# Limpieza de nombres de columnas (minúsculas y guiones bajos)
df_raw <- clean_names(df_raw)

# Vista inicial
head(df_raw)

## # A tibble: 6 × 8
##   bill_no itemname        quantity date                time                price
##     <dbl> <chr>              <dbl> <dttm>              <dttm>              <dbl>
## 1  536365 WHITE HANGING …        6 2010-12-01 00:00:00 1899-12-31 08:26:00  2.55
## 2  536365 WHITE METAL LA…        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 3  536365 CREAM CUPID HE…        8 2010-12-01 00:00:00 1899-12-31 08:26:00  2.75
## 4  536365 KNITTED UNION …        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 5  536365 RED WOOLLY HOT…        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 6  536365 SET 7 BABUSHKA…        2 2010-12-01 00:00:00 1899-12-31 08:26:00  7.65
## # ℹ 2 more variables: customer_id <dbl>, country <chr>

str(df_raw)

## tibble [522,064 × 8] (S3: tbl_df/tbl/data.frame)
##  $ bill_no    : num [1:522064] 536365 536365 536365 536365 536365 ...
##  $ itemname   : chr [1:522064] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ quantity   : num [1:522064] 6 6 8 6 6 2 6 6 6 32 ...
##  $ date       : POSIXct[1:522064], format: "2010-12-01" "2010-12-01" ...
##  $ time       : POSIXct[1:522064], format: "1899-12-31 08:26:00" "1899-12-31 08:26:00" ...
##  $ price      : num [1:522064] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ customer_id: num [1:522064] 17850 17850 17850 17850 17850 ...
##  $ country    : chr [1:522064] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

# Ver nombres reales de columnas 
names(df_raw)

## [1] "bill_no"     "itemname"    "quantity"    "date"        "time"       
## [6] "price"       "customer_id" "country"

summary(df_raw)

##     bill_no         itemname            quantity       
##  Min.   :536365   Length:522064      Min.   :-9600.00  
##  1st Qu.:547892   Class :character   1st Qu.:    1.00  
##  Median :560603   Mode  :character   Median :    3.00  
##  Mean   :559951                      Mean   :   10.09  
##  3rd Qu.:571892                      3rd Qu.:   10.00  
##  Max.   :581587                      Max.   :80995.00  
##  NA's   :3                                             
##       date                          time                    
##  Min.   :2010-12-01 00:00:00   Min.   :1899-12-31 06:20:00  
##  1st Qu.:2011-03-28 00:00:00   1st Qu.:1899-12-31 11:48:00  
##  Median :2011-07-20 00:00:00   Median :1899-12-31 13:37:00  
##  Mean   :2011-07-03 23:15:13   Mean   :1899-12-31 13:36:07  
##  3rd Qu.:2011-10-19 00:00:00   3rd Qu.:1899-12-31 15:30:00  
##  Max.   :2011-12-09 00:00:00   Max.   :1899-12-31 20:18:00  
##                                                             
##      price             customer_id       country         
##  Min.   :-11062.060   Min.   :12346    Length:522064     
##  1st Qu.:     1.250   1st Qu.:13950    Class :character  
##  Median :     2.080   Median :15265    Mode  :character  
##  Mean   :     3.827   Mean   :15317                      
##  3rd Qu.:     4.130   3rd Qu.:16837                      
##  Max.   : 13541.330   Max.   :18287                      
##                       NA's   :134041

num_cols <- df_raw %>% select(where(is.numeric))
names(num_cols)

## [1] "bill_no"     "quantity"    "price"       "customer_id"

summary(num_cols)

##     bill_no          quantity            price             customer_id    
##  Min.   :536365   Min.   :-9600.00   Min.   :-11062.060   Min.   :12346   
##  1st Qu.:547892   1st Qu.:    1.00   1st Qu.:     1.250   1st Qu.:13950   
##  Median :560603   Median :    3.00   Median :     2.080   Median :15265   
##  Mean   :559951   Mean   :   10.09   Mean   :     3.827   Mean   :15317   
##  3rd Qu.:571892   3rd Qu.:   10.00   3rd Qu.:     4.130   3rd Qu.:16837   
##  Max.   :581587   Max.   :80995.00   Max.   : 13541.330   Max.   :18287   
##  NA's   :3                                                NA's   :134041

desc_stats <- num_cols %>%
  summarise(across(
    everything(),
    list(
      n = ~sum(!is.na(.)),
      media = ~mean(., na.rm = TRUE),
      mediana = ~median(., na.rm = TRUE),
      sd = ~sd(., na.rm = TRUE),
      min = ~min(., na.rm = TRUE),
      max = ~max(., na.rm = TRUE)
    ),
    .names = "{.col}_{.fn}"
  ))

desc_stats

## # A tibble: 1 × 24
##   bill_no_n bill_no_media bill_no_mediana bill_no_sd bill_no_min bill_no_max
##       <int>         <dbl>           <dbl>      <dbl>       <dbl>       <dbl>
## 1    522061       559951.          560603     13453.      536365      581587
## # ℹ 18 more variables: quantity_n <int>, quantity_media <dbl>,
## #   quantity_mediana <dbl>, quantity_sd <dbl>, quantity_min <dbl>,
## #   quantity_max <dbl>, price_n <int>, price_media <dbl>, price_mediana <dbl>,
## #   price_sd <dbl>, price_min <dbl>, price_max <dbl>, customer_id_n <int>,
## #   customer_id_media <dbl>, customer_id_mediana <dbl>, customer_id_sd <dbl>,
## #   customer_id_min <dbl>, customer_id_max <dbl>

for (col in names(num_cols)) {
  print(
    ggplot(df_raw, aes(x = .data[[col]])) +
      geom_histogram(bins = 30) +
      labs(title = paste("Histograma de", col), x = col, y = "Frecuencia")
  )
}

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 134041 rows containing non-finite outside the scale range
## (`stat_bin()`).

na_por_columna <- sapply(df_raw, function(x) sum(is.na(x)))
na_por_columna

##     bill_no    itemname    quantity        date        time       price 
##           3        1455           0           0           0           0 
## customer_id     country 
##      134041           0

n_antes <- nrow(df_raw)
df_raw <- df_raw %>% distinct()
n_despues <- nrow(df_raw)

cat("Filas antes de eliminar duplicados:", n_antes, "\n")

## Filas antes de eliminar duplicados: 522064

cat("Filas después de eliminar duplicados:", n_despues, "\n")

## Filas después de eliminar duplicados: 516718

cat("Duplicados eliminados:", n_antes - n_despues, "\n")

## Duplicados eliminados: 5346

# Mostrar nombres:
names(df_raw)

## [1] "bill_no"     "itemname"    "quantity"    "date"        "time"       
## [6] "price"       "customer_id" "country"

# Dataset limpio para clustering (sin NA en columnas clave y con valores válidos)
df_clean <- df_raw %>%
  filter(!is.na(customer_id), !is.na(quantity), !is.na(price)) %>%
  filter(quantity > 0, price > 0)

cat("Filas originales:", nrow(df_raw), "\n")

## Filas originales: 516718

cat("Filas limpias para análisis:", nrow(df_clean), "\n")

## Filas limpias para análisis: 382737

head(df_clean)

## # A tibble: 6 × 8
##   bill_no itemname        quantity date                time                price
##     <dbl> <chr>              <dbl> <dttm>              <dttm>              <dbl>
## 1  536365 WHITE HANGING …        6 2010-12-01 00:00:00 1899-12-31 08:26:00  2.55
## 2  536365 WHITE METAL LA…        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 3  536365 CREAM CUPID HE…        8 2010-12-01 00:00:00 1899-12-31 08:26:00  2.75
## 4  536365 KNITTED UNION …        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 5  536365 RED WOOLLY HOT…        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 6  536365 SET 7 BABUSHKA…        2 2010-12-01 00:00:00 1899-12-31 08:26:00  7.65
## # ℹ 2 more variables: customer_id <dbl>, country <chr>

df_clientes <- df_clean %>%
  group_by(customer_id) %>%
  summarise(
    frecuencia_compras = n(),
    cantidad_total = sum(quantity, na.rm = TRUE),
    gasto_total = sum(quantity * price, na.rm = TRUE),
    ticket_promedio = mean(quantity * price, na.rm = TRUE),
    .groups = "drop"
  )

vars_cluster <- df_clientes %>%
  select(frecuencia_compras, cantidad_total, gasto_total, ticket_promedio)

vars_cluster_scaled <- scale(vars_cluster)

head(vars_cluster_scaled)

##      frecuencia_compras cantidad_total gasto_total ticket_promedio
## [1,]        -0.42180306     15.0785957  8.75820212     52.27981658
## [2,]         0.44486820      0.2684172  0.27045289     -0.03036610
## [3,]        -0.07704985     -0.1086638 -0.02683660     -0.03009851
## [4,]        -0.34519124     -0.1982386 -0.19259403     -0.03308515
## [5,]        -0.01959099     -0.1282712  0.06034168     -0.02643300
## [6,]        -0.40743835     -0.2347702 -0.22117631     -0.03133645

fviz_nbclust(vars_cluster_scaled, kmeans, method = "wss") +
  labs(title = "Método del codo (WSS)")

fviz_nbclust(vars_cluster_scaled, kmeans, method = "silhouette") +
  labs(title = "Método Silhouette")

set.seed(123)
gap_stat <- clusGap(vars_cluster_scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 50)

## Warning: did not converge in 10 iterations

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 214800)

## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations

fviz_gap_stat(gap_stat) +
  labs(title = "Gap Statistic")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

set.seed(123)
k_final <- 4

modelo_kmeans <- kmeans(vars_cluster_scaled, centers = k_final, nstart = 50)

modelo_kmeans

## K-means clustering with 4 clusters of sizes 2, 14, 323, 3957
## 
## Cluster means:
##   frecuencia_compras cantidad_total gasto_total ticket_promedio
## 1         -0.4170148     15.7784761  14.0745121     45.15254668
## 2          9.3602272     11.8282224  12.0179838      0.06936930
## 3          1.4909428      0.8066301   0.7915682      0.01650780
## 4         -0.1546080     -0.1156668  -0.1142475     -0.02441453
## 
## Clustering vector:
##    [1] 1 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##   [38] 4 4 4 3 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4
##   [75] 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 3 4 3 3 4 4 4 3 4 3 4 4 3 4 4 4 4 4 4 4 4 4
##  [112] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 3 4 4 4 4 4 4
##  [149] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4
##  [186] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4
##  [223] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 3 3 4 4 4 4 4
##  [260] 4 4 4 4 4 4 3 4 4 4 3 4 4 3 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4
##  [297] 4 4 4 4 4 4 3 4 2 4 4 4 3 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [334] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 3 4
##  [371] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [408] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [445] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4
##  [482] 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4
##  [519] 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 3 4 4 4 2 4 4 4 4 4 4 3 3 4 4 3 4 4 4 4
##  [556] 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4
##  [593] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4
##  [630] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [667] 4 3 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [704] 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [741] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
##  [778] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 3 4
##  [815] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [852] 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [889] 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [926] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
##  [963] 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1000] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 3
## [1037] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1074] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4
## [1111] 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1148] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3
## [1185] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4
## [1222] 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 3 3 4 3 3 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4
## [1259] 3 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1296] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 3 4 3
## [1333] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1370] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [1407] 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1444] 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4
## [1481] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1518] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1555] 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4
## [1592] 3 3 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4
## [1629] 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4
## [1666] 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 3 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 3 3
## [1703] 4 3 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1740] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4
## [1777] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4
## [1814] 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4
## [1851] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1888] 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 3 3
## [1925] 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3
## [1962] 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1999] 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4
## [2036] 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2073] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2110] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 2 4 4 4 4 4 4 4
## [2147] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4
## [2184] 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2221] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2258] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [2295] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4
## [2332] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [2369] 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2406] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4
## [2443] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 3
## [2480] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4
## [2517] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4
## [2554] 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2591] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2628] 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 2
## [2665] 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2702] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2739] 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2776] 3 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4
## [2813] 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4
## [2850] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4
## [2887] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [2924] 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4
## [2961] 4 4 4 4 3 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4
## [2998] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4
## [3035] 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3072] 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3109] 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 2 4 3 4 4 4 4 4
## [3146] 4 4 4 3 3 3 4 4 4 3 4 3 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4
## [3183] 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [3220] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3257] 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4
## [3294] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 3 4 4 4 4
## [3331] 4 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
## [3368] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3405] 4 4 4 4 3 3 4 4 4 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3442] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4
## [3479] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3516] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 4
## [3553] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3
## [3590] 4 4 4 4 4 4 4 4 4 4 4 3 3 4 3 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4
## [3627] 4 4 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [3664] 4 3 4 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4
## [3701] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 2 4 4 4 4 4 4
## [3738] 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3775] 4 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4 3 4 3 4 4 4 4 4 4
## [3812] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [3849] 4 4 4 4 4 3 4 3 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4
## [3886] 4 3 4 3 4 4 4 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4
## [3923] 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 3 4 3 4 4 4 4 4
## [3960] 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4
## [3997] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4
## [4034] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4
## [4071] 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [4108] 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4
## [4145] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 3 4 4 4 4 3 4 3 4 4 4 3 4 4 4 4 4
## [4182] 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4
## [4219] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 3 3 4 4
## [4256] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [4293] 4 4 3 4
## 
## Within cluster sum of squares by cluster:
## [1]  159.1020 3702.1457 1323.1806  484.0423
##  (between_SS / total_SS =  67.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(
  modelo_kmeans, 
  data = vars_cluster_scaled,
  geom = "point",
  ellipse.type = "convex",
  main = "Clusters de clientes (k = 4)"
)

df_clientes_cluster <- df_clientes %>%
  mutate(cluster_num = modelo_kmeans$cluster)

head(df_clientes_cluster)

## # A tibble: 6 × 6
##   customer_id frecuencia_compras cantidad_total gasto_total ticket_promedio
##         <dbl>              <int>          <dbl>       <dbl>           <dbl>
## 1       12346                  1          74215      77184.         77184. 
## 2       12347                182           2458       4310.            23.7
## 3       12349                 73            631       1758.            24.1
## 4       12350                 17            197        334.            19.7
## 5       12352                 85            536       2506.            29.5
## 6       12353                  4             20         89             22.2
## # ℹ 1 more variable: cluster_num <int>

centroides <- df_clientes_cluster %>%
  group_by(cluster_num) %>%
  summarise(
    frecuencia_compras = mean(frecuencia_compras),
    cantidad_total = mean(cantidad_total),
    gasto_total = mean(gasto_total),
    ticket_promedio = mean(ticket_promedio),
    n_clientes = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(gasto_total))

centroides

## # A tibble: 4 × 6
##   cluster_num frecuencia_compras cantidad_total gasto_total ticket_promedio
##         <int>              <dbl>          <dbl>       <dbl>           <dbl>
## 1           1                2           77606      122828.         66671. 
## 2           2             2044.          58467.     105171.           171. 
## 3           3              400.           5066.       8784.            92.8
## 4           4               56.8           597.       1007.            32.5
## # ℹ 1 more variable: n_clientes <int>

# Se asignan nombres con base en el orden de gasto promedio (de mayor a menor)

orden_clusters <- centroides$cluster_num

df_clientes_cluster <- df_clientes_cluster %>%
  mutate(
    nombre_cluster = case_when(
      cluster_num == orden_clusters[1] ~ "Premium / Alto valor",
      cluster_num == orden_clusters[2] ~ "Frecuentes / Valor medio",
      cluster_num == orden_clusters[3] ~ "Ocasionales / Bajo gasto",
      TRUE ~ "Compradores de ticket específico"
    )
  )

head(df_clientes_cluster)

## # A tibble: 6 × 7
##   customer_id frecuencia_compras cantidad_total gasto_total ticket_promedio
##         <dbl>              <int>          <dbl>       <dbl>           <dbl>
## 1       12346                  1          74215      77184.         77184. 
## 2       12347                182           2458       4310.            23.7
## 3       12349                 73            631       1758.            24.1
## 4       12350                 17            197        334.            19.7
## 5       12352                 85            536       2506.            29.5
## 6       12353                  4             20         89             22.2
## # ℹ 2 more variables: cluster_num <int>, nombre_cluster <chr>

df_final <- df_raw %>%
  left_join(
    df_clientes_cluster %>% select(customer_id, cluster_num, nombre_cluster),
    by = "customer_id"
  )

head(df_final)

## # A tibble: 6 × 10
##   bill_no itemname        quantity date                time                price
##     <dbl> <chr>              <dbl> <dttm>              <dttm>              <dbl>
## 1  536365 WHITE HANGING …        6 2010-12-01 00:00:00 1899-12-31 08:26:00  2.55
## 2  536365 WHITE METAL LA…        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 3  536365 CREAM CUPID HE…        8 2010-12-01 00:00:00 1899-12-31 08:26:00  2.75
## 4  536365 KNITTED UNION …        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 5  536365 RED WOOLLY HOT…        6 2010-12-01 00:00:00 1899-12-31 08:26:00  3.39
## 6  536365 SET 7 BABUSHKA…        2 2010-12-01 00:00:00 1899-12-31 08:26:00  7.65
## # ℹ 4 more variables: customer_id <dbl>, country <chr>, cluster_num <int>,
## #   nombre_cluster <chr>

perfil_clusters <- df_clientes_cluster %>%
  group_by(nombre_cluster) %>%
  summarise(
    clientes = n(),
    frecuencia_promedio = mean(frecuencia_compras),
    cantidad_promedio = mean(cantidad_total),
    gasto_promedio = mean(gasto_total),
    ticket_promedio = mean(ticket_promedio),
    .groups = "drop"
  ) %>%
  mutate(
    porcentaje_clientes = clientes / sum(clientes)
  ) %>%
  arrange(desc(gasto_promedio))

perfil_clusters

## # A tibble: 4 × 7
##   nombre_cluster   clientes frecuencia_promedio cantidad_promedio gasto_promedio
##   <chr>               <int>               <dbl>             <dbl>          <dbl>
## 1 Premium / Alto …        2                 2              77606         122828.
## 2 Frecuentes / Va…       14              2044.             58467.        105171.
## 3 Ocasionales / B…      323               400.              5066.          8784.
## 4 Compradores de …     3957                56.8              597.          1007.
## # ℹ 2 more variables: ticket_promedio <dbl>, porcentaje_clientes <dbl>

ggplot(perfil_clusters, aes(x = reorder(nombre_cluster, gasto_promedio), y = gasto_promedio)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Gasto promedio por cluster",
    x = "Cluster",
    y = "Gasto promedio"
  )

recomendaciones <- data.frame(
  Cluster = c(
    "Premium / Alto valor",
    "Frecuentes / Valor medio",
    "Ocasionales / Bajo gasto",
    "Compradores de ticket específico"
  ),
  Caracteristicas = c(
    "Clientes con mayor gasto total y/o mayor ticket promedio. Alta contribución al ingreso.",
    "Clientes con recurrencia constante y valor intermedio. Buena base para crecimiento.",
    "Clientes con baja frecuencia y bajo gasto. Menor participación actual.",
    "Clientes con patrón mixto: compran poco pero con tickets altos, o compran seguido con tickets bajos."
  ),
  Recomendaciones = c(
    "Implementar programa VIP, ofertas exclusivas, recompensas por lealtad y atención personalizada.",
    "Aplicar estrategias de upselling/cross-selling, cupones por recurrencia y combos.",
    "Lanzar campañas de reactivación, descuentos de recompra y promociones de entrada.",
    "Segmentar promociones según patrón de compra y analizar categorías para personalización."
  ),
  stringsAsFactors = FALSE
)

recomendaciones

##                            Cluster
## 1             Premium / Alto valor
## 2         Frecuentes / Valor medio
## 3         Ocasionales / Bajo gasto
## 4 Compradores de ticket específico
##                                                                                        Caracteristicas
## 1              Clientes con mayor gasto total y/o mayor ticket promedio. Alta contribución al ingreso.
## 2                  Clientes con recurrencia constante y valor intermedio. Buena base para crecimiento.
## 3                               Clientes con baja frecuencia y bajo gasto. Menor participación actual.
## 4 Clientes con patrón mixto: compran poco pero con tickets altos, o compran seguido con tickets bajos.
##                                                                                   Recomendaciones
## 1 Implementar programa VIP, ofertas exclusivas, recompensas por lealtad y atención personalizada.
## 2               Aplicar estrategias de upselling/cross-selling, cupones por recurrencia y combos.
## 3               Lanzar campañas de reactivación, descuentos de recompra y promociones de entrada.
## 4        Segmentar promociones según patrón de compra y analizar categorías para personalización.

write.csv(df_final, "supermarket_con_clusters.csv", row.names = FALSE)

write.csv(perfil_clusters, "perfil_clusters.csv", row.names = FALSE)

write.csv(recomendaciones, "recomendaciones_clusters.csv", row.names = FALSE)

Se realizó un análisis de clustering sobre clientes de supermercado a partir de un archivo Excel. Primero, se desarrolló una estadística descriptiva para identificar la distribución de variables numéricas y el comportamiento general de los datos. Posteriormente, se efectuó una limpieza de la base, eliminando valores faltantes en variables clave, registros inválidos y duplicados.

Para la segmentación, se construyó una tabla por cliente con variables de comportamiento de compra (frecuencia, cantidad total, gasto total y ticket promedio). Estas variables fueron escaladas para evitar sesgos por diferencias de magnitud. Aunque se exploraron métodos de optimización, se implementó un modelo final de k-means con 4 clusters en cumplimiento de la rúbrica.

Finalmente, se asignó a cada cliente un número y nombre de cluster, y esta información se integró a la tabla original. Con base en las características promedio de cada segmento, se elaboraron recomendaciones orientadas a fidelización, crecimiento de valor y reactivación de clientes.

Agrupamiento - clientes

EQUIPO 5

2026-02-23