El algoritmo a priori se usa para encontrar conjuntos de elementos frecuentes en un conjunto de datos para la minerÃa de reglas de asociación. Se llama a priori porque utiliza el conocimiento previo de las propiedades frecuentes de los conjuntos de elementos. Aplicamos un enfoque iterativo o una búsqueda por niveles en la que se utilizan conjuntos de elementos k frecuentes para encontrar conjuntos de elementos k+1. Para mejorar la eficiencia de la generación nivelada de conjuntos de elementos frecuentes, se utiliza una propiedad importante llamada propiedad Apriori que ayuda a reducir el espacio de búsqueda. Es muy fácil implementar este algoritmo utilizando el lenguaje de programación R.
library(readr)
library(arules)
library(magrittr)
library(tidyverse)
datos <- read_csv (file = "./datos_groceries.csv", col_names = TRUE,show_col_types = FALSE)
head(df,5)
transacciones <- read.transactions(file = "./datos_groceries.csv",
format = "single",
sep = ",",
header = TRUE,
cols = c("id_compra", "item"),
rm.duplicates = TRUE)
transacciones
transactions in sparse format with
9835 transactions (rows) and
169 items (columns)
colnames(transacciones)[1:5]
[1] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" "bags"
rownames(transacciones)[1:5]
[1] "1" "10" "100" "1000" "1001"
transacciones
transactions in sparse format with
9835 transactions (rows) and
169 items (columns)
datos_matriz <- df %>%
as.data.frame() %>%
mutate(valor = 1) %>%
spread(key = item, value = valor, fill = 0) %>%
column_to_rownames(var = "id_compra") %>%
as.matrix()
transacciones <- as(datos_matriz, Class = "transactions")
transacciones
transactions in sparse format with
9835 transactions (rows) and
169 items (columns)
CONVERSIÓN DE UNA MATRIZ A UN OBJETO TIPO TRANSACTION
inspect(transacciones[1:5])
NA
df_transacciones <- as(transacciones, Class = "data.frame")
# Para que el tamaño de la tabla se ajuste mejor, se convierte el dataframe a tibble
as.tibble(df_transacciones) %>% head()
tamanyos <- size(transacciones)
summary(tamanyos)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
data.frame(tamanyos) %>%
ggplot(aes(x = tamanyos)) +
geom_histogram() +
labs(title = "Distribución del tamaño de las transacciones",
x = "Tamaño") +
theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
frecuencia_items <- itemFrequency(x = transacciones, type = "relative")
frecuencia_items %>% sort(decreasing = TRUE) %>% head(5)
whole milk other vegetables rolls/buns soda yogurt
0.2555160 0.1934926 0.1839349 0.1743772 0.1395018
frecuencia_items <- itemFrequency(x = transacciones, type = "absolute")
frecuencia_items %>% sort(decreasing = TRUE) %>% head(5)
whole milk other vegetables rolls/buns soda yogurt
2513 1903 1809 1715 1372
soporte <- 30 / dim(transacciones)[1]
itemsets <- apriori(data = transacciones,
parameter = list(support = soporte,
minlen = 1,
maxlen = 20,
target = "frequent itemset"))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 30
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [136 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
sorting transactions ... done [0.00s].
writing ... [2226 set(s)] done [0.00s].
creating S4 object ... done [0.00s].
summary(itemsets)
set of 2226 itemsets
most frequent items:
whole milk other vegetables yogurt root vegetables rolls/buns (Other)
556 468 316 251 241 3536
element (itemset/transaction) length distribution:sizes
1 2 3 4 5
136 1140 850 98 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.412 3.000 5.000
summary of quality measures:
support count
Min. :0.003050 Min. : 30.00
1st Qu.:0.003660 1st Qu.: 36.00
Median :0.004779 Median : 47.00
Mean :0.007879 Mean : 77.49
3rd Qu.:0.007219 3rd Qu.: 71.00
Max. :0.255516 Max. :2513.00
includes transaction ID lists: FALSE
mining info:
# Se muestran los top 20 itemsets de mayor a menor soporte
top_20_itemsets <- sort(itemsets, by = "support", decreasing = TRUE)[1:20]
inspect(top_20_itemsets)
#Reglas de asociación
soporte <- 30 / dim(transacciones)[1]
reglas <- apriori(data = transacciones,
parameter = list(support = soporte,
confidence = 0.70,
# Se especifica que se creen reglas
target = "rules"))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 30
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [136 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [19 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(sort(x = reglas, decreasing = TRUE, by = "confidence"))
Para una confidencia mayor a 0.80
soporte <- 30 / dim(transacciones)[1]
reglas <- apriori(data = transacciones,
parameter = list(support = soporte,
confidence = 0.80,
# Se especifica que se creen reglas
target = "rules"))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 30
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [136 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [1 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(sort(x = reglas, decreasing = TRUE, by = "confidence"))
Para una confidencia menor a 0.30
soporte <- 30 / dim(transacciones)[1]
reglas <- apriori(data = transacciones,
parameter = list(support = soporte,
confidence = 0.30,
# Se especifica que se creen reglas
target = "rules"))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 30
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [136 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.01s].
writing ... [1361 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(sort(x = reglas, decreasing = FALSE, by = "confidence"))