En este práctico se aplicará el algoritmo apriori de reglas de asociación para encontrar relaciones entre las películas vistas por un conjunto de usuarios.

setwd('C:/DiploDatos/Practico_ANS')

movies <- read.csv("ml-20m/ml-20m/movies.csv", 
                  colClasses = c("integer","character","character"),
                  sep = ",",
                  stringsAsFactors = FALSE)

ratings <- read.csv("ml-20m/ml-20m/ratings.csv",
                   colClasses = c("integer","integer","numeric","character"),
                   sep=",",
                   stringsAsFactors = FALSE)

head(movies)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

head(ratings)

##   userId movieId rating  timestamp
## 1      1       2    3.5 1112486027
## 2      1      29    3.5 1112484676
## 3      1      32    3.5 1112484819
## 4      1      47    3.5 1112484727
## 5      1      50    3.5 1112484580
## 6      1     112    3.5 1094785740

Las bases cuentan con 138493 usuarios únicos y 26744 películas únicas identificadas por Id.

Definimos:
- Transacción: Todas las peliculas vistas por cada usuario
- Item: cada película
- Soporte del item A: número de transacciones que contienen A dividido entre el total de transacciones.

A los fines de realizar un primer análisis exploratorio de las bases, calculamos el soporte por Item (k=1).

usuarios <- length(unique(ratings$userId))

soporte <- ratings %>%
  group_by(movieId) %>%
  summarise(q_movie = n_distinct(userId),
    soporte = q_movie/usuarios)

soporte <- merge(soporte, movies, by = "movieId", all.x = T)

Las 10 películas más vistas:

top_n(soporte, 10, q_movie) %>% 
  arrange(desc(q_movie))

##    movieId q_movie   soporte                                     title
## 1      296   67310 0.4860173                       Pulp Fiction (1994)
## 2      356   66172 0.4778003                       Forrest Gump (1994)
## 3      318   63366 0.4575394          Shawshank Redemption, The (1994)
## 4      593   63299 0.4570556          Silence of the Lambs, The (1991)
## 5      480   59715 0.4311770                      Jurassic Park (1993)
## 6      260   54502 0.3935361 Star Wars: Episode IV - A New Hope (1977)
## 7      110   53769 0.3882434                         Braveheart (1995)
## 8      589   52244 0.3772321         Terminator 2: Judgment Day (1991)
## 9     2571   51334 0.3706613                        Matrix, The (1999)
## 10     527   50054 0.3614190                   Schindler's List (1993)
##                              genres
## 1       Comedy|Crime|Drama|Thriller
## 2          Comedy|Drama|Romance|War
## 3                       Crime|Drama
## 4             Crime|Horror|Thriller
## 5  Action|Adventure|Sci-Fi|Thriller
## 6           Action|Adventure|Sci-Fi
## 7                  Action|Drama|War
## 8                     Action|Sci-Fi
## 9            Action|Sci-Fi|Thriller
## 10                        Drama|War

La película Pulp Fiction fue la más vista con un total de 67.310 usuarios distintos sobre un total de 138.493, su soporte es de 0.48. En segundo lugar se encuentra Forrest Gump con 66.172 visualizaciones.

Algoritmo apriori

A continuación se aplicará el algoritmo apriori del paquete arules y sus principales métricas.

user_item_matrix <- as(split(ratings[,"movieId"], ratings[,"userId"]), 
                       "transactions")

user_item_matrix

## transactions in sparse format with
##  138493 transactions (rows) and
##  26744 items (columns)

trans <- size(user_item_matrix)
summary(trans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    35.0    68.0   144.4   155.0  9254.0

En promedio cada usuario vió 144 peliculas, con un minimo de 20 y máximo de 9254.

quantile(trans, probs = seq(0,1,0.1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##   20   24   30   39   51   68   93  127  193  334 9254

A continuación se presenta un gráfico con la distribución del tamaño de las transaccuones. Se encuentra concentrada en valores bajos, la mitad de los usuarios han evaluado hasta 68 películas.

data.frame(trans) %>%
  ggplot(aes(x = trans)) +
  geom_histogram() +
  labs(title = "Distribución del tamaño de las transacciones",
       x = "Tamaño (Cantidad de películas)") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Parámetros del modelo

support: soporte mínimo que debe tener un itemset para ser considerado frecuente.
minlen: número mínimo de items que debe tener un itemset para ser incluido en los resultados.
maxlen: número máximo de items que puede tener un itemset para ser incluido en los resultados.
target: tipo de resultados que debe de generar el algoritmo, “frequent itemsets”, “maximally frequent itemsets”, “closed frequent itemsets”, “rules” o “hyperedgesets”.
confidence: confianza mínima que debe de tener una regla para ser incluida en los resultados.
maxtime: tiempo máximo que puede estar el algoritmo buscando subsets.

Para establecer el soporte definimos transacciones con por lo menos 68 items (Mediana).

parametros = list(
  supp = 68 / dim(user_item_matrix)[1],
  conf = 0.7,
  maxlen = 2)

reglas <- apriori(user_item_matrix, parameter = parametros)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime      support
##         0.7    0.1    1 none FALSE            TRUE       5 0.0004909995
##  minlen maxlen target   ext
##       1      2  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 68 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[26744 item(s), 138493 transaction(s)] done [9.19s].
## sorting and recoding items ... [9647 item(s)] done [0.67s].
## creating transaction tree ... done [0.19s].
## checking subsets of size 1 2

## Warning in apriori(user_item_matrix, parameter = parametros): Mining
## stopped (maxlen reached). Only patterns up to a length of 2 returned!

##  done [10.64s].
## writing ... [298319 rule(s)] done [0.32s].
## creating S4 object  ... done [0.13s].

Resumen del modelo

summary(reglas)

## set of 298319 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2 
## 298319 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support            confidence          lift              count        
##  Min.   :0.0004910   Min.   :0.7000   Min.   :   1.440   Min.   :   68.0  
##  1st Qu.:0.0007798   1st Qu.:0.7250   1st Qu.:   2.343   1st Qu.:  108.0  
##  Median :0.0014152   Median :0.7557   Median :   3.257   Median :  196.0  
##  Mean   :0.0044903   Mean   :0.7667   Mean   :   4.399   Mean   :  621.9  
##  3rd Qu.:0.0034009   3rd Qu.:0.7985   3rd Qu.:   4.681   3rd Qu.:  471.0  
##  Max.   :0.3445156   Max.   :0.9890   Max.   :1210.495   Max.   :47713.0  
## 
## mining info:
##              data ntransactions      support confidence
##  user_item_matrix        138493 0.0004909995        0.7

A continuación se presenta la relación entre el soporte y la confianza de cada regla.

plot(reglas, method = "two-key plot")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Relación entre el soporte y el lift.

plot(reglas, measure = c("support", "lift"), shading = "confidence")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Damos formato a la base para ver el nombre y género de las películas.

reglas <- as(reglas,"data.frame")

rules = sapply(reglas$rules,function(x){
  x = gsub("[\\{\\}]", "", regmatches(x, gregexpr("\\{.*\\}", x))[[1]])
  x = gsub("=>",",",x)
  x = str_replace_all(x," ","")
  return( x )
})

rules <- as.character(rules)
rules <- str_split(rules,",")

reglas$lhs_movie <- sapply(rules, "[[", 1)
reglas$rhs_movie <- sapply(rules , "[[", 2)

reglas$lhs_movie = as.numeric(reglas$lhs_movie)
reglas$rhs_movie = as.numeric(reglas$rhs_movie)

all_genres <- unique(unlist(str_split(movies$genres,"\\|")))

for(genre in all_genres){
  movies[genre] = 
    ifelse((str_detect(movies$genres,genre) | 
               str_detect(movies$genres,"no genres")) , 1 , 0)
}

movies$genres <- NULL

reglas <- reglas %>% 
  left_join(movies[, c(1,2)], by=c("lhs_movie" = "movieId"))

reglas <- reglas %>% 
  left_join(movies[, c(1,2)], by=c("rhs_movie" = "movieId"))

head(movies)

##   movieId                              title Adventure Animation Children
## 1       1                   Toy Story (1995)         1         1        1
## 2       2                     Jumanji (1995)         1         0        1
## 3       3            Grumpier Old Men (1995)         0         0        0
## 4       4           Waiting to Exhale (1995)         0         0        0
## 5       5 Father of the Bride Part II (1995)         0         0        0
## 6       6                        Heat (1995)         0         0        0
##   Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi
## 1      1       1       0     0      0     0        0      0       0      0
## 2      0       1       0     0      0     0        0      0       0      0
## 3      1       0       1     0      0     0        0      0       0      0
## 4      1       0       1     1      0     0        0      0       0      0
## 5      1       0       0     0      0     0        0      0       0      0
## 6      0       0       0     0      1     1        1      0       0      0
##   IMAX Documentary War Musical Western Film-Noir (no genres listed)
## 1    0           0   0       0       0         0                  0
## 2    0           0   0       0       0         0                  0
## 3    0           0   0       0       0         0                  0
## 4    0           0   0       0       0         0                  0
## 5    0           0   0       0       0         0                  0
## 6    0           0   0       0       0         0                  0

Observamos las reglas que tengan un mayor lift.

reglas %>% 
  arrange(desc(lift)) %>% 
  select(title.x, title.y, support, confidence, lift) %>% 
  head()

##                                                                                                         title.x
## 1 Samurai III: Duel on Ganryu Island (a.k.a. Bushido) (Miyamoto Musashi kanketsuhen: kettÃ´ GanryÃ»jima) (1956)
## 2                       Samurai II: Duel at Ichijoji Temple (Zoku Miyamoto Musashi: IchijÃ´ji no kettÃ´) (1955)
## 3                                                                                       Red Riding: 1983 (2009)
## 4                                                                                       Red Riding: 1980 (2009)
## 5                       Samurai II: Duel at Ichijoji Temple (Zoku Miyamoto Musashi: IchijÃ´ji no kettÃ´) (1955)
## 6                                                         Babylon 5: The Lost Tales - Voices in the Dark (2007)
##                                                                                                         title.y
## 1                       Samurai II: Duel at Ichijoji Temple (Zoku Miyamoto Musashi: IchijÃ´ji no kettÃ´) (1955)
## 2 Samurai III: Duel on Ganryu Island (a.k.a. Bushido) (Miyamoto Musashi kanketsuhen: kettÃ´ GanryÃ»jima) (1956)
## 3                                                                                       Red Riding: 1980 (2009)
## 4                                                                                       Red Riding: 1983 (2009)
## 5                                                         Samurai I: Musashi Miyamoto (Miyamoto Musashi) (1954)
## 6                                     Babylon 5: The Legend of the Rangers: To Live and Die in Starlight (2002)
##        support confidence      lift
## 1 0.0005632054  0.8041237 1210.4946
## 2 0.0005632054  0.8478261 1210.4946
## 3 0.0007076170  0.8750000  918.0407
## 4 0.0007076170  0.7424242  918.0407
## 5 0.0006209700  0.9347826  863.0723
## 6 0.0005271025  0.7087379  838.9336

Las primeras 50 reglas nos muestran secuelas, sagas o series. Para descartar estos casos, eliminamos las reglas en que la primer palabra del titulo x es igual a la primer palabra del titulo y. Observamos las diez primeras reglas de acuerdo al lift.

reglas_2 <- reglas %>% 
  filter(substr(title.x, 1, str_locate(title.x, ' ')-1) != 
           substr(title.y, 1, str_locate(title.y, ' ')-1)) %>%
  filter(substr(title.x, 1, str_locate(title.x, "[[:punct:]]")-1) != 
           substr(title.y, 1, str_locate(title.y, " ")-1)) %>%
  filter(substr(title.x, 1, str_locate(title.x, " ")-1) != 
           substr(title.y, 1, str_locate(title.y, "[[:punct:]]")-1)) %>%
  arrange(desc(lift))

head(reglas_2 %>% 
       select(title.x, title.y, support, confidence, lift),10)

##                              title.x                        title.y
## 1        Song of the Thin Man (1947) Thin Man Goes Home, The (1945)
## 2        Song of the Thin Man (1947)  Shadow of the Thin Man (1941)
## 3     Thin Man Goes Home, The (1945)  Shadow of the Thin Man (1941)
## 4  Ghost of Frankenstein, The (1942)   House of Frankenstein (1944)
## 5        Song of the Thin Man (1947)        Another Thin Man (1939)
## 6     Thin Man Goes Home, The (1945)        Another Thin Man (1939)
## 7                       56 Up (2012)                   35 Up (1991)
## 8      Shadow of the Thin Man (1941)        Another Thin Man (1939)
## 9                7 Plus Seven (1970)               Seven Up! (1964)
## 10 Ghost of Frankenstein, The (1942)     Son of Frankenstein (1939)
##         support confidence     lift
## 1  0.0009458962  0.7751479 497.0026
## 2  0.0010253226  0.8402367 418.5860
## 3  0.0012563812  0.8055556 401.3087
## 4  0.0009025727  0.7485030 382.5182
## 5  0.0009603373  0.7869822 324.3796
## 6  0.0011769548  0.7546296 311.0444
## 7  0.0006137494  0.7522124 310.9736
## 8  0.0014946604  0.7446043 306.9122
## 9  0.0013502487  0.7663934 295.6549
## 10 0.0009097933  0.7544910 288.6512

A pesar de la limpieza anterior, se verifica la presencia de la saga “Thin Man”, la serie “Up” y “Frankenstein”. Se eliminan estos casos y se muestran las 10 primeras reglas.

reglas_3 <- reglas_2 %>% 
  filter(!(grepl("Thin Man", title.x, perl = TRUE) & grepl("Thin Man", title.y, perl = TRUE))) %>%
  filter(!(grepl("[0-9] Up", title.x, perl = TRUE) & grepl("[0-9] Up", title.y, perl = TRUE))) %>%
  filter(!(grepl("Seven", title.x, perl = TRUE) & grepl("Seven", title.y, perl = TRUE))) %>%
  filter(!(grepl("Frankenstein", title.x, perl = TRUE) & grepl("Frankenstein", title.y, perl = TRUE))) %>%
  arrange(desc(lift))
  
head(reglas_3 %>% 
       select(title.x, title.y, support, confidence, lift),10)

##                               title.x
## 1                 7 Plus Seven (1970)
## 2  Superman/Batman: Apocalypse (2010)
## 3         Insidious: Chapter 2 (2013)
## 4                In Like Flint (1967)
## 5                  G. I. Blues (1960)
## 6          Tomb of Ligeia, The (1965)
## 7              PokÃ©mon Heroes (2003)
## 8                    September (1987)
## 9               Nothing Sacred (1937)
## 10           Twentieth Century (1934)
##                                                 title.y      support
## 1                                          21 Up (1977) 0.0014007928
## 2                     Batman: Under the Red Hood (2010) 0.0005054407
## 3                                 Conjuring, The (2013) 0.0006642935
## 4                                  Our Man Flint (1965) 0.0018845718
## 5                                    Blue Hawaii (1961) 0.0009170139
## 6                   Masque of the Red Death, The (1964) 0.0005848671
## 7  Pokemon 4 Ever (a.k.a. PokÃ©mon 4: The Movie) (2002) 0.0017184984
## 8                                      Interiors (1978) 0.0006209700
## 9                               Awful Truth, The (1937) 0.0006065288
## 10                              Awful Truth, The (1937) 0.0007942640
##    confidence     lift
## 1   0.7950820 270.5486
## 2   0.7000000 255.1187
## 3   0.7360000 226.0107
## 4   0.7331461 225.1344
## 5   0.7055556 222.0784
## 6   0.7168142 215.3443
## 7   0.7323077 214.8718
## 8   0.7350427 186.7858
## 9   0.7433628 174.4925
## 10  0.7382550 173.2935

Se verifica que la regla con mayor lift, relaciona la películas “7 Plus Seven” con “21 Up” con un soporte de 0.14%, una confianza de 79.5% y un lift de 270, en segundo lugar se encuentra la relación “Superman/Batman: Apocalypse” y “Batman: Under the Red Hood”. El siguiente gráfico muestra las principales 30 reglas de acuerdo al valor del lift.

reglas_lift <- reglas_3[1:30, ]

reglas_lift$title.x <- substr(reglas_lift$title.x, 
                           1, str_locate(reglas_lift$title.x,"\\(")-1)

reglas_lift$title.y <- substr(reglas_lift$title.y, 
                           1, str_locate(reglas_lift$title.y,"\\(")-1)

reglas_lift$title.x <- trimws(reglas_lift$title.x, "both")
reglas_lift$title.y <- trimws(reglas_lift$title.y, "both")

ggplot(reglas_lift, aes(x = title.x, y = title.y, fill = lift)) +
  geom_tile() + 
  labs(title = "Reglas de asociación", subtitle = "(Primeras 30 de acuerdo al lift)")+
  theme(axis.text.x=element_text(angle = -45, hjust = 0))

Principales 30 reglas de acuerdo al valor de la confianza.

reglas_conf <- reglas_3 %>%
  arrange(desc(confidence))
  
reglas_conf <- reglas_conf[1:30, ]

reglas_conf$title.x <- substr(reglas_conf$title.x, 
                           1, str_locate(reglas_conf$title.x,"\\(")-1)

reglas_conf$title.y <- substr(reglas_conf$title.y, 
                           1, str_locate(reglas_conf$title.y,"\\(")-1)

reglas_conf$title.x <- trimws(reglas_conf$title.x, "both")
reglas_conf$title.y <- trimws(reglas_conf$title.y, "both")

head(reglas_conf %>% 
       select(title.x, title.y, support, confidence, lift),10)

##                         title.x                            title.y
## 1                     Project A                        Matrix, The
## 2                     AutÃ³mata                        Matrix, The
## 3                  Killers, The                       Pulp Fiction
## 4           Thunderbirds Are GO                        Matrix, The
## 5                     Hell Ride                         Fight Club
## 6                     Robot Jox                        Matrix, The
## 7       Morons From Outer Space Star Wars: Episode IV - A New Hope
## 8             Zero Theorem, The                        Matrix, The
## 9                       Star 80                              Fargo
## 10 Gamers, The: Dorkness Rising                        Matrix, The
##         support confidence     lift
## 1  0.0006498523  0.9890110 2.668234
## 2  0.0006281906  0.9886364 2.667223
## 3  0.0004982201  0.9857143 2.028146
## 4  0.0004909995  0.9855072 2.658781
## 5  0.0007292787  0.9805825 3.386122
## 6  0.0007003964  0.9797980 2.643378
## 7  0.0006859552  0.9793814 2.488670
## 8  0.0006209700  0.9772727 2.636565
## 9  0.0006065288  0.9767442 3.126092
## 10 0.0006065288  0.9767442 2.635139

ggplot(reglas_conf, aes(x = title.x, y = title.y, fill = lift)) +
  geom_tile() + 
  labs(title = "Reglas de asociación", subtitle = "(Primeras 30 de acuerdo a la confianza)")+
  theme(axis.text.x=element_text(angle = -45, hjust = 0))

Aprendizaje No Supervisado - Reglas de Asociación

Angelina Garnica

11/9/2019

Algoritmo apriori

Parámetros del modelo

Resumen del modelo