En este prĆ”ctico se aplicarĆ” el algoritmo apriori de reglas de asociación para encontrar relaciones entre las pelĆculas vistas por un conjunto de usuarios.
setwd('C:/DiploDatos/Practico_ANS')
movies <- read.csv("ml-20m/ml-20m/movies.csv",
colClasses = c("integer","character","character"),
sep = ",",
stringsAsFactors = FALSE)
ratings <- read.csv("ml-20m/ml-20m/ratings.csv",
colClasses = c("integer","integer","numeric","character"),
sep=",",
stringsAsFactors = FALSE)
head(movies)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
head(ratings)
## userId movieId rating timestamp
## 1 1 2 3.5 1112486027
## 2 1 29 3.5 1112484676
## 3 1 32 3.5 1112484819
## 4 1 47 3.5 1112484727
## 5 1 50 3.5 1112484580
## 6 1 112 3.5 1094785740
Las bases cuentan con 138493 usuarios Ćŗnicos y 26744 pelĆculas Ćŗnicas identificadas por Id.
Definimos:
- Transacción: Todas las peliculas vistas por cada usuario
- Item: cada pelĆcula
- Soporte del item A: nĆŗmero de transacciones que contienen A dividido entre el total de transacciones.
A los fines de realizar un primer anƔlisis exploratorio de las bases, calculamos el soporte por Item (k=1).
usuarios <- length(unique(ratings$userId))
soporte <- ratings %>%
group_by(movieId) %>%
summarise(q_movie = n_distinct(userId),
soporte = q_movie/usuarios)
soporte <- merge(soporte, movies, by = "movieId", all.x = T)
Las 10 pelĆculas mĆ”s vistas:
top_n(soporte, 10, q_movie) %>%
arrange(desc(q_movie))
## movieId q_movie soporte title
## 1 296 67310 0.4860173 Pulp Fiction (1994)
## 2 356 66172 0.4778003 Forrest Gump (1994)
## 3 318 63366 0.4575394 Shawshank Redemption, The (1994)
## 4 593 63299 0.4570556 Silence of the Lambs, The (1991)
## 5 480 59715 0.4311770 Jurassic Park (1993)
## 6 260 54502 0.3935361 Star Wars: Episode IV - A New Hope (1977)
## 7 110 53769 0.3882434 Braveheart (1995)
## 8 589 52244 0.3772321 Terminator 2: Judgment Day (1991)
## 9 2571 51334 0.3706613 Matrix, The (1999)
## 10 527 50054 0.3614190 Schindler's List (1993)
## genres
## 1 Comedy|Crime|Drama|Thriller
## 2 Comedy|Drama|Romance|War
## 3 Crime|Drama
## 4 Crime|Horror|Thriller
## 5 Action|Adventure|Sci-Fi|Thriller
## 6 Action|Adventure|Sci-Fi
## 7 Action|Drama|War
## 8 Action|Sci-Fi
## 9 Action|Sci-Fi|Thriller
## 10 Drama|War
La pelĆcula Pulp Fiction fue la mĆ”s vista con un total de 67.310 usuarios distintos sobre un total de 138.493, su soporte es de 0.48. En segundo lugar se encuentra Forrest Gump con 66.172 visualizaciones.
A continuación se aplicarÔ el algoritmo apriori del paquete arules y sus principales métricas.
user_item_matrix <- as(split(ratings[,"movieId"], ratings[,"userId"]),
"transactions")
user_item_matrix
## transactions in sparse format with
## 138493 transactions (rows) and
## 26744 items (columns)
trans <- size(user_item_matrix)
summary(trans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 35.0 68.0 144.4 155.0 9254.0
En promedio cada usuario vió 144 peliculas, con un minimo de 20 y mÔximo de 9254.
quantile(trans, probs = seq(0,1,0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 20 24 30 39 51 68 93 127 193 334 9254
A continuación se presenta un grĆ”fico con la distribución del tamaƱo de las transaccuones. Se encuentra concentrada en valores bajos, la mitad de los usuarios han evaluado hasta 68 pelĆculas.
data.frame(trans) %>%
ggplot(aes(x = trans)) +
geom_histogram() +
labs(title = "Distribución del tamaño de las transacciones",
x = "TamaƱo (Cantidad de pelĆculas)") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Para establecer el soporte definimos transacciones con por lo menos 68 items (Mediana).
parametros = list(
supp = 68 / dim(user_item_matrix)[1],
conf = 0.7,
maxlen = 2)
reglas <- apriori(user_item_matrix, parameter = parametros)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support
## 0.7 0.1 1 none FALSE TRUE 5 0.0004909995
## minlen maxlen target ext
## 1 2 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 68
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[26744 item(s), 138493 transaction(s)] done [9.19s].
## sorting and recoding items ... [9647 item(s)] done [0.67s].
## creating transaction tree ... done [0.19s].
## checking subsets of size 1 2
## Warning in apriori(user_item_matrix, parameter = parametros): Mining
## stopped (maxlen reached). Only patterns up to a length of 2 returned!
## done [10.64s].
## writing ... [298319 rule(s)] done [0.32s].
## creating S4 object ... done [0.13s].
summary(reglas)
## set of 298319 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 298319
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift count
## Min. :0.0004910 Min. :0.7000 Min. : 1.440 Min. : 68.0
## 1st Qu.:0.0007798 1st Qu.:0.7250 1st Qu.: 2.343 1st Qu.: 108.0
## Median :0.0014152 Median :0.7557 Median : 3.257 Median : 196.0
## Mean :0.0044903 Mean :0.7667 Mean : 4.399 Mean : 621.9
## 3rd Qu.:0.0034009 3rd Qu.:0.7985 3rd Qu.: 4.681 3rd Qu.: 471.0
## Max. :0.3445156 Max. :0.9890 Max. :1210.495 Max. :47713.0
##
## mining info:
## data ntransactions support confidence
## user_item_matrix 138493 0.0004909995 0.7
A continuación se presenta la relación entre el soporte y la confianza de cada regla.
plot(reglas, method = "two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Relación entre el soporte y el lift.
plot(reglas, measure = c("support", "lift"), shading = "confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Damos formato a la base para ver el nombre y gĆ©nero de las pelĆculas.
reglas <- as(reglas,"data.frame")
rules = sapply(reglas$rules,function(x){
x = gsub("[\\{\\}]", "", regmatches(x, gregexpr("\\{.*\\}", x))[[1]])
x = gsub("=>",",",x)
x = str_replace_all(x," ","")
return( x )
})
rules <- as.character(rules)
rules <- str_split(rules,",")
reglas$lhs_movie <- sapply(rules, "[[", 1)
reglas$rhs_movie <- sapply(rules , "[[", 2)
reglas$lhs_movie = as.numeric(reglas$lhs_movie)
reglas$rhs_movie = as.numeric(reglas$rhs_movie)
all_genres <- unique(unlist(str_split(movies$genres,"\\|")))
for(genre in all_genres){
movies[genre] =
ifelse((str_detect(movies$genres,genre) |
str_detect(movies$genres,"no genres")) , 1 , 0)
}
movies$genres <- NULL
reglas <- reglas %>%
left_join(movies[, c(1,2)], by=c("lhs_movie" = "movieId"))
reglas <- reglas %>%
left_join(movies[, c(1,2)], by=c("rhs_movie" = "movieId"))
head(movies)
## movieId title Adventure Animation Children
## 1 1 Toy Story (1995) 1 1 1
## 2 2 Jumanji (1995) 1 0 1
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## 6 6 Heat (1995) 0 0 0
## Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi
## 1 1 1 0 0 0 0 0 0 0 0
## 2 0 1 0 0 0 0 0 0 0 0
## 3 1 0 1 0 0 0 0 0 0 0
## 4 1 0 1 1 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 1 1 1 0 0 0
## IMAX Documentary War Musical Western Film-Noir (no genres listed)
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
Observamos las reglas que tengan un mayor lift.
reglas %>%
arrange(desc(lift)) %>%
select(title.x, title.y, support, confidence, lift) %>%
head()
## title.x
## 1 Samurai III: Duel on Ganryu Island (a.k.a. Bushido) (Miyamoto Musashi kanketsuhen: kettĆĀ“ GanryĆĀ»jima) (1956)
## 2 Samurai II: Duel at Ichijoji Temple (Zoku Miyamoto Musashi: IchijĆĀ“ji no kettĆĀ“) (1955)
## 3 Red Riding: 1983 (2009)
## 4 Red Riding: 1980 (2009)
## 5 Samurai II: Duel at Ichijoji Temple (Zoku Miyamoto Musashi: IchijĆĀ“ji no kettĆĀ“) (1955)
## 6 Babylon 5: The Lost Tales - Voices in the Dark (2007)
## title.y
## 1 Samurai II: Duel at Ichijoji Temple (Zoku Miyamoto Musashi: IchijĆĀ“ji no kettĆĀ“) (1955)
## 2 Samurai III: Duel on Ganryu Island (a.k.a. Bushido) (Miyamoto Musashi kanketsuhen: kettĆĀ“ GanryĆĀ»jima) (1956)
## 3 Red Riding: 1980 (2009)
## 4 Red Riding: 1983 (2009)
## 5 Samurai I: Musashi Miyamoto (Miyamoto Musashi) (1954)
## 6 Babylon 5: The Legend of the Rangers: To Live and Die in Starlight (2002)
## support confidence lift
## 1 0.0005632054 0.8041237 1210.4946
## 2 0.0005632054 0.8478261 1210.4946
## 3 0.0007076170 0.8750000 918.0407
## 4 0.0007076170 0.7424242 918.0407
## 5 0.0006209700 0.9347826 863.0723
## 6 0.0005271025 0.7087379 838.9336
Las primeras 50 reglas nos muestran secuelas, sagas o series. Para descartar estos casos, eliminamos las reglas en que la primer palabra del titulo x es igual a la primer palabra del titulo y. Observamos las diez primeras reglas de acuerdo al lift.
reglas_2 <- reglas %>%
filter(substr(title.x, 1, str_locate(title.x, ' ')-1) !=
substr(title.y, 1, str_locate(title.y, ' ')-1)) %>%
filter(substr(title.x, 1, str_locate(title.x, "[[:punct:]]")-1) !=
substr(title.y, 1, str_locate(title.y, " ")-1)) %>%
filter(substr(title.x, 1, str_locate(title.x, " ")-1) !=
substr(title.y, 1, str_locate(title.y, "[[:punct:]]")-1)) %>%
arrange(desc(lift))
head(reglas_2 %>%
select(title.x, title.y, support, confidence, lift),10)
## title.x title.y
## 1 Song of the Thin Man (1947) Thin Man Goes Home, The (1945)
## 2 Song of the Thin Man (1947) Shadow of the Thin Man (1941)
## 3 Thin Man Goes Home, The (1945) Shadow of the Thin Man (1941)
## 4 Ghost of Frankenstein, The (1942) House of Frankenstein (1944)
## 5 Song of the Thin Man (1947) Another Thin Man (1939)
## 6 Thin Man Goes Home, The (1945) Another Thin Man (1939)
## 7 56 Up (2012) 35 Up (1991)
## 8 Shadow of the Thin Man (1941) Another Thin Man (1939)
## 9 7 Plus Seven (1970) Seven Up! (1964)
## 10 Ghost of Frankenstein, The (1942) Son of Frankenstein (1939)
## support confidence lift
## 1 0.0009458962 0.7751479 497.0026
## 2 0.0010253226 0.8402367 418.5860
## 3 0.0012563812 0.8055556 401.3087
## 4 0.0009025727 0.7485030 382.5182
## 5 0.0009603373 0.7869822 324.3796
## 6 0.0011769548 0.7546296 311.0444
## 7 0.0006137494 0.7522124 310.9736
## 8 0.0014946604 0.7446043 306.9122
## 9 0.0013502487 0.7663934 295.6549
## 10 0.0009097933 0.7544910 288.6512
A pesar de la limpieza anterior, se verifica la presencia de la saga āThin Manā, la serie āUpā y āFrankensteinā. Se eliminan estos casos y se muestran las 10 primeras reglas.
reglas_3 <- reglas_2 %>%
filter(!(grepl("Thin Man", title.x, perl = TRUE) & grepl("Thin Man", title.y, perl = TRUE))) %>%
filter(!(grepl("[0-9] Up", title.x, perl = TRUE) & grepl("[0-9] Up", title.y, perl = TRUE))) %>%
filter(!(grepl("Seven", title.x, perl = TRUE) & grepl("Seven", title.y, perl = TRUE))) %>%
filter(!(grepl("Frankenstein", title.x, perl = TRUE) & grepl("Frankenstein", title.y, perl = TRUE))) %>%
arrange(desc(lift))
head(reglas_3 %>%
select(title.x, title.y, support, confidence, lift),10)
## title.x
## 1 7 Plus Seven (1970)
## 2 Superman/Batman: Apocalypse (2010)
## 3 Insidious: Chapter 2 (2013)
## 4 In Like Flint (1967)
## 5 G. I. Blues (1960)
## 6 Tomb of Ligeia, The (1965)
## 7 PokĆĀ©mon Heroes (2003)
## 8 September (1987)
## 9 Nothing Sacred (1937)
## 10 Twentieth Century (1934)
## title.y support
## 1 21 Up (1977) 0.0014007928
## 2 Batman: Under the Red Hood (2010) 0.0005054407
## 3 Conjuring, The (2013) 0.0006642935
## 4 Our Man Flint (1965) 0.0018845718
## 5 Blue Hawaii (1961) 0.0009170139
## 6 Masque of the Red Death, The (1964) 0.0005848671
## 7 Pokemon 4 Ever (a.k.a. PokĆĀ©mon 4: The Movie) (2002) 0.0017184984
## 8 Interiors (1978) 0.0006209700
## 9 Awful Truth, The (1937) 0.0006065288
## 10 Awful Truth, The (1937) 0.0007942640
## confidence lift
## 1 0.7950820 270.5486
## 2 0.7000000 255.1187
## 3 0.7360000 226.0107
## 4 0.7331461 225.1344
## 5 0.7055556 222.0784
## 6 0.7168142 215.3443
## 7 0.7323077 214.8718
## 8 0.7350427 186.7858
## 9 0.7433628 174.4925
## 10 0.7382550 173.2935
Se verifica que la regla con mayor lift, relaciona la pelĆculas ā7 Plus Sevenā con ā21 Upā con un soporte de 0.14%, una confianza de 79.5% y un lift de 270, en segundo lugar se encuentra la relación āSuperman/Batman: Apocalypseā y āBatman: Under the Red Hoodā. El siguiente grĆ”fico muestra las principales 30 reglas de acuerdo al valor del lift.
reglas_lift <- reglas_3[1:30, ]
reglas_lift$title.x <- substr(reglas_lift$title.x,
1, str_locate(reglas_lift$title.x,"\\(")-1)
reglas_lift$title.y <- substr(reglas_lift$title.y,
1, str_locate(reglas_lift$title.y,"\\(")-1)
reglas_lift$title.x <- trimws(reglas_lift$title.x, "both")
reglas_lift$title.y <- trimws(reglas_lift$title.y, "both")
ggplot(reglas_lift, aes(x = title.x, y = title.y, fill = lift)) +
geom_tile() +
labs(title = "Reglas de asociación", subtitle = "(Primeras 30 de acuerdo al lift)")+
theme(axis.text.x=element_text(angle = -45, hjust = 0))
Principales 30 reglas de acuerdo al valor de la confianza.
reglas_conf <- reglas_3 %>%
arrange(desc(confidence))
reglas_conf <- reglas_conf[1:30, ]
reglas_conf$title.x <- substr(reglas_conf$title.x,
1, str_locate(reglas_conf$title.x,"\\(")-1)
reglas_conf$title.y <- substr(reglas_conf$title.y,
1, str_locate(reglas_conf$title.y,"\\(")-1)
reglas_conf$title.x <- trimws(reglas_conf$title.x, "both")
reglas_conf$title.y <- trimws(reglas_conf$title.y, "both")
head(reglas_conf %>%
select(title.x, title.y, support, confidence, lift),10)
## title.x title.y
## 1 Project A Matrix, The
## 2 Autómata Matrix, The
## 3 Killers, The Pulp Fiction
## 4 Thunderbirds Are GO Matrix, The
## 5 Hell Ride Fight Club
## 6 Robot Jox Matrix, The
## 7 Morons From Outer Space Star Wars: Episode IV - A New Hope
## 8 Zero Theorem, The Matrix, The
## 9 Star 80 Fargo
## 10 Gamers, The: Dorkness Rising Matrix, The
## support confidence lift
## 1 0.0006498523 0.9890110 2.668234
## 2 0.0006281906 0.9886364 2.667223
## 3 0.0004982201 0.9857143 2.028146
## 4 0.0004909995 0.9855072 2.658781
## 5 0.0007292787 0.9805825 3.386122
## 6 0.0007003964 0.9797980 2.643378
## 7 0.0006859552 0.9793814 2.488670
## 8 0.0006209700 0.9772727 2.636565
## 9 0.0006065288 0.9767442 3.126092
## 10 0.0006065288 0.9767442 2.635139
ggplot(reglas_conf, aes(x = title.x, y = title.y, fill = lift)) +
geom_tile() +
labs(title = "Reglas de asociación", subtitle = "(Primeras 30 de acuerdo a la confianza)")+
theme(axis.text.x=element_text(angle = -45, hjust = 0))