I decided to run an apriori algorithm on data found on Kaggle. The data describes ratings of movies per user.
I came up with the idea that movies liked by users (ratings 4/5 and 5/5) can be treated as a ‘shopping basket’.
The goal of this analysis is to find rules that describe users’ preferences. The main idea is to find rules like: if user liked A and B, he is likely to like C. I imagine something similar is done by Spotify or Netflix.
library(arules)
library(arulesViz)
library(tidyverse)
library(networkD3)
library(visNetwork)
library(kableExtra)
library(DT)
Firstly I changed data into basket ‘transaction format’. I shortened movie titles and gathered ratings 4 and more for each user.
names <- read_csv("movies.csv")
data <- read_csv("ratings.csv")
data <- data %>% filter(rating>=4)
data <- data %>% select(userId, rating, movieId)
str(data)
str(names)
## tibble [48,580 × 3] (S3: tbl_df/tbl/data.frame)
## $ userId : num [1:48580] 1 1 1 1 1 1 1 1 1 1 ...
## $ rating : num [1:48580] 4 4 4 5 5 5 4 5 5 5 ...
## $ movieId: num [1:48580] 1 3 6 47 50 101 110 151 157 163 ...
## spc_tbl_ [9,742 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ movieId: num [1:9742] 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr [1:9742] "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
## $ genres : chr [1:9742] "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
## - attr(*, "spec")=
## .. cols(
## .. movieId = col_double(),
## .. title = col_character(),
## .. genres = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
movies.f <- read_csv("movies.csv")
movies <- movies.f %>% select(movieId, title)
movies$title <- gsub("\\s*\\(\\d{4}\\)","",movies$title)
movies$title <- gsub(" ","", movies$title)
movies$title <- gsub(",.*", "", movies$title)
movies$title <- gsub("\\(.*", "", movies$title)
movies$title<- gsub("\"", "", movies$title)
movies$title<- gsub("\'", "", movies$title)
movies
## # A tibble: 9,742 × 2
## movieId title
## <dbl> <chr>
## 1 1 ToyStory
## 2 2 Jumanji
## 3 3 GrumpierOldMen
## 4 4 WaitingtoExhale
## 5 5 FatheroftheBridePartII
## 6 6 Heat
## 7 7 Sabrina
## 8 8 TomandHuck
## 9 9 SuddenDeath
## 10 10 GoldenEye
## # ℹ 9,732 more rows
data.t <- data %>%
left_join(movies, by = "movieId") %>%
select(userId, title)
basket.n <- data %>%
group_by(userId) %>%
summarise(movies = paste(movieId, collapse=",")) %>%
ungroup()
writeLines(basket.n$movies, "basket.n.csv")
basket.t <- data.t %>%
group_by(userId) %>%
summarise(movies = paste(title, collapse=",")) %>%
ungroup()
writeLines(basket.t$movies, "basket.t.csv")
As I prepared my data i could read it in desired format. I prepared 2 files: with movie IDs and with titles. I can see that number of items differ. It may be due titles cleaning. Some movies could have the same title and different (year). Any way i will conduct analysis further.
movies.t = read.transactions(
"basket.t.csv",
format = "basket",
sep = ",",
skip = 0,
header = TRUE
)
movies.t
## transactions in sparse format with
## 608 transactions (rows) and
## 6121 items (columns)
movies.n = read.transactions(
"basket.n.csv",
format = "basket",
sep = ",",
skip = 0,
header = TRUE
)
movies.n
## transactions in sparse format with
## 608 transactions (rows) and
## 6293 items (columns)
Most frequently liked movies are Shawshank Redemption, Forest Gump, Pulp Fiction. I can see those are basically the most popular movies on rating lists like IMDB etc.
itemFrequencyPlot(
movies.t,
topN = 20,
type = "absolute",
main = "Movie frequency",
cex.names = 0.85
)
I used apriori algorithm to find rules.
The key parameters used in this function are:
supp (support) = 0.15: s value of 0.15 means that a given combination of movies must appear in at least 15% of all transactions (users) to be considered for rule generation. Support measures how frequently a rule appears in the dataset. A higher support threshold results in fewer but more significant rules.
conf (confidence) = 0.8: s value of 0.8 means that the rule must hold true in at least 80% of cases (if a user likes the movies in the left-hand side (LHS) of the rule, they must also like the movie in the right-hand side (RHS) 80% of the time). Confidence measures how reliable a rule is—how often the antecedent (lhs) leads to the consequent (rhs).
These parameters help control the number and quality of association rules. Lowering supp can yield more rules, but they might be less meaningful, while increasing conf ensures higher reliability but may result in fewer rules. I experimented with parameters and decided to choose ones giving 19 of rules.
rules.t = apriori(movies.t, parameter = list(supp = 0.15, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.15 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 91
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6121 item(s), 608 transaction(s)] done [0.01s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
I performed the same on data with numeric IDs. I experimented with parameters in order to visualize different rules networks after. The same parameters for numeric data gives the same number of rules as data.t with titles. It’s good considering divergence of transactions’ number.
rules.n = apriori(movies.n, parameter = list(supp = 0.1, conf = 0.8))
rules.n2 = apriori(movies.n, parameter = list(supp = 0.15, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 60
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6293 item(s), 608 transaction(s)] done [0.01s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [303 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.15 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 91
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6293 item(s), 608 transaction(s)] done [0.01s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
I decided to sort rules with increasing lift. Lift is a measure of how much more likely the consequent (RHS) is to occur when the antecedent (LHS) is present, compared to when it occurs independently. Basically it is better to sort lifts form higher to lowest because higher lift equals to better association. However in my data, there are many rules indicates the same serie of movies or the same author. To find more interesting associations I could change the parameters of apriori or check lower lifts.
One of interesting rule is ‘{Seven} => {PulpFiction}’. Even though these are classics, other rules indicates obvious connections.
rules.t<-sort(rules.t, by="lift", decreasing=FALSE)
table <- as(rules.t, "data.frame")
kable(table, format = "html", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%", height = "300px")
rules | support | confidence | coverage | lift | count | |
---|---|---|---|---|---|---|
13 | {ShawshankRedemption,UsualSuspects} => {PulpFiction} | 0.16 | 0.81 | 0.19 | 2.01 | 95 |
7 | {Seven} => {PulpFiction} | 0.18 | 0.81 | 0.23 | 2.01 | 112 |
8 | {StarWars:EpisodeVI-ReturnoftheJedi} => {StarWars:EpisodeIV-ANewHope} | 0.21 | 0.84 | 0.25 | 2.55 | 125 |
9 | {StarWars:EpisodeV-TheEmpireStrikesBack} => {StarWars:EpisodeIV-ANewHope} | 0.24 | 0.86 | 0.27 | 2.60 | 143 |
18 | {Matrix,StarWars:EpisodeV-TheEmpireStrikesBack} => {StarWars:EpisodeIV-ANewHope} | 0.16 | 0.86 | 0.19 | 2.62 | 99 |
14 | {StarWars:EpisodeV-TheEmpireStrikesBack,StarWars:EpisodeVI-ReturnoftheJedi} => {StarWars:EpisodeIV-ANewHope} | 0.17 | 0.92 | 0.19 | 2.80 | 106 |
16 | {RaidersoftheLostArk,StarWars:EpisodeV-TheEmpireStrikesBack} => {StarWars:EpisodeIV-ANewHope} | 0.16 | 0.93 | 0.17 | 2.84 | 98 |
19 | {Matrix,StarWars:EpisodeIV-ANewHope} => {StarWars:EpisodeV-TheEmpireStrikesBack} | 0.16 | 0.85 | 0.19 | 3.08 | 99 |
15 | {StarWars:EpisodeIV-ANewHope,StarWars:EpisodeVI-ReturnoftheJedi} => {StarWars:EpisodeV-TheEmpireStrikesBack} | 0.17 | 0.85 | 0.21 | 3.09 | 106 |
17 | {RaidersoftheLostArk,StarWars:EpisodeIV-ANewHope} => {StarWars:EpisodeV-TheEmpireStrikesBack} | 0.16 | 0.86 | 0.19 | 3.13 | 98 |
4 | {LordoftheRings:TheTwoTowers} => {LordoftheRings:TheFellowshipoftheRing} | 0.18 | 0.85 | 0.22 | 3.56 | 112 |
5 | {LordoftheRings:TheReturnoftheKing} => {LordoftheRings:TheFellowshipoftheRing} | 0.20 | 0.86 | 0.23 | 3.60 | 121 |
6 | {LordoftheRings:TheFellowshipoftheRing} => {LordoftheRings:TheReturnoftheKing} | 0.20 | 0.83 | 0.24 | 3.60 | 121 |
1 | {Godfather:PartII} => {Godfather} | 0.17 | 0.95 | 0.18 | 3.67 | 102 |
10 | {LordoftheRings:TheReturnoftheKing,LordoftheRings:TheTwoTowers} => {LordoftheRings:TheFellowshipoftheRing} | 0.17 | 0.91 | 0.19 | 3.77 | 106 |
2 | {LordoftheRings:TheTwoTowers} => {LordoftheRings:TheReturnoftheKing} | 0.19 | 0.89 | 0.22 | 3.88 | 117 |
3 | {LordoftheRings:TheReturnoftheKing} => {LordoftheRings:TheTwoTowers} | 0.19 | 0.84 | 0.23 | 3.88 | 117 |
12 | {LordoftheRings:TheFellowshipoftheRing,LordoftheRings:TheReturnoftheKing} => {LordoftheRings:TheTwoTowers} | 0.17 | 0.88 | 0.20 | 4.07 | 106 |
11 | {LordoftheRings:TheFellowshipoftheRing,LordoftheRings:TheTwoTowers} => {LordoftheRings:TheReturnoftheKing} | 0.17 | 0.95 | 0.18 | 4.11 | 106 |
I plotted the network graphs for numeric data rules. Plotting the same with titles would be unclear. Plots descrbies parameters:
respectively.
Yo can check the title in the table below by filtering movieID.
plot(rules.n2, method="graph", measure="support", shading="lift")
plot(rules.n, method="graph", measure="support", shading="lift")
datatable(movies.f,
options = list(pageLength = 10, searchHighlight = TRUE),
rownames = FALSE)
Using arulesVis library it is possible to visualize connections in different ways.
plot(rules.t, method="paracoord", control=list(reorder=TRUE))
It is also possible to filter some of the products (movies) and find the rules containing given title. I chose Spirited Away for rules detection and visualized its network using visNetwork. Mostly I found obvious connections realated to Ghibli studio. If one likes A movie from Ghibli he/she would like B (other) movie by Ghibli.
rules.spirited<-apriori(movies.t, parameter=list(supp=0.025,conf = 0.8),
appearance=list(default="lhs", rhs="SpiritedAway"), control=list(verbose=F))
rules.spirited<-sort(rules.spirited, by="lift", decreasing=F)
table.sp <- as(rules.spirited, "data.frame")
kable(table.sp, format = "html", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%", height = "300px")
plot(rules.spirited, method="graph", engine="visNetwork")
rules | support | confidence | coverage | lift | count | |
---|---|---|---|---|---|---|
27 | {AmericanBeauty,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.80 | 0.03 | 7.05 | 16 |
31 | {PrincessMononoke,StarWars:EpisodeV-TheEmpireStrikesBack} => {SpiritedAway} | 0.03 | 0.80 | 0.03 | 7.05 | 16 |
34 | {PrincessMononoke,ShawshankRedemption} => {SpiritedAway} | 0.03 | 0.80 | 0.03 | 7.05 | 16 |
57 | {Matrix,PrincessMononoke,StarWars:EpisodeIV-ANewHope} => {SpiritedAway} | 0.03 | 0.80 | 0.03 | 7.05 | 16 |
62 | {Alien,KillBill:Vol.1,Matrix,PulpFiction,Shining,SilenceoftheLambs} => {SpiritedAway} | 0.03 | 0.80 | 0.03 | 7.05 | 16 |
8 | {FightClub,HowlsMovingCastle} => {SpiritedAway} | 0.03 | 0.81 | 0.03 | 7.13 | 17 |
33 | {PrincessMononoke,StarWars:EpisodeIV-ANewHope} => {SpiritedAway} | 0.03 | 0.81 | 0.03 | 7.13 | 17 |
37 | {PrincessMononoke,PulpFiction} => {SpiritedAway} | 0.03 | 0.81 | 0.03 | 7.13 | 17 |
36 | {Matrix,PrincessMononoke} => {SpiritedAway} | 0.04 | 0.81 | 0.04 | 7.18 | 22 |
4 | {PrincessMononoke} => {SpiritedAway} | 0.04 | 0.82 | 0.05 | 7.21 | 27 |
2 | {HowlsMovingCastle} => {SpiritedAway} | 0.04 | 0.82 | 0.05 | 7.24 | 23 |
30 | {FightClub,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.83 | 0.04 | 7.28 | 19 |
24 | {Memento,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.83 | 0.04 | 7.34 | 20 |
19 | {PrincessBride,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.84 | 0.03 | 7.42 | 16 |
22 | {Alien,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.84 | 0.03 | 7.42 | 16 |
32 | {ForrestGump,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.84 | 0.03 | 7.42 | 16 |
47 | {Matrix,MontyPythonandtheHolyGrail,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.84 | 0.03 | 7.42 | 16 |
53 | {PrincessMononoke,StarWars:EpisodeV-TheEmpireStrikesBack,StarWars:EpisodeVI-ReturnoftheJedi} => {SpiritedAway} | 0.03 | 0.84 | 0.03 | 7.42 | 16 |
58 | {PrincessMononoke,PulpFiction,SilenceoftheLambs} => {SpiritedAway} | 0.03 | 0.84 | 0.03 | 7.42 | 16 |
61 | {Matrix,PrincessMononoke,StarWars:EpisodeIV-ANewHope,StarWars:EpisodeVI-ReturnoftheJedi} => {SpiritedAway} | 0.03 | 0.84 | 0.03 | 7.42 | 16 |
35 | {PrincessMononoke,SilenceoftheLambs} => {SpiritedAway} | 0.03 | 0.85 | 0.03 | 7.49 | 17 |
51 | {Matrix,Memento,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.85 | 0.03 | 7.49 | 17 |
54 | {PrincessMononoke,StarWars:EpisodeIV-ANewHope,StarWars:EpisodeVI-ReturnoftheJedi} => {SpiritedAway} | 0.03 | 0.85 | 0.03 | 7.49 | 17 |
55 | {Matrix,PrincessMononoke,StarWars:EpisodeVI-ReturnoftheJedi} => {SpiritedAway} | 0.03 | 0.85 | 0.03 | 7.49 | 17 |
56 | {FightClub,Matrix,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.85 | 0.03 | 7.49 | 17 |
29 | {PrincessMononoke,RaidersoftheLostArk} => {SpiritedAway} | 0.03 | 0.86 | 0.03 | 7.55 | 18 |
28 | {PrincessMononoke,StarWars:EpisodeVI-ReturnoftheJedi} => {SpiritedAway} | 0.03 | 0.86 | 0.04 | 7.61 | 19 |
20 | {MontyPythonandtheHolyGrail,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.87 | 0.04 | 7.66 | 20 |
3 | {MyNeighborTotoro} => {SpiritedAway} | 0.04 | 0.88 | 0.04 | 7.79 | 23 |
9 | {HowlsMovingCastle,StarWars:EpisodeV-TheEmpireStrikesBack} => {SpiritedAway} | 0.03 | 0.89 | 0.03 | 7.83 | 16 |
52 | {LordoftheRings:TheFellowshipoftheRing,Matrix,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.89 | 0.03 | 7.83 | 16 |
10 | {HowlsMovingCastle,StarWars:EpisodeIV-ANewHope} => {SpiritedAway} | 0.03 | 0.89 | 0.03 | 7.88 | 17 |
11 | {HowlsMovingCastle,Matrix} => {SpiritedAway} | 0.03 | 0.89 | 0.03 | 7.88 | 17 |
15 | {KillBill:Vol.2,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.89 | 0.03 | 7.88 | 17 |
17 | {KillBill:Vol.1,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.89 | 0.03 | 7.88 | 17 |
42 | {KillBill:Vol.1,KillBill:Vol.2,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.89 | 0.03 | 7.88 | 17 |
18 | {EternalSunshineoftheSpotlessMind,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.90 | 0.03 | 7.93 | 18 |
26 | {LordoftheRings:TheFellowshipoftheRing,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.90 | 0.03 | 7.93 | 18 |
16 | {Amelie,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.91 | 0.04 | 8.01 | 20 |
14 | {LordoftheRings:TheFellowshipoftheRing,MyNeighborTotoro} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
25 | {PrincessMononoke,SchindlersList} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
43 | {KillBill:Vol.2,Matrix,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
44 | {Amelie,Matrix,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
45 | {KillBill:Vol.1,Matrix,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
46 | {Memento,MontyPythonandtheHolyGrail,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
49 | {LordoftheRings:TheFellowshipoftheRing,LordoftheRings:TheReturnoftheKing,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
50 | {LordoftheRings:TheReturnoftheKing,Matrix,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
60 | {KillBill:Vol.1,KillBill:Vol.2,Matrix,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.29 | 16 |
1 | {Laputa:CastleintheSky} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.32 | 17 |
13 | {Memento,MyNeighborTotoro} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.32 | 17 |
23 | {LordoftheRings:TheReturnoftheKing,PrincessMononoke} => {SpiritedAway} | 0.03 | 0.94 | 0.03 | 8.32 | 17 |
6 | {HowlsMovingCastle,LordoftheRings:TheReturnoftheKing} => {SpiritedAway} | 0.03 | 0.95 | 0.03 | 8.35 | 18 |
40 | {HowlsMovingCastle,LordoftheRings:TheFellowshipoftheRing,LordoftheRings:TheReturnoftheKing} => {SpiritedAway} | 0.03 | 0.95 | 0.03 | 8.35 | 18 |
7 | {HowlsMovingCastle,LordoftheRings:TheFellowshipoftheRing} => {SpiritedAway} | 0.03 | 0.95 | 0.03 | 8.37 | 19 |
5 | {HowlsMovingCastle,LordoftheRings:TheTwoTowers} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 16 |
12 | {Amelie,MyNeighborTotoro} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 17 |
21 | {LordoftheRings:TheTwoTowers,PrincessMononoke} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 16 |
38 | {HowlsMovingCastle,LordoftheRings:TheReturnoftheKing,LordoftheRings:TheTwoTowers} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 16 |
39 | {HowlsMovingCastle,LordoftheRings:TheFellowshipoftheRing,LordoftheRings:TheTwoTowers} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 16 |
41 | {HowlsMovingCastle,LordoftheRings:TheFellowshipoftheRing,StarWars:EpisodeIV-ANewHope} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 16 |
48 | {LordoftheRings:TheReturnoftheKing,LordoftheRings:TheTwoTowers,PrincessMononoke} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 16 |
59 | {HowlsMovingCastle,LordoftheRings:TheFellowshipoftheRing,LordoftheRings:TheReturnoftheKing,LordoftheRings:TheTwoTowers} => {SpiritedAway} | 0.03 | 1.00 | 0.03 | 8.81 | 16 |
Inspired by Paper of Honorata Bogusz i decided to create sankey diagram for Spirited Away. It’s very convienient way to visualize connections.
rules_spirited <- as(rules.spirited, "data.frame")
str(rules_spirited)
rules_spirited <- rules_spirited %>%
mutate(lhs = str_extract(rules, "^[^=]+"),
rhs = str_extract(rules, "(?<=\\=> ).*")) %>%
select(lhs,rhs,support)
rules_spirited
nodes <- data.frame(name = unique(c(rules_spirited$lhs, rules_spirited$rhs)))
links <- rules_spirited %>%
select(lhs, rhs, support) %>%
rename(source = lhs, target = rhs, value = support) %>%
mutate(source = match(source, nodes$name) - 1,
target = match(target, nodes$name) - 1)
head(nodes)
head(links)
sankeyNetwork(
Links = links,
Nodes = nodes,
Source = "source",
Target = "target",
Value = "value",
NodeID = "name",
fontSize = 8,
nodeWidth = 30,
width = "100%",
height = "600px"
)
The analysis revealed that the most frequently liked movies were Shawshank Redemption, Forrest Gump, and Pulp Fiction, aligning with popular IMDb rankings. Sorting rules by lift highlighted strong associations within franchises like Lord of the Rings and Star Wars, but also interesting connections like Seven leading to Pulp Fiction. Filtering for Spirited Away confirmed that users who enjoy one Ghibli film are highly likely to enjoy others. Visualizations, including network graphs and Sankey diagrams, helped illustrate these relationships. Some data loss occurred due to inconsistent title formatting, and further tuning of parameters or genre-based filtering could reveal deeper insights.
Overall, the results demonstrate the power of association rule mining in movie recommendations producing expected output. Haven’t done anything discovery but my work showed that apriori method can indeed find patterns in the movie industry. Further analysis might consist of better EDA and discovering less obvious conncetions.