The aim of this project is to assist movie viewers in finding worthwhile films to watch by referencing the movies they have previously seen. To achieve this, user-rated movies will be analyzed to uncover association rules among them. Association rules are a data mining technique used to identify relationships and patterns between items in large datasets. Association rules are often used to identify patterns such as “if one item is present, another is likely to be as well,” revealing relationships between items in a dataset. This approach provides meaningful recommendations by leveraging patterns in user preferences, making it an ideal method for improving personalized suggestions in the movie domain.
The dataset consists of dataset “movie” which contains information about 27278 movies and “rating” which represents movie ratings from 138493 users that at least rated 20 movies, between January 09, 1995 and March 31, 2015. The data primary comes from MovieLens site and was stored as a few datasets in Kaggle site. Direct link to the dataset: https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset
Loading of the data
Infotmation about our datasets
## [1] 27278 3
## [1] 20000263 4
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
## userId movieId rating timestamp
## 1 1 2 3.5 2005-04-02 23:53:47
## 2 1 29 3.5 2005-04-02 23:31:16
## 3 1 32 3.5 2005-04-02 23:33:39
## 4 1 47 3.5 2005-04-02 23:32:07
## 5 1 50 3.5 2005-04-02 23:29:40
## 6 1 112 3.5 2004-09-10 03:09:00
To leave only worthwhile watching movies, only those with a rating of 4 and above are considered.
To optimize the work the observations where droped from over 20 millions to 2 millions.
ratings <- ratings %>%
left_join(movies %>% select(movieId, title), by = c("movieId" = "movieId"), relationship = "many-to-many")
head(ratings)## userId movieId rating timestamp
## 1 1 151 4 2004-09-10 03:08:54
## 2 1 223 4 2005-04-02 23:46:13
## 3 1 253 4 2005-04-02 23:35:40
## 4 1 260 4 2005-04-02 23:33:46
## 5 1 293 4 2005-04-02 23:31:43
## 6 1 296 4 2005-04-02 23:32:47
## title
## 1 Rob Roy (1995)
## 2 Clerks (1994)
## 3 Interview with the Vampire: The Vampire Chronicles (1994)
## 4 Star Wars: Episode IV - A New Hope (1977)
## 5 Léon: The Professional (a.k.a. The Professional) (Léon) (1994)
## 6 Pulp Fiction (1994)
Limitng it to the columns we only need and checking if the data is complete.
ratings<-select(ratings, userId, title)
missing_in_cols <- sapply(ratings, function(x) sum(is.na(x))/nrow(ratings))
percent(missing_in_cols)## userId title
## "0%" "0%"
## userId title
## 1 1 Rob Roy (1995)
## 2 1 Clerks (1994)
## 3 1 Interview with the Vampire: The Vampire Chronicles (1994)
## 4 1 Star Wars: Episode IV - A New Hope (1977)
## 5 1 Léon: The Professional (a.k.a. The Professional) (Léon) (1994)
## 6 1 Pulp Fiction (1994)
Creating a sparse matrix suitable to analyse our data.
## Warning in asMethod(object): removing duplicated items in transactions
## transactions in sparse format with
## 27380 transactions (rows) and
## 14960 items (columns)
27380 rows refer to the number of rating users 14960 columns are features for each of the 14960 different movies Each cell in the matrix is a 1 if the movie was rated by the corresponding user, or 0 otherwise
It would be very difficult to reach every participant, so in order to optimize the work and increase the readability of the final results—making them useful for everyone—I decided to group the films by their genre and then select the 100 most-watched ones. To ensure that any given person had a chance of having seen at least one of these 100 films, I selected films from each genre. The number of films chosen from a particular genre depended on its percentage share among all films.
Checking which movies are rated, how many times they were rated and what is their genre.
## Warning in asMethod(object): sparse->dense coercion: allocating vector of size
## 1.5 GiB
rated_movies <- rated_movies %>% arrange(desc(count))
rated_movies <- rated_movies %>%
left_join(movies %>% select(title, genres), by = c("title" = "title"))
head(rated_movies)## title count genres
## 1 Shawshank Redemption, The (1994) 11125 Crime|Drama
## 2 Pulp Fiction (1994) 10363 Comedy|Crime|Drama|Thriller
## 3 Silence of the Lambs, The (1991) 9898 Crime|Horror|Thriller
## 4 Forrest Gump (1994) 9454 Comedy|Drama|Romance|War
## 5 Star Wars: Episode IV - A New Hope (1977) 8523 Action|Adventure|Sci-Fi
## 6 Schindler's List (1993) 8346 Drama|War
Firstly I would like to focus on type of genres represented by movies.
## n
## 1 1082
There are too many of genres it has to be fixed.
The “genres” column from “rated_movies” dataset is being limited to only primary genre for every movie.
## title count genres
## 1 Shawshank Redemption, The (1994) 11125 Crime
## 2 Pulp Fiction (1994) 10363 Comedy
## 3 Silence of the Lambs, The (1991) 9898 Crime
## 4 Forrest Gump (1994) 9454 Comedy
## 5 Star Wars: Episode IV - A New Hope (1977) 8523 Action
## 6 Schindler's List (1993) 8346 Drama
Counting films in terms of the genre they belong to.
genre_counts <- rated_movies %>%
count(genres, sort = TRUE) %>%
mutate(
percent = n / sum(n) * 100,
cum_percent = cumsum(percent)
)
ggplot(genre_counts, aes(x = reorder(genres, -n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = n), vjust = -0.5, size = 4, color = "black") +
geom_text(aes(label = paste0(round(percent, 1), "%")),
vjust = 1.5, size = 3, color = "white") +
labs(
title = "Percentage of movie genres",
x = "Genre",
y = "Number of film appearances"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Selecting and deining the number of films for each genre
genre_counts <- c(
Drama = 30, Comedy = 27, Action = 15,
Documentary = 7, Crime = 6, Adventure = 6,
Horror = 5, Animation = 2, Children = 3)
selected_movies <- c()
for (genre in names(genre_counts)) {
num_movies <- genre_counts[genre]
genre_movies <- rated_movies %>%
filter(genre == genre & !(title %in% selected_movies)) %>%
head(num_movies) %>%
pull(title)
selected_movies <- c(selected_movies, genre_movies)
}List of chosen movies
## [1] "Shawshank Redemption, The (1994)"
## [2] "Pulp Fiction (1994)"
## [3] "Silence of the Lambs, The (1991)"
## [4] "Forrest Gump (1994)"
## [5] "Star Wars: Episode IV - A New Hope (1977)"
## [6] "Schindler's List (1993)"
## [7] "Matrix, The (1999)"
## [8] "Usual Suspects, The (1995)"
## [9] "Braveheart (1995)"
## [10] "Terminator 2: Judgment Day (1991)"
## [11] "Star Wars: Episode V - The Empire Strikes Back (1980)"
## [12] "Fugitive, The (1993)"
## [13] "Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)"
## [14] "Godfather, The (1972)"
## [15] "American Beauty (1999)"
## [16] "Star Wars: Episode VI - Return of the Jedi (1983)"
## [17] "Toy Story (1995)"
## [18] "Fargo (1996)"
## [19] "Jurassic Park (1993)"
## [20] "Seven (a.k.a. Se7en) (1995)"
## [21] "Fight Club (1999)"
## [22] "Apollo 13 (1995)"
## [23] "Twelve Monkeys (a.k.a. 12 Monkeys) (1995)"
## [24] "Lord of the Rings: The Fellowship of the Ring, The (2001)"
## [25] "Sixth Sense, The (1999)"
## [26] "Back to the Future (1985)"
## [27] "Saving Private Ryan (1998)"
## [28] "Monty Python and the Holy Grail (1975)"
## [29] "Dances with Wolves (1990)"
## [30] "Princess Bride, The (1987)"
## [31] "Lord of the Rings: The Two Towers, The (2002)"
## [32] "One Flew Over the Cuckoo's Nest (1975)"
## [33] "Memento (2000)"
## [34] "Lord of the Rings: The Return of the King, The (2003)"
## [35] "Lion King, The (1994)"
## [36] "Blade Runner (1982)"
## [37] "Aladdin (1992)"
## [38] "Alien (1979)"
## [39] "Terminator, The (1984)"
## [40] "Gladiator (2000)"
## [41] "Indiana Jones and the Last Crusade (1989)"
## [42] "Godfather: Part II, The (1974)"
## [43] "Goodfellas (1990)"
## [44] "Groundhog Day (1993)"
## [45] "Die Hard (1988)"
## [46] "Reservoir Dogs (1992)"
## [47] "Good Will Hunting (1997)"
## [48] "L.A. Confidential (1997)"
## [49] "Shrek (2001)"
## [50] "True Lies (1994)"
## [51] "Independence Day (a.k.a. ID4) (1996)"
## [52] "Aliens (1986)"
## [53] "Speed (1994)"
## [54] "Casablanca (1942)"
## [55] "E.T. the Extra-Terrestrial (1982)"
## [56] "Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)"
## [57] "Léon: The Professional (a.k.a. The Professional) (Léon) (1994)"
## [58] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)"
## [59] "Beauty and the Beast (1991)"
## [60] "Being John Malkovich (1999)"
## [61] "Taxi Driver (1976)"
## [62] "Batman (1989)"
## [63] "American History X (1998)"
## [64] "Babe (1995)"
## [65] "Men in Black (a.k.a. MIB) (1997)"
## [66] "Clockwork Orange, A (1971)"
## [67] "Apocalypse Now (1979)"
## [68] "Ghostbusters (a.k.a. Ghost Busters) (1984)"
## [69] "2001: A Space Odyssey (1968)"
## [70] "Trainspotting (1996)"
## [71] "Rock, The (1996)"
## [72] "Eternal Sunshine of the Spotless Mind (2004)"
## [73] "Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)"
## [74] "Shining, The (1980)"
## [75] "Dark Knight, The (2008)"
## [76] "Rain Man (1988)"
## [77] "Fifth Element, The (1997)"
## [78] "Wizard of Oz, The (1939)"
## [79] "Full Metal Jacket (1987)"
## [80] "Pirates of the Caribbean: The Curse of the Black Pearl (2003)"
## [81] "Willy Wonka & the Chocolate Factory (1971)"
## [82] "Four Weddings and a Funeral (1994)"
## [83] "Clear and Present Danger (1994)"
## [84] "Die Hard: With a Vengeance (1995)"
## [85] "Ferris Bueller's Day Off (1986)"
## [86] "Shakespeare in Love (1998)"
## [87] "Clerks (1994)"
## [88] "Heat (1995)"
## [89] "Monsters, Inc. (2001)"
## [90] "Finding Nemo (2003)"
## [91] "Amadeus (1984)"
## [92] "Stand by Me (1986)"
## [93] "Truman Show, The (1998)"
## [94] "Mission: Impossible (1996)"
## [95] "Green Mile, The (1999)"
## [96] "Kill Bill: Vol. 1 (2003)"
## [97] "Rear Window (1954)"
## [98] "Sense and Sensibility (1995)"
## [99] "Psycho (1960)"
## [100] "Monty Python's Life of Brian (1979)"
## transactions as itemMatrix in sparse format with
## 27380 rows (elements/itemsets/transactions) and
## 14960 columns (items) and a density of 0.004882721
##
## most frequent items:
## Shawshank Redemption, The (1994)
## 11125
## Pulp Fiction (1994)
## 10363
## Silence of the Lambs, The (1991)
## 9898
## Forrest Gump (1994)
## 9454
## Star Wars: Episode IV - A New Hope (1977)
## 8523
## (Other)
## 1950623
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 50 59 94 123 173 238 291 326 397 427 479 543 567 581 607 575
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 604 579 515 508 455 483 444 437 399 366 401 352 328 307 315 275
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 295 264 277 261 259 232 241 236 212 224 227 208 205 185 193 183
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 187 192 160 143 154 170 150 174 147 150 161 119 132 130 123 99
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 97 123 115 125 125 119 100 116 107 115 95 107 97 117 103 93
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 88 95 94 106 88 86 73 79 62 74 83 65 59 84 56 67
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 79 67 63 68 60 53 63 59 52 56 60 58 57 57 50 53
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 49 50 42 54 52 54 57 57 55 61 51 47 29 48 47 49
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 33 50 50 40 34 42 26 43 32 44 37 44 38 30 38 43
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 42 38 40 35 30 31 43 27 28 31 28 36 28 37 25 32
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 27 31 33 30 27 24 31 20 20 22 22 37 31 18 31 22
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 31 28 28 22 20 32 23 22 23 21 19 27 17 25 19 20
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 21 15 17 24 22 20 19 24 18 14 18 16 16 14 21 23
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 12 12 17 24 21 11 15 21 20 13 20 11 16 9 11 9
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 15 21 8 17 19 14 11 13 14 14 14 7 14 9 14 9
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 7 16 11 14 9 9 7 14 14 8 11 16 10 7 4 10
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 7 13 6 7 10 12 7 5 11 12 14 5 9 8 7 5
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 8 8 5 8 13 10 9 10 8 6 9 11 12 7 7 9
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 9 9 8 11 9 6 8 7 8 9 6 5 6 2 12 4
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 10 5 6 5 7 9 9 5 7 6 5 1 6 10 6 5
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 3 15 7 4 4 8 8 10 3 3 8 7 11 4 5 4
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 6 6 1 2 5 5 6 3 6 3 4 8 3 3 6 4
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 5 4 5 8 4 4 5 3 4 4 3 2 4 3 4 8
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 7 1 7 5 7 4 3 5 4 3 3 2 3 7 5 3
## 385 386 387 389 390 391 392 393 394 395 396 397 398 399 400 401
## 1 5 3 5 3 2 5 6 2 7 5 6 2 6 1 5
## 402 403 405 406 407 408 409 410 411 412 413 414 415 416 417 418
## 4 4 2 4 3 6 1 1 1 5 1 2 4 4 2 4
## 419 420 421 422 423 424 425 426 427 428 429 430 431 433 435 436
## 8 3 2 2 2 2 4 1 2 2 4 1 2 2 1 2
## 437 438 439 440 441 442 443 445 446 447 448 450 451 453 454 455
## 5 4 3 4 3 1 2 2 2 1 2 1 2 2 1 1
## 456 458 459 460 461 462 463 464 465 466 467 469 472 473 475 477
## 3 3 1 1 1 3 2 1 2 3 1 2 2 1 1 2
## 478 479 480 481 483 484 485 486 487 489 490 491 492 493 494 495
## 1 1 2 1 2 2 1 3 1 2 3 4 4 1 3 3
## 496 497 498 501 502 504 505 506 507 508 510 512 514 515 516 517
## 1 5 1 2 2 2 3 2 1 3 2 2 3 1 1 3
## 520 521 522 523 525 526 527 528 529 530 531 532 533 534 535 536
## 1 1 1 5 2 1 1 1 1 4 2 2 2 2 2 2
## 537 538 543 544 545 547 550 551 553 554 555 556 559 560 561 562
## 1 2 2 3 2 1 1 1 1 2 3 2 2 1 1 1
## 563 567 568 571 576 578 579 581 584 588 589 592 593 594 595 597
## 1 1 2 1 1 2 2 1 1 1 1 2 3 1 2 1
## 599 600 601 602 603 604 605 608 609 610 612 613 614 615 617 620
## 2 2 1 2 1 2 2 1 1 1 1 3 1 1 1 1
## 621 623 624 626 628 629 633 637 638 640 641 643 644 647 650 652
## 2 1 1 1 1 2 1 1 1 2 1 2 2 1 2 1
## 653 655 659 660 661 668 672 675 678 681 687 688 689 693 699 705
## 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2
## 708 712 713 716 721 722 723 726 731 736 741 750 755 757 762 773
## 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 1
## 778 781 784 787 788 789 791 793 795 797 798 800 803 804 809 822
## 1 2 1 1 2 2 1 2 2 1 1 1 1 1 1 1
## 824 831 841 846 851 853 859 861 862 869 872 878 884 887 890 899
## 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1
## 921 944 951 952 954 972 993 1001 1003 1004 1011 1016 1038 1048 1049 1058
## 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1
## 1065 1090 1125 1137 1158 1171 1190 1203 1204 1211 1246 1247 1253 1298 1337 1342
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1575 1751 1935 1984 2502
## 1 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 38.00 73.05 85.00 2502.00
##
## includes extended item information - examples:
## labels
## 1 ¡Three Amigos! (1986)
## 2 ...And God Spoke (1993)
## 3 ...And Justice for All (1979)
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Density value of 0.004882721 (0.5%) refers to the proportion of non-zero matrix cells
Simple statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000365 0.0000730 0.0003652 0.0048827 0.0019357 0.4063185
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.0 10.0 133.7 53.0 11125.0
On average user rated 73 movies. There is also one user that rated over 16% of all movies!
The most watched film turned out to be Shawshank Redemption with over 11125 rates (40%)
Investigating associations for movies from selected genres and the rest of the movies through the application of the Apriori algorithm.
After getting familiar with data statistics, regarding the algorithm I decided to set thresholds:
support (the proportion of user ratings in which a particular movie appears) at 0.5% (movies in a rule had to be rated by at least 137 same users)
confidence (measures how often a rule is correct when its antecedent occurs) at 30%
rules length equal to 2
rules <- apriori(
data = reviews,
parameter = list(supp = 0.01, conf = 0.3, minlen = 2, maxlen = 2),
appearance = list(lhs = selected_movies, default = "rhs"),
control = list(verbose = FALSE)
)
rules## set of 360 rules
## set of 360 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 360
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.03170 Min. :0.3000 Min. :0.1050 Min. :2.866
## 1st Qu.:0.03724 1st Qu.:0.3115 1st Qu.:0.1098 1st Qu.:3.303
## Median :0.04125 Median :0.3281 Median :0.1188 Median :3.591
## Mean :0.04294 Mean :0.3380 Mean :0.1278 Mean :3.716
## 3rd Qu.:0.04645 3rd Qu.:0.3523 3rd Qu.:0.1391 3rd Qu.:4.016
## Max. :0.07663 Max. :0.7033 Max. :0.2390 Max. :7.958
## count
## Min. : 868
## 1st Qu.:1020
## Median :1130
## Mean :1176
## 3rd Qu.:1272
## Max. :2098
##
## mining info:
## data ntransactions support confidence
## reviews 27380 0.01 0.3
## call
## apriori(data = reviews, parameter = list(supp = 0.01, conf = 0.3, minlen = 2, maxlen = 2), appearance = list(lhs = selected_movies, default = "rhs"), control = list(verbose = FALSE))
Reordering the rules to be able to select the most meaningful ones
## lhs rhs support confidence coverage lift count
## [1] {Kill Bill: Vol. 1 (2003)} => {Kill Bill: Vol. 2 (2004)} 0.07542001 0.7033379 0.1072316 7.957600 2065
## [2] {Rear Window (1954)} => {Vertigo (1958)} 0.05288532 0.4943667 0.1069759 6.226200 1448
## [3] {Dark Knight, The (2008)} => {Iron Man (2008)} 0.04408327 0.3652042 0.1207085 6.203035 1207
## [4] {Kill Bill: Vol. 1 (2003)} => {Sin City (2005)} 0.04612856 0.4301771 0.1072316 5.790683 1263
## [5] {Rear Window (1954)} => {North by Northwest (1959)} 0.05525931 0.5165586 0.1069759 5.559502 1513
## [6] {Dark Knight, The (2008)} => {Inception (2010)} 0.05171658 0.4284418 0.1207085 5.522945 1416
## [7] {Dark Knight, The (2008)} => {WALL·E (2008)} 0.04258583 0.3527988 0.1207085 5.354563 1166
## [8] {Dark Knight, The (2008)} => {Prestige, The (2006)} 0.03772827 0.3125567 0.1207085 5.269583 1033
## [9] {Finding Nemo (2003)} => {Incredibles, The (2004)} 0.05460190 0.4975042 0.1097516 5.245154 1495
## [10] {Psycho (1960)} => {Vertigo (1958)} 0.04349890 0.4132547 0.1052593 5.204652 1191
## lhs rhs support confidence coverage lift count
## [1] {Kill Bill: Vol. 1 (2003)} => {Kill Bill: Vol. 2 (2004)} 0.07542001 0.7033379 0.1072316 7.957600 2065
## [2] {Rear Window (1954)} => {North by Northwest (1959)} 0.05525931 0.5165586 0.1069759 5.559502 1513
## [3] {Finding Nemo (2003)} => {Incredibles, The (2004)} 0.05460190 0.4975042 0.1097516 5.245154 1495
## [4] {Rear Window (1954)} => {Vertigo (1958)} 0.05288532 0.4943667 0.1069759 6.226200 1448
## [5] {Monsters, Inc. (2001)} => {Incredibles, The (2004)} 0.05230095 0.4743292 0.1102630 5.000822 1432
## [6] {Clear and Present Danger (1994)} => {Crimson Tide (1995)} 0.05193572 0.4598965 0.1129291 4.505176 1422
## [7] {Dark Knight, The (2008)} => {Batman Begins (2005)} 0.05405405 0.4478064 0.1207085 4.703083 1480
## [8] {Ferris Bueller's Day Off (1986)} => {Breakfast Club, The (1985)} 0.04974434 0.4423514 0.1124543 4.620977 1362
## [9] {Pirates of the Caribbean: The Curse of the Black Pearl (2003)} => {Ocean's Eleven (2001)} 0.05021914 0.4338908 0.1157414 4.181602 1375
## [10] {Eternal Sunshine of the Spotless Mind (2004)} => {Donnie Darko (2001)} 0.05354273 0.4334713 0.1235208 4.568301 1466
## lhs rhs support confidence coverage lift count
## [1] {Toy Story (1995)} => {Toy Story 2 (1999)} 0.07662527 0.3205990 0.2390066 3.060670 2098
## [2] {Kill Bill: Vol. 1 (2003)} => {Kill Bill: Vol. 2 (2004)} 0.07542001 0.7033379 0.1072316 7.957600 2065
## [3] {Fight Club (1999)} => {Snatch (2000)} 0.07081812 0.3065613 0.2310080 3.308493 1939
## [4] {Fight Club (1999)} => {Donnie Darko (2001)} 0.07074507 0.3062451 0.2310080 3.227479 1937
## [5] {Apollo 13 (1995)} => {Crimson Tide (1995)} 0.06680058 0.3026142 0.2207451 2.964428 1829
## [6] {Lord of the Rings: The Fellowship of the Ring, The (2001)} => {Ocean's Eleven (2001)} 0.06486486 0.3115789 0.2081812 3.002827 1776
## [7] {Lord of the Rings: The Fellowship of the Ring, The (2001)} => {Batman Begins (2005)} 0.06406136 0.3077193 0.2081812 3.231820 1754
## [8] {Lord of the Rings: The Fellowship of the Ring, The (2001)} => {Beautiful Mind, A (2001)} 0.06391527 0.3070175 0.2081812 2.945389 1750
## [9] {Indiana Jones and the Last Crusade (1989)} => {Indiana Jones and the Temple of Doom (1984)} 0.06336742 0.3891007 0.1628561 4.558655 1735
## [10] {Lord of the Rings: The Return of the King, The (2003)} => {Batman Begins (2005)} 0.06292915 0.3633488 0.1731921 3.816068 1723
Many of the resulting rules stem from the fact that some films were produced in several parts. However, the existence of such pairs indicates that the algorithm works. With the information about the film’s production year in the titles, we can also observe that films connect according to the periods from which they come. Older films have the strongest associations with other older films, and newer films with other newer films. “The Dark Night” appears several times in the ranking of rules with the highest lift, so it is worth taking a closer look at.
rules.knight<-apriori(data=reviews, parameter=list(supp=0.05,conf = 0.05),
appearance=list(default="lhs", rhs="Dark Knight, The (2008)"), control=list(verbose=F))
rules.knight<-sort(rules.knight, by="confidence", decreasing=TRUE)
inspect((rules.knight)[1:10])## lhs rhs support confidence coverage lift count
## [1] {Inception (2010)} => {Dark Knight, The (2008)} 0.05171658 0.6666667 0.07757487 5.522945 1416
## [2] {Batman Begins (2005)} => {Dark Knight, The (2008)} 0.05405405 0.5677023 0.09521549 4.703083 1480
## [3] {Lord of the Rings: The Fellowship of the Ring, The (2001),
## Lord of the Rings: The Return of the King, The (2003),
## Lord of the Rings: The Two Towers, The (2002)} => {Dark Knight, The (2008)} 0.05200877 0.3854900 0.13491600 3.193560 1424
## [4] {Lord of the Rings: The Fellowship of the Ring, The (2001),
## Lord of the Rings: The Return of the King, The (2003)} => {Dark Knight, The (2008)} 0.05715851 0.3830152 0.14923302 3.173058 1565
## [5] {Lord of the Rings: The Return of the King, The (2003),
## Lord of the Rings: The Two Towers, The (2002)} => {Dark Knight, The (2008)} 0.05562454 0.3829520 0.14525201 3.172534 1523
## [6] {Fight Club (1999),
## Shawshank Redemption, The (1994)} => {Dark Knight, The (2008)} 0.05270270 0.3720031 0.14167275 3.081829 1443
## [7] {Lord of the Rings: The Return of the King, The (2003)} => {Dark Knight, The (2008)} 0.06387874 0.3688317 0.17319211 3.055556 1749
## [8] {Fight Club (1999),
## Matrix, The (1999)} => {Dark Knight, The (2008)} 0.05463842 0.3646113 0.14985391 3.020592 1496
## [9] {Lord of the Rings: The Fellowship of the Ring, The (2001),
## Matrix, The (1999)} => {Dark Knight, The (2008)} 0.05058437 0.3644737 0.13878744 3.019452 1385
## [10] {Fight Club (1999),
## Pulp Fiction (1994)} => {Dark Knight, The (2008)} 0.05208181 0.3495098 0.14901388 2.895485 1426
Interestingly, the film “Inception” ranks first, preceding the previous installment of our film series, “Batman Begins.” Among the 10 films with the highest confidence, three parts of the Lord of the Rings series appear. Most of these films are from the 21st century, but there are also older productions such as “The Godfather” and “Forrest Gump.”
List of films where there is something for everyone (1 association for each of the selected 100 films)
solo_rules <- apriori(
data = reviews,
parameter = list(supp = 0.005, conf = 0.25, minlen = 2, maxlen = 2),
appearance = list(lhs = selected_movies, default = "rhs"),
control = list(verbose = FALSE))Selecting the most optimal association for each film in the LHS
solo_rules <- sort(solo_rules, by = "confidence", decreasing = TRUE)
best_rules <- solo_rules[!duplicated(lhs(rules))]
best_rules <- best_rules[1:100]List of movies
## lhs rhs support confidence coverage lift count
## [1] {Kill Bill: Vol. 1 (2003)} => {Kill Bill: Vol. 2 (2004)} 0.07542001 0.7033379 0.1072316 7.957600 2065
## [2] {Rear Window (1954)} => {North by Northwest (1959)} 0.05525931 0.5165586 0.1069759 5.559502 1513
## [3] {Finding Nemo (2003)} => {Incredibles, The (2004)} 0.05460190 0.4975042 0.1097516 5.245154 1495
## [4] {Monsters, Inc. (2001)} => {Incredibles, The (2004)} 0.05230095 0.4743292 0.1102630 5.000822 1432
## [5] {Rear Window (1954)} => {Citizen Kane (1941)} 0.04598247 0.4298395 0.1069759 4.444489 1259
## [6] {Dark Knight, The (2008)} => {Inception (2010)} 0.05171658 0.4284418 0.1207085 5.522945 1416
## [7] {Rear Window (1954)} => {Chinatown (1974)} 0.04488678 0.4195971 0.1069759 4.718098 1229
## [8] {Pirates of the Caribbean: The Curse of the Black Pearl (2003)} => {Incredibles, The (2004)} 0.04663988 0.4029662 0.1157414 4.248446 1277
## [9] {Monsters, Inc. (2001)} => {Toy Story 2 (1999)} 0.04441198 0.4027824 0.1102630 3.845252 1216
## [10] {Kill Bill: Vol. 1 (2003)} => {Batman Begins (2005)} 0.04236669 0.3950954 0.1072316 4.149486 1160
## [11] {Kill Bill: Vol. 1 (2003)} => {Donnie Darko (2001)} 0.04203798 0.3920300 0.1072316 4.131555 1151
## [12] {Psycho (1960)} => {Graduate, The (1967)} 0.04123448 0.3917418 0.1052593 4.232791 1129
## [13] {Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)} => {Citizen Kane (1941)} 0.05328707 0.3907338 0.1363769 4.040140 1459
## [14] {Indiana Jones and the Last Crusade (1989)} => {Indiana Jones and the Temple of Doom (1984)} 0.06336742 0.3891007 0.1628561 4.558655 1735
## [15] {Kill Bill: Vol. 1 (2003)} => {Snatch (2000)} 0.04156318 0.3876022 0.1072316 4.183109 1138
## [16] {Rear Window (1954)} => {Graduate, The (1967)} 0.04035793 0.3772619 0.1069759 4.076334 1105
## [17] {Reservoir Dogs (1992)} => {Big Lebowski, The (1998)} 0.05668371 0.3741562 0.1514974 3.570721 1552
## [18] {Being John Malkovich (1999)} => {Big Lebowski, The (1998)} 0.05080351 0.3739247 0.1358656 3.568511 1391
## [19] {Ferris Bueller's Day Off (1986)} => {Office Space (1999)} 0.04200146 0.3734979 0.1124543 3.656193 1150
## [20] {Dark Knight, The (2008)} => {Departed, The (2006)} 0.04506939 0.3733737 0.1207085 4.768177 1234
## [21] {Kill Bill: Vol. 1 (2003)} => {Ocean's Eleven (2001)} 0.04002922 0.3732970 0.1072316 3.597632 1096
## [22] {Finding Nemo (2003)} => {Batman Begins (2005)} 0.04039445 0.3680532 0.1097516 3.865477 1106
## [23] {Four Weddings and a Funeral (1994)} => {Sleepless in Seattle (1993)} 0.04185537 0.3678973 0.1137692 3.723855 1146
## [24] {Kill Bill: Vol. 1 (2003)} => {Bourne Identity, The (2002)} 0.03937180 0.3671662 0.1072316 3.730245 1078
## [25] {Kill Bill: Vol. 1 (2003)} => {Incredibles, The (2004)} 0.03929876 0.3664850 0.1072316 3.863827 1076
## [26] {Shrek (2001)} => {Ocean's Eleven (2001)} 0.05412710 0.3653846 0.1481373 3.521377 1482
## [27] {Dark Knight, The (2008)} => {Iron Man (2008)} 0.04408327 0.3652042 0.1207085 6.203035 1207
## [28] {Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)} => {O Brother, Where Art Thou? (2000)} 0.04437546 0.3642086 0.1218408 3.813397 1215
## [29] {Full Metal Jacket (1987)} => {Big Lebowski, The (1998)} 0.04229364 0.3641509 0.1161432 3.475236 1158
## [30] {Rain Man (1988)} => {Dead Poets Society (1989)} 0.04306063 0.3624347 0.1188093 3.725023 1179
## [31] {Rear Window (1954)} => {To Kill a Mockingbird (1962)} 0.03871439 0.3618983 0.1069759 4.163351 1060
## [32] {Wizard of Oz, The (1939)} => {Graduate, The (1967)} 0.04229364 0.3615361 0.1169832 3.906416 1158
## [33] {Finding Nemo (2003)} => {Toy Story 2 (1999)} 0.03929876 0.3580699 0.1097516 3.418394 1076
## [34] {Pirates of the Caribbean: The Curse of the Black Pearl (2003)} => {X-Men (2000)} 0.04141709 0.3578416 0.1157414 3.586275 1134
## [35] {Rear Window (1954)} => {Sting, The (1973)} 0.03823959 0.3574599 0.1069759 4.004604 1047
## [36] {Pirates of the Caribbean: The Curse of the Black Pearl (2003)} => {Spider-Man (2002)} 0.04108839 0.3550016 0.1157414 4.442387 1125
## [37] {Kill Bill: Vol. 1 (2003)} => {Big Lebowski, The (1998)} 0.03802045 0.3545640 0.1072316 3.383745 1041
## [38] {Trainspotting (1996)} => {Big Lebowski, The (1998)} 0.04452155 0.3544635 0.1256026 3.382785 1219
## [39] {Truman Show, The (1998)} => {Edward Scissorhands (1990)} 0.03838568 0.3532773 0.1086560 3.853678 1051
## [40] {Stand by Me (1986)} => {Jaws (1975)} 0.03845873 0.3527638 0.1090212 3.525063 1053
## [41] {Apocalypse Now (1979)} => {Platoon (1986)} 0.04477721 0.3527043 0.1269540 4.502118 1226
## [42] {Casablanca (1942)} => {Chinatown (1974)} 0.04937911 0.3507134 0.1407962 3.943545 1352
## [43] {Monsters, Inc. (2001)} => {Minority Report (2002)} 0.03853178 0.3494535 0.1102630 3.858079 1055
## [44] {Eternal Sunshine of the Spotless Mind (2004)} => {Requiem for a Dream (2000)} 0.04276844 0.3462448 0.1235208 4.423791 1171
## [45] {Clockwork Orange, A (1971)} => {Big Lebowski, The (1998)} 0.04422936 0.3460989 0.1277940 3.302958 1211
## [46] {Apocalypse Now (1979)} => {Graduate, The (1967)} 0.04379109 0.3449367 0.1269540 3.727059 1199
## [47] {Shrek (2001)} => {Beautiful Mind, A (2001)} 0.05058437 0.3414694 0.1481373 3.275905 1385
## [48] {Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)} => {North by Northwest (1959)} 0.04638422 0.3401178 0.1363769 3.660545 1270
## [49] {Amadeus (1984)} => {North by Northwest (1959)} 0.03725347 0.3398867 0.1096056 3.658057 1020
## [50] {Shrek (2001)} => {X-Men (2000)} 0.04915997 0.3318540 0.1481373 3.325829 1346
## [51] {Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)} => {Graduate, The (1967)} 0.04474069 0.3280664 0.1363769 3.544774 1225
## [52] {Men in Black (a.k.a. MIB) (1997)} => {X-Men (2000)} 0.04225712 0.3272984 0.1291088 3.280173 1157
## [53] {Rain Man (1988)} => {When Harry Met Sally... (1989)} 0.03875091 0.3261605 0.1188093 3.263989 1061
## [54] {Pirates of the Caribbean: The Curse of the Black Pearl (2003)} => {Kill Bill: Vol. 2 (2004)} 0.03772827 0.3259703 0.1157414 3.688045 1033
## [55] {True Lies (1994)} => {Crimson Tide (1995)} 0.04696859 0.3257345 0.1441928 3.190917 1286
## [56] {Casablanca (1942)} => {Vertigo (1958)} 0.04579985 0.3252918 0.1407962 4.096822 1254
## [57] {Gladiator (2000)} => {Bourne Identity, The (2002)} 0.05292184 0.3231490 0.1637692 3.283050 1449
## [58] {Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)} => {Traffic (2000)} 0.03904310 0.3204436 0.1218408 4.284056 1069
## [59] {Kill Bill: Vol. 1 (2003)} => {V for Vendetta (2006)} 0.03400292 0.3170981 0.1072316 4.615707 931
## [60] {Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)} => {X-Men (2000)} 0.03813002 0.3129496 0.1218408 3.136369 1044
## [61] {Eternal Sunshine of the Spotless Mind (2004)} => {O Brother, Where Art Thou? (2000)} 0.03794741 0.3072147 0.1235208 3.216649 1039
## [62] {Stand by Me (1986)} => {Platoon (1986)} 0.03341855 0.3065327 0.1090212 3.912757 915
## [63] {Truman Show, The (1998)} => {Catch Me If You Can (2002)} 0.03305332 0.3042017 0.1086560 3.748444 905
## [64] {Monty Python's Life of Brian (1979)} => {Brazil (1985)} 0.03173850 0.3023660 0.1049671 4.040401 869
## [65] {Groundhog Day (1993)} => {Fish Called Wanda, A (1988)} 0.04653031 0.3020389 0.1540541 3.217831 1274
## [66] {Aliens (1986)} => {Total Recall (1990)} 0.04306063 0.3000000 0.1435354 3.909567 1179
## [67] {Fifth Element, The (1997)} => {Total Recall (1990)} 0.03553689 0.2998459 0.1185172 3.907559 973
## [68] {Ferris Bueller's Day Off (1986)} => {Christmas Story, A (1983)} 0.03371074 0.2997727 0.1124543 4.475341 923
## [69] {Apocalypse Now (1979)} => {North by Northwest (1959)} 0.03805698 0.2997699 0.1269540 3.226297 1042
## [70] {Shakespeare in Love (1998)} => {Graduate, The (1967)} 0.03363769 0.2996096 0.1122717 3.237297 921
## [71] {Goodfellas (1990)} => {Jaws (1975)} 0.04605551 0.2987444 0.1541636 2.985263 1261
## [72] {Clockwork Orange, A (1971)} => {Jaws (1975)} 0.03816654 0.2986568 0.1277940 2.984388 1045
## [73] {Ghostbusters (a.k.a. Ghost Busters) (1984)} => {Big (1988)} 0.03761870 0.2986373 0.1259679 4.039866 1030
## [74] {Wizard of Oz, The (1939)} => {Sound of Music, The (1965)} 0.03484295 0.2978458 0.1169832 4.713883 954
## [75] {Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)} => {Jaws (1975)} 0.04061359 0.2978040 0.1363769 2.975866 1112
## [76] {Shrek (2001)} => {Shrek 2 (2004)} 0.04408327 0.2975838 0.1481373 5.626965 1207
## [77] {Finding Nemo (2003)} => {Shrek 2 (2004)} 0.03265157 0.2975042 0.1097516 5.625458 894
## [78] {Apocalypse Now (1979)} => {This Is Spinal Tap (1984)} 0.03776479 0.2974684 0.1269540 3.521264 1034
## [79] {Trainspotting (1996)} => {Donnie Darko (2001)} 0.03732652 0.2971794 0.1256026 3.131937 1022
## [80] {Stand by Me (1986)} => {Sting, The (1973)} 0.03239591 0.2971524 0.1090212 3.328983 887
## [81] {Good Will Hunting (1997)} => {Bourne Identity, The (2002)} 0.04470416 0.2970153 0.1505113 3.017543 1224
## [82] {Ghostbusters (a.k.a. Ghost Busters) (1984)} => {Fish Called Wanda, A (1988)} 0.03725347 0.2957379 0.1259679 3.150702 1020
## [83] {Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)} => {Big Lebowski, The (1998)} 0.04101534 0.2953709 0.1388605 2.818841 1123
## [84] {Wizard of Oz, The (1939)} => {Fish Called Wanda, A (1988)} 0.03455077 0.2953481 0.1169832 3.146549 946
## [85] {Amadeus (1984)} => {Raising Arizona (1987)} 0.03235939 0.2952349 0.1096056 3.970301 886
## [86] {Monsters, Inc. (2001)} => {Kill Bill: Vol. 2 (2004)} 0.03254200 0.2951308 0.1102630 3.339125 891
## [87] {Lord of the Rings: The Fellowship of the Ring, The (2001)} => {Incredibles, The (2004)} 0.06143170 0.2950877 0.2081812 3.111090 1682
## [88] {Stand by Me (1986)} => {This Is Spinal Tap (1984)} 0.03210373 0.2944724 0.1090212 3.485799 879
## [89] {Being John Malkovich (1999)} => {Office Space (1999)} 0.03995617 0.2940860 0.1358656 2.878826 1094
## [90] {Rear Window (1954)} => {This Is Spinal Tap (1984)} 0.03144631 0.2939570 0.1069759 3.479698 861
## [91] {Memento (2000)} => {Kill Bill: Vol. 2 (2004)} 0.05120526 0.2939203 0.1742148 3.325429 1402
## [92] {Green Mile, The (1999)} => {Dead Poets Society (1989)} 0.03151936 0.2938372 0.1072681 3.019994 863
## [93] {Clockwork Orange, A (1971)} => {Graduate, The (1967)} 0.03754565 0.2937982 0.1277940 3.174505 1028
## [94] {Truman Show, The (1998)} => {Office Space (1999)} 0.03192111 0.2937815 0.1086560 2.875845 874
## [95] {Groundhog Day (1993)} => {Big Lebowski, The (1998)} 0.04525201 0.2937411 0.1540541 2.803287 1239
## [96] {American History X (1998)} => {Departed, The (2006)} 0.03882396 0.2933223 0.1323594 3.745879 1063
## [97] {Shakespeare in Love (1998)} => {Toy Story 2 (1999)} 0.03290723 0.2931034 0.1122717 2.798177 901
## [98] {Full Metal Jacket (1987)} => {Donnie Darko (2001)} 0.03403944 0.2930818 0.1161432 3.088752 932
## [99] {Monsters, Inc. (2001)} => {Office Space (1999)} 0.03228634 0.2928122 0.1102630 2.866356 884
## [100] {Stand by Me (1986)} => {Airplane! (1980)} 0.03192111 0.2927973 0.1090212 3.502311 874
In my project, I used the method of association rules learning to create and describe the key dependencies between films. The algorithm, limited to 100 films, is intended to help film enthusiasts solve the problem of choosing the next film to watch.