library(data.table)
library(dplyr)
library(arules)
library(arulesViz)
knitr::opts_chunk$set(echo = TRUE)
In this study, we apply association rule mining to anime preference data. Each user is treated as a transaction (basket), and highly rated anime titles are treated as items. The goal is to uncover structured preference patterns among the Top 200 most popular anime titles.
#Reading and Cleaning Data
anime <- fread("anime.csv")
rating <- fread("rating.csv")
anime[, name := gsub("'", "'", name, fixed = TRUE)]
anime[, name := gsub("&", "&", name, fixed = TRUE)]
top200 <- anime[order(-members)][1:200, .(anime_id, name, members)]
liked <- rating[rating >= 8]
liked <- merge(liked, top200[, .(anime_id, name)], by = "anime_id")
liked <- unique(liked, by = c("user_id", "name"))
The transaction summary shows that 62,033 users and 198 anime titles are included in the final analysis. The average basket size of 23.57 indicates that users tend to highly rate multiple titles rather than isolated ones, creating sufficient overlap for meaningful association rule discovery. The distribution of basket sizes also suggests heterogeneous user behavior, as some users rate only a few titles while others rate a substantially larger number of anime.
baskets <- liked[, .(items = list(unique(name))), by = user_id]
baskets <- baskets[lengths(items) >= 2]
trans <- as(baskets$items, "transactions")
summary(trans)
## transactions as itemMatrix in sparse format with
## 62033 rows (elements/itemsets/transactions) and
## 198 columns (items) and a density of 0.1190334
##
## most frequent items:
## Death Note Shingeki no Kyojin
## 29203 21428
## Code Geass: Hangyaku no Lelouch Fullmetal Alchemist: Brotherhood
## 21390 20416
## Code Geass: Hangyaku no Lelouch R2 (Other)
## 18955 1350640
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 3374 2993 2679 2485 2311 2215 1986 1902 1858 1797 1699 1563 1453 1404 1322 1319
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 1185 1208 1118 1078 1025 1035 981 925 874 829 844 738 732 747 668 649
## 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 641 596 580 547 515 475 501 481 479 430 416 392 398 363 363 377
## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
## 367 264 278 309 266 260 244 236 216 221 191 212 196 198 192 177
## 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
## 147 173 150 137 105 146 125 151 105 110 103 97 93 91 94 86
## 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
## 89 63 72 76 73 64 65 63 57 43 63 49 48 45 40 50
## 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
## 38 39 41 35 37 29 48 35 32 29 27 23 23 16 22 30
## 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
## 17 10 16 18 23 10 13 9 10 12 9 6 13 9 11 5
## 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
## 8 3 6 7 6 3 5 6 4 3 5 3 1 2 4 2
## 146 147 149 151 153 157 158 160 162 163 167 170 176 179
## 3 1 1 2 2 1 1 1 2 2 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 7.00 16.00 23.57 33.00 179.00
##
## includes extended item information - examples:
## labels
## 1 Accel World
## 2 Akame ga Kill!
## 3 Akatsuki no Yona
The figure displays the 20 most frequently liked anime titles. Highly popular series dominate the distribution, suggesting that user preferences are strongly centered around mainstream titles. This concentration may affect the strength of generated association rules.
itemFrequencyPlot(trans, topN = 20, type = "absolute")
We apply the Apriori algorithm with: Minimum support = 0.006 Minimum confidence = 0.60 Rule length between 2 and 5
rules <- apriori(
trans,
parameter = list(
supp = 0.006,
conf = 0.60,
minlen = 2,
maxlen = 5,
target = "rules"
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.006 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 372
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[198 item(s), 62033 transaction(s)] done [0.06s].
## sorting and recoding items ... [198 item(s)] done [0.02s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4
## Warning in apriori(trans, parameter = list(supp = 0.006, conf = 0.6, minlen =
## 2, : Mining stopped (time limit reached). Only patterns up to a length of 4
## returned!
## done [12.35s].
## writing ... [10664371 rule(s)] done [0.76s].
## creating S4 object ... done [1.46s].
rules <- subset(rules, lift > 1.2)
rules <- rules[!is.redundant(rules)]
The highest-lift rules mainly connect different seasons of the same series. The large lift values indicate strong dependence between these titles, reflecting sequential viewing behavior and franchise loyalty.
rules_1lhs <- subset(rules, size(lhs(rules)) == 1)
top_lift <- sort(rules_1lhs, by = "lift", decreasing = TRUE)
inspect(head(top_lift, 10))
## lhs rhs support confidence coverage lift count
## [1] {Zero no Tsukaima: Futatsuki no Kishi} => {Zero no Tsukaima: Princesses no Rondo} 0.07420244 0.7473616 0.09928586 9.384834 4603
## [2] {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima: Futatsuki no Kishi} 0.07420244 0.9317814 0.07963503 9.384834 4603
## [3] {Magi: The Kingdom of Magic} => {Magi: The Labyrinth of Magic} 0.07080103 0.8638867 0.08195638 8.309735 4392
## [4] {Magi: The Labyrinth of Magic} => {Magi: The Kingdom of Magic} 0.07080103 0.6810358 0.10396080 8.309735 4392
## [5] {Boku wa Tomodachi ga Sukunai} => {Boku wa Tomodachi ga Sukunai Next} 0.06736737 0.6588365 0.10225203 8.271525 4179
## [6] {Boku wa Tomodachi ga Sukunai Next} => {Boku wa Tomodachi ga Sukunai} 0.06736737 0.8457802 0.07965115 8.271525 4179
## [7] {Log Horizon 2nd Season} => {Log Horizon} 0.04228395 0.9347826 0.04523399 7.852047 2623
## [8] {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima} 0.07376719 0.9263158 0.07963503 7.567779 4576
## [9] {Zero no Tsukaima} => {Zero no Tsukaima: Princesses no Rondo} 0.07376719 0.6026603 0.12240259 7.567779 4576
## [10] {High School DxD New} => {High School DxD} 0.07141360 0.8772277 0.08140828 7.553730 4430
The rules ranked by confidence show very high conditional probabilities, often above 0.92. This means that users who highly rate a sequel season almost always rate the original season highly as well. However, confidence alone may be influenced by overall popularity. Therefore, lift remains important to confirm that these relationships reflect true dependence rather than general popularity effects.
top_conf <- sort(rules_1lhs, by = "confidence", decreasing = TRUE)
inspect(head(top_conf, 10))
## lhs rhs support confidence coverage lift count
## [1] {Psycho-Pass 2} => {Psycho-Pass} 0.05089227 0.9526252 0.05342318 5.682681 3157
## [2] {Code Geass: Hangyaku no Lelouch R2} => {Code Geass: Hangyaku no Lelouch} 0.29013590 0.9495120 0.30556317 2.753674 17998
## [3] {Nisemonogatari} => {Bakemonogatari} 0.09831864 0.9381634 0.10479906 5.157944 6099
## [4] {Log Horizon 2nd Season} => {Log Horizon} 0.04228395 0.9347826 0.04523399 7.852047 2623
## [5] {Fate/Zero 2nd Season} => {Fate/Zero} 0.13865201 0.9343835 0.14838876 5.540829 8601
## [6] {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima: Futatsuki no Kishi} 0.07420244 0.9317814 0.07963503 9.384834 4603
## [7] {Kuroko no Basket 2nd Season} => {Kuroko no Basket} 0.08829172 0.9298812 0.09494946 7.139024 5477
## [8] {Kuroshitsuji II} => {Kuroshitsuji} 0.07030129 0.9280698 0.07575000 6.246171 4361
## [9] {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima} 0.07376719 0.9263158 0.07963503 7.567779 4576
## [10] {Darker than Black: Ryuusei no Gemini} => {Darker than Black: Kuro no Keiyakusha} 0.07536311 0.9257426 0.08140828 5.386099 4675
The graph visualization illustrates association rules as a network of connected anime titles. Stronger associations form visible clusters, particularly among sequel seasons, reinforcing the sequential nature of viewing behavior.
plot(head(top_lift, 15), method="graph", engine="htmlwidget")
The results reveal strong sequential and franchise-based consumption behavior. Many high-lift rules occur between sequel seasons of the same anime series. This suggests that anime consumption is structured and continuity-driven rather than independent. Lift proved more informative than confidence in identifying meaningful associations.
This study demonstrates that association rule mining can effectively uncover structured user preference patterns in entertainment datasets. By treating users as transactions and highly rated anime titles as items, we identified strong co-occurrence relationships within the Top 200 anime.
The findings confirm the presence of sequential viewing behavior, a clear franchise loyalty effect, and strong item co-dependence between related titles. In particular, sequel seasons frequently appeared together with high lift and confidence values, indicating that anime consumption is not random but highly structured.
These results highlight the practical relevance of association rule mining for recommendation systems and digital content platforms. Understanding such structured preference patterns can improve personalized suggestions and user engagement strategies.
Future research may extend this analysis by, examining genre-based associations, or comparing different support and confidence thresholds to evaluate the stability of discovered rules.