library(data.table)
library(dplyr)
library(arules)
library(arulesViz)
knitr::opts_chunk$set(echo = TRUE)

Introduction

In this study, we apply association rule mining to anime preference data. Each user is treated as a transaction (basket), and highly rated anime titles are treated as items. The goal is to uncover structured preference patterns among the Top 200 most popular anime titles.

Dataset

#Reading and Cleaning Data
anime  <- fread("anime.csv")
rating <- fread("rating.csv")

anime[, name := gsub("&#039;", "'", name, fixed = TRUE)]
anime[, name := gsub("&amp;", "&", name, fixed = TRUE)]

Data Preparation

Select Top 200 Anime

top200 <- anime[order(-members)][1:200, .(anime_id, name, members)]

Liked Anime (Rating ≥ 8)

liked <- rating[rating >= 8]
liked <- merge(liked, top200[, .(anime_id, name)], by = "anime_id")
liked <- unique(liked, by = c("user_id", "name"))

Create User Baskets

The transaction summary shows that 62,033 users and 198 anime titles are included in the final analysis. The average basket size of 23.57 indicates that users tend to highly rate multiple titles rather than isolated ones, creating sufficient overlap for meaningful association rule discovery. The distribution of basket sizes also suggests heterogeneous user behavior, as some users rate only a few titles while others rate a substantially larger number of anime.

baskets <- liked[, .(items = list(unique(name))), by = user_id]
baskets <- baskets[lengths(items) >= 2]

trans <- as(baskets$items, "transactions")
summary(trans)
## transactions as itemMatrix in sparse format with
##  62033 rows (elements/itemsets/transactions) and
##  198 columns (items) and a density of 0.1190334 
## 
## most frequent items:
##                         Death Note                 Shingeki no Kyojin 
##                              29203                              21428 
##    Code Geass: Hangyaku no Lelouch   Fullmetal Alchemist: Brotherhood 
##                              21390                              20416 
## Code Geass: Hangyaku no Lelouch R2                            (Other) 
##                              18955                            1350640 
## 
## element (itemset/transaction) length distribution:
## sizes
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
## 3374 2993 2679 2485 2311 2215 1986 1902 1858 1797 1699 1563 1453 1404 1322 1319 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
## 1185 1208 1118 1078 1025 1035  981  925  874  829  844  738  732  747  668  649 
##   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
##  641  596  580  547  515  475  501  481  479  430  416  392  398  363  363  377 
##   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##  367  264  278  309  266  260  244  236  216  221  191  212  196  198  192  177 
##   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81 
##  147  173  150  137  105  146  125  151  105  110  103   97   93   91   94   86 
##   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96   97 
##   89   63   72   76   73   64   65   63   57   43   63   49   48   45   40   50 
##   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112  113 
##   38   39   41   35   37   29   48   35   32   29   27   23   23   16   22   30 
##  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129 
##   17   10   16   18   23   10   13    9   10   12    9    6   13    9   11    5 
##  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144  145 
##    8    3    6    7    6    3    5    6    4    3    5    3    1    2    4    2 
##  146  147  149  151  153  157  158  160  162  163  167  170  176  179 
##    3    1    1    2    2    1    1    1    2    2    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    7.00   16.00   23.57   33.00  179.00 
## 
## includes extended item information - examples:
##             labels
## 1      Accel World
## 2   Akame ga Kill!
## 3 Akatsuki no Yona

Exploratory Analysis

The figure displays the 20 most frequently liked anime titles. Highly popular series dominate the distribution, suggesting that user preferences are strongly centered around mainstream titles. This concentration may affect the strength of generated association rules.

itemFrequencyPlot(trans, topN = 20, type = "absolute")

Association Rule Mining

We apply the Apriori algorithm with: Minimum support = 0.006 Minimum confidence = 0.60 Rule length between 2 and 5

rules <- apriori(
  trans,
  parameter = list(
    supp = 0.006,
    conf = 0.60,
    minlen = 2,
    maxlen = 5,
    target = "rules"
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 372 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[198 item(s), 62033 transaction(s)] done [0.06s].
## sorting and recoding items ... [198 item(s)] done [0.02s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4
## Warning in apriori(trans, parameter = list(supp = 0.006, conf = 0.6, minlen =
## 2, : Mining stopped (time limit reached). Only patterns up to a length of 4
## returned!
##  done [12.35s].
## writing ... [10664371 rule(s)] done [0.76s].
## creating S4 object  ... done [1.46s].

Filtering Rules

rules <- subset(rules, lift > 1.2)
rules <- rules[!is.redundant(rules)]

Results

Top Rules by Lift

The highest-lift rules mainly connect different seasons of the same series. The large lift values indicate strong dependence between these titles, reflecting sequential viewing behavior and franchise loyalty.

rules_1lhs <- subset(rules, size(lhs(rules)) == 1)
top_lift <- sort(rules_1lhs, by = "lift", decreasing = TRUE)
inspect(head(top_lift, 10))
##      lhs                                        rhs                                        support confidence   coverage     lift count
## [1]  {Zero no Tsukaima: Futatsuki no Kishi}  => {Zero no Tsukaima: Princesses no Rondo} 0.07420244  0.7473616 0.09928586 9.384834  4603
## [2]  {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima: Futatsuki no Kishi}  0.07420244  0.9317814 0.07963503 9.384834  4603
## [3]  {Magi: The Kingdom of Magic}            => {Magi: The Labyrinth of Magic}          0.07080103  0.8638867 0.08195638 8.309735  4392
## [4]  {Magi: The Labyrinth of Magic}          => {Magi: The Kingdom of Magic}            0.07080103  0.6810358 0.10396080 8.309735  4392
## [5]  {Boku wa Tomodachi ga Sukunai}          => {Boku wa Tomodachi ga Sukunai Next}     0.06736737  0.6588365 0.10225203 8.271525  4179
## [6]  {Boku wa Tomodachi ga Sukunai Next}     => {Boku wa Tomodachi ga Sukunai}          0.06736737  0.8457802 0.07965115 8.271525  4179
## [7]  {Log Horizon 2nd Season}                => {Log Horizon}                           0.04228395  0.9347826 0.04523399 7.852047  2623
## [8]  {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima}                      0.07376719  0.9263158 0.07963503 7.567779  4576
## [9]  {Zero no Tsukaima}                      => {Zero no Tsukaima: Princesses no Rondo} 0.07376719  0.6026603 0.12240259 7.567779  4576
## [10] {High School DxD New}                   => {High School DxD}                       0.07141360  0.8772277 0.08140828 7.553730  4430

Top Rules by Confidence

The rules ranked by confidence show very high conditional probabilities, often above 0.92. This means that users who highly rate a sequel season almost always rate the original season highly as well. However, confidence alone may be influenced by overall popularity. Therefore, lift remains important to confirm that these relationships reflect true dependence rather than general popularity effects.

top_conf <- sort(rules_1lhs, by = "confidence", decreasing = TRUE)
inspect(head(top_conf, 10))
##      lhs                                        rhs                                        support confidence   coverage     lift count
## [1]  {Psycho-Pass 2}                         => {Psycho-Pass}                           0.05089227  0.9526252 0.05342318 5.682681  3157
## [2]  {Code Geass: Hangyaku no Lelouch R2}    => {Code Geass: Hangyaku no Lelouch}       0.29013590  0.9495120 0.30556317 2.753674 17998
## [3]  {Nisemonogatari}                        => {Bakemonogatari}                        0.09831864  0.9381634 0.10479906 5.157944  6099
## [4]  {Log Horizon 2nd Season}                => {Log Horizon}                           0.04228395  0.9347826 0.04523399 7.852047  2623
## [5]  {Fate/Zero 2nd Season}                  => {Fate/Zero}                             0.13865201  0.9343835 0.14838876 5.540829  8601
## [6]  {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima: Futatsuki no Kishi}  0.07420244  0.9317814 0.07963503 9.384834  4603
## [7]  {Kuroko no Basket 2nd Season}           => {Kuroko no Basket}                      0.08829172  0.9298812 0.09494946 7.139024  5477
## [8]  {Kuroshitsuji II}                       => {Kuroshitsuji}                          0.07030129  0.9280698 0.07575000 6.246171  4361
## [9]  {Zero no Tsukaima: Princesses no Rondo} => {Zero no Tsukaima}                      0.07376719  0.9263158 0.07963503 7.567779  4576
## [10] {Darker than Black: Ryuusei no Gemini}  => {Darker than Black: Kuro no Keiyakusha} 0.07536311  0.9257426 0.08140828 5.386099  4675

Visualization

The graph visualization illustrates association rules as a network of connected anime titles. Stronger associations form visible clusters, particularly among sequel seasons, reinforcing the sequential nature of viewing behavior.

plot(head(top_lift, 15), method="graph", engine="htmlwidget")

Discussion

The results reveal strong sequential and franchise-based consumption behavior. Many high-lift rules occur between sequel seasons of the same anime series. This suggests that anime consumption is structured and continuity-driven rather than independent. Lift proved more informative than confidence in identifying meaningful associations.

Conclusion

This study demonstrates that association rule mining can effectively uncover structured user preference patterns in entertainment datasets. By treating users as transactions and highly rated anime titles as items, we identified strong co-occurrence relationships within the Top 200 anime.

The findings confirm the presence of sequential viewing behavior, a clear franchise loyalty effect, and strong item co-dependence between related titles. In particular, sequel seasons frequently appeared together with high lift and confidence values, indicating that anime consumption is not random but highly structured.

These results highlight the practical relevance of association rule mining for recommendation systems and digital content platforms. Understanding such structured preference patterns can improve personalized suggestions and user engagement strategies.

Future research may extend this analysis by, examining genre-based associations, or comparing different support and confidence thresholds to evaluate the stability of discovered rules.