This study aims to explore the methodological potential of association rule mining in the context of humanities research and to examine its interpretive value when applied to unstructured or weakly structured cultural data. Using movie genres as the object of analysis—a form of cultural data characterized by high semantic heterogeneity and hierarchical structure—this study seeks to move beyond the traditional application of association rule analysis in market basket data and introduce it into the research framework of digital humanities and cultural data analysis.
Specifically, the study conducts a systematic analysis of genre combinations based on frequency, support, confidence, and lift, in order to uncover functional differentiation and structural roles among genres within cinematic narratives, such as core–supplementary and carrying–directional relationships. Rather than focusing on prediction or commercial recommendation, the emphasis of this research lies in demonstrating that even low-frequency genre combinations may exhibit stable and structurally meaningful associations, which are of particular importance in interpretation-oriented humanities research.
Through this study, the following objectives are pursued: (1) to evaluate the applicability of association rule mining to unstructured, semantically dense cultural data; (2) to illustrate the explanatory potential of quantitative methods in revealing cultural structures and latent semantic relationships; (3) to provide an empirically grounded and operational case for the broader adoption of association rule analysis in humanities research and interdisciplinary applications. ## Source The data used in this study are drawn from the MovieLens dataset, a publicly available dataset released by GroupLens Research, a research group at the University of Minnesota, and widely used in recommender systems research.
The dataset contains approximately two million tag applications associated with 87,585 movies. MovieLens is a well-established benchmark dataset for studying recommender systems and user behavior, and it includes information such as movie titles, release years, genres, and user ratings. In this study, the analysis focuses specifically on the movie genre attribute.
As the original dataset also contains additional attributes such as release year and user ratings, the data were preprocessed by removing irrelevant columns and retaining only the variables required for association rule mining. Due to the large size of the original dataset, a random sample of 150 observations was selected to facilitate analysis.
It can be observed that movies with one or two genre labels dominate the dataset.
This pattern indicates that, in practice, movie genres are not combined in an unrestricted manner but are typically organized around one or two core genres. In this context, genre labels primarily function as a structure of a dominant genre accompanied by one or more supplementary genres, rather than as fully equivalent multiple classifications.
This distributional characteristic implies that association rule mining on movie genres will inevitably yield both high-frequency rules with limited association strength and low-frequency but structurally strong rules with relatively high lift values.
inspect(genre)
## items
## [1] {x}
## [2] {Action,
## Crime,
## Drama,
## Thriller}
## [3] {Comedy,
## Drama,
## Fantasy,
## Romance}
## [4] {Animation,
## Children,
## Comedy}
## [5] {Drama}
## [6] {Horror,
## Sci-Fi}
## [7] {Comedy}
## [8] {Drama}
## [9] {Comedy,
## Drama}
## [10] {Action,
## Drama}
## [11] {Comedy,
## Horror}
## [12] {Drama,
## Romance}
## [13] {Crime}
## [14] {Children,
## Drama}
## [15] {Comedy}
## [16] {Drama}
## [17] {Action,
## Drama}
## [18] {Comedy,
## Horror}
## [19] {Children,
## Comedy}
## [20] {Comedy,
## Drama}
## [21] {Comedy,
## Drama,
## War}
## [22] {Comedy,
## Drama,
## Romance}
## [23] {Action,
## Adventure,
## Animation,
## Drama,
## Fantasy,
## Sci-Fi}
## [24] {Drama}
## [25] {Comedy,
## Crime,
## Fantasy}
## [26] {Drama}
## [27] {Comedy,
## Romance}
## [28] {Drama}
## [29] {Crime,
## Drama,
## Thriller}
## [30] {Drama,
## Romance}
## [31] {Drama}
## [32] {Action,
## Adventure,
## Animation,
## Children,
## Comedy,
## Fantasy}
## [33] {Comedy,
## Romance}
## [34] {Documentary}
## [35] {Action,
## Adventure,
## Drama}
## [36] {Drama}
## [37] {Animation,
## Comedy,
## Sci-Fi}
## [38] {Adventure,
## Fantasy}
## [39] {Adventure,
## Children,
## Drama}
## [40] {Drama,
## Musical}
## [41] {Adventure,
## Romance}
## [42] {Crime,
## Thriller}
## [43] {Mystery,
## Thriller}
## [44] {Comedy,
## Drama}
## [45] {Action,
## Adventure,
## Fantasy}
## [46] {Comedy,
## Drama}
## [47] {Comedy,
## Romance}
## [48] {Action,
## Drama,
## Thriller,
## War}
## [49] {Comedy,
## Drama}
## [50] {Adventure,
## Children}
## [51] {Documentary}
## [52] {Crime,
## Drama,
## Thriller}
## [53] {Animation,
## Children,
## Comedy}
## [54] {Action,
## Comedy,
## Horror}
## [55] {Drama,
## War}
## [56] {Crime,
## Documentary}
## [57] {Action,
## Thriller}
## [58] {Adventure,
## Comedy,
## Crime}
## [59] {Action,
## Adventure,
## Thriller}
## [60] {Action,
## Comedy}
## [61] {Drama}
## [62] {Documentary}
## [63] {Action,
## Comedy,
## Drama}
## [64] {Adventure,
## Drama,
## Fantasy,
## Romance}
## [65] {Comedy}
## [66] {Action,
## Thriller}
## [67] {Action,
## Thriller}
## [68] {Comedy,
## Romance}
## [69] {Drama,
## Thriller}
## [70] {Crime,
## Drama}
## [71] {Documentary,
## War}
## [72] {Comedy,
## Drama}
## [73] {Documentary}
## [74] {Action,
## Horror,
## Sci-Fi}
## [75] {Comedy}
## [76] {Comedy}
## [77] {Drama,
## Romance}
## [78] {Action,
## Drama,
## Horror,
## Thriller}
## [79] {Drama,
## Romance}
## [80] {Comedy,
## Drama}
## [81] {Drama,
## Mystery,
## Romance,
## Thriller}
## [82] {Drama,
## Romance}
## [83] {Crime,
## Drama}
## [84] {Crime,
## Drama}
## [85] {Action,
## Adventure,
## Sci-Fi}
## [86] {Action,
## Drama,
## Thriller,
## War}
## [87] {Horror}
## [88] {Drama,
## Romance,
## Thriller}
## [89] {Comedy,
## Drama}
## [90] {Drama,
## Romance,
## War}
## [91] {Comedy,
## Drama}
## [92] {Drama,
## Romance}
## [93] {Drama}
## [94] {Drama,
## Horror,
## Sci-Fi}
## [95] {(no genres listed)}
## [96] {Comedy,
## Romance}
## [97] {Action,
## Comedy,
## Crime}
## [98] {Adventure,
## Comedy}
## [99] {Comedy}
## [100] {Action,
## Crime}
## [101] {Animation,
## Children,
## Comedy}
## [102] {Action,
## Drama,
## Romance,
## War}
## [103] {Horror,
## Sci-Fi,
## Thriller}
## [104] {Comedy}
## [105] {Action,
## Animation,
## Film-Noir,
## Sci-Fi,
## Thriller}
## [106] {Drama,
## Romance}
## [107] {Adventure,
## Drama}
## [108] {Horror,
## Thriller}
## [109] {Action,
## Comedy}
## [110] {Drama,
## Romance}
## [111] {Action,
## Crime,
## Drama}
## [112] {Drama}
## [113] {Children}
## [114] {Action,
## Animation,
## Comedy,
## Horror}
## [115] {Comedy,
## Drama,
## Romance}
## [116] {Comedy,
## Romance}
## [117] {Comedy}
## [118] {Drama}
## [119] {Crime,
## Sci-Fi}
## [120] {Documentary}
## [121] {Comedy}
## [122] {Comedy}
## [123] {Drama}
## [124] {Drama,
## Thriller}
## [125] {Comedy}
## [126] {Comedy,
## Horror}
## [127] {Action,
## Adventure,
## IMAX,
## Sci-Fi}
## [128] {Film-Noir,
## Romance,
## Thriller}
## [129] {Documentary}
## [130] {Drama}
## [131] {Drama}
## [132] {Adventure,
## Children,
## Fantasy}
## [133] {Drama}
## [134] {Comedy,
## Drama,
## Romance}
## [135] {Animation,
## Children}
## [136] {Comedy}
## [137] {Drama,
## Mystery,
## Thriller}
## [138] {Drama,
## Thriller}
## [139] {Documentary}
## [140] {Drama}
## [141] {Drama}
## [142] {Film-Noir,
## Thriller}
## [143] {Comedy,
## Crime,
## Drama,
## Sci-Fi,
## Thriller}
## [144] {Action,
## Horror,
## Thriller}
## [145] {Comedy}
## [146] {Comedy}
## [147] {Drama}
## [148] {Action,
## Thriller}
## [149] {Crime,
## Drama,
## Mystery,
## Romance,
## Thriller}
## [150] {Crime,
## Drama}
## [151] {Animation,
## Comedy,
## Fantasy,
## Musical}
size(genre)
## [1] 1 4 4 3 1 2 1 1 2 2 2 2 1 2 1 1 2 2 2 2 3 3 6 1 3 1 2 1 3 2 1 6 2 1 3 1 3
## [38] 2 3 2 2 2 2 2 3 2 2 4 2 2 1 3 3 3 2 2 2 3 3 2 1 1 3 4 1 2 2 2 2 2 2 2 1 3
## [75] 1 1 2 4 2 2 4 2 2 2 3 4 1 3 2 3 2 2 1 3 1 2 3 2 1 2 3 4 3 1 5 2 2 2 2 2 3
## [112] 1 1 4 3 2 1 1 2 1 1 1 1 2 1 2 4 3 1 1 1 3 1 3 2 1 3 2 1 1 1 2 5 3 1 1 1 2
## [149] 5 2 4
The absolute frequency of each genre was computed, and the top ten genres were visualized. The results indicate that Drama is the most frequently occurring genre in the sample of 150 movies, appearing in nearly half of the observations, followed by Comedy. The graphical representation further corroborates this observation.
round(itemFrequency(genre),3)
## (no genres listed) Action Adventure Animation
## 0.007 0.192 0.106 0.066
## Children Comedy Crime Documentary
## 0.073 0.358 0.119 0.060
## Drama Fantasy Film-Noir Horror
## 0.490 0.060 0.020 0.086
## IMAX Musical Mystery Romance
## 0.007 0.013 0.026 0.172
## Sci-Fi Thriller War x
## 0.073 0.179 0.046 0.007
itemFrequency(genre, type = "absolute")
## (no genres listed) Action Adventure Animation
## 1 29 16 10
## Children Comedy Crime Documentary
## 11 54 18 9
## Drama Fantasy Film-Noir Horror
## 74 9 3 13
## IMAX Musical Mystery Romance
## 1 2 4 26
## Sci-Fi Thriller War x
## 11 27 7 1
itemFrequencyPlot(genre, topN=10, type="absolute", main="Item Frequency")
image(genre)
colnames(genre[,9])
## [1] "Drama"
colnames(genre[,6])
## [1] "Comedy"
The relative frequency of each genre was also calculated and visualized. The resulting distribution is consistent with that of the absolute frequencies, thereby providing mutual confirmation of the observed pattern.
itemFrequency(genre, type = "relative")
## (no genres listed) Action Adventure Animation
## 0.006622517 0.192052980 0.105960265 0.066225166
## Children Comedy Crime Documentary
## 0.072847682 0.357615894 0.119205298 0.059602649
## Drama Fantasy Film-Noir Horror
## 0.490066225 0.059602649 0.019867550 0.086092715
## IMAX Musical Mystery Romance
## 0.006622517 0.013245033 0.026490066 0.172185430
## Sci-Fi Thriller War x
## 0.072847682 0.178807947 0.046357616 0.006622517
itemFrequencyPlot(genre, topN=10, type="relative", main="Item Frequency")
From the perspective of absolute frequencies, Drama and Comedy emerge as the most frequently occurring genres. However, the average co-occurrence frequency across genre combinations remains relatively low, indicating that many movies occupy peripheral positions in the genre co-occurrence network. Such movies tend to connect primarily through broader, more general genres and exhibit limited connections to other specific genres, which results in lower observed frequencies.
This pattern suggests that, in multi-genre movies, Drama often functions as a high-frequency foundational genre, while other genres more commonly appear as supplementary elements. The findings point to the possible existence of a core–periphery structure in movie genre combinations, although this interpretation requires further validation using larger samples or longitudinal data.
Based on this observation, a preliminary hypothesis can be proposed: for a given movie—for example, a drama–science fiction film—Drama serves as a conservative baseline genre that ensures broad audience accessibility, whereas Science Fiction represents a secondary or more specialized genre reflecting the film’s distinctive thematic orientation.
ctab<-crossTable(genre, sort=TRUE)
ctab<-crossTable(genre, measure="count", sort=TRUE)
ctab
## Drama Comedy Action Thriller Romance Crime Adventure Horror
## Drama 74 16 11 14 18 10 5 2
## Comedy 16 54 7 1 10 4 3 5
## Action 11 7 29 11 1 4 7 5
## Thriller 14 1 11 27 4 6 1 4
## Romance 18 10 1 4 26 1 2 0
## Crime 10 4 4 6 1 18 1 0
## Adventure 5 3 7 1 2 1 16 0
## Horror 2 5 5 4 0 0 0 13
## Children 2 5 1 0 0 0 4 0
## Sci-Fi 3 2 5 3 0 2 3 4
## Animation 1 7 4 1 0 0 2 1
## Documentary 0 0 0 0 0 1 0 0
## Fantasy 3 4 3 0 2 1 6 0
## War 6 1 3 2 2 0 0 0
## Mystery 3 0 0 4 2 1 0 0
## Film-Noir 0 0 1 3 1 0 0 0
## Musical 1 1 0 0 0 0 0 0
## (no genres listed) 0 0 0 0 0 0 0 0
## IMAX 0 0 1 0 0 0 1 0
## x 0 0 0 0 0 0 0 0
## Children Sci-Fi Animation Documentary Fantasy War Mystery
## Drama 2 3 1 0 3 6 3
## Comedy 5 2 7 0 4 1 0
## Action 1 5 4 0 3 3 0
## Thriller 0 3 1 0 0 2 4
## Romance 0 0 0 0 2 2 2
## Crime 0 2 0 1 1 0 1
## Adventure 4 3 2 0 6 0 0
## Horror 0 4 1 0 0 0 0
## Children 11 0 5 0 2 0 0
## Sci-Fi 0 11 3 0 1 0 0
## Animation 5 3 10 0 3 0 0
## Documentary 0 0 0 9 0 1 0
## Fantasy 2 1 3 0 9 0 0
## War 0 0 0 1 0 7 0
## Mystery 0 0 0 0 0 0 4
## Film-Noir 0 1 1 0 0 0 0
## Musical 0 0 1 0 1 0 0
## (no genres listed) 0 0 0 0 0 0 0
## IMAX 0 1 0 0 0 0 0
## x 0 0 0 0 0 0 0
## Film-Noir Musical (no genres listed) IMAX x
## Drama 0 1 0 0 0
## Comedy 0 1 0 0 0
## Action 1 0 0 1 0
## Thriller 3 0 0 0 0
## Romance 1 0 0 0 0
## Crime 0 0 0 0 0
## Adventure 0 0 0 1 0
## Horror 0 0 0 0 0
## Children 0 0 0 0 0
## Sci-Fi 1 0 0 1 0
## Animation 1 1 0 0 0
## Documentary 0 0 0 0 0
## Fantasy 0 1 0 0 0
## War 0 0 0 0 0
## Mystery 0 0 0 0 0
## Film-Noir 3 0 0 0 0
## Musical 0 2 0 0 0
## (no genres listed) 0 0 1 0 0
## IMAX 0 0 0 1 0
## x 0 0 0 0 1
mean(ctab)
## [1] 2.195
From the perspective of support, the central hub role of Drama becomes particularly evident. It co-occurs with most other genres and exhibits relatively high support values. The overall average support is also relatively substantial, reflecting the difficulty of assigning films to strictly defined categories. This finding highlights the strong integrative capacity of cinema and the importance of genre hybridity as a fundamental driver of its ongoing development.
stab<-crossTable(genre, measure="support", sort=TRUE)
round(stab, 3)
## Drama Comedy Action Thriller Romance Crime Adventure Horror
## Drama 0.490 0.106 0.073 0.093 0.119 0.066 0.033 0.013
## Comedy 0.106 0.358 0.046 0.007 0.066 0.026 0.020 0.033
## Action 0.073 0.046 0.192 0.073 0.007 0.026 0.046 0.033
## Thriller 0.093 0.007 0.073 0.179 0.026 0.040 0.007 0.026
## Romance 0.119 0.066 0.007 0.026 0.172 0.007 0.013 0.000
## Crime 0.066 0.026 0.026 0.040 0.007 0.119 0.007 0.000
## Adventure 0.033 0.020 0.046 0.007 0.013 0.007 0.106 0.000
## Horror 0.013 0.033 0.033 0.026 0.000 0.000 0.000 0.086
## Children 0.013 0.033 0.007 0.000 0.000 0.000 0.026 0.000
## Sci-Fi 0.020 0.013 0.033 0.020 0.000 0.013 0.020 0.026
## Animation 0.007 0.046 0.026 0.007 0.000 0.000 0.013 0.007
## Documentary 0.000 0.000 0.000 0.000 0.000 0.007 0.000 0.000
## Fantasy 0.020 0.026 0.020 0.000 0.013 0.007 0.040 0.000
## War 0.040 0.007 0.020 0.013 0.013 0.000 0.000 0.000
## Mystery 0.020 0.000 0.000 0.026 0.013 0.007 0.000 0.000
## Film-Noir 0.000 0.000 0.007 0.020 0.007 0.000 0.000 0.000
## Musical 0.007 0.007 0.000 0.000 0.000 0.000 0.000 0.000
## (no genres listed) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## IMAX 0.000 0.000 0.007 0.000 0.000 0.000 0.007 0.000
## x 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Children Sci-Fi Animation Documentary Fantasy War Mystery
## Drama 0.013 0.020 0.007 0.000 0.020 0.040 0.020
## Comedy 0.033 0.013 0.046 0.000 0.026 0.007 0.000
## Action 0.007 0.033 0.026 0.000 0.020 0.020 0.000
## Thriller 0.000 0.020 0.007 0.000 0.000 0.013 0.026
## Romance 0.000 0.000 0.000 0.000 0.013 0.013 0.013
## Crime 0.000 0.013 0.000 0.007 0.007 0.000 0.007
## Adventure 0.026 0.020 0.013 0.000 0.040 0.000 0.000
## Horror 0.000 0.026 0.007 0.000 0.000 0.000 0.000
## Children 0.073 0.000 0.033 0.000 0.013 0.000 0.000
## Sci-Fi 0.000 0.073 0.020 0.000 0.007 0.000 0.000
## Animation 0.033 0.020 0.066 0.000 0.020 0.000 0.000
## Documentary 0.000 0.000 0.000 0.060 0.000 0.007 0.000
## Fantasy 0.013 0.007 0.020 0.000 0.060 0.000 0.000
## War 0.000 0.000 0.000 0.007 0.000 0.046 0.000
## Mystery 0.000 0.000 0.000 0.000 0.000 0.000 0.026
## Film-Noir 0.000 0.007 0.007 0.000 0.000 0.000 0.000
## Musical 0.000 0.000 0.007 0.000 0.007 0.000 0.000
## (no genres listed) 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## IMAX 0.000 0.007 0.000 0.000 0.000 0.000 0.000
## x 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Film-Noir Musical (no genres listed) IMAX x
## Drama 0.000 0.007 0.000 0.000 0.000
## Comedy 0.000 0.007 0.000 0.000 0.000
## Action 0.007 0.000 0.000 0.007 0.000
## Thriller 0.020 0.000 0.000 0.000 0.000
## Romance 0.007 0.000 0.000 0.000 0.000
## Crime 0.000 0.000 0.000 0.000 0.000
## Adventure 0.000 0.000 0.000 0.007 0.000
## Horror 0.000 0.000 0.000 0.000 0.000
## Children 0.000 0.000 0.000 0.000 0.000
## Sci-Fi 0.007 0.000 0.000 0.007 0.000
## Animation 0.007 0.007 0.000 0.000 0.000
## Documentary 0.000 0.000 0.000 0.000 0.000
## Fantasy 0.000 0.007 0.000 0.000 0.000
## War 0.000 0.000 0.000 0.000 0.000
## Mystery 0.000 0.000 0.000 0.000 0.000
## Film-Noir 0.020 0.000 0.000 0.000 0.000
## Musical 0.000 0.013 0.000 0.000 0.000
## (no genres listed) 0.000 0.000 0.007 0.000 0.000
## IMAX 0.000 0.000 0.000 0.007 0.000
## x 0.000 0.000 0.000 0.000 0.007
mean(stab)
## [1] 0.01453642
Lift is used to measure the extent to which the probability of the consequent increases under a given antecedent, relative to its unconditional probability. When the lift value is close to 1, the corresponding rule approximates random co-occurrence; when lift is significantly greater than 1, it indicates the presence of a strong structural association.
From the perspective of lift, it can be observed that many genre combinations are essentially random and occur only in a small number of specific movies. Such combinations are therefore difficult to replicate or scale in a commercial context. In contrast, genres such as Drama and Comedy, which function as broadly compatible or “general-purpose” genres, exhibit relatively high lift values.
The average lift value is approximately 1, suggesting that most rules represent weak associations or are close to random co-occurrence, while only a limited number of rules constitute genuinely strong structural associations. This highlights the importance of further identifying truly strong associations, as these may indicate directions for scalable commercial production in the film industry.
It can also be observed that Drama and Comedy function as “safe” genres within the film industry, frequently co-occurring with a wide range of other genres. This widespread compatibility helps explain the increasing differentiation of movie genres, such as action–comedy or romantic–comedy, where hybrid genres give rise to a broader variety of films.
ltab<-crossTable(genre, measure="lift", sort=TRUE)
round(ltab,2)
## Drama Comedy Action Thriller Romance Crime Adventure Horror
## Drama NA 0.60 0.77 1.06 1.41 1.13 0.64 0.31
## Comedy 0.60 NA 0.67 0.10 1.08 0.62 0.52 1.08
## Action 0.77 0.67 NA 2.12 0.20 1.16 2.28 2.00
## Thriller 1.06 0.10 2.12 NA 0.86 1.86 0.35 1.72
## Romance 1.41 1.08 0.20 0.86 NA 0.32 0.73 0.00
## Crime 1.13 0.62 1.16 1.86 0.32 NA 0.52 0.00
## Adventure 0.64 0.52 2.28 0.35 0.73 0.52 NA 0.00
## Horror 0.31 1.08 2.00 1.72 0.00 0.00 0.00 NA
## Children 0.37 1.27 0.47 0.00 0.00 0.00 3.43 0.00
## Sci-Fi 0.56 0.51 2.37 1.53 0.00 1.53 2.57 4.22
## Animation 0.20 1.96 2.08 0.56 0.00 0.00 1.89 1.16
## Documentary 0.00 0.00 0.00 0.00 0.00 0.93 0.00 0.00
## Fantasy 0.68 1.24 1.74 0.00 1.29 0.93 6.29 0.00
## War 1.75 0.40 2.23 1.60 1.66 0.00 0.00 0.00
## Mystery 1.53 0.00 0.00 5.59 2.90 2.10 0.00 0.00
## Film-Noir 0.00 0.00 1.74 5.59 1.94 0.00 0.00 0.00
## Musical 1.02 1.40 0.00 0.00 0.00 0.00 0.00 0.00
## (no genres listed) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## IMAX 0.00 0.00 5.21 0.00 0.00 0.00 9.44 0.00
## x 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Children Sci-Fi Animation Documentary Fantasy War Mystery
## Drama 0.37 0.56 0.20 0.00 0.68 1.75 1.53
## Comedy 1.27 0.51 1.96 0.00 1.24 0.40 0.00
## Action 0.47 2.37 2.08 0.00 1.74 2.23 0.00
## Thriller 0.00 1.53 0.56 0.00 0.00 1.60 5.59
## Romance 0.00 0.00 0.00 0.00 1.29 1.66 2.90
## Crime 0.00 1.53 0.00 0.93 0.93 0.00 2.10
## Adventure 3.43 2.57 1.89 0.00 6.29 0.00 0.00
## Horror 0.00 4.22 1.16 0.00 0.00 0.00 0.00
## Children NA 0.00 6.86 0.00 3.05 0.00 0.00
## Sci-Fi 0.00 NA 4.12 0.00 1.53 0.00 0.00
## Animation 6.86 4.12 NA 0.00 5.03 0.00 0.00
## Documentary 0.00 0.00 0.00 NA 0.00 2.40 0.00
## Fantasy 3.05 1.53 5.03 0.00 NA 0.00 0.00
## War 0.00 0.00 0.00 2.40 0.00 NA 0.00
## Mystery 0.00 0.00 0.00 0.00 0.00 0.00 NA
## Film-Noir 0.00 4.58 5.03 0.00 0.00 0.00 0.00
## Musical 0.00 0.00 7.55 0.00 8.39 0.00 0.00
## (no genres listed) 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## IMAX 0.00 13.73 0.00 0.00 0.00 0.00 0.00
## x 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Film-Noir Musical (no genres listed) IMAX x
## Drama 0.00 1.02 0 0.00 0
## Comedy 0.00 1.40 0 0.00 0
## Action 1.74 0.00 0 5.21 0
## Thriller 5.59 0.00 0 0.00 0
## Romance 1.94 0.00 0 0.00 0
## Crime 0.00 0.00 0 0.00 0
## Adventure 0.00 0.00 0 9.44 0
## Horror 0.00 0.00 0 0.00 0
## Children 0.00 0.00 0 0.00 0
## Sci-Fi 4.58 0.00 0 13.73 0
## Animation 5.03 7.55 0 0.00 0
## Documentary 0.00 0.00 0 0.00 0
## Fantasy 0.00 8.39 0 0.00 0
## War 0.00 0.00 0 0.00 0
## Mystery 0.00 0.00 0 0.00 0
## Film-Noir NA 0.00 0 0.00 0
## Musical 0.00 NA 0 0.00 0
## (no genres listed) 0.00 0.00 NA 0.00 0
## IMAX 0.00 0.00 0 NA 0
## x 0.00 0.00 0 0.00 NA
mean(ltab,na.rm=TRUE)
## [1] 0.9066996
Frequent itemsets were mined from the movie genre data to identify genres or genre combinations that appear in at least 5% of the movies, with support used to measure their prevalence within the dataset. The results indicate that single-genre movies constitute the basic foundation of film production, suggesting that directors often prefer to explore a particular genre in depth.
However, as the data reflect films from the twentieth century, this period already witnessed the emergence of cross-genre movies, which helps explain the increasing prevalence of hybrid genres observed today. Consequently, greater attention is paid to frequent two-genre combinations (2-itemsets). The frequent itemset analysis shows that high-frequency genre combinations are primarily centered around Drama.
Drama is not only the most frequently occurring single genre, but also a core component of many mainstream genre combinations. This result suggests that, in genre design, films often adopt drama as a foundational narrative framework, upon which more direction-specific genre elements—such as romance, thriller, or action—are layered, thereby achieving a balance between broad audience appeal and genre innovation.
freq.items <- eclat(genre, parameter = list(supp = 0.05, maxlen = 15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 7
##
## create itemset ...
## set transactions ...[20 item(s), 151 transaction(s)] done [0.00s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating bit matrix ... [13 row(s), 151 column(s)] done [0.00s].
## writing ... [20 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(freq.items)
## items support count
## [1] {Crime, Drama} 0.06622517 10
## [2] {Drama, Romance} 0.11920530 18
## [3] {Comedy, Romance} 0.06622517 10
## [4] {Drama, Thriller} 0.09271523 14
## [5] {Action, Thriller} 0.07284768 11
## [6] {Action, Drama} 0.07284768 11
## [7] {Comedy, Drama} 0.10596026 16
## [8] {Drama} 0.49006623 74
## [9] {Comedy} 0.35761589 54
## [10] {Action} 0.19205298 29
## [11] {Thriller} 0.17880795 27
## [12] {Romance} 0.17218543 26
## [13] {Adventure} 0.10596026 16
## [14] {Crime} 0.11920530 18
## [15] {Animation} 0.06622517 10
## [16] {Sci-Fi} 0.07284768 11
## [17] {Fantasy} 0.05960265 9
## [18] {Horror} 0.08609272 13
## [19] {Children} 0.07284768 11
## [20] {Documentary} 0.05960265 9
round(support(items(freq.items), genre) , 2)
## [1] 0.07 0.12 0.07 0.09 0.07 0.07 0.11 0.49 0.36 0.19 0.18 0.17 0.11 0.12 0.07
## [16] 0.07 0.06 0.09 0.07 0.06
The results indicate that genres such as Crime, Romance, and Thriller tend to co-occur with Drama when they appear. Among these, Romance shows the highest confidence in predicting Drama, with a confidence level close to 70%, while Crime and Thriller also exhibit confidence values exceeding 50% when pointing to Drama.
Although the lift values of these rules are relatively low, their support levels are comparatively high. This suggests that Drama occupies a stable foundational narrative role within the movie genre system, functioning as a conservative anchor for multiple genres by providing a broadly acceptable narrative framework.
freq.rules<-ruleInduction(freq.items, genre, confidence=0.5)
freq.rules
## set of 3 rules
inspect(freq.rules)
## lhs rhs support confidence lift itemset
## [1] {Crime} => {Drama} 0.06622517 0.5555556 1.133634 1
## [2] {Romance} => {Drama} 0.11920530 0.6923077 1.412682 2
## [3] {Thriller} => {Drama} 0.09271523 0.5185185 1.058058 4
The Apriori rules sorted by confidence reveal several genre relationships with deterministic characteristics. Among them, both Film-Noir and Mystery point to Thriller with 100% confidence in the dataset, indicating a strong genre dependency. In addition, the combination of Mystery and Romance not only consistently leads to Thriller, but also stably points to Drama, suggesting that Thriller and Drama respectively serve dual core functions within the genre system by aggregating emotional tension and providing narrative stability.
Such high-confidence rules reflect a hierarchical semantic structure among genres rather than simple co-occurrence relationships.
rules.genre<-apriori(genre, parameter=list(supp=0.01, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[20 item(s), 151 transaction(s)] done [0.00s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [65 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.by.conf<-sort(rules.genre, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence coverage lift
## [1] {Film-Noir} => {Thriller} 0.01986755 1 0.01986755 5.592593
## [2] {Mystery} => {Thriller} 0.02649007 1 0.02649007 5.592593
## [3] {Mystery, Romance} => {Thriller} 0.01324503 1 0.01324503 5.592593
## [4] {Mystery, Romance} => {Drama} 0.01324503 1 0.01324503 2.040541
## [5] {Drama, Mystery} => {Thriller} 0.01986755 1 0.01986755 5.592593
## [6] {Romance, War} => {Drama} 0.01324503 1 0.01324503 2.040541
## count
## [1] 3
## [2] 4
## [3] 2
## [4] 2
## [5] 3
## [6] 2
Association rules sorted by lift further reveal a set of genre structures that, while relatively infrequent, are significantly stronger than random co-occurrence. Among these, the combination of Drama, Romance, and Thriller substantially increases the likelihood that a movie is classified as Mystery, with lift values exceeding 25, indicating that the mystery genre often emerges as a resultant category derived from more fundamental genres.
At the same time, the stable combination of Adventure and Animation almost inevitably points to Fantasy, highlighting the central role of fantasy world-building within this genre subsystem. The strong dependence of Children and Comedy on the animated format, as well as the tendency of Action, Drama, and Thriller to point toward War, further demonstrate that movie genres are not combined randomly but instead follow an internally coherent mechanism of genre composition.
However, such strong associations constitute only a small proportion of the overall rules, while the majority exhibit lift values only slightly above random levels. When considering lift alone, the strongly associated rules identified in this analysis are limited to the following cases.
rules.by.lift<-sort(rules.genre, by="lift", decreasing=TRUE) # sorting by lift
inspect(head(rules.by.lift))
## lhs rhs support confidence
## [1] {Drama, Romance, Thriller} => {Mystery} 0.01324503 0.6666667
## [2] {Romance, Thriller} => {Mystery} 0.01324503 0.5000000
## [3] {Adventure, Animation} => {Fantasy} 0.01324503 1.0000000
## [4] {Action, Adventure, Animation} => {Fantasy} 0.01324503 1.0000000
## [5] {Children, Comedy} => {Animation} 0.02649007 0.8000000
## [6] {Action, Drama, Thriller} => {War} 0.01324503 0.5000000
## coverage lift count
## [1] 0.01986755 25.16667 2
## [2] 0.02649007 18.87500 2
## [3] 0.01324503 16.77778 2
## [4] 0.01324503 16.77778 2
## [5] 0.03311258 12.08000 4
## [6] 0.02649007 10.78571 2
This figure presents the overall structure of the 65 movie genre association rules mined using the Apriori algorithm. The rules are visualized in a matrix format, with lift used as the color-mapping measure to indicate the strength of association relative to random co-occurrence.
The visualization further confirms the preceding findings, showing that rules with high lift values—and thus strong associations—constitute only a small subset of all rules. At the same time, a number of rules with moderate association strength are also observed, indicating the presence of meaningful but less pronounced structural relationships among movie genres.
plot(rules.genre, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{Drama,Romance,Thriller}" "{Action,Adventure,Animation}"
## [3] "{Children,Comedy}" "{Adventure,Animation}"
## [5] "{Action,Drama,Thriller}" "{Romance,Thriller}"
## [7] "{Action,Adventure,Fantasy}" "{Action,Fantasy}"
## [9] "{Children,Fantasy}" "{Action,Animation,Fantasy}"
## [11] "{Adventure,Children}" "{Animation,Comedy}"
## [13] "{Comedy,Fantasy}" "{Fantasy}"
## [15] "{Action,Sci-Fi}" "{Film-Noir}"
## [17] "{Drama,Mystery,Romance}" "{Action,Animation}"
## [19] "{Adventure,Sci-Fi}" "{Drama,Thriller,War}"
## [21] "{Adventure,Animation,Fantasy}" "{Drama,Fantasy}"
## [23] "{Drama,Mystery}" "{Animation}"
## [25] "{Animation,Fantasy}" "{Drama,Mystery,Thriller}"
## [27] "{Mystery,Romance}" "{Action,Drama,War}"
## [29] "{Thriller,War}" "{Animation,Sci-Fi}"
## [31] "{Mystery}" "{Action,War}"
## [33] "{Crime,Drama}" "{Drama,War}"
## [35] "{Horror,Thriller}" "{Adventure,Fantasy}"
## [37] "{Animation,Children}" "{Mystery,Thriller}"
## [39] "{Romance,War}" "{Fantasy,Romance}"
## [41] "{Mystery,Romance,Thriller}" "{Action,Thriller,War}"
## [43] "{War}" "{Crime,Thriller}"
## [45] "{Romance}" "{Crime}"
## [47] "{Thriller}" "{Action,Crime}"
## Itemsets in Consequent (RHS)
## [1] "{Drama}" "{Comedy}" "{Romance}" "{Action}" "{Thriller}"
## [6] "{Sci-Fi}" "{Adventure}" "{Children}" "{Animation}" "{War}"
## [11] "{Fantasy}" "{Mystery}"
plot(rules.genre)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
rules.df <- as(rules.genre, "data.frame")
The upper-left quadrant is regarded as the most important region, characterized by low support and high confidence. Although these rules occur relatively infrequently, their outcomes are highly deterministic when they do appear and are significantly stronger than random co-occurrence, making them the most structurally informative rules—referred to here as deterministic rules.
Using a lift threshold of 5 as the cutoff, we identify the strong association rules as follows.
rules_left_up <- subset(
rules.df,
support < 0.05 & confidence >= 0.8 &lift>=5
)
rules_left_up_sorted <- rules_left_up[
order(rules_left_up$lift, decreasing = TRUE),
]
rules_left_up_sorted
## rules support confidence coverage
## 32 {Adventure,Animation} => {Fantasy} 0.01324503 1.0 0.01324503
## 65 {Action,Adventure,Animation} => {Fantasy} 0.01324503 1.0 0.01324503
## 28 {Children,Comedy} => {Animation} 0.02649007 0.8 0.03311258
## 25 {Children,Fantasy} => {Adventure} 0.01324503 1.0 0.01324503
## 39 {Action,Fantasy} => {Adventure} 0.01986755 1.0 0.01986755
## 63 {Action,Animation,Fantasy} => {Adventure} 0.01324503 1.0 0.01324503
## 1 {Film-Noir} => {Thriller} 0.01986755 1.0 0.01986755
## 3 {Mystery} => {Thriller} 0.02649007 1.0 0.02649007
## 12 {Mystery,Romance} => {Thriller} 0.01324503 1.0 0.01324503
## 18 {Drama,Mystery} => {Thriller} 0.01986755 1.0 0.01986755
## 55 {Drama,Mystery,Romance} => {Thriller} 0.01324503 1.0 0.01324503
## 20 {Thriller,War} => {Action} 0.01324503 1.0 0.01324503
## 45 {Adventure,Animation} => {Action} 0.01324503 1.0 0.01324503
## 48 {Adventure,Sci-Fi} => {Action} 0.01986755 1.0 0.01986755
## 59 {Drama,Thriller,War} => {Action} 0.01324503 1.0 0.01324503
## 62 {Adventure,Animation,Fantasy} => {Action} 0.01324503 1.0 0.01324503
## lift count
## 32 16.777778 2
## 65 16.777778 2
## 28 12.080000 4
## 25 9.437500 2
## 39 9.437500 3
## 63 9.437500 2
## 1 5.592593 3
## 3 5.592593 4
## 12 5.592593 2
## 18 5.592593 3
## 55 5.592593 2
## 20 5.206897 2
## 45 5.206897 2
## 48 5.206897 3
## 59 5.206897 2
## 62 5.206897 2
The lower-left quadrant represents rules with low support and low confidence, which can be considered unstable rules. These rules occur infrequently and lack robustness. However, when their lift values are high, their association strength is second only to the strong associations identified in the upper-left quadrant. Such rules can be regarded as potential rules that may evolve into strong association rules.
It is worth noting that the dataset contains a substantial number of movies from the twentieth century. Consequently, rules such as {Action, Drama, Thriller} ⇒ {War}, which are commonly observed in contemporary war-related films, did not constitute strong associations at that time. Nevertheless, this pattern has gradually emerged as a stronger association in more recent contexts. Therefore, this region exhibits a certain predictive potential and may be viewed as a reservoir of genre combinations that could characterize future film production. As strong associations become saturated, such emerging combinations may represent unexplored opportunities.
rules_left_down <- subset(
rules.df,
support < 0.05 & confidence < 0.8 &lift>=5
)
rules_left_down_sorted <- rules_left_down[
order(rules_left_down$lift, decreasing = TRUE),
]
rules_left_down_sorted
## rules support confidence coverage
## 57 {Drama,Romance,Thriller} => {Mystery} 0.01324503 0.6666667 0.01986755
## 14 {Romance,Thriller} => {Mystery} 0.01324503 0.5000000 0.02649007
## 61 {Action,Drama,Thriller} => {War} 0.01324503 0.5000000 0.02649007
## 34 {Action,Fantasy} => {Animation} 0.01324503 0.6666667 0.01986755
## 64 {Action,Adventure,Fantasy} => {Animation} 0.01324503 0.6666667 0.01986755
## 26 {Adventure,Children} => {Fantasy} 0.01324503 0.5000000 0.02649007
## 35 {Action,Animation} => {Fantasy} 0.01324503 0.5000000 0.02649007
## 29 {Animation,Comedy} => {Children} 0.02649007 0.5714286 0.04635762
## 37 {Comedy,Fantasy} => {Animation} 0.01324503 0.5000000 0.02649007
## 6 {Animation} => {Children} 0.03311258 0.5000000 0.06622517
## 44 {Action,Animation} => {Sci-Fi} 0.01324503 0.5000000 0.02649007
## 7 {Fantasy} => {Adventure} 0.03973510 0.6666667 0.05960265
## 31 {Animation,Fantasy} => {Adventure} 0.01324503 0.6666667 0.01986755
## 40 {Drama,Fantasy} => {Adventure} 0.01324503 0.6666667 0.01986755
## 49 {Action,Sci-Fi} => {Adventure} 0.01986755 0.6000000 0.03311258
## lift count
## 57 25.166667 2
## 14 18.875000 2
## 61 10.785714 2
## 34 10.066667 2
## 64 10.066667 2
## 26 8.388889 2
## 35 8.388889 2
## 29 7.844156 4
## 37 7.550000 2
## 6 6.863636 5
## 44 6.863636 2
## 7 6.291667 6
## 31 6.291667 2
## 40 6.291667 2
## 49 5.662500 3
Based on this figure, we further examine the common antecedent conditions in movie genre associations, focusing on which left-hand side (LHS) itemsets exhibit greater carrying capacity by pointing to a wider range of consequent genres (RHS), and which LHS itemsets are more precise in directing toward specific RHS outcomes.
First, as the length of the LHS increases, its frequency decreases, leading to fewer occurrences in the dataset. Second, when the LHS includes genres that are substantially different in nature—for example, Romance and Thriller—the resulting rules tend to be low in frequency but highly directional, thus forming strong association rules. Third, when core hub genres such as Drama and Comedy appear in the LHS, the frequency of the rules increases substantially; however, their directional specificity decreases, as these hub genres are capable of supporting a wide variety of subordinate genres.
plot(rules.genre, method="grouped")
Based on the scatter plot and network visualization of the association rules, it can be observed that different movie genres play clearly distinct roles within the rule system. Genres such as Comedy, Drama, and Action function as strong hub nodes, characterized by high support but relatively low lift values. These genres exhibit strong carrying capacity and are able to combine with a wide range of other genres; as a result, the corresponding rules tend to have dispersed directions and rarely point to a single, well-defined consequent genre.
In contrast, genres such as Thriller and Mystery appear less frequently in the dataset, but under specific genre combinations they often form rules with substantially higher lift values, indicating stronger structural associations. These genres are therefore more likely to function as consequent or convergent nodes within the rule system.
From a structural perspective, this pattern reflects an important distinction between movie genre data and traditional market basket data. Some genres possess a high degree of generality, while others serve more specific narrative or stylistic functions. This asymmetry in carrying capacity and structural roles leads to the coexistence of high-frequency but low-lift rules and low-frequency but high-lift rules in movie genre data.
For quantitative analysis in the humanities, such patterns are particularly meaningful: although certain rules may not be dominant in terms of frequency, the structural relationships they reveal remain highly valuable for interpretation. In contrast, in conventional shopping data it is difficult to identify a single product that universally applies across everyday contexts. Similar to literary or linguistic data, genres inherently differ in function and status and cannot be treated as equal or interchangeable units. Consequently, such data naturally give rise to both high-frequency/low-lift and low-frequency/high-lift associations. In humanistic quantitative analysis, these structurally informative but infrequent patterns are often of substantial interpretive significance.
plot(rules.genre, method="graph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules.genre, method="paracoord", control=list(reorder=TRUE))
### In summary The association rule analysis based on movie genre data
indicates that genre combinations are not generated randomly but instead
exhibit clear structural characteristics. High-frequency genres such as
Drama and Comedy function as hub genres within the genre system,
possessing strong carrying capacity and forming stable co-occurrence
patterns with a wide range of other genres. In contrast, relatively
low-frequency genres such as Thriller and Mystery demonstrate higher
directional specificity under certain conditions, often appearing as
consequent genres and forming structural rules with higher lift
values.
This asymmetric distribution across the dimensions of support, confidence, and lift reveals a complex structure in which “core–supplementary” and “carrying–directional” roles coexist within the movie genre system. For cultural data such as movie genres, frequency alone is not the sole indicator of analytical value; low-frequency but structurally strong associations also carry significant interpretive importance. The findings of this study demonstrate that association rule mining is not only applicable to market-basket-style data, but also provides a valuable quantitative perspective for research in the humanities, where data often exhibit semantic heterogeneity and hierarchical structures.
From the perspective of linguistics, particularly for researchers with a background in semantic analysis and topic modeling, this methodological insight is highly transferable. Natural language data—especially in isolating languages such as Chinese, where grammatical inflection is minimal and meaning is largely conveyed through lexical co-occurrence—are inherently weakly structured. In such contexts, strong co-occurrence rules offer an especially effective analytical framework. Rather than relying solely on syntactic cues, association-based methods allow semantic relationships to emerge directly from patterns of usage.
Building on this principle, we argue that in natural language processing, intra-document word frequency and inter-document co-occurrence frequency can be naturally mapped onto the concepts of support and confidence, while repeated and disproportionately strong word-pair co-occurrences can be evaluated through lift to identify salient semantic associations. This framework enables the identification of strongly associated lexical units, recurrent semantic pairings, or coherent thematic structures, even in corpora where explicit annotation or large-scale supervision is unavailable.
Such an approach is particularly valuable for the analysis of low-resource or under-studied languages, where data sparsity and limited tooling often constrain more complex modeling techniques. While recent advances in large language models may have surpassed traditional association-based methods in many applied NLP tasks, these models do not diminish the methodological relevance of association rule mining. On the contrary, for fields such as sociolinguistics, which have historically relied heavily on qualitative interpretation, association-based modeling provides a transparent, interpretable, and empirically grounded pathway toward scientific generalization. In this sense, association rule mining serves not merely as a technical tool, but as a bridge between qualitative insight and quantitative rigor in language-centered research.