The project aims to use association rule mining to uncover patterns and relationships within a dataset of English movies, specifically focusing on the co-occurrence of budget, rate, and genre. The insights gained from this analysis could be valuable for movie producers, providing actionable intelligence to guide decision-making in the production process and improve the chances of creating successful movies that resonate with audiences. the data we will use is coming from IMDB
# defined cut_points and labels for Budget
budget_cut_points <- c(0, 2, 4, 6, 10, 15, Inf)
budget_labels <- c("Very Low", "Low", "Normal", "High", "Very High", "Highest")
# Use the cut function to categorize Budget values
budget_categories <- cut(movies_data$Budget, breaks = budget_cut_points, labels = budget_labels, include.lowest = TRUE)
# Add Budget_Category to movies_data
movies_data$Budget_Category <- as.character(budget_categories)
# Define the cut points and labels for the "Rate_Category" column
rate_cut_points <- c(0.1, 2, 4, 6, 7, 8, 9, 10)
rate_labels <- c("Very Low", "Low", "Normal", "Good", "Good Plus", "Very Good", "Perfect")
# Use the cut function to categorize "Rate" values for the entire column
movies_data$Rate_Category <- cut(movies_data$Rate, breaks = rate_cut_points, labels = rate_labels, include.lowest = TRUE)
# Define the cut points and labels for the years
year_cut_points <- c(1990, 1995, 2000, 2005, 2010, Inf)
year_labels <- c("1991-1994", "1995-1999", "2000-2004", "2005-2009", "2010-2017")
# Use the cut function to categorize years for the entire column
movies_data$Year_Category <- cut(movies_data$Year, breaks = year_cut_points, labels = year_labels, include.lowest = TRUE)
# Converting the new columns to factor in order to use them in association rule
library(arules)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
movies_data$Budget_Category <- factor(movies_data$Budget_Category, levels = budget_labels)
movies_data$Rate_Category <- factor(movies_data$Rate_Category, levels = rate_labels)
movies_data$Year_Category <- factor(movies_data$Year_Category, levels = year_labels)
# Define genre_levels
genre_levels <- c("Action", "Drama", "Comedy", "Science Fiction", "Romance", "Family", "Horror", "Animation", "Thriller", "Mystery", "Fantasy", "Crime", "Adventure", "Biography", "History", "Sport", "War", "Music", "Documentary", "Western", "Musical", "Sci-Fi")
# Convert Genre to a factor
movies_data$Genre <- factor(movies_data$Genre, levels = genre_levels)
#show some structure and rows from the data
str(movies_data)
## 'data.frame': 5737 obs. of 8 variables:
## $ Budget : num 3 0.02 0.025 0.03 0.05 ...
## $ Year : int 1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 ...
## $ Name : chr "Madonna: Truth or Dare" "Dingo" "Poison" "High Strung" ...
## $ Rate : num 6.3 6 6.4 5.4 4.8 6.8 6 4.8 4.2 6.4 ...
## $ Genre : Factor w/ 22 levels "Action","Drama",..: 19 2 2 3 3 2 7 12 1 10 ...
## $ Budget_Category: Factor w/ 6 levels "Very Low","Low",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ Rate_Category : Factor w/ 7 levels "Very Low","Low",..: 4 3 4 3 3 4 3 3 3 4 ...
## $ Year_Category : Factor w/ 5 levels "1991-1994","1995-1999",..: 1 1 1 1 1 1 1 1 1 1 ...
head(movies_data, 10)
## Budget Year Name Rate Genre
## 1 3.0000000 1991 Madonna: Truth or Dare 6.3 Documentary
## 2 0.0200000 1991 Dingo 6.0 Drama
## 3 0.0250000 1991 Poison 6.4 Drama
## 4 0.0300000 1991 High Strung 5.4 Comedy
## 5 0.0500000 1991 Johnny Suede 4.8 Comedy
## 6 0.0800000 1991 Daughters of the Dust 6.8 Drama
## 7 0.0800000 1991 Puppet Master III Toulon's Revenge 6.0 Horror
## 8 0.1000000 1991 Delusion 4.8 Crime
## 9 0.1182273 1991 Year of the Gun 4.2 Action
## 10 0.1227401 1991 The Rapture 6.4 Mystery
## Budget_Category Rate_Category Year_Category
## 1 Low Good 1991-1994
## 2 Very Low Normal 1991-1994
## 3 Very Low Good 1991-1994
## 4 Very Low Normal 1991-1994
## 5 Very Low Normal 1991-1994
## 6 Very Low Good 1991-1994
## 7 Very Low Normal 1991-1994
## 8 Very Low Normal 1991-1994
## 9 Very Low Normal 1991-1994
## 10 Very Low Good 1991-1994
# Implementing the cross table to show the data distribution across genres budget and rate
cross_table <- xtabs(~ Budget_Category + Rate_Category + Genre, data = movies_data)
print(cross_table)
## , , Genre = Action
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 50 240 84 20 0 0
## Low 1 10 111 74 18 0 0
## Normal 0 3 78 34 7 0 0
## High 0 2 54 63 11 0 0
## Very High 0 0 25 25 9 1 0
## Highest 0 0 12 23 15 1 0
##
## , , Genre = Drama
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 2 27 313 365 141 10 0
## Low 1 3 89 145 52 6 1
## Normal 0 1 32 41 20 0 0
## High 0 0 10 33 14 1 0
## Very High 0 0 4 10 3 0 0
## Highest 0 0 1 1 2 1 0
##
## , , Genre = Comedy
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 2 26 354 244 52 4 1
## Low 0 10 155 109 17 1 1
## Normal 0 0 66 27 3 1 0
## High 0 2 43 26 1 0 0
## Very High 0 0 5 0 0 0 0
## Highest 0 0 1 0 0 0 0
##
## , , Genre = Science Fiction
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 1 6 30 12 4 0 0
## Low 0 0 5 7 3 0 0
## Normal 0 0 0 5 1 0 0
## High 0 0 11 3 0 0 0
## Very High 0 0 4 6 3 0 0
## Highest 0 0 3 1 2 0 0
##
## , , Genre = Romance
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 1 1 22 21 13 0 0
## Low 0 2 10 16 9 0 1
## Normal 0 0 4 5 1 0 0
## High 0 0 5 2 1 0 0
## Very High 0 0 0 1 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Family
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 1 17 2 0 0 0
## Low 0 0 8 6 2 0 0
## Normal 0 0 2 4 1 0 0
## High 0 0 2 4 0 0 0
## Very High 0 0 2 3 1 0 0
## Highest 0 0 0 5 1 0 0
##
## , , Genre = Horror
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 1 62 228 45 7 0 0
## Low 3 17 46 23 0 0 0
## Normal 0 0 3 7 1 0 0
## High 0 0 7 1 0 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 1 0 0 0 0
##
## , , Genre = Animation
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 1 0 12 14 2 0 0
## Low 0 0 14 8 5 0 0
## Normal 0 1 6 4 4 0 0
## High 0 1 11 21 6 0 0
## Very High 0 0 5 13 4 0 0
## Highest 0 0 2 3 4 0 0
##
## , , Genre = Thriller
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 23 92 40 8 1 0
## Low 0 2 28 19 9 0 0
## Normal 0 0 8 11 3 0 0
## High 0 0 10 5 4 0 0
## Very High 0 0 2 4 0 0 0
## Highest 0 0 1 0 0 0 0
##
## , , Genre = Mystery
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 1 27 10 2 1 0
## Low 0 3 3 7 4 1 0
## Normal 0 0 0 3 0 0 0
## High 0 0 3 2 2 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Fantasy
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 2 4 21 15 9 0 0
## Low 0 1 15 10 4 0 0
## Normal 0 0 10 5 0 1 0
## High 0 1 10 6 0 0 0
## Very High 0 0 6 3 3 0 0
## Highest 0 0 4 1 1 0 0
##
## , , Genre = Crime
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 4 66 52 19 2 0
## Low 0 2 20 26 8 2 0
## Normal 0 0 7 21 6 0 0
## High 0 0 6 9 3 0 0
## Very High 0 0 0 1 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Adventure
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 6 35 21 9 0 0
## Low 0 0 40 18 3 0 0
## Normal 0 1 21 15 5 0 0
## High 0 0 24 24 11 1 0
## Very High 0 0 21 28 11 0 0
## Highest 0 0 8 21 6 1 0
##
## , , Genre = Biography
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 0 0 0 0 0 0
## Low 0 0 0 0 0 0 0
## Normal 0 0 0 0 0 0 0
## High 0 0 0 0 0 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = History
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 0 4 9 6 1 0
## Low 0 0 3 3 0 0 0
## Normal 0 0 1 0 1 0 0
## High 0 0 0 2 0 0 0
## Very High 0 0 0 1 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Sport
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 0 0 0 0 0 0
## Low 0 0 0 0 0 0 0
## Normal 0 0 0 0 0 0 0
## High 0 0 0 0 0 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = War
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 0 3 4 2 0 0
## Low 0 0 2 3 1 0 0
## Normal 0 0 0 1 1 0 0
## High 0 0 2 2 2 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 1 0 0 0 0
##
## , , Genre = Music
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 0 4 9 3 0 0
## Low 0 0 3 8 1 0 0
## Normal 0 0 1 1 0 0 0
## High 0 0 0 1 0 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Documentary
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 1 3 25 36 31 2 1
## Low 0 1 15 10 13 4 1
## Normal 0 0 0 0 0 0 0
## High 0 0 0 0 0 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Western
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 1 8 2 1 0 0
## Low 0 1 0 2 0 0 0
## Normal 0 0 1 1 0 0 0
## High 0 0 0 1 1 0 0
## Very High 0 0 1 0 1 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Musical
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 0 0 0 0 0 0
## Low 0 0 0 0 0 0 0
## Normal 0 0 0 0 0 0 0
## High 0 0 0 0 0 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 0 0 0 0 0
##
## , , Genre = Sci-Fi
##
## Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
## Very Low 0 0 0 0 0 0 0
## Low 0 0 0 0 0 0 0
## Normal 0 0 0 0 0 0 0
## High 0 0 0 0 0 0 0
## Very High 0 0 0 0 0 0 0
## Highest 0 0 0 0 0 0 0
From this table we can say that movies production, budgetary limitations play a significant role, and several genres tend to favor lower budgets. Some genres, such as Adventure and Science Fiction, may require higher budgets due to their reliance on special effects, which contrasts with genres like Drama and Comedy Market trends and audience preferences could be inferred by examining the distribution of movies across different rating categories. we can also observe some analysis from every table (genre) like:
xLow-budget action movies are quite common in the “Normal” rating category. As the budget increases, the number of movies produced decreases.
Drama are often made with low budgets, particularly in the “Good” and “Good Plus” rating categories. This genre has a wide distribution across budget categories and ratings.”
#show the first 20 rows from the dataset after converted to transactions
inspect(movies_transactions[1:20])
## items transactionID
## [1] {Budget=[3,38],
## Year=[1.99e+03,2e+03),
## Name=Madonna: Truth or Dare,
## Rate=[5.6,6.4),
## Genre=Documentary,
## Budget_Category=Low,
## Rate_Category=Good,
## Year_Category=1991-1994} 1
## [2] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Dingo,
## Rate=[5.6,6.4),
## Genre=Drama,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 2
## [3] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Poison,
## Rate=[6.4,10],
## Genre=Drama,
## Budget_Category=Very Low,
## Rate_Category=Good,
## Year_Category=1991-1994} 3
## [4] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=High Strung,
## Rate=[0,5.6),
## Genre=Comedy,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 4
## [5] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Johnny Suede,
## Rate=[0,5.6),
## Genre=Comedy,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 5
## [6] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Daughters of the Dust,
## Rate=[6.4,10],
## Genre=Drama,
## Budget_Category=Very Low,
## Rate_Category=Good,
## Year_Category=1991-1994} 6
## [7] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Puppet Master III Toulon's Revenge,
## Rate=[5.6,6.4),
## Genre=Horror,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 7
## [8] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Delusion,
## Rate=[0,5.6),
## Genre=Crime,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 8
## [9] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Year of the Gun,
## Rate=[0,5.6),
## Genre=Action,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 9
## [10] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=The Rapture,
## Rate=[6.4,10],
## Genre=Mystery,
## Budget_Category=Very Low,
## Rate_Category=Good,
## Year_Category=1991-1994} 10
## [11] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Shakes the Clown,
## Rate=[0,5.6),
## Genre=Action,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 11
## [12] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=The Pit and the Pendulum,
## Rate=[5.6,6.4),
## Genre=Horror,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 12
## [13] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=My Own Private Idaho,
## Rate=[6.4,10],
## Genre=Drama,
## Budget_Category=Very Low,
## Rate_Category=Good Plus,
## Year_Category=1991-1994} 13
## [14] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Scenes from a Mall,
## Rate=[0,5.6),
## Genre=Comedy,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 14
## [15] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Night on Earth,
## Rate=[6.4,10],
## Genre=Comedy,
## Budget_Category=Very Low,
## Rate_Category=Good Plus,
## Year_Category=1991-1994} 15
## [16] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Closet Land,
## Rate=[6.4,10],
## Genre=Drama,
## Budget_Category=Very Low,
## Rate_Category=Good,
## Year_Category=1991-1994} 16
## [17] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Meet the Applegates,
## Rate=[0,5.6),
## Genre=Comedy,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 17
## [18] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Beastmaster 2: Through the Portal of Time,
## Rate=[0,5.6),
## Genre=Action,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 18
## [19] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=Cool as Ice,
## Rate=[0,5.6),
## Genre=Drama,
## Budget_Category=Very Low,
## Rate_Category=Normal,
## Year_Category=1991-1994} 19
## [20] {Budget=[0.01,0.8),
## Year=[1.99e+03,2e+03),
## Name=The People Under the Stairs,
## Rate=[6.4,10],
## Genre=Comedy,
## Budget_Category=Very Low,
## Rate_Category=Good,
## Year_Category=1991-1994} 20
head(itemFrequency(movies_transactions, type="absolute"))
## Budget=[0.01,0.8) Budget=[0.8,3) Budget=[3,38]
## 1811 1778 2148
## Year=[1.99e+03,2e+03) Year=[2e+03,2.01e+03) Year=[2.01e+03,2.02e+03]
## 1872 1640 2225
# Subset transactions with specific columns to use them in argiori method
subset_transactions <- movies_data[, c("Genre", "Budget_Category", "Rate_Category", "Year_Category")]
head(subset_transactions)
## Genre Budget_Category Rate_Category Year_Category
## 1 Documentary Low Good 1991-1994
## 2 Drama Very Low Normal 1991-1994
## 3 Drama Very Low Good 1991-1994
## 4 Comedy Very Low Normal 1991-1994
## 5 Comedy Very Low Normal 1991-1994
## 6 Drama Very Low Good 1991-1994
str(subset_transactions)
## 'data.frame': 5737 obs. of 4 variables:
## $ Genre : Factor w/ 22 levels "Action","Drama",..: 19 2 2 3 3 2 7 12 1 10 ...
## $ Budget_Category: Factor w/ 6 levels "Very Low","Low",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ Rate_Category : Factor w/ 7 levels "Very Low","Low",..: 4 3 4 3 3 4 3 3 3 4 ...
## $ Year_Category : Factor w/ 5 levels "1991-1994","1995-1999",..: 1 1 1 1 1 1 1 1 1 1 ...
showing some plots to impact the relationships of cross table
#install.packages("vcd")
library(vcd)
## Loading required package: grid
cross_table <- xtabs(~ Budget_Category + Rate_Category + Genre, data = movies_data)
#Create a mosaic plot for a specific genre (e.g., 'Action')
mosaicplot(cross_table[, , "Action"], main = "Budget vs. Rate for Action Genre")
Regarding this plot we can say that action movies tend to perform well in the ‘Normal’ and ‘Good’ rate categories, regardless of budget. However, high-budget productions don’t always guarantee the highest ratings. Producers must consider other factors to improve the likelihood of success. The absence of ‘Perfect’ rated movies in this genre suggests that it’s rare to achieve the highest critical acclaim, possibly due to the genre’s focus on entertainment. This presents an opportunity for filmmakers to explore how they can create Action films that break into this highest-rate category.
#Create a mosaic plot for a specific genre (e.g., 'Comedy')
mosaicplot(cross_table[, , "Comedy"], main = "Budget vs. Rate for Comedy Genre")
Comedy movies can achieve good ratings with low budgets. As the budget increases, the ratings don’t seem to improve much, indicating that other factors like script quality and cast chemistry play a crucial role in the genre’s success. High budgets don’t guarantee very good or perfect ratings. This suggests that budget is not the sole determinant of a Comedy movie’s success.
# Run apriori algorithm with specific rhs Genre
rules <- apriori(subset_transactions,
parameter = list(supp = 0.05, conf = 0.05),
appearance = list(default = "lhs", rhs = "Genre=Drama"),
control = list(verbose = FALSE))
# Inspect the unique rules
inspect(rules)
## lhs rhs support confidence coverage lift count
## [1] {Year_Category=2010-2017} => {Genre=Drama} 0.07530068 0.2266527 0.3322294 0.9703779 432
## [2] {Rate_Category=Good} => {Genre=Drama} 0.10371274 0.2936821 0.3531462 1.2573540 595
## [3] {Rate_Category=Normal} => {Genre=Drama} 0.07826390 0.1707874 0.4582534 0.7311994 449
## [4] {Budget_Category=Very Low} => {Genre=Drama} 0.15077567 0.2751272 0.5480216 1.1779141 865
## [5] {Budget_Category=Very Low,
## Year_Category=2010-2017} => {Genre=Drama} 0.05699843 0.2904085 0.1962698 1.2433386 327
## [6] {Budget_Category=Very Low,
## Rate_Category=Good} => {Genre=Drama} 0.06362210 0.3679435 0.1729127 1.5752926 365
## [7] {Budget_Category=Very Low,
## Rate_Category=Normal} => {Genre=Drama} 0.05455813 0.2070106 0.2635524 0.8862834 313
The Apriori algorithm provides association rules that show a relationship between movie attributes and their likelihood of being classified as Drama. The rules suggest that the time period of 2010-2017 has a significant association with Drama movies. Movies rated as ‘Good’ are more likely to be classified as Dramas. Low budget movies are more likely to be Dramas, and when combined with a ‘Good’ rating, they have the highest chance of being classified as such.
# making a summary for the first association rule
summary(rules)
## set of 7 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 4 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.429 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05456 Min. :0.1708 Min. :0.1729 Min. :0.7312
## 1st Qu.:0.06031 1st Qu.:0.2168 1st Qu.:0.2299 1st Qu.:0.9283
## Median :0.07530 Median :0.2751 Median :0.3322 Median :1.1779
## Mean :0.08332 Mean :0.2617 Mean :0.3321 Mean :1.1203
## 3rd Qu.:0.09099 3rd Qu.:0.2920 3rd Qu.:0.4057 3rd Qu.:1.2503
## Max. :0.15078 Max. :0.3679 Max. :0.5480 Max. :1.5753
## count
## Min. :313
## 1st Qu.:346
## Median :432
## Mean :478
## 3rd Qu.:522
## Max. :865
##
## mining info:
## data ntransactions support confidence
## subset_transactions 5737 0.05 0.05
## call
## apriori(data = subset_transactions, parameter = list(supp = 0.05, conf = 0.05), appearance = list(default = "lhs", rhs = "Genre=Drama"), control = list(verbose = FALSE))
# Plot rules
plot(rules, method = "graph", control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
# the opposite Run apriori algorithm with specific lhs item
rules1 <- apriori(subset_transactions,
parameter = list(supp = 0.05, conf = 0.05),
appearance = list(default = "rhs", lhs = "Genre=Drama"),
control = list(verbose = FALSE))
inspect(rules1)
## lhs rhs support confidence coverage
## [1] {Genre=Drama} => {Budget_Category=Low} 0.05246645 0.2246269 0.2335716
## [2] {Genre=Drama} => {Year_Category=2005-2009} 0.06153042 0.2634328 0.2335716
## [3] {Genre=Drama} => {Year_Category=2010-2017} 0.07530068 0.3223881 0.2335716
## [4] {Genre=Drama} => {Rate_Category=Good} 0.10371274 0.4440299 0.2335716
## [5] {Genre=Drama} => {Rate_Category=Normal} 0.07826390 0.3350746 0.2335716
## [6] {Genre=Drama} => {Budget_Category=Very Low} 0.15077567 0.6455224 0.2335716
## lift count
## [1] 0.9837285 301
## [2] 1.0102368 353
## [3] 0.9703779 432
## [4] 1.2573540 595
## [5] 0.7311994 449
## [6] 1.1779141 865
he rules in this set are identifying attributes that, when present, suggest the likelihood of the movie being a Drama.For example, a rule like Year_Category=2010-2017 => Genre=Drama can be interpreted as: “If a movie is from the year category 2010-2017, it is likely to be a Drama.”
This set of rules is identifying attributes that are likely to be present when the movie is a Drama. For example, a rule like Genre=Drama Budget_Category=Very Low can be interpreted
The first rule helps with identifying the probability of a movie belonging to the Drama genre based on its other features. This type of data is highly valuable for predictive and classification tasks. On the other hand, the second rule sheds light on the common traits and characteristics of Drama movies. This type of data is crucial for gaining an in-depth understanding of the overall profile of Drama movies in the dataset.
#Run apriori algorithm with specific rhs Genre
rules2 <- apriori(subset_transactions,
parameter = list(supp = 0.05, conf = 0.05),
appearance = list(default = "lhs", rhs = "Genre=Action"),
control = list(verbose = FALSE))
inspect(rules2)
## lhs rhs support confidence
## [1] {Year_Category=2010-2017} => {Genre=Action} 0.05874150 0.1768101
## [2] {Rate_Category=Good} => {Genre=Action} 0.05281506 0.1495558
## [3] {Rate_Category=Normal} => {Genre=Action} 0.09063971 0.1977938
## [4] {Budget_Category=Very Low} => {Genre=Action} 0.06902562 0.1259542
## coverage lift count
## [1] 0.3322294 1.0425071 337
## [2] 0.3531462 0.8818104 303
## [3] 0.4582534 1.1662315 520
## [4] 0.5480216 0.7426508 396
These Apriori results indicate the association between certain attributes and the ‘Action’ genre. The analysis shows that Action movies were slightly more common in the years 2010-2017. However, ‘Good’ & ‘Normal’ rated movies are actually less likely to be ‘Action’ movies compared to the overall dataset.
#Run apriori algorithm with specific rhs budget
rules3 <- apriori(subset_transactions,
parameter = list(supp = 0.07, conf = 0.09),
appearance = list(default = "lhs", rhs = "Budget_Category=Very Low"),
control = list(verbose = FALSE))
inspect(rules3)
## lhs rhs support confidence coverage lift count
## [1] {Year_Category=2000-2004} => {Budget_Category=Very Low} 0.08593341 0.4767892 0.1802336 0.8700189 493
## [2] {Genre=Comedy} => {Budget_Category=Very Low} 0.11957469 0.5929127 0.2016733 1.0819148 686
## [3] {Genre=Drama} => {Budget_Category=Very Low} 0.15077567 0.6455224 0.2335716 1.1779141 865
## [4] {Year_Category=2005-2009} => {Budget_Category=Very Low} 0.15478473 0.5935829 0.2607635 1.0831377 888
## [5] {Year_Category=2010-2017} => {Budget_Category=Very Low} 0.19626983 0.5907660 0.3322294 1.0779976 1126
## [6] {Rate_Category=Good} => {Budget_Category=Very Low} 0.17291267 0.4896347 0.3531462 0.8934588 992
## [7] {Rate_Category=Normal} => {Budget_Category=Very Low} 0.26355238 0.5751236 0.4582534 1.0494543 1512
## [8] {Rate_Category=Normal,
## Year_Category=2005-2009} => {Budget_Category=Very Low} 0.07442914 0.6161616 0.1207948 1.1243382 427
## [9] {Rate_Category=Normal,
## Year_Category=2010-2017} => {Budget_Category=Very Low} 0.10336413 0.6552486 0.1577480 1.1956620 593
library(arulesViz)
# Plot the rules
plot(rules3, method = "paracoord", measure = "lift")
The Apriori results indicate that certain genres and year categories are
strongly associated with the ‘Very Low’ budget category. Here’s a
summary: When it comes to Drama and Comedy genres, particularly from
more recent years, are tend to be produced with very low budgets, and we
can also notice that ‘Normal’ rated movies are also associated with very
low budgets.
# Perform association rule mining
rulesall <- apriori(subset_transactions,
parameter = list(supp = 0.08, conf = 0.40),
control = list(verbose = FALSE))
inspect(rulesall)
## lhs rhs support confidence coverage lift count
## [1] {Genre=Action} => {Rate_Category=Normal} 0.09063971 0.5344296 0.1696008 1.1662315 520
## [2] {Year_Category=2000-2004} => {Rate_Category=Normal} 0.08262158 0.4584139 0.1802336 1.0003502 474
## [3] {Year_Category=2000-2004} => {Budget_Category=Very Low} 0.08593341 0.4767892 0.1802336 0.8700189 493
## [4] {Genre=Comedy} => {Rate_Category=Normal} 0.10876765 0.5393258 0.2016733 1.1769161 624
## [5] {Genre=Comedy} => {Budget_Category=Very Low} 0.11957469 0.5929127 0.2016733 1.0819148 686
## [6] {Budget_Category=Low} => {Rate_Category=Normal} 0.09935506 0.4351145 0.2283423 0.9495062 570
## [7] {Genre=Drama} => {Rate_Category=Good} 0.10371274 0.4440299 0.2335716 1.2573540 595
## [8] {Genre=Drama} => {Budget_Category=Very Low} 0.15077567 0.6455224 0.2335716 1.1779141 865
## [9] {Year_Category=2005-2009} => {Rate_Category=Normal} 0.12079484 0.4632353 0.2607635 1.0108714 693
## [10] {Year_Category=2005-2009} => {Budget_Category=Very Low} 0.15478473 0.5935829 0.2607635 1.0831377 888
## [11] {Year_Category=2010-2017} => {Rate_Category=Normal} 0.15774795 0.4748164 0.3322294 1.0361436 905
## [12] {Year_Category=2010-2017} => {Budget_Category=Very Low} 0.19626983 0.5907660 0.3322294 1.0779976 1126
## [13] {Rate_Category=Good} => {Budget_Category=Very Low} 0.17291267 0.4896347 0.3531462 0.8934588 992
## [14] {Rate_Category=Normal} => {Budget_Category=Very Low} 0.26355238 0.5751236 0.4582534 1.0494543 1512
## [15] {Budget_Category=Very Low} => {Rate_Category=Normal} 0.26355238 0.4809160 0.5480216 1.0494543 1512
## [16] {Rate_Category=Normal,
## Year_Category=2010-2017} => {Budget_Category=Very Low} 0.10336413 0.6552486 0.1577480 1.1956620 593
## [17] {Budget_Category=Very Low,
## Year_Category=2010-2017} => {Rate_Category=Normal} 0.10336413 0.5266430 0.1962698 1.1492396 593
plot(rulesall, measure = c("support", "confidence"), main = "Support vs. Confidence")
plot(rulesall, measure = "lift", method = "matrix", control = list(max.levels = 5), main = "Lift by Genre")
## Available control parameters (with default values):
## main = Matrix for 17 rules
## colors = c("#EE0000FF", "#EEEEEEFF")
## reorder = measure
## max = 1000
## engine = ggplot2
## verbose = FALSE
## Itemsets in Antecedent (LHS)
## [1] "{Genre=Drama}"
## [2] "{Rate_Category=Normal,Year_Category=2010-2017}"
## [3] "{Genre=Action}"
## [4] "{Budget_Category=Very Low,Year_Category=2010-2017}"
## [5] "{Genre=Comedy}"
## [6] "{Year_Category=2010-2017}"
## [7] "{Rate_Category=Normal}"
## [8] "{Budget_Category=Very Low}"
## [9] "{Year_Category=2005-2009}"
## [10] "{Budget_Category=Low}"
## [11] "{Year_Category=2000-2004}"
## [12] "{Rate_Category=Good}"
## Itemsets in Consequent (RHS)
## [1] "{Budget_Category=Very Low}" "{Rate_Category=Normal}"
## [3] "{Rate_Category=Good}"
plot(rulesall, method = "graph", control = list(type = "itemsets"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rulesall, method = "paracoord", measure = "lift")
## analysis and conclusion
The findings reveal connections, between genres, year categories, budget and rating in movies;
Action and Comedy genres tend to be associated with an average rating category. Movies from the 2000s are linked to both ratings and low budgets. Dramas have a likelihood of receiving ratings despite having limited budgets. There is a trend of ratings being connected to low budgets especially for movies released between 2010 2017. This indicates that during this period there was a production of budget films that received moderate ratings.
In summary the genre and release year have an impact on a movies budget and ratings. Action and Comedy films generally receive ratings while Dramas often achieve ratings despite working with limited budgets. The time period from 2010 2017 stands out as it witnessed the production of low budget movies that garnered ratings. These associations can serve as insights for film production and marketing strategies, within the industry.
# Check if the association rules are maximal (not subsumed by any other rule)
is.maximal(rulesall)
## {Genre=Action,Rate_Category=Normal}
## TRUE
## {Rate_Category=Normal,Year_Category=2000-2004}
## TRUE
## {Budget_Category=Very Low,Year_Category=2000-2004}
## TRUE
## {Genre=Comedy,Rate_Category=Normal}
## TRUE
## {Genre=Comedy,Budget_Category=Very Low}
## TRUE
## {Budget_Category=Low,Rate_Category=Normal}
## TRUE
## {Genre=Drama,Rate_Category=Good}
## TRUE
## {Genre=Drama,Budget_Category=Very Low}
## TRUE
## {Rate_Category=Normal,Year_Category=2005-2009}
## TRUE
## {Budget_Category=Very Low,Year_Category=2005-2009}
## TRUE
## {Rate_Category=Normal,Year_Category=2010-2017}
## FALSE
## {Budget_Category=Very Low,Year_Category=2010-2017}
## FALSE
## {Budget_Category=Very Low,Rate_Category=Good}
## TRUE
## {Budget_Category=Very Low,Rate_Category=Normal}
## FALSE
## {Budget_Category=Very Low,Rate_Category=Normal}
## FALSE
## {Budget_Category=Very Low,Rate_Category=Normal,Year_Category=2010-2017}
## TRUE
## {Budget_Category=Very Low,Rate_Category=Normal,Year_Category=2010-2017}
## TRUE
as we see here In the output provided, most pf the rules are identified as maximal, as indicated by the TRUE values. These results suggest that the identified itemsets are not subsets of any larger frequent itemsets within the specified support and confidence thresholds.
#install.packages('arulesViz')
library(arulesViz)
#making high conf rule
high_conf_rules <- subset(rulesall, confidence > 0.5)
high_conf_rules
## set of 9 rules
inspect(high_conf_rules)
## lhs rhs support confidence coverage lift count
## [1] {Genre=Action} => {Rate_Category=Normal} 0.09063971 0.5344296 0.1696008 1.166231 520
## [2] {Genre=Comedy} => {Rate_Category=Normal} 0.10876765 0.5393258 0.2016733 1.176916 624
## [3] {Genre=Comedy} => {Budget_Category=Very Low} 0.11957469 0.5929127 0.2016733 1.081915 686
## [4] {Genre=Drama} => {Budget_Category=Very Low} 0.15077567 0.6455224 0.2335716 1.177914 865
## [5] {Year_Category=2005-2009} => {Budget_Category=Very Low} 0.15478473 0.5935829 0.2607635 1.083138 888
## [6] {Year_Category=2010-2017} => {Budget_Category=Very Low} 0.19626983 0.5907660 0.3322294 1.077998 1126
## [7] {Rate_Category=Normal} => {Budget_Category=Very Low} 0.26355238 0.5751236 0.4582534 1.049454 1512
## [8] {Rate_Category=Normal,
## Year_Category=2010-2017} => {Budget_Category=Very Low} 0.10336413 0.6552486 0.1577480 1.195662 593
## [9] {Budget_Category=Very Low,
## Year_Category=2010-2017} => {Rate_Category=Normal} 0.10336413 0.5266430 0.1962698 1.149240 593
# Visualizing the subset of high conf.
plot(high_conf_rules)
plot(high_conf_rules, method = "paracoord", control = list(reorder = TRUE))
The association rules presented in this study offer valuable and
concrete insights for movie producers. Our analysis reveals that movies
categorized as Comedy or Drama are significantly associated with a Very
Low budget. We also found that movies with a Normal rate category are
strongly associated with a Very Low budget. These findings clearly
indicate that producers can leverage these patterns to make
well-informed decisions regarding the budget allocation and
categorization of their movies. It is imperative that producers take
these insights seriously and take proactive steps to ensure that their
movies are optimally categorized and budgeted. Such knowledge can prove
critically important in maximizing the commercial success of a movie
project.