Introduction

The project aims to use association rule mining to uncover patterns and relationships within a dataset of English movies, specifically focusing on the co-occurrence of budget, rate, and genre. The insights gained from this analysis could be valuable for movie producers, providing actionable intelligence to guide decision-making in the production process and improve the chances of creating successful movies that resonate with audiences. the data we will use is coming from IMDB

  • Data Loading and Preprocessing:
    • A file named ART.csv” that contains data related to movies is to be read in CSV format.
    • Installs and loads necessary packages like arules, arulesViz, and arulesCBA for association rule
      mining.

Date Cleaning

  • Exploratory Data Analysis and data cleaning
    • Using the cut points to make label for our variables/ columns .
    • The specified columns such as ‘genre’, ‘budget’, ‘Years’, and ’Rate’will be converted to the factor type to enable categorical analysis.
    • Converts the entire dataset into transactions for association rule mining.
#  defined cut_points and labels for Budget
budget_cut_points <- c(0, 2, 4, 6, 10, 15, Inf)
budget_labels <- c("Very Low", "Low", "Normal", "High", "Very High", "Highest")

# Use the cut function to categorize Budget values
budget_categories <- cut(movies_data$Budget, breaks = budget_cut_points, labels = budget_labels, include.lowest = TRUE)

# Add Budget_Category to movies_data
movies_data$Budget_Category <- as.character(budget_categories)
# Define the cut points and labels for the "Rate_Category" column
rate_cut_points <- c(0.1, 2, 4, 6, 7, 8, 9, 10)
rate_labels <- c("Very Low", "Low", "Normal", "Good", "Good Plus", "Very Good", "Perfect")
# Use the cut function to categorize "Rate" values for the entire column
movies_data$Rate_Category <- cut(movies_data$Rate, breaks = rate_cut_points, labels = rate_labels, include.lowest = TRUE)
# Define the cut points and labels for the years
year_cut_points <- c(1990, 1995, 2000, 2005, 2010, Inf)
year_labels <- c("1991-1994", "1995-1999", "2000-2004", "2005-2009", "2010-2017")

# Use the cut function to categorize years for the entire column
movies_data$Year_Category <- cut(movies_data$Year, breaks = year_cut_points, labels = year_labels, include.lowest = TRUE)
# Converting the new columns to factor in order to use them in association rule
library(arules)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
movies_data$Budget_Category <- factor(movies_data$Budget_Category, levels = budget_labels)
movies_data$Rate_Category <- factor(movies_data$Rate_Category, levels = rate_labels)
movies_data$Year_Category <- factor(movies_data$Year_Category, levels = year_labels)
# Define genre_levels
genre_levels <- c("Action", "Drama", "Comedy", "Science Fiction", "Romance", "Family", "Horror", "Animation", "Thriller", "Mystery", "Fantasy", "Crime", "Adventure", "Biography", "History", "Sport", "War", "Music", "Documentary", "Western", "Musical", "Sci-Fi")

# Convert Genre to a factor
movies_data$Genre <- factor(movies_data$Genre, levels = genre_levels)
#show some structure and rows from the data 
str(movies_data)
## 'data.frame':    5737 obs. of  8 variables:
##  $ Budget         : num  3 0.02 0.025 0.03 0.05 ...
##  $ Year           : int  1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 ...
##  $ Name           : chr  "Madonna: Truth or Dare" "Dingo" "Poison" "High Strung" ...
##  $ Rate           : num  6.3 6 6.4 5.4 4.8 6.8 6 4.8 4.2 6.4 ...
##  $ Genre          : Factor w/ 22 levels "Action","Drama",..: 19 2 2 3 3 2 7 12 1 10 ...
##  $ Budget_Category: Factor w/ 6 levels "Very Low","Low",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ Rate_Category  : Factor w/ 7 levels "Very Low","Low",..: 4 3 4 3 3 4 3 3 3 4 ...
##  $ Year_Category  : Factor w/ 5 levels "1991-1994","1995-1999",..: 1 1 1 1 1 1 1 1 1 1 ...
head(movies_data, 10)
##       Budget Year                               Name Rate       Genre
## 1  3.0000000 1991             Madonna: Truth or Dare  6.3 Documentary
## 2  0.0200000 1991                              Dingo  6.0       Drama
## 3  0.0250000 1991                             Poison  6.4       Drama
## 4  0.0300000 1991                        High Strung  5.4      Comedy
## 5  0.0500000 1991                       Johnny Suede  4.8      Comedy
## 6  0.0800000 1991              Daughters of the Dust  6.8       Drama
## 7  0.0800000 1991 Puppet Master III Toulon's Revenge  6.0      Horror
## 8  0.1000000 1991                           Delusion  4.8       Crime
## 9  0.1182273 1991                    Year of the Gun  4.2      Action
## 10 0.1227401 1991                        The Rapture  6.4     Mystery
##    Budget_Category Rate_Category Year_Category
## 1              Low          Good     1991-1994
## 2         Very Low        Normal     1991-1994
## 3         Very Low          Good     1991-1994
## 4         Very Low        Normal     1991-1994
## 5         Very Low        Normal     1991-1994
## 6         Very Low          Good     1991-1994
## 7         Very Low        Normal     1991-1994
## 8         Very Low        Normal     1991-1994
## 9         Very Low        Normal     1991-1994
## 10        Very Low          Good     1991-1994

some analysis before applying Association Rule

# Implementing the cross table to show the data distribution across genres budget and  rate
cross_table <- xtabs(~ Budget_Category + Rate_Category + Genre, data = movies_data)
print(cross_table)
## , , Genre = Action
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0  50    240   84        20         0       0
##       Low              1  10    111   74        18         0       0
##       Normal           0   3     78   34         7         0       0
##       High             0   2     54   63        11         0       0
##       Very High        0   0     25   25         9         1       0
##       Highest          0   0     12   23        15         1       0
## 
## , , Genre = Drama
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         2  27    313  365       141        10       0
##       Low              1   3     89  145        52         6       1
##       Normal           0   1     32   41        20         0       0
##       High             0   0     10   33        14         1       0
##       Very High        0   0      4   10         3         0       0
##       Highest          0   0      1    1         2         1       0
## 
## , , Genre = Comedy
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         2  26    354  244        52         4       1
##       Low              0  10    155  109        17         1       1
##       Normal           0   0     66   27         3         1       0
##       High             0   2     43   26         1         0       0
##       Very High        0   0      5    0         0         0       0
##       Highest          0   0      1    0         0         0       0
## 
## , , Genre = Science Fiction
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         1   6     30   12         4         0       0
##       Low              0   0      5    7         3         0       0
##       Normal           0   0      0    5         1         0       0
##       High             0   0     11    3         0         0       0
##       Very High        0   0      4    6         3         0       0
##       Highest          0   0      3    1         2         0       0
## 
## , , Genre = Romance
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         1   1     22   21        13         0       0
##       Low              0   2     10   16         9         0       1
##       Normal           0   0      4    5         1         0       0
##       High             0   0      5    2         1         0       0
##       Very High        0   0      0    1         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Family
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   1     17    2         0         0       0
##       Low              0   0      8    6         2         0       0
##       Normal           0   0      2    4         1         0       0
##       High             0   0      2    4         0         0       0
##       Very High        0   0      2    3         1         0       0
##       Highest          0   0      0    5         1         0       0
## 
## , , Genre = Horror
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         1  62    228   45         7         0       0
##       Low              3  17     46   23         0         0       0
##       Normal           0   0      3    7         1         0       0
##       High             0   0      7    1         0         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      1    0         0         0       0
## 
## , , Genre = Animation
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         1   0     12   14         2         0       0
##       Low              0   0     14    8         5         0       0
##       Normal           0   1      6    4         4         0       0
##       High             0   1     11   21         6         0       0
##       Very High        0   0      5   13         4         0       0
##       Highest          0   0      2    3         4         0       0
## 
## , , Genre = Thriller
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0  23     92   40         8         1       0
##       Low              0   2     28   19         9         0       0
##       Normal           0   0      8   11         3         0       0
##       High             0   0     10    5         4         0       0
##       Very High        0   0      2    4         0         0       0
##       Highest          0   0      1    0         0         0       0
## 
## , , Genre = Mystery
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   1     27   10         2         1       0
##       Low              0   3      3    7         4         1       0
##       Normal           0   0      0    3         0         0       0
##       High             0   0      3    2         2         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Fantasy
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         2   4     21   15         9         0       0
##       Low              0   1     15   10         4         0       0
##       Normal           0   0     10    5         0         1       0
##       High             0   1     10    6         0         0       0
##       Very High        0   0      6    3         3         0       0
##       Highest          0   0      4    1         1         0       0
## 
## , , Genre = Crime
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   4     66   52        19         2       0
##       Low              0   2     20   26         8         2       0
##       Normal           0   0      7   21         6         0       0
##       High             0   0      6    9         3         0       0
##       Very High        0   0      0    1         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Adventure
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   6     35   21         9         0       0
##       Low              0   0     40   18         3         0       0
##       Normal           0   1     21   15         5         0       0
##       High             0   0     24   24        11         1       0
##       Very High        0   0     21   28        11         0       0
##       Highest          0   0      8   21         6         1       0
## 
## , , Genre = Biography
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   0      0    0         0         0       0
##       Low              0   0      0    0         0         0       0
##       Normal           0   0      0    0         0         0       0
##       High             0   0      0    0         0         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = History
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   0      4    9         6         1       0
##       Low              0   0      3    3         0         0       0
##       Normal           0   0      1    0         1         0       0
##       High             0   0      0    2         0         0       0
##       Very High        0   0      0    1         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Sport
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   0      0    0         0         0       0
##       Low              0   0      0    0         0         0       0
##       Normal           0   0      0    0         0         0       0
##       High             0   0      0    0         0         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = War
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   0      3    4         2         0       0
##       Low              0   0      2    3         1         0       0
##       Normal           0   0      0    1         1         0       0
##       High             0   0      2    2         2         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      1    0         0         0       0
## 
## , , Genre = Music
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   0      4    9         3         0       0
##       Low              0   0      3    8         1         0       0
##       Normal           0   0      1    1         0         0       0
##       High             0   0      0    1         0         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Documentary
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         1   3     25   36        31         2       1
##       Low              0   1     15   10        13         4       1
##       Normal           0   0      0    0         0         0       0
##       High             0   0      0    0         0         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Western
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   1      8    2         1         0       0
##       Low              0   1      0    2         0         0       0
##       Normal           0   0      1    1         0         0       0
##       High             0   0      0    1         1         0       0
##       Very High        0   0      1    0         1         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Musical
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   0      0    0         0         0       0
##       Low              0   0      0    0         0         0       0
##       Normal           0   0      0    0         0         0       0
##       High             0   0      0    0         0         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      0    0         0         0       0
## 
## , , Genre = Sci-Fi
## 
##                Rate_Category
## Budget_Category Very Low Low Normal Good Good Plus Very Good Perfect
##       Very Low         0   0      0    0         0         0       0
##       Low              0   0      0    0         0         0       0
##       Normal           0   0      0    0         0         0       0
##       High             0   0      0    0         0         0       0
##       Very High        0   0      0    0         0         0       0
##       Highest          0   0      0    0         0         0       0

From this table we can say that movies production, budgetary limitations play a significant role, and several genres tend to favor lower budgets. Some genres, such as Adventure and Science Fiction, may require higher budgets due to their reliance on special effects, which contrasts with genres like Drama and Comedy Market trends and audience preferences could be inferred by examining the distribution of movies across different rating categories. we can also observe some analysis from every table (genre) like:

xLow-budget action movies are quite common in the “Normal” rating category. As the budget increases, the number of movies produced decreases.

Drama are often made with low budgets, particularly in the “Good” and “Good Plus” rating categories. This genre has a wide distribution across budget categories and ratings.”

#show the first 20 rows from the dataset after converted to transactions 
inspect(movies_transactions[1:20]) 
##      items                                             transactionID
## [1]  {Budget=[3,38],                                                
##       Year=[1.99e+03,2e+03),                                        
##       Name=Madonna: Truth or Dare,                                  
##       Rate=[5.6,6.4),                                               
##       Genre=Documentary,                                            
##       Budget_Category=Low,                                          
##       Rate_Category=Good,                                           
##       Year_Category=1991-1994}                                    1 
## [2]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Dingo,                                                   
##       Rate=[5.6,6.4),                                               
##       Genre=Drama,                                                  
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    2 
## [3]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Poison,                                                  
##       Rate=[6.4,10],                                                
##       Genre=Drama,                                                  
##       Budget_Category=Very Low,                                     
##       Rate_Category=Good,                                           
##       Year_Category=1991-1994}                                    3 
## [4]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=High Strung,                                             
##       Rate=[0,5.6),                                                 
##       Genre=Comedy,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    4 
## [5]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Johnny Suede,                                            
##       Rate=[0,5.6),                                                 
##       Genre=Comedy,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    5 
## [6]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Daughters of the Dust,                                   
##       Rate=[6.4,10],                                                
##       Genre=Drama,                                                  
##       Budget_Category=Very Low,                                     
##       Rate_Category=Good,                                           
##       Year_Category=1991-1994}                                    6 
## [7]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Puppet Master III Toulon's Revenge,                      
##       Rate=[5.6,6.4),                                               
##       Genre=Horror,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    7 
## [8]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Delusion,                                                
##       Rate=[0,5.6),                                                 
##       Genre=Crime,                                                  
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    8 
## [9]  {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Year of the Gun,                                         
##       Rate=[0,5.6),                                                 
##       Genre=Action,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    9 
## [10] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=The Rapture,                                             
##       Rate=[6.4,10],                                                
##       Genre=Mystery,                                                
##       Budget_Category=Very Low,                                     
##       Rate_Category=Good,                                           
##       Year_Category=1991-1994}                                    10
## [11] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Shakes the Clown,                                        
##       Rate=[0,5.6),                                                 
##       Genre=Action,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    11
## [12] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=The Pit and the Pendulum,                                
##       Rate=[5.6,6.4),                                               
##       Genre=Horror,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    12
## [13] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=My Own Private Idaho,                                    
##       Rate=[6.4,10],                                                
##       Genre=Drama,                                                  
##       Budget_Category=Very Low,                                     
##       Rate_Category=Good Plus,                                      
##       Year_Category=1991-1994}                                    13
## [14] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Scenes from a Mall,                                      
##       Rate=[0,5.6),                                                 
##       Genre=Comedy,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    14
## [15] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Night on Earth,                                          
##       Rate=[6.4,10],                                                
##       Genre=Comedy,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Good Plus,                                      
##       Year_Category=1991-1994}                                    15
## [16] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Closet Land,                                             
##       Rate=[6.4,10],                                                
##       Genre=Drama,                                                  
##       Budget_Category=Very Low,                                     
##       Rate_Category=Good,                                           
##       Year_Category=1991-1994}                                    16
## [17] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Meet the Applegates,                                     
##       Rate=[0,5.6),                                                 
##       Genre=Comedy,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    17
## [18] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Beastmaster 2: Through the Portal of Time,               
##       Rate=[0,5.6),                                                 
##       Genre=Action,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    18
## [19] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=Cool as Ice,                                             
##       Rate=[0,5.6),                                                 
##       Genre=Drama,                                                  
##       Budget_Category=Very Low,                                     
##       Rate_Category=Normal,                                         
##       Year_Category=1991-1994}                                    19
## [20] {Budget=[0.01,0.8),                                            
##       Year=[1.99e+03,2e+03),                                        
##       Name=The People Under the Stairs,                             
##       Rate=[6.4,10],                                                
##       Genre=Comedy,                                                 
##       Budget_Category=Very Low,                                     
##       Rate_Category=Good,                                           
##       Year_Category=1991-1994}                                    20
head(itemFrequency(movies_transactions, type="absolute"))
##        Budget=[0.01,0.8)           Budget=[0.8,3)            Budget=[3,38] 
##                     1811                     1778                     2148 
##    Year=[1.99e+03,2e+03)    Year=[2e+03,2.01e+03) Year=[2.01e+03,2.02e+03] 
##                     1872                     1640                     2225
# Subset transactions with specific columns to use them in argiori method
subset_transactions <- movies_data[, c("Genre", "Budget_Category", "Rate_Category", "Year_Category")]
head(subset_transactions)
##         Genre Budget_Category Rate_Category Year_Category
## 1 Documentary             Low          Good     1991-1994
## 2       Drama        Very Low        Normal     1991-1994
## 3       Drama        Very Low          Good     1991-1994
## 4      Comedy        Very Low        Normal     1991-1994
## 5      Comedy        Very Low        Normal     1991-1994
## 6       Drama        Very Low          Good     1991-1994
str(subset_transactions)
## 'data.frame':    5737 obs. of  4 variables:
##  $ Genre          : Factor w/ 22 levels "Action","Drama",..: 19 2 2 3 3 2 7 12 1 10 ...
##  $ Budget_Category: Factor w/ 6 levels "Very Low","Low",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ Rate_Category  : Factor w/ 7 levels "Very Low","Low",..: 4 3 4 3 3 4 3 3 3 4 ...
##  $ Year_Category  : Factor w/ 5 levels "1991-1994","1995-1999",..: 1 1 1 1 1 1 1 1 1 1 ...

showing some plots to impact the relationships of cross table

#install.packages("vcd")

library(vcd)
## Loading required package: grid
cross_table <- xtabs(~ Budget_Category + Rate_Category + Genre, data = movies_data)
#Create a mosaic plot for a specific genre (e.g., 'Action')
mosaicplot(cross_table[, , "Action"], main = "Budget vs. Rate for Action Genre")

Regarding this plot we can say that action movies tend to perform well in the ‘Normal’ and ‘Good’ rate categories, regardless of budget. However, high-budget productions don’t always guarantee the highest ratings. Producers must consider other factors to improve the likelihood of success. The absence of ‘Perfect’ rated movies in this genre suggests that it’s rare to achieve the highest critical acclaim, possibly due to the genre’s focus on entertainment. This presents an opportunity for filmmakers to explore how they can create Action films that break into this highest-rate category.

#Create a mosaic plot for a specific genre (e.g., 'Comedy')
mosaicplot(cross_table[, , "Comedy"], main = "Budget vs. Rate for Comedy Genre")

Comedy movies can achieve good ratings with low budgets. As the budget increases, the ratings don’t seem to improve much, indicating that other factors like script quality and cast chemistry play a crucial role in the genre’s success. High budgets don’t guarantee very good or perfect ratings. This suggests that budget is not the sole determinant of a Comedy movie’s success.

Argiori algorithm

# Run apriori algorithm with specific rhs Genre
rules <- apriori(subset_transactions, 
                 parameter = list(supp = 0.05, conf = 0.05), 
                 appearance = list(default = "lhs", rhs = "Genre=Drama"), 
                 control = list(verbose = FALSE))
# Inspect the unique rules
inspect(rules)
##     lhs                            rhs              support confidence  coverage      lift count
## [1] {Year_Category=2010-2017}   => {Genre=Drama} 0.07530068  0.2266527 0.3322294 0.9703779   432
## [2] {Rate_Category=Good}        => {Genre=Drama} 0.10371274  0.2936821 0.3531462 1.2573540   595
## [3] {Rate_Category=Normal}      => {Genre=Drama} 0.07826390  0.1707874 0.4582534 0.7311994   449
## [4] {Budget_Category=Very Low}  => {Genre=Drama} 0.15077567  0.2751272 0.5480216 1.1779141   865
## [5] {Budget_Category=Very Low,                                                                  
##      Year_Category=2010-2017}   => {Genre=Drama} 0.05699843  0.2904085 0.1962698 1.2433386   327
## [6] {Budget_Category=Very Low,                                                                  
##      Rate_Category=Good}        => {Genre=Drama} 0.06362210  0.3679435 0.1729127 1.5752926   365
## [7] {Budget_Category=Very Low,                                                                  
##      Rate_Category=Normal}      => {Genre=Drama} 0.05455813  0.2070106 0.2635524 0.8862834   313

The Apriori algorithm provides association rules that show a relationship between movie attributes and their likelihood of being classified as Drama. The rules suggest that the time period of 2010-2017 has a significant association with Drama movies. Movies rated as ‘Good’ are more likely to be classified as Dramas. Low budget movies are more likely to be Dramas, and when combined with a ‘Good’ rating, they have the highest chance of being classified as such.

# making a summary for the first association rule 
summary(rules)
## set of 7 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 4 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.429   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage           lift       
##  Min.   :0.05456   Min.   :0.1708   Min.   :0.1729   Min.   :0.7312  
##  1st Qu.:0.06031   1st Qu.:0.2168   1st Qu.:0.2299   1st Qu.:0.9283  
##  Median :0.07530   Median :0.2751   Median :0.3322   Median :1.1779  
##  Mean   :0.08332   Mean   :0.2617   Mean   :0.3321   Mean   :1.1203  
##  3rd Qu.:0.09099   3rd Qu.:0.2920   3rd Qu.:0.4057   3rd Qu.:1.2503  
##  Max.   :0.15078   Max.   :0.3679   Max.   :0.5480   Max.   :1.5753  
##      count    
##  Min.   :313  
##  1st Qu.:346  
##  Median :432  
##  Mean   :478  
##  3rd Qu.:522  
##  Max.   :865  
## 
## mining info:
##                 data ntransactions support confidence
##  subset_transactions          5737    0.05       0.05
##                                                                                                                                                                       call
##  apriori(data = subset_transactions, parameter = list(supp = 0.05, conf = 0.05), appearance = list(default = "lhs", rhs = "Genre=Drama"), control = list(verbose = FALSE))
# Plot rules
plot(rules, method = "graph", control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

# the opposite Run apriori algorithm with specific lhs item
rules1 <- apriori(subset_transactions, 
                 parameter = list(supp = 0.05, conf = 0.05), 
                 appearance = list(default = "rhs", lhs = "Genre=Drama"), 
                 control = list(verbose = FALSE))
inspect(rules1)
##     lhs              rhs                        support    confidence coverage 
## [1] {Genre=Drama} => {Budget_Category=Low}      0.05246645 0.2246269  0.2335716
## [2] {Genre=Drama} => {Year_Category=2005-2009}  0.06153042 0.2634328  0.2335716
## [3] {Genre=Drama} => {Year_Category=2010-2017}  0.07530068 0.3223881  0.2335716
## [4] {Genre=Drama} => {Rate_Category=Good}       0.10371274 0.4440299  0.2335716
## [5] {Genre=Drama} => {Rate_Category=Normal}     0.07826390 0.3350746  0.2335716
## [6] {Genre=Drama} => {Budget_Category=Very Low} 0.15077567 0.6455224  0.2335716
##     lift      count
## [1] 0.9837285 301  
## [2] 1.0102368 353  
## [3] 0.9703779 432  
## [4] 1.2573540 595  
## [5] 0.7311994 449  
## [6] 1.1779141 865

he rules in this set are identifying attributes that, when present, suggest the likelihood of the movie being a Drama.For example, a rule like Year_Category=2010-2017 => Genre=Drama can be interpreted as: “If a movie is from the year category 2010-2017, it is likely to be a Drama.”

This set of rules is identifying attributes that are likely to be present when the movie is a Drama. For example, a rule like Genre=Drama Budget_Category=Very Low can be interpreted

The first rule helps with identifying the probability of a movie belonging to the Drama genre based on its other features. This type of data is highly valuable for predictive and classification tasks. On the other hand, the second rule sheds light on the common traits and characteristics of Drama movies. This type of data is crucial for gaining an in-depth understanding of the overall profile of Drama movies in the dataset.

#Run apriori algorithm with specific rhs Genre
rules2 <- apriori(subset_transactions, 
                 parameter = list(supp = 0.05, conf = 0.05), 
                 appearance = list(default = "lhs", rhs = "Genre=Action"), 
                 control = list(verbose = FALSE))
inspect(rules2)
##     lhs                           rhs            support    confidence
## [1] {Year_Category=2010-2017}  => {Genre=Action} 0.05874150 0.1768101 
## [2] {Rate_Category=Good}       => {Genre=Action} 0.05281506 0.1495558 
## [3] {Rate_Category=Normal}     => {Genre=Action} 0.09063971 0.1977938 
## [4] {Budget_Category=Very Low} => {Genre=Action} 0.06902562 0.1259542 
##     coverage  lift      count
## [1] 0.3322294 1.0425071 337  
## [2] 0.3531462 0.8818104 303  
## [3] 0.4582534 1.1662315 520  
## [4] 0.5480216 0.7426508 396

These Apriori results indicate the association between certain attributes and the ‘Action’ genre. The analysis shows that Action movies were slightly more common in the years 2010-2017. However, ‘Good’ & ‘Normal’ rated movies are actually less likely to be ‘Action’ movies compared to the overall dataset.

#Run apriori algorithm with specific rhs budget
rules3 <- apriori(subset_transactions, 
                 parameter = list(supp = 0.07, conf = 0.09), 
                 appearance = list(default = "lhs", rhs = "Budget_Category=Very Low"), 
                 control = list(verbose = FALSE))
inspect(rules3)
##     lhs                          rhs                           support confidence  coverage      lift count
## [1] {Year_Category=2000-2004} => {Budget_Category=Very Low} 0.08593341  0.4767892 0.1802336 0.8700189   493
## [2] {Genre=Comedy}            => {Budget_Category=Very Low} 0.11957469  0.5929127 0.2016733 1.0819148   686
## [3] {Genre=Drama}             => {Budget_Category=Very Low} 0.15077567  0.6455224 0.2335716 1.1779141   865
## [4] {Year_Category=2005-2009} => {Budget_Category=Very Low} 0.15478473  0.5935829 0.2607635 1.0831377   888
## [5] {Year_Category=2010-2017} => {Budget_Category=Very Low} 0.19626983  0.5907660 0.3322294 1.0779976  1126
## [6] {Rate_Category=Good}      => {Budget_Category=Very Low} 0.17291267  0.4896347 0.3531462 0.8934588   992
## [7] {Rate_Category=Normal}    => {Budget_Category=Very Low} 0.26355238  0.5751236 0.4582534 1.0494543  1512
## [8] {Rate_Category=Normal,                                                                                 
##      Year_Category=2005-2009} => {Budget_Category=Very Low} 0.07442914  0.6161616 0.1207948 1.1243382   427
## [9] {Rate_Category=Normal,                                                                                 
##      Year_Category=2010-2017} => {Budget_Category=Very Low} 0.10336413  0.6552486 0.1577480 1.1956620   593
library(arulesViz)

# Plot the rules
plot(rules3, method = "paracoord", measure = "lift")

The Apriori results indicate that certain genres and year categories are strongly associated with the ‘Very Low’ budget category. Here’s a summary: When it comes to Drama and Comedy genres, particularly from more recent years, are tend to be produced with very low budgets, and we can also notice that ‘Normal’ rated movies are also associated with very low budgets.

# Perform association rule mining
rulesall <- apriori(subset_transactions, 
                     parameter = list(supp = 0.08, conf = 0.40), 
                     control = list(verbose = FALSE))
inspect(rulesall)
##      lhs                            rhs                           support confidence  coverage      lift count
## [1]  {Genre=Action}              => {Rate_Category=Normal}     0.09063971  0.5344296 0.1696008 1.1662315   520
## [2]  {Year_Category=2000-2004}   => {Rate_Category=Normal}     0.08262158  0.4584139 0.1802336 1.0003502   474
## [3]  {Year_Category=2000-2004}   => {Budget_Category=Very Low} 0.08593341  0.4767892 0.1802336 0.8700189   493
## [4]  {Genre=Comedy}              => {Rate_Category=Normal}     0.10876765  0.5393258 0.2016733 1.1769161   624
## [5]  {Genre=Comedy}              => {Budget_Category=Very Low} 0.11957469  0.5929127 0.2016733 1.0819148   686
## [6]  {Budget_Category=Low}       => {Rate_Category=Normal}     0.09935506  0.4351145 0.2283423 0.9495062   570
## [7]  {Genre=Drama}               => {Rate_Category=Good}       0.10371274  0.4440299 0.2335716 1.2573540   595
## [8]  {Genre=Drama}               => {Budget_Category=Very Low} 0.15077567  0.6455224 0.2335716 1.1779141   865
## [9]  {Year_Category=2005-2009}   => {Rate_Category=Normal}     0.12079484  0.4632353 0.2607635 1.0108714   693
## [10] {Year_Category=2005-2009}   => {Budget_Category=Very Low} 0.15478473  0.5935829 0.2607635 1.0831377   888
## [11] {Year_Category=2010-2017}   => {Rate_Category=Normal}     0.15774795  0.4748164 0.3322294 1.0361436   905
## [12] {Year_Category=2010-2017}   => {Budget_Category=Very Low} 0.19626983  0.5907660 0.3322294 1.0779976  1126
## [13] {Rate_Category=Good}        => {Budget_Category=Very Low} 0.17291267  0.4896347 0.3531462 0.8934588   992
## [14] {Rate_Category=Normal}      => {Budget_Category=Very Low} 0.26355238  0.5751236 0.4582534 1.0494543  1512
## [15] {Budget_Category=Very Low}  => {Rate_Category=Normal}     0.26355238  0.4809160 0.5480216 1.0494543  1512
## [16] {Rate_Category=Normal,                                                                                   
##       Year_Category=2010-2017}   => {Budget_Category=Very Low} 0.10336413  0.6552486 0.1577480 1.1956620   593
## [17] {Budget_Category=Very Low,                                                                               
##       Year_Category=2010-2017}   => {Rate_Category=Normal}     0.10336413  0.5266430 0.1962698 1.1492396   593
plot(rulesall, measure = c("support", "confidence"), main = "Support vs. Confidence")

plot(rulesall, measure = "lift", method = "matrix", control = list(max.levels = 5), main = "Lift by Genre")
## Available control parameters (with default values):
## main  =  Matrix for 17 rules
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## reorder   =  measure
## max   =  1000
## engine    =  ggplot2
## verbose   =  FALSE
## Itemsets in Antecedent (LHS)
##  [1] "{Genre=Drama}"                                     
##  [2] "{Rate_Category=Normal,Year_Category=2010-2017}"    
##  [3] "{Genre=Action}"                                    
##  [4] "{Budget_Category=Very Low,Year_Category=2010-2017}"
##  [5] "{Genre=Comedy}"                                    
##  [6] "{Year_Category=2010-2017}"                         
##  [7] "{Rate_Category=Normal}"                            
##  [8] "{Budget_Category=Very Low}"                        
##  [9] "{Year_Category=2005-2009}"                         
## [10] "{Budget_Category=Low}"                             
## [11] "{Year_Category=2000-2004}"                         
## [12] "{Rate_Category=Good}"                              
## Itemsets in Consequent (RHS)
## [1] "{Budget_Category=Very Low}" "{Rate_Category=Normal}"    
## [3] "{Rate_Category=Good}"

plot(rulesall, method = "graph", control = list(type = "itemsets"))
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(rulesall, method = "paracoord", measure = "lift")

## analysis and conclusion

The findings reveal connections, between genres, year categories, budget and rating in movies;

Action and Comedy genres tend to be associated with an average rating category. Movies from the 2000s are linked to both ratings and low budgets. Dramas have a likelihood of receiving ratings despite having limited budgets. There is a trend of ratings being connected to low budgets especially for movies released between 2010 2017. This indicates that during this period there was a production of budget films that received moderate ratings.

In summary the genre and release year have an impact on a movies budget and ratings. Action and Comedy films generally receive ratings while Dramas often achieve ratings despite working with limited budgets. The time period from 2010 2017 stands out as it witnessed the production of low budget movies that garnered ratings. These associations can serve as insights for film production and marketing strategies, within the industry.

# Check if the association rules are maximal (not subsumed by any other rule)
is.maximal(rulesall)
##                                     {Genre=Action,Rate_Category=Normal} 
##                                                                    TRUE 
##                          {Rate_Category=Normal,Year_Category=2000-2004} 
##                                                                    TRUE 
##                      {Budget_Category=Very Low,Year_Category=2000-2004} 
##                                                                    TRUE 
##                                     {Genre=Comedy,Rate_Category=Normal} 
##                                                                    TRUE 
##                                 {Genre=Comedy,Budget_Category=Very Low} 
##                                                                    TRUE 
##                              {Budget_Category=Low,Rate_Category=Normal} 
##                                                                    TRUE 
##                                        {Genre=Drama,Rate_Category=Good} 
##                                                                    TRUE 
##                                  {Genre=Drama,Budget_Category=Very Low} 
##                                                                    TRUE 
##                          {Rate_Category=Normal,Year_Category=2005-2009} 
##                                                                    TRUE 
##                      {Budget_Category=Very Low,Year_Category=2005-2009} 
##                                                                    TRUE 
##                          {Rate_Category=Normal,Year_Category=2010-2017} 
##                                                                   FALSE 
##                      {Budget_Category=Very Low,Year_Category=2010-2017} 
##                                                                   FALSE 
##                           {Budget_Category=Very Low,Rate_Category=Good} 
##                                                                    TRUE 
##                         {Budget_Category=Very Low,Rate_Category=Normal} 
##                                                                   FALSE 
##                         {Budget_Category=Very Low,Rate_Category=Normal} 
##                                                                   FALSE 
## {Budget_Category=Very Low,Rate_Category=Normal,Year_Category=2010-2017} 
##                                                                    TRUE 
## {Budget_Category=Very Low,Rate_Category=Normal,Year_Category=2010-2017} 
##                                                                    TRUE

as we see here In the output provided, most pf the rules are identified as maximal, as indicated by the TRUE values. These results suggest that the identified itemsets are not subsets of any larger frequent itemsets within the specified support and confidence thresholds.

#install.packages('arulesViz')
library(arulesViz)
#making high conf rule
high_conf_rules <- subset(rulesall, confidence > 0.5)
high_conf_rules
## set of 9 rules
inspect(high_conf_rules)
##     lhs                            rhs                           support confidence  coverage     lift count
## [1] {Genre=Action}              => {Rate_Category=Normal}     0.09063971  0.5344296 0.1696008 1.166231   520
## [2] {Genre=Comedy}              => {Rate_Category=Normal}     0.10876765  0.5393258 0.2016733 1.176916   624
## [3] {Genre=Comedy}              => {Budget_Category=Very Low} 0.11957469  0.5929127 0.2016733 1.081915   686
## [4] {Genre=Drama}               => {Budget_Category=Very Low} 0.15077567  0.6455224 0.2335716 1.177914   865
## [5] {Year_Category=2005-2009}   => {Budget_Category=Very Low} 0.15478473  0.5935829 0.2607635 1.083138   888
## [6] {Year_Category=2010-2017}   => {Budget_Category=Very Low} 0.19626983  0.5907660 0.3322294 1.077998  1126
## [7] {Rate_Category=Normal}      => {Budget_Category=Very Low} 0.26355238  0.5751236 0.4582534 1.049454  1512
## [8] {Rate_Category=Normal,                                                                                  
##      Year_Category=2010-2017}   => {Budget_Category=Very Low} 0.10336413  0.6552486 0.1577480 1.195662   593
## [9] {Budget_Category=Very Low,                                                                              
##      Year_Category=2010-2017}   => {Rate_Category=Normal}     0.10336413  0.5266430 0.1962698 1.149240   593

high conf subset

# Visualizing the subset of high conf.
plot(high_conf_rules)

plot(high_conf_rules, method = "paracoord", control = list(reorder = TRUE))

The association rules presented in this study offer valuable and concrete insights for movie producers. Our analysis reveals that movies categorized as Comedy or Drama are significantly associated with a Very Low budget. We also found that movies with a Normal rate category are strongly associated with a Very Low budget. These findings clearly indicate that producers can leverage these patterns to make well-informed decisions regarding the budget allocation and categorization of their movies. It is imperative that producers take these insights seriously and take proactive steps to ensure that their movies are optimally categorized and budgeted. Such knowledge can prove critically important in maximizing the commercial success of a movie project.