In this project, we combine descriptive and predictive analytics to gather insights about the film industry. The focus of our analysis will be on linear regression models, which will be used to forecast values of the dependent variable revenue based on input values of independent variables.
Understand the movie market
Determine factors impacting box-office financial performance
Develop recommendations to increase revenue
Before we start working with data, we need to set up our
environment.
Our pre-processing and clean-up of the data is a 13-step process.
It consists of:
# Step 1: Setting the working directory where our data files are stored.
# The working directory is the location on the computer where R will look for files.
setwd("~/School/Year 4/Winter 2024-25/1. MKTG 3P98/RStudio Files/A3") # This is where our datasets are stored.
# Step 2: Loading Required Libraries
# Libraries in R are like toolkits that provide additional functions for data handling, visualization, and analysis.
# We need the following libraries for this project:
library(tidyverse) # This package includes multiple useful functions for data manipulation and visualization.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) # A part of tidyverse, mainly used for filtering, summarizing, and transforming data.
library(data.table) # Helps in handling large datasets efficiently.
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(stringr) # Provides tools to manipulate text (e.g., trimming unnecessary spaces).
library(openxlsx) # Allows us to read and write Excel files.
## Warning: package 'openxlsx' was built under R version 4.4.3
library(ggplot2) # Used for creating graphs and visualizations.
library(performance)# Used for checking regression assumptions
## Warning: package 'performance' was built under R version 4.4.3
# Step 3: Loading and Inspecting Datasets
# This project includes two datasets:
# 1. Financial Data: Contains information about movie budgets and revenue.
# 2. Audience Data: Includes information such as IMDb ratings, critic scores, and Facebook likes.
# We will load both datasets before merging them into one.
financials <- read.xlsx("Movie Dataset_Financials.xlsx", sheet = "Sheet1") # Load financial dataset
audience <- read.xlsx("Movie Dataset_General Audience.xlsx", sheet = "Sheet 1") # Load audience dataset (with a space in the sheet name)
# Let's check the structure of both datasets to understand what kind of data we have.
glimpse(financials) # This function shows column names, data types, and sample values.
## Rows: 651
## Columns: 5
## $ original_title <chr> "The Departed ", "Exodus: Gods and Kings ", "Spec…
## $ `budget.(Millions)` <dbl> 1.00, 2.00, 110.00, 2.90, 190.00, 135.00, 1.00, 2…
## $ `revenue.(Millions)` <dbl> 0.009318, 0.045431, 295.238201, 0.296557, 1506.24…
## $ language <chr> "English", "English", "English", "English", "Fren…
## $ country <chr> "USA", "UK", "UK", "USA", "France", "USA", "USA",…
glimpse(audience) # Helps us understand what information is available in the audience dataset.
## Rows: 651
## Columns: 13
## $ original_title <chr> "The Departed ", "Exodus: Gods and Kings ", "Spectre "…
## $ type <chr> "Feature Film", "Feature Film", "Feature Film", "Featu…
## $ genre <chr> "Drama", "Drama", "Comedy", "Drama", "Horror", "Docume…
## $ runtime <dbl> 118, 131, 84, 97, 90, 78, 107, 130, 88, 100, 127, 110,…
## $ mpaa_rating <chr> "Unrated", "PG-13", "R", "PG", "R", "Unrated", "R", "R…
## $ imdb_rating <dbl> 2.1, 3.3, 7.6, 2.5, 7.2, 7.8, 2.1, 2.1, 7.5, 4.1, 6.8,…
## $ imdb_num_votes <dbl> 9904, 1010, 22381, 54363, 35096, 333, 9904, 122980, 88…
## $ critics_rating <chr> "Fresh", "Fresh", "Certified Fresh", "Rotten", "Certif…
## $ critics_score <dbl> 10, 43, 91, 27, 81, 91, 30, 33, 90, 56, 89, 30, 30, 45…
## $ audience_rating <chr> "Upright", "Upright", "Upright", "Upright", "Upright",…
## $ audience_score <dbl> 21, 34, 91, 23, 77, 86, 21, 31, 89, 45, 75, 21, 39, 53…
## $ best_pic_nom <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ Facebook_Likes <dbl> 48, 490, 11700, 230, 172221, 211873, 171, 355, 292000,…
# Step 4: Standardizing Column Names
# Before merging the datasets, we must ensure that column names are consistent.
# If the column names are different in the two datasets, merging will not work correctly.
financials <- financials |>
rename(
title = "original_title", # Changing "original_title" to "title" for consistency.
budget = "budget.(Millions)", # Standardizing the budget column.
revenue = "revenue.(Millions)" # Standardizing the revenue column.
)
audience <- audience |>
rename(title = "original_title") # Making sure "title" is the same in both datasets.
# Step 5: Cleaning Text Data (Trimming Spaces and Removing Duplicates)
# Sometimes, data might contain extra spaces before or after text values, which can cause issues when merging.
# We will use functions to remove these unnecessary spaces and also eliminate duplicate movie titles.
financials <- financials |>
mutate(
title = str_trim(title), # Removes spaces at the beginning and end of movie titles.
title = str_squish(title) # Removes extra spaces between words.
) |>
filter(!duplicated(title)) # Removes duplicate movie entries.
audience <- audience |>
mutate(
title = str_trim(title),
title = str_squish(title)
) |>
filter(!duplicated(title))
# Step 6: Merging the Two Datasets
# Now that the text data is cleaned, we can combine both datasets into one.
# The function `left_join()` is used to merge datasets based on a common column.
# We use "title" as the common column since it represents the movie name.
df <- left_join(audience, financials, by = "title")
# Step 7: Checking for Missing Values
# Missing values can create problems in analysis. We check for missing values in key columns.
colSums(is.na(df)) # This function counts how many missing values exist in each column.
## title type genre runtime mpaa_rating
## 0 0 0 1 0
## imdb_rating imdb_num_votes critics_rating critics_score audience_rating
## 0 0 0 0 0
## audience_score best_pic_nom Facebook_Likes budget revenue
## 0 0 0 0 0
## language country
## 1 1
# If a movie is missing key financial data (budget or revenue) or rating data (IMDb rating or critic score),
# it cannot be used for meaningful analysis. We remove such rows.
df <- df |>
filter(!is.na(budget), !is.na(revenue), !is.na(critics_score), !is.na(imdb_rating))
# Step 8: Converting Budget & Revenue to Numeric Data Type
# Sometimes, numerical values are stored as text (characters) instead of actual numbers.
# We need to make sure budget and revenue are in numeric format so that we can perform calculations.
df <- df |>
mutate(
budget = as.numeric(budget),
revenue = as.numeric(revenue)
)
# Step 9: Handling Outliers (Removing Extreme Revenue Values)
# Some movies have extremely high revenue compared to others, which may distort our analysis.
# We will remove the top 1% of revenue values to prevent misleading results.
df <- df |>
filter(budget > 0, revenue > 0, revenue < quantile(df$revenue, 0.99))
# Step 10: Creating New Variables
# In addition to the existing columns, we create new variables that can help in analysis.
# 1. Profit Calculation
# Profit is calculated as the difference between revenue and budget.
df <- df |> mutate(profit = revenue - budget)
# 2. Adjusting Critics Score Scale
# IMDb ratings are based on a 10-point scale, while critic scores use a 100-point scale.
# To make them comparable, we convert the critic score to a 10-point scale.
df <- df |> mutate(critics_score_scaled = critics_score / 10)
# 3. Calculating Average Rating
# To combine IMDb ratings and critic scores into a single metric, we take the average.
df <- df |> mutate(avg_rating = (critics_score_scaled + imdb_rating) / 2)
# 4. Calculating Return on Investment (ROI)
# ROI tells us how profitable a movie was compared to its budget.
# The formula is: (Profit / Budget) * 100
df <- df |> mutate(ROI = (profit / budget) * 100)
# Step 11: Grouping Movies into Broad Genre Categories
# The dataset has multiple genres, so we will simplify them into broader categories.
df <- df |>
mutate(
genre_category = case_when(
str_detect(genre, "Action|Adventure") ~ "Action/Adventure",
str_detect(genre, "Comedy") ~ "Comedy",
str_detect(genre, "Drama|Romance") ~ "Drama",
str_detect(genre, "Horror|Thriller") ~ "Horror/Thriller",
str_detect(genre, "Sci-Fi|Fantasy") ~ "Sci-Fi/Fantasy",
TRUE ~ "Other"
)
)
# Step 12: Saving the Cleaned Dataset
# The cleaned data will be saved as a CSV file so that it can be used for further analysis.
write.csv(df, "Cleaned_Movie_Data.csv", row.names = FALSE)
# Step 13: Final Check & Confirmation Message
# Finally, we check the structure of the cleaned dataset and print a message confirming that the process is complete.
glimpse(df) # Shows the final dataset structure.
## Rows: 637
## Columns: 22
## $ title <chr> "The Departed", "Exodus: Gods and Kings", "Spectr…
## $ type <chr> "Feature Film", "Feature Film", "Feature Film", "…
## $ genre <chr> "Drama", "Drama", "Comedy", "Drama", "Documentary…
## $ runtime <dbl> 118, 131, 84, 97, 78, 107, 130, 100, 127, 110, 11…
## $ mpaa_rating <chr> "Unrated", "PG-13", "R", "PG", "Unrated", "R", "R…
## $ imdb_rating <dbl> 2.1, 3.3, 7.6, 2.5, 7.8, 2.1, 2.1, 4.1, 6.8, 2.1,…
## $ imdb_num_votes <dbl> 9904, 1010, 22381, 54363, 333, 9904, 122980, 739,…
## $ critics_rating <chr> "Fresh", "Fresh", "Certified Fresh", "Rotten", "F…
## $ critics_score <dbl> 10, 43, 91, 27, 91, 30, 33, 56, 89, 30, 30, 45, 3…
## $ audience_rating <chr> "Upright", "Upright", "Upright", "Upright", "Upri…
## $ audience_score <dbl> 21, 34, 91, 23, 86, 21, 31, 45, 75, 21, 39, 53, 2…
## $ best_pic_nom <chr> "no", "no", "no", "no", "no", "no", "no", "no", "…
## $ Facebook_Likes <dbl> 48, 490, 11700, 230, 211873, 171, 355, 3611, 1244…
## $ budget <dbl> 1.00, 2.00, 110.00, 2.90, 135.00, 1.00, 2.00, 2.0…
## $ revenue <dbl> 0.009318, 0.045431, 295.238201, 0.296557, 532.950…
## $ language <chr> "English", "English", "English", "English", "Engl…
## $ country <chr> "USA", "UK", "UK", "USA", "USA", "USA", "USA", "U…
## $ profit <dbl> -0.990682, -1.954569, 185.238201, -2.603443, 397.…
## $ critics_score_scaled <dbl> 1.0, 4.3, 9.1, 2.7, 9.1, 3.0, 3.3, 5.6, 8.9, 3.0,…
## $ avg_rating <dbl> 1.55, 3.80, 8.35, 2.60, 8.45, 2.55, 2.70, 4.85, 7…
## $ ROI <dbl> -99.06820, -97.72845, 168.39836, -89.77390, 294.7…
## $ genre_category <chr> "Drama", "Drama", "Comedy", "Drama", "Other", "Dr…
cat("Data preparation is complete. The cleaned dataset has been saved as 'Cleaned_Movie_Data.csv'\n")
## Data preparation is complete. The cleaned dataset has been saved as 'Cleaned_Movie_Data.csv'
To understand which factors impact box office financial performance and find ways to increase revenue, we first want to understand the relationship between key variables and revenue
# Summary statistics for key variables
summary(df[, c("revenue", "imdb_rating", "critics_score", "audience_score", "budget", "Facebook_Likes")])
## revenue imdb_rating critics_score audience_score
## Min. : 0.0093 Min. : 1.200 Min. : 5.00 Min. :13.00
## 1st Qu.: 20.6453 1st Qu.: 4.300 1st Qu.: 37.00 1st Qu.:38.00
## Median : 61.4590 Median : 6.100 Median : 54.00 Median :55.00
## Mean :124.7731 Mean : 5.808 Mean : 54.72 Mean :56.38
## 3rd Qu.:150.1661 3rd Qu.: 7.000 3rd Qu.: 72.00 3rd Qu.:76.00
## Max. :886.6868 Max. :55.000 Max. :100.00 Max. :96.00
## budget Facebook_Likes
## Min. : 0.318 Min. : 2
## 1st Qu.: 10.000 1st Qu.: 6946
## Median : 25.000 Median : 25942
## Mean : 39.962 Mean : 61230
## 3rd Qu.: 52.000 3rd Qu.: 81115
## Max. :250.000 Max. :555609
# Correlation between revenue and other variables
cor(df[, c("revenue", "imdb_rating", "critics_score", "audience_score", "budget", "Facebook_Likes")], use = "complete.obs")
## revenue imdb_rating critics_score audience_score budget
## revenue 1.0000000 0.2486711 0.4030977 0.4749107 0.6812840
## imdb_rating 0.2486711 1.0000000 0.3783054 0.4797029 0.1973535
## critics_score 0.4030977 0.3783054 1.0000000 0.7703565 0.2994714
## audience_score 0.4749107 0.4797029 0.7703565 1.0000000 0.3758337
## budget 0.6812840 0.1973535 0.2994714 0.3758337 1.0000000
## Facebook_Likes 0.7076950 0.1407699 0.2403084 0.3050564 0.4645446
## Facebook_Likes
## revenue 0.7076950
## imdb_rating 0.1407699
## critics_score 0.2403084
## audience_score 0.3050564
## budget 0.4645446
## Facebook_Likes 1.0000000
##Through this, we can see which variables are most strongly associated with revenue and any other relationships that can assist in marketing strategies, budget, and overall performance.
##Facebook Likes, budget, and audience scores have the strongest correlation with revenue, which means these are the key drivers we can focus on. Other relationships that have a strong positive correlation include critics scores vs. audience scores, and budget vs. Facebook Likes
Now we will create general data visualizations to get insights regarding the key relationships we have defined above
#Revenue vs. Facebook Likes
ggplot(df, aes(x = Facebook_Likes, y = revenue)) +
geom_point(alpha = 0.6, color = "blue") +
labs(title = "Revenue vs. Facebook Likes", x = "Facebook Likes", y = "Revenue")
#Revenue vs. Budget
ggplot(df, aes(x = budget, y = revenue)) +
geom_point(alpha = 0.6, color = "green") +
labs(title = "Revenue vs. Budget", x = "Budget", y = "Revenue")
#Revenue vs. Audience scores
ggplot(df, aes(x = audience_score, y = revenue)) +
geom_point(alpha = 0.6, color = "purple") +
labs(title = "Revenue vs. Audience Score", x = "Audience Score", y = "Revenue")
#Critics scores vs. Audience scores
ggplot(df, aes(x = audience_score, y = critics_score)) +
geom_point(alpha = 0.6, color = "orange") +
labs(title = "Critics scores vs. Audience Score", x = "Audience Score", y = "Critics Scores")
#Budget vs. Facebook Likes
ggplot(df, aes(x = Facebook_Likes, y = budget)) +
geom_point(alpha = 0.6, color = "purple") +
labs(title = "Budget vs. Facebook Likes", x = "Facebook Likes", y = "Budget")
Now we will create histograms to compare the distribution of revenue across the different groups of our key drivers.
#1. Revenue by Facebook Likes
df %>%
mutate(Facebook_Likes_Group = cut(Facebook_Likes, breaks = c(0, 1000, 10000, 100000, Inf))) %>%
## Box Plot: Revenue by Facebook Likes Group
ggplot(aes(x = Facebook_Likes_Group, y = revenue, fill = Facebook_Likes_Group)) +
geom_boxplot() +
labs(title = "Revenue by Facebook Likes Group", x = "Facebook Likes Group", y = "Revenue")
#2. Revenue by Budget
df %>%
mutate(Budget_Group = cut(budget, breaks = c(0, 50, 100, 150, 200, 250, Inf))) %>%
## Box plot: Revenue by Budget Group
ggplot(aes(x = Budget_Group, y = revenue, fill = Budget_Group)) +
geom_boxplot() +
labs(title = "Revenue by Budget Group", x = "Budget Group", y = "Revenue")
#3. Revenue by Audience Scores
df %>%
mutate(Audience_Score_Group = cut(audience_score, breaks = c(0, 50, 70, 90, 100))) %>%
##Box Plot: Revenue by Audience Score Group
ggplot(aes(x = Audience_Score_Group, y = revenue, fill = Audience_Score_Group)) +
geom_boxplot() +
labs(title = "Revenue by Audience Score Group", x = "Audience Score Group", y = "Revenue")
Additionally, we want to see if best picture nominations have an impact on the revenue and profits which would result in different strategies.
# Calculate average revenue by genre and Best Picture nomination
genre_revenue <- df %>%
group_by(genre_category, best_pic_nom) %>%
summarise(avg_revenue = mean(revenue, na.rm = TRUE))
## `summarise()` has grouped output by 'genre_category'. You can override using
## the `.groups` argument.
# Bar chart: Average Revenue by Genre and Best Picture Nomination
ggplot(genre_revenue, aes(x = genre_category, y = avg_revenue, fill = best_pic_nom)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Genre", y = "Average Revenue", fill = "Best Picture Nomination",
title = "Average Revenue by Genre and Best Picture Nomination") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We have a general idea of our revenue drivers, next we want to answer the specific questions asked by the producers
#1. Do some genres generate more revenue?
## Average revenue by genre
df %>%
group_by(genre_category) %>%
summarise(avg_revenue = mean(revenue, na.rm = TRUE)) %>%
arrange(desc(avg_revenue))
## # A tibble: 6 × 2
## genre_category avg_revenue
## <chr> <dbl>
## 1 Action/Adventure 129.
## 2 Horror/Thriller 129.
## 3 Drama 128.
## 4 Other 126.
## 5 Comedy 113.
## 6 Sci-Fi/Fantasy 67.2
## Box plot: Revenue by Genre
ggplot(df, aes(x = genre_category, y = revenue, fill = genre_category)) +
geom_boxplot() +
labs(title = "Revenue by Genre", x = "Genre", y = "Revenue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#2. Do critics have a typical preference for some particular genres?
## Average critics score by genre
df %>%
group_by(genre_category) %>%
summarise(avg_critics_score = mean(critics_score, na.rm = TRUE)) %>%
## Bar chart: Average critics score by Genre
ggplot(aes(x = genre_category, y = avg_critics_score, fill = genre_category)) +
geom_bar(stat = "identity") +
labs(title = "Average Critics Score by Genre", x = "Genre", y = "Average Critics Score") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#3. Is a critic score for a movie more important than the audience score.
##First we get the correlation of the critic scores with the revenue
cor(df$critics_score, df$revenue, use = "complete.obs")
## [1] 0.4030977
##Critics score has a moderately positive correlation with revenue - 0.4030977
##Correlation between the audience score with the revenue
cor(df$audience_score, df$revenue, use = "complete.obs")
## [1] 0.4749107
##Audience scores has a slightly stronger positive correlation with revenue - 0.4749107
We want to leverage data from past movie releases to forecast the performance of future releases
Specifically, we are interested in seeing how different independent variables can predict the dependent variable of revenue. These analyses will enable the studio to anticipate how its films will perform as well as to proactively influence films towards commercial success
We will use multiple linear regression to build a predictive model
We begin by identifying categorical variables in the dataset and formatting them so that they can be included in the regression analysis
##These are the non-numeric variables which are classified into discrete categories / groups.
##We format the variables as factors except in the case of best_pic_nom (where we simply code "yes" as 1 and "no" as 0)
str(df)
## 'data.frame': 637 obs. of 22 variables:
## $ title : chr "The Departed" "Exodus: Gods and Kings" "Spectre" "The X Files" ...
## $ type : chr "Feature Film" "Feature Film" "Feature Film" "Feature Film" ...
## $ genre : chr "Drama" "Drama" "Comedy" "Drama" ...
## $ runtime : num 118 131 84 97 78 107 130 100 127 110 ...
## $ mpaa_rating : chr "Unrated" "PG-13" "R" "PG" ...
## $ imdb_rating : num 2.1 3.3 7.6 2.5 7.8 2.1 2.1 4.1 6.8 2.1 ...
## $ imdb_num_votes : num 9904 1010 22381 54363 333 ...
## $ critics_rating : chr "Fresh" "Fresh" "Certified Fresh" "Rotten" ...
## $ critics_score : num 10 43 91 27 91 30 33 56 89 30 ...
## $ audience_rating : chr "Upright" "Upright" "Upright" "Upright" ...
## $ audience_score : num 21 34 91 23 86 21 31 45 75 21 ...
## $ best_pic_nom : chr "no" "no" "no" "no" ...
## $ Facebook_Likes : num 48 490 11700 230 211873 ...
## $ budget : num 1 2 110 2.9 135 1 2 2 245 3.15 ...
## $ revenue : num 9.32e-03 4.54e-02 2.95e+02 2.97e-01 5.33e+02 ...
## $ language : chr "English" "English" "English" "English" ...
## $ country : chr "USA" "UK" "UK" "USA" ...
## $ profit : num -0.991 -1.955 185.238 -2.603 397.951 ...
## $ critics_score_scaled: num 1 4.3 9.1 2.7 9.1 3 3.3 5.6 8.9 3 ...
## $ avg_rating : num 1.55 3.8 8.35 2.6 8.45 2.55 2.7 4.85 7.85 2.55 ...
## $ ROI : num -99.1 -97.7 168.4 -89.8 294.8 ...
## $ genre_category : chr "Drama" "Drama" "Comedy" "Drama" ...
df <- df %>%
mutate(
type = as.factor(type),
genre_category = as.factor(genre_category),
mpaa_rating = as.factor(mpaa_rating),
critics_rating = as.factor(critics_rating),
audience_rating = as.factor(audience_rating),
language = as.factor(language),
country = as.factor(country),
best_pic_nom = ifelse(best_pic_nom == "yes", 1, 0)
)
str(df)
## 'data.frame': 637 obs. of 22 variables:
## $ title : chr "The Departed" "Exodus: Gods and Kings" "Spectre" "The X Files" ...
## $ type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 1 2 2 2 2 2 ...
## $ genre : chr "Drama" "Drama" "Comedy" "Drama" ...
## $ runtime : num 118 131 84 97 78 107 130 100 127 110 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 6 4 5 3 6 5 5 5 3 3 ...
## $ imdb_rating : num 2.1 3.3 7.6 2.5 7.8 2.1 2.1 4.1 6.8 2.1 ...
## $ imdb_num_votes : num 9904 1010 22381 54363 333 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 2 2 1 3 2 2 1 2 1 2 ...
## $ critics_score : num 10 43 91 27 91 30 33 56 89 30 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 2 2 2 2 2 2 ...
## $ audience_score : num 21 34 91 23 86 21 31 45 75 21 ...
## $ best_pic_nom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Facebook_Likes : num 48 490 11700 230 211873 ...
## $ budget : num 1 2 110 2.9 135 1 2 2 245 3.15 ...
## $ revenue : num 9.32e-03 4.54e-02 2.95e+02 2.97e-01 5.33e+02 ...
## $ language : Factor w/ 5 levels "English","French",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ country : Factor w/ 13 levels "Australia","Belgium",..: 13 12 12 13 13 13 13 13 13 13 ...
## $ profit : num -0.991 -1.955 185.238 -2.603 397.951 ...
## $ critics_score_scaled: num 1 4.3 9.1 2.7 9.1 3 3.3 5.6 8.9 3 ...
## $ avg_rating : num 1.55 3.8 8.35 2.6 8.45 2.55 2.7 4.85 7.85 2.55 ...
## $ ROI : num -99.1 -97.7 168.4 -89.8 294.8 ...
## $ genre_category : Factor w/ 6 levels "Action/Adventure",..: 3 3 2 3 5 3 3 3 1 3 ...
#With the categorical variables formatted as factors, they can be incorporated into the model as dummy variables
We can now choose which variables to include in our linear regression model
There are many variables in our dataset. Existing research regarding the film industry and our own judgement provide a starting point for variables to include in our model.
Amongst film characteristics discussed in the text “Predicting Movie Success at the Box Office” - see Note 1 at end of code - budget, genre, awards, critics’ reviews and audience opinions are represented within our dataset.
We will be conducting in-depth genre analyses after our initial regression model is created, so we will exclude genre from our model at this stage
We will use best_pic_nom to examine the impact of awards
We will use critics_score to examine the impact of critics’ reviews
Given the strong correlation observed between critics_score and audience_score, we will be sure to exclude audience scores from our model to avoid multicollinearity.
Instead, we will use Facebook likes to gauge audience opinions. We believe Facebook Likes may be a valuable characteristic, because it could capture early attitudes towards an unreleased film. I.e., before the public has had an opportunity to view and/or review the film
#Therefore, we test "budget, "best_pic_nom", "critics_score", and "Facebook_Likes" as independent variables in our model with revenue as a dependent variable
##We applied a model intercept of 0, as a film with no budget and no viewer response will likely not earn revenue
model <- lm(revenue ~ budget + best_pic_nom + critics_score + Facebook_Likes + 0, data=df)
We will also apply best rpactices and check that fundamental assumptions of linear regressions hold for this model
#1 - linear relationship exists in data #2 - constant variance #3 - normality #4 - No heteroscedasticity #5 - No multicollinearity
plot(model)
#In the first graph, dots are mostly scattered randomly around the line (Assumption 1) and there are no systemic patterns (Assumption 2)
#In the second graph, the dots mostly lie along the line (Assumption 3). There is however some positive skew present (some dots veer above the line)
#In the third graph, dots are generally along the line (Assumption 1) and the line is relatively flat (Assumption 4)
performance::check_collinearity(model) #See Note 2 at end of code
## Model has no intercept. VIFs may not be sensible.
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## budget 2.46 [2.19, 2.79] 1.57 0.41 [0.36, 0.46]
## best_pic_nom 1.04 [1.01, 1.29] 1.02 0.96 [0.78, 0.99]
## critics_score 2.18 [1.95, 2.47] 1.48 0.46 [0.41, 0.51]
## Facebook_Likes 1.91 [1.71, 2.15] 1.38 0.52 [0.47, 0.58]
#VIF values are well below 10 (Assumption 5)
#All assumption requirements are met for our model
We can now create an output for our model and interpret the results
#We use the summary function to print the regression results
summary(model)
##
## Call:
## lm(formula = revenue ~ budget + best_pic_nom + critics_score +
## Facebook_Likes + 0, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -236.71 -54.01 -27.41 17.24 787.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## budget 1.513e+00 1.008e-01 15.010 < 2e-16 ***
## best_pic_nom -3.555e+01 2.090e+01 -1.701 0.089436 .
## critics_score 3.612e-01 9.442e-02 3.825 0.000144 ***
## Facebook_Likes 8.874e-04 4.845e-05 18.315 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 95.9 on 633 degrees of freedom
## Multiple R-squared: 0.7878, Adjusted R-squared: 0.7865
## F-statistic: 587.6 on 4 and 633 DF, p-value: < 2.2e-16
We interpret the model results by checking the F-statistic significance, the R^2 value, the t-test significances, and the regression coefficients
1. The F-statistic has a p-value < 0.05. Therefore, We are 95% confident that there is a significant linear relationship amongst the independent variables and the revenue
2. The R^2 is 0.7878. Therefore, 78.78% of the variation in revenue can be explained by the independent variables
3. The t-test significances for budget, critics_score and Facebook_Likes are < 0.05. Therefore we are 95% confident that these variables have a significant impact on revenue. best_pic_nom, however, has a t-test value > 0.05, so there is insufficient evidence to suggest that Best Picture nominations have a strong impact on revenue
Caution should be taken when predicting revenues based on Best Picture Nominations
4. Looking at the magnitudes of the regression coefficients (the estimate column), best_pic_nom has the largest impact on revenue. There is a negative relationship between these variables, as (all else equal) the presence of a Best Picture Nomination reduces the expected revenue. Following best_pic_nom, the next largest impacts on revenue come from budget, critics_score, and Facebook_Likes respectively. There is a positive relationship between these other variables and revenue, with increases in these values contributing to an increase in expected revenue.
We will now use our model to predict revenue for the company’s recently-screened movie premiere and for the related campaign See Note 3 at end of code
##The premiere garnered a critics score of 55 and 1,250 Facebook likes. We will create a dataset with these values to input them into our model, holding all other variables constant at 0
df_premiere <- data.frame(budget = 0, best_pic_nom = 0, critics_score = 55, audience_score = 0, Facebook_Likes = 1250)
premiere_prediction <- predict(model, df_premiere)
print(premiere_prediction)
## 1
## 20.97317
#Using the movie premiere data, we estimate that this film will earn revenues of about $21 million.
#We can also use the model to predict the effects of the advertising campaign
##The campaign is expected to require a budget of $20 million and increase facebook likes by 10,000% (*100)
df_campaign <- data.frame(budget = 20, best_pic_nom = 0, critics_score = 55, audience_score = 0, Facebook_Likes = 1250 * 100)
campaign_prediction <- predict(model, df_campaign)
print(campaign_prediction) #Revenue
## 1
## 161.0358
print(campaign_prediction - premiere_prediction) #Increase in revenue compared to non-campaign (df_premiere) prediction
## 1
## 140.0626
print((campaign_prediction-premiere_prediction)/premiere_prediction*100) #Percentage change in revenue (campaign vs non-campaign)
## 1
## 667.8179
#Using the campaign data as well as the existing premiere data, we estimate that this film will earn revenues of about $161 million.
#This would represent an increase of about $140 million, or a 668% increase
We will also consider what the profit and ROI will look like for this film
print(campaign_prediction - df_campaign$budget) #Profit
## 1
## 141.0358
print((campaign_prediction/df_campaign$budget)*100) #ROI
## 1
## 805.1789
##This film is expected to yield a profit of $141 million and an ROI of 805%
a<-ggplot(df, aes(x = budget, y = profit)) +
geom_point(alpha = 0.6, color = "lightblue") +
labs(title = "Budget vs. Profit (mean profit and predictions)", x = "Budget", y = "Profit")
a + geom_hline(yintercept=mean(df$profit), linetype="dashed", color = "blue") +
geom_hline(yintercept=premiere_prediction, linetype = "twodash", color="purple")+
geom_hline(yintercept=campaign_prediction) +
geom_point(aes(20,campaign_prediction), color="red")
## Warning in geom_point(aes(20, campaign_prediction), color = "red"): All aesthetics have length 1, but the data has 637 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
mean(df$profit)
## [1] 84.81117
mean(df$ROI)
## [1] 663.8348
#The mean profit value within the movie dataset (df) is $84 million and the mean ROI is 663%
Overall, the campaign is expected to result in a significant increase in revenue compared to the non-campaign prediction ($140 million more, a 668% increase). It is expected to yield a profit of $141 million and an ROI of 805%. These values are higher than the average industry profits and ROIs of $84 million and 663%.
Therefore we believe this campaign would be a beneficial investment for the company to pursue.
#Predictive Analysis - Genre Breakdown To conduct analysis on each genre, we will filter the data and create separate datasets for each genre to run our regression models. Based on this, we will then:
Identify top 2 revenue-driving factors in each genre
Create comparative visualizations
#Step 1: Filter the data in df by the genre categories
genre_action_adventure <- df %>% filter(genre_category == "Action/Adventure")
genre_comedy <- df %>% filter(genre_category == "Comedy")
genre_drama <- df %>% filter(genre_category == "Drama")
genre_horror_thriller <- df %>% filter(genre_category == "Horror/Thriller")
genre_scifi_fantasy <- df %>% filter(genre_category == "Sci-Fi/Fantasy")
genre_other <- df %>% filter(genre_category == "Other")
#Step 2: Run our model for each genre dataset
Action/Adventure:
##Genre: Action/Adventure
action_ad_model <- lm(revenue ~ budget + best_pic_nom + critics_score + Facebook_Likes + 0, data = genre_action_adventure)
summary(action_ad_model)
##
## Call:
## lm(formula = revenue ~ budget + best_pic_nom + critics_score +
## Facebook_Likes + 0, data = genre_action_adventure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -159.14 -58.74 -27.56 -0.37 423.57
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## budget 1.7434896 0.2825726 6.170 5.73e-08 ***
## best_pic_nom NA NA NA NA
## critics_score 0.5847091 0.3106641 1.882 0.064514 .
## Facebook_Likes 0.0006181 0.0001624 3.806 0.000326 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.69 on 62 degrees of freedom
## Multiple R-squared: 0.7895, Adjusted R-squared: 0.7793
## F-statistic: 77.49 on 3 and 62 DF, p-value: < 2.2e-16
#We will now interpret the model results -
#The coefficients (estimate) column shows us the estimated impact on revenue (in millions) and if the p-value < 0.05, the factor is statistically significant.
##1. Budget: Coefficient = 1.74, p-value = 5.73e-08 (< 0.05)
##2. Critics score: Coefficient = 0.58, p-value = 0.064514 (> 0.05) ~ not significant at the 95% confidence level
##3. Facebook Likes: Coefficient = 0.0006181, p-value = 0.000326 (<0.05)
#Based on these results, the top 2 revenue driving factors are Budget and Facebook likes as they have a significant impact on revenue.
##Note: best_pic_nom does not provide results as there is little variation in the data to be considered reliable.
Comedy
##Genre: Comedy
comedy_model <- lm(revenue ~ budget + best_pic_nom + critics_score + Facebook_Likes + 0, data = genre_comedy)
summary(comedy_model)
##
## Call:
## lm(formula = revenue ~ budget + best_pic_nom + critics_score +
## Facebook_Likes + 0, data = genre_comedy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.08 -49.26 -21.48 18.07 266.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## budget 1.997e+00 2.563e-01 7.793 1.92e-11 ***
## best_pic_nom -3.338e+01 6.877e+01 -0.485 0.6287
## critics_score 4.845e-01 2.271e-01 2.134 0.0359 *
## Facebook_Likes 5.776e-04 1.077e-04 5.363 7.58e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.47 on 81 degrees of freedom
## Multiple R-squared: 0.8544, Adjusted R-squared: 0.8472
## F-statistic: 118.8 on 4 and 81 DF, p-value: < 2.2e-16
#We will now interpret the model results -
#The coefficients (estimate) column shows us the estimated impact on revenue (in millions) and if the p-value < 0.05, the factor is statistically significant.
##1. Budget: Coefficient = 1.997, p-value = 1.92e-11 (< 0.05)
##2. Best Picture Nomination: Coefficient = -3.338e+01, p-value = 0.6287 (>0.05) ~ not significant at the 95% confidence level
##3. Critics score: Coefficient = 0.4845, p-value = 0.0359 (< 0.05)
##4. Facebook Likes: Coefficient = 0.0005776, p-value = 7.58e-07 (<0.05)
#Based on these results, the top 2 revenue driving factors are Facebook Likes and Budget as they have a significant impact on revenue.
##Note: While critics score has a positive coefficient and a statistically significant p-value, its impact is smaller compared to the other factors.
Drama
##Genre: Drama
drama_model <- lm(revenue ~ budget + best_pic_nom + critics_score + Facebook_Likes + 0, data = genre_drama)
summary(drama_model)
##
## Call:
## lm(formula = revenue ~ budget + best_pic_nom + critics_score +
## Facebook_Likes + 0, data = genre_drama)
##
## Residuals:
## Min 1Q Median 3Q Max
## -220.73 -57.83 -30.97 20.20 783.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## budget 1.488e+00 1.660e-01 8.964 < 2e-16 ***
## best_pic_nom -4.517e+01 2.758e+01 -1.638 0.10254
## critics_score 4.115e-01 1.530e-01 2.689 0.00757 **
## Facebook_Likes 9.330e-04 8.176e-05 11.412 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.2 on 294 degrees of freedom
## Multiple R-squared: 0.7494, Adjusted R-squared: 0.746
## F-statistic: 219.8 on 4 and 294 DF, p-value: < 2.2e-16
#We will now interpret the model results -
#The coefficients (estimate) column shows us the estimated impact on revenue (in millions) and if the p-value < 0.05, the factor is statistically significant.
##1. Budget: Coefficient = 1.488, p-value = < 2e-16 (< 0.05)
##2. Best Picture Nomination: Coefficient = -4.517e+01, p-value = 0.10254 (>0.05) ~ not significant at the 95% confidence level
##3. Critics score: Coefficient = 0.4115, p-value = 0.00757 (< 0.05)
##4. Facebook Likes: Coefficient = 0.0009330, p-value = < 2e-16 (<0.05)
#Based on these results, the top 2 revenue driving factors are Budget and Facebook likes as they have a significant impact on revenue.
##Note: While critics score has a positive coefficient and a statistically significant p-value, its impact is smaller compared to the other factors.
Horror/Thriller
##Genre: Drama
drama_model <- lm(revenue ~ budget + best_pic_nom + critics_score + Facebook_Likes + 0, data = genre_drama)
summary(drama_model)
##
## Call:
## lm(formula = revenue ~ budget + best_pic_nom + critics_score +
## Facebook_Likes + 0, data = genre_drama)
##
## Residuals:
## Min 1Q Median 3Q Max
## -220.73 -57.83 -30.97 20.20 783.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## budget 1.488e+00 1.660e-01 8.964 < 2e-16 ***
## best_pic_nom -4.517e+01 2.758e+01 -1.638 0.10254
## critics_score 4.115e-01 1.530e-01 2.689 0.00757 **
## Facebook_Likes 9.330e-04 8.176e-05 11.412 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.2 on 294 degrees of freedom
## Multiple R-squared: 0.7494, Adjusted R-squared: 0.746
## F-statistic: 219.8 on 4 and 294 DF, p-value: < 2.2e-16
#We will now interpret the model results -
#The coefficients (estimate) column shows us the estimated impact on revenue (in millions) and if the p-value < 0.05, the factor is statistically significant.
##1. Budget: Coefficient = 1.488, p-value = < 2e-16 (< 0.05)
##2. Best Picture Nomination: Coefficient = -4.517e+01, p-value = 0.10254 (>0.05) ~ not significant at the 95% confidence level
##3. Critics score: Coefficient = 0.4115, p-value = 0.00757 (< 0.05)
##4. Facebook Likes: Coefficient = 0.0009330, p-value = < 2e-16 (<0.05)
#Based on these results, the top 2 revenue driving factors are Budget and Facebook likes as they have a significant impact on revenue.
##Note: While critics score has a positive coefficient and a statistically significant p-value, its impact is smaller compared to the other factors.
Genre: Sci-Fi/Fantasy
##Genre: Sci-Fi/Fantasy
scifi_fantasy_model <- lm(revenue ~ budget + best_pic_nom + critics_score + Facebook_Likes + 0, data = genre_scifi_fantasy)
summary(scifi_fantasy_model)
##
## Call:
## lm(formula = revenue ~ budget + best_pic_nom + critics_score +
## Facebook_Likes + 0, data = genre_scifi_fantasy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.06 -42.05 4.93 14.13 73.78
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## budget 0.8221934 0.6040700 1.361 0.222
## best_pic_nom NA NA NA NA
## critics_score 1.0493438 0.6421199 1.634 0.153
## Facebook_Likes -0.0005409 0.0012307 -0.440 0.676
##
## Residual standard error: 64.05 on 6 degrees of freedom
## Multiple R-squared: 0.733, Adjusted R-squared: 0.5995
## F-statistic: 5.491 on 3 and 6 DF, p-value: 0.03721
#We will now interpret the model results –
#The coefficients (estimate) column shows us the estimated impact on revenue (in millions) and if the p-value < 0.05, the factor is statistically significant.
##1. Budget: Coefficient = 0.822, p-value = 0.222 (> 0.05) ~ not statistically significant
##2. Best Picture Nomination: Coefficient = NA ~ not defined due to singularity
##3. Critics score: Coefficient = 1.049, p-value = 0.153 (> 0.05) ~ not statistically significant
##4. Facebook Likes: Coefficient = -0.000541, p-value = 0.676 (> 0.05) ~ not statistically significant
#Based on these results, none of the factors tested show statistically significant impact on revenue for the Sci-Fi/Fantasy genre at the 95% confidence level.
##This may be due to a smaller sample size or high variability within this genre category. While the coefficient for critics score is positive, and relatively large compared to other genres, the p-value indicates that the relationship is not statistically strong enough to draw conclusions.
Other:
##Genre: Other
other_model <- lm(revenue ~ budget + best_pic_nom + critics_score + Facebook_Likes + 0, data = genre_other)
summary(other_model)
##
## Call:
## lm(formula = revenue ~ budget + best_pic_nom + critics_score +
## Facebook_Likes + 0, data = genre_other)
##
## Residuals:
## Min 1Q Median 3Q Max
## -201.06 -50.61 -25.05 15.52 256.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## budget 1.233e+00 1.675e-01 7.362 1.02e-11 ***
## best_pic_nom -2.161e+01 4.221e+01 -0.512 0.609
## critics_score 2.423e-01 1.551e-01 1.562 0.120
## Facebook_Likes 9.918e-04 7.426e-05 13.356 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.04 on 154 degrees of freedom
## Multiple R-squared: 0.8461, Adjusted R-squared: 0.8421
## F-statistic: 211.7 on 4 and 154 DF, p-value: < 2.2e-16
#We will now interpret the model results –
#The coefficients (estimate) column shows us the estimated impact on revenue (in millions) and if the p-value < 0.05, the factor is statistically significant.
##1. Budget: Coefficient = 1.233, p-value = 1.02e-11 (< 0.05)
##2. Best Picture Nomination: Coefficient = -21.61, p-value = 0.609 (> 0.05) ~ not significant at the 95% confidence level
##3. Critics score: Coefficient = 0.2423, p-value = 0.120 (> 0.05) ~ not significant at the 95% confidence level
##4. Facebook Likes: Coefficient = 0.0009918, p-value = < 2e-16 (< 0.05)
#Based on these results, the top 2 revenue-driving factors are Budget and Facebook Likes as they have a significant impact on revenue.
##Note: Best Picture Nomination and Critics Score are not statistically significant, suggesting they are less relevant for predicting revenue in this genre compared to the financial and social media-related variables.
Data Visualizations
#After identifying the top two revenue-driving factors in each genre, we will create a data frame for the top 2 factors and their coefficients.
top_factors <- data.frame(
genre = c("Action/Adventure", "Action/Adventure",
"Comedy", "Comedy",
"Drama", "Drama",
"Horror/Thriller", "Horror/Thriller",
"Other", "Other"),
factor = c("Budget", "Facebook_Likes",
"Budget", "Facebook_Likes",
"Budget", "Facebook_Likes",
"Budget", "Facebook_Likes",
"Budget", "Facebook_Likes"),
coefficient = c(1.74, 0.0006181,
1.997, 0.0005776,
1.488, 0.0009330,
2.7336502, 0.0008371,
1.233, 0.0009918)
)
# Bar chart: Top 2 Revenue-Driving Factors by Genre
ggplot(top_factors, aes(x = genre, y = coefficient, fill = factor)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Genre", y = "Coefficient (Impact on Revenue)", fill = "Factor",
title = "Top 2 Revenue-Driving Factors by Genre") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
##While this shows us the overall impact of both the factors on revenue, the Facebook Likes coefficients are much smaller than the Budget ones, which makes it difficult to visualize their impact effectively on the same scale.
##We already know budget is the most significant revenue driver across all genres, but we want to know what impact of social media (Facebook Likes) has on the revenue, with attention to the genres.
#Due to this, we have created separate bar plots for the Budget and Facebook Likes coefficients.
# Bar chart: Budget Coefficients by Genre
ggplot(top_factors %>% filter(factor == "Budget"), aes(x = genre, y = coefficient, fill = genre)) +
geom_bar(stat = "identity") +
labs(x = "Genre", y = "Budget Coefficient", fill = "Genre",
title = "Budget Coefficients by Genre") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Bar chart: Facebook Likes Coefficients by Genre
ggplot(top_factors %>% filter(factor == "Facebook_Likes"), aes(x = genre, y = coefficient, fill = genre)) +
geom_bar(stat = "identity") +
labs(x = "Genre", y = "Facebook Likes Coefficient", fill = "Genre",
title = "Facebook Likes Coefficients by Genre") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
1: Gunter, B. (2018). Predicting Movie Success at the Box Office. Springer International Publishing. https://doi.org/10.1007/978-3-319-71803-3
2: [R on easystats] (2019). Check your (Mixed) Model for Multicollinearity with ‘performance’. R-bloggers. https://www.r-bloggers.com/2019/08/check-your-mixed-model-for-multicollinearity-with-performance-2/
3: Sanderson, S. [Steven P. Sanderson II, MPH] (2023). “How to Predict a Single Value Using a Regression Model in R”. R-bloggers. https://www.r-bloggers.com/2023/11/how-to-predict-a-single-value-using-a-regression-model-in-r/
4: Impact of awards on film marketing and distribution | Film Industry Class Notes | Fiveable. (n.d.). Fiveable. https://library.fiveable.me/film-industry/unit-11/impact-awards-film-marketing-distribution/study-guide/XVhcBVRAPpyeDi2W