Data 607 Final Project: IMDB Popular Horror Movie Data

Introduction

For this project I will examine factors that influence a horror movie’s financial success by analyzing a Kaggle dataset of the 10000 most popular films on IMDB and retrieving additional movie feature and performance information from OMDB API.

The Kaggle dataset: https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

The Kaggle dataset contains detailed IMDb movie data, such as genre, release year, runtime, IMDB ratings, and box office revenue in the millions. I will then enhance this dataset with information using the OMDb API, which provides the original box office revenue (not in the millions), the movie’s credited writers, and full release date.

The combined dataset will enable me to evaluate patterns in movie outcomes and understand what traits are associated with a strong box office performance. The combined dataset will contain the following columns:

Rank: A movie’s popularity ranking on IMDB
Title: The movie’s title
Genre: The genres a movie falls under
Description: A movie’s plot in a few short sentences
Director: The full name of the director of the movie
Actors: The actors in the movie
Year: The year the movie was released
Runtime: The movie’s length in minutes
Rating: A movie’s IMDB rating
Votes: Number of Votes a movie received on IMDB
Box_Office_Millions: A movie’s box office revenue in millions
Metascore: The movie’s metascore
Rated: The movie’s rating
Released: The full date the movie was released
Writer: The writers credited for the movie
Awards: The awards and nominations a movie has received
BoxOffice: A movie’s box office revenue

Since our dataset contains only the most popular IMDB movies we will be analyzing only the most popular horror movies on IMDB.

We will address the following questions:

Which production, release, and audience factors are associated with higher box office performance for popular horror movies?
Which numeric features of a horror movie — runtime, metascores, IMDB rating, or votes — can reliably predict its Box Office performance?

Importing Our Dataset

Let’s begin by importing our Kaggle datset from our Github repository.

url <- "https://raw.githubusercontent.com/WendyR20/DATA-607-Final-Project/refs/heads/main/imdb_movie_dataset.csv"
movies_df <-read_csv(url)

## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Title, Genre, Description, Director, Actors
## dbl (7): Rank, Year, Runtime (Minutes), Rating, Votes, Revenue (Millions), M...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First we must filter out all movies that are not horror movies.

horror_df <- movies_df %>%
  filter(
    str_detect(Genre, "Horror")
  )

#make nice in table
glimpse(horror_df)

## Rows: 119
## Columns: 12
## $ Rank                 <dbl> 3, 23, 28, 35, 43, 57, 62, 98, 111, 117, 119, 133…
## $ Title                <chr> "Split", "Hounds of Love", "Dead Awake", "Residen…
## $ Genre                <chr> "Horror,Thriller", "Crime,Drama,Horror", "Horror,…
## $ Description          <chr> "Three girls are kidnapped by a man with a diagno…
## $ Director             <chr> "M. Night Shyamalan", "Ben Young", "Phillip Guzma…
## $ Actors               <chr> "James McAvoy, Anya Taylor-Joy, Haley Lu Richards…
## $ Year                 <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ `Runtime (Minutes)`  <dbl> 117, 108, 99, 107, 73, 88, 86, 90, 89, 118, 104, …
## $ Rating               <dbl> 7.3, 6.7, 4.7, 5.6, 2.7, 7.2, 6.8, 5.8, 6.3, 6.2,…
## $ Votes                <dbl> 157606, 1115, 523, 46165, 496, 121103, 35870, 924…
## $ `Revenue (Millions)` <dbl> 138.12, NA, 0.01, 26.84, NA, 89.21, NA, 0.15, 10.…
## $ Metascore            <dbl> 62, 72, NA, 49, NA, 71, 65, 62, 44, 51, 76, 59, N…

Before we call the api, we must save our api key, OMDB API provides a free api with a daily limit of 1,000 calls.

#saving my API Key

omdb_key <- Sys.getenv("OMDB_KEY")

Okay, now we are ready to call our API. Since our API has a limit on the number of daily calls we can make I highly suggest running a test on a smaller subset of the data.

Acquring the Additional Data from the OMDB API

Now let’s call our API.

get_omdb_data_full <- function(title, year) {
 Sys.sleep(0.25)
  
  url <- paste0(
    "http://www.omdbapi.com/?t=",
    URLencode(title),
    "&y=", year,
    "&apikey=", omdb_key
  )
  
  response <- try(GET(url), silent = TRUE)
  
  
  content <- fromJSON(rawToChar(response$content))
  
  #Movie not found
  
  if (content$Response == "False") {
    return(tibble(
      Title = title,
      Year = year,
      API_Failure = TRUE
    ))
  }
  
 #Success
  
  content %>%
    as_tibble() %>%
    mutate(
      API_Failure = FALSE,
      QueryTitle = title,
      QueryYear = year
    )
}

omdb_horror <- list() 

for (i in seq_len(nrow(horror_df))) {
  title <- horror_df$Title[i]
  year  <- horror_df$Year[i]
  
  result <- get_omdb_data_full(title, year)
  omdb_horror[[i]] <- result
}

Now we must combine all results into a single data frame. To do this we need to fix all tibbles in the list before binding.

Challenge 1 This was a challenge I came across when I first tried to bind my rows together, the data for Year returned was of different data types, so to I ahd to convert them each to the same data type.

omdb_horror <- lapply(omdb_horror, function(x) {
  x$Year <- as.character(x$Year)
  x
})

omdb_horror <- bind_rows(omdb_horror)

Cleaning and Merging

I had to remove duplicates from my dataset based on distinct titles then distinct IMDB ids.

horror_distinct <- omdb_horror %>%
  distinct(Title, imdbID, .keep_all = TRUE)

Then, and only then can we move on and select only those columns we’re interested in for the analysis that are not already in the original dataset

Now we are able to merge our original dataset with our dataset obtained from the OMDB API.

hmovies_merge <- horror_df %>%
  left_join(
    horror_distinct,
    by = c("Title" = "Title", "Year" = "Year")
  )

Since the OMDB API has a daily limit of 1,000 calls, I saved my data once cleaned and merged as a csv and uploaded it to Github, I will include the code I used to do so below. But for the sake of this final project I will not import my cleaned and merged data in this final run-through and instead use the already cleaned and stored data above.

write_csv(x = hmovies_merge, 
          file = "omdb_merged_data.csv")

url2 <- "https://raw.githubusercontent.com/WendyR20/DATA-607-Final-Project/refs/heads/main/omdb_merged_data.cs#v"
hmovies_df2 <- read.csv(url2)

hmovies_df2  <- hmovies_merge

For ease of use we should rename some of the columns, and since I intend to use the box office revenue not alredy in millions, we need to extract the numbers from that column.

Exploring the Data

Distributions of Quantitative Variables

Let us first take a look at the distribution of the box office revenue for the popular horror movies in our dataset.

ggplot(hmovies_df2, aes(x = Box_Office_Millions)) +
  geom_histogram(bins=100,fill = "steelblue", color ="white")+
  labs(
    title = "Distribution of Popular Horror Box Office Revenue",
    x = "Box Office Revenue (in Millions)",
    y = "Count"
  )+
  theme_gray()

ggplot(hmovies_df2, aes(x = Metascore)) +
  geom_histogram(bins=50,fill = "blue", color ="white")+
  labs(
    title = "Distribution of Popular Horror Metasores",
    x = "Metascores",
    y = "Count"
  )+
  theme_gray()

ggplot(hmovies_df2, aes(x = Rating)) +
  geom_histogram(bins=50,fill = "yellow4")+
  labs(
    title = "Distribution of Popular Horror IMDB Ratings",
    x = "IMDB Rating",
    y = "Count"
  )+
  theme_minimal()

ggplot(hmovies_df2, aes(x = Runtime)) +
  geom_histogram(bins=50,fill = "purple3")+
  labs(
    title = "Distribution of Popular Horror Runtimes",
    x = "Runtime (in Minutes)",
    y = "Count"
  )+
  theme_minimal()

ggplot(hmovies_df2, aes(x = Votes)) +
  geom_histogram(bins=50, fill = "chocolate")+
  labs(
    title = "Distribution of Popular Horror IMDB User Votes",
    x = "IMDB Votes",
    y = "Count"
  )+
  theme_minimal()

We can see that our distribution is heavily right-skewed, which makes sense as we are looking at revenue in the millions and thus a movie can have made under a million and thus show up as near zero on our plot. What is interesting is that a large number of of the most popular horror movies on IMDB made less than a million dollars in revenue. We should also note the spread of the distribution, the most popular horror movies on IMDB ranged in box office revenue from under a million to over 250 million, though we must note that we have large outliers.

Box Office vs Qunatitative Variables Scatter Plots

Let us take a look at scatter plots between our numerical variables and box office revenue. We might be able to gauge if any of our numerical variables have linear relationships with box office revenue and thus could be helpful in predicting box office revenue.

#Box Office vs Runtime
splot1 <- ggplot(hmovies_df2, aes(x = Runtime, y = Box_Office_Millions )) +
  geom_point(color = "red")+
  #geom_jitter() +
  geom_smooth(method = "lm")+
  labs(
    title = "Popular Horror Movies: Runtime vs Box Office Revenue",
    x = "Runtime (in Minutes)",
    y = "Box Office Revenue (in Millions) "
  )+
  theme_minimal()

#Box Office vs IMDB Ratings

splot2 <- ggplot(hmovies_df2, aes(x = Rating, y = Box_Office_Millions )) +
  geom_point(color = "orange")+
  # geom_jitter() +
  geom_smooth(method = "lm")+
  labs(
    title = "Popular Horror IMDB Ratings vs Box Office Revenue",
    x = "IMDB Ratings",
    y = "Box Office Revenue (in Millions) "
  )+
  theme_minimal()

#Box Office vs Metascores

splot3 <- ggplot(hmovies_df2, aes(x = Metascore, y = Box_Office_Millions )) +
  geom_point(color = "darkgreen")+
 # geom_jitter() +
  geom_smooth(method = "lm")+
  labs(
    title = "Popular Horror MetaScores vs Box Office Revenue",
    x = "MetaScores",
    y = "Box Office Revenue (in Millions) "
  )+
  theme_minimal()

#Box Office vs Votes

splot4 <- ggplot(hmovies_df2, aes(x = Votes, y = Box_Office_Millions )) +
  geom_point(color = "orchid")+
  geom_smooth(method = "lm", se = TRUE)+
  scale_x_continuous(labels = scales::comma)+
  labs(
    title = "Popular Horror Movies: Number of IMDB User Votes \n vs Box Office Revenue",
    x = "Number of IMDB User Votes",
    y = "Box Office Revenue (in Millions)"
  )+
  theme_gray()



#Display

plots1 <- splot1 + splot2 + patchwork::plot_layout(ncol = 2)
plots1

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

plots2 <- splot3 + splot4 + patchwork::plot_layout(ncol = 2)

plots2

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

We can see from our plots that neither runtime nor IMDB ratings seem to have a linear relationship with box office revenue. However, when we examine the relationship between metascores and box office, it does seem possible a linear relationship exists between them. The same is true for the relationship between IMDB ratings and box office revenue. We will take a close look at these relationships later and use linear regression models to fully examine whether any of the numerical variable in our dataset would make for good predictors for box office revenue.

#Votes vs Ratings
 ggplot(hmovies_df2, aes(x = Votes, y = Rating )) +
  geom_point(color = "darkblue")+
  geom_smooth(method = "lm", se = FALSE)+
  scale_x_continuous(labels = scales::comma)+
  labs(
    title = "Popular Horror IMDB Ratings \n vs Number of IMDB User Votes ",
    x = "Number of IMDB User Votes",
    y = "IMDB Ratings Rating "
  )+
  theme_gray()

## `geom_smooth()` using formula = 'y ~ x'

Box Plot of Movie Ratings

#converting Released column from character to date

hmovies_df2$Released <- dmy(hmovies_df2$Released)

class(hmovies_df2$Released)

## [1] "Date"

ggplot(hmovies_df2, aes(x=Rated, y=Box_Office_Millions, fill = Rated)) + 
  geom_boxplot(na.rm = TRUE,
               outlier.colour="red", 
               outlier.shape = 3 
  ) +
  labs( 
    title = "Box Office Revenue by Movie Ratings",
    x = "Rating",
    y = "Box Office Revenue (in Millions)")

Box Office Over Time

Let us take a look at box office revenue over the time for the horror movies in our dataset.

#arranging the Year column in chronological order
hmovies_df2 <- hmovies_df2 %>% arrange(Year)
class(hmovies_df2$Year)

## [1] "numeric"

ggplot(data = hmovies_df2, aes(x = Released, y = Box_Office_Millions)) +
  geom_line(color="cyan4") +
  geom_point(color="springgreen3") + 
  scale_x_date(breaks=date_breaks("12 month"))+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Popular Horror Movie Box Office Revenue Over Time",
       x = "Date",
       y = "Box Office Revenue (in Millions)")

There is a rise and fall in box office revenue that seems to repeat regularly every year in our dataset. And though we have a two large outliers, the pattern in the rise and fall of our data indicates that there is a seasonal pattern.

Let us investigate horror movie box office revenue by month of the year.

#creating a new column to hold the month when movies were released

hmovies_df2 <- hmovies_df2 %>%
  mutate(Month_Released = lubridate::month(Released, label = TRUE))

ggplot(
  data = hmovies_df2 %>% filter(!is.na(Month_Released)),
  aes(x= Month_Released, y=Box_Office_Millions, fill=Month_Released)) + 
  geom_col() + 
  labs(
    title = "Popular Horror Movie Box Office Revenue by Month",
    x = "Month",
    y ="Box Office Revenue (in Millions)"
  )+
  theme(legend.position="none")

## Warning: Removed 22 rows containing missing values or values outside the scale range
## (`geom_col()`).

Here we can see that the highest earning month for the movies in our dataset are January, June, July and October. With the next two highest earning month being September and December near October, this may explain the seasonal pattern we are saw in our time-series plot.

Box Office by Year

Total Box Office By Year

Now, let us take a look at highest total box office revenue by year.

hmovies_df2 %>%
  group_by(Year) %>%
  summarise(total_revenue = sum(Box_Office_Millions, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(Year), y = total_revenue, fill = factor(Year))) +
  geom_col() +
  labs(
    title = "Popular Horror Movie Total Box Office Revenue by Year",
    x = "Year",
    y = "Total Revenue (Millions)"
  ) +
  theme(legend.position="none")

It seems that overall there is an upward trend from year-to-year in box office revenue when we look at total box office revenue.

Average Box Office By Year

Now, let us take a look at highest average box office revenue by year.

#group by avg

hmovies_df2 %>%
  group_by(Year) %>%
  summarise(avg_revenue = mean(Box_Office_Millions, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(Year), y = avg_revenue, fill = factor(Year))) +
  geom_col() +
  labs(
    title = "Popular Horror Movie Average Box Office Revenue by Year",
    x = "Year",
    y = "Average Revenue (Millions)"
  ) +
  theme(legend.position="none")

Extreme Outliers

extreme_ub <- quantile(
  hmovies_df2$BoxOffice,
  0.95, na.rm = TRUE
)

print(extreme_ub)

##      95% 
## 99191574

outliers_df <- hmovies_df2 %>%
  filter(!is.na(BoxOffice)) %>%
  filter(BoxOffice >= quantile(BoxOffice, 0.95)) %>%
  select(
    Title,
    Director,
    Released,
    Country,
    BoxOffice
  ) %>%
  arrange(desc(BoxOffice)) %>%
  mutate(
    BoxOffice = dollar(BoxOffice)
  )

kable(outliers_df,
      caption = "Most Extreme Box Office Outliers (Top 5%)",
      align = "l")

Most Extreme Box Office Outliers (Top 5%)
Title	Director	Released	Country	BoxOffice
I Am Legend	Francis Lawrence	2007-12-14	United States, United Kingdom	$256,393,010
World War Z	Marc Forster	2013-06-21	United States, United Kingdom	$202,807,711
Split	M. Night Shyamalan	2017-01-20	United States, Japan	$138,291,365
The Conjuring	James Wan	2013-07-19	United States	$137,446,368
The Conjuring 2	James Wan	2016-06-10	Canada, United States, United Kingdom	$102,516,140

Taking a look at average box office revenue from year-to-year and the extreme box office outliers clarifies our findings from our earlier plot of total box office revenue year-by-year.

Although total horror box office revenue increased in some years, this growth was influenced by a small number of extremely high-grossing films. The average box office revenue per movie did not increase proportionally, indicating that the majority of films in our dataset did not have higher earnings. This indicates that overall market growth was driven either by a few blockbuster outliers or by the number of films released, rather than by improvements in individual movie performance.

Average Box Office by Country

Let’s take a look at the average box office by country and the number of horror movies in our dataset produced by these countries. We will include only countries with sufficient data, so those that have made 2 or more.

country_df <- hmovies_df2 %>%
  separate_rows(Country, sep = ",") %>%   
  mutate(Country = trimws(Country))



 country_df %>%
  filter(!is.na(Country)) %>%
  group_by(Country) %>%
  summarize(
    Average_Revenue = round(mean(BoxOffice, na.rm = TRUE), 0),
    Count = n()
  ) %>%
  filter(Count >= 2, Average_Revenue > 0) %>%     
  arrange(desc(Average_Revenue)) %>%
    mutate(
    Average_Revenue = comma(Average_Revenue)
  ) %>%
  kable("html", col.names = c("Country", "Average Revenue", "Number of Horror Movies Produced"),
        caption = "Average Revenue from Countires Producing the Most Popular Horror Movies") %>%
  kable_styling(full_width = TRUE, position = "center") %>%
  row_spec(0, background = "red", color = "white", bold = TRUE)

Average Revenue from Countires Producing the Most Popular Horror Movies
Country	Average Revenue	Number of Horror Movies Produced
Japan	53,119,769	4
United Kingdom	49,776,185	21
Australia	47,513,865	5
United States	39,154,212	90
Spain	33,425,033	3
Mexico	31,090,320	2
Canada	30,210,553	30
New Zealand	28,854,540	2
France	25,529,613	11
Germany	22,596,180	10
Sweden	6,269,528	2
Chile	3,614,314	2
Netherlands	1,970,272	2
Denmark	868,623	2
Italy	849,520	2

We can see that the countries with the highest average box office revenue are Japan, the U.K., Australia, the United States and Mexico. However only the U.S, and the U.K, are top produders of horror movies when we look at the number of movies produced.

Country Collaborations

I am interested in taking a look at the most popular collaborations between countires to produce the horror movies in our dataset.

pairs <- country_df %>%
  group_by(Title) %>%
  summarise(pairs = if(n() >= 2) list(combn(Country, 2, simplify = FALSE)) else list(NULL)
  ) %>% 
  pull(pairs) %>% 
  unlist(recursive = FALSE)

edges <- do.call(rbind, pairs)

g <- graph_from_data_frame(edges)

ggraph(g, layout = "fr") +
  geom_edge_link(alpha = 0.3) +
  geom_node_point(size = 3) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

It’s no surprise that the United States has the largest number of collaborations as we saw above that the United States are largest producer of horror movies in our dataset and that they collaborate most often with the United Kingdom, Canada, Australia, France and Germany all who are also top producers of horror movies in out dataset.

Average Box Office by Director, Writers and Actors

Popular Horror Movie Directors with the Average Highest Box Office Revenue

Let us investigate which directors in our dataset have the highest average box office revenue.

hmovies_df2 %>%
  group_by(Director) %>%
  summarise(avg_box1 = mean(Box_Office_Millions, na.rm = TRUE),
            count = n()) %>%
  filter(count >= 2) %>% 
ggplot(aes(x = reorder(Director, avg_box1), y = avg_box1, fill = avg_box1)) +
  geom_col() +
  coord_flip() +
  scale_fill_viridis_c(option = "C")+
  labs(
    title = "Directors With Highest Average \n Box Office Revenue",
    x = "Director",
    y = "Avg Box Office (Millions)"
  ) +
  theme_minimal() + 
  theme(legend.position="none")

We see that M. Night Shyamalan and James Wan are far and above responsible for directing the horror movies with the highest average box offoice revenue in our dataset.

Popular Horror Movie Writers with the Highest Average Box Office Revenue

Let us investigate which writers in our dataset have the highest average box office revenue.

writers_df <- hmovies_df2 %>%
  separate_rows(Writer, sep = ",") %>%   
  mutate(Writer = trimws(Writer)) %>%
  filter(!is.na(Writer))

writers_df %>%
  group_by(Writer) %>%
  summarise(avg_box = mean(Box_Office_Millions, na.rm = TRUE),
            count = n()) %>%
  filter(count >= 2) %>%  # writers with sufficient data
  ggplot(aes(x = reorder(Writer, avg_box), y = avg_box, fill = avg_box)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient2(
    low = "lightblue", 
    mid = "cadetblue", 
    high = "purple", 
    midpoint = .02
  )+
  labs(
    title = "Writers With Highest Average \n Box Office Revenue",
    x = "Writer",
    y = "Avg Box Office (Millions)"
  ) +
  theme_minimal() + 
  theme(legend.position="none")

Popular Horror Movie Actors with the Highest Average Box Office Revenue

Let us investigate which actors in our dataset have the highest average box office revenue.

actors_df <- hmovies_df2 %>%
  separate_rows(Actors, sep = ",") %>%   
  mutate(Actors = trimws(Actors)) %>%
  filter(!is.na(Actors))

actors_df %>%
  group_by(Actors) %>%
  summarise(avg_box = mean(Box_Office_Millions, na.rm = TRUE),
            count = n()) %>%
  filter(count >= 2) %>%  # actors with enough data
  filter(avg_box > 0) %>%
  ggplot(aes(x = reorder(Actors, avg_box), y = avg_box, fill = avg_box)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient2(
    low = "orchid3", 
    mid = "darksalmon", 
    high = "deeppink2", 
    midpoint = .02
  )+
  labs(
    title = "Actors With Highest Average \n Popular Horror Movie Box Office Revenue",
    x = "Actor",
    y = "Avg Box Office (Millions)"
  ) +
  theme_minimal() +
  theme(legend.position="none")

A small number of directors, actors and writers are associated with consistent higher average box office revenue. For example: James Wan is the director associated with the second highest average box office revenue for horror movie in our dataset, and Vera Farminga and Patrick Wilson are the stars of the very popular Conjuring series directed by him, and Chad and Corey Hayes are the writers behind the series.

Popular Horror Movies By Subgenre

Let us investigate which subgenres in our dataset have the highest average box office revenue.

subgenre_df <- hmovies_df2 %>%
  separate_rows(Genre, sep = ",") %>%   
  mutate(Actors = trimws(Genre)) %>%
  filter(!is.na(Genre)) %>%
  filter(Genre != "Horror")

subgenre_df %>%
  group_by(Genre) %>%
  summarise(avg_box = mean(Box_Office_Millions, na.rm = TRUE),
            count = n()) %>%
  filter(count >= 2) %>%  # actors with enough data
  filter(avg_box > 0) %>%
  ggplot(aes(x = reorder(Genre, -avg_box), y = avg_box, fill = Genre)) +
  geom_col() +
  labs(
    title = "Subgenres With Highest Average \n Popular Horror Movie Box Office Revenue",
    x = "Subgenre",
    y = "Avg Box Office (Millions)"
  ) +
  theme_minimal() + 
  theme(legend.position="none")

We can see that far and above the horror movies in our dataset with the highest box office revenue that also fell under another genre were action movies

Average Box Office by Month Released

heatmap_subg <- subgenre_df %>%
  filter(!is.na(Box_Office_Millions) & !is.na(Month_Released)) %>%
  group_by(Genre, Month_Released) %>%
  summarise(avg_box = mean(Box_Office_Millions, na.rm = TRUE)) %>%
  ungroup()

## `summarise()` has grouped output by 'Genre'. You can override using the
## `.groups` argument.

ggplot(heatmap_subg, aes(x = Month_Released, y = Genre, fill = avg_box)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "C", labels = comma) +
  labs(
    title = "Average Box Office by Subgenre and Month",
    x = "Month of Release",
    y = "Subgenre",
    fill = "Avg Box Office (Millions)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Adventure and Action Horror movies have the highest earnings June; in Sci-fi movies have the highest earning in December and Mystery andThriller movies see the highest earning in July

Productions and Release Insights

Which production, release, and audience factors are associated with higher box office performance for popular horror movies?

Box office performance among popular (on IMDB) horror movies is associated with a combination of production strategies, release timing, and content characteristics rather than any single factor. Movies produced in countries with established film industries like Japan, the United Kingdom and the United States, tend to achieve on average, higher revenues, as do films associated with a small group of repeatedly successful directors and actors. Release timing also plays an important role, with revenues peaking during seasonally appropriate months such as June, October and December. Content-wise, certain subgenres (e.g., adventure, sci-fi , mystery) have higher average box office than others and are especially profitable when released during certain seasons. Overall, higher box office performance is associated with strategic production choices, seasonal release timing, and broad audience appeal.

Numerical Variables vs Box Office Correalation

First we must find out if any of our numerical variables are colinear

cor_matrix <- cor(hmovies_df2[, numeric_columns], use = "complete.obs")

ggcorrplot(cor_matrix, lab = TRUE,
           title = "Correlation Heatmap of Numeric Predictors")

IMDB Rating has a high positive correlation with all other numerical factors in our dataset. As IMDB Rating increases so do all the rest.

cor_with_box <- cor(hmovies_df2$Box_Office_Millions,
                    hmovies_df2[, numeric_columns],
                    use = "complete.obs")


ggcorrplot(cor_with_box, lab = TRUE,
           title = "Correlations Among Box Office and Numeric Predictors")

There is a positive correlation between Box Office and IMDB Votes, Ratins and Metascore. The correlation is low between Box Office and Runtime.

Creating linear models for our numerical predictors

Linear Regression Model for Box Office and IMDB Votes

model <- lm(BoxOffice ~ Votes, data = hmovies_df2)

plot(model,1)

plot(model,2)

plot(model,3)

plot(model,5)

Attempting to build a regression model with votes as the predictor fails though there is a high postive correlation between them because the underlying relationship is reversed: box office revenue influences the number of votes a movie receives, and not the other way around. This violates the assumptions of a predictive model.

This insight is still useful just not for our purpose. We have come to see that movies that earn higher revenue tend to attract more votes, which reflects audience engagement and popularity, we were looking at popular horror movie on IMDB after all. However, votes cannot be used to reliably predict revenue.

Linear Regression Model for Box Office and IMDB Ratings

model2 <- lm(BoxOffice ~ Rating, data = hmovies_df2)

plot(model2,1)

plot(model2,2)

plot(model2,3)

plot(model2,5)

Similarly, attempting to build a regression model with IMDB ratings as the predictor fails even though there is a high positive correlation between them because the underlying relationship is reversed: box office revenue influences the number of votes a movie receives, and not the other way around. This violates the assumptions of a predictive model.

Linear Regression Model for Box Office and Metascores

model3 <- lm(BoxOffice ~ Metascore, data = hmovies_df2)

plot(model3,1)

plot(model3,2)

plot(model3,3)

plot(model3,5)

Attempting to build a regression model with metascores as the predictor fails though they a not insignificant, poistive correlation because the underlying relationship is reversed: box office revenue influences the number of metascore ratings a movie receives, and not the other way around. This violates the assumptions of a predictive model.

Linear Regression Model for Box Office and Runtime

model4 <- lm(BoxOffice ~ Runtime, data = hmovies_df2)

plot(model4,1)

plot(model4,2)

plot(model4,3)

plot(model4,5)

stargazer(model4, type = "html", title = "Linear Regression Results: Runtime Model",
          align = TRUE, dep.var.labels = "Box Office Revenue",
          covariate.labels = "Runtime")

**Linear Regression Results: Runtime Model**

	Dependent variable:

	Box Office Revenue

Runtime	177,485.300
	(302,424.400)

Constant	17,575,043.000
	(31,480,240.000)


Observations	86
R²	0.004
Adjusted R²	-0.008
Residual Std. Error	44,580,262.000 (df = 84)
F Statistic	0.344 (df = 1; 84)

Note:	p<0.1; p<0.05; p<0.01

While the diagnostic linear model for runtime vs box office look better than the rest, our residuals do not look normal, and we must remember that these two variable have an extremely low negative correalation.

The linear model for runtime and box office produces some good diagnostic plots, even if all the diagnostic plots were perfect — residuals approximately normal, constant variance, and there were no influential outliers. The correlation between runtime and box office is extremely low, near zero, which indicates that runtime explains almost none of the variability in revenue. While the model is statistically valid, it is not useful for prediction.

Numerical Factors Insights

Which numeric features of a horror movie — runtime, metascores, IMDB rating, or votes — can reliably predict its Box Office performance?

Attempting to build a regression model with any of our numerical predictors—IMDB votes, IMDB ratings, metascores or runtime—as the predictor fails.

This insight is still useful just not for our purpose. We have come to see that movies that earn higher revenue tend to attract more votes, which reflects audience engagement and popularity.

We have also seen that due to the low correlation between runtime and box office revenue, our model provides very little insight into the variation of revenue.

We must focus more of our attention on production and release features to understand what appeals most to horror movies audiences and what combination of features will produce more financially successful horror movies.

Some limitations of working with this data is that it is quite a small number of horror movies when compared to how many are listed on IMDB.