Boxoffice Data Analysis

Code

This article provides a comprehensive analysis for Boxoffice data

Beni F, cryptoanalytica13#2474 (DataAnalyticade@gmail.com)
2023-03-03

1. Boxoffice:

Box office provide reports on financial performance of movies, typically measured by the total amount of revenue generated from ticket sales during its theatrical run.

1.1. Aim of the Boxoffice:

The aim of box office analysis is to understand the factors that contribute to a movie’s financial success at the box office. This involves exploring the relationship between a movie’s budget, release date, genre, runtime, ratings, and other variables, and its box office performance.

2. Methodology:

A public Box office dataset is used in this study which is available in Kaggle and some rows of the dataset are shown below:

Show code
# ensure the results are repeatable
set.seed(7)
# load the data
df <- read.csv('blockbusters.csv')
# head of the data
df %>% arrange(desc(year)) %>% head() %>%
  kbl(caption = "Head of Data sorted by Worldwide Gross in descending order") %>%
  kable_classic(full_width = F, html_font = "Cambria")
(#tab:read the data)Head of Data sorted by Worldwide Gross in descending order
Main_Genre Genre_2 Genre_3 imdb_rating length rank_in_year rating studio title worldwide_gross year
Action Adventure Drama 7.4 135 1 PG-13 Walt Disney Pictures Black Panther $700,059,566 2018
Action Adventure Sci-Fi 8.5 156 2 PG-13 Walt Disney Pictures Avengers: Infinity War $678,815,482 2018
Animation Action Adventure 7.8 118 3 PG Pixar Incredibles 2 $608,581,744 2018
Action Adventure Drama 6.2 129 4 PG-13 Universal Pictures Jurassic World: Fallen Kingdom $416,769,345 2018
Action Comedy 7.8 119 5 R 20th Century Fox Deadpool 2 $318,491,426 2018
Action Adventure Drama 7.9 147 6 PG-13 Paramount Pictures Mission: Impossible - Fallout $220,159,104 2018

3. Results:

3.1. Relationship between worldwide sales and othe features

A correlation matrix is plotted for the numeric values of the dataset to find a pattern between these values.

Show code
df_num <- select(df, c("year", "imdb_rating", "length", "worldwide_sales"))
df_cormatrix <- data.frame(cor(df_num))

ggcorrplot(df_cormatrix, type="lower", hc.order = TRUE, lab = TRUE, insig ="blank")

3.2. Relationship between the worlwide sales and other features?

3.2.1. Worldwide sales and IMDB rating:

Show code
# load the plotly and RColorBrewer packages
library(plotly)
library(RColorBrewer)

# create a color palette with Set1 colors
n_colors <- length(unique(df$Main_Genre))
colors <- brewer.pal(n_colors, "Set1")

# create a scatter plot with plotly
p <- plot_ly(df, x = ~imdb_rating, y = ~worldwide_sales, color = ~Main_Genre,
             colors = colors, type = 'scatter', mode = 'markers') %>%
  layout(title = 'Scatter plot of worldwide sales and IMDB rating',
         xaxis = list(title = 'IMDB rating'),
         yaxis = list(title = 'Worldwide sales [$]')) %>%
  add_lines(x = ~imdb_rating, y = ~fitted(lm(worldwide_sales ~ imdb_rating, data = df)),
            line = list(dash = 'dash'), color= "blue", name = "regression line") 

# display the plot
p

3.2.2. Worldwide sales and year of release:

Show code
# load the plotly and RColorBrewer packages
library(plotly)
library(RColorBrewer)

# create a color palette with Set1 colors
n_colors <- length(unique(df$Main_Genre))
colors <- brewer.pal(n_colors, "Set1")

# create a scatter plot with plotly
p <- plot_ly(df, x = ~year, y = ~worldwide_sales, color = ~Main_Genre,
             colors = colors, type = 'scatter', mode = 'markers') %>%
  layout(title = 'Scatter plot of worldwide sales and year of the movie release',
         xaxis = list(title = 'year'),
         yaxis = list(title = 'Worldwide sales [$]')) %>%
  add_lines(x = ~year, y = ~fitted(lm(worldwide_sales ~ year, data = df)),
            line = list(dash = 'dash'), color= "blue", name = "regression line") 

# display the plot
p

3.2.3. Worldwide sales and length of the movie:

4. Conclusions:

Show code
# subset the data to show only the top 10 movies with highest worldwide sales
top5 <- df[order(df$worldwide_sales, decreasing = TRUE), ][1:5, ]

# create a ggplot scatter plot with text labels
p1 <- ggplot(top5, aes(x=imdb_rating, y=worldwide_sales, color=Main_Genre)) +
  geom_point() +
  geom_text(data = top5, aes(label = paste0(head(title, 10), ", ", head(year, 10), ", ", head(studio, 10))),
            hjust = -0.1, vjust = -0.5, size = 3, color = "black", show.legend = FALSE, hoverinfo = "text") +
  ggtitle("Worldwide Sales vs IMDB Rating by Main Genre")+ xlab("IMDB rating")+ ylab(" Worldwide sales")+
  xlim(5, 10)

# convert the plot to plotly
p <- ggplotly(p1)
p
Show code
top5 %>% arrange(desc(worldwide_sales)) %>% head() %>%
  kbl(caption = "Head of Data sorted by Worldwide Gross in descending order") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 1: Head of Data sorted by Worldwide Gross in descending order
Main_Genre Genre_2 Genre_3 imdb_rating length rank_in_year rating studio title worldwide_sales year
Fantasy Adventure Action 7.9 162 1 PG-13 20th Century Fox Avatar 2749064328 2009
Romance Drama 7.7 194 1 PG-13 Paramount Pictures Titanic 1843201268 1997
Sci-Fi Adventure Action 8.2 143 1 PG-13 Walt Disney Pictures The Avengers 1518594910 2012
Thriller Fantasy Adventure 8.1 130 1 PG-13 Warner Bros Harry Potter and the Deathly Hallows - Part 2 1341511219 2011
Comedy Animation Adventure 7.7 102 1 PG Walt Disney Pictures Frozen 1274219009 2013