This article provides a comprehensive analysis for Boxoffice data
Box office provide reports on financial performance of movies, typically measured by the total amount of revenue generated from ticket sales during its theatrical run.
The aim of box office analysis is to understand the factors that contribute to a movie’s financial success at the box office. This involves exploring the relationship between a movie’s budget, release date, genre, runtime, ratings, and other variables, and its box office performance.
A public Box office dataset is used in this study which is available in Kaggle and some rows of the dataset are shown below:
| Main_Genre | Genre_2 | Genre_3 | imdb_rating | length | rank_in_year | rating | studio | title | worldwide_gross | year |
|---|---|---|---|---|---|---|---|---|---|---|
| Action | Adventure | Drama | 7.4 | 135 | 1 | PG-13 | Walt Disney Pictures | Black Panther | $700,059,566 | 2018 |
| Action | Adventure | Sci-Fi | 8.5 | 156 | 2 | PG-13 | Walt Disney Pictures | Avengers: Infinity War | $678,815,482 | 2018 |
| Animation | Action | Adventure | 7.8 | 118 | 3 | PG | Pixar | Incredibles 2 | $608,581,744 | 2018 |
| Action | Adventure | Drama | 6.2 | 129 | 4 | PG-13 | Universal Pictures | Jurassic World: Fallen Kingdom | $416,769,345 | 2018 |
| Action | Comedy | 7.8 | 119 | 5 | R | 20th Century Fox | Deadpool 2 | $318,491,426 | 2018 | |
| Action | Adventure | Drama | 7.9 | 147 | 6 | PG-13 | Paramount Pictures | Mission: Impossible - Fallout | $220,159,104 | 2018 |
A correlation matrix is plotted for the numeric values of the dataset to find a pattern between these values.
The following figure shows that the worldwide sales is correlated with year by 68% which means that recent movies had a better sales in general which could be due to the inflation and loss in the value of the money as well. But, overall, this trend is observed.
Moreover, worldwide sales is correlated with length of the movie and IMDB rating with the correlation rates of 34% and 21%, respectively.
df_num <- select(df, c("year", "imdb_rating", "length", "worldwide_sales"))
df_cormatrix <- data.frame(cor(df_num))
ggcorrplot(df_cormatrix, type="lower", hc.order = TRUE, lab = TRUE, insig ="blank")
# load the plotly and RColorBrewer packages
library(plotly)
library(RColorBrewer)
# create a color palette with Set1 colors
n_colors <- length(unique(df$Main_Genre))
colors <- brewer.pal(n_colors, "Set1")
# create a scatter plot with plotly
p <- plot_ly(df, x = ~imdb_rating, y = ~worldwide_sales, color = ~Main_Genre,
colors = colors, type = 'scatter', mode = 'markers') %>%
layout(title = 'Scatter plot of worldwide sales and IMDB rating',
xaxis = list(title = 'IMDB rating'),
yaxis = list(title = 'Worldwide sales [$]')) %>%
add_lines(x = ~imdb_rating, y = ~fitted(lm(worldwide_sales ~ imdb_rating, data = df)),
line = list(dash = 'dash'), color= "blue", name = "regression line")
# display the plot
p
The following plot shows the worldwide sales versus year of the movie release. We see a general pattern that the sales increases as the the year of the movie is more recent. However, a specific pattern is observed that after 2015 there is a sudden drop in the worldwide sales.
The highest sales occured approximately in the range of 2005 to 2015.
# load the plotly and RColorBrewer packages
library(plotly)
library(RColorBrewer)
# create a color palette with Set1 colors
n_colors <- length(unique(df$Main_Genre))
colors <- brewer.pal(n_colors, "Set1")
# create a scatter plot with plotly
p <- plot_ly(df, x = ~year, y = ~worldwide_sales, color = ~Main_Genre,
colors = colors, type = 'scatter', mode = 'markers') %>%
layout(title = 'Scatter plot of worldwide sales and year of the movie release',
xaxis = list(title = 'year'),
yaxis = list(title = 'Worldwide sales [$]')) %>%
add_lines(x = ~year, y = ~fitted(lm(worldwide_sales ~ year, data = df)),
line = list(dash = 'dash'), color= "blue", name = "regression line")
# display the plot
p
Length of the movie has a direct relation with the worldwide sales of the movie. This is illustrated in the following plot where the sales enhances as the length of the movie is more.
A regression analysis shows the direct relation between these two variabels. Lenghty movies tend to have higher sales.
Show code
# load the plotly and RColorBrewer packages
library(plotly)
library(RColorBrewer)
# create a color palette with Set1 colors
n_colors <- length(unique(df$Main_Genre))
colors <- brewer.pal(n_colors, "Set1")
# create a scatter plot with plotly
p <- plot_ly(df, x = ~length, y = ~worldwide_sales, color = ~Main_Genre,
colors = colors, type = 'scatter', mode = 'markers') %>%
layout(title = 'Scatter plot of worldwide sales and length of the movie',
xaxis = list(title = 'length [min]'),
yaxis = list(title = 'Worldwide sales [$]')) %>%
add_lines(x = ~length, y = ~fitted(lm(worldwide_sales ~ length, data = df)),
line = list(dash = 'dash'), color= "blue", name = "regression line")
# display the plot
p
The worldwide sales of the movies is related to the length of the movie, IMDB rating, and the year of the release.
Higher values in all of these three features lead to higher worldwide sales of the movie.
The top 5 movies by order were Avatar, Titanic, The Avengers, Harry Potter and the Deathly Hallows - Part 2, and Frozen.
Among these five movies 2 of them are from Walt Disney Pictures studio which shows the success of this studio.
# subset the data to show only the top 10 movies with highest worldwide sales
top5 <- df[order(df$worldwide_sales, decreasing = TRUE), ][1:5, ]
# create a ggplot scatter plot with text labels
p1 <- ggplot(top5, aes(x=imdb_rating, y=worldwide_sales, color=Main_Genre)) +
geom_point() +
geom_text(data = top5, aes(label = paste0(head(title, 10), ", ", head(year, 10), ", ", head(studio, 10))),
hjust = -0.1, vjust = -0.5, size = 3, color = "black", show.legend = FALSE, hoverinfo = "text") +
ggtitle("Worldwide Sales vs IMDB Rating by Main Genre")+ xlab("IMDB rating")+ ylab(" Worldwide sales")+
xlim(5, 10)
# convert the plot to plotly
p <- ggplotly(p1)
p
| Main_Genre | Genre_2 | Genre_3 | imdb_rating | length | rank_in_year | rating | studio | title | worldwide_sales | year |
|---|---|---|---|---|---|---|---|---|---|---|
| Fantasy | Adventure | Action | 7.9 | 162 | 1 | PG-13 | 20th Century Fox | Avatar | 2749064328 | 2009 |
| Romance | Drama | 7.7 | 194 | 1 | PG-13 | Paramount Pictures | Titanic | 1843201268 | 1997 | |
| Sci-Fi | Adventure | Action | 8.2 | 143 | 1 | PG-13 | Walt Disney Pictures | The Avengers | 1518594910 | 2012 |
| Thriller | Fantasy | Adventure | 8.1 | 130 | 1 | PG-13 | Warner Bros | Harry Potter and the Deathly Hallows - Part 2 | 1341511219 | 2011 |
| Comedy | Animation | Adventure | 7.7 | 102 | 1 | PG | Walt Disney Pictures | Frozen | 1274219009 | 2013 |