For this assingment, I will be analyzing data that was extracted from boxofficemojo.com, the-numbers.com and filmratings.com. The compiled data set focuses on the #1 movie of each year between 1980 and 2017, along with additional information about each movie. This includes runtime (minutes), budget (millions), MPAA rating, and total gross revenue to name a few. I will focus on the relationships between the variables, and note what they tell us about the change in the movie industry over the decades.
Before doing any analysis, I expect that there will be an increase in production budget, total gross revenue, and ticket price. Furthermore, I expect that these variables will have a very strong and significant positive correlation with each other, and with the year variable (throughout time x1, x2, x3… will increase).
I also predict there will be a strong and significant positive relationship between the runtime and production budget of a movie.
First, I downloaded a cleaned up version of the yearly box office data from Box Office Mojo (through Canvas). However, after initially viewing the data as a .csv file in Excel, it was noted that some of the data was missing for certain variables and other variables were unclear. Rather than use those variables, I decided to delete the “Total Screens” and “Average Cost” variables, along with both “Change” variables beacuse they were unnecessary for this analysis. In the .csv file, I added three new variables (information obtained from https://www.filmratings.com/ and https://www.the-numbers.com/). The variables that will be analyzed include:
After cleaning and obtaining all other data (MPAA rating, runtime, and budget), the .csv file was saved and read into RStudio:
Movies <- read.csv("YBOplus.csv")
To view the first 6 cases (out of 38), use the head function:
head(Movies)
## year movie MPAA runtime budget total_gross
## 1 2017 Star Wars: The Last Jedi PG-13 152 262 11071900
## 2 2016 Rogue One R 133 200 11377700
## 3 2015 Star Wars: The Force Awakens PG-13 135 306 11129400
## 4 2014 American Sniper R 134 58 10361300
## 5 2013 Catching Fire PG-13 146 130 10924600
## 6 2012 The Avengers PG-13 143 225 10837600
## numb_movies ticket_price tickets_sold
## 1 740 8.97 1234
## 2 736 8.65 1315
## 3 705 8.43 1320
## 4 707 8.17 1268
## 5 688 8.13 1344
## 6 669 7.96 1362
x = year
y1 = total_gross
y2 = ticket_price
y3 = budget
y4 = numb_movies
plot(total_gross ~ year,
data=Movies,
pch=16,
xlab = "Year",
ylab = "Total Gross Revenue")
model1 <- lm(total_gross~year,Movies,model = TRUE)
summary(model1)
##
## Call:
## lm(formula = total_gross ~ year, data = Movies, model = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -835780 -275006 20974 303550 1069522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -494215478 14808125 -33.38 <2e-16 ***
## year 250900 7410 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 500900 on 36 degrees of freedom
## Multiple R-squared: 0.9696, Adjusted R-squared: 0.9687
## F-statistic: 1147 on 1 and 36 DF, p-value: < 2.2e-16
The graph above shows the increase in total gross revenue throughout the years. The statistics show that this is a very strong/powerful relationship (R^2 = 0.9687), and that is significant (2.2e-16 < α (0.001)). The equation for the line is Y = -494215478 + 250900t + e
plot(ticket_price ~ year,
data=Movies,
pch=16,
xlab = "Year",
ylab = "Average Ticket Price")
model2 <- lm(ticket_price~year,Movies,model = TRUE)
summary(model2)
##
## Call:
## lm(formula = ticket_price ~ year, data = Movies, model = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6990 -0.2339 0.1105 0.2710 0.4929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.290e+02 1.038e+01 -31.68 <2e-16 ***
## year 1.673e-01 5.196e-03 32.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3512 on 36 degrees of freedom
## Multiple R-squared: 0.9665, Adjusted R-squared: 0.9655
## F-statistic: 1037 on 1 and 36 DF, p-value: < 2.2e-16
The graph above shows the increase in the average ticket price throughout the years. The statistics show that this is a very strong/powerful relationship (R^2 = 0.9655), and that is significant (2.2e-16 < α (0.001)). The equation for the line is Y = -3.290e+02 + 1.673e-01t + e
plot(budget ~ year,
data=Movies,
pch=16,
xlab = "Year",
ylab = "Production Budget")
model3 <- lm(budget~year,Movies,model = TRUE)
summary(model3)
##
## Call:
## lm(formula = budget ~ year, data = Movies, model = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.290 -29.638 -1.375 23.555 103.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.242e+04 1.501e+03 -8.275 7.55e-10 ***
## year 6.268e+00 7.511e-01 8.346 6.16e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.77 on 36 degrees of freedom
## Multiple R-squared: 0.6593, Adjusted R-squared: 0.6498
## F-statistic: 69.65 on 1 and 36 DF, p-value: 6.157e-10
The graph above shows the increase in the production budget throughout the years. The statistics show that this is a strong/powerful relationship (R^2 = 0.6498), and that is significant (6.157e-10 < α (0.001)). The equation for the line is Y = -1.242e+04 + 6.268e+00t + e
plot(runtime ~ budget,
data=Movies,
pch=16,
ylab = "Runtime (min)",
xlab = "Production Budget")
model4 <- lm(runtime~budget,Movies,model = TRUE)
summary(model4)
##
## Call:
## lm(formula = runtime ~ budget, data = Movies, model = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.217 -17.367 -0.766 9.431 66.891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122.51749 6.41301 19.105 <2e-16 ***
## budget 0.12332 0.04724 2.611 0.0131 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.65 on 36 degrees of freedom
## Multiple R-squared: 0.1592, Adjusted R-squared: 0.1358
## F-statistic: 6.815 on 1 and 36 DF, p-value: 0.01309
The graph above shows an increase in the run time as production budget throughout increases. However, the statistics show that this is a weak relationship (R^2 = 0.1358), but that is significant (0.01309 < α (0.05)). The equation for the line is Y = 122.51749 + 0.12332t + e
corrplot(corrgram(Movies), method = "number")
This is a correlation matrix, which displays the correlations of each variable with all other variables. In this case, all the relationships are positive. The number shows the strength of the relationship (>0.5 is strong, <0.5 is weak). You will notice that most of the relationships with the variable “year” are very strong (>0.5), whereas all the relationships with “runtime” are weak (<0.5).
ggplot(Movies) +
geom_point(aes(x=MPAA %>% as.factor(), y=total_gross)) +
geom_hline(yintercept=Movies %>%
group_by(MPAA) %>%
arrange(total_gross %>%
desc()
) %>%
top_n(1, total_gross) %>%
.$total_gross
) +
geom_label_repel(data=. %>%
group_by(MPAA) %>%
top_n(3, total_gross),
aes(x=MPAA, y=total_gross, label=movie, color=MPAA)
) +
scale_y_continuous(limits=c(Movies$total_gross %>%
min(),
Movies$total_gross %>%
max()
),
labels=comma
) +
labs(x='MPAA Rating',y='Total Gross',
title='Most Frequent/Highest Earners'
) +
theme(legend.position='none')
This is additional information using a ggplot. It shows that the most frequent and the highestet earning movies are rated PG-13, including Star Wars: The Last Jedi, Star Wars: The Force Awakens, and Catching Fire. There are some rated R movies that are high earners as well, but R rated movies are not frequently high earners compared to PG-13.
This could be the case because PG-13 movies are geared towards a larger audience, and are more inclusive for families.
Before doing any analysis, I expected that there would be an increase in production budget, total gross revenue, and ticket price. I also expect that these variables would have a very strong and significant positive correlation with each other, and with the year variable (throughout time x1, x2, x3… will increase). With statistical testing, it was revealed that there was a stron and significant correlation between these variables.
I had also predicted that there would be a strong and significant positive relationship between the runtime and production budget of a movie. However, this was not supported through the statistical testing.
Overall, all variables relating to money (budget, ticket price, total gross revenue) increased over time. Part of this might be explained by inflation in the economy. But it was also noted that the highest earners and most frequent #1 movies are PG-13.