Introduction

For this assingment, I will be analyzing data that was extracted from boxofficemojo.com, the-numbers.com and filmratings.com. The compiled data set focuses on the #1 movie of each year between 1980 and 2017, along with additional information about each movie. This includes runtime (minutes), budget (millions), MPAA rating, and total gross revenue to name a few. I will focus on the relationships between the variables, and note what they tell us about the change in the movie industry over the decades.

Before doing any analysis, I expect that there will be an increase in production budget, total gross revenue, and ticket price. Furthermore, I expect that these variables will have a very strong and significant positive correlation with each other, and with the year variable (throughout time x1, x2, x3… will increase).
I also predict there will be a strong and significant positive relationship between the runtime and production budget of a movie.

Data

First, I downloaded a cleaned up version of the yearly box office data from Box Office Mojo (through Canvas). However, after initially viewing the data as a .csv file in Excel, it was noted that some of the data was missing for certain variables and other variables were unclear. Rather than use those variables, I decided to delete the “Total Screens” and “Average Cost” variables, along with both “Change” variables beacuse they were unnecessary for this analysis. In the .csv file, I added three new variables (information obtained from https://www.filmratings.com/ and https://www.the-numbers.com/). The variables that will be analyzed include:

After cleaning and obtaining all other data (MPAA rating, runtime, and budget), the .csv file was saved and read into RStudio:

Movies <- read.csv("YBOplus.csv")

To view the first 6 cases (out of 38), use the head function:

head(Movies)
##   year                        movie  MPAA runtime budget total_gross
## 1 2017     Star Wars: The Last Jedi PG-13     152    262    11071900
## 2 2016                    Rogue One     R     133    200    11377700
## 3 2015 Star Wars: The Force Awakens PG-13     135    306    11129400
## 4 2014              American Sniper     R     134     58    10361300
## 5 2013                Catching Fire PG-13     146    130    10924600
## 6 2012                 The Avengers PG-13     143    225    10837600
##   numb_movies ticket_price tickets_sold
## 1         740         8.97         1234
## 2         736         8.65         1315
## 3         705         8.43         1320
## 4         707         8.17         1268
## 5         688         8.13         1344
## 6         669         7.96         1362

Results

Correlations to year

x = year

y1 = total_gross

y2 = ticket_price

y3 = budget

y4 = numb_movies

total_gross~year

plot(total_gross ~ year,
     data=Movies,
     pch=16,
     xlab = "Year",
     ylab = "Total Gross Revenue")

model1 <- lm(total_gross~year,Movies,model = TRUE)
summary(model1)
## 
## Call:
## lm(formula = total_gross ~ year, data = Movies, model = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -835780 -275006   20974  303550 1069522 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -494215478   14808125  -33.38   <2e-16 ***
## year            250900       7410   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 500900 on 36 degrees of freedom
## Multiple R-squared:  0.9696, Adjusted R-squared:  0.9687 
## F-statistic:  1147 on 1 and 36 DF,  p-value: < 2.2e-16

The graph above shows the increase in total gross revenue throughout the years. The statistics show that this is a very strong/powerful relationship (R^2 = 0.9687), and that is significant (2.2e-16 < α (0.001)). The equation for the line is Y = -494215478 + 250900t + e

ticket_price~year

plot(ticket_price ~ year,
     data=Movies,
     pch=16,
     xlab = "Year",
     ylab = "Average Ticket Price")

model2 <- lm(ticket_price~year,Movies,model = TRUE)
summary(model2)
## 
## Call:
## lm(formula = ticket_price ~ year, data = Movies, model = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6990 -0.2339  0.1105  0.2710  0.4929 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.290e+02  1.038e+01  -31.68   <2e-16 ***
## year         1.673e-01  5.196e-03   32.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3512 on 36 degrees of freedom
## Multiple R-squared:  0.9665, Adjusted R-squared:  0.9655 
## F-statistic:  1037 on 1 and 36 DF,  p-value: < 2.2e-16

The graph above shows the increase in the average ticket price throughout the years. The statistics show that this is a very strong/powerful relationship (R^2 = 0.9655), and that is significant (2.2e-16 < α (0.001)). The equation for the line is Y = -3.290e+02 + 1.673e-01t + e

budget~year

plot(budget ~ year,
     data=Movies,
     pch=16,
     xlab = "Year",
     ylab = "Production Budget")

model3 <- lm(budget~year,Movies,model = TRUE)
summary(model3)
## 
## Call:
## lm(formula = budget ~ year, data = Movies, model = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.290  -29.638   -1.375   23.555  103.271 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.242e+04  1.501e+03  -8.275 7.55e-10 ***
## year         6.268e+00  7.511e-01   8.346 6.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.77 on 36 degrees of freedom
## Multiple R-squared:  0.6593, Adjusted R-squared:  0.6498 
## F-statistic: 69.65 on 1 and 36 DF,  p-value: 6.157e-10

The graph above shows the increase in the production budget throughout the years. The statistics show that this is a strong/powerful relationship (R^2 = 0.6498), and that is significant (6.157e-10 < α (0.001)). The equation for the line is Y = -1.242e+04 + 6.268e+00t + e

runtime~budget

plot(runtime ~ budget,
     data=Movies,
     pch=16,
     ylab = "Runtime (min)",
     xlab = "Production Budget")

model4 <- lm(runtime~budget,Movies,model = TRUE)
summary(model4)
## 
## Call:
## lm(formula = runtime ~ budget, data = Movies, model = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.217 -17.367  -0.766   9.431  66.891 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 122.51749    6.41301  19.105   <2e-16 ***
## budget        0.12332    0.04724   2.611   0.0131 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.65 on 36 degrees of freedom
## Multiple R-squared:  0.1592, Adjusted R-squared:  0.1358 
## F-statistic: 6.815 on 1 and 36 DF,  p-value: 0.01309

The graph above shows an increase in the run time as production budget throughout increases. However, the statistics show that this is a weak relationship (R^2 = 0.1358), but that is significant (0.01309 < α (0.05)). The equation for the line is Y = 122.51749 + 0.12332t + e

Correlation Matrix

corrplot(corrgram(Movies), method = "number")

This is a correlation matrix, which displays the correlations of each variable with all other variables. In this case, all the relationships are positive. The number shows the strength of the relationship (>0.5 is strong, <0.5 is weak). You will notice that most of the relationships with the variable “year” are very strong (>0.5), whereas all the relationships with “runtime” are weak (<0.5).

Most Frequent/Highest Earners

ggplot(Movies) + 
  geom_point(aes(x=MPAA %>% as.factor(), y=total_gross)) +
  geom_hline(yintercept=Movies %>% 
               group_by(MPAA) %>% 
               arrange(total_gross %>% 
                         desc()
               ) %>% 
               top_n(1, total_gross) %>% 
               .$total_gross
  ) +
  geom_label_repel(data=. %>% 
                     group_by(MPAA) %>% 
                     top_n(3, total_gross),
                   aes(x=MPAA, y=total_gross, label=movie, color=MPAA)
  ) +
  scale_y_continuous(limits=c(Movies$total_gross %>% 
                                min(), 
                              Movies$total_gross %>% 
                                max()
  ),
  labels=comma
  ) + 
  labs(x='MPAA Rating',y='Total Gross',
       title='Most Frequent/Highest Earners'
  ) +
  theme(legend.position='none')

This is additional information using a ggplot. It shows that the most frequent and the highestet earning movies are rated PG-13, including Star Wars: The Last Jedi, Star Wars: The Force Awakens, and Catching Fire. There are some rated R movies that are high earners as well, but R rated movies are not frequently high earners compared to PG-13.

This could be the case because PG-13 movies are geared towards a larger audience, and are more inclusive for families.

Conclusion

Before doing any analysis, I expected that there would be an increase in production budget, total gross revenue, and ticket price. I also expect that these variables would have a very strong and significant positive correlation with each other, and with the year variable (throughout time x1, x2, x3… will increase). With statistical testing, it was revealed that there was a stron and significant correlation between these variables.
I had also predicted that there would be a strong and significant positive relationship between the runtime and production budget of a movie. However, this was not supported through the statistical testing.

Overall, all variables relating to money (budget, ticket price, total gross revenue) increased over time. Part of this might be explained by inflation in the economy. But it was also noted that the highest earners and most frequent #1 movies are PG-13.