As I finish my college career with an analytics major I look forward to completing this final project. The topic that I have chosen for this final project is movie data. Specifically, I am going to be looking at blockbuster movies. I have always been a pretty big movie fan, but there are some movies that seem to make incredible amounts of money for being average in my opinion. I think that I prefer some movies that are more traditional compared to some of the ones that come out now. I do not think that a movie of completely CGI and action is of the highest quality. Yet, these movies seem to be the ones that make the most money. I want to look into what factors actually lead to a movie making money. Is there a relationship between the budget and the gross profits of the movie. Does the movie rating preventing part of the population seeing a movie affect the gross profit. I am going to ponder these and many more lines of inquiry in an attempt to learn more about how and why movies make money. I think that this will be an interesting exercise that will be interesting and stimulating.
Before I introduce the data itself I want to highlight the key packages I will be using to analyze the data. These packages will help us do meaningful analysis. The key package that we will use is the tidyverse. The tidyverse holds many packages inside of it that are critical to us. We are also including packages that will enable us to scrape data off of websites. Finally, we will add some visualization tools to help us create data tables and adjust scales on our graphs adequately.
The data that I will be working with comes from Kaggle. The shows the top 10 domestic grossing movies from 1977 to 2019. There are a great many interesting variables that will be critical to the data analysis. It is important to understand the data before working. I was able to download the data onto my onedrive account as a CSV file. I then was able to read this CSV file into my console and create a data frame for it. The data dictionary is listed below:
Data Dictionary: Release_year – The year that the film was released in
Rank_in_year – Rank based on worldwide film gross
Imdb_rating – Rating given on film by IMBD (Internet Movie Data Base), average composite score
Mpaa_rating – Rating given to film by Motion Picture Association of America
Film_title – name of film
Film_budget – budget allocated to film in USD
Length_in_min – film length in minutes
Domestic_distributor – domestic studio distributor
Worldwide_gross – worldwide sales of film in USD
Domestic_gross – domestic sales of film in USD
Genre_1 – first genre tagged on film as recorded on IMBD
Genre_2 - second genre tagged on film as recorded on IMBD
Genre_3 - Third genre tagged on film as recorded on IMBD
This Data a good deal of variables that will be helpful for us in our analysis. Getting this dataset from a website like Kaggle was very helpful because there was little to no cleaning that we had to take care of before working with the data. We are going to list a few below to get a better general sense of the data.
## # A tibble: 43 x 3
## release_year Total_Domestic_Gross Average_Length
## <dbl> <dbl> <dbl>
## 1 2019 4491159482 128.
## 2 2015 4100654301 129.
## 3 2018 3868388338 132.
## 4 2016 3655061193 116.
## 5 2017 3473700750 129.
## 6 2012 3197961125 130.
## 7 2013 3139525267 122.
## 8 2009 3025877904 131.
## 9 2010 2859374511 114.
## 10 2007 2656474335 122.
## # … with 33 more rows
I am now going to pursue some direct analysis. I am going to bring forward concrete questions, provide visualizations or other forms of analysis, and critically analyze the results.
This Visual shows us that there is definitely a correlation between the film budget and the gross domestic sales. This relationship makes sense because the greater the budget the more things that film makers can try or have access to. Film makers could use their budget to get exclusive technology or locations to make their movies attractive to the viewers. I included a flat regression that showed a relationship. Yet, a smooth curved line shows that the most significant part of the relationship is around when the film budget climbs over 250 million. I also chose to include the release year to show how inflation has an effect on this as well. We can see that the disparity between budgets has grown substantially as the years have progressed. It seems that since about 2000s the spread has really grown the most for film budgets. I think one key takeaway from this graph is that the large film budgets seem to have overall success in sales Yet, a movie’s budget is not the sole factor in determining how much money a movie makes.
This relationship is one that is not a very drastic one. Again I prefer to use both a straight linear relationship line along with a smoothed line. I think that it is observable that the Gross Sales only go up slightly as the IMDB rating increases. The smoothed line is very flat for most of the ratings scale. I would say that the only part of the rating scale that really affects the gross profits is between 8 and 9. The IMDB rating is created from the ratings of IMDB members. Anyone can become an IMDB member and vote on these movies. Therefore, I feel it makes that the end of the scale is where the increase occurs. With a great deal of people coming to a consensus on the quality of the movie it makes sense that the film would generate a great deal of sales. The large sample size that goes into the IMDB ratings would lead to the fact that a great deal of people would have been generating conversation and opinions on the film.
## # A tibble: 25 x 5
## domestic_distrib… AVG_Domestic_Gr… AVG_Worldwide_G… Average_Length AVG_Rating
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Icon Productions 370274604 611486736 127 7.1
## 2 Lionsgate 348772653. 760570828. 133. 6.66
## 3 Walt Disney 310488900. 786154828. 116. 7.36
## 4 Summit Entertain… 292814173. 706841555 124. 4.87
## 5 IFC Films 241438208 368744044 95 6.5
## 6 DreamWorks 237096358 598534028. 110. 7.14
## 7 New Line Cinema 214840884. 550156429. 135. 7.82
## 8 Sony Pictures 213227450. 608578765. 125. 6.83
## 9 Warner Bros. 199257710. 501463332. 130. 7.15
## 10 Twentieth Centur… 194699363. 473871835. 117. 7.17
## # … with 15 more rows
## # A tibble: 25 x 5
## domestic_distrib… AVG_Worldwide_G… AVG_Domestic_Gr… Average_Length AVG_Rating
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The H Collective 870325439 2721100 123 6
## 2 Walt Disney 786154828. 310488900. 116. 7.36
## 3 Lionsgate 760570828. 348772653. 133. 6.66
## 4 Summit Entertain… 706841555 292814173. 124. 4.87
## 5 Icon Productions 611486736 370274604 127 7.1
## 6 Sony Pictures 608578765. 213227450. 125. 6.83
## 7 DreamWorks 598534028. 237096358 110. 7.14
## 8 New Line Cinema 550156429. 214840884. 135. 7.82
## 9 Warner Bros. 501463332. 199257710. 130. 7.15
## 10 Twentieth Centur… 473871835. 194699363. 117. 7.17
## # … with 15 more rows
These Tables show us the big players both at home and abroad for large sales. We see that many of the same film distributors seem to do well both domestically and worldwide. This points to the fact that the American Market seems to do a good job of representing the major distributors across the globe. There were some discrepancies that were very interesting. Almost all of the film companies appear for the top 10 domestically and abroad, but the H collective only appears on the worldwide gross list. I think this shows that while the American market is a decent indicator of worldwide success that other countries can find success distributing films. This could also be due to the fact that they have a small sample size with a few highly successful international movies. The visualization above overall concludes that Domestic Success for film distributors on average typically leads to success in worldwide gross sales as well.
I also included genre to see what kind of relationship that has concering the MPAA rating and domestic sales.
The results in the graph show us that PG-13 movies are the ones that movie distributors should aim to be creating and distributing. They have historically done far better in domestic sales than any other rating by far. We can see some interesting trends with the primary genres as well. The distributors of films seem to make most on its sales for PG-13 action movies. Movie distributors should seek to make these action movies consistently. PG-13 allows for most of the population to go and see the movie. They can have subtle messages with adult context, but the importance for these movies is being able to include those families who are with their children as well. The other trend that can be observed is that the G and PG rated movies are the only ones that are able to grab any real amount of the animation sales for top movies. The main key to take away from this visual is that film distributors need to remember how great the sales have been historically for PG-13 movies and PG-13 action movies specifically.
## # A tibble: 12 x 3
## genre_1 avg_domestic_gross avg_length
## <chr> <dbl> <dbl>
## 1 Family 301272901. 120.
## 2 Animation 256292633. 95.0
## 3 Action 233239842. 128.
## 4 Adventure 206503425. 132.
## 5 Musical 159978870 110
## 6 Drama 153289721. 129.
## 7 Crime 128114455. 128.
## 8 Comedy 125467402. 107.
## 9 Biography 102806396. 144.
## 10 Mystery 86303188 127
## 11 Sci-Fi 79567667 114
## 12 Horror 74301192. 113.
This is a great table because it shows that family centered movies do best in America. Many of the previous visualizations showed how successful action movies are. Family and Animated movies are not as common in the top rankings of film sales. We can see from this table that when family or animated movies get in among the top movies that they do very well. Blockbuster animated or family movies do the best on average. Action movies still do very well for how much of the market share they seem to take up. Yet, it is easy to understand why so many animated films try to come out and be the next big thing every year. Another interesting facet is that they tend to be shorter than the action movie. The sales gross makes sense further because whole families are going to see family and animated movies. Whole families going to see films increases sales of the movie greatly. The last key trend is to see that family and animation are back to back just like action and adventure. The primary genre is usually followed by a second genre that is very close to the original. This explains why the two groups of similar genres are very close in these variables.
The second form of data that I am going to introduce is going to have a decent amount of overlap. The one key difference is that the data will be recorded for lifetime grosses of the movies. I will be scrapping directly from a table on the internet to acquire this data. This data should give me a great deal of flexibility to learn more about the reasons and ways that films generate sales.
We need to take this unstructured data of the internet, place it in our console, and then put it in a format that will allow us to perform analysis. First we need to retrieve the link from the web page. After reading the header nodes to ensure that we have the right data we need to place this content into another entity. Once we have named the entity and placed what we need in there we can use the readHTMLTable to get the information we want. Once we have saved this as another entity we can create a data frame and place all the necessary information inside of the new data frame.
## [1] "All Time Worldwide Box Office " "Quick Links"
## [3] "Search" "Most Anticipated Movies"
## [5] "Trending Movies" "Trending People"
## Rank Year Movie WorldwideBox Office
## 1 1 2009 Avatar $2,845,899,541
## 2 2 2019 Avengers: Endgame $2,797,800,564
## 3 3 1997 Titanic $2,207,986,545
## 4 4 2015 Star Wars Ep. VII: The Force Awakens $2,065,478,084
## 5 5 2018 Avengers: Infinity War $2,044,540,523
## 6 6 2015 Jurassic World $1,669,979,967
## DomesticBox Office InternationalBox Office
## 1 $760,507,625 $2,085,391,916
## 2 $858,373,000 $1,939,427,564
## 3 $659,363,944 $1,548,622,601
## 4 $936,662,225 $1,128,815,859
## 5 $678,815,482 $1,365,725,041
## 6 $652,306,625 $1,017,673,342
Before we can jump right into analysis we need to clean this up a little bit so that it is really to work with. We mainly need to convert some of the dollar amounts and years from character variables into number variables that we can work with.
Are the classics still holding their own for sales or are the new movies squeezing them out to make their own fortunes larger?
Using strictly the new data that we have, we can see that there is the slightest trend of lifetime worldwide sales increasing with year of release. I think that this graph does show signs of hope for older movies. By looking at the smoothed curve we can see that the mid 2000s and mid 2010s saw a different kind of relationship. Overall it makes sense that the lifetime gross is correlated to release years closer to today. The amount of exposure and access to films has increased greatly over the last 20 years. One interesting trend is the sharp increase from the mid 2010s to today. There is a sharp uptick in lifetime gross sales. This could be attributed to the increased access through multiple streaming services. Overall I think that this visual shows us that overall there is a very small correlation between lifetime gross and year of release. The top movies from previous years should not be too concerned about losing out in lifetime sales.
I am merging the data together on Movie and film_title to work with the data together in an attempt to compare the two while getting more meaningful analysis.
This visual allows us to see the data combined. We can see that the ranking in each year has a decent correlation with the lifetime worldwide sales. The linear curve is pretty drastic compared to what we have seen prior. The smoothed line provides us with some interesting information. It looks like the largest part of the correlation comes when the rank in year goes from around 5 to near 3 overall in that year. This is interesting because I think it shows just how important being one of the top movies in your release year can be for lifetime financial success. I think another interesting part of the smoothed line is that how the lifetime worldwide sales really seem to even out for the top 1 or 2 movies in every year. The main takeaway from this visual is that the better your movie performs in it’s release the year, the better it will do in worldwide lifetime sales.
I think this visual shows us that there is a very small correlation between the lifetime worldwide sales of a film and the IMDB rating. I think that one thing we need to recall for this is that the IMDB rating is created by IMDB members. Their rating means that they at least saw the movie and contributed to sales. People are going to have to go and see the movie to be able to rate it. Also, people are more inclined to see a movie if it has a high review, but some people may want to also see one with a low review to get a review of their own. I think that film distributors should not really be concerned about the IMDB rating because it will not have a great affect on lifetime sales.
We are going back to track performance within the year itself and our original data.
I will be running multiple regressions to inspect this inquiry. We are going to start with film budget and Motion Picture Association of America rating. We suspect that film budget will have an affect and I am curious to see what the rating gives us. We know that their is an existence of multi-colinearity between domestic gross sales and rank in year. Domestic sales will have have an obvious affect on overall sales the rank is generated base on worldwide sales.
##
## Call:
## glm(formula = worldwide_gross ~ film_budget + mpaa_rating, data = Blockbusters)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -846854464 -133242417 -48831621 101260371 1635269968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.809e+07 5.550e+07 1.767 0.0779 .
## film_budget 3.905e+00 1.851e-01 21.097 <2e-16 ***
## mpaa_ratingPG 5.595e+07 5.697e+07 0.982 0.3265
## mpaa_ratingPG-13 8.546e+07 5.576e+07 1.533 0.1261
## mpaa_ratingR -9.076e+05 5.920e+07 -0.015 0.9878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.275035e+16)
##
## Null deviance: 6.7344e+19 on 428 degrees of freedom
## Residual deviance: 2.6606e+19 on 424 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 17817
##
## Number of Fisher Scoring iterations: 2
##
## Regression Results
## ==================================================
## Dependent variable:
## --------------------------------
## worldwide_gross
## --------------------------------------------------
## film_budget 3.905*** (0.185)
## mpaa_ratingPG 55,953,586.000 (56,966,019.000)
## mpaa_ratingPG-13 85,464,100.000 (55,755,773.000)
## mpaa_ratingR -907,558.300 (59,200,818.000)
## Constant 98,090,216.000* (55,499,557.000)
## --------------------------------------------------
## Observations 429
## Log Likelihood -8,903.628
## Akaike Inf. Crit. 17,817.260
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This regressions gives us some good information about how important the film budget is. We can see how much the worldwide gross seems to be affected by the film budget. The MPAA rating has some variables approaching significance. We are going to see if we can improve the model with another variable. I am interested to see whether adding film budget squared will help us. Is there a curve in film rating that we need to account for?
##
## Call:
## glm(formula = worldwide_gross ~ film_budget + I(film_budget *
## film_budget) + mpaa_rating, data = Blockbusters)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -886416124 -139966343 -53087347 105103615 1616626088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.231e+08 5.940e+07 2.072 0.0389 *
## film_budget 3.325e+00 5.266e-01 6.314 6.9e-10 ***
## I(film_budget * film_budget) 2.395e-09 2.035e-09 1.177 0.2399
## mpaa_ratingPG 4.811e+07 5.733e+07 0.839 0.4019
## mpaa_ratingPG-13 8.213e+07 5.580e+07 1.472 0.1418
## mpaa_ratingR -9.648e+06 5.964e+07 -0.162 0.8716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.269338e+16)
##
## Null deviance: 6.7344e+19 on 428 degrees of freedom
## Residual deviance: 2.6519e+19 on 423 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 17818
##
## Number of Fisher Scoring iterations: 2
We can see in the summary that we finally got our intercept coefficient to have a significant p value. Our model clearly improved overall. I am mostly happy with this, but I want to take out MPAA rating because it looks like rated R movies are bringing down our overall results. Hopefully, this will improve our model even more.
Film Budget and Film Budget Squared
##
## Call:
## glm(formula = worldwide_gross ~ film_budget + I(film_budget *
## film_budget), data = Blockbusters)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -879619010 -132042448 -54723324 106230190 1625987650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.517e+08 2.421e+07 6.269 8.93e-10 ***
## film_budget 3.606e+00 5.045e-01 7.148 3.85e-12 ***
## I(film_budget * film_budget) 1.993e-09 2.027e-09 0.983 0.326
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.408696e+16)
##
## Null deviance: 6.7482e+19 on 429 degrees of freedom
## Residual deviance: 2.7365e+19 on 427 degrees of freedom
## AIC: 17866
##
## Number of Fisher Scoring iterations: 2
We can see that our P-value in the constant dropped again significantly. I am happy with this result showing the clear relationship between worldwide gross and film budget.
This exercise was quite an interesting practice of many of the concepts I have pursued in my analytics major. I think that the original question I posed about what makes films successful financially can be pretty clear. The films that have the biggest budgets typically make the most money. As a more traditional movie fan, I just hope that honest and creative cinema is not falling to the wayside compared to big budget films. I hope that the movie industry bounces back from this pandemic in a strong manner.