Sorry that I wasn’t able to get the rough draft of the final project as far along as I would have wanted, I had a lot come up this week. The overall direction of my project is here – the investigation of the research questions etc. The project definitely still needs more fleshing out in terms of providing supporting writing and overall direction, and I’ll be sure to take care of that in the coming days and maybe ask for another review if that is possible. All feedback is still greatly appreciated!
The dataset is mostly clean, however a few columns could be represented in a better format. Rating, genre, and country should all be factors as they are categorical options rather than a continuous spectrum.
movies <-mutate(movies_raw, rating =factor(rating),genre =factor(genre),rating =factor(rating))# head(movies, 10)
Next, the column “released” currently has data which is a combination of a date and a country, like June 13, 1980 (United States). In order to tidy the data, we should split this column into a release_date column and a release_country column.
movies <-separate(movies, released, into =c("release_date", "release_country"), sep =" \\(|\\)") %>%mutate(release_country =factor(release_country), release_date =mdy(release_date))
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [5729,
5731].
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `release_date = mdy(release_date)`.
Caused by warning:
! 10 failed to parse.
# head(movies, 10)
We note that there are two instances where released is empty, and then 10 instances where mdy fails to parse the date because only the year or nothing is given:
To account for this, we can simply make the release date be equal to the release year starting at January 1st which is given to us as the column “year”.
Since budgets and grosses are not inflation adjusted, it is difficult to compare their trends over time. So, we should adjust these numbers for inflation, base lining them for what their value would have been in 2020.
library(quantmod)
Loading required package: xts
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
######################### Warning from 'xts' package ##########################
# #
# The dplyr lag() function breaks how base R's lag() function is supposed to #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
# source() into this session won't work correctly. #
# #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
# dplyr from breaking base R's lag() function. #
# #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
# #
###############################################################################
Attaching package: 'xts'
The following objects are masked from 'package:dplyr':
first, last
Loading required package: TTR
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
This dataset is a list of 7,688 movies released from 1980 to 2020. For each year, the most popular 200 (with some exceptions) movies from that year are recorded. For each movie, there are 16 attributes recorded:
name: the title of the movie (string)
rating: the rating of the movie (R, PG, etc.) (factor)
genre: the main genre of the movie (factor)
year: year of release (double)
release_date: the date the movie was first released (date)
release_country: the country the movie was first released in (factor)
score: IMDb user rating (double)
votes: number of user votes (double)
director: the director (string)
writer: writer of the movie (string)
star: main actor/actress (string)
country: the country of origin (factor)
budget: the inflation adjusted budget of a movie. Some movies don’t have this, so it appears as 0 (double)
gross: the inflation adjusted box office revenue of the movie in the US (double)
company: the production company (string)
runtime: duration of the movie (double)
General Descriptive Analysis
First, we can calculate the distribution of ratings:
ggplot(rating_prop, aes(x =reorder(rating, -count), y = count)) +geom_bar(stat ="identity", fill ="blue") +labs(title ="Frequency of Movies by Rating",x ="Rating",y ="Count")
We see that the most common ratings are “R”, “PG-13”, and “PG”. Additionally, we see that there are 77 movies where the rating was not recorded.
year_counts <- movies %>%group_by(year) %>%summarise(count =n())ggplot(year_counts, aes(x = year, y = count)) +geom_bar(stat ="identity", fill ="blue") +labs(title ="Frequency of Movies by Year",x ="Year",y ="Count")
We can see that the earliest years and 2020 do not have a full 200 movies recorded, likely due to a lack of availability of data to scrape.
mean(movies$votes, na.rm =TRUE)
[1] 88108.5
quantile(movies$votes, na.rm =TRUE)
0% 25% 50% 75% 100%
7 9100 33000 93000 2400000
sd(movies$votes, na.rm =TRUE)
[1] 163323.8
The mean number of votes cast for a movie was 88 thousand, which is significantly higher than the median votes cast of 33 thousand. The fewest votes cast was 7, while the maximum votes cast was 2,400,000. The standard deviation was quite considerable at 163,323, indicating that the data is not well concentrated.
We see that the average budget of a movie is $49,356,607 while the median budget is significantly lower at $20,500,000. The smallest movie budget was $5053 while the largest was $380,210,877.
We see that the average gross of a movie is $106,885,865 while the median gross is significantly lower at $33,228,566. The smallest movie gross was $501 while the largest was $3,565,566,383.
mean(movies$runtime, na.rm =TRUE)
[1] 107.2616
quantile(movies$runtime, na.rm =TRUE)
0% 25% 50% 75% 100%
55 95 104 116 366
sd(movies$runtime, na.rm =TRUE)
[1] 18.58125
The average length of a movie was 107.2616 minutes, while the median was lower at 104 minutes. The shortest movie was 55 minutes, and the longest was 366.
# A tibble: 60 × 3
country count percentage
<chr> <int> <dbl>
1 United States 5475 71.4
2 United Kingdom 816 10.6
3 France 279 3.64
4 Canada 190 2.48
5 Germany 117 1.53
6 Australia 92 1.2
7 Japan 81 1.06
8 India 62 0.81
9 Italy 61 0.8
10 Spain 47 0.61
# ℹ 50 more rows
top_countries =head(country_counts, 15)ggplot(top_countries, aes(x =reorder(country, -count), y = count)) +geom_bar(stat ="identity", fill ="blue") +labs(title ="Frequency of Movies by Primary Genre",x ="Genre",y ="Count") +theme(axis.text.x =element_text(angle =90, hjust =1))
We see that while there are a total of 60 different countries of origin captured by the dataset, by far the majority are from a few different countries. The US alone accounts for 71% of the movies, and in fact just the top 5 countries account for 90% of the total.
Question Driven Analysis
1. The Economics of Movie Making Over Time
Research Question: How have the economics of movie making changed over time? Are online streaming services like Netflix detracting from the appeal of seeing a movie at theaters, and thereby reducing the box office returns of movies?
Let’s explore how budget has changed over time. First, we can try simply creating a scatterplot of budget vs time.
ggplot(movies, aes(x = release_date, y = budget)) +geom_point() +labs(title ="Scatterplot of Release Date vs Budget",x ="Release Date",y ="Budget")
While the data does seem to generally trend up, the data is not cleanly captured. Instead, let’s bucket the movies by year and graph the median budget for each year:
movies2 <- movies[!is.na(movies$budget), ]med_budgets <- movies2 %>%group_by(year) %>%summarise(median_budget =median(budget))ggplot(med_budgets, aes(x = year, y = median_budget)) +geom_point() +labs(title ="Scatterplot of Release Year vs Median Budget",x ="Release Year",y ="Median Budget")
What we actually see is that when adjusting for inflation, the median budget of the top 200 movies (where budget is recorded) steadily rose from 1980 to its peak in 1999, after which it began to decrease again. Let’s overlay the trend in median gross on top of this data to see if we notice anything:
movies2 <- movies[!is.na(movies$gross), ]movies2 <- movies2[!is.na(movies2$budget), ]movies2 <-filter(movies2, year !=2020)med_budg_gross <- movies2 %>%group_by(year) %>%summarise(median_budget =median(budget), median_gross =median(gross))ggplot(med_budg_gross, aes(x = year)) +geom_line(aes(y = median_budget, color ="Budget"), size =2) +geom_line(aes(y = median_gross, color ="Gross"), size =2) +labs(title ="Scatter Plot of Movie Budget and Gross vs Time",x ="Release Date",y ="Value",color ="Legend") +scale_color_manual(values =c("Budget"="blue", "Gross"="red"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Now we can see that generally movies are becoming much more profitable, as the inflation adjusted gross of top movies has become significantly greater over time. Specifically, we see that gross has steadily rose since 1990 at a rapid rate, while budget has remained relatively steady and in fact decreased since 1990.
In conclusion, it appears that the movie industry is not struggling due to the advent of more streaming services. Rather, both in absolute value and value relative to the budget of the movie, US box office returns are seeing all time highs at present, meaning that as of 2020 movies were more profitable than ever in theaters.
2. Influence of Various Factors on Movie Score
Research Question: What factors tend to make a movie well received? Are there associations with budget, profit, or movie length?
First, let’s analyze the impact of budget on movie score.
clean_movies <- movies %>%filter(!is.na(gross) &!is.na(budget) &!is.na(score))ggplot(clean_movies, aes(x = budget, y = score)) +geom_point(alpha =0.5) +geom_smooth(method ="lm", se =FALSE, color ="blue") +labs(title ="Scatter Plot of Movie Score vs Budget",x ="Budget",y ="Score")
`geom_smooth()` using formula = 'y ~ x'
Fitting a linear to the data, we see a slight general upward trend correlating increasing budget with increasing movie score. While there are significantly more low budget movies that are flops than high budget, there are also simply more low budget movies in general. We also see that the highest scoring movies don’t necessarily require a high budget.
Next, let’s shift our focus instead to the correlation of the movie score with profit, as we may profit to be a better indicator of how well received a movie was.
clean_movies$profit <- clean_movies$gross - clean_movies$budgetclean_movies2 <- clean_movies %>%filter(profit <=2e+09) # remove outliers for better fitggplot(clean_movies2, aes(x = profit, y = score)) +geom_point(alpha =0.5) +geom_smooth(method ="lm", se =FALSE, color ="blue") +labs(title ="Scatter Plot of Movie Score vs Profit",x ="Profit",y ="Score")
`geom_smooth()` using formula = 'y ~ x'
Now, we see that the upward trend is significantly more apparent. Almost all movies that were true flops, receiving scores under 4, also had a small gross under 250 million dollars. Furthermore, the highest scoring movies almost always had scores greater than 6.
To further our analysis, we can increase the dimensional of our graph by coloring movies by genre. Looking at the overall distribution of genres, we see that action, comedy, and drama are by far the most common.
genre_prop <- movies %>%group_by(genre) %>%summarise(count =n()) %>%mutate(percentage =round(count /sum(count) *100, 2))ggplot(genre_prop, aes(x =reorder(genre, -count), y = count)) +geom_bar(stat ="identity", fill ="blue") +labs(title ="Frequency of Movies by Primary Genre",x ="Genre",y ="Count") +theme(axis.text.x =element_text(angle =45, hjust =1))
To increase the efficacy of our analysis, let’s reduce the dimensions to only label the big 3 genres, and classify all the others under the umbrella of “other”.
clean_movies2 <- clean_movies %>%mutate(genre =recode(genre, "Comedy"="Comedy", "Action"="Action", "Drama"="Drama", .default ="Other"))genre_prop <- clean_movies2 %>%group_by(genre) %>%summarise(count =n())ggplot(genre_prop, aes(x =reorder(genre, -count), y = count)) +geom_bar(stat ="identity", fill ="blue") +labs(title ="Frequency of Movies by Primary Genre",x ="Genre",y ="Count") +theme(axis.text.x =element_text(angle =45, hjust =1))
Now, we can separate our scatterplot of profit vs score based on the genre
ggplot(clean_movies2, aes(x = profit, y = score)) +geom_point(alpha =0.25) +labs(title ="Scatter Plot of Movie Score vs Gross",x ="Profit",y ="Score") +facet_wrap(~genre, scales ="free_y")
We see that action and other genres have a much greater spread to higher profits (big hits) than comedy and drama, which are more well concentrated below $100,000,000. This may be due to action having a higher mass appeal than comedies or drama, so the ceiling of profit for an action movie is much higher. We can also simply breakdown the scores of movies by genre using the interquantile range:
We observe that genre does not have a strongn association with score — movies regardless of genre tend to average around a 6.5, with this varying by just +/-1 from 25% to 75%.
We could also test to see if the length of a movie impacts its score. Let’s create a graph of runtime vs score to see how strong a correlation exists:
clean_movies <- movies %>%filter(!is.na(score) &!is.na(runtime))ggplot(movies, aes(x = runtime, y = score)) +geom_point(alpha =0.25) +labs(title ="Scatter Plot of Runtime vs Score",x ="Runtime",y ="Score")
We observe that there is some positive association that as runtime increases the odds a movie truly bombs decreases. In particular, movie scores become more tightly concentrated at a higher score as runtime increases from 60 to 150 minutes, with the frequency of movies bombing greatly decreasing.
In general, we have observed that as movie profit and movie runtime increases, movie score also generally increases. Movies which make more money and have a greater runtime are substantially less likely to have been flops than movies with low profit or short runtime. Genre did not have a large impact on movie score.
Bibliography
Baumer, Benjamin S., et al. Modern Data Science with R. CRC Press, Taylor & Francis Group, 2021.
Grijalva, Daniel. “Movie Industry.” Kaggle, 23 July 2021, www.kaggle.com/datasets/danielgrijalvas/movies/data.
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Wickham, Hadley, et al. R for Data Science: Import, Tidy, Transform, Visualie, and Model Data. O’Reilly, 2023.