Assignment 6

Ethical Web Scraping - Box Office Mojo

For this assignment, I’m looking at Box Office Mojo, a website through IMDB that tracks and analyzes box office data for movies.

Scraping Box Office Mojo

When it came to my data, I created a function that would scrape the different URLs of Box Office Mojo, between the years 2020 and 2024. The website used tables to house their data, so I was able to import that in RStudio and my global environment. Once I did so, I had columns for the Rank, Release (movie title), Gross, Theaters, Total Gross, Release Data, Distributor, and Estimated.

From there, there was a lot of cleanup to do with my variables. All the titles of the vectors had spaces between them, so I edited titles to replace that space with a ‘.’, to make it easier to address in my code. I also needed to delete commas and dollar sighs from my Total Gross and Gross, and change the data type from chr to num. Additionally, I had to change my Theaters datatype from chr to num, and remove the commas from those numbers as well.

Because a lot of my questions for this topic had to do with years, I created a new vector: Year. To do so, I pulled the base URL for the website list, then added a sequence of years 2020-2024, which my scraping function would circle through once run. With that vector, I took my H1 element from the webpage, which would give me “Domestic Box Office For year” and deleted out all the text aside from the year I wanted. Now that I had my year, I was able to take an existing vector (Release.Date) and turn it into the full date of the release, combining this vector with my Year vector using lubridate.

I turned my data into a CSV file, so that we aren’t constantly overwhelming the website.

Analysis of Data

Here, we’re looking at the distribution of the average total gross by year. Initially, I wanted to see how the effects of COVID-19 impacted the movie industry, especially since this dataframe tracked theaters it was released in and the following gross revenue generated from it. Looking at this, within the five year range, it seems that 2020 had the lowest gross, followed closely by 2021 - this result honestly surprised me. I was expecting it to be higher in 2020, and drop in 2021, since lockdowns and quarantine began nationwide in March 2021. However, it doesn’t surprise me that it jumps back up in 2022/2023, then drops back down in 2024; people were probably taking every advantage to get out of the house, and since then, has stabilized. It might be beneficial, next time we run this model, to go back to 2019, to see if COVID increased movie going overall.

With this graph, I wanted to compare the distributions to see how they differed between 2021 and 2022. In 2021, it seems that the range is smaller than that of 2022, and the amount of outliers that fall outside the range as we get into 2022. This could be to the lack of theaters showing movies as we end 2021, and the gradual move back into society as 2022 progress.

2023 was a big year for movies, and I wanted to see how the breakdown by month looked and how seasonality effected movie gross by release date. Here, it shows an obvious seasonality when it comes to the release months. Starting low in January, Total gross increases until it gets to a peak in July, where it drops suddenly, and the cycle continues with August - December when it peaks, then drops for January. This seasonality is increasing, and could be attributed to award seasons or other factors; further research could be worthwhile for this!

With this graph, I wanted to see if there was a correlation between the number of theaters the movie was released in and the total gross that the movie generated. Prior to doing so, I had a belief it would be correlated - higher number of theater, higher total gross. It would make sense that national movies would make more than local films or smaller, indie films. After running the analysis, it seemed like my assumption was right. With a higher number of theaters they were released in, films saw a general increase in their total gross generated.

Initially, when I ran this graph, I included every movie on this list and their distributors, which brought up an insane amount of distributors. So, to narrow that down, I want to see the top performing Distributors on the list. It brought up a list with Distributors I know, with one or two that I didn’t. This leads me to believe that while a portion of distributors make an immense about of revenue from gross, there are still a lot of distributors that appear on the list and contribute - whether or not they’re smaller, indie, or just don’t crack the top ten.