Proposal: We want to use the publicly available movie data from websites such as IMDB, Box Office Mojo and Rotten Tomatoes to make a model to predict the revenue of the movies.

Data: For the prototype, we scraped data from IMDB website for over 27000 movies for the last 10 years, using lxml in Python. Of those, we filtered the data for US based English language movies, leaving us with a total of 8984 movies for our analysis. We extracted 22 features including title, release date, genres, cast, director, user ratings, meta score, plot keywords, runtime, budget, opening weekend revenue and gross revenue.

The code for scraping, cleaning and analyzing this data can be found here.

Analysis: In our analysis, we found several interesting patterns and important factors that can affect the revenue. We talk about two such features, namely, the date of release and the genre of the movie.

Plot I

We looked at several aspects of how the release date affects the revenue, by plotting the average of the budget, opening weekend revenue and the gross revenue over the years, and over different months and days of the year. The following plot shows some important insights into the data. If we look at the average revenue, gross and the opening week, for all movies data over 10 years, averaged over by day of the week, we see that Thursdays releases have lower revenues than Wednesday and Tuesday releases. Also, Saturday has the lowest budget to gross revenue ratio, and so probably the worse day to release a movie.

In terms of months, September and Summer months, June and July have the highest revenue and budgets, followed by March. November, December and May released movies contribute least to the revenue.

Plot II

Second, we looked at how film Genres affect the revenue. Movies which fall under ‘Adventure’, ‘Animation’ and ‘Sci-Fi’ genres have the highest average revenues followed by ‘Fantasy’, ‘Family’ and ‘Action’ which require a comparable budget but are not as profitable. Intriguingly, ‘Short’ films are low budget and surprisingly most profitable compared to the cost.