The analysis primarily focuses on two datasets which are movies_metadata.csv and ratings.csv. We let average rating be the createria to evaluate the quality of movies. For the purpose of building model to assess the movie, we decide to leave out several variables based on their variable types and practical significance. These variables include budget, revenue, runtime, timing factors (day of the week and year), production countries, cast and crew. And We use python to clean the raw data, leaving only relevant observations. We also created indicator variables based on characteristics of each factor.
| Dependent Var | Coefficiennts |
|---|---|
| budget | -0.027 |
| popularity | 0.040 |
| revenue | 0.003 |
| runtime | 0.024 |
| dayMon | 0.361 |
| dayTue | 0.499 |
| dayWed | 0.345 |
| dayThu | 0.211 |
| daySat | 0.341 |
| daySun | 0.463 |
| year | -0.023 |
| country# | 0.138 |
| cast_size | 0.008 |
| crew_size | 0.006 |
| En Langrage | 0.338 |
| MadeinusUS | -0.874 |
| not winter | 0.202 |
| goodactor | 0.469 |
| gooddir | 0.478 |
For production countries and spoken languages, we count the number of records on each row and replace the original columns with new numeric columns.
Create a new column called ?oglang? where ?En? represents English movies and ?Not En? represents non-English movies. With the same logic, ?Madeinus? is created to identify whether the movie is produced in USA or not.
Considering the seasonality aspects, we want to look at the month by quarters.Later on we find that quarter is not significant. So we divide year into two categories: winter and not winter.
We define score as the same as average_ratings. In order to build a logitic regression model, we need to convert score into a binary variable. By checking the distribution of score, we decide to consider a score that is greater or equal to 3.58 (which is top 25% of score distribution) as ?good score?. Therefore, we created a new binary variable called ?goodScore? where ?1=good score?, ?0=bad score(<3.58)?. For example, as for good actor we have Brad Pitt, Matt Damon etc.
Count occurrence of actors in all movies of the dataset, if occurrences >= 30, we consider this actor as a ?good actor (popular actor). Then we create an indicator variable called ?Withgoodactor? to identify those movies with good actors, in which “1”" represents ?movies with good actor?.
And we also count the number of movies for each director in the dataset. If the number of movies is greater or equal to 5, then we define this director as a ?good director?. Then we create an indicator variable called ?Withgooddirct? to identify those movies with good directors, where “1”" represents ?movies with good director?.
People tend to beautify memories, and this also works for movies. As time goes on (year), it generally become harder for a movie to be considered good. On one hand, people?s expectations towards movies are getting higher. In other words, production companies need to detect the trend and tailor the preferences of people. On the other hand, survivorship bias may also exist. Most movies produced in 80s and 90s are well-known and classic, including The Shawshank Redemption, Forrest Gump and Titanic, and they never fade away with time elapsing. However, this does not mean that there were no junk movies in the past. The movie database is built in the recent years, namely, people might ignored those bad old movies, and only gave high scores to those classic movies.
Seasons indeed affect people?s views about movies, especially when it comes to winter. One possible reason might be that there are lot of holidays during the winter. As a result, people tend to be in good mood in those days and are more willing to give high grades for those movies. Nevertheless, self-selection bias might be an issue since those industry giants, who has larger influence in the market, have more chance to release movies in peak seasons.
Intuitively, revenue and budget shall have vital positive impact on the quality of movies. However, in the model we build, this is not the case when we hold all other variables constant. Investing a huge amount of money may not guarantee a good score.
Star actors and famous directors to some extent guarantee the quality of movies. First, they are more experienced and have better over performance compared to. Second, they typically would not accept bad movie scripts. Those popular actors also have a strong fanbase who will gave high scores to them regardless of the quality of the movies.
Crew size, cast size and production country size have positive impacts on scores because generally the a large and conprehensive team would make a movie more diversified. However, larger size does not indicate a movie is good.
Runtime is also positively related to scores. With more runtime, a movie can better shape the characteristics of roles, enrich the plots and intensify the feelings of audience.
## Appendix: