Movies analysis

Abstract

This analysis was conducted to illustrate some important and curious facts about something that has been fascinating people for more than 100 years. Since Lumiere brothers created the first short film this industry has evolved a lot. Nowadays, with the internet, we have been surrounded by movies every day so we thought it would be interesting to go a little bit further about the data that it has behind. Can producers predict what people will prefer in the future in the cinema industry? Have most popular films any pattern to be most preferred? In this study, we will answer some questions like those and go deep in movies data, analysing and predicting some films attributes.

Exploratory Data Analysis

In our analysis we will use the TMDB Movie Metadata dataset collected by Kaggle. The data was extracted from The Movie Database API to get almost 5000 films on 2017.

TMDB Movie Metadata

This dataset contains 4803 rows with a total of 20 columns.

Name Data type Description
Budget Integer Movie budget in dollars
Genres String A comma-separated list of genres used to classify the film
Homepage Url Official website of the film
Id Integer Identification number of the film created by TMBD
Keywords String A comma-separated list of keywords used to classify the film
Original_language String Original language of the film
Original_title String Original title of the film
Overview String Short description of the film
Popularity Decimal Popularity in the TMBD website
Production_companies String A comma-separated list of production companies in the film
Production_countries String A comma-separated list of production countries in the film
Release_date Date Release date in YYYY-MM-DD format
Revenue Integer Movie revenue in dollars
Runtime Integer The duration of the film in minutes
Spoken_languages String A comma-separated list of languages spoken in the original film
Status String Indicates if movie released. Values are “Released” or “Rumored”
Tagline String Short text to clarify or make you excited about the film
Title String The title of the film
Vote_average Decimal Average of users rating for the movie 0-10
Vote_count Integer Number of votes
summary(tmdb_movies)

The most relevant features for this study are Budget, Genres, Revenue, and vote_average. We will use more than these ones but are lees important.

We will clean a bit the data, so we will only use movies with all the fileds completed (no NA data), we wont consider either the movies with budget or revenue that equals 0, as we suppose that this is an error. In addition, each film has multiple genres, but we will only consider the first genre of each movie.

Revenue and budget

First we will print a graph relating the revenue and budget, and with color information about the vote average, to see if we can se a relationship.

We can see that it may exist a relationship between the revenue of a film and the budget invested, but it doesn´t seem like any of those feature is related to vote average in the graph, so let´s print in a separate graph.

We can see it clearer now, and we can appreciate that it is possible that there is a relationship between the vote average and the revenue, but doesn´t seem like that with the budget.

We are going to compare these features with the genre of the film, so we can see if action movies earn more mony than others genres movies.

Genre

Lets see if there is a relation between genre and revenue, so we can compare if a genre is better to make money than other.

gen_vot <- ggplot(movies2, aes(x=genres, y=revenue),las=2) + 
  geom_violin()
gen_vot + theme(axis.text.x = element_text(angle = 90, hjust = 1))

We can see than Action, Drama and Science-Fiction make a lot more money than the others movie genres. To making a better comparision, we are now to clean a bit more the data. We are going to keep only the genres more popular (by revenue), and these are: - 1) Action - 2) Adventure - 3) Comedy - 4) Drama - 5) Family - 6) Fantasy - 7) Romance - 8) Science-fiction - 9) Thriller

We are going to compare with the vote average too, so we will quit all the movies with few votes.

summary(movies2$vote_count)

There is a big difference between the most voted and the least, so we are not going to use the ones in the first quartile.

# there are a lot of films who doesnt have many votes (half of movies have less than 471 votes), we have to use only fimls which have more than 178 votes
# (1st quartile)
movies_reduced_votes <- movies2 %>% # data frame to start with
  filter(vote_count>178)  %>%
  filter(genres=="Action" | genres=="Adventure" | genres=="Animation" | genres=="Comedy" | genres=="Drama" | genres=="Family" | genres=="Fantasy" | genres=="Romance" | genres=="Thriller" | genres=="Science")

Once we have a cleaner dataset, we want to compare the genre with revenue and score. In order to do that, we convert the vote average in facotr, so we round the number (it is a 1 to 10, scale).

# making a facet with vote integers
movies_reduced_votes$vote_average = sapply(movies_reduced_votes$vote_average, function(x) floor(x/1))
movies_reduced_votes$vote_average <- as.factor(movies_reduced_votes$vote_average)

We make a plot now comparing these three variables, to see if there is a pattern.

# we only lost like 400 movies, they were not common films then, good
gen_vot <- ggplot(movies_reduced_votes, aes(x=genres, y=revenue)) + 
  geom_violin()

gen_vot <- gen_vot+facet_wrap(~vote_average)
gen_vot + theme(axis.text.x = element_text(angle = 90, hjust = 1))

We see than the genres revenue are similar, but when the score rounds the seven, action films earn much more money than the rest. To see if there is a different pattern with the action movies, we are going to do the same graphics than before but only with action movies.

Action movies

We need to clean the data first:

# lets see revenue vs budget in action films compared to vote score, only for action movies
action_movies <- movies_reduced_votes %>%
  filter(genres == "Action")

And lastly we print the same graphs than before

Movies than have an 8 as a vote_average don’t earn a lot or few money. We can also see that the ones that earn a lot of money they have an average budget.

The first graph is the same that the one considering all genres, so it is no usefull for the study. However, we can see that in action films it is possible that the revenue is indeed related with the vote_average, so as the vote average rise, there are more films which have more revenue. The exception is the movies with more than an 8. These movies doesn´t have a big revenue, but they doen´t invest too much money in the budget neither, so they are likely profitable.

Money vs Time

We are going to plot some graphs tto see if there is a tendency to invest or earn more money with time. We are going to consider all the movies.

Seems like indeed there is more money invested each year. To have a clearer view, we have to plot the graph with only the median value, not with all the films.

It is obvious that, despite of the exception in the 20s years, the budget increase each year. In the revenue is not that obvious and it is possible that we need to do a time series analysis to be sure, but the tendency is increasing too.

Questions of interest

  • What attributes makes the film better (with better we refer to the money earned, the popularity, the ratings)?
  • Does the duration of film influence the average of users rating?
  • Are some film genres more popular than others? Have it changed during years?
  • Have old films better critic scores than recent ones?
  • Can a film earn a more than 1 billion dollars if the genre is not one of the top3 popular gender in that year?
  • The release year influences in the popularity and revenue of a movie?

Methods

Strength of relationships

First we will clean the data

Question 1) First we will see which attribute makes the film earn more money.

Step 1.1) Select rows that we will use. In this case the earn money (revenue) are related to the variables budget, genres, production companies, release date and spoken languages.

movies_related_revenue <- tmdb_movies_clean %>% select(revenue, budget, genres, popularity, production_companies, release_date, vote_average, spoken_languages)

Step 1.2) Compare the variable revenue with other variables.

movies_r <- movies_related_revenue %>% select(revenue, budget)
cor(movies_r$revenue, movies_r$budget)

ggplot(movies_related_revenue, aes(x=revenue, y=budget))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(x="Revenue", y="Budget", title="Relation between Revenue and Badget")

regression_model = lm(formula = revenue ~ budget + genres + popularity + production_companies + release_date + spoken_languages, data=movies_related_revenue)

summary(regression_model)

Question 3) Now we will see which attribute makes the film better voted by users.

movies_r <- movies_related_revenue %>% select(revenue, budget, vote_average)
cor(movies_r$revenue, movies_r$vote_average)
cor(movies_r$budget, movies_r$vote_average)

ggplot(movies_related_revenue, aes(x=revenue, y=vote_average))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(x="Revenue", y="Vote average", title="Relation between Revenue and Vote average")

regression_model = lm(formula = revenue ~ vote_average + budget + genres + production_companies + release_date + spoken_languages, data=movies_related_revenue)

summary(regression_model)
ggplot(movies_related_revenue, aes(x=budget, y=vote_average))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(x="Budget", y="Vote average", title="Relation between Budget and Vote average")

regression_model = lm(formula = budget ~ vote_average + revenue + genres + production_companies + release_date + spoken_languages, data=movies_related_revenue)

summary(regression_model)

Question 4) After that we will see if the duration of film influence the average of users rating.

# Numerical value of the correlation
movies_r <- tmdb_movies_clean %>% select(vote_average, runtime)
cor(movies_r$vote_average, movies_r$runtime)

# Graphical representation
ggplot(tmdb_movies_clean, aes(x=vote_average, y=runtime)) +
geom_point(shape=1) +
geom_smooth(method=lm)

Question 6) Moreover we will see if old films have better user scores than recent ones.

# Compare films vote_average of films released on 1980 or before with the ones released on 2015 or after.
tmdb_movies_clean$year <- format(as.Date(tmdb_movies_clean$release_date, format="%Y-%m-%d"),"%Y")

movies_r <- tmdb_movies_clean %>% 
            select(vote_average, vote_count, year)

movies_old <- movies_r %>% 
              filter(year<=1980)
nrow(movies_old) # number of rows
mean(movies_old$vote_average) # average vote
sum(movies_old$vote_count) # number of votes

movies_now <- movies_r %>% 
              filter(year>=2015)
nrow(movies_now) # number of rows
mean(movies_now$vote_average) # average vote
sum(movies_now$vote_count) # number of votes

Question 8) Finally we will see if the release year influences in the popularity and revenue of a movie.

tmdb_movies_clean$year <- as.numeric(format(as.Date(tmdb_movies_clean$release_date, format="%Y-%m-%d"),"%Y"))

# Numerical value of the correlation
movies_r <- tmdb_movies_clean %>% select(year, popularity, revenue)
cor(movies_r$year, movies_r$popularity)
cor(movies_r$year, movies_r$revenue)

Prediction

For the pre-processing stage, we have made some transformations:

  • Remove irrelevant variables, such as name (the variable id identifies each entry), keywords and some variables, which format transformation exceeds the scope of this project.

  • Convert the name and description attributes, that could be interpreted as categorical variables, to a new field which contains the length of its name/description, in order to find fi this length is relevant for the model.

  • Delete missing values for target variables, such as revenue, vote or popularity, in order to train the model correctly. The final data frame looks like this:

   budget   ori_lan popularity release_date    revenue   runtime  status   vote_av. vote_count name_l tagline_l
1 237000000   en    150.43758   2009-12-10   2787965087    162   Released    7.2      11800      6       27
2 300000000   en    139.08262   2007-05-19   961000000     169   Released    6.9       4500      40      46
3 245000000   en    107.37679   2015-10-26   880674609     148   Released    6.3       4466      7       21
4 250000000   en    112.31295   2012-07-16   1084939099    165   Released    7.6       9106      21      15
5 260000000   en     43.92699   2012-03-07   284139100     132   Released    6.1       2124      11      36
6 258000000   en    115.69981   2007-05-01   890871626     139   Released    5.9       3576      12      18

About the machine learning algorithm, we just have used two types: - Linear regression algorithm for the regression related questions. This algorithm allows seeing how well a variable (in this particular project, the runtime and release date) can predict score value. - KNN algorithm to classify the data into clusters by the revenue and see the clusters proprieties.

Results

Strength of relationships

Question 1)

The variable that have more influency to the revenue is the budget variable. The one that gets closer to 1 or -1 with the correlation test is the most related. Budget is the higher with 0.7053993.

Budget Genres Popularity Release_date Spoken_languages
0.705399 0.362128 -0.428725 0.229568 0.037188

Question 2)

The variable that influences more to the popularity of the film is the revenue. The one that gets closer to 1 or -1 with the correlation test is the most related. Revenue is the higher with 0.602245.

Revenue Genres Popularity Release_date Spoken_languages
0.602245 0.294614 -0.371995 0.487992 -0.152934

Question 3)

The variable that influences more to get better average mark by users is the popularity variable. The one that gets closer to 1 or -1 with the correlation test is the most related. Popularity is the higher with 0.3185934.

Budget Revenue Genres Popularity Release_date Spoken_languages
-0.03120827 0.187839 0.028153 0.3185934 0.0051984 -0.0018729

Question 4)

We get a correlation of 0.3786415 between the duration of the film and the average of users rating. This means there is almost no relation between this 2 factors. We can see it a graphic representation here:

ggplot(tmdb_movies_clean, aes(x=vote_average, y=runtime)) +
geom_point(shape=1) +
geom_smooth(method=lm)

Question 5)

We split the years in 3 clusters: before 1980, between 1980 and 2014 and after 2015.

  • Top3 genres before 1980:
  1. Drama
  2. Action and Comedy (equal number of films)
  • Top3 genres between 1980 and 2014:
  1. Drama
  2. Comedy
  3. Action
  • Top3 genres after 2015:
  1. Action
  2. Drama
  3. Comedy

Question 6)

We can see in the results than old films have better user score than the new ones in general. In this case we need to point out that we have more or less the same number of films in both clusters nevertheless old films have aproximately 33% less user reviews than the new ones.

Number of rows Average of votes Number of votes
Old movies (<=1980) 219 6.93105 124385
recent movies (>=2015) 193 6.20725 321932

Question 7)

In the results we can see that is difficult that a film can earn more than 1 billion dollars if the genre is not one of the most popular gender in that year.

Number films more than $1B Same genre as the popular in that year Different genre than the popular in that year
21 18 3

Question 8)

Here we can see that the release year does not influence in the popularity and revenue of a movie as both correlations numbers are close to 0.

Popularity Revenue
0.1613506 0.1474426

Prediction

We had to compare how well the variable score is predicted by:

  • The runtime –> we have had the next output:

             actuals predicteds
actuals    1.0000000  0.4146759
predicteds 0.4146759  1.0000000

> regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
      mae       mse      rmse      mape 
0.6087162 0.6002294 0.7747448 0.1022778 

Seeing the correlation index and the rmse, we can conclude that the variables have a similar directional movement. Using the rmse, we can see that the mean error is too low, so the variables are correlated.

  • The release date –> we have had the next output:

             actuals predicteds
actuals    1.0000000  0.1650971
predicteds 0.1650971  1.0000000

> regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
      mae       mse      rmse      mape 
0.6569323 0.7043607 0.8392620 0.1111245 

In this case, the correlation between the data observed and predicted is two low, so we can’t conclude that both variables are correlated.

Discussion and Future work

The implications of the research are directly related with the film industry or the film critics. The success (votes and views) of the film is influenced by revenues, release date or the duration of the film. Knowing this productors can study all factors to try to earn more money, get better critics or improve the popularity of the movie.

The future work could be divided, mainly, in two main fields:

On one hand, the idea is to explore other possible relations between attributes, for example between the title length and the average vote. Hence, we could find answers to newer questions, like, for example, if the title of the film influences the positive votes or the total views of the film. Even more, we could find the semantic similarities between the words of the title of the most popular films.

On the other hand, the exploration of new algorithms to use, in order to predict with more accuracy, could be a more technical part of the future work. Not only the use of other machine learning techniques but also the employment of deep learning, smarter statistics, etc.