This analysis was conducted to illustrate some important and curious facts about something that has been fascinating people for more than 100 years. Since Lumiere brothers created the first short film this industry has evolved a lot. Nowadays, with the internet, we have been surrounded by movies every day so we thought it would be interesting to go a little bit further about the data that it has behind. Can producers predict what people will prefer in the future in the cinema industry? Have most popular films any pattern to be most preferred? In this study, we will answer some questions like those and go deep in movies data, analysing and predicting some films attributes.
In our analysis we will use the TMDB Movie Metadata dataset collected by Kaggle. The data was extracted from The Movie Database API to get almost 5000 films on 2017.
This dataset contains 4803 rows with a total of 20 columns.
| Name | Data type | Description |
|---|---|---|
| Budget | Integer | Movie budget in dollars |
| Genres | String | A comma-separated list of genres used to classify the film |
| Homepage | Url | Official website of the film |
| Id | Integer | Identification number of the film created by TMBD |
| Keywords | String | A comma-separated list of keywords used to classify the film |
| Original_language | String | Original language of the film |
| Original_title | String | Original title of the film |
| Overview | String | Short description of the film |
| Popularity | Decimal | Popularity in the TMBD website |
| Production_companies | String | A comma-separated list of production companies in the film |
| Production_countries | String | A comma-separated list of production countries in the film |
| Release_date | Date | Release date in YYYY-MM-DD format |
| Revenue | Integer | Movie revenue in dollars |
| Runtime | Integer | The duration of the film in minutes |
| Spoken_languages | String | A comma-separated list of languages spoken in the original film |
| Status | String | Indicates if movie released. Values are “Released” or “Rumored” |
| Tagline | String | Short text to clarify or make you excited about the film |
| Title | String | The title of the film |
| Vote_average | Decimal | Average of users rating for the movie 0-10 |
| Vote_count | Integer | Number of votes |
summary(tmdb_movies)
The most relevant features for this study are Budget, Genres, Revenue, and vote_average. We will use more than these ones but are lees important.
We will clean a bit the data, so we will only use movies with all the fileds completed (no NA data), we wont consider either the movies with budget or revenue that equals 0, as we suppose that this is an error. In addition, each film has multiple genres, but we will only consider the first genre of each movie.
First we will print a graph relating the revenue and budget, and with color information about the vote average, to see if we can se a relationship.
We can see that it may exist a relationship between the revenue of a film and the budget invested, but it doesn´t seem like any of those feature is related to vote average in the graph, so let´s print in a separate graph.
We can see it clearer now, and we can appreciate that it is possible that there is a relationship between the vote average and the revenue, but doesn´t seem like that with the budget.
We are going to compare these features with the genre of the film, so we can see if action movies earn more mony than others genres movies.
Lets see if there is a relation between genre and revenue, so we can compare if a genre is better to make money than other.
gen_vot <- ggplot(movies2, aes(x=genres, y=revenue),las=2) +
geom_violin()
gen_vot + theme(axis.text.x = element_text(angle = 90, hjust = 1))
We can see than Action, Drama and Science-Fiction make a lot more money than the others movie genres. To making a better comparision, we are now to clean a bit more the data. We are going to keep only the genres more popular (by revenue), and these are: - 1) Action - 2) Adventure - 3) Comedy - 4) Drama - 5) Family - 6) Fantasy - 7) Romance - 8) Science-fiction - 9) Thriller
We are going to compare with the vote average too, so we will quit all the movies with few votes.
summary(movies2$vote_count)
There is a big difference between the most voted and the least, so we are not going to use the ones in the first quartile.
# there are a lot of films who doesnt have many votes (half of movies have less than 471 votes), we have to use only fimls which have more than 178 votes
# (1st quartile)
movies_reduced_votes <- movies2 %>% # data frame to start with
filter(vote_count>178) %>%
filter(genres=="Action" | genres=="Adventure" | genres=="Animation" | genres=="Comedy" | genres=="Drama" | genres=="Family" | genres=="Fantasy" | genres=="Romance" | genres=="Thriller" | genres=="Science")
Once we have a cleaner dataset, we want to compare the genre with revenue and score. In order to do that, we convert the vote average in facotr, so we round the number (it is a 1 to 10, scale).
# making a facet with vote integers
movies_reduced_votes$vote_average = sapply(movies_reduced_votes$vote_average, function(x) floor(x/1))
movies_reduced_votes$vote_average <- as.factor(movies_reduced_votes$vote_average)
We make a plot now comparing these three variables, to see if there is a pattern.
# we only lost like 400 movies, they were not common films then, good
gen_vot <- ggplot(movies_reduced_votes, aes(x=genres, y=revenue)) +
geom_violin()
gen_vot <- gen_vot+facet_wrap(~vote_average)
gen_vot + theme(axis.text.x = element_text(angle = 90, hjust = 1))
We see than the genres revenue are similar, but when the score rounds the seven, action films earn much more money than the rest. To see if there is a different pattern with the action movies, we are going to do the same graphics than before but only with action movies.
We need to clean the data first:
# lets see revenue vs budget in action films compared to vote score, only for action movies
action_movies <- movies_reduced_votes %>%
filter(genres == "Action")
And lastly we print the same graphs than before
Movies than have an 8 as a vote_average don’t earn a lot or few money. We can also see that the ones that earn a lot of money they have an average budget.
The first graph is the same that the one considering all genres, so it is no usefull for the study. However, we can see that in action films it is possible that the revenue is indeed related with the vote_average, so as the vote average rise, there are more films which have more revenue. The exception is the movies with more than an 8. These movies doesn´t have a big revenue, but they doen´t invest too much money in the budget neither, so they are likely profitable.
We are going to plot some graphs tto see if there is a tendency to invest or earn more money with time. We are going to consider all the movies.
Seems like indeed there is more money invested each year. To have a clearer view, we have to plot the graph with only the median value, not with all the films.
It is obvious that, despite of the exception in the 20s years, the budget increase each year. In the revenue is not that obvious and it is possible that we need to do a time series analysis to be sure, but the tendency is increasing too.
First we will clean the data
Step 1.1) Select rows that we will use. In this case the earn money (revenue) are related to the variables budget, genres, production companies, release date and spoken languages.
movies_related_revenue <- tmdb_movies_clean %>% select(revenue, budget, genres, popularity, production_companies, release_date, vote_average, spoken_languages)
Step 1.2) Compare the variable revenue with other variables.
movies_r <- movies_related_revenue %>% select(revenue, budget)
cor(movies_r$revenue, movies_r$budget)
ggplot(movies_related_revenue, aes(x=revenue, y=budget))+
geom_point()+
geom_smooth(method=lm)+
labs(x="Revenue", y="Budget", title="Relation between Revenue and Badget")
regression_model = lm(formula = revenue ~ budget + genres + popularity + production_companies + release_date + spoken_languages, data=movies_related_revenue)
summary(regression_model)
movies_r <- movies_related_revenue %>% select(revenue, popularity)
cor(movies_r$revenue, movies_r$popularity)
ggplot(movies_related_revenue, aes(x=revenue, y=popularity))+
geom_point()+
geom_smooth(method=lm)+
labs(x="Revenue", y="Popularity", title="Relation between Revenue and Popularity")
regression_model = lm(formula = revenue ~ popularity + budget + genres + production_companies + release_date + spoken_languages, data=movies_related_revenue)
summary(regression_model)
movies_r <- movies_related_revenue %>% select(revenue, budget, vote_average)
cor(movies_r$revenue, movies_r$vote_average)
cor(movies_r$budget, movies_r$vote_average)
ggplot(movies_related_revenue, aes(x=revenue, y=vote_average))+
geom_point()+
geom_smooth(method=lm)+
labs(x="Revenue", y="Vote average", title="Relation between Revenue and Vote average")
regression_model = lm(formula = revenue ~ vote_average + budget + genres + production_companies + release_date + spoken_languages, data=movies_related_revenue)
summary(regression_model)
ggplot(movies_related_revenue, aes(x=budget, y=vote_average))+
geom_point()+
geom_smooth(method=lm)+
labs(x="Budget", y="Vote average", title="Relation between Budget and Vote average")
regression_model = lm(formula = budget ~ vote_average + revenue + genres + production_companies + release_date + spoken_languages, data=movies_related_revenue)
summary(regression_model)
# Numerical value of the correlation
movies_r <- tmdb_movies_clean %>% select(vote_average, runtime)
cor(movies_r$vote_average, movies_r$runtime)
# Graphical representation
ggplot(tmdb_movies_clean, aes(x=vote_average, y=runtime)) +
geom_point(shape=1) +
geom_smooth(method=lm)
# Top3 genres during years
# Select the top genre for each year
popular_genre <- tmdb_movies_clean
popular_genre$year <- format(as.Date(popular_genre$release_date, format="%Y-%m-%d"),"%Y") # add year column
# get first genre
genre <- popular_genre$genres
genre <- as.data.frame(genre)
genre <- separate(genre, col = genre, into = c("1","2","3","4","5","6"))
popular_genre$genres <- genre$`5`
popular_genre <- na.omit(popular_genre) # we have removed an extra film with no genre
popular_genre <- popular_genre %>%
select(genres, year)
# Subdivisions years
old_genre <- popular_genre %>%
filter(year<=1980)
count(old_genre, "genres")
# Top3:
# 1) Drama
# 2) Action and Comedy (equal number of films)
neutral_genre <- popular_genre %>%
filter(year>1980 && year<2015)
count(neutral_genre, "genres")
# Top3:
# 1) Drama
# 2) Comedy
# 3) Action
new_genre <- popular_genre %>%
filter(year>=2015)
count(new_genre, "genres")
# Top3:
# 1) Action
# 2) Drama
# 3) Comedy
# Compare films vote_average of films released on 1980 or before with the ones released on 2015 or after.
tmdb_movies_clean$year <- format(as.Date(tmdb_movies_clean$release_date, format="%Y-%m-%d"),"%Y")
movies_r <- tmdb_movies_clean %>%
select(vote_average, vote_count, year)
movies_old <- movies_r %>%
filter(year<=1980)
nrow(movies_old) # number of rows
mean(movies_old$vote_average) # average vote
sum(movies_old$vote_count) # number of votes
movies_now <- movies_r %>%
filter(year>=2015)
nrow(movies_now) # number of rows
mean(movies_now$vote_average) # average vote
sum(movies_now$vote_count) # number of votes
# Filter movies with revenue more than 1.000.000.000 dollars
movies_more_revenue <- tmdb_movies_clean %>%
filter(revenue>1000000000)
# Select years from dates of the movies with more revenue
movies_more_revenue$year <- format(as.Date(movies_more_revenue$release_date, format="%Y-%m-%d"),"%Y")
# get first genre
genre <- movies_more_revenue$genres
genre <- as.data.frame(genre)
genre <- separate(genre, col = genre, into = c("1","2","3","4","5","6"))
movies_more_revenue$genres <- genre$`5`
movies_more_revenue <- na.omit(movies_more_revenue) # we have removed an extra film with no genre
movies_more_revenue <- movies_more_revenue %>% select(revenue, year, genres)
movies_more_revenue <- movies_more_revenue[order(movies_more_revenue$year),]
names(movies_more_revenue)[3] <- "movie_genre"
# Select the top genre for each year
year_top_genre <- tmdb_movies_clean
year_top_genre$year <- format(as.Date(year_top_genre$release_date, format="%Y-%m-%d"),"%Y") # add year column
# get first genre
genre <- year_top_genre$genres
genre <- as.data.frame(genre)
genre <- separate(genre, col = genre, into = c("1","2","3","4","5","6"))
year_top_genre$genres <- genre$`5`
year_top_genre <- na.omit(year_top_genre) # we have removed an extra film with no genre
# get genre and year
year_top_genre <- year_top_genre %>% select(genres, year)
year_genre <- year_top_genre %>% select(year, genres) %>%
group_by(year)
# Compare the genre of each film with the top genre of the release year
join(movies_more_revenue, year_genre, by ="year")
tmdb_movies_clean$year <- as.numeric(format(as.Date(tmdb_movies_clean$release_date, format="%Y-%m-%d"),"%Y"))
# Numerical value of the correlation
movies_r <- tmdb_movies_clean %>% select(year, popularity, revenue)
cor(movies_r$year, movies_r$popularity)
cor(movies_r$year, movies_r$revenue)
For the pre-processing stage, we have made some transformations:
Remove irrelevant variables, such as name (the variable id identifies each entry), keywords and some variables, which format transformation exceeds the scope of this project.
Convert the name and description attributes, that could be interpreted as categorical variables, to a new field which contains the length of its name/description, in order to find fi this length is relevant for the model.
Delete missing values for target variables, such as revenue, vote or popularity, in order to train the model correctly. The final data frame looks like this:
budget ori_lan popularity release_date revenue runtime status vote_av. vote_count name_l tagline_l
1 237000000 en 150.43758 2009-12-10 2787965087 162 Released 7.2 11800 6 27
2 300000000 en 139.08262 2007-05-19 961000000 169 Released 6.9 4500 40 46
3 245000000 en 107.37679 2015-10-26 880674609 148 Released 6.3 4466 7 21
4 250000000 en 112.31295 2012-07-16 1084939099 165 Released 7.6 9106 21 15
5 260000000 en 43.92699 2012-03-07 284139100 132 Released 6.1 2124 11 36
6 258000000 en 115.69981 2007-05-01 890871626 139 Released 5.9 3576 12 18
About the machine learning algorithm, we just have used two types: - Linear regression algorithm for the regression related questions. This algorithm allows seeing how well a variable (in this particular project, the runtime and release date) can predict score value. - KNN algorithm to classify the data into clusters by the revenue and see the clusters proprieties.
The variable that have more influency to the revenue is the budget variable. The one that gets closer to 1 or -1 with the correlation test is the most related. Budget is the higher with 0.7053993.
| Budget | Genres | Popularity | Release_date | Spoken_languages |
|---|---|---|---|---|
| 0.705399 | 0.362128 | -0.428725 | 0.229568 | 0.037188 |
The variable that influences more to the popularity of the film is the revenue. The one that gets closer to 1 or -1 with the correlation test is the most related. Revenue is the higher with 0.602245.
| Revenue | Genres | Popularity | Release_date | Spoken_languages |
|---|---|---|---|---|
| 0.602245 | 0.294614 | -0.371995 | 0.487992 | -0.152934 |
The variable that influences more to get better average mark by users is the popularity variable. The one that gets closer to 1 or -1 with the correlation test is the most related. Popularity is the higher with 0.3185934.
| Budget | Revenue | Genres | Popularity | Release_date | Spoken_languages |
|---|---|---|---|---|---|
| -0.03120827 | 0.187839 | 0.028153 | 0.3185934 | 0.0051984 | -0.0018729 |
We get a correlation of 0.3786415 between the duration of the film and the average of users rating. This means there is almost no relation between this 2 factors. We can see it a graphic representation here:
ggplot(tmdb_movies_clean, aes(x=vote_average, y=runtime)) +
geom_point(shape=1) +
geom_smooth(method=lm)
We split the years in 3 clusters: before 1980, between 1980 and 2014 and after 2015.
We can see in the results than old films have better user score than the new ones in general. In this case we need to point out that we have more or less the same number of films in both clusters nevertheless old films have aproximately 33% less user reviews than the new ones.
| Number of rows | Average of votes | Number of votes | |
|---|---|---|---|
| Old movies (<=1980) | 219 | 6.93105 | 124385 |
| recent movies (>=2015) | 193 | 6.20725 | 321932 |
In the results we can see that is difficult that a film can earn more than 1 billion dollars if the genre is not one of the most popular gender in that year.
| Number films more than $1B | Same genre as the popular in that year | Different genre than the popular in that year |
|---|---|---|
| 21 | 18 | 3 |
Here we can see that the release year does not influence in the popularity and revenue of a movie as both correlations numbers are close to 0.
| Popularity | Revenue |
|---|---|
| 0.1613506 | 0.1474426 |
We had to compare how well the variable score is predicted by:
actuals predicteds
actuals 1.0000000 0.4146759
predicteds 0.4146759 1.0000000
> regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
mae mse rmse mape
0.6087162 0.6002294 0.7747448 0.1022778
Seeing the correlation index and the rmse, we can conclude that the variables have a similar directional movement. Using the rmse, we can see that the mean error is too low, so the variables are correlated.
actuals predicteds
actuals 1.0000000 0.1650971
predicteds 0.1650971 1.0000000
> regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
mae mse rmse mape
0.6569323 0.7043607 0.8392620 0.1111245
In this case, the correlation between the data observed and predicted is two low, so we can’t conclude that both variables are correlated.
The implications of the research are directly related with the film industry or the film critics. The success (votes and views) of the film is influenced by revenues, release date or the duration of the film. Knowing this productors can study all factors to try to earn more money, get better critics or improve the popularity of the movie.
The future work could be divided, mainly, in two main fields:
On one hand, the idea is to explore other possible relations between attributes, for example between the title length and the average vote. Hence, we could find answers to newer questions, like, for example, if the title of the film influences the positive votes or the total views of the film. Even more, we could find the semantic similarities between the words of the title of the most popular films.
On the other hand, the exploration of new algorithms to use, in order to predict with more accuracy, could be a more technical part of the future work. Not only the use of other machine learning techniques but also the employment of deep learning, smarter statistics, etc.