Problem Statement Since there is no universal way to claim the goodness of a movie, one could rely on the parameters like critic ratings, profit earned, viewers likes and ratings etc.
Implementation Since the data has information about the year of its release, budget, gross earnings, director name of the movie etc, I would be working on these aspects to analyze the success of movies and directors over the years.
Summary Several trends have been seen between the budget, profit earned, and it’s Return on Investment. Analysis has also been done to figure out the most successful directors and the most favorite genres among directors and producers.
library(knitr) # for kniting r code to html files
library(magrittr) # for chaining commands with pipe operator, %>%.
library(tidyverse) # for tibbles, graphs, performing data transformation and manipuation tasks
library(lubridate) # for working with date-times and time-spans
library(stringr) # for string handling functions
library(ggrepel) # to repel overlapping text labels away from each other.
library(DT) # for HTML Display of data
url <- "https://raw.githubusercontent.com/akashgupta4891/datasharing/master/movie_metadata.csv"
imdb_data <- read.csv(url)
imdb <- as_tibble(imdb_data)
The data set has been taken from Kaggle
Original Purpose Given that thousands of movies are produced each year, this dataset was collected to find a better way to analyze the greatness of movies without relying on critics or one’s own instincts
Initial Data Description * The data was originally scraped for 5000+ movies from IMDB website using a Python library called “scrapy” 3 months ago.
The original dataset has 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries.
There are 2399 unique director names, and thousands of actors/actresses.
Original Variables
Data Peculiarities
Total missing values are 2059
Number of duplicate values for Movie Title: 126
The data has special character (Â) at the end of every movie_title.
1. Missing Values To find the missing values:
all_na <- sapply(imdb, function(y) sum(length(which(is.na(y)))))
all_na <- data.frame(all_na)
kable(all_na)
| all_na | |
|---|---|
| color | 0 |
| director_name | 0 |
| num_critic_for_reviews | 50 |
| duration | 15 |
| director_facebook_likes | 104 |
| actor_3_facebook_likes | 23 |
| actor_2_name | 0 |
| actor_1_facebook_likes | 7 |
| gross | 884 |
| genres | 0 |
| actor_1_name | 0 |
| movie_title | 0 |
| num_voted_users | 0 |
| cast_total_facebook_likes | 0 |
| actor_3_name | 0 |
| facenumber_in_poster | 13 |
| plot_keywords | 0 |
| movie_imdb_link | 0 |
| num_user_for_reviews | 21 |
| language | 0 |
| country | 0 |
| content_rating | 0 |
| budget | 492 |
| title_year | 108 |
| actor_2_facebook_likes | 13 |
| imdb_score | 0 |
| aspect_ratio | 329 |
| movie_facebook_likes | 0 |
Since gross has 884 missing values and budget has 492 missing values, deleting rows with null values for gross and budget.
imdb <- imdb[!is.na(imdb$gross), ]
imdb <- imdb[!is.na(imdb$budget), ]
dim(imdb)
## [1] 3891 28
2. Removing Special Characters
#removing special character (Â) at the end of movie title
imdb$movie_title <- gsub("Â", "", as.character(factor(imdb$movie_title)))
3. Adding Two Colums
# adding two colums: profit and percentage return on investment.
imdb %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100)
4. Removing white spaces
# remove whitespaces on the right side
str_trim(imdb$movie_title, side = "right")
imdb %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit / budget) * 100)%>%
select(movie_title, title_year, gross, budget, director_name, profit, return_on_investment_perc, imdb_score, content_rating)%>%
datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))
dim(imdb)
## [1] 3891 28
range(imdb$title_year, na.rm = TRUE)
## [1] 1920 2016
summary(imdb$budget)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.180e+02 1.000e+07 2.400e+07 4.521e+07 5.000e+07 1.222e+10
summary(imdb$gross)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 162 6837000 27980000 51050000 65360000 760500000
| S.No. | Variable Name | Description | Data Type |
|---|---|---|---|
| 01 | movie_title | Title of the Movie | Factor |
| 02 | title_year | The year in which the movie is released (1920:2016) | int |
| 03 | gross | Gross earnings of the movie in Dollars | int |
| 04 | budget | Budget of the movie in Dollars | num |
| 05 | director_name | Name of the Director of the Movie | Factor |
| 06 | profit | Profit earned by the movie in Dollars | int |
| 07 | return_on_investment_perc | Percentage Return on Investment | int |
| 08 | imdb_score | Score of the movie on IMDB | num |
| 09 | content_rating | Content Rating of the movie like G: General Audiences, PG: Parental Guidance etc | Factor |
| 10 | genres | Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’ | Factor |
#Top 20 most successful directors
imdb %>%
group_by(director_name) %>%
mutate(profit = gross - budget)%>%
select(director_name, budget, gross, profit) %>%
na.omit() %>%
summarise(films = n(), budget = sum(as.numeric(budget)), gross = sum(as.numeric(gross)), profit = sum(as.numeric(profit))) %>%
mutate(avg_per_film = profit/films) %>%
arrange(desc(avg_per_film)) %>%
top_n(20, avg_per_film) %>%
ggplot( aes(x = films, y = avg_per_film/1000000)) +
geom_point(size = 1, color = "blue") +
geom_text_repel(aes(label = director_name), size = 3, color = "blue") +
xlab("Number of Films") + ylab("Avg Profit $millions") +
ggtitle("Most Successful Directors")
Insights
These are the top 20 most successful directors based on the average profit earned by the movies they directed.
There’s an obvious downward trend between average profit and number of films. This could be because as one makes more movies, one could have more hits and misses, therefore the average goes down. It could also be due to film makers having a more diverse range of movies which could include smaller and/or high budget movies"
#Top 20 movies based on its Profit
imdb %>%
filter(title_year %in% c(2000:2016)) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=profit/1000000)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Profit $million") +
ggtitle("20 Most Profitable Movies")
Insights
These are the top 20 movies based on the Profit earned (Gross Earnings - Budget). Since the data has movies released over a span of 100 years, i.e. from 1916 to 2016, the values for earnings and budget do not take into account inflation and market value, therefore the analyses is only done for last 16 years, i.e. from 2000 to 2016.
It can be inferred from this plot that high budget movies tend to earn more profit.
The trend is almost linear, with profit increasing with the increase in budget.
#Top 20 movies based on its Return on Investment
imdb %>%
filter(budget >100000) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=return_on_investment_perc)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Percent Return on Investment") +
ggtitle("20 Most Profitable Movies based on its Return on Investment")
Insights
These are the top 20 movies based on its Percentage Return on Investment. ((profit/budget)*100).
Since profit earned by a movie does not give a clear picture about its monetary success over the years (1916 to 2016), this analyses, over the absolute value of the Return on Investment(ROI) across its Budget, would provide better results.
As hypothesized, the ROI is high for Low Budget Films and decreases as the budget of the movie increases.
# Commercial Success vs Critical Acclaim
imdb %>%
mutate(profit=gross- budget) %>%
top_n(10,profit) %>%
ggplot(aes(x=imdb_score, y=gross/10^6, size=profit/10^6, colour= content_rating)) +
geom_point() +
geom_hline(aes(yintercept = 600)) +
geom_vline(aes(xintercept = 7.75)) +
geom_text_repel(aes(label = movie_title), size =4) +
xlab("Imdb score") +
ylab("Gross money earned in million dollars") +
ggtitle("Commercial success Vs Critical acclaim") +
annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross")
Insights
This is an analysis on the Commercial Success acclaimed by the movie (Gross earnings and profit earned) vs its IMDB Score.
As expected, there is not much correlation since most critically acclaimed movies do not do much well commercially.
genre = imdb['genres']
genre = data.frame(table(genre))
genre = genre[order(genre$Freq,decreasing=TRUE),]
# Top 20 genres with the most movies
ggplot(genre[1:20,], aes(x=reorder(factor(genre), Freq), y=Freq, alpha=Freq)) +
geom_bar(stat = "identity", fill="blue") +
geom_text(aes(label=Freq),hjust=1.2, size=3.5)+
xlab("Genre") +
ylab("Number of Movies") +
ggtitle("Top 20 genres with the most movies") +
coord_flip()
Insights
This is a plot for the top 20 genres for the movies.
Clearly, ‘Drama’, ‘Romance’ and ‘Comedy’ are the most favorite genres among directors and producers.
Summarizing the problem statement
Analyses on the monetary success and the likeability of movies were done to gather insights about the success of several movies.
Summarizing the implementation
The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically to determine what is the general trend in the movie industry, using scatter plots, line chart and bar charts.
Summarizing the Analyses
There’s a linear trend between the profit earned and the budget of the movie.
There’s inverse relation between the Return on Investment and the Budget of a movie.
There is not much correlation between the profit earned and the imdb rating of the movie indicating that most critically acclaimed movies do not do much well commercially.
‘Drama’, ‘Romance’ and ‘Comedy’ are the most favorite genres among directors and producers.