Synopsys

Problem Statement Since there is no universal way to claim the goodness of a movie, one could rely on the parameters like critic ratings, profit earned, viewers likes and ratings etc.

Implementation Since the data has information about the year of its release, budget, gross earnings, director name of the movie etc, I would be working on these aspects to analyze the success of movies and directors over the years.

Summary Several trends have been seen between the budget, profit earned, and it’s Return on Investment. Analysis has also been done to figure out the most successful directors and the most favorite genres among directors and producers.

Packages Required

library(knitr)          # for kniting r code to html files
library(magrittr)       # for chaining commands with pipe operator, %>%.
library(tidyverse)      # for tibbles, graphs, performing data transformation and manipuation tasks 
library(lubridate)      # for working with date-times and time-spans
library(stringr)        # for string handling functions 
library(ggrepel)        # to repel overlapping text labels away from each other.
library(DT)             # for HTML Display of data

Import Data

url <- "https://raw.githubusercontent.com/akashgupta4891/datasharing/master/movie_metadata.csv"
imdb_data <- read.csv(url)
imdb <- as_tibble(imdb_data)

Data Preparation

Source

The data set has been taken from Kaggle

More about the Data

Original Purpose Given that thousands of movies are produced each year, this dataset was collected to find a better way to analyze the greatness of movies without relying on critics or one’s own instincts

Initial Data Description * The data was originally scraped for 5000+ movies from IMDB website using a Python library called “scrapy” 3 months ago.

  • The original dataset has 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries.

  • There are 2399 unique director names, and thousands of actors/actresses.

Original Variables

  • movie_title : Title of the Movie
  • duration: Duration in minutes
  • director_name : Name of the Director of the Movie.
  • director_facebook_likes : Number of likes of the Director on his Facebook Page.
  • color: Film colorization. ‘Black and White’ or ‘Color’
  • genres: Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’
  • actor_1_name: Primary actor starring in the movie
  • actor_1_facebook_likes : Number of likes of the Actor_1 on his/her Facebook Page.
  • actor_2_name: Other actor starring in the movie
  • actor_2_facebook_likes : Number of likes of the Actor_2 on his/her Facebook Page.
  • actor_3_name: Other actor starring in the movie
  • actor_3_facebook_likes : Number of likes of the Actor_3 on his/her Facebook Page.
  • num_critic_for_reviews : Number of critical reviews on imdb
  • num_voted_users: Number of people who voted for the movie
  • cast_total_facebook_likes: Total number of facebook likes of the entire cast of the movie.
  • language : English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc
  • country: Country where the movie is produced.
  • gross: Gross earnings of the movie in Dollars
  • budget: Budget of the movie in Dollars
  • title_year: The year in which the movie is released (1916:2016)
  • imdb_score: IMDB Score of the movie on IMDB
  • movie_facebook_likes: Number of Facebook likes in the movie page.

Data Peculiarities

  • Total missing values are 2059

  • Number of duplicate values for Movie Title: 126

  • The data has special character (Â) at the end of every movie_title.

Data Cleaning

1. Missing Values To find the missing values:

all_na <- sapply(imdb, function(y) sum(length(which(is.na(y)))))
all_na <- data.frame(all_na)
kable(all_na)
all_na
color 0
director_name 0
num_critic_for_reviews 50
duration 15
director_facebook_likes 104
actor_3_facebook_likes 23
actor_2_name 0
actor_1_facebook_likes 7
gross 884
genres 0
actor_1_name 0
movie_title 0
num_voted_users 0
cast_total_facebook_likes 0
actor_3_name 0
facenumber_in_poster 13
plot_keywords 0
movie_imdb_link 0
num_user_for_reviews 21
language 0
country 0
content_rating 0
budget 492
title_year 108
actor_2_facebook_likes 13
imdb_score 0
aspect_ratio 329
movie_facebook_likes 0

Since gross has 884 missing values and budget has 492 missing values, deleting rows with null values for gross and budget.

imdb <- imdb[!is.na(imdb$gross), ]
imdb <- imdb[!is.na(imdb$budget), ]
dim(imdb)
## [1] 3891   28

2. Removing Special Characters

#removing special character (Â) at the end of movie title 
imdb$movie_title <- gsub("Â", "", as.character(factor(imdb$movie_title)))

3. Adding Two Colums

# adding two colums: profit and percentage return on investment.
imdb %>% 
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) 

4. Removing white spaces

# remove whitespaces on the right side
str_trim(imdb$movie_title, side = "right")
Cleaned Dataset
imdb %>%
    mutate(profit = gross - budget,
           return_on_investment_perc = (profit / budget) * 100)%>%
  select(movie_title, title_year, gross, budget, director_name, profit, return_on_investment_perc, imdb_score, content_rating)%>%
datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))
Data Summary
dim(imdb)
## [1] 3891   28
range(imdb$title_year, na.rm = TRUE)
## [1] 1920 2016
summary(imdb$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 2.180e+02 1.000e+07 2.400e+07 4.521e+07 5.000e+07 1.222e+10
summary(imdb$gross)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       162   6837000  27980000  51050000  65360000 760500000
S.No. Variable Name Description Data Type
01 movie_title Title of the Movie Factor
02 title_year The year in which the movie is released (1920:2016) int
03 gross Gross earnings of the movie in Dollars int
04 budget Budget of the movie in Dollars num
05 director_name Name of the Director of the Movie Factor
06 profit Profit earned by the movie in Dollars int
07 return_on_investment_perc Percentage Return on Investment int
08 imdb_score Score of the movie on IMDB num
09 content_rating Content Rating of the movie like G: General Audiences, PG: Parental Guidance etc Factor
10 genres Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’ Factor

Exploratory Data Analyses

Analysis 1
#Top 20 most successful directors
imdb %>%
        group_by(director_name) %>%
  mutate(profit = gross - budget)%>%
select(director_name, budget, gross, profit) %>%
na.omit() %>% 
summarise(films = n(), budget = sum(as.numeric(budget)), gross = sum(as.numeric(gross)), profit = sum(as.numeric(profit))) %>%
mutate(avg_per_film = profit/films) %>%
arrange(desc(avg_per_film)) %>% 
top_n(20, avg_per_film) %>%
ggplot( aes(x = films, y = avg_per_film/1000000)) + 
geom_point(size = 1, color = "blue") + 
geom_text_repel(aes(label = director_name), size = 3, color = "blue") + 
xlab("Number of Films") + ylab("Avg Profit $millions") + 
ggtitle("Most Successful Directors")

Insights

These are the top 20 most successful directors based on the average profit earned by the movies they directed.

There’s an obvious downward trend between average profit and number of films. This could be because as one makes more movies, one could have more hits and misses, therefore the average goes down. It could also be due to film makers having a more diverse range of movies which could include smaller and/or high budget movies"

Analysis 2
#Top 20 movies based on its Profit
imdb %>% 
  filter(title_year %in% c(2000:2016)) %>%
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) %>%
  arrange(desc(profit)) %>% 
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y=profit/1000000)) + 
  geom_point(size = 2) + 
  geom_smooth(size = 1) + 
  geom_text_repel(aes(label = movie_title), size = 3) + 
  xlab("Budget $million") + 
  ylab("Profit $million") + 
  ggtitle("20 Most Profitable Movies")

Insights

These are the top 20 movies based on the Profit earned (Gross Earnings - Budget). Since the data has movies released over a span of 100 years, i.e. from 1916 to 2016, the values for earnings and budget do not take into account inflation and market value, therefore the analyses is only done for last 16 years, i.e. from 2000 to 2016.

It can be inferred from this plot that high budget movies tend to earn more profit.

The trend is almost linear, with profit increasing with the increase in budget.

Analysis 3
#Top 20 movies based on its Return on Investment
imdb %>% 
  filter(budget >100000) %>%
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) %>%
  arrange(desc(profit)) %>% 
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y=return_on_investment_perc)) + 
  geom_point(size = 2) + 
  geom_smooth(size = 1) + 
  geom_text_repel(aes(label = movie_title), size = 3) + 
  xlab("Budget $million") + 
  ylab("Percent Return on Investment") + 
  ggtitle("20 Most Profitable Movies based on its Return on Investment")

Insights

These are the top 20 movies based on its Percentage Return on Investment. ((profit/budget)*100).

Since profit earned by a movie does not give a clear picture about its monetary success over the years (1916 to 2016), this analyses, over the absolute value of the Return on Investment(ROI) across its Budget, would provide better results.

As hypothesized, the ROI is high for Low Budget Films and decreases as the budget of the movie increases.

Analysis 4
# Commercial Success vs Critical Acclaim
imdb %>%
  mutate(profit=gross- budget) %>%
  top_n(10,profit) %>%
ggplot(aes(x=imdb_score, y=gross/10^6, size=profit/10^6, colour= content_rating)) + 
  geom_point() + 
  geom_hline(aes(yintercept = 600)) + 
  geom_vline(aes(xintercept = 7.75)) + 
  geom_text_repel(aes(label = movie_title), size =4) +
  xlab("Imdb score") + 
  ylab("Gross money earned in million dollars") + 
  ggtitle("Commercial success Vs Critical acclaim") +
  annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross")

Insights

This is an analysis on the Commercial Success acclaimed by the movie (Gross earnings and profit earned) vs its IMDB Score.

As expected, there is not much correlation since most critically acclaimed movies do not do much well commercially.

Analysis 5
genre = imdb['genres']
genre = data.frame(table(genre))
genre = genre[order(genre$Freq,decreasing=TRUE),]

# Top 20 genres with the most movies
ggplot(genre[1:20,], aes(x=reorder(factor(genre), Freq), y=Freq, alpha=Freq)) + 
  geom_bar(stat = "identity", fill="blue") + 
  geom_text(aes(label=Freq),hjust=1.2, size=3.5)+
  xlab("Genre") + 
  ylab("Number of Movies") + 
  ggtitle("Top 20 genres with the most movies") + 
  coord_flip()

Insights

This is a plot for the top 20 genres for the movies.

Clearly, ‘Drama’, ‘Romance’ and ‘Comedy’ are the most favorite genres among directors and producers.

Summary

Summarizing the problem statement

Analyses on the monetary success and the likeability of movies were done to gather insights about the success of several movies.

Summarizing the implementation

The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically to determine what is the general trend in the movie industry, using scatter plots, line chart and bar charts.

Summarizing the Analyses

  • There’s a linear trend between the profit earned and the budget of the movie.

  • There’s inverse relation between the Return on Investment and the Budget of a movie.

  • There is not much correlation between the profit earned and the imdb rating of the movie indicating that most critically acclaimed movies do not do much well commercially.

  • ‘Drama’, ‘Romance’ and ‘Comedy’ are the most favorite genres among directors and producers.