IMDB Movies Analysis

Synopsys

Problem Statement Since there is no universal way to claim the goodness of a movie, one could rely on the parameters like critic ratings, profit earned, viewers likes and ratings etc.

Implementation Since the data has information about the year of its release, budget, gross earnings, director name of the movie etc, I would be working on these aspects to analyze the success of movies and directors over the years.

Summary Several trends have been seen between the budget, profit earned, and it’s Return on Investment. Analysis has also been done to figure out the most successful directors and the most favorite genres among directors and producers.

Packages Required

library(knitr)          # for kniting r code to html files
library(magrittr)       # for chaining commands with pipe operator, %>%.
library(tidyverse)      # for tibbles, graphs, performing data transformation and manipuation tasks 
library(lubridate)      # for working with date-times and time-spans
library(stringr)        # for string handling functions 
library(ggrepel)        # to repel overlapping text labels away from each other.
library(DT)             # for HTML Display of data

Import Data

url <- "https://raw.githubusercontent.com/akashgupta4891/datasharing/master/movie_metadata.csv"
imdb_data <- read.csv(url)
imdb <- as_tibble(imdb_data)

Data Preparation

Source

The data set has been taken from Kaggle

More about the Data

Original Purpose Given that thousands of movies are produced each year, this dataset was collected to find a better way to analyze the greatness of movies without relying on critics or one’s own instincts

Initial Data Description * The data was originally scraped for 5000+ movies from IMDB website using a Python library called “scrapy” 3 months ago.

The original dataset has 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries.
There are 2399 unique director names, and thousands of actors/actresses.

Original Variables

movie_title : Title of the Movie
duration: Duration in minutes
director_name : Name of the Director of the Movie.
director_facebook_likes : Number of likes of the Director on his Facebook Page.
color: Film colorization. ‘Black and White’ or ‘Color’
genres: Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’
actor_1_name: Primary actor starring in the movie
actor_1_facebook_likes : Number of likes of the Actor_1 on his/her Facebook Page.
actor_2_name: Other actor starring in the movie
actor_2_facebook_likes : Number of likes of the Actor_2 on his/her Facebook Page.
actor_3_name: Other actor starring in the movie
actor_3_facebook_likes : Number of likes of the Actor_3 on his/her Facebook Page.
num_critic_for_reviews : Number of critical reviews on imdb
num_voted_users: Number of people who voted for the movie
cast_total_facebook_likes: Total number of facebook likes of the entire cast of the movie.
language : English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc
country: Country where the movie is produced.
gross: Gross earnings of the movie in Dollars
budget: Budget of the movie in Dollars
title_year: The year in which the movie is released (1916:2016)
imdb_score: IMDB Score of the movie on IMDB
movie_facebook_likes: Number of Facebook likes in the movie page.

Data Peculiarities

Total missing values are 2059
Number of duplicate values for Movie Title: 126
The data has special character (Â) at the end of every movie_title.

Data Cleaning

1. Missing Values To find the missing values:

all_na <- sapply(imdb, function(y) sum(length(which(is.na(y)))))
all_na <- data.frame(all_na)
kable(all_na)

	all_na
color	0
director_name	0
num_critic_for_reviews	50
duration	15
director_facebook_likes	104
actor_3_facebook_likes	23
actor_2_name	0
actor_1_facebook_likes	7
gross	884
genres	0
actor_1_name	0
movie_title	0
num_voted_users	0
cast_total_facebook_likes	0
actor_3_name	0
facenumber_in_poster	13
plot_keywords	0
movie_imdb_link	0
num_user_for_reviews	21
language	0
country	0
content_rating	0
budget	492
title_year	108
actor_2_facebook_likes	13
imdb_score	0
aspect_ratio	329
movie_facebook_likes	0

Since gross has 884 missing values and budget has 492 missing values, deleting rows with null values for gross and budget.

imdb <- imdb[!is.na(imdb$gross), ]
imdb <- imdb[!is.na(imdb$budget), ]
dim(imdb)

## [1] 3891   28

2. Removing Special Characters

#removing special character (Â) at the end of movie title 
imdb$movie_title <- gsub("Â", "", as.character(factor(imdb$movie_title)))

3. Adding Two Colums

# adding two colums: profit and percentage return on investment.
imdb %>% 
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100)

4. Removing white spaces

# remove whitespaces on the right side
str_trim(imdb$movie_title, side = "right")

Cleaned Dataset

imdb %>%
    mutate(profit = gross - budget,
           return_on_investment_perc = (profit / budget) * 100)%>%
  select(movie_title, title_year, gross, budget, director_name, profit, return_on_investment_perc, imdb_score, content_rating)%>%
datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))

Data Summary

dim(imdb)

## [1] 3891   28

range(imdb$title_year, na.rm = TRUE)

## [1] 1920 2016

summary(imdb$budget)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 2.180e+02 1.000e+07 2.400e+07 4.521e+07 5.000e+07 1.222e+10

summary(imdb$gross)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       162   6837000  27980000  51050000  65360000 760500000

S.No.	Variable Name	Description	Data Type
01	movie_title	Title of the Movie	Factor
02	title_year	The year in which the movie is released (1920:2016)	int
03	gross	Gross earnings of the movie in Dollars	int
04	budget	Budget of the movie in Dollars	num
05	director_name	Name of the Director of the Movie	Factor
06	profit	Profit earned by the movie in Dollars	int
07	return_on_investment_perc	Percentage Return on Investment	int
08	imdb_score	Score of the movie on IMDB	num
09	content_rating	Content Rating of the movie like G: General Audiences, PG: Parental Guidance etc	Factor
10	genres	Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’	Factor

Exploratory Data Analyses

Analysis 1

#Top 20 most successful directors
imdb %>%
        group_by(director_name) %>%
  mutate(profit = gross - budget)%>%
select(director_name, budget, gross, profit) %>%
na.omit() %>% 
summarise(films = n(), budget = sum(as.numeric(budget)), gross = sum(as.numeric(gross)), profit = sum(as.numeric(profit))) %>%
mutate(avg_per_film = profit/films) %>%
arrange(desc(avg_per_film)) %>% 
top_n(20, avg_per_film) %>%
ggplot( aes(x = films, y = avg_per_film/1000000)) + 
geom_point(size = 1, color = "blue") + 
geom_text_repel(aes(label = director_name), size = 3, color = "blue") + 
xlab("Number of Films") + ylab("Avg Profit $millions") + 
ggtitle("Most Successful Directors")

Insights

These are the top 20 most successful directors based on the average profit earned by the movies they directed.

There’s an obvious downward trend between average profit and number of films. This could be because as one makes more movies, one could have more hits and misses, therefore the average goes down. It could also be due to film makers having a more diverse range of movies which could include smaller and/or high budget movies"

Analysis 2

#Top 20 movies based on its Profit
imdb %>% 
  filter(title_year %in% c(2000:2016)) %>%
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) %>%
  arrange(desc(profit)) %>% 
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y=profit/1000000)) + 
  geom_point(size = 2) + 
  geom_smooth(size = 1) + 
  geom_text_repel(aes(label = movie_title), size = 3) + 
  xlab("Budget $million") + 
  ylab("Profit $million") + 
  ggtitle("20 Most Profitable Movies")

Insights

These are the top 20 movies based on the Profit earned (Gross Earnings - Budget). Since the data has movies released over a span of 100 years, i.e. from 1916 to 2016, the values for earnings and budget do not take into account inflation and market value, therefore the analyses is only done for last 16 years, i.e. from 2000 to 2016.

It can be inferred from this plot that high budget movies tend to earn more profit.

The trend is almost linear, with profit increasing with the increase in budget.

Analysis 3

#Top 20 movies based on its Return on Investment
imdb %>% 
  filter(budget >100000) %>%
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) %>%
  arrange(desc(profit)) %>% 
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y=return_on_investment_perc)) + 
  geom_point(size = 2) + 
  geom_smooth(size = 1) + 
  geom_text_repel(aes(label = movie_title), size = 3) + 
  xlab("Budget $million") + 
  ylab("Percent Return on Investment") + 
  ggtitle("20 Most Profitable Movies based on its Return on Investment")

Insights

These are the top 20 movies based on its Percentage Return on Investment. ((profit/budget)*100).

Since profit earned by a movie does not give a clear picture about its monetary success over the years (1916 to 2016), this analyses, over the absolute value of the Return on Investment(ROI) across its Budget, would provide better results.

As hypothesized, the ROI is high for Low Budget Films and decreases as the budget of the movie increases.

Analysis 4

# Commercial Success vs Critical Acclaim
imdb %>%
  mutate(profit=gross- budget) %>%
  top_n(10,profit) %>%
ggplot(aes(x=imdb_score, y=gross/10^6, size=profit/10^6, colour= content_rating)) + 
  geom_point() + 
  geom_hline(aes(yintercept = 600)) + 
  geom_vline(aes(xintercept = 7.75)) + 
  geom_text_repel(aes(label = movie_title), size =4) +
  xlab("Imdb score") + 
  ylab("Gross money earned in million dollars") + 
  ggtitle("Commercial success Vs Critical acclaim") +
  annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross")

Insights

This is an analysis on the Commercial Success acclaimed by the movie (Gross earnings and profit earned) vs its IMDB Score.

As expected, there is not much correlation since most critically acclaimed movies do not do much well commercially.

Analysis 5

genre = imdb['genres']
genre = data.frame(table(genre))
genre = genre[order(genre$Freq,decreasing=TRUE),]

# Top 20 genres with the most movies
ggplot(genre[1:20,], aes(x=reorder(factor(genre), Freq), y=Freq, alpha=Freq)) + 
  geom_bar(stat = "identity", fill="blue") + 
  geom_text(aes(label=Freq),hjust=1.2, size=3.5)+
  xlab("Genre") + 
  ylab("Number of Movies") + 
  ggtitle("Top 20 genres with the most movies") + 
  coord_flip()

Insights

This is a plot for the top 20 genres for the movies.

Clearly, ‘Drama’, ‘Romance’ and ‘Comedy’ are the most favorite genres among directors and producers.

Summary

Summarizing the problem statement

Analyses on the monetary success and the likeability of movies were done to gather insights about the success of several movies.

Summarizing the implementation

The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically to determine what is the general trend in the movie industry, using scatter plots, line chart and bar charts.

Summarizing the Analyses

There’s a linear trend between the profit earned and the budget of the movie.
There’s inverse relation between the Return on Investment and the Budget of a movie.
There is not much correlation between the profit earned and the imdb rating of the movie indicating that most critically acclaimed movies do not do much well commercially.
‘Drama’, ‘Romance’ and ‘Comedy’ are the most favorite genres among directors and producers.

IMDB Movies Analysis

Jasmine Sachdeva

2016-12-10

Synopsys

Packages Required

Import Data

Data Preparation

Source

More about the Data

Data Cleaning

Cleaned Dataset

Data Summary

Exploratory Data Analyses

Analysis 1

Analysis 2

Analysis 3

Analysis 4

Analysis 5

Summary