Movie Exploratory Analysis (Using IMDB datasets)

Synopsis

  1. Problem Statement: The objective of this project is to explore the Kaggle Movies dataset and answers some interesting questions like which countries produce most movies, profitability analysis, kind of movies are most produced, most produced genres etc.

  1. Solution Overview: Post the data cleaning and exploration of the data I was able to answer some of the above questions. A variety of packages were used for this analysis, which are explained in the later sections. Each analysis has been explained along with its codes.

  1. Insights: Some very interesting insights were obtained from this analysis:

    • Drama, Comedy and Thriller are the top genres.
    • US has the most thriving movie industry, and movies are being produced since the early-mid nineties. Japan, Italy, Germany and France being only other countries which produced significant number of movies before 1980s.
    • India has significantly higher average movie budget (~100 M$) than other countries.
    • New Zealand tops the average movie revenue (~90 M$), and is significantly higher than the second highest Taiwan.
    • English is the language in which most movies are made.
    • Average movie IMDB Score is 6.4 and 90% of movies have a score between 8.1 and 4.3. IMDB scores follow a bell shaped distribution.
    • Movies with higher IMDB score tend to have higher revenue.
    • Significant number of movies go into loss.

Assumption: Analysis is performed based on data available in the dataset. The results shown may not be True.

Packages Required

  1. Following packages were used:

    • Tibble: Used to store data as a tibble, and makes it much easier to handle and manipulate data (package required - tidyverse)
    • DT: Used to display the data on the screen in a scrollable format (package required - DT)
    • Knitr: Used to display an aligned table on the screen
    • TM: Used for text mining on the “Genre” columns in the data (package required - tm)
    • Dplyr: Used for data manipulation
    • Ggplot2: Used to plot charts
    • Wordcloud: Used to chart wordcloud in the genre text analysis (package required - wordcloud)
    • Fitdistrplus: Used for statistical analysis (distribution fitting) (package required - fitdistrplus)
    • Plotly: Used to plot interactive charts
    • Plyr: Used for data transformation (install.packages(“plyr”, dependencies=T) and then require(plyr))
library(tibble)
library(DT)
library(knitr)
library(tm)
library(ggplot2)
library(wordcloud)
library(dplyr)
library(fitdistrplus)
library(plyr)
#library(plotly)

Data Preparation

This sections contains all the procedures followed in getting the data analysis ready. Each step is explained and the codes are given.

Data Import

IMDB 5000 Movie Dataset from Kaggle

This dataset contains 28 columns and about 5,043 movies spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses.

Data Import Code:

library(tibble)
movie <- as_data_frame(read.csv("movie_metadata.csv", stringsAsFactors = FALSE))
class(movie)
## [1] "tbl_df"     "tbl"        "data.frame"
colnames(movie)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"
dim(movie)
## [1] 5043   28

Data Cleaning

The first part of data cleaning involves removal of spurious characters (Â) from a the movie title, genre and plot keyword columns. This could have come up during data scrapping from the net.

Then we remove the duplicates in the data, using then “movie_title” column. Duplicate data will skew our analysis hence needs to be removed.

Third part of data cleaning is the currency columns. The “Budget” and “Gross” (revenue) columns for a few countries were not converted to USD while compiling the data. Hence I have done this manually. This cleaning will help us compare these financial columns across movies and countries. All the currency rates were taken from XE.com as of September 29th 2021.

Lastly I created a profit_flag column, which is 1 if the movie is profitable (Revenue > Budget) and 0 otherwise.

movie$movie_title <- (sapply(movie$movie_title,gsub,pattern="\\Â",replacement=""))
movie$genres_2 <- (sapply(movie$genres,gsub,pattern="\\|",replacement=" "))
movie$plot_keywords_2 <- (sapply(movie$plot_keywords,gsub,pattern="\\|",replacement=" "))

movie = movie[!duplicated(movie$movie_title),]

movie <- transform(movie, budget = ifelse(country == "South Korea", budget/1186.31, budget))
movie <- transform(movie, budget = ifelse(country == "Japan", budget/111.91, budget))
movie <- transform(movie, budget = ifelse(country == "Turkey", budget/8.86, budget))
movie <- transform(movie, budget = ifelse(country == "Hungary", budget/309.93, budget))
movie <- transform(movie, budget = ifelse(country == "Thailand", budget/33.93, budget))

movie <- transform(movie, gross = ifelse(country == "South Korea", gross/1186.31, gross))
movie <- transform(movie, gross = ifelse(country == "Japan", gross/111.91, gross))
movie <- transform(movie, gross = ifelse(country == "Turkey", gross/8.86, gross))
movie <- transform(movie, gross = ifelse(country == "Hungary", gross/309.93, gross))
movie <- transform(movie, gross = ifelse(country == "Thailand", gross/33.93, gross))

movie$profit_flag <- as.factor(ifelse((movie$gross > movie$budget),1,0))

After the data cleaning, I once again check the number of rows and columns, as shown in the code below. The results show 4,917 unique movies.


dim(movie)
## [1] 4917   31

Missing Values

There are 3,700 rows which do not have any missing value.For Character values missing values are blanks, while numeric variables have missing values as NAs. But I have decided not to remove any rows with missing data yet.

print(paste(sum(complete.cases(movie)),"Complete cases!"))
## [1] "3700 Complete cases!"

Data Preview

In the data, each row is a movie and the columns are the information regarding the movie.

Preview (top 50 rows) of the cleaned dataset:

library(DT)
datatable(head(movie,50))

Data Description

Below is the table containing the the variable names, data types and a short description.

Variable.type <- lapply(movie,class)
Variable.desc <- c(
  "Specifies if it was color/black & white movie",
  "Name of movie director",
  "Number of critics who reviewed",
  "Duration of the movie (minutes)",
  "Number of likes on director's FB page",
  "Number of likes on 3rd actor's FB page",
  "Name of second actor",
  "Number of likes on 1st actor's FB page",
  "Gross earning by the movie (\\$)",
  "Genres of the movie",
  "Name of the first actor",
  "Title of the movie",
  "Number of users voted on IMDB",
  "Total facebook likes for all cast members",
  "Name of the third actor",
  "Number of the actor who featured in the movie poster",
  "Keywords describing the movie plot",
  "IMDB link of the movie",
  "Number of users who gave a review",
  "Language of the movie",
  "Country the movie was produced in",
  "Content rating of the movie",
  "Budget of the movie (\\$)",
  "Year the movie released in",
  "Number of facebook likes for actor 2",
  "IMDB score for the movie (out of 10)",
  "Aspect ratio the movie was made in",
  "Number of facebook likes",
  "Cleaned genre column",
  "Cleaned Plot keyword column",
  "Flag indicating profitability of the movie (1-profit, 0-loss)")

Variable.name1 <- colnames(movie)

data.desc <- as_data_frame(cbind(Variable.name1,Variable.type,Variable.desc))

colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")

library(knitr)
kable(data.desc)
Variable Name Data Type Variable Description
color character Specifies if it was color/black & white movie
director_name character Name of movie director
num_critic_for_reviews integer Number of critics who reviewed
duration integer Duration of the movie (minutes)
director_facebook_likes integer Number of likes on director’s FB page
actor_3_facebook_likes integer Number of likes on 3rd actor’s FB page
actor_2_name character Name of second actor
actor_1_facebook_likes integer Number of likes on 1st actor’s FB page
gross numeric Gross earning by the movie ($)
genres character Genres of the movie
actor_1_name character Name of the first actor
movie_title character Title of the movie
num_voted_users integer Number of users voted on IMDB
cast_total_facebook_likes integer Total facebook likes for all cast members
actor_3_name character Name of the third actor
facenumber_in_poster integer Number of the actor who featured in the movie poster
plot_keywords character Keywords describing the movie plot
movie_imdb_link character IMDB link of the movie
num_user_for_reviews integer Number of users who gave a review
language character Language of the movie
country character Country the movie was produced in
content_rating character Content rating of the movie
budget numeric Budget of the movie ($)
title_year integer Year the movie released in
actor_2_facebook_likes integer Number of facebook likes for actor 2
imdb_score numeric IMDB score for the movie (out of 10)
aspect_ratio numeric Aspect ratio the movie was made in
movie_facebook_likes integer Number of facebook likes
genres_2 character Cleaned genre column
plot_keywords_2 character Cleaned Plot keyword column
profit_flag factor Flag indicating profitability of the movie (1-profit, 0-loss)

Exploratory Data Analysis

In this section, I have tried to use a very diverse set of packages, functions and graphical methods to explore the movies dataset.

Genre Analysis

Each movie in my dataset had more than one genre, hence some cleaning and genre separation was required. Some pre-built functions in the “TM” package were very useful. In short, in the Analysis of Genre, the following was done: 1. Cleaning of the Genre Variable 2. Converting Genre variable to Corpus 3. Frequency Analysis: See which are the most used genres in a movie 4. Genre Associations Analysis: See which genres are closely associated and used together in a movie

Drama, Comedy and Thriller are the top movie genres, as showm in the below Word-Cloud and Bar-Chart

library(tm)
library(dplyr)
library(ggplot2)
library(wordcloud)
library(plyr)
genre <- Corpus(VectorSource(movie$genres_2))
genre_dtm <- DocumentTermMatrix(genre)
genre_freq <- colSums(as.matrix(genre_dtm))
freq <- sort(colSums(as.matrix(genre_dtm)), decreasing=TRUE) 
genre_wf <- data.frame(word=names(genre_freq), freq=genre_freq)

ggplot(genre_wf, aes(x=reorder(word,-freq), y=freq))+ 
  geom_bar(stat="identity")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ggtitle("Movie Genre frequency graph")+
  xlab("Genre")+
  ylab("Frequency")

set.seed(10)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(genre_wf$word,genre_wf$freq,random.order=FALSE,
          rot.per=.15, colors=pal2,scale=c(4,.9),
          title="WordCloud: Movie Genres")

***

Genre Association Analysis: Here I have analyzed which genres occur together in a movie. Since most of the movies in our database have multiple Genres, it would be interesting to understand how do these genres overlap.

A custom function is created, which finds the association for each genre and then binds all the individual results together to show them on a single graph.

word_assoc <- function(word)
{
  assoc_1 <- as.data.frame(findAssocs(genre_dtm,c(word),corlimit = 0.1))
  assoc_1$words <- rownames(assoc_1)
  colnames(assoc_1)[1] <- c("score")
  assoc_1$key <- c(word)
  rownames(assoc_1) <- NULL
  return(assoc_1)
}

drama_assoc <- word_assoc("drama")
comedy_assoc <- word_assoc("comedy")
thriller_assoc <- word_assoc("thriller")
action_assoc <- word_assoc("action")
romance_assoc <- word_assoc("romance")
adventure_assoc <- word_assoc("adventure")
crime_assoc <- word_assoc("crime")
assoc <- rbind(drama_assoc,comedy_assoc,thriller_assoc,action_assoc,
               romance_assoc,adventure_assoc,crime_assoc)

assoc$n = as.numeric(factor(assoc$key))
assoc = ddply(assoc,.(key,words),transform, x=paste(c(rep(' ',n-1), words), collapse=''))
assoc$x = factor(assoc$x, levels=assoc[order(assoc$score), 'x'])

ggplot(assoc,aes(x=x,y=score))+
  geom_bar(stat="identity")+
  facet_grid(~key,scales = 'free',space="free_x")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Association Score")+
  xlab("Genre")+
  ggtitle("Genre Associations")

The above graph shows the association strength, which can be thought of as a correlation to understand easily. The score is between 0 and 1. Lower score means they two genres in question were not used together while a higher score shows strong association.


Country

This part of the analysis deals with exploring the country variable. The first part shows how many movies were produced in which countries across the years. I have tried to show that through a heat map. Dark blue indicates more number of movies produces and vice-versa.

Most of the countries started producing movies in the early 2000s, except a handful which had prevalant movie production going on since the mid 1900s.

library(plyr)
detach("package:plyr", unload=TRUE)
library(dplyr) 
country_count <- movie %>%
  subset(country != "") %>%
  subset(country != "New Line") %>%
  na.omit() %>%
  group_by(country,title_year) %>%
  summarise(count=n())

colnames(country_count)[3] <- "Movie_Count"
ggplot(country_count,aes(title_year,country))+
  geom_tile(aes(fill=Movie_Count),colour="white")+
  scale_fill_gradient(low="light blue",high = "dark blue")+
  xlab("Year of movie release")+
  ylab("Country")+
  ggtitle("Heat Map: Country vs Movie Release Year")+
  guides(fill=FALSE)


The second part deals with exploring how does the budget varies by each country.

country_summary <- movie %>%
  subset(country != "") %>%
  subset(country != "New Line") %>%
  group_by(country) %>%
  summarise(count=n(),
            avg_score=mean(imdb_score,na.rm="true"),
            avg_budget = mean(budget,na.rm="true"),
            avg_gross = mean(gross,na.rm="true"))
country_with_multiple_movies <- subset(country_summary,count>1)[1]
ggplot(country_summary[complete.cases(country_summary), ],
       aes(x=reorder(country,-avg_budget),avg_budget/1000000))+
  geom_bar(stat = "identity")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Budget (Million $)")+
  xlab("")+
  ggtitle("Average Movie Budget by Country")

budget_rank <- country_summary[complete.cases(country_summary), c(1,4)]
budget_rank <- budget_rank[order(-budget_rank$avg_budget),]
budget_rank$rank <- seq.int(nrow(budget_rank))
movie_temp <- merge(x=movie,y=budget_rank, by = "country",all.x = TRUE)

ggplot(subset(movie_temp,country %in% country_with_multiple_movies$country),
       aes(x=reorder(country,rank),y=budget/1000000))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Budget (Million $)")+
  xlab("")+
  ggtitle("Movie Budget variation by Country")+
  ylim(0,100)

rm(movie_temp)

The third part in country analysis deals with exploring the gross revenue for the movie across the countries (Similar to the above analysis). Here the top 5 are significantly higher (more than 50 M USD) than the rest while the 6th highest is approximately 35 USD.

ggplot(country_summary[complete.cases(country_summary), ],
       aes(x=reorder(country,-avg_gross),avg_gross/1000000))+
  geom_bar(stat = "identity")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Revenue (Million $)")+
  xlab("")+
  ggtitle("Average Movie Revenue by Country")

revenue_rank <- country_summary[complete.cases(country_summary), c(1,5)]
revenue_rank <- revenue_rank[order(-revenue_rank$avg_gross),]
revenue_rank$rank <- seq.int(nrow(revenue_rank))
movie_temp <- merge(x=movie,y=revenue_rank, by = "country",all.x = TRUE)

ggplot(subset(movie_temp,country %in% country_with_multiple_movies$country),
       aes(x=reorder(country,rank),y=gross/1000000))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Revenue (Million $)")+
  xlab("")+
  ggtitle("Movie Revenue variation by Country")+
  ylim(0,50)

rm(movie_temp)

In the fifth and the final part has the analysis, I have seen which country released which languages of films. English is obviosuly the language overlapping most countries, but the below heat map helps us get a better picture. Dark blue indicates higher number of movie released.

country_language <- movie %>%
  subset(country != "") %>%
  subset(language != "") %>%
  group_by(country,language) %>%
  summarise(count=n())

colnames(country_language)[3] <- "Movie_Count"
ggplot(country_language,aes(language,country))+
  geom_tile(aes(fill=log(Movie_Count)),colour="white")+
  scale_fill_gradient(low="light blue",high = "dark blue")+
  xlab("Language")+
  ylab("Country")+
  ggtitle("Heat Map: Country vs Language")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  guides(fill=FALSE)


We can see below that USA produces movies in most number of languages (14), followed by UK.

  movie %>%
  group_by(country) %>%
  summarise(num = n_distinct(language)) %>%
  arrange(-num) %>%
  subset(num > 3) %>%
  ggplot(aes(y=num,x=reorder(country,-num)))+
    geom_bar(stat = "identity")+
    xlab("")+ylab("Number of Languages")+
    ggtitle("Top countries by number of languages of films produced")


IMDB Score Analysis

In here, I have tried to see which kind of movies are more successful in terms of the IMDB ratings.

We first start by looking at the basic central tendency (mean) and the variation in movie score. For this purpose I have plotted a histogram which also has the 5th and 95th percentile mark for the IMDB score.

ggplot(movie,aes(imdb_score))+
  geom_histogram(bins=80)+
  geom_vline(xintercept = mean(movie$imdb_score,na.rm = TRUE),colour="steel blue")+
  geom_vline(xintercept = quantile(movie$imdb_score, prob = c(0.05)),colour="red",linetype = "longdash")+
  geom_vline(xintercept = quantile(movie$imdb_score, prob = c(0.95)),colour="red",linetype = "longdash")+
  annotate("text", label = "5th Percentile (4.3)",x = 4.2, y = 100, size = 4, colour = "red",angle=90)+
  annotate("text", label = "95th Percentile (8.1)",x = 8.2, y = 100, size = 4, colour = "red",angle=90)+
  annotate("text", label = "Mean (6.4)",x = 6.3, y = 100, size = 4, colour = "light blue",angle=90)+
  ylab("Count of Movies")+
  xlab("IMDB Score")+
  ggtitle("Histogram: IMDB Score")


Lets check the relationship between IMDB score, revenue and budget of the movie. Here Profit flag 0 represents loss and 1 represents gain. From the graph we can make out that IMDB score doesn’t depend on Budget or Gross amount.

movie %>% 
  na.omit() %>% 
  ggplot(aes(x = imdb_score, y = budget/1000000, color = profit_flag)) + 
  geom_jitter() + 
  ylab("Budget im Million $") + 
  xlab("IMDB Score") + 
  ggtitle("IMDB Score vs Budget") 

movie %>% 
  na.omit() %>% 
  ggplot(aes(x = imdb_score, y = gross/1000000, color = profit_flag)) + 
  geom_jitter() + 
  ylab("Budget im Million $") + 
  xlab("IMDB Score") + 
  ggtitle("IMDB Score vs Gross") 


Now lets see which countries do well in terms of the movie IMDB scores. Size of the bubble indicates the number of movies produced in the country while color shows profitability:

imdb_country <- movie %>%
  group_by(country) %>%
  summarise(num_movie = n(),
            avg_score = mean(imdb_score,na.rm=TRUE),
            profit = (sum(gross,na.rm=TRUE)-sum(budget,na.rm=TRUE))/1000000) %>%
  subset(country != "") %>%
  subset(num_movie > 5) %>%
  arrange(-num_movie)

imdb_country$profit_flag <- as.factor(ifelse(imdb_country$profit > 0 , 1,0))
imdb_country$profit_2 <- imdb_country$profit + 2009

imdb_country %>% 
  na.omit() %>% 
  ggplot(aes(x = profit_2, y = avg_score, color = profit_flag, size = num_movie)) + 
  geom_point() + 
  geom_text(label = imdb_country$country, nudge_x = 0.25, nudge_y = 0.25, check_overlap = T) +
  ylab("Average IMDB Score") + 
  xlab("Profitability") + 
  ggtitle("IMDB Score vs Profitability")

library(knitr)
kable(imdb_country)
country num_movie avg_score profit profit_flag profit_2
USA 3711 6.365212 51748.898187 1 53757.898187
UK 434 6.800230 1075.670352 1 3084.670352
France 154 6.678571 -2008.942677 0 0.057323
Canada 124 6.154032 13.196801 1 2022.196801
Germany 94 6.315957 -586.972999 0 1422.027001
Australia 53 6.522641 88.181074 1 2097.181074
India 34 6.532353 -1810.682479 0 198.317521
Spain 33 6.824242 -926.300528 0 1082.699472
China 28 6.610714 -1131.594853 0 877.405147
Italy 23 6.873913 -223.452921 0 1785.547079
Japan 22 6.895454 -57.559708 0 1951.440292
Hong Kong 17 6.741176 -192.928261 0 1816.071739
Mexico 17 6.776471 27.497480 1 2036.497480
South Korea 13 6.200000 -3.685296 0 2005.314704
Ireland 12 6.783333 -28.853700 0 1980.146300
New Zealand 12 7.291667 216.207274 1 2225.207274
Denmark 11 7.172727 -112.533778 0 1896.466222
Russia 11 6.081818 -157.740193 0 1851.259807
Brazil 8 7.275000 -15.637130 0 1993.362870
Norway 8 6.737500 -165.295451 0 1843.704549
South Africa 8 6.437500 34.740241 1 2043.740241
Sweden 6 7.516667 -50.019565 0 1958.980435

Summary

The above exersice helped us understand the movies dataset better. Now we have answers to some interesting questions. The following is the summary of the analysis:

  • Drama, Comedy and Thriller are the top genres. As seen in the bar chart and wordcloud
  • US has the most thriving movie industry, and movies are being produced since the early-mid nineties. Japan, Italy, Germany and France being only other countries which produced significant number of movies before 1980s.
  • India has significantly higher average movie budget (~100 M$) than other countries.
  • New Zealand tops the average movie revenue (~90 M$), and is significantly higher than the second highest Taiwan.
  • English is the language in which most movies are made. And USA produces movies in 14 languages, most by any country
  • Average movie IMDB Score is 6.4 and 90% of movies have a score between 8.1 and 4.3. IMDB scores follow a bell shaped distribution. So any movie having a score of more than 8.1 would be one of the top 5% movies in the world.
  • Movies with higher IMDB score tend to have higher revenue. Here IMDB score and Revenue might be correlated. People prefer to watch famous and good movies.
  • Significant number of movies go into loss.