IMDB Movie data

Movie Exploratory Analysis (Using IMDB datasets)

Synopsis

Problem Statement: The objective of this project is to explore the Kaggle Movies dataset and answers some interesting questions like which countries produce most movies, profitability analysis, kind of movies are most produced, most produced genres etc.

Solution Overview: Post the data cleaning and exploration of the data I was able to answer some of the above questions. A variety of packages were used for this analysis, which are explained in the later sections. Each analysis has been explained along with its codes.

Insights: Some very interesting insights were obtained from this analysis:
- Drama, Comedy and Thriller are the top genres.
- US has the most thriving movie industry, and movies are being produced since the early-mid nineties. Japan, Italy, Germany and France being only other countries which produced significant number of movies before 1980s.
- India has significantly higher average movie budget (~100 M$) than other countries.
- New Zealand tops the average movie revenue (~90 M$), and is significantly higher than the second highest Taiwan.
- English is the language in which most movies are made.
- Average movie IMDB Score is 6.4 and 90% of movies have a score between 8.1 and 4.3. IMDB scores follow a bell shaped distribution.
- Movies with higher IMDB score tend to have higher revenue.
- Significant number of movies go into loss.

Assumption: Analysis is performed based on data available in the dataset. The results shown may not be True.

Packages Required

Following packages were used:
- Tibble: Used to store data as a tibble, and makes it much easier to handle and manipulate data (package required - tidyverse)
- DT: Used to display the data on the screen in a scrollable format (package required - DT)
- Knitr: Used to display an aligned table on the screen
- TM: Used for text mining on the “Genre” columns in the data (package required - tm)
- Dplyr: Used for data manipulation
- Ggplot2: Used to plot charts
- Wordcloud: Used to chart wordcloud in the genre text analysis (package required - wordcloud)
- Fitdistrplus: Used for statistical analysis (distribution fitting) (package required - fitdistrplus)
- Plotly: Used to plot interactive charts
- Plyr: Used for data transformation (install.packages(“plyr”, dependencies=T) and then require(plyr))

library(tibble)
library(DT)
library(knitr)
library(tm)
library(ggplot2)
library(wordcloud)
library(dplyr)
library(fitdistrplus)
library(plyr)
#library(plotly)

Data Preparation

This sections contains all the procedures followed in getting the data analysis ready. Each step is explained and the codes are given.

Data Import

IMDB 5000 Movie Dataset from Kaggle

This dataset contains 28 columns and about 5,043 movies spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses.

Data Import Code:

library(tibble)
movie <- as_data_frame(read.csv("movie_metadata.csv", stringsAsFactors = FALSE))
class(movie)

## [1] "tbl_df"     "tbl"        "data.frame"

colnames(movie)

##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

dim(movie)

## [1] 5043   28

Data Cleaning

The first part of data cleaning involves removal of spurious characters (Â) from a the movie title, genre and plot keyword columns. This could have come up during data scrapping from the net.

Then we remove the duplicates in the data, using then “movie_title” column. Duplicate data will skew our analysis hence needs to be removed.

Third part of data cleaning is the currency columns. The “Budget” and “Gross” (revenue) columns for a few countries were not converted to USD while compiling the data. Hence I have done this manually. This cleaning will help us compare these financial columns across movies and countries. All the currency rates were taken from XE.com as of September 29th 2021.

Lastly I created a profit_flag column, which is 1 if the movie is profitable (Revenue > Budget) and 0 otherwise.

movie$movie_title <- (sapply(movie$movie_title,gsub,pattern="\\Â",replacement=""))
movie$genres_2 <- (sapply(movie$genres,gsub,pattern="\\|",replacement=" "))
movie$plot_keywords_2 <- (sapply(movie$plot_keywords,gsub,pattern="\\|",replacement=" "))

movie = movie[!duplicated(movie$movie_title),]

movie <- transform(movie, budget = ifelse(country == "South Korea", budget/1186.31, budget))
movie <- transform(movie, budget = ifelse(country == "Japan", budget/111.91, budget))
movie <- transform(movie, budget = ifelse(country == "Turkey", budget/8.86, budget))
movie <- transform(movie, budget = ifelse(country == "Hungary", budget/309.93, budget))
movie <- transform(movie, budget = ifelse(country == "Thailand", budget/33.93, budget))

movie <- transform(movie, gross = ifelse(country == "South Korea", gross/1186.31, gross))
movie <- transform(movie, gross = ifelse(country == "Japan", gross/111.91, gross))
movie <- transform(movie, gross = ifelse(country == "Turkey", gross/8.86, gross))
movie <- transform(movie, gross = ifelse(country == "Hungary", gross/309.93, gross))
movie <- transform(movie, gross = ifelse(country == "Thailand", gross/33.93, gross))

movie$profit_flag <- as.factor(ifelse((movie$gross > movie$budget),1,0))

After the data cleaning, I once again check the number of rows and columns, as shown in the code below. The results show 4,917 unique movies.

dim(movie)

## [1] 4917   31

Missing Values

There are 3,700 rows which do not have any missing value.For Character values missing values are blanks, while numeric variables have missing values as NAs. But I have decided not to remove any rows with missing data yet.

print(paste(sum(complete.cases(movie)),"Complete cases!"))

## [1] "3700 Complete cases!"

Data Preview

In the data, each row is a movie and the columns are the information regarding the movie.

Preview (top 50 rows) of the cleaned dataset:

library(DT)
datatable(head(movie,50))

Data Description

Below is the table containing the the variable names, data types and a short description.

Variable.type <- lapply(movie,class)
Variable.desc <- c(
  "Specifies if it was color/black & white movie",
  "Name of movie director",
  "Number of critics who reviewed",
  "Duration of the movie (minutes)",
  "Number of likes on director's FB page",
  "Number of likes on 3rd actor's FB page",
  "Name of second actor",
  "Number of likes on 1st actor's FB page",
  "Gross earning by the movie (\\$)",
  "Genres of the movie",
  "Name of the first actor",
  "Title of the movie",
  "Number of users voted on IMDB",
  "Total facebook likes for all cast members",
  "Name of the third actor",
  "Number of the actor who featured in the movie poster",
  "Keywords describing the movie plot",
  "IMDB link of the movie",
  "Number of users who gave a review",
  "Language of the movie",
  "Country the movie was produced in",
  "Content rating of the movie",
  "Budget of the movie (\\$)",
  "Year the movie released in",
  "Number of facebook likes for actor 2",
  "IMDB score for the movie (out of 10)",
  "Aspect ratio the movie was made in",
  "Number of facebook likes",
  "Cleaned genre column",
  "Cleaned Plot keyword column",
  "Flag indicating profitability of the movie (1-profit, 0-loss)")

Variable.name1 <- colnames(movie)

data.desc <- as_data_frame(cbind(Variable.name1,Variable.type,Variable.desc))

colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")

library(knitr)
kable(data.desc)

Variable Name	Data Type	Variable Description
color	character	Specifies if it was color/black & white movie
director_name	character	Name of movie director
num_critic_for_reviews	integer	Number of critics who reviewed
duration	integer	Duration of the movie (minutes)
director_facebook_likes	integer	Number of likes on director’s FB page
actor_3_facebook_likes	integer	Number of likes on 3rd actor’s FB page
actor_2_name	character	Name of second actor
actor_1_facebook_likes	integer	Number of likes on 1st actor’s FB page
gross	numeric	Gross earning by the movie ($)
genres	character	Genres of the movie
actor_1_name	character	Name of the first actor
movie_title	character	Title of the movie
num_voted_users	integer	Number of users voted on IMDB
cast_total_facebook_likes	integer	Total facebook likes for all cast members
actor_3_name	character	Name of the third actor
facenumber_in_poster	integer	Number of the actor who featured in the movie poster
plot_keywords	character	Keywords describing the movie plot
movie_imdb_link	character	IMDB link of the movie
num_user_for_reviews	integer	Number of users who gave a review
language	character	Language of the movie
country	character	Country the movie was produced in
content_rating	character	Content rating of the movie
budget	numeric	Budget of the movie ($)
title_year	integer	Year the movie released in
actor_2_facebook_likes	integer	Number of facebook likes for actor 2
imdb_score	numeric	IMDB score for the movie (out of 10)
aspect_ratio	numeric	Aspect ratio the movie was made in
movie_facebook_likes	integer	Number of facebook likes
genres_2	character	Cleaned genre column
plot_keywords_2	character	Cleaned Plot keyword column
profit_flag	factor	Flag indicating profitability of the movie (1-profit, 0-loss)

Exploratory Data Analysis

In this section, I have tried to use a very diverse set of packages, functions and graphical methods to explore the movies dataset.

Genre Analysis

Each movie in my dataset had more than one genre, hence some cleaning and genre separation was required. Some pre-built functions in the “TM” package were very useful. In short, in the Analysis of Genre, the following was done: 1. Cleaning of the Genre Variable 2. Converting Genre variable to Corpus 3. Frequency Analysis: See which are the most used genres in a movie 4. Genre Associations Analysis: See which genres are closely associated and used together in a movie

Drama, Comedy and Thriller are the top movie genres, as showm in the below Word-Cloud and Bar-Chart

library(tm)
library(dplyr)
library(ggplot2)
library(wordcloud)
library(plyr)
genre <- Corpus(VectorSource(movie$genres_2))
genre_dtm <- DocumentTermMatrix(genre)
genre_freq <- colSums(as.matrix(genre_dtm))
freq <- sort(colSums(as.matrix(genre_dtm)), decreasing=TRUE) 
genre_wf <- data.frame(word=names(genre_freq), freq=genre_freq)

ggplot(genre_wf, aes(x=reorder(word,-freq), y=freq))+ 
  geom_bar(stat="identity")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ggtitle("Movie Genre frequency graph")+
  xlab("Genre")+
  ylab("Frequency")

set.seed(10)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(genre_wf$word,genre_wf$freq,random.order=FALSE,
          rot.per=.15, colors=pal2,scale=c(4,.9),
          title="WordCloud: Movie Genres")

***

Genre Association Analysis: Here I have analyzed which genres occur together in a movie. Since most of the movies in our database have multiple Genres, it would be interesting to understand how do these genres overlap.

A custom function is created, which finds the association for each genre and then binds all the individual results together to show them on a single graph.

word_assoc <- function(word)
{
  assoc_1 <- as.data.frame(findAssocs(genre_dtm,c(word),corlimit = 0.1))
  assoc_1$words <- rownames(assoc_1)
  colnames(assoc_1)[1] <- c("score")
  assoc_1$key <- c(word)
  rownames(assoc_1) <- NULL
  return(assoc_1)
}

drama_assoc <- word_assoc("drama")
comedy_assoc <- word_assoc("comedy")
thriller_assoc <- word_assoc("thriller")
action_assoc <- word_assoc("action")
romance_assoc <- word_assoc("romance")
adventure_assoc <- word_assoc("adventure")
crime_assoc <- word_assoc("crime")
assoc <- rbind(drama_assoc,comedy_assoc,thriller_assoc,action_assoc,
               romance_assoc,adventure_assoc,crime_assoc)

assoc$n = as.numeric(factor(assoc$key))
assoc = ddply(assoc,.(key,words),transform, x=paste(c(rep(' ',n-1), words), collapse=''))
assoc$x = factor(assoc$x, levels=assoc[order(assoc$score), 'x'])

ggplot(assoc,aes(x=x,y=score))+
  geom_bar(stat="identity")+
  facet_grid(~key,scales = 'free',space="free_x")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Association Score")+
  xlab("Genre")+
  ggtitle("Genre Associations")

The above graph shows the association strength, which can be thought of as a correlation to understand easily. The score is between 0 and 1. Lower score means they two genres in question were not used together while a higher score shows strong association.

Country

This part of the analysis deals with exploring the country variable. The first part shows how many movies were produced in which countries across the years. I have tried to show that through a heat map. Dark blue indicates more number of movies produces and vice-versa.

Most of the countries started producing movies in the early 2000s, except a handful which had prevalant movie production going on since the mid 1900s.

library(plyr)
detach("package:plyr", unload=TRUE)
library(dplyr) 
country_count <- movie %>%
  subset(country != "") %>%
  subset(country != "New Line") %>%
  na.omit() %>%
  group_by(country,title_year) %>%
  summarise(count=n())

colnames(country_count)[3] <- "Movie_Count"
ggplot(country_count,aes(title_year,country))+
  geom_tile(aes(fill=Movie_Count),colour="white")+
  scale_fill_gradient(low="light blue",high = "dark blue")+
  xlab("Year of movie release")+
  ylab("Country")+
  ggtitle("Heat Map: Country vs Movie Release Year")+
  guides(fill=FALSE)

The second part deals with exploring how does the budget varies by each country.

country_summary <- movie %>%
  subset(country != "") %>%
  subset(country != "New Line") %>%
  group_by(country) %>%
  summarise(count=n(),
            avg_score=mean(imdb_score,na.rm="true"),
            avg_budget = mean(budget,na.rm="true"),
            avg_gross = mean(gross,na.rm="true"))
country_with_multiple_movies <- subset(country_summary,count>1)[1]
ggplot(country_summary[complete.cases(country_summary), ],
       aes(x=reorder(country,-avg_budget),avg_budget/1000000))+
  geom_bar(stat = "identity")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Budget (Million $)")+
  xlab("")+
  ggtitle("Average Movie Budget by Country")

budget_rank <- country_summary[complete.cases(country_summary), c(1,4)]
budget_rank <- budget_rank[order(-budget_rank$avg_budget),]
budget_rank$rank <- seq.int(nrow(budget_rank))
movie_temp <- merge(x=movie,y=budget_rank, by = "country",all.x = TRUE)

ggplot(subset(movie_temp,country %in% country_with_multiple_movies$country),
       aes(x=reorder(country,rank),y=budget/1000000))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Budget (Million $)")+
  xlab("")+
  ggtitle("Movie Budget variation by Country")+
  ylim(0,100)

rm(movie_temp)

The third part in country analysis deals with exploring the gross revenue for the movie across the countries (Similar to the above analysis). Here the top 5 are significantly higher (more than 50 M USD) than the rest while the 6th highest is approximately 35 USD.

ggplot(country_summary[complete.cases(country_summary), ],
       aes(x=reorder(country,-avg_gross),avg_gross/1000000))+
  geom_bar(stat = "identity")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Revenue (Million $)")+
  xlab("")+
  ggtitle("Average Movie Revenue by Country")

revenue_rank <- country_summary[complete.cases(country_summary), c(1,5)]
revenue_rank <- revenue_rank[order(-revenue_rank$avg_gross),]
revenue_rank$rank <- seq.int(nrow(revenue_rank))
movie_temp <- merge(x=movie,y=revenue_rank, by = "country",all.x = TRUE)

ggplot(subset(movie_temp,country %in% country_with_multiple_movies$country),
       aes(x=reorder(country,rank),y=gross/1000000))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ylab("Average Movie Revenue (Million $)")+
  xlab("")+
  ggtitle("Movie Revenue variation by Country")+
  ylim(0,50)

rm(movie_temp)

In the fifth and the final part has the analysis, I have seen which country released which languages of films. English is obviosuly the language overlapping most countries, but the below heat map helps us get a better picture. Dark blue indicates higher number of movie released.

country_language <- movie %>%
  subset(country != "") %>%
  subset(language != "") %>%
  group_by(country,language) %>%
  summarise(count=n())

colnames(country_language)[3] <- "Movie_Count"
ggplot(country_language,aes(language,country))+
  geom_tile(aes(fill=log(Movie_Count)),colour="white")+
  scale_fill_gradient(low="light blue",high = "dark blue")+
  xlab("Language")+
  ylab("Country")+
  ggtitle("Heat Map: Country vs Language")+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  guides(fill=FALSE)

We can see below that USA produces movies in most number of languages (14), followed by UK.

  movie %>%
  group_by(country) %>%
  summarise(num = n_distinct(language)) %>%
  arrange(-num) %>%
  subset(num > 3) %>%
  ggplot(aes(y=num,x=reorder(country,-num)))+
    geom_bar(stat = "identity")+
    xlab("")+ylab("Number of Languages")+
    ggtitle("Top countries by number of languages of films produced")

IMDB Score Analysis

In here, I have tried to see which kind of movies are more successful in terms of the IMDB ratings.

We first start by looking at the basic central tendency (mean) and the variation in movie score. For this purpose I have plotted a histogram which also has the 5th and 95th percentile mark for the IMDB score.

ggplot(movie,aes(imdb_score))+
  geom_histogram(bins=80)+
  geom_vline(xintercept = mean(movie$imdb_score,na.rm = TRUE),colour="steel blue")+
  geom_vline(xintercept = quantile(movie$imdb_score, prob = c(0.05)),colour="red",linetype = "longdash")+
  geom_vline(xintercept = quantile(movie$imdb_score, prob = c(0.95)),colour="red",linetype = "longdash")+
  annotate("text", label = "5th Percentile (4.3)",x = 4.2, y = 100, size = 4, colour = "red",angle=90)+
  annotate("text", label = "95th Percentile (8.1)",x = 8.2, y = 100, size = 4, colour = "red",angle=90)+
  annotate("text", label = "Mean (6.4)",x = 6.3, y = 100, size = 4, colour = "light blue",angle=90)+
  ylab("Count of Movies")+
  xlab("IMDB Score")+
  ggtitle("Histogram: IMDB Score")

Lets check the relationship between IMDB score, revenue and budget of the movie. Here Profit flag 0 represents loss and 1 represents gain. From the graph we can make out that IMDB score doesn’t depend on Budget or Gross amount.

movie %>% 
  na.omit() %>% 
  ggplot(aes(x = imdb_score, y = budget/1000000, color = profit_flag)) + 
  geom_jitter() + 
  ylab("Budget im Million $") + 
  xlab("IMDB Score") + 
  ggtitle("IMDB Score vs Budget")

movie %>% 
  na.omit() %>% 
  ggplot(aes(x = imdb_score, y = gross/1000000, color = profit_flag)) + 
  geom_jitter() + 
  ylab("Budget im Million $") + 
  xlab("IMDB Score") + 
  ggtitle("IMDB Score vs Gross")

Now lets see which countries do well in terms of the movie IMDB scores. Size of the bubble indicates the number of movies produced in the country while color shows profitability:

imdb_country <- movie %>%
  group_by(country) %>%
  summarise(num_movie = n(),
            avg_score = mean(imdb_score,na.rm=TRUE),
            profit = (sum(gross,na.rm=TRUE)-sum(budget,na.rm=TRUE))/1000000) %>%
  subset(country != "") %>%
  subset(num_movie > 5) %>%
  arrange(-num_movie)

imdb_country$profit_flag <- as.factor(ifelse(imdb_country$profit > 0 , 1,0))
imdb_country$profit_2 <- imdb_country$profit + 2009

imdb_country %>% 
  na.omit() %>% 
  ggplot(aes(x = profit_2, y = avg_score, color = profit_flag, size = num_movie)) + 
  geom_point() + 
  geom_text(label = imdb_country$country, nudge_x = 0.25, nudge_y = 0.25, check_overlap = T) +
  ylab("Average IMDB Score") + 
  xlab("Profitability") + 
  ggtitle("IMDB Score vs Profitability")

library(knitr)
kable(imdb_country)

country	num_movie	avg_score	profit	profit_flag	profit_2
USA	3711	6.365212	51748.898187	1	53757.898187
UK	434	6.800230	1075.670352	1	3084.670352
France	154	6.678571	-2008.942677	0	0.057323
Canada	124	6.154032	13.196801	1	2022.196801
Germany	94	6.315957	-586.972999	0	1422.027001
Australia	53	6.522641	88.181074	1	2097.181074
India	34	6.532353	-1810.682479	0	198.317521
Spain	33	6.824242	-926.300528	0	1082.699472
China	28	6.610714	-1131.594853	0	877.405147
Italy	23	6.873913	-223.452921	0	1785.547079
Japan	22	6.895454	-57.559708	0	1951.440292
Hong Kong	17	6.741176	-192.928261	0	1816.071739
Mexico	17	6.776471	27.497480	1	2036.497480
South Korea	13	6.200000	-3.685296	0	2005.314704
Ireland	12	6.783333	-28.853700	0	1980.146300
New Zealand	12	7.291667	216.207274	1	2225.207274
Denmark	11	7.172727	-112.533778	0	1896.466222
Russia	11	6.081818	-157.740193	0	1851.259807
Brazil	8	7.275000	-15.637130	0	1993.362870
Norway	8	6.737500	-165.295451	0	1843.704549
South Africa	8	6.437500	34.740241	1	2043.740241
Sweden	6	7.516667	-50.019565	0	1958.980435

Summary

The above exersice helped us understand the movies dataset better. Now we have answers to some interesting questions. The following is the summary of the analysis:

Drama, Comedy and Thriller are the top genres. As seen in the bar chart and wordcloud
US has the most thriving movie industry, and movies are being produced since the early-mid nineties. Japan, Italy, Germany and France being only other countries which produced significant number of movies before 1980s.
India has significantly higher average movie budget (~100 M$) than other countries.
New Zealand tops the average movie revenue (~90 M$), and is significantly higher than the second highest Taiwan.
English is the language in which most movies are made. And USA produces movies in 14 languages, most by any country
Average movie IMDB Score is 6.4 and 90% of movies have a score between 8.1 and 4.3. IMDB scores follow a bell shaped distribution. So any movie having a score of more than 8.1 would be one of the top 5% movies in the world.
Movies with higher IMDB score tend to have higher revenue. Here IMDB score and Revenue might be correlated. People prefer to watch famous and good movies.
Significant number of movies go into loss.