IMDB Data Storytelling

Analysing IMDb Movies - What factors lead to a successful movie?

Author: Dobariyapatel Anish Kishor

1. Motivation behind topic

As a huge movie enthusiast, this project provides an avenue to combine my interests in data visualisation and movies to analyse and gain useful data insights.

In recent years, there has been a rush towards creating more digital content to capture user attention. Streaming services like Netflix, Disney+ and AppleTV are creating their own content that could rival existing film production companies. With more digital content and content producers, competition between film companies for viewership are heating up. Hence it would be interesting to identify factors that result in a successful movie.

1.1 Data Source

I will be using the IMBD dataset from 2 sources:
1) IMDb’s official dataset (IMDb, n.d.). It has lesser variables but more movies compared to the 2nd source.
2) Dataworld imdb 5000 movie dataset. It contains 28 variables for 5043 movies, spanning across 100 years and 66 countries. There are 2399 unique director names, and thousands of actors/actresses.

This visualisation story is exploratory in nature to find some meaningful insights.

1.2 Major Data and Design Challenges

Major Data Challenges: This project will comprise of 2 datasources. The first is the official dataset from IMDB themselves that they allow for users to analyse. This is a huge dataset containing over a 100,000 movies from 1880s till 2021. However, because this laptop is unable to handle such huge data, I have cleaned it up to only be from 1990s onwards, thus narrowing it down to a manageable scale.

Another challenge is the first data source titled ‘imdbofficial’ has very limited information and variables which only allows for limited analysis. Thus I searched for another datasource titled ‘movies’ which contains around 5k movies. This dataset contains more useful information like budget and revenue which can make for some insightful visualisations.

Major Design Challenges: The aim is to create 8 unique chart elements to let the data tell its own story in a visually appealing and engaging manner.

There is sufficient data to do a range of analysis. The key challenge here was to ensure I do not repeat myself in certain design elements. It was vital to showcase the range that R possesses as an analytics and visualisation tool. In order to achieve this successfully, a sketch has to be drawn beforehand in order to plan out the chart elements used.

Below is the sketch for the visualisations I will be aiming to create.

1.3 Sketch of Proposed Data Visualisation Design

Proposed Sketch

1.4 Load neccessary packages:

library(tidyverse)
library(stringr)
library(ggplot2)
library(ggthemes)
library(tm)
library(dplyr)
library(ggplot2)
library(ggiraph)
library(plotly)
library(wordcloud)
library(corrgram)
library(corrplot)
library(ggpubr)

2 Data Visualisation Step by Step

2.1 Import data

#Data Source 1
imdbofficial <- read_csv("final_imdb_dataset.csv")
#Data Source 2 
movies <- read_csv("imdbdataset.csv")

2.2 Data Cleaning

Removing duplicate rows and renaming genre variables.

#Data Source 1
imdbofficial$Final_Genre[imdbofficial$Final_Genre == "Sci-Fi"] <- "SciFi"
#Data Source 2
sum(duplicated(movies))
## [1] 45
movies <- movies[!duplicated(movies),]
movies$movie_title <- substr(movies$movie_title,1,nchar(movies$movie_title)-1)
movies$genres_2 <- (sapply(movies$genres,gsub,pattern="\\|",replacement=" "))
movies = movies[!duplicated(movies$movie_title),]
movies$profit_flag <- as.factor(ifelse((movies$gross > movies$budget),1,0))

We see there are 45 duplicated rows that we have now removed.

2.3 Data Tidying

We see that the values in the movie_titlevariable are not consistent (eg: some have blank spaces at the end). Thus, we will remove them.

#Glimpse at both data source
glimpse(imdbofficial)
## Rows: 20,253
## Columns: 9
## $ Movie             <chr> "Halfaouine: Boy of the Terraces", "Demonia", "Baby…
## $ Year              <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1994, 199…
## $ Duration          <dbl> 98, 88, 88, 116, 108, 108, 102, 168, 101, 56, 107, …
## $ Genres            <chr> "Comedy,Drama", "Horror,Mystery", "Horror", "Action…
## $ Rating            <dbl> 6.7, 4.5, 6.0, 5.4, 7.3, 7.6, 3.6, 7.2, 5.6, 5.9, 7…
## $ Num_of_Votes      <dbl> 1198, 1207, 1511, 3318, 3583, 1218, 1389, 1761, 100…
## $ Five_Year         <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 199…
## $ Final_Genre       <chr> "Drama", "Mystery", "Horror", "Action", "Comedy", "…
## $ Types_of_Duration <chr> "Just Nice", "Very Short", "Very Short", "Long", "L…
glimpse(movies)
## Rows: 4,916
## Columns: 30
## $ color                     <chr> "Color", "Color", "Color", "Color", NA, "Co…
## $ director_name             <chr> "James Cameron", "Gore Verbinski", "Sam Men…
## $ num_critic_for_reviews    <dbl> 723, 302, 602, 813, NA, 462, 392, 324, 635,…
## $ duration                  <dbl> 178, 169, 148, 164, NA, 132, 156, 100, 141,…
## $ director_facebook_likes   <dbl> 0, 563, 0, 22000, 131, 475, 0, 15, 0, 282, …
## $ actor_3_facebook_likes    <dbl> 855, 1000, 161, 23000, NA, 530, 4000, 284, …
## $ actor_2_name              <chr> "Joel David Moore", "Orlando Bloom", "Rory …
## $ actor_1_facebook_likes    <dbl> 1000, 40000, 11000, 27000, 131, 640, 24000,…
## $ gross                     <dbl> 760505847, 309404152, 200074175, 448130642,…
## $ genres                    <chr> "Action|Adventure|Fantasy|Sci-Fi", "Action|…
## $ actor_1_name              <chr> "CCH Pounder", "Johnny Depp", "Christoph Wa…
## $ movie_title               <chr> "Avatar", "Pirates of the Caribbean: At Wor…
## $ num_voted_users           <dbl> 886204, 471220, 275868, 1144337, 8, 212204,…
## $ cast_total_facebook_likes <dbl> 4834, 48350, 11700, 106759, 143, 1873, 4605…
## $ actor_3_name              <chr> "Wes Studi", "Jack Davenport", "Stephanie S…
## $ facenumber_in_poster      <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 4, 3, 0, 0, 1, 2, 1…
## $ plot_keywords             <chr> "avatar|future|marine|native|paraplegic", "…
## $ movie_imdb_link           <chr> "http://www.imdb.com/title/tt0499549/?ref_=…
## $ num_user_for_reviews      <dbl> 3054, 1238, 994, 2701, NA, 738, 1902, 387, …
## $ language                  <chr> "English", "English", "English", "English",…
## $ country                   <chr> "USA", "USA", "UK", "USA", NA, "USA", "USA"…
## $ content_rating            <chr> "PG-13", "PG-13", "PG-13", "PG-13", NA, "PG…
## $ budget                    <dbl> 237000000, 300000000, 245000000, 250000000,…
## $ title_year                <dbl> 2009, 2007, 2015, 2012, NA, 2012, 2007, 201…
## $ actor_2_facebook_likes    <dbl> 936, 5000, 393, 23000, 12, 632, 11000, 553,…
## $ imdb_score                <dbl> 7.9, 7.1, 6.8, 8.5, 7.1, 6.6, 6.2, 7.8, 7.5…
## $ aspect_ratio              <dbl> 1.78, 2.35, 2.35, 2.35, NA, 2.35, 2.35, 1.8…
## $ movie_facebook_likes      <dbl> 33000, 0, 85000, 164000, 0, 24000, 0, 29000…
## $ genres_2                  <chr> "Action Adventure Fantasy Sci-Fi", "Action …
## $ profit_flag               <fct> 1, 1, 0, 1, NA, 0, 1, 0, 1, 1, 1, 0, 0, 1, …

3. Data Visualisation

3.1 Genre Analysis

genre <- Corpus(VectorSource(movies$genres_2))
genre_dtm <- DocumentTermMatrix(genre)
genre_freq <- colSums(as.matrix(genre_dtm))
freq <- sort(colSums(as.matrix(genre_dtm)), decreasing=TRUE) 
genre_wf <- data.frame(word=names(genre_freq), freq=genre_freq)

3.1.1 Wordcloud of Genre count

df_by_genre <- imdbofficial %>% group_by(Final_Genre) %>% 
  summarise(count = n()) %>% arrange(desc(count))
set.seed(123)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(df_by_genre$Final_Genre,df_by_genre$count,random.order=TRUE,
          rot.per=.15, colors=pal2,scale=c(4,.9),
          title="Sentiment Analysis of Movie Genre")

# Using imdb official datasource containing 20k+ movies
dim(imdbofficial)
## [1] 20253     9
plot_ly(df_by_genre, type = 'pie',
        labels = ~Final_Genre, values = ~count) %>% 
  layout(xaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         yaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         title = "Distribution of Content by Rating", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

3.2 Monetary Analysis

3.2.1 Top 10 most profitable movies

movies$profit <- movies$gross - movies$budget
movies %>% drop_na(movie_title,profit)%>%
  arrange(desc(profit)) %>% 
  head(10) %>%  
  ggplot(aes(reorder(movie_title,profit),profit,fill=movie_title))+
  geom_bar(stat="identity")+
  coord_cartesian(ylim = c(300000000,550000000))+ 
  theme_linedraw() +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  theme(legend.position = "none") +
  scale_y_continuous(labels=scales::comma)+
  labs(x="Movie Titles",
       y="Total Profit in USD",
       title="Top 10 most profitable movies")

3.2.2 Top 10 least profitable movies

movies$loss <- movies$budget - movies$gross

movies %>% drop_na(movie_title,loss)%>%
  arrange(desc(loss)) %>% 
  head(10) %>%  
  ggplot(aes(reorder(movie_title,loss),loss,fill=movie_title))+
  geom_bar(stat="identity")+ 
  theme_linedraw() +
  theme(axis.text.x = element_text(angle=10),plot.title=element_text(color="Black",face="bold"),legend.position="none")+
  scale_y_continuous(labels=scales::comma)+
  labs(x="Movie Titles",
       y="Total Loss in USD",
       title="Top 10 least profitable movies")

3.2.3 Relation between IMDB Score, Revenue & Budget

options(scipen = 999)
movies %>% 
plot_ly(x = ~imdb_score, y = ~budget, z = ~gross, 
        color = ~profit_flag,size = I(3),
        hoverinfo = 'text',
          text = ~paste('Movie: ', movie_title,
                        '</br></br> Gross: ', gross,
                        '</br> Budget: ', budget,
                        '</br> IMDB Score: ', imdb_score)) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'IMDB Score'),
                      yaxis = list(title = 'Budget'),
                      zaxis = list(title = 'Revenue')),
         title = "IMDB Score vs Revenue vs Budget",
         showlegend = FALSE)

3.3 Ratings Analysis

3.3.1 Scatterplot of Ratings by Year

Based on the scatter plot, there is a general downward trend in ratings over the year.

year_rating <- imdbofficial %>%
  select(Year,Rating,Movie) %>%
  distinct() %>%
  group_by(Year) %>%
  summarise(avg_rating = mean(Rating)) %>%
  ggplot((aes(x=Year,y=avg_rating))) +
  geom_point(aes(color = avg_rating)) +
  geom_smooth(method ="loess", se = 0.95) +
  theme(text = element_text(size=20)) +
  labs (x = "Year of Movie",
        y = "Average Ratings",
        title ="Average ratings of movies produced each year from 1990-2019",
        caption = "Source: IMDB") +
  theme_linedraw()

ggplotly(year_rating) 

3.3.2 Barplot of Ratings by Duration bins

From the bar chart based on the 5 movie duration bins, there is a similar trend where movies with longer duration have higher average rating. Both Very Short and Short duration of movies have similar average rating while the average rating increases from Just Nice to Long to Extremely Long duration of movies.

Bins for Movie duration (minutes):

  • Very Short: < 90
  • Short: 91 - 97
  • Just Nice: 98 - 105
  • Long: 106 - 119
  • Extremely Long: 119 - 467

3.3.3 Boxplot of Ratings of Genre

3.3.4 Ratings vs Genre of Movie

There are two versions of this boxplot created. The first one is using ggplotly and the second one is more manual created by defining each point using the stat_summary function. This is to highlight the versatility of R as a data visualisation platform.

Version 2 created more manually using stat_summary.

imdbofficial %>% 
  ggplot(aes(x = reorder(Final_Genre,Rating), y = Rating)) +
  geom_boxplot(aes(color = Final_Genre),
               notch = T) +
  # geom_hline(yintercept = 7.25, color = "deepskyblue4", size = 1.5)  +
   geom_hline(aes(yintercept = mean(Rating)), color = "red", size = 0.5) +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "errorbar",
               color = "purple",
               fun.args = (conf.int = 0.95),
               size = 0.5) + 
  stat_summary(fun.data = "mean_cl_boot",
               geom = "pointrange",
               color = "purple",
               fun.args = (conf.int = 0.95),
               size = 0.1) + 
  geom_jitter(aes(color = Final_Genre),
              alpha = 0.05) + 
  labs(title = "Boxplot Rating Analysis of Genres from `imdbofficial` dataset",
       subtitle = "Red horizontal line represents mean rating across all genres",
       caption = "Source: IMDB",
       x = "Genre Names",
       y = "Rating Scores") +
theme_linedraw()+
    theme(axis.text.x = element_text(angle = 20)) + 
  theme(legend.position = "none",
        axis.title = element_text())

3.4 Correlation Analysis

Here we look at the correlation between the movies dataset amongst key numerical variables.

corrgram_data <- movies %>% 
  dplyr::select(., duration, num_critic_for_reviews, gross,  num_voted_users, num_user_for_reviews, budget, title_year, imdb_score, movie_facebook_likes)

data <- cor(corrgram_data)
?corrgram
corrgram(corrgram_data,legend=T,
         upper.panel = panel.shade,
         lower.panel = panel.cor)

4. Description and Insights

There were a number of really great insights that the above few visualisations revealed. I will be highlighting two main one’s as requested and a smaller number of cool findings as well!

Key Finding 1: From all the movies genres, the Documentary genre was surprisingly the one with the highest Ratings on IMDB. One would have expected a genre such as comedy or action to be the most succcessful.

However, upon further reflection, it makes sense as to why Documentary would have the highest rating by critics. It is a neutral genre as people can’t hate it or love it, like they can when it comes to drama for example. A documentary simply showcases the facts and thus doesnt garner a great negative reaction in most cases.

Key Finding 2: The longer the duration is, the greater the movie rating. Using the 5 movie length bins from the Types_of_Duration variable, we see that the longest movies have the highest critic ratings. Personally I would have presumed that the middle length movies would garner the most positive responses as longer movies may not remain as interesting throughout.

Additionally, there were some other mini interesting findings as well:

Cool Fact 1: Drama, Comedy or Romance make up around 50% of all movies (from 20,000 movies since 1990s).

Cool Fact 2: As seen from the 3D chart in 3.2.3, the movie Lady Vengeance was a huge outlier in terms of the amount of money spent on making it vs the miserly Revenue it brought it. Surprisingly, it still received a high IMDb critic rating of 7.7 when the average is 6.33.

Overall, as a movie and visualisation enthusiast, this data revealed great insights about IMDb movies and with greater analysis, this could potentially be useful for directors/production companies to look at as well.