Source Code

I would be working on IMDB Data of 5000 movies for my Project. This dataset has been taken from Kaggle.

Since there is no universal way to claim the goodness of a movie, one could rely on the critics, or the profit it made, or how many people liked the movie.

This data gives us information about the year of its release, budget, gross earnings, facebook likes of the movie etc. So, I would be working on these aspects to analyse the greatness of a movie.

Codebook

  • movie_title : Title of the Movie
  • duration: Duration in minutes
  • director_name : Name of the Director of the Movie.
  • director_facebook_likes : Number of likes of the Director on his Facebook Page.
  • color: Film colorization. ‘Black and White’ or ‘Color’
  • genres: Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’
  • actor_1_name: Primary actor starring in the movie
  • actor_1_facebook_likes : Number of likes of the Actor_1 on his/her Facebook Page.
  • actor_2_name: Other actor starring in the movie
  • actor_2_facebook_likes : Number of likes of the Actor_2 on his/her Facebook Page.
  • actor_3_name: Other actor starring in the movie
  • actor_3_facebook_likes : Number of likes of the Actor_3 on his/her Facebook Page.
  • num_critic_for_reviews : Number of critical reviews on imdb
  • num_voted_users: Number of people who voted for the movie
  • cast_total_facebook_likes: Total number of facebook likes of the entire cast of the movie.
  • language : English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc
  • country: Country where the movie is produced.
  • gross: Gross earnings of the movie in Dollars
  • budget: Budget of the movie in Dollars
  • title_year: The year in which the movie is released (1916:2016)
  • imdb_score: IMDB Score of the movie on IMDB
  • movie_facebook_likes: Number of Facebook likes in the movie page.

Packages Required

For this Project, I’d be using the following packages:

library(ggplot2)        # for creating graphs
library(dplyr)          # for performing data transformation and manipulation tasks.
library(knitr)          # for kniting r code to html files
library(tidyr)          # for tidying the data set
library(magrittr)       # for chaining commands with pipe operator, %>%.
library(tidyverse)      # for tibbles, a modern re-imagining of data frames
library(tibble)
library(colorspace)
library(scales)
library(lubridate)
library(stringr)
library(ggrepel)

Import Data

temp <- tempfile()
download.file("http://kaggle2.blob.core.windows.net/datasets/138/287/movie_metadata.csv.zip",temp)
imdb_data <- read.csv(unz(temp, "movie_metadata.csv"))
unlink(temp)
imdb <- as_tibble(imdb_data)
imdb
## # A tibble: 5,043 × 28
##     color     director_name num_critic_for_reviews duration
##    <fctr>            <fctr>                  <int>    <int>
## 1   Color     James Cameron                    723      178
## 2   Color    Gore Verbinski                    302      169
## 3   Color        Sam Mendes                    602      148
## 4   Color Christopher Nolan                    813      164
## 5               Doug Walker                     NA       NA
## 6   Color    Andrew Stanton                    462      132
## 7   Color         Sam Raimi                    392      156
## 8   Color      Nathan Greno                    324      100
## 9   Color       Joss Whedon                    635      141
## 10  Color       David Yates                    375      153
## # ... with 5,033 more rows, and 24 more variables:
## #   director_facebook_likes <int>, actor_3_facebook_likes <int>,
## #   actor_2_name <fctr>, actor_1_facebook_likes <int>, gross <int>,
## #   genres <fctr>, actor_1_name <fctr>, movie_title <fctr>,
## #   num_voted_users <int>, cast_total_facebook_likes <int>,
## #   actor_3_name <fctr>, facenumber_in_poster <int>, plot_keywords <fctr>,
## #   movie_imdb_link <fctr>, num_user_for_reviews <int>, language <fctr>,
## #   country <fctr>, content_rating <fctr>, budget <dbl>, title_year <int>,
## #   actor_2_facebook_likes <int>, imdb_score <dbl>, aspect_ratio <dbl>,
## #   movie_facebook_likes <int>

Data Description

dim(imdb)
## [1] 5043   28
str(imdb)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
##  $ movie_title              : Factor w/ 4917 levels "#Horror ","[Rec] 2 ",..: 398 2731 3279 3707 3332 1961 3289 3459 399 1631 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 1 2714 1970 2163 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
summary(imdb)
##               color               director_name  num_critic_for_reviews
##                  :  19                   : 104   Min.   :  1.0         
##   Black and White: 209   Steven Spielberg:  26   1st Qu.: 50.0         
##  Color           :4815   Woody Allen     :  22   Median :110.0         
##                          Clint Eastwood  :  20   Mean   :140.2         
##                          Martin Scorsese :  20   3rd Qu.:195.0         
##                          Ridley Scott    :  17   Max.   :813.0         
##                          (Other)         :4834   NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  20   Min.   :     0         Min.   :      162  
##  Charlize Theron:  15   1st Qu.:   614         1st Qu.:  5340988  
##  Brad Pitt      :  14   Median :   988         Median : 25517500  
##                 :  13   Mean   :  6560         Mean   : 48468408  
##  James Franco   :  11   3rd Qu.: 11000         3rd Qu.: 62309438  
##  Meryl Streep   :  11   Max.   :640000         Max.   :760505847  
##  (Other)        :4959   NA's   :7              NA's   :884        
##                   genres                actor_1_name 
##  Drama               : 236   Robert De Niro   :  49  
##  Comedy              : 209   Johnny Depp      :  41  
##  Comedy|Drama        : 191   Nicolas Cage     :  33  
##  Comedy|Drama|Romance: 187   J.K. Simmons     :  31  
##  Comedy|Romance      : 158   Bruce Willis     :  30  
##  Drama|Romance       : 152   Denzel Washington:  30  
##  (Other)             :3910   (Other)          :4829  
##                      movie_title   num_voted_users  
##  Ben-Hur                  :   3   Min.   :      5  
##  Halloween                :   3   1st Qu.:   8594  
##  Home                     :   3   Median :  34359  
##  King Kong                :   3   Mean   :  83668  
##  Pan                      :   3   3rd Qu.:  96309  
##  The Fast and the Furious :   3   Max.   :1689764  
##  (Other)                   :5025                    
##  cast_total_facebook_likes         actor_3_name  facenumber_in_poster
##  Min.   :     0                          :  23   Min.   : 0.000      
##  1st Qu.:  1411            Ben Mendelsohn:   8   1st Qu.: 0.000      
##  Median :  3090            John Heard    :   8   Median : 1.000      
##  Mean   :  9699            Steve Coogan  :   8   Mean   : 1.371      
##  3rd Qu.: 13756            Anne Hathaway :   7   3rd Qu.: 2.000      
##  Max.   :656730            Jon Gries     :   7   Max.   :43.000      
##                            (Other)       :4982   NA's   :13          
##                                                                            plot_keywords 
##                                                                                   : 153  
##  based on novel                                                                   :   4  
##  1940s|child hero|fantasy world|orphan|reference to peter pan                     :   3  
##  alien friendship|alien invasion|australia|flying car|mother daughter relationship:   3  
##  animal name in title|ape abducts a woman|gorilla|island|king kong                :   3  
##  assistant|experiment|frankenstein|medical student|scientist                      :   3  
##  (Other)                                                                          :4874  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1:   3  
##  (Other)                                             :5025  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :4704   USA      :3807   R        :2118  
##  1st Qu.:  65.0       French  :  73   UK       : 448   PG-13    :1461  
##  Median : 156.0       Spanish :  40   France   : 154   PG       : 701  
##  Mean   : 272.8       Hindi   :  28   Canada   : 126            : 303  
##  3rd Qu.: 326.0       Mandarin:  26   Germany  :  97   Not Rated: 116  
##  Max.   :5060.0       German  :  19   Australia:  55   G        : 112  
##  NA's   :21           (Other) : 153   (Other)  : 356   (Other)  : 232  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1916   Min.   :     0         Min.   :1.600  
##  1st Qu.:6.000e+06   1st Qu.:1999   1st Qu.:   281         1st Qu.:5.800  
##  Median :2.000e+07   Median :2005   Median :   595         Median :6.600  
##  Mean   :3.975e+07   Mean   :2002   Mean   :  1652         Mean   :6.442  
##  3rd Qu.:4.500e+07   3rd Qu.:2011   3rd Qu.:   918         3rd Qu.:7.200  
##  Max.   :1.222e+10   Max.   :2016   Max.   :137000         Max.   :9.500  
##  NA's   :492         NA's   :108    NA's   :13                            
##   aspect_ratio   movie_facebook_likes
##  Min.   : 1.18   Min.   :     0      
##  1st Qu.: 1.85   1st Qu.:     0      
##  Median : 2.35   Median :   166      
##  Mean   : 2.22   Mean   :  7526      
##  3rd Qu.: 2.35   3rd Qu.:  3000      
##  Max.   :16.00   Max.   :349000      
##  NA's   :329
range(imdb$title_year, na.rm = TRUE)
## [1] 1916 2016

Data Cleaning

# Remove instances which have at least one NA variable
imdb <- imdb[complete.cases(imdb), ]
dim(imdb)
## [1] 3801   28
#converting Year from int to Date
movie_year <- as.Date(as.character(imdb$title_year),"%Y")

# Remove duplicate movie titles
imdb <- imdb[!duplicated(imdb$movie_title),]
dim(imdb)
## [1] 3700   28
#removing special character (Â) at the end of movie title 
str_replace_all(imdb$movie_title, pattern = "Â", replacement = "")

# adding two colums: profit and percentage return on investment.
imdb %>% 
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) 

Planned Analysis

I plan to do analysis on the following areas.

  1. Top 20 most profitable directors by calculating the average profit per film earned by the director in their movies over the years.

  2. Top 20 movies based on its Profit (Gross Earnings - Budget). Since the data is over 100 years, from 1916 to 2016, the values for earnings and budget do not take into account inflation and market value, therefore I plan to analyse this only from the last 10 years.

  3. Top 20 movies based on its Percentage Return on Investment.((profit/budget)*100). Since profit earned by the movie doesnt give a clear picture about the monetary sucsess of the movie over the years (1916 to 2016), I plan to analyse the absolute value of the Return on Investment across its Budget, my hypothesis being that the ROI would be high for Low Budget Films and will decrease as the cost of the movie increases.

  4. Analysis on the Commercial Success acclaimed by the movie (Gross earnings and profit earned) vs its IMDB Score. I do not expect to see much correlation since most critically acclaimed movies dont do much well commercially.