I would be working on IMDB Data of 5000 movies for my Project. This dataset has been taken from Kaggle.
Since there is no universal way to claim the goodness of a movie, one could rely on the critics, or the profit it made, or how many people liked the movie.
This data gives us information about the year of its release, budget, gross earnings, facebook likes of the movie etc. So, I would be working on these aspects to analyse the greatness of a movie.
Codebook
For this Project, I’d be using the following packages:
library(ggplot2) # for creating graphs
library(dplyr) # for performing data transformation and manipulation tasks.
library(knitr) # for kniting r code to html files
library(tidyr) # for tidying the data set
library(magrittr) # for chaining commands with pipe operator, %>%.
library(tidyverse) # for tibbles, a modern re-imagining of data frames
library(tibble)
library(colorspace)
library(scales)
library(lubridate)
library(stringr)
library(ggrepel)
temp <- tempfile()
download.file("http://kaggle2.blob.core.windows.net/datasets/138/287/movie_metadata.csv.zip",temp)
imdb_data <- read.csv(unz(temp, "movie_metadata.csv"))
unlink(temp)
imdb <- as_tibble(imdb_data)
imdb
## # A tibble: 5,043 × 28
## color director_name num_critic_for_reviews duration
## <fctr> <fctr> <int> <int>
## 1 Color James Cameron 723 178
## 2 Color Gore Verbinski 302 169
## 3 Color Sam Mendes 602 148
## 4 Color Christopher Nolan 813 164
## 5 Doug Walker NA NA
## 6 Color Andrew Stanton 462 132
## 7 Color Sam Raimi 392 156
## 8 Color Nathan Greno 324 100
## 9 Color Joss Whedon 635 141
## 10 Color David Yates 375 153
## # ... with 5,033 more rows, and 24 more variables:
## # director_facebook_likes <int>, actor_3_facebook_likes <int>,
## # actor_2_name <fctr>, actor_1_facebook_likes <int>, gross <int>,
## # genres <fctr>, actor_1_name <fctr>, movie_title <fctr>,
## # num_voted_users <int>, cast_total_facebook_likes <int>,
## # actor_3_name <fctr>, facenumber_in_poster <int>, plot_keywords <fctr>,
## # movie_imdb_link <fctr>, num_user_for_reviews <int>, language <fctr>,
## # country <fctr>, content_rating <fctr>, budget <dbl>, title_year <int>,
## # actor_2_facebook_likes <int>, imdb_score <dbl>, aspect_ratio <dbl>,
## # movie_facebook_likes <int>
dim(imdb)
## [1] 5043 28
str(imdb)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5043 obs. of 28 variables:
## $ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ director_name : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
## $ actor_1_name : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
## $ movie_title : Factor w/ 4917 levels "#Horror ","[Rec] 2 ",..: 398 2731 3279 3707 3332 1961 3289 3459 399 1631 ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 1 2714 1970 2163 3018 2941 ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
summary(imdb)
## color director_name num_critic_for_reviews
## : 19 : 104 Min. : 1.0
## Black and White: 209 Steven Spielberg: 26 1st Qu.: 50.0
## Color :4815 Woody Allen : 22 Median :110.0
## Clint Eastwood : 20 Mean :140.2
## Martin Scorsese : 20 3rd Qu.:195.0
## Ridley Scott : 17 Max. :813.0
## (Other) :4834 NA's :50
## duration director_facebook_likes actor_3_facebook_likes
## Min. : 7.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 93.0 1st Qu.: 7.0 1st Qu.: 133.0
## Median :103.0 Median : 49.0 Median : 371.5
## Mean :107.2 Mean : 686.5 Mean : 645.0
## 3rd Qu.:118.0 3rd Qu.: 194.5 3rd Qu.: 636.0
## Max. :511.0 Max. :23000.0 Max. :23000.0
## NA's :15 NA's :104 NA's :23
## actor_2_name actor_1_facebook_likes gross
## Morgan Freeman : 20 Min. : 0 Min. : 162
## Charlize Theron: 15 1st Qu.: 614 1st Qu.: 5340988
## Brad Pitt : 14 Median : 988 Median : 25517500
## : 13 Mean : 6560 Mean : 48468408
## James Franco : 11 3rd Qu.: 11000 3rd Qu.: 62309438
## Meryl Streep : 11 Max. :640000 Max. :760505847
## (Other) :4959 NA's :7 NA's :884
## genres actor_1_name
## Drama : 236 Robert De Niro : 49
## Comedy : 209 Johnny Depp : 41
## Comedy|Drama : 191 Nicolas Cage : 33
## Comedy|Drama|Romance: 187 J.K. Simmons : 31
## Comedy|Romance : 158 Bruce Willis : 30
## Drama|Romance : 152 Denzel Washington: 30
## (Other) :3910 (Other) :4829
## movie_title num_voted_users
## Ben-Hur : 3 Min. : 5
## Halloween : 3 1st Qu.: 8594
## Home : 3 Median : 34359
## King Kong : 3 Mean : 83668
## Pan : 3 3rd Qu.: 96309
## The Fast and the Furious : 3 Max. :1689764
## (Other) :5025
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## Min. : 0 : 23 Min. : 0.000
## 1st Qu.: 1411 Ben Mendelsohn: 8 1st Qu.: 0.000
## Median : 3090 John Heard : 8 Median : 1.000
## Mean : 9699 Steve Coogan : 8 Mean : 1.371
## 3rd Qu.: 13756 Anne Hathaway : 7 3rd Qu.: 2.000
## Max. :656730 Jon Gries : 7 Max. :43.000
## (Other) :4982 NA's :13
## plot_keywords
## : 153
## based on novel : 4
## 1940s|child hero|fantasy world|orphan|reference to peter pan : 3
## alien friendship|alien invasion|australia|flying car|mother daughter relationship: 3
## animal name in title|ape abducts a woman|gorilla|island|king kong : 3
## assistant|experiment|frankenstein|medical student|scientist : 3
## (Other) :4874
## movie_imdb_link
## http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1: 3
## (Other) :5025
## num_user_for_reviews language country content_rating
## Min. : 1.0 English :4704 USA :3807 R :2118
## 1st Qu.: 65.0 French : 73 UK : 448 PG-13 :1461
## Median : 156.0 Spanish : 40 France : 154 PG : 701
## Mean : 272.8 Hindi : 28 Canada : 126 : 303
## 3rd Qu.: 326.0 Mandarin: 26 Germany : 97 Not Rated: 116
## Max. :5060.0 German : 19 Australia: 55 G : 112
## NA's :21 (Other) : 153 (Other) : 356 (Other) : 232
## budget title_year actor_2_facebook_likes imdb_score
## Min. :2.180e+02 Min. :1916 Min. : 0 Min. :1.600
## 1st Qu.:6.000e+06 1st Qu.:1999 1st Qu.: 281 1st Qu.:5.800
## Median :2.000e+07 Median :2005 Median : 595 Median :6.600
## Mean :3.975e+07 Mean :2002 Mean : 1652 Mean :6.442
## 3rd Qu.:4.500e+07 3rd Qu.:2011 3rd Qu.: 918 3rd Qu.:7.200
## Max. :1.222e+10 Max. :2016 Max. :137000 Max. :9.500
## NA's :492 NA's :108 NA's :13
## aspect_ratio movie_facebook_likes
## Min. : 1.18 Min. : 0
## 1st Qu.: 1.85 1st Qu.: 0
## Median : 2.35 Median : 166
## Mean : 2.22 Mean : 7526
## 3rd Qu.: 2.35 3rd Qu.: 3000
## Max. :16.00 Max. :349000
## NA's :329
range(imdb$title_year, na.rm = TRUE)
## [1] 1916 2016
# Remove instances which have at least one NA variable
imdb <- imdb[complete.cases(imdb), ]
dim(imdb)
## [1] 3801 28
#converting Year from int to Date
movie_year <- as.Date(as.character(imdb$title_year),"%Y")
# Remove duplicate movie titles
imdb <- imdb[!duplicated(imdb$movie_title),]
dim(imdb)
## [1] 3700 28
#removing special character (Â) at the end of movie title
str_replace_all(imdb$movie_title, pattern = "Â", replacement = "")
# adding two colums: profit and percentage return on investment.
imdb %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100)
I plan to do analysis on the following areas.
Top 20 most profitable directors by calculating the average profit per film earned by the director in their movies over the years.
Top 20 movies based on its Profit (Gross Earnings - Budget). Since the data is over 100 years, from 1916 to 2016, the values for earnings and budget do not take into account inflation and market value, therefore I plan to analyse this only from the last 10 years.
Top 20 movies based on its Percentage Return on Investment.((profit/budget)*100). Since profit earned by the movie doesnt give a clear picture about the monetary sucsess of the movie over the years (1916 to 2016), I plan to analyse the absolute value of the Return on Investment across its Budget, my hypothesis being that the ROI would be high for Low Budget Films and will decrease as the cost of the movie increases.
Analysis on the Commercial Success acclaimed by the movie (Gross earnings and profit earned) vs its IMDB Score. I do not expect to see much correlation since most critically acclaimed movies dont do much well commercially.