IMDB Dataset

I will be analyzing a data set from that includes 5000+ films from IMDB. The ultimate question of interest will be what factors, if any, influence movie ratings.

str(movie_data)

## 'data.frame':    3801 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 109 2030 1652 1228 554 2394 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 462 392 324 635 375 673 ...
##  $ duration                 : int  178 169 148 164 132 156 100 141 153 183 ...
##  $ director_facebook_likes  : int  0 563 0 22000 475 0 15 0 282 0 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 530 4000 284 19000 10000 2000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2549 1228 801 2440 653 1704 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 640 24000 799 26000 25000 15000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 73058679 336530303 200807262 458991599 301956980 330249062 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 126 120 308 126 447 126 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 443 787 223 338 35 741 ...
##  $ movie_title              : Factor w/ 4917 levels "#HorrorÂ ","[Rec] 2Â ",..: 398 2731 3279 3707 1961 3289 3459 399 1631 461 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 212204 383056 294810 462669 321795 371639 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 1873 46055 2036 92000 58753 24450 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 2714 1970 2163 3018 2941 58 ...
##  $ facenumber_in_poster     : int  0 0 1 0 1 0 1 4 3 0 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 651 4745 29 1142 2005 1564 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 2476 2526 2458 4546 2551 4690 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 738 1902 387 1117 973 3018 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 65 65 65 65 63 65 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 10 10 9 10 9 10 ...
##  $ budget                   : num  237000000 300000000 245000000 250000000 263700000 ...
##  $ title_year               : int  2009 2007 2015 2012 2012 2007 2010 2015 2009 2016 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 632 11000 553 21000 11000 4000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1242] 5 56 85 99 100 178 200 205 207 243 ...
##   .. ..- attr(*, "names")= chr [1:1242] "5" "56" "85" "99" ...

head(movie_data)

##   color     director_name num_critic_for_reviews duration
## 1 Color     James Cameron                    723      178
## 2 Color    Gore Verbinski                    302      169
## 3 Color        Sam Mendes                    602      148
## 4 Color Christopher Nolan                    813      164
## 6 Color    Andrew Stanton                    462      132
## 7 Color         Sam Raimi                    392      156
##   director_facebook_likes actor_3_facebook_likes     actor_2_name
## 1                       0                    855 Joel David Moore
## 2                     563                   1000    Orlando Bloom
## 3                       0                    161     Rory Kinnear
## 4                   22000                  23000   Christian Bale
## 6                     475                    530  Samantha Morton
## 7                       0                   4000     James Franco
##   actor_1_facebook_likes     gross                          genres
## 1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2                  40000 309404152        Action|Adventure|Fantasy
## 3                  11000 200074175       Action|Adventure|Thriller
## 4                  27000 448130642                 Action|Thriller
## 6                    640  73058679         Action|Adventure|Sci-Fi
## 7                  24000 336530303        Action|Adventure|Romance
##      actor_1_name                                movie_title
## 1     CCH Pounder                                   AvatarÂ 
## 2     Johnny Depp Pirates of the Caribbean: At World's EndÂ 
## 3 Christoph Waltz                                  SpectreÂ 
## 4       Tom Hardy                    The Dark Knight RisesÂ 
## 6    Daryl Sabara                              John CarterÂ 
## 7    J.K. Simmons                             Spider-Man 3Â 
##   num_voted_users cast_total_facebook_likes         actor_3_name
## 1          886204                      4834            Wes Studi
## 2          471220                     48350       Jack Davenport
## 3          275868                     11700     Stephanie Sigman
## 4         1144337                    106759 Joseph Gordon-Levitt
## 6          212204                      1873         Polly Walker
## 7          383056                     46055        Kirsten Dunst
##   facenumber_in_poster
## 1                    0
## 2                    0
## 3                    1
## 4                    0
## 6                    1
## 7                    0
##                                                      plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 2     goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 6               alien|american civil war|male nipple|mars|princess
## 7                        sandman|spider man|symbiote|venom|villain
##                                        movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
## 7 http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1
##   num_user_for_reviews language country content_rating    budget
## 1                 3054  English     USA          PG-13 237000000
## 2                 1238  English     USA          PG-13 300000000
## 3                  994  English      UK          PG-13 245000000
## 4                 2701  English     USA          PG-13 250000000
## 6                  738  English     USA          PG-13 263700000
## 7                 1902  English     USA          PG-13 258000000
##   title_year actor_2_facebook_likes imdb_score aspect_ratio
## 1       2009                    936        7.9         1.78
## 2       2007                   5000        7.1         2.35
## 3       2015                    393        6.8         2.35
## 4       2012                  23000        8.5         2.35
## 6       2012                    632        6.6         2.35
## 7       2007                  11000        6.2         2.35
##   movie_facebook_likes
## 1                33000
## 2                    0
## 3                85000
## 4               164000
## 6                24000
## 7                    0

ggplot(data = melt(movie_data), mapping = aes(x = value)) + 
  geom_histogram(bins = 10) + facet_wrap(~variable, scales = 'free_x')

This data set consists of 5043 observations of 28 variables. There appear to be many variables of interest. Many of the variables are factors.

Looking at the histograms, many of the variables in the data set show a left skew. I presume this is so because many of the variables involve Facebook likes which tend to be distributed only to the most famous actors, films, and directors. Also, many films usually do not have a huge gross profit or budget as well (for the gross/budget variables)

ggplot(movie_data, aes(x= imdb_score)) + geom_bar()

Histogram of IMDB scores. It looks roughly normally distributed with most films between roughly 6 and 7.5 rating.

corrgram_data <- movie_data %>% 
  dplyr::select(., duration, num_critic_for_reviews, gross,  num_voted_users, num_user_for_reviews, budget, title_year, imdb_score, movie_facebook_likes)
  

corrgram(corrgram_data, upper.panel = panel.pie, order = TRUE)

Correlation matrix for key variables. Surprisingly, IMDB scores do not seem correlate well with any of the variables except with the number of users voting. The strongest positive correlation is number of user voters for a particular film. Number of user reviews, duration, and critic reviews is also correlated with a better rating. Budget appears to have very little correlation with IMDB rating which is also surprising.

ggplot(movie_data, aes(x = gross, y = imdb_score))+ geom_point(alpha = 1/10) + geom_smooth()

Scatter plot with gross profit on the x axis and IMDB score on the y. As shown, there is a very slight positive correlation between gross profit and score. Many films do not have much gross profit despite being rated well!

avg_rating_director <- movie_data %>%
  group_by(., director_name) %>%
  mutate(.,no_rows = length(director_name)) %>%
  select(director_name, imdb_score, no_rows) %>% 
  filter(no_rows >= 3) %>% 
  summarise_each(funs(mean(.,na.rm=TRUE))) %>% 
  arrange(desc(imdb_score))

avg_rating_director <- as.data.frame(avg_rating_director)

head(avg_rating_director, n = 50)

##              director_name imdb_score no_rows
## 1             Sergio Leone   8.433333       3
## 2        Christopher Nolan   8.425000       8
## 3              Pete Docter   8.233333       3
## 4           Hayao Miyazaki   8.225000       4
## 5        Quentin Tarantino   8.200000       8
## 6             Milos Forman   8.133333       3
## 7               David Lean   8.000000       4
## 8           Frank Darabont   7.975000       4
## 9         Denis Villeneuve   7.966667       3
## 10             Joss Whedon   7.925000       4
## 11           James Cameron   7.914286       7
## 12 Alejandro G. IÃ±Ã¡rritu   7.840000       5
## 13         Alfonso CuarÃ³n   7.800000       4
## 14            Dean DeBlois   7.766667       3
## 15              Mel Gibson   7.766667       3
## 16           David Fincher   7.750000      10
## 17           Michael Moore   7.750000       4
## 18          Andrew Stanton   7.733333       3
## 19             Paul Haggis   7.733333       3
## 20              Tom Hooper   7.733333       3
## 21              Peter Weir   7.725000       4
## 22         Martin Scorsese   7.675000      16
## 23       Thomas Vinterberg   7.666667       3
## 24    Francis Ford Coppola   7.655556       9
## 25           Peter Jackson   7.654545      11
## 26          Matthew Vaughn   7.650000       4
## 27          Chan-wook Park   7.633333       3
## 28     Alejandro AmenÃ¡bar   7.633333       3
## 29              Mel Brooks   7.633333       3
## 30            Wes Anderson   7.628571       7
## 31            Edgar Wright   7.600000       4
## 32      Fernando Meirelles   7.600000       3
## 33             Yimou Zhang   7.600000       5
## 34         Paul Greengrass   7.585714       7
## 35               Brad Bird   7.580000       5
## 36             Spike Jonze   7.575000       4
## 37            Duncan Jones   7.566667       3
## 38        Steven Spielberg   7.544000      25
## 39             David Lynch   7.533333       3
## 40    Paul Thomas Anderson   7.516667       6
## 41        John G. Avildsen   7.500000       3
## 42          Lars von Trier   7.500000       3
## 43              Sam Mendes   7.500000       8
## 44        Darren Aronofsky   7.483333       6
## 45       Jean-Marc VallÃ©e   7.466667       3
## 46             J.J. Abrams   7.450000       4
## 47          Stephen Daldry   7.450000       4
## 48               Niki Caro   7.433333       3
## 49    Richard Attenborough   7.433333       3
## 50         Alexander Payne   7.420000       5

List of highest mean IMDB score by director with 3 or more films directed. No surprises here for the most part if you are familiar with film. Many of the names listed are some of the most well-known in the industry.

ggplot(movie_data, aes(x =factor(title_year), y = imdb_score)) + geom_boxplot() + scale_x_discrete(breaks = pretty(movie_data$title_year,n=20))

Box plots with IMDB ratings by year. There are alot more higher rated films towards the beginning of the 20th century. An explanation for this is because many older films are classics and classics, by nature, are usually the best films of their era. Bad old films are unlikely to have any ratings.

movie_data_by_year <- movie_data %>% 
  group_by(title_year) %>% 
  summarise_each(funs(median), imdb_score) %>%
  filter(title_year > 1920) %>%
  arrange(desc(imdb_score))

colnames(movie_data_by_year)[colnames(movie_data_by_year) == 'imdb_score'] <- 'median_imdb_score'

ggplot(movie_data_by_year, aes( x = title_year, y = median_imdb_score)) + geom_point() + geom_smooth() + ylim(0,10)

Median IMDB scores by year. Classics seem to score around a 7.5 and recent films around 6. Presumably, if bad older films were more frequently rated there would likely be no differences over time.

movie_data_by_year_bestyears <- movie_data_by_year %>%
                          select(title_year, median_imdb_score)

head(movie_data_by_year_bestyears, n = 10)

## # A tibble: 10 × 2
##    title_year median_imdb_score
##         <int>             <dbl>
## 1        1966              8.90
## 2        1936              8.60
## 3        1960              8.50
## 4        1954              8.45
## 5        1927              8.30
## 6        1959              8.30
## 7        1975              8.30
## 8        1957              8.20
## 9        1939              8.15
## 10       1968              7.90

movie_data_by_year_bestyears_2000 <- movie_data_by_year %>%
                          select(title_year, median_imdb_score) %>%
                          filter(title_year >= 2000)

head(movie_data_by_year_bestyears_2000, n = 10)

## # A tibble: 10 × 2
##    title_year median_imdb_score
##         <int>             <dbl>
## 1        2007              6.80
## 2        2013              6.65
## 3        2004              6.60
## 4        2005              6.60
## 5        2006              6.60
## 6        2015              6.60
## 7        2016              6.60
## 8        2002              6.50
## 9        2008              6.50
## 10       2009              6.50

Highest rated years for all years and for the aughts by median IMDB score. It looks like 1966 was a particularly good year for film followed by 1936. In the 2000’s, it looks like 2007 was the best year but there is not much variation at all between the years.

ggplot(movie_data, aes(x = num_voted_users, y = imdb_score)) + geom_point() + geom_smooth()

Plot of number of users voted versus imdb_score. It looks like a positive curvilinear relationship. I will try and transform the x-axis to hopefully something more linear.

ggplot(movie_data, aes(x = sqrt(num_voted_users), y = imdb_score)) + geom_point() + geom_smooth()

ggplot(movie_data, aes(x = sqrt(num_voted_users), y = imdb_score)) + geom_point() + geom_smooth() + geom_smooth(method= "lm", color = "red")

Transformed the x axis using the square root function to get a more linear relationship. The smoothing function shows a more linear function. In the next graph, in red, I added the linear of best fit which, except at the extremes, matches the smoothing function pretty well.

linear_sqrtnumusers <- lm(imdb_score ~ sqrt(num_voted_users), data = movie_data)

summary(linear_sqrtnumusers)

## 
## Call:
## lm(formula = imdb_score ~ sqrt(num_voted_users), data = movie_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8779 -0.5240  0.0764  0.6288  2.3881 
## 
## Coefficients:
##                         Estimate Std. Error t value            Pr(>|t|)
## (Intercept)           5.66523194 0.02652735  213.56 <0.0000000000000002
## sqrt(num_voted_users) 0.00298050 0.00008199   36.35 <0.0000000000000002
##                          
## (Intercept)           ***
## sqrt(num_voted_users) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9111 on 3799 degrees of freedom
## Multiple R-squared:  0.2581, Adjusted R-squared:  0.2579 
## F-statistic:  1321 on 1 and 3799 DF,  p-value: < 0.00000000000000022

layout(matrix(c(1,2,3,4),2,2))
plot(linear_sqrtnumusers)

Looking at the regression outputs and diagnostics, I feel pretty confident that a linear model is the best model for this relationship. However, R^2 is only 0.26. Given how complicated an IMDB score can be I feel satisfied with that result but wonder if another variable may be added for a higher R^2.

imdb_test_data <- read.csv("imdb_test_data.csv")

head(imdb_test_data)

##   num_voted_users imdb_score sqrt_num_users
## 1           26573        7.2       163.0123
## 2           14950        7.1       122.2702
## 3           10865        7.6       104.2353
## 4          225042        6.7       474.3859
## 5          204511        8.1       452.2289
## 6           29345        7.9       171.3038

new <- data.frame(num_voted_users = imdb_test_data$num_voted_users)

prediction_values <- predict(linear_sqrtnumusers, newdata = new, interval="prediction")

residuals <- prediction_values[,1] - imdb_test_data$imdb_score

test_data <- data.frame(imdb_test_data$imdb_score, prediction_values, abs(residuals))

mean(test_data$abs.residuals.)

## [1] 0.9019898

ggplot(test_data, aes(x = abs.residuals.)) + geom_histogram(binwidth = 0.2)

I took 25 films from IMDB for 2016 and compared the actual scores versus the scores predicted by the model. On average, the predictions were off by about 0.9. Looking at the histogram the vast majority were within a 1.5 error range. The one observation that did not fit in a prediction interval was a a film rated 8.4. This is not surprising since the model is less predictive when it comes to highly rated films.