It's Movie Time!

Susan Li

April 7, 2017

We are movie-goers, we have heavily relied on how many gold stars a movie gets before we decide whether we watch it or not. I have to admit that we miss good movies sometimes because some critics reviews are controversial, another time we regret after watching a movie because it was not what we expected.

When I was browsing Kaggle dataset, I came across an IMDB movie dataset which contains 5043 movies and 28 variables. Looking at the variables, I think I might be able to find something interesting.

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
##  $ actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ movie_title              : chr  "AvatarÂ " "Pirates of the Caribbean: At World's EndÂ " "SpectreÂ " "The Dark Knight RisesÂ " ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : chr  "English" "English" "English" "English" ...
##  $ country                  : chr  "USA" "USA" "UK" "USA" ...
##  $ content_rating           : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

## [1] 5043   28

##     color           director_name      num_critic_for_reviews
##  Length:5043        Length:5043        Min.   :  1.0         
##  Class :character   Class :character   1st Qu.: 50.0         
##  Mode  :character   Mode  :character   Median :110.0         
##                                        Mean   :140.2         
##                                        3rd Qu.:195.0         
##                                        Max.   :813.0         
##                                        NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##  actor_2_name       actor_1_facebook_likes     gross          
##  Length:5043        Min.   :     0         Min.   :      162  
##  Class :character   1st Qu.:   614         1st Qu.:  5340988  
##  Mode  :character   Median :   988         Median : 25517500  
##                     Mean   :  6560         Mean   : 48468408  
##                     3rd Qu.: 11000         3rd Qu.: 62309438  
##                     Max.   :640000         Max.   :760505847  
##                     NA's   :7              NA's   :884        
##     genres          actor_1_name       movie_title       
##  Length:5043        Length:5043        Length:5043       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  num_voted_users   cast_total_facebook_likes actor_3_name      
##  Min.   :      5   Min.   :     0            Length:5043       
##  1st Qu.:   8594   1st Qu.:  1411            Class :character  
##  Median :  34359   Median :  3090            Mode  :character  
##  Mean   :  83668   Mean   :  9699                              
##  3rd Qu.:  96309   3rd Qu.: 13756                              
##  Max.   :1689764   Max.   :656730                              
##                                                                
##  facenumber_in_poster plot_keywords      movie_imdb_link   
##  Min.   : 0.000       Length:5043        Length:5043       
##  1st Qu.: 0.000       Class :character   Class :character  
##  Median : 1.000       Mode  :character   Mode  :character  
##  Mean   : 1.371                                            
##  3rd Qu.: 2.000                                            
##  Max.   :43.000                                            
##  NA's   :13                                                
##  num_user_for_reviews   language           country         
##  Min.   :   1.0       Length:5043        Length:5043       
##  1st Qu.:  65.0       Class :character   Class :character  
##  Median : 156.0       Mode  :character   Mode  :character  
##  Mean   : 272.8                                            
##  3rd Qu.: 326.0                                            
##  Max.   :5060.0                                            
##  NA's   :21                                                
##  content_rating         budget            title_year  
##  Length:5043        Min.   :2.180e+02   Min.   :1916  
##  Class :character   1st Qu.:6.000e+06   1st Qu.:1999  
##  Mode  :character   Median :2.000e+07   Median :2005  
##                     Mean   :3.975e+07   Mean   :2002  
##                     3rd Qu.:4.500e+07   3rd Qu.:2011  
##                     Max.   :1.222e+10   Max.   :2016  
##                     NA's   :492         NA's   :108   
##  actor_2_facebook_likes   imdb_score     aspect_ratio  
##  Min.   :     0         Min.   :1.600   Min.   : 1.18  
##  1st Qu.:   281         1st Qu.:5.800   1st Qu.: 1.85  
##  Median :   595         Median :6.600   Median : 2.35  
##  Mean   :  1652         Mean   :6.442   Mean   : 2.22  
##  3rd Qu.:   918         3rd Qu.:7.200   3rd Qu.: 2.35  
##  Max.   :137000         Max.   :9.500   Max.   :16.00  
##  NA's   :13                             NA's   :329    
##  movie_facebook_likes
##  Min.   :     0      
##  1st Qu.:     0      
##  Median :   166      
##  Mean   :  7526      
##  3rd Qu.:  3000      
##  Max.   :349000      
##

Always start from the distribution of the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    50.0   110.0   140.2   195.0   813.0      50

The distribution of the number of reviews is right skewed. Among these 5043 movies, the minimum number of review was 1 and the maximum number of reviews was 813. Majority of the movies received less than 200 reviews.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.800   6.600   6.442   7.200   9.500

The score distribution is left skewed, with minimum score at 1.60 and maximum score at 9.50.

Most of the movies in the dataset were produced after 2000.

However, the movies with the highest scores were produced in the 1950s, and there have been significant amount of low score movies came out in the recent years.

Which countries produced the most movies and which countries have the highest scores?

The USA produced the most number of movies.

But that does not mean their movie are all good quality. Kyrgyzstan, Libya and United Arab Emirates might have the highest average scores.

How about directors?

Multiple Linear Regression - Variable Selection

Time to do something serious work, I intend to predict IMDB scores from the other variables using multiple linear regression model. Because regression can't deal with missing values, I will eliminate all missing values.

##     color           director_name      num_critic_for_reviews
##  Length:5043        Length:5043        Min.   :  1.0         
##  Class :character   Class :character   1st Qu.: 50.0         
##  Mode  :character   Mode  :character   Median :111.0         
##                                        Mean   :140.2         
##                                        3rd Qu.:194.0         
##                                        Max.   :813.0         
##                                                              
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  134.5       
##  Median :103.0   Median :   52.0         Median :  374.0       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  218.0         3rd Qu.:  638.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##                                                                
##  actor_2_name       actor_1_facebook_likes     gross          
##  Length:5043        Min.   :     0.0       Min.   :      162  
##  Class :character   1st Qu.:   615.5       1st Qu.:  8460992  
##  Mode  :character   Median :   989.0       Median : 37432299  
##                     Mean   :  6560.0       Mean   : 48468408  
##                     3rd Qu.: 11000.0       3rd Qu.: 51357066  
##                     Max.   :640000.0       Max.   :760505847  
##                                                               
##     genres          actor_1_name       movie_title       
##  Length:5043        Length:5043        Length:5043       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  num_voted_users   cast_total_facebook_likes actor_3_name      
##  Min.   :      5   Min.   :     0            Length:5043       
##  1st Qu.:   8594   1st Qu.:  1411            Class :character  
##  Median :  34359   Median :  3090            Mode  :character  
##  Mean   :  83668   Mean   :  9699                              
##  3rd Qu.:  96309   3rd Qu.: 13756                              
##  Max.   :1689764   Max.   :656730                              
##                                                                
##  facenumber_in_poster plot_keywords      movie_imdb_link   
##  Min.   : 0.000       Length:5043        Length:5043       
##  1st Qu.: 0.000       Class :character   Class :character  
##  Median : 1.000       Mode  :character   Mode  :character  
##  Mean   : 1.371                                            
##  3rd Qu.: 2.000                                            
##  Max.   :43.000                                            
##                                                            
##  num_user_for_reviews   language           country         
##  Min.   :   1.0       Length:5043        Length:5043       
##  1st Qu.:  65.0       Class :character   Class :character  
##  Median : 156.0       Mode  :character   Mode  :character  
##  Mean   : 272.8                                            
##  3rd Qu.: 326.0                                            
##  Max.   :5060.0                                            
##  NA's   :21                                                
##  content_rating         budget            title_year  
##  Length:5043        Min.   :2.180e+02   Min.   :1916  
##  Class :character   1st Qu.:7.000e+06   1st Qu.:1999  
##  Mode  :character   Median :2.300e+07   Median :2005  
##                     Mean   :3.975e+07   Mean   :2003  
##                     3rd Qu.:4.000e+07   3rd Qu.:2011  
##                     Max.   :1.222e+10   Max.   :2016  
##                                                       
##  actor_2_facebook_likes   imdb_score     aspect_ratio  
##  Min.   :     0         Min.   :1.600   Min.   : 1.18  
##  1st Qu.:   281         1st Qu.:5.800   1st Qu.: 1.85  
##  Median :   596         Median :6.600   Median : 2.22  
##  Mean   :  1652         Mean   :6.442   Mean   : 2.22  
##  3rd Qu.:   919         3rd Qu.:7.200   3rd Qu.: 2.35  
##  Max.   :137000         Max.   :9.500   Max.   :16.00  
##                                                        
##  movie_facebook_likes
##  Min.   :     0      
##  1st Qu.:     0      
##  Median :   166      
##  Mean   :  7526      
##  3rd Qu.:  3000      
##  Max.   :349000      
##

Now I have got rid of all 'NA's. And I picked the following variables as potential candidates for the IMDB score predicators.

num_critic_for_reviews
duration
director_facebook_likes
actor_1_facebook_likes
gross
cast_total_facebook_likes
facenumber_in_poster
budget
movie_facebook_likes

Select a subset of numeric variables for regression modelling.

Construct the model

Split data into training and testing.

Fit the model

I am trying out a stepwise selection of variables by backwards elimination. So I start with all candidate varibles and elimiate one at a time.

## 
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration + 
##     director_facebook_likes + actor_1_facebook_likes + gross + 
##     cast_total_facebook_likes + facenumber_in_poster + budget + 
##     movie_facebook_likes, data = train_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0881 -0.5841  0.0853  0.7019  3.2965 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.321e+00  7.345e-02  72.446  < 2e-16 ***
## num_critic_for_reviews     1.794e-03  1.974e-04   9.088  < 2e-16 ***
## duration                   8.065e-03  6.762e-04  11.926  < 2e-16 ***
## director_facebook_likes    3.923e-05  5.982e-06   6.559 6.10e-11 ***
## actor_1_facebook_likes     1.385e-05  3.742e-06   3.701 0.000218 ***
## gross                      3.871e-10  2.990e-10   1.295 0.195427    
## cast_total_facebook_likes -1.235e-05  3.167e-06  -3.899 9.82e-05 ***
## facenumber_in_poster      -3.396e-02  8.355e-03  -4.065 4.90e-05 ***
## budget                    -4.776e-11  7.588e-11  -0.629 0.529159    
## movie_facebook_likes       4.644e-06  1.202e-06   3.865 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 4024 degrees of freedom
## Multiple R-squared:  0.1434, Adjusted R-squared:  0.1414 
## F-statistic: 74.82 on 9 and 4024 DF,  p-value: < 2.2e-16

I am going to eliminate the variables that has little value, - gross and budget, one at a time, and fit it again.

## 
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration + 
##     budget + director_facebook_likes + actor_1_facebook_likes + 
##     cast_total_facebook_likes + facenumber_in_poster + movie_facebook_likes, 
##     data = train_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0798 -0.5854  0.0797  0.7018  3.3097 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.321e+00  7.346e-02  72.442  < 2e-16 ***
## num_critic_for_reviews     1.852e-03  1.923e-04   9.629  < 2e-16 ***
## duration                   8.132e-03  6.743e-04  12.060  < 2e-16 ***
## budget                    -4.373e-11  7.583e-11  -0.577 0.564168    
## director_facebook_likes    3.956e-05  5.977e-06   6.618 4.12e-11 ***
## actor_1_facebook_likes     1.301e-05  3.686e-06   3.530 0.000421 ***
## cast_total_facebook_likes -1.153e-05  3.104e-06  -3.715 0.000206 ***
## facenumber_in_poster      -3.431e-02  8.352e-03  -4.108 4.07e-05 ***
## movie_facebook_likes       4.762e-06  1.198e-06   3.974 7.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 4025 degrees of freedom
## Multiple R-squared:  0.143,  Adjusted R-squared:  0.1413 
## F-statistic: 83.95 on 8 and 4025 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration + 
##     director_facebook_likes + actor_1_facebook_likes + cast_total_facebook_likes + 
##     facenumber_in_poster + movie_facebook_likes, data = train_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0802 -0.5839  0.0793  0.7020  3.3078 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.322e+00  7.344e-02  72.473  < 2e-16 ***
## num_critic_for_reviews     1.842e-03  1.915e-04   9.617  < 2e-16 ***
## duration                   8.119e-03  6.738e-04  12.049  < 2e-16 ***
## director_facebook_likes    3.957e-05  5.976e-06   6.622 4.02e-11 ***
## actor_1_facebook_likes     1.304e-05  3.685e-06   3.539 0.000407 ***
## cast_total_facebook_likes -1.156e-05  3.103e-06  -3.724 0.000199 ***
## facenumber_in_poster      -3.423e-02  8.350e-03  -4.099 4.23e-05 ***
## movie_facebook_likes       4.782e-06  1.198e-06   3.993 6.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 4026 degrees of freedom
## Multiple R-squared:  0.1429, Adjusted R-squared:  0.1414 
## F-statistic: 95.91 on 7 and 4026 DF,  p-value: < 2.2e-16

From the fitted model, I find that the model is significant since the p-value is very small. The "cast_total_facebook_likes" and "facenumber_in_poster" has negative weight. This model has multiple R-squared score of 0.143, meaning that around 14.3% of the variability can be explained by this model.

Let me make a few plots of the model I arrived at.

If I consider IMDB scores of all movies in the dataset, it is a non-linear fit, it has a small degree of nonlinearity.

This charts shows how all of the examples of residuals compare against theoretical distances from the model. I can see I have a bit problems here because some of the observations are not neatly fit the line.

This chart shows the distribution of residuals around the linear model in relation to the IMDB scores of all movies in my data. The higher the score, the less movies, and most movies are in the lower or median score range.

This chart identifies all extrme values, but I don't see any extrme value has huge impact on my model.

At this point, I think this model is as good as I can get. Let's evaluate it.

The theoretical model performance is defined here as R-Squared

## 
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration + 
##     director_facebook_likes + actor_1_facebook_likes + cast_total_facebook_likes + 
##     facenumber_in_poster + movie_facebook_likes, data = train_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0802 -0.5839  0.0793  0.7020  3.3078 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.322e+00  7.344e-02  72.473  < 2e-16 ***
## num_critic_for_reviews     1.842e-03  1.915e-04   9.617  < 2e-16 ***
## duration                   8.119e-03  6.738e-04  12.049  < 2e-16 ***
## director_facebook_likes    3.957e-05  5.976e-06   6.622 4.02e-11 ***
## actor_1_facebook_likes     1.304e-05  3.685e-06   3.539 0.000407 ***
## cast_total_facebook_likes -1.156e-05  3.103e-06  -3.724 0.000199 ***
## facenumber_in_poster      -3.423e-02  8.350e-03  -4.099 4.23e-05 ***
## movie_facebook_likes       4.782e-06  1.198e-06   3.993 6.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 4026 degrees of freedom
## Multiple R-squared:  0.1429, Adjusted R-squared:  0.1414 
## F-statistic: 95.91 on 7 and 4026 DF,  p-value: < 2.2e-16

Check how good the model is on the training set.

## [1] 0.1444 1.0000 1.0000

The correlation between predicted score and actual score for the training set is 14.44%, which is cery close to theoretical R-Squared for the model, this is the good news. However, on average, on the set of the observations I have previously seen, I am going to make 1 score difference when estimating.

Check how good the model is on the test set.

## [1] 0.1521 1.0000 1.0000

This result is not bad, The results from test set are not far from the results of the training set.

Conclusion

The most important factor that affect movie score is the duration, the longer the movie, the higher the sore will be.
The number of critic reviews is important, the more reviews a movie receives, the higher the score will be.
The face number in poster has a negative effect to the movie score. The more faces in a movie poster, the lower the score will be.

The End

I hope movie will be the same after I learn how to analyze movie data. Apprécier le film!