Purpose: By doing a regresson analysis, we want to know: 1) Among the 27 variables given, which of them are critical in telling the IMDB rating of a movie. 2) Is there any correlation between genre & IMDB raging,face number in poster & IMDB rating,director name & IMDB rating and duration & IMDB rating. 3) Predict the IMDB Score using our model

m<- read.csv('movie_metadata.csv')
cannot open file 'movie_metadata.csv': No such file or directoryError in file(file, "rt") : cannot open the connection

Step 1: Data Collection

This data set was found from Kaggle. The author scraped 5000+ movies from IMDB website using a Python library called “scrapy” and obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables: “movie_title” “color” “num_critic_for_reviews” “movie_facebook_likes” “duration” “director_name” “director_facebook_likes” “actor_3_name” “actor_3_facebook_likes” “actor_2_name” “actor_2_facebook_likes” “actor_1_name” “actor_1_facebook_likes” “gross” “genres” “num_voted_users” “cast_total_facebook_likes” “facenumber_in_poster” “plot_keywords” “movie_imdb_link” “num_user_for_reviews” “language” “country” “content_rating” “budget” “title_year” “imdb_score” “aspect_ratio”

This dataset is a proof of concept. It can be used for experimental and learning purpose.For comprehensive movie analysis and accurate movie ratings prediction, 28 attributes from 5000 movies might not be enough. A decent dataset could contain hundreds of attributes from 50K or more movies, and requires tons of feature engineering.

Step 2 : Data cleaning and exploration

Assign the first word of genres as the genre of each movie:(genres been split into words in Excel):

# remove columns X-X.8
which(colnames(m)=='genres')
[1] 10
which(colnames(m)=='X.8')
[1] 19
m<-m[,-c(11:19)]

Only keep movie data for USA, bacause the “budget” variable was not all converted to US dollars, which might cause a problem in later analysis. If we want to convert all budgets into US dollarts, we have to take in to consideration for inflation as well. This might make the problem more complicated. Therefore, for pratice purpose, we decided to only study data for movies of USA.

movie.usa<-m[which(m[,'country']=='USA'),]

Double check:

movie.usa$country
   [1] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
  [22] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
  [43] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
  [64] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
  [85] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [106] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [127] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [148] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [169] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [190] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [211] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [232] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [253] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [274] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [295] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [316] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [337] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [358] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [379] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [400] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [421] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [442] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [463] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [484] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [505] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [526] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [547] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [568] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [589] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [610] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [631] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [652] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [673] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [694] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [715] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [736] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [757] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [778] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [799] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [820] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [841] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [862] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [883] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [904] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [925] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [946] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [967] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
 [988] USA USA USA USA USA USA USA USA USA USA USA USA USA
 [ reached getOption("max.print") -- omitted 2807 entries ]
66 Levels:  Afghanistan Argentina Aruba Australia Bahamas Belgium Brazil ... West Germany

Remove ‘language’ since after removing all countries except for USA, there is only 4 languages aside from English, not meaningful for our prediction.

summary(movie.usa$language)
           Aboriginal     Arabic    Aramaic    Bosnian  Cantonese    Chinese      Czech 
        10          0          0          1          1          1          0          0 
    Danish       Dari      Dutch   Dzongkha    English   Filipino     French     German 
         0          1          0          0       3779          1          0          0 
     Greek     Hebrew      Hindi  Hungarian  Icelandic Indonesian    Italian   Japanese 
         0          1          1          0          0          0          0          1 
   Kannada     Kazakh     Korean   Mandarin       Maya  Mongolian       None  Norwegian 
         0          0          0          0          1          0          1          0 
   Panjabi    Persian     Polish Portuguese   Romanian    Russian  Slovenian    Spanish 
         0          0          0          0          0          0          0          7 
   Swahili    Swedish      Tamil     Telugu       Thai       Urdu Vietnamese       Zulu 
         0          0          0          0          0          0          1          0 
movie.usa<-movie.usa[, -which(names(movie.usa)=='language')]

Remove ‘movie_imdb_link’ column since it’s not useful for our analysis and store the rest od the data as ‘movie’.

movie.df= data.frame(movie.usa)
mm<-movie.df[, -which(names(movie.df)=='movie_imdb_link')] 
str(mm)
'data.frame':   3807 obs. of  26 variables:
 $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ director_name            : Factor w/ 2399 levels "","\xcc\xe4mile Gaudreault",..: 926 799 379 106 2030 1652 1225 2394 284 799 ...
 $ num_critic_for_reviews   : int  723 302 813 462 392 324 635 673 434 313 ...
 $ duration                 : int  178 169 164 132 156 100 141 183 169 151 ...
 $ director_facebook_likes  : int  0 563 22000 475 0 15 0 0 0 563 ...
 $ actor_3_facebook_likes   : int  855 1000 23000 530 4000 284 19000 2000 903 1000 ...
 $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 534 2549 1228 801 2440 1704 1911 2218 ...
 $ actor_1_facebook_likes   : int  1000 40000 27000 640 24000 799 26000 15000 18000 40000 ...
 $ gross                    : int  760505847 309404152 448130642 73058679 336530303 200807262 458991599 330249062 200069408 423032628 ...
 $ genres                   : Factor w/ 21 levels "Action","Adventure",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ actor_1_name             : Factor w/ 2098 levels "","\xcc\xd2lafur Darri \xcc\xd2lafsson",..: 303 982 1968 441 786 221 337 740 1104 982 ...
 $ movie_title              : Factor w/ 4917 levels "[Rec] 2\xe5\xca",..: 397 2731 3707 1960 3289 3459 398 460 3416 2732 ...
 $ num_voted_users          : int  886204 471220 1144337 212204 383056 294810 462669 371639 240396 522040 ...
 $ cast_total_facebook_likes: int  4834 48350 106759 1873 46055 2036 92000 24450 29991 48486 ...
 $ actor_3_name             : Factor w/ 3522 levels "","\xcc\xd2scar Jaenada",..: 3442 1393 1769 2714 1969 2162 3018 57 1134 1393 ...
 $ facenumber_in_poster     : int  0 0 0 1 0 1 4 0 0 2 ...
 $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 3484 651 4745 29 1142 1564 3312 2188 ...
 $ num_user_for_reviews     : int  3054 1238 2701 738 1902 387 1117 3018 2367 1832 ...
 $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 65 65 65 65 65 65 65 65 ...
 $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 10 9 10 10 10 10 ...
 $ budget                   : num  2.37e+08 3.00e+08 2.50e+08 2.64e+08 2.58e+08 ...
 $ title_year               : int  2009 2007 2012 2012 2007 2010 2015 2016 2006 2006 ...
 $ actor_2_facebook_likes   : int  936 5000 23000 632 11000 553 21000 4000 10000 5000 ...
 $ imdb_score               : num  7.9 7.1 8.5 6.6 6.2 7.8 7.5 6.9 6.1 7.3 ...
 $ aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 2.35 ...
 $ movie_facebook_likes     : int  33000 0 164000 24000 0 29000 118000 197000 0 5000 ...

Check for missing values:

library(Amelia)
Loading required package: Rcpp
## 
## Amelia II: Multiple Imputation
## (Version 1.7.4, built: 2015-12-05)
## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
## 
missmap(mm, main = "Missing values vs observed")

sapply(mm,function(x) sum(is.na(x))) # number of missing values for each variable 
                    color             director_name    num_critic_for_reviews 
                        0                         0                        39 
                 duration   director_facebook_likes    actor_3_facebook_likes 
                        6                        74                        13 
             actor_2_name    actor_1_facebook_likes                     gross 
                        0                         4                       572 
                   genres              actor_1_name               movie_title 
                        0                         0                         0 
          num_voted_users cast_total_facebook_likes              actor_3_name 
                        0                         0                         0 
     facenumber_in_poster             plot_keywords      num_user_for_reviews 
                       12                         0                        13 
                  country            content_rating                    budget 
                        0                         0                       298 
               title_year    actor_2_facebook_likes                imdb_score 
                       74                         7                         0 
             aspect_ratio      movie_facebook_likes 
                      222                         0 

We noticed that there are many missing values for budget,aspect ratio and gross.

Omit missing values:

movie<-na.omit(mm)
sapply(movie,function(x) sum(is.na(x))) # double check for missing values
                    color             director_name    num_critic_for_reviews 
                        0                         0                         0 
                 duration   director_facebook_likes    actor_3_facebook_likes 
                        0                         0                         0 
             actor_2_name    actor_1_facebook_likes                     gross 
                        0                         0                         0 
                   genres              actor_1_name               movie_title 
                        0                         0                         0 
          num_voted_users cast_total_facebook_likes              actor_3_name 
                        0                         0                         0 
     facenumber_in_poster             plot_keywords      num_user_for_reviews 
                        0                         0                         0 
                  country            content_rating                    budget 
                        0                         0                         0 
               title_year    actor_2_facebook_likes                imdb_score 
                        0                         0                         0 
             aspect_ratio      movie_facebook_likes 
                        0                         0 
library(psych)
library(car)

Attaching package: ‘car’

The following object is masked from ‘package:psych’:

    logit
library(RColorBrewer) 
library(corrplot)
library(ggplot2)

Attaching package: ‘ggplot2’

The following objects are masked from ‘package:psych’:

    %+%, alpha

Explore title_year predictor:

range(movie$title_year) # check movie title year
[1] 1920 2016
sum(with(movie,title_year=='2009')) # 145
[1] 145
sum(with(movie,title_year=='2014')) # 121
[1] 121

Visualization of title Year vs. Score:

library(scatter)
Error in library(scatter) : there is no package called ‘scatter’

There are many outliers for title year. The mojority of data points are around the year of 2000 and later,which make sense that this is less movies in the early years. Also, an intering notice is that movies from early years tend to have higher scores.

Visualization of IMDB Score:

max(movie$imdb_score) # 9.4
[1] 9.3
ggplot(movie, aes(x = imdb_score)) +
        geom_histogram(aes(fill = ..count..), binwidth =0.5) +
        scale_x_continuous(name = "IMDB Score",
                           breaks = seq(0,10),
                           limits=c(1, 10)) +
        ggtitle("Histogram of Movie IMDB Score") +
        scale_fill_gradient("Count", low = "blue", high = "red")

sum(with(movie,imdb_score>=8))
[1] 148
# 148 movies with IMDB score greater or equal to 8.

IMDB score looks normal.The highest score is 9.4 out of scale 10. And we can consider movies with a score greater or equal to 8 a great movie from many perspectives.

Exploring correlation :

pairs.panels(movie[c('director_name','duration','facenumber_in_poster','imdb_score','genres')])

from the plot, only duration and IMBD score has a high correlation. face number in posters has a negative correaltion with IMBD score. genre has little correlatin with score Interesting, director name has no correlation with IMDB score

pairs.panels(movie[c('color','actor_1_name','title_year','imdb_score','aspect_ratio','gross')])

Color and title year has highly positive correlation. Color and aspect ratia,gross has smaller positive correlations. Actor 1 namem has very small positive correlation with gross, meaning who plays the movies does not have impact on the gross. Title year and aspect ratio and color are highly positively correlated. IMDB score has very small positive correlation with actor 1 name ,which means who was the actor 1 does not make the movie has a higher score. Interestingly, IMDB score has a negative correlation with title year,which means the old movies seems to have a higher score. the result agrees with out pbservation from the scatter plot. IMDB and aspect ratio has small positive correlation. IMDB has a strong positive correlation with gross.

Corplot for all numerical variables:

nums<- sapply(movie,is.numeric) # select numeric columns
movie.num<- movie[,nums]
corrplot(cor(movie.num),method='ellipse') 

Note: corrplot cannot use data.frame, use cor() to change it to matrix.

From the correlation plot, we can tell that: Face number in poster has negative correlation with all other predictors. Cast total facebook likes and actor 1 facebook likes has a stronger positive correlation. budget and gross have strong correaltion which is not surprising. Interestingly, IMDB scores has strong positive corrlation with number of critics for review, which means the more the critics review, the higher the score.Duration and number of voted users also have strong positive correlation with IMDB scores.

Find the pairs of correlations

corr.test(movie.num,y=NULL,use='pairwise',method='pearson',adjust='holm',alpha=0.05) # x must be numeric
Call:corr.test(x = movie.num, y = NULL, use = "pairwise", method = "pearson", 
    adjust = "holm", alpha = 0.05)
Correlation matrix 
                          num_critic_for_reviews duration director_facebook_likes
num_critic_for_reviews                      1.00     0.26                    0.19
duration                                    0.26     1.00                    0.21
director_facebook_likes                     0.19     0.21                    1.00
actor_3_facebook_likes                      0.28     0.14                    0.12
actor_1_facebook_likes                      0.17     0.09                    0.09
gross                                       0.48     0.28                    0.14
num_voted_users                             0.60     0.37                    0.32
cast_total_facebook_likes                   0.25     0.13                    0.12
facenumber_in_poster                       -0.03     0.01                   -0.05
num_user_for_reviews                        0.57     0.36                    0.24
budget                                      0.49     0.30                    0.09
title_year                                  0.42    -0.11                   -0.06
actor_2_facebook_likes                      0.28     0.15                    0.12
imdb_score                                  0.36     0.38                    0.22
aspect_ratio                                0.18     0.16                    0.05
movie_facebook_likes                        0.71     0.25                    0.17
                          actor_3_facebook_likes actor_1_facebook_likes gross
num_critic_for_reviews                      0.28                   0.17  0.48
duration                                    0.14                   0.09  0.28
director_facebook_likes                     0.12                   0.09  0.14
actor_3_facebook_likes                      1.00                   0.25  0.30
actor_1_facebook_likes                      0.25                   1.00  0.13
gross                                       0.30                   0.13  1.00
num_voted_users                             0.28                   0.17  0.64
cast_total_facebook_likes                   0.48                   0.95  0.22
facenumber_in_poster                        0.10                   0.05 -0.04
num_user_for_reviews                        0.22                   0.12  0.55
budget                                      0.27                   0.15  0.64
title_year                                  0.13                   0.09  0.06
actor_2_facebook_likes                      0.55                   0.38  0.25
imdb_score                                  0.09                   0.12  0.27
aspect_ratio                                0.05                   0.05  0.07
movie_facebook_likes                        0.31                   0.12  0.38
                          num_voted_users cast_total_facebook_likes facenumber_in_poster
num_critic_for_reviews               0.60                      0.25                -0.03
duration                             0.37                      0.13                 0.01
director_facebook_likes              0.32                      0.12                -0.05
actor_3_facebook_likes               0.28                      0.48                 0.10
actor_1_facebook_likes               0.17                      0.95                 0.05
gross                                0.64                      0.22                -0.04
num_voted_users                      1.00                      0.25                -0.04
cast_total_facebook_likes            0.25                      1.00                 0.07
facenumber_in_poster                -0.04                      0.07                 1.00
num_user_for_reviews                 0.78                      0.18                -0.09
budget                               0.40                      0.23                -0.03
title_year                           0.03                      0.13                 0.08
actor_2_facebook_likes               0.25                      0.63                 0.07
imdb_score                           0.51                      0.14                -0.07
aspect_ratio                         0.09                      0.07                 0.01
movie_facebook_likes                 0.52                      0.21                 0.01
                          num_user_for_reviews budget title_year actor_2_facebook_likes
num_critic_for_reviews                    0.57   0.49       0.42                   0.28
duration                                  0.36   0.30      -0.11                   0.15
director_facebook_likes                   0.24   0.09      -0.06                   0.12
actor_3_facebook_likes                    0.22   0.27       0.13                   0.55
actor_1_facebook_likes                    0.12   0.15       0.09                   0.38
gross                                     0.55   0.64       0.06                   0.25
num_voted_users                           0.78   0.40       0.03                   0.25
cast_total_facebook_likes                 0.18   0.23       0.13                   0.63
facenumber_in_poster                     -0.09  -0.03       0.08                   0.07
num_user_for_reviews                      1.00   0.40       0.03                   0.20
budget                                    0.40   1.00       0.25                   0.25
title_year                                0.03   0.25       1.00                   0.13
actor_2_facebook_likes                    0.20   0.25       0.13                   1.00
imdb_score                                0.35   0.07      -0.14                   0.13
aspect_ratio                              0.10   0.18       0.22                   0.07
movie_facebook_likes                      0.39   0.33       0.31                   0.25
                          imdb_score aspect_ratio movie_facebook_likes
num_critic_for_reviews          0.36         0.18                 0.71
duration                        0.38         0.16                 0.25
director_facebook_likes         0.22         0.05                 0.17
actor_3_facebook_likes          0.09         0.05                 0.31
actor_1_facebook_likes          0.12         0.05                 0.12
gross                           0.27         0.07                 0.38
num_voted_users                 0.51         0.09                 0.52
cast_total_facebook_likes       0.14         0.07                 0.21
facenumber_in_poster           -0.07         0.01                 0.01
num_user_for_reviews            0.35         0.10                 0.39
budget                          0.07         0.18                 0.33
title_year                     -0.14         0.22                 0.31
actor_2_facebook_likes          0.13         0.07                 0.25
imdb_score                      1.00         0.04                 0.29
aspect_ratio                    0.04         1.00                 0.11
movie_facebook_likes            0.29         0.11                 1.00
Sample Size 
[1] 3005
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
                          num_critic_for_reviews duration director_facebook_likes
num_critic_for_reviews                      0.00     0.00                    0.00
duration                                    0.00     0.00                    0.00
director_facebook_likes                     0.00     0.00                    0.00
actor_3_facebook_likes                      0.00     0.00                    0.00
actor_1_facebook_likes                      0.00     0.00                    0.00
gross                                       0.00     0.00                    0.00
num_voted_users                             0.00     0.00                    0.00
cast_total_facebook_likes                   0.00     0.00                    0.00
facenumber_in_poster                        0.09     0.66                    0.00
num_user_for_reviews                        0.00     0.00                    0.00
budget                                      0.00     0.00                    0.00
title_year                                  0.00     0.00                    0.00
actor_2_facebook_likes                      0.00     0.00                    0.00
imdb_score                                  0.00     0.00                    0.00
aspect_ratio                                0.00     0.00                    0.01
movie_facebook_likes                        0.00     0.00                    0.00
                          actor_3_facebook_likes actor_1_facebook_likes gross
num_critic_for_reviews                      0.00                   0.00  0.00
duration                                    0.00                   0.00  0.00
director_facebook_likes                     0.00                   0.00  0.00
actor_3_facebook_likes                      0.00                   0.00  0.00
actor_1_facebook_likes                      0.00                   0.00  0.00
gross                                       0.00                   0.00  0.00
num_voted_users                             0.00                   0.00  0.00
cast_total_facebook_likes                   0.00                   0.00  0.00
facenumber_in_poster                        0.00                   0.01  0.05
num_user_for_reviews                        0.00                   0.00  0.00
budget                                      0.00                   0.00  0.00
title_year                                  0.00                   0.00  0.00
actor_2_facebook_likes                      0.00                   0.00  0.00
imdb_score                                  0.00                   0.00  0.00
aspect_ratio                                0.01                   0.00  0.00
movie_facebook_likes                        0.00                   0.00  0.00
                          num_voted_users cast_total_facebook_likes facenumber_in_poster
num_critic_for_reviews               0.00                         0                 0.65
duration                             0.00                         0                 1.00
director_facebook_likes              0.00                         0                 0.06
actor_3_facebook_likes               0.00                         0                 0.00
actor_1_facebook_likes               0.00                         0                 0.07
gross                                0.00                         0                 0.37
num_voted_users                      0.00                         0                 0.17
cast_total_facebook_likes            0.00                         0                 0.00
facenumber_in_poster                 0.02                         0                 0.00
num_user_for_reviews                 0.00                         0                 0.00
budget                               0.00                         0                 0.14
title_year                           0.10                         0                 0.00
actor_2_facebook_likes               0.00                         0                 0.00
imdb_score                           0.00                         0                 0.00
aspect_ratio                         0.00                         0                 0.55
movie_facebook_likes                 0.00                         0                 0.50
                          num_user_for_reviews budget title_year actor_2_facebook_likes
num_critic_for_reviews                    0.00   0.00       0.00                   0.00
duration                                  0.00   0.00       0.00                   0.00
director_facebook_likes                   0.00   0.00       0.04                   0.00
actor_3_facebook_likes                    0.00   0.00       0.00                   0.00
actor_1_facebook_likes                    0.00   0.00       0.00                   0.00
gross                                     0.00   0.00       0.04                   0.00
num_voted_users                           0.00   0.00       0.65                   0.00
cast_total_facebook_likes                 0.00   0.00       0.00                   0.00
facenumber_in_poster                      0.00   0.65       0.00                   0.01
num_user_for_reviews                      0.00   0.00       0.65                   0.00
budget                                    0.00   0.00       0.00                   0.00
title_year                                0.12   0.00       0.00                   0.00
actor_2_facebook_likes                    0.00   0.00       0.00                   0.00
imdb_score                                0.00   0.00       0.00                   0.00
aspect_ratio                              0.00   0.00       0.00                   0.00
movie_facebook_likes                      0.00   0.00       0.00                   0.00
                          imdb_score aspect_ratio movie_facebook_likes
num_critic_for_reviews          0.00         0.00                    0
duration                        0.00         0.00                    0
director_facebook_likes         0.00         0.10                    0
actor_3_facebook_likes          0.00         0.07                    0
actor_1_facebook_likes          0.00         0.05                    0
gross                           0.00         0.00                    0
num_voted_users                 0.00         0.00                    0
cast_total_facebook_likes       0.00         0.00                    0
facenumber_in_poster            0.00         1.00                    1
num_user_for_reviews            0.00         0.00                    0
budget                          0.00         0.00                    0
title_year                      0.00         0.00                    0
actor_2_facebook_likes          0.00         0.00                    0
imdb_score                      0.00         0.34                    0
aspect_ratio                    0.04         0.00                    0
movie_facebook_likes            0.00         0.00                    0

 To see confidence intervals of the correlations, print with the short=FALSE option
# Boxplots for significant categorical predictors
Boxplot(movie$imdb_score,movie$color)

Black and white movies seems to have a hither meadian rate, and overall a little higher scores. Colors movies have many outliers.

Boxplot for genre:

fill <- "Blue"
line <- "Red"
ggplot(movie, aes(x = genres, y =imdb_score)) +
        geom_boxplot(fill = fill, colour = line) +
        scale_y_continuous(name = "IMDB Score",
                           breaks = seq(0, 11, 0.5),
                           limits=c(0, 11)) +
        scale_x_discrete(name = "Genres") +
        ggtitle("Boxplot of IMDB Score and Genres")

From the boxplot of genres, “Documentation” has the highest median score.And Trill movies has the lowest median. But it is also because there is 1 observation for thrill movies in our data set.

summary(movie$genres)
     Action   Adventure   Animation   Biography      Comedy       Crime Documentary 
        751         291          36         137         853         204          25 
      Drama      Family     Fantasy   Film-Noir   Game-Show     History      Horror 
        506           3          31           0           0           0         138 
      Music     Musical     Mystery     Romance      Sci-Fi    Thriller     Western 
          0           2          16           2           7           1           2 

Boxplots for “title year’:

library(ggplot2)
fill <- "Blue"
line <- "Red"
ggplot(movie, aes(x = as.factor(title_year), y =imdb_score)) +
        geom_boxplot(fill = fill, colour = line) +
        scale_y_continuous(name = "IMDB Score",
                           breaks = seq(1.5, 10, 0.5),
                           limits=c(1.5, 10)) +
        scale_x_discrete(name = "title_year") +
        ggtitle("Boxplot of IMDB Score and Genres")

The median of imdb score of all years seem different. So let’s try to treat title_year as categorical.

# Scatter plot matrix for correlation significant numerical variables
scatterplotMatrix(~movie$imdb_score+movie$num_voted_users+movie$num_critic_for_reviews+movie$num_user_for_reviews+movie$duration+movie$facenumber_in_poster+movie$gross+movie$movie_facebook_likes+movie$director_facebook_likes+movie$cast_total_facebook_likes+movie$budget)

Step 3: fitting regression model

movie.sig<-movie[,c('imdb_score','num_voted_users','num_critic_for_reviews','num_user_for_reviews','duration','facenumber_in_poster','gross','movie_facebook_likes','director_facebook_likes','cast_total_facebook_likes','budget','title_year','genres')]

Step function to check AIC criteria:

null=lm(movie.sig$imdb_score~1) # set null model
summary(null)

Call:
lm(formula = movie.sig$imdb_score ~ 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7873 -0.5873  0.1127  0.7127  2.9127 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.3873     0.0192   332.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.053 on 3004 degrees of freedom
  1. Full model is linear additive model
full1=lm(movie.sig$imdb_score~movie.sig$num_voted_users+movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews+movie.sig$duration+movie.sig$facenumber_in_poster+movie.sig$gross+movie.sig$movie_facebook_likes+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+factor(movie.sig$genres))
summary(full1)

Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users + 
    movie.sig$num_critic_for_reviews + movie.sig$num_user_for_reviews + 
    movie.sig$duration + movie.sig$facenumber_in_poster + movie.sig$gross + 
    movie.sig$movie_facebook_likes + movie.sig$director_facebook_likes + 
    movie.sig$cast_total_facebook_likes + movie.sig$budget + 
    movie.sig$title_year + factor(movie.sig$genres))

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9157 -0.3693  0.0835  0.4993  2.0350 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          5.413e+01  3.604e+00  15.019  < 2e-16 ***
movie.sig$num_voted_users            3.158e-06  1.757e-07  17.969  < 2e-16 ***
movie.sig$num_critic_for_reviews     3.333e-03  2.119e-04  15.727  < 2e-16 ***
movie.sig$num_user_for_reviews      -4.887e-04  5.976e-05  -8.177 4.26e-16 ***
movie.sig$duration                   8.491e-03  7.848e-04  10.820  < 2e-16 ***
movie.sig$facenumber_in_poster      -1.750e-02  6.947e-03  -2.519  0.01182 *  
movie.sig$gross                      2.247e-10  3.096e-10   0.726  0.46808    
movie.sig$movie_facebook_likes      -4.007e-06  9.702e-07  -4.131 3.72e-05 ***
movie.sig$director_facebook_likes    2.832e-07  4.562e-06   0.062  0.95051    
movie.sig$cast_total_facebook_likes  1.110e-06  7.323e-07   1.516  0.12975    
movie.sig$budget                    -4.486e-09  5.125e-10  -8.753  < 2e-16 ***
movie.sig$title_year                -2.467e-02  1.797e-03 -13.727  < 2e-16 ***
factor(movie.sig$genres)Adventure    3.458e-01  5.448e-02   6.347 2.53e-10 ***
factor(movie.sig$genres)Animation    6.621e-01  1.345e-01   4.924 8.93e-07 ***
factor(movie.sig$genres)Biography    6.557e-01  7.661e-02   8.558  < 2e-16 ***
factor(movie.sig$genres)Comedy       1.532e-01  4.361e-02   3.513  0.00045 ***
factor(movie.sig$genres)Crime        4.551e-01  6.464e-02   7.040 2.37e-12 ***
factor(movie.sig$genres)Documentary  9.270e-01  1.608e-01   5.765 8.98e-09 ***
factor(movie.sig$genres)Drama        5.326e-01  4.904e-02  10.861  < 2e-16 ***
factor(movie.sig$genres)Family       2.201e-01  4.521e-01   0.487  0.62639    
factor(movie.sig$genres)Fantasy     -1.629e-01  1.448e-01  -1.125  0.26068    
factor(movie.sig$genres)Horror      -3.858e-01  7.777e-02  -4.961 7.41e-07 ***
factor(movie.sig$genres)Musical     -4.133e-01  5.573e-01  -0.742  0.45839    
factor(movie.sig$genres)Mystery      1.968e-01  1.979e-01   0.995  0.32005    
factor(movie.sig$genres)Romance      5.466e-01  5.506e-01   0.993  0.32095    
factor(movie.sig$genres)Sci-Fi       2.551e-01  2.960e-01   0.862  0.38870    
factor(movie.sig$genres)Thriller    -4.301e-01  7.786e-01  -0.552  0.58077    
factor(movie.sig$genres)Western     -1.037e-01  5.521e-01  -0.188  0.85101    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7768 on 2977 degrees of freedom
Multiple R-squared:  0.4604,    Adjusted R-squared:  0.4555 
F-statistic: 94.07 on 27 and 2977 DF,  p-value: < 2.2e-16
step(null,scope = list(lower=null,upper=full1),direction = 'forward')
Start:  AIC=309.81
movie.sig$imdb_score ~ 1

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$num_voted_users            1    871.90 2457.2 -600.74
+ movie.sig$duration                   1    491.13 2838.0 -167.82
+ movie.sig$num_critic_for_reviews     1    428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews       1    407.62 2921.5  -80.68
+ factor(movie.sig$genres)            16    331.02 2998.1   27.10
+ movie.sig$movie_facebook_likes       1    282.82 3046.3   45.02
+ movie.sig$gross                      1    242.62 3086.5   84.42
+ movie.sig$director_facebook_likes    1    166.17 3163.0  157.95
+ movie.sig$title_year                 1     69.27 3259.9  248.63
+ movie.sig$cast_total_facebook_likes  1     64.28 3264.8  253.22
+ movie.sig$budget                     1     16.26 3312.9  297.09
+ movie.sig$facenumber_in_poster       1     15.14 3314.0  298.11
<none>                                             3329.1  309.81

Step:  AIC=-600.74
movie.sig$imdb_score ~ movie.sig$num_voted_users

                                      Df Sum of Sq    RSS     AIC
+ factor(movie.sig$genres)            16   311.531 2145.7 -976.12
+ movie.sig$duration                   1   147.786 2309.4 -785.13
+ movie.sig$title_year                 1    84.649 2372.6 -704.08
+ movie.sig$budget                     1    73.211 2384.0 -689.63
+ movie.sig$num_user_for_reviews       1    21.297 2435.9 -624.90
+ movie.sig$gross                      1    16.929 2440.3 -619.51
+ movie.sig$num_critic_for_reviews     1    14.632 2442.6 -616.69
+ movie.sig$director_facebook_likes    1    13.657 2443.6 -615.49
+ movie.sig$facenumber_in_poster       1     6.789 2450.4 -607.05
+ movie.sig$movie_facebook_likes       1     2.627 2454.6 -601.95
<none>                                             2457.2 -600.74
+ movie.sig$cast_total_facebook_likes  1     0.524 2456.7 -599.38

Step:  AIC=-976.12
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres)

                                      Df Sum of Sq    RSS      AIC
+ movie.sig$title_year                 1    79.623 2066.1 -1087.75
+ movie.sig$duration                   1    74.584 2071.1 -1080.44
+ movie.sig$budget                     1    28.689 2117.0 -1014.57
+ movie.sig$num_critic_for_reviews     1    23.116 2122.6 -1006.67
+ movie.sig$num_user_for_reviews       1    12.251 2133.4  -991.33
+ movie.sig$director_facebook_likes    1     3.707 2142.0  -979.32
+ movie.sig$facenumber_in_poster       1     3.274 2142.4  -978.71
+ movie.sig$movie_facebook_likes       1     1.686 2144.0  -976.49
<none>                                             2145.7  -976.12
+ movie.sig$gross                      1     1.391 2144.3  -976.07
+ movie.sig$cast_total_facebook_likes  1     0.362 2145.3  -974.63

Step:  AIC=-1087.75
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$num_critic_for_reviews     1   125.091 1941.0 -1273.4
+ movie.sig$duration                   1    55.857 2010.2 -1168.1
+ movie.sig$movie_facebook_likes       1    21.746 2044.3 -1117.5
+ movie.sig$num_user_for_reviews       1    11.741 2054.3 -1102.9
+ movie.sig$budget                     1     9.196 2056.9 -1099.2
+ movie.sig$cast_total_facebook_likes  1     2.923 2063.2 -1090.0
+ movie.sig$director_facebook_likes    1     1.740 2064.3 -1088.3
<none>                                             2066.1 -1087.8
+ movie.sig$facenumber_in_poster       1     1.084 2065.0 -1087.3
+ movie.sig$gross                      1     0.638 2065.4 -1086.7

Step:  AIC=-1273.43
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$budget                     1    36.627 1904.4 -1328.7
+ movie.sig$num_user_for_reviews       1    35.326 1905.7 -1326.6
+ movie.sig$duration                   1    34.873 1906.1 -1325.9
+ movie.sig$gross                      1     7.359 1933.6 -1282.8
+ movie.sig$movie_facebook_likes       1     1.397 1939.6 -1273.6
<none>                                             1941.0 -1273.4
+ movie.sig$facenumber_in_poster       1     0.926 1940.1 -1272.9
+ movie.sig$director_facebook_likes    1     0.644 1940.3 -1272.4
+ movie.sig$cast_total_facebook_likes  1     0.572 1940.4 -1272.3

Step:  AIC=-1328.68
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$duration                   1    58.373 1846.0 -1420.2
+ movie.sig$num_user_for_reviews       1    27.052 1877.3 -1369.7
+ movie.sig$movie_facebook_likes       1     2.576 1901.8 -1330.8
+ movie.sig$cast_total_facebook_likes  1     2.005 1902.3 -1329.8
<none>                                             1904.4 -1328.7
+ movie.sig$facenumber_in_poster       1     1.071 1903.3 -1328.4
+ movie.sig$director_facebook_likes    1     0.557 1903.8 -1327.6
+ movie.sig$gross                      1     0.074 1904.3 -1326.8

Step:  AIC=-1420.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$num_user_for_reviews       1    33.825 1812.2 -1473.8
+ movie.sig$movie_facebook_likes       1     4.702 1841.3 -1425.9
+ movie.sig$facenumber_in_poster       1     2.488 1843.5 -1422.3
+ movie.sig$cast_total_facebook_likes  1     1.601 1844.4 -1420.8
<none>                                             1846.0 -1420.2
+ movie.sig$gross                      1     0.196 1845.8 -1418.5
+ movie.sig$director_facebook_likes    1     0.043 1845.9 -1418.3

Step:  AIC=-1473.81
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$movie_facebook_likes       1   10.4792 1801.7 -1489.2
+ movie.sig$facenumber_in_poster       1    3.7911 1808.4 -1478.1
<none>                                             1812.2 -1473.8
+ movie.sig$cast_total_facebook_likes  1    0.9926 1811.2 -1473.5
+ movie.sig$gross                      1    0.3569 1811.8 -1472.4
+ movie.sig$director_facebook_likes    1    0.0128 1812.2 -1471.8

Step:  AIC=-1489.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$movie_facebook_likes

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$facenumber_in_poster       1    3.5218 1798.2 -1493.1
<none>                                             1801.7 -1489.2
+ movie.sig$cast_total_facebook_likes  1    1.0918 1800.6 -1489.0
+ movie.sig$gross                      1    0.3413 1801.3 -1487.8
+ movie.sig$director_facebook_likes    1    0.0167 1801.7 -1487.3

Step:  AIC=-1493.11
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$cast_total_facebook_likes  1   1.41883 1796.7 -1493.5
<none>                                             1798.2 -1493.1
+ movie.sig$gross                      1   0.33944 1797.8 -1491.7
+ movie.sig$director_facebook_likes    1   0.00320 1798.2 -1491.1

Step:  AIC=-1493.48
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster + 
    movie.sig$cast_total_facebook_likes

                                    Df Sum of Sq    RSS     AIC
<none>                                           1796.7 -1493.5
+ movie.sig$gross                    1   0.31546 1796.4 -1492.0
+ movie.sig$director_facebook_likes  1   0.00000 1796.7 -1491.5

Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users + 
    factor(movie.sig$genres) + movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster + 
    movie.sig$cast_total_facebook_likes)

Coefficients:
                        (Intercept)            movie.sig$num_voted_users  
                          5.446e+01                            3.203e-06  
  factor(movie.sig$genres)Adventure    factor(movie.sig$genres)Animation  
                          3.495e-01                            6.687e-01  
  factor(movie.sig$genres)Biography       factor(movie.sig$genres)Comedy  
                          6.564e-01                            1.558e-01  
      factor(movie.sig$genres)Crime  factor(movie.sig$genres)Documentary  
                          4.522e-01                            9.302e-01  
      factor(movie.sig$genres)Drama       factor(movie.sig$genres)Family  
                          5.326e-01                            2.466e-01  
    factor(movie.sig$genres)Fantasy       factor(movie.sig$genres)Horror  
                         -1.616e-01                           -3.839e-01  
    factor(movie.sig$genres)Musical      factor(movie.sig$genres)Mystery  
                         -4.044e-01                            1.950e-01  
    factor(movie.sig$genres)Romance       factor(movie.sig$genres)Sci-Fi  
                          5.455e-01                            2.483e-01  
   factor(movie.sig$genres)Thriller      factor(movie.sig$genres)Western  
                         -4.271e-01                           -9.845e-02  
               movie.sig$title_year     movie.sig$num_critic_for_reviews  
                         -2.483e-02                            3.339e-03  
                   movie.sig$budget                   movie.sig$duration  
                         -4.311e-09                            8.481e-03  
     movie.sig$num_user_for_reviews       movie.sig$movie_facebook_likes  
                         -4.876e-04                           -4.010e-06  
     movie.sig$facenumber_in_poster  movie.sig$cast_total_facebook_likes  
                         -1.753e-02                            1.121e-06  
  1. full model is polynomial regresison model with interaction terms:
full2=lm(movie.sig$imdb_score~poly(movie.sig$num_voted_users,2)+poly(movie.sig$num_critic_for_reviews,2)+poly(movie.sig$num_user_for_reviews,2)+poly(movie.sig$duration,2)+movie.sig$facenumber_in_poster+poly(movie.sig$gross,2)+poly(movie.sig$movie_facebook_likes,2)+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+movie.sig$genres+movie.sig$facenumber_in_poster*movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews*movie.sig$num_voted_users+movie.sig$num_voted_users*movie.sig$gross+movie.sig$gross*movie.sig$budget)
summary(full2)

Call:
lm(formula = movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 
    2) + poly(movie.sig$num_critic_for_reviews, 2) + poly(movie.sig$num_user_for_reviews, 
    2) + poly(movie.sig$duration, 2) + movie.sig$facenumber_in_poster + 
    poly(movie.sig$gross, 2) + poly(movie.sig$movie_facebook_likes, 
    2) + movie.sig$director_facebook_likes + movie.sig$cast_total_facebook_likes + 
    movie.sig$budget + movie.sig$title_year + movie.sig$genres + 
    movie.sig$facenumber_in_poster * movie.sig$num_critic_for_reviews + 
    movie.sig$num_user_for_reviews * movie.sig$num_voted_users + 
    movie.sig$num_voted_users * movie.sig$gross + movie.sig$gross * 
    movie.sig$budget)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3608 -0.3549  0.0642  0.4619  2.1792 

Coefficients: (4 not defined because of singularities)
                                                                  Estimate Std. Error
(Intercept)                                                      5.948e+01  3.617e+00
poly(movie.sig$num_voted_users, 2)1                              2.305e+01  3.426e+00
poly(movie.sig$num_voted_users, 2)2                             -1.873e+01  2.200e+00
poly(movie.sig$num_critic_for_reviews, 2)1                       1.393e+01  1.661e+00
poly(movie.sig$num_critic_for_reviews, 2)2                      -9.490e+00  1.004e+00
poly(movie.sig$num_user_for_reviews, 2)1                        -1.760e+01  2.325e+00
poly(movie.sig$num_user_for_reviews, 2)2                         4.166e+00  1.593e+00
poly(movie.sig$duration, 2)1                                     1.087e+01  9.246e-01
poly(movie.sig$duration, 2)2                                    -3.883e+00  7.809e-01
movie.sig$facenumber_in_poster                                  -2.093e-02  1.106e-02
poly(movie.sig$gross, 2)1                                       -1.454e+01  2.418e+00
poly(movie.sig$gross, 2)2                                       -5.285e+00  1.483e+00
poly(movie.sig$movie_facebook_likes, 2)1                         2.580e+00  1.322e+00
poly(movie.sig$movie_facebook_likes, 2)2                         2.283e-01  8.238e-01
movie.sig$director_facebook_likes                                4.608e-06  4.409e-06
movie.sig$cast_total_facebook_likes                              2.533e-07  7.013e-07
movie.sig$budget                                                -7.852e-09  7.213e-10
movie.sig$title_year                                            -2.656e-02  1.809e-03
movie.sig$genresAdventure                                        3.727e-01  5.268e-02
movie.sig$genresAnimation                                        7.564e-01  1.298e-01
movie.sig$genresBiography                                        6.264e-01  7.351e-02
movie.sig$genresComedy                                           1.576e-01  4.205e-02
movie.sig$genresCrime                                            4.558e-01  6.236e-02
movie.sig$genresDocumentary                                      9.738e-01  1.542e-01
movie.sig$genresDrama                                            5.230e-01  4.726e-02
movie.sig$genresFamily                                           5.958e-01  4.362e-01
movie.sig$genresFantasy                                         -1.891e-01  1.387e-01
movie.sig$genresHorror                                          -3.533e-01  7.597e-02
movie.sig$genresMusical                                         -4.744e-01  5.328e-01
movie.sig$genresMystery                                          1.947e-01  1.891e-01
movie.sig$genresRomance                                          6.094e-01  5.254e-01
movie.sig$genresSci-Fi                                           1.471e-01  2.827e-01
movie.sig$genresThriller                                        -3.085e-01  7.433e-01
movie.sig$genresWestern                                         -4.204e-02  5.272e-01
movie.sig$num_critic_for_reviews                                        NA         NA
movie.sig$num_user_for_reviews                                          NA         NA
movie.sig$num_voted_users                                               NA         NA
movie.sig$gross                                                         NA         NA
movie.sig$facenumber_in_poster:movie.sig$num_critic_for_reviews -1.291e-06  4.260e-05
movie.sig$num_user_for_reviews:movie.sig$num_voted_users         7.966e-10  2.817e-10
movie.sig$num_voted_users:movie.sig$gross                        1.498e-15  1.105e-15
movie.sig$budget:movie.sig$gross                                 2.946e-17  4.104e-18
                                                                t value Pr(>|t|)    
(Intercept)                                                      16.446  < 2e-16 ***
poly(movie.sig$num_voted_users, 2)1                               6.727 2.07e-11 ***
poly(movie.sig$num_voted_users, 2)2                              -8.514  < 2e-16 ***
poly(movie.sig$num_critic_for_reviews, 2)1                        8.388  < 2e-16 ***
poly(movie.sig$num_critic_for_reviews, 2)2                       -9.452  < 2e-16 ***
poly(movie.sig$num_user_for_reviews, 2)1                         -7.568 5.01e-14 ***
poly(movie.sig$num_user_for_reviews, 2)2                          2.615 0.008973 ** 
poly(movie.sig$duration, 2)1                                     11.755  < 2e-16 ***
poly(movie.sig$duration, 2)2                                     -4.973 6.98e-07 ***
movie.sig$facenumber_in_poster                                   -1.892 0.058589 .  
poly(movie.sig$gross, 2)1                                        -6.012 2.05e-09 ***
poly(movie.sig$gross, 2)2                                        -3.565 0.000370 ***
poly(movie.sig$movie_facebook_likes, 2)1                          1.952 0.051079 .  
poly(movie.sig$movie_facebook_likes, 2)2                          0.277 0.781673    
movie.sig$director_facebook_likes                                 1.045 0.296005    
movie.sig$cast_total_facebook_likes                               0.361 0.717999    
movie.sig$budget                                                -10.886  < 2e-16 ***
movie.sig$title_year                                            -14.680  < 2e-16 ***
movie.sig$genresAdventure                                         7.075 1.86e-12 ***
movie.sig$genresAnimation                                         5.828 6.23e-09 ***
movie.sig$genresBiography                                         8.522  < 2e-16 ***
movie.sig$genresComedy                                            3.747 0.000183 ***
movie.sig$genresCrime                                             7.309 3.45e-13 ***
movie.sig$genresDocumentary                                       6.317 3.06e-10 ***
movie.sig$genresDrama                                            11.067  < 2e-16 ***
movie.sig$genresFamily                                            1.366 0.172120    
movie.sig$genresFantasy                                          -1.364 0.172721    
movie.sig$genresHorror                                           -4.650 3.46e-06 ***
movie.sig$genresMusical                                          -0.890 0.373334    
movie.sig$genresMystery                                           1.029 0.303477    
movie.sig$genresRomance                                           1.160 0.246252    
movie.sig$genresSci-Fi                                            0.520 0.602769    
movie.sig$genresThriller                                         -0.415 0.678129    
movie.sig$genresWestern                                          -0.080 0.936448    
movie.sig$num_critic_for_reviews                                     NA       NA    
movie.sig$num_user_for_reviews                                       NA       NA    
movie.sig$num_voted_users                                            NA       NA    
movie.sig$gross                                                      NA       NA    
movie.sig$facenumber_in_poster:movie.sig$num_critic_for_reviews  -0.030 0.975820    
movie.sig$num_user_for_reviews:movie.sig$num_voted_users          2.828 0.004714 ** 
movie.sig$num_voted_users:movie.sig$gross                         1.355 0.175492    
movie.sig$budget:movie.sig$gross                                  7.180 8.80e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.741 on 2967 degrees of freedom
Multiple R-squared:  0.5107,    Adjusted R-squared:  0.5046 
F-statistic: 83.69 on 37 and 2967 DF,  p-value: < 2.2e-16
step(null,scope=list(lower=null,upper=full2),direction='forward')
Start:  AIC=309.81
movie.sig$imdb_score ~ 1

                                            Df Sum of Sq    RSS     AIC
+ poly(movie.sig$num_voted_users, 2)         2    976.96 2352.2 -730.05
+ movie.sig$num_voted_users                  1    871.90 2457.2 -600.74
+ poly(movie.sig$duration, 2)                2    536.11 2793.0 -213.83
+ poly(movie.sig$num_user_for_reviews, 2)    2    483.99 2845.1 -158.27
+ poly(movie.sig$num_critic_for_reviews, 2)  2    436.49 2892.6 -108.52
+ movie.sig$num_critic_for_reviews           1    428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews             1    407.62 2921.5  -80.68
+ poly(movie.sig$movie_facebook_likes, 2)    2    317.80 3011.3   12.32
+ movie.sig$genres                          16    331.02 2998.1   27.10
+ poly(movie.sig$gross, 2)                   2    251.27 3077.9   77.99
+ movie.sig$gross                            1    242.62 3086.5   84.42
+ movie.sig$director_facebook_likes          1    166.17 3163.0  157.95
+ movie.sig$title_year                       1     69.27 3259.9  248.63
+ movie.sig$cast_total_facebook_likes        1     64.28 3264.8  253.22
+ movie.sig$budget                           1     16.26 3312.9  297.09
+ movie.sig$facenumber_in_poster             1     15.14 3314.0  298.11
<none>                                                   3329.1  309.81

Step:  AIC=-730.05
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2)

                                            Df Sum of Sq    RSS      AIC
+ movie.sig$genres                          16    337.58 2014.6 -1163.60
+ poly(movie.sig$duration, 2)                2    137.87 2214.3  -907.55
+ movie.sig$budget                           1    133.09 2219.1  -903.07
+ movie.sig$title_year                       1    101.46 2250.7  -860.55
+ poly(movie.sig$gross, 2)                   2     58.78 2293.4  -802.09
+ movie.sig$gross                            1     54.53 2297.6  -798.53
+ poly(movie.sig$num_user_for_reviews, 2)    2     29.12 2323.1  -763.48
+ movie.sig$num_user_for_reviews             1     25.39 2326.8  -760.66
+ movie.sig$director_facebook_likes          1     17.94 2334.2  -751.05
+ movie.sig$facenumber_in_poster             1      6.62 2345.5  -736.52
+ poly(movie.sig$num_critic_for_reviews, 2)  2      5.36 2346.8  -732.90
<none>                                                   2352.2  -730.05
+ movie.sig$num_critic_for_reviews           1      0.18 2352.0  -728.28
+ movie.sig$cast_total_facebook_likes        1      0.15 2352.0  -728.23
+ poly(movie.sig$movie_facebook_likes, 2)    2      1.29 2350.9  -727.70

Step:  AIC=-1163.6
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres

                                            Df Sum of Sq    RSS     AIC
+ movie.sig$title_year                       1    97.775 1916.8 -1311.1
+ movie.sig$budget                           1    65.238 1949.3 -1260.5
+ poly(movie.sig$duration, 2)                2    65.750 1948.8 -1259.3
+ movie.sig$gross                            1    19.722 1994.9 -1191.2
+ poly(movie.sig$gross, 2)                   2    20.698 1993.9 -1190.6
+ poly(movie.sig$num_user_for_reviews, 2)    2    20.024 1994.6 -1189.6
+ movie.sig$num_user_for_reviews             1    14.834 1999.8 -1183.8
+ poly(movie.sig$num_critic_for_reviews, 2)  2     9.375 2005.2 -1173.6
+ movie.sig$director_facebook_likes          1     6.114 2008.5 -1170.7
+ movie.sig$facenumber_in_poster             1     3.792 2010.8 -1167.3
<none>                                                   2014.6 -1163.6
+ movie.sig$cast_total_facebook_likes        1     0.355 2014.2 -1162.1
+ movie.sig$num_critic_for_reviews           1     0.042 2014.5 -1161.7
+ poly(movie.sig$movie_facebook_likes, 2)    2     0.813 2013.8 -1160.8

Step:  AIC=-1311.1
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres + 
    movie.sig$title_year

                                            Df Sum of Sq    RSS     AIC
+ poly(movie.sig$num_critic_for_reviews, 2)  2    73.976 1842.8 -1425.4
+ poly(movie.sig$duration, 2)                2    49.885 1866.9 -1386.3
+ movie.sig$num_critic_for_reviews           1    43.723 1873.1 -1378.4
+ movie.sig$budget                           1    32.246 1884.6 -1360.1
+ poly(movie.sig$num_user_for_reviews, 2)    2    21.755 1895.0 -1341.4
+ poly(movie.sig$gross, 2)                   2    19.623 1897.2 -1338.0
+ movie.sig$gross                            1    17.879 1898.9 -1337.3
+ poly(movie.sig$movie_facebook_likes, 2)    2    18.788 1898.0 -1336.7
+ movie.sig$num_user_for_reviews             1    14.396 1902.4 -1331.8
+ movie.sig$director_facebook_likes          1     3.373 1913.4 -1314.4
<none>                                                   1916.8 -1311.1
+ movie.sig$facenumber_in_poster             1     1.216 1915.6 -1311.0
+ movie.sig$cast_total_facebook_likes        1     0.300 1916.5 -1309.6

Step:  AIC=-1425.37
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres + 
    movie.sig$title_year + poly(movie.sig$num_critic_for_reviews, 
    2)

                                          Df Sum of Sq    RSS     AIC
+ poly(movie.sig$num_user_for_reviews, 2)  2    54.189 1788.6 -1511.1
+ movie.sig$budget                         1    46.017 1796.8 -1499.4
+ poly(movie.sig$duration, 2)              2    38.533 1804.3 -1484.9
+ movie.sig$num_user_for_reviews           1    33.751 1809.1 -1478.9
+ poly(movie.sig$gross, 2)                 2    20.602 1822.2 -1455.2
+ movie.sig$gross                          1    16.630 1826.2 -1450.6
+ poly(movie.sig$movie_facebook_likes, 2)  2     8.227 1834.6 -1434.8
+ movie.sig$director_facebook_likes        1     2.296 1840.5 -1427.1
<none>                                                 1842.8 -1425.4
+ movie.sig$facenumber_in_poster           1     0.831 1842.0 -1424.7
+ movie.sig$cast_total_facebook_likes      1     0.104 1842.7 -1423.5

Step:  AIC=-1511.06
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres + 
    movie.sig$title_year + poly(movie.sig$num_critic_for_reviews, 
    2) + poly(movie.sig$num_user_for_reviews, 2)

                                          Df Sum of Sq    RSS     AIC
+ poly(movie.sig$duration, 2)              2    51.219 1737.4 -1594.4
+ movie.sig$budget                         1    34.907 1753.7 -1568.3
+ poly(movie.sig$gross, 2)                 2    20.882 1767.8 -1542.3
+ movie.sig$gross                          1    16.727 1771.9 -1537.3
+ poly(movie.sig$movie_facebook_likes, 2)  2     3.910 1784.7 -1513.6
+ movie.sig$director_facebook_likes        1     2.540 1786.1 -1513.3
+ movie.sig$facenumber_in_poster           1     1.970 1786.7 -1512.4
<none>                                                 1788.6 -1511.1
+ movie.sig$cast_total_facebook_likes      1     0.022 1788.6 -1509.1

Step:  AIC=-1594.36
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres + 
    movie.sig$title_year + poly(movie.sig$num_critic_for_reviews, 
    2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration, 
    2)

                                          Df Sum of Sq    RSS     AIC
+ movie.sig$budget                         1    62.211 1675.2 -1701.9
+ poly(movie.sig$gross, 2)                 2    30.406 1707.0 -1643.4
+ movie.sig$gross                          1    23.936 1713.5 -1634.0
+ movie.sig$facenumber_in_poster           1     4.139 1733.3 -1599.5
<none>                                                 1737.4 -1594.4
+ movie.sig$director_facebook_likes        1     0.946 1736.5 -1594.0
+ poly(movie.sig$movie_facebook_likes, 2)  2     1.928 1735.5 -1593.7
+ movie.sig$cast_total_facebook_likes      1     0.064 1737.4 -1592.5

Step:  AIC=-1701.94
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres + 
    movie.sig$title_year + poly(movie.sig$num_critic_for_reviews, 
    2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration, 
    2) + movie.sig$budget

                                          Df Sum of Sq    RSS     AIC
+ movie.sig$facenumber_in_poster           1    5.0599 1670.2 -1709.0
+ poly(movie.sig$gross, 2)                 2    4.5359 1670.7 -1706.1
+ movie.sig$gross                          1    1.8995 1673.3 -1703.3
<none>                                                 1675.2 -1701.9
+ movie.sig$director_facebook_likes        1    0.6239 1674.6 -1701.1
+ movie.sig$cast_total_facebook_likes      1    0.2471 1675.0 -1700.4
+ poly(movie.sig$movie_facebook_likes, 2)  2    0.8695 1674.3 -1699.5

Step:  AIC=-1709.03
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres + 
    movie.sig$title_year + poly(movie.sig$num_critic_for_reviews, 
    2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration, 
    2) + movie.sig$budget + movie.sig$facenumber_in_poster

                                          Df Sum of Sq    RSS     AIC
+ poly(movie.sig$gross, 2)                 2    4.6247 1665.5 -1713.4
+ movie.sig$gross                          1    1.9720 1668.2 -1710.6
<none>                                                 1670.2 -1709.0
+ movie.sig$director_facebook_likes        1    0.4874 1669.7 -1707.9
+ movie.sig$cast_total_facebook_likes      1    0.4443 1669.7 -1707.8
+ poly(movie.sig$movie_facebook_likes, 2)  2    0.8414 1669.3 -1706.5

Step:  AIC=-1713.36
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres + 
    movie.sig$title_year + poly(movie.sig$num_critic_for_reviews, 
    2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration, 
    2) + movie.sig$budget + movie.sig$facenumber_in_poster + 
    poly(movie.sig$gross, 2)

                                          Df Sum of Sq    RSS     AIC
<none>                                                 1665.5 -1713.4
+ movie.sig$director_facebook_likes        1   0.49076 1665.0 -1712.2
+ movie.sig$cast_total_facebook_likes      1   0.41310 1665.1 -1712.1
+ poly(movie.sig$movie_facebook_likes, 2)  2   1.10614 1664.4 -1711.4

Call:
lm(formula = movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 
    2) + movie.sig$genres + movie.sig$title_year + poly(movie.sig$num_critic_for_reviews, 
    2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration, 
    2) + movie.sig$budget + movie.sig$facenumber_in_poster + 
    poly(movie.sig$gross, 2))

Coefficients:
                               (Intercept)         poly(movie.sig$num_voted_users, 2)1  
                                 5.851e+01                                   3.249e+01  
       poly(movie.sig$num_voted_users, 2)2                   movie.sig$genresAdventure  
                                -1.320e+01                                   3.770e-01  
                 movie.sig$genresAnimation                   movie.sig$genresBiography  
                                 7.306e-01                                   6.559e-01  
                    movie.sig$genresComedy                       movie.sig$genresCrime  
                                 1.875e-01                                   4.845e-01  
               movie.sig$genresDocumentary                       movie.sig$genresDrama  
                                 1.037e+00                                   5.524e-01  
                    movie.sig$genresFamily                     movie.sig$genresFantasy  
                                 2.093e-01                                  -1.231e-01  
                    movie.sig$genresHorror                     movie.sig$genresMusical  
                                -2.986e-01                                  -4.597e-01  
                   movie.sig$genresMystery                     movie.sig$genresRomance  
                                 2.304e-01                                   6.151e-01  
                    movie.sig$genresSci-Fi                    movie.sig$genresThriller  
                                 1.706e-01                                  -2.631e-01  
                   movie.sig$genresWestern                        movie.sig$title_year  
                                 5.056e-02                                  -2.605e-02  
poly(movie.sig$num_critic_for_reviews, 2)1  poly(movie.sig$num_critic_for_reviews, 2)2  
                                 1.634e+01                                  -6.906e+00  
  poly(movie.sig$num_user_for_reviews, 2)1    poly(movie.sig$num_user_for_reviews, 2)2  
                                -1.209e+01                                   7.641e+00  
              poly(movie.sig$duration, 2)1                poly(movie.sig$duration, 2)2  
                                 1.072e+01                                  -3.800e+00  
                          movie.sig$budget              movie.sig$facenumber_in_poster  
                                -4.048e-09                                  -2.026e-02  
                 poly(movie.sig$gross, 2)1                   poly(movie.sig$gross, 2)2  
                                -2.770e+00                                   1.851e+00  
  1. full3: additive model with interaction
full3=
lm(movie.sig$imdb_score ~movie.sig$num_voted_users+movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews+movie.sig$duration+movie.sig$facenumber_in_poster+movie.sig$gross+movie.sig$movie_facebook_likes+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+factor(movie.sig$genres)+movie.sig$duration*movie.sig$num_voted_users+movie.sig$num_voted_users*movie.sig$num_user_for_reviews+movie.sig$gross*movie.sig$budget,data=movie.sig)
summary(full3)

Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users + 
    movie.sig$num_critic_for_reviews + movie.sig$num_user_for_reviews + 
    movie.sig$duration + movie.sig$facenumber_in_poster + movie.sig$gross + 
    movie.sig$movie_facebook_likes + movie.sig$director_facebook_likes + 
    movie.sig$cast_total_facebook_likes + movie.sig$budget + 
    movie.sig$title_year + factor(movie.sig$genres) + movie.sig$duration * 
    movie.sig$num_voted_users + movie.sig$num_voted_users * movie.sig$num_user_for_reviews + 
    movie.sig$gross * movie.sig$budget, data = movie.sig)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0519 -0.3700  0.0863  0.4828  2.0996 

Coefficients:
                                                           Estimate Std. Error t value
(Intercept)                                               4.748e+01  3.592e+00  13.218
movie.sig$num_voted_users                                 7.890e-06  4.790e-07  16.472
movie.sig$num_critic_for_reviews                          2.427e-03  2.275e-04  10.669
movie.sig$num_user_for_reviews                           -3.039e-04  6.998e-05  -4.343
movie.sig$duration                                        1.277e-02  9.200e-04  13.882
movie.sig$facenumber_in_poster                           -1.858e-02  6.806e-03  -2.730
movie.sig$gross                                          -1.469e-09  4.191e-10  -3.505
movie.sig$movie_facebook_likes                           -2.370e-06  9.659e-07  -2.454
movie.sig$director_facebook_likes                         3.969e-06  4.482e-06   0.885
movie.sig$cast_total_facebook_likes                       7.641e-07  7.181e-07   1.064
movie.sig$budget                                         -5.900e-09  5.917e-10  -9.971
movie.sig$title_year                                     -2.154e-02  1.790e-03 -12.032
factor(movie.sig$genres)Adventure                         3.308e-01  5.338e-02   6.196
factor(movie.sig$genres)Animation                         7.426e-01  1.319e-01   5.629
factor(movie.sig$genres)Biography                         6.551e-01  7.512e-02   8.720
factor(movie.sig$genres)Comedy                            1.515e-01  4.284e-02   3.537
factor(movie.sig$genres)Crime                             4.496e-01  6.353e-02   7.077
factor(movie.sig$genres)Documentary                       8.960e-01  1.579e-01   5.676
factor(movie.sig$genres)Drama                             4.965e-01  4.835e-02  10.269
factor(movie.sig$genres)Family                            3.329e-01  4.432e-01   0.751
factor(movie.sig$genres)Fantasy                          -1.544e-01  1.419e-01  -1.089
factor(movie.sig$genres)Horror                           -3.577e-01  7.638e-02  -4.683
factor(movie.sig$genres)Musical                          -2.616e-01  5.459e-01  -0.479
factor(movie.sig$genres)Mystery                           1.263e-01  1.939e-01   0.652
factor(movie.sig$genres)Romance                           5.476e-01  5.392e-01   1.016
factor(movie.sig$genres)Sci-Fi                            1.673e-01  2.900e-01   0.577
factor(movie.sig$genres)Thriller                         -4.858e-01  7.627e-01  -0.637
factor(movie.sig$genres)Western                          -1.277e-01  5.408e-01  -0.236
movie.sig$num_voted_users:movie.sig$duration             -3.052e-08  3.447e-09  -8.852
movie.sig$num_voted_users:movie.sig$num_user_for_reviews -3.752e-10  9.851e-11  -3.809
movie.sig$gross:movie.sig$budget                          1.411e-17  2.887e-18   4.886
                                                         Pr(>|t|)    
(Intercept)                                               < 2e-16 ***
movie.sig$num_voted_users                                 < 2e-16 ***
movie.sig$num_critic_for_reviews                          < 2e-16 ***
movie.sig$num_user_for_reviews                           1.46e-05 ***
movie.sig$duration                                        < 2e-16 ***
movie.sig$facenumber_in_poster                           0.006371 ** 
movie.sig$gross                                          0.000463 ***
movie.sig$movie_facebook_likes                           0.014175 *  
movie.sig$director_facebook_likes                        0.376035    
movie.sig$cast_total_facebook_likes                      0.287447    
movie.sig$budget                                          < 2e-16 ***
movie.sig$title_year                                      < 2e-16 ***
factor(movie.sig$genres)Adventure                        6.60e-10 ***
factor(movie.sig$genres)Animation                        1.98e-08 ***
factor(movie.sig$genres)Biography                         < 2e-16 ***
factor(movie.sig$genres)Comedy                           0.000411 ***
factor(movie.sig$genres)Crime                            1.83e-12 ***
factor(movie.sig$genres)Documentary                      1.51e-08 ***
factor(movie.sig$genres)Drama                             < 2e-16 ***
factor(movie.sig$genres)Family                           0.452648    
factor(movie.sig$genres)Fantasy                          0.276414    
factor(movie.sig$genres)Horror                           2.95e-06 ***
factor(movie.sig$genres)Musical                          0.631791    
factor(movie.sig$genres)Mystery                          0.514773    
factor(movie.sig$genres)Romance                          0.309947    
factor(movie.sig$genres)Sci-Fi                           0.563982    
factor(movie.sig$genres)Thriller                         0.524230    
factor(movie.sig$genres)Western                          0.813336    
movie.sig$num_voted_users:movie.sig$duration              < 2e-16 ***
movie.sig$num_voted_users:movie.sig$num_user_for_reviews 0.000143 ***
movie.sig$gross:movie.sig$budget                         1.08e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7607 on 2974 degrees of freedom
Multiple R-squared:  0.483, Adjusted R-squared:  0.4778 
F-statistic: 92.63 on 30 and 2974 DF,  p-value: < 2.2e-16
step(null,scope=list(lower=null,upper=full3),direction='forward')
Start:  AIC=309.81
movie.sig$imdb_score ~ 1

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$num_voted_users            1    871.90 2457.2 -600.74
+ movie.sig$duration                   1    491.13 2838.0 -167.82
+ movie.sig$num_critic_for_reviews     1    428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews       1    407.62 2921.5  -80.68
+ factor(movie.sig$genres)            16    331.02 2998.1   27.10
+ movie.sig$movie_facebook_likes       1    282.82 3046.3   45.02
+ movie.sig$gross                      1    242.62 3086.5   84.42
+ movie.sig$director_facebook_likes    1    166.17 3163.0  157.95
+ movie.sig$title_year                 1     69.27 3259.9  248.63
+ movie.sig$cast_total_facebook_likes  1     64.28 3264.8  253.22
+ movie.sig$budget                     1     16.26 3312.9  297.09
+ movie.sig$facenumber_in_poster       1     15.14 3314.0  298.11
<none>                                             3329.1  309.81

Step:  AIC=-600.74
movie.sig$imdb_score ~ movie.sig$num_voted_users

                                      Df Sum of Sq    RSS     AIC
+ factor(movie.sig$genres)            16   311.531 2145.7 -976.12
+ movie.sig$duration                   1   147.786 2309.4 -785.13
+ movie.sig$title_year                 1    84.649 2372.6 -704.08
+ movie.sig$budget                     1    73.211 2384.0 -689.63
+ movie.sig$num_user_for_reviews       1    21.297 2435.9 -624.90
+ movie.sig$gross                      1    16.929 2440.3 -619.51
+ movie.sig$num_critic_for_reviews     1    14.632 2442.6 -616.69
+ movie.sig$director_facebook_likes    1    13.657 2443.6 -615.49
+ movie.sig$facenumber_in_poster       1     6.789 2450.4 -607.05
+ movie.sig$movie_facebook_likes       1     2.627 2454.6 -601.95
<none>                                             2457.2 -600.74
+ movie.sig$cast_total_facebook_likes  1     0.524 2456.7 -599.38

Step:  AIC=-976.12
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres)

                                      Df Sum of Sq    RSS      AIC
+ movie.sig$title_year                 1    79.623 2066.1 -1087.75
+ movie.sig$duration                   1    74.584 2071.1 -1080.44
+ movie.sig$budget                     1    28.689 2117.0 -1014.57
+ movie.sig$num_critic_for_reviews     1    23.116 2122.6 -1006.67
+ movie.sig$num_user_for_reviews       1    12.251 2133.4  -991.33
+ movie.sig$director_facebook_likes    1     3.707 2142.0  -979.32
+ movie.sig$facenumber_in_poster       1     3.274 2142.4  -978.71
+ movie.sig$movie_facebook_likes       1     1.686 2144.0  -976.49
<none>                                             2145.7  -976.12
+ movie.sig$gross                      1     1.391 2144.3  -976.07
+ movie.sig$cast_total_facebook_likes  1     0.362 2145.3  -974.63

Step:  AIC=-1087.75
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$num_critic_for_reviews     1   125.091 1941.0 -1273.4
+ movie.sig$duration                   1    55.857 2010.2 -1168.1
+ movie.sig$movie_facebook_likes       1    21.746 2044.3 -1117.5
+ movie.sig$num_user_for_reviews       1    11.741 2054.3 -1102.9
+ movie.sig$budget                     1     9.196 2056.9 -1099.2
+ movie.sig$cast_total_facebook_likes  1     2.923 2063.2 -1090.0
+ movie.sig$director_facebook_likes    1     1.740 2064.3 -1088.3
<none>                                             2066.1 -1087.8
+ movie.sig$facenumber_in_poster       1     1.084 2065.0 -1087.3
+ movie.sig$gross                      1     0.638 2065.4 -1086.7

Step:  AIC=-1273.43
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$budget                     1    36.627 1904.4 -1328.7
+ movie.sig$num_user_for_reviews       1    35.326 1905.7 -1326.6
+ movie.sig$duration                   1    34.873 1906.1 -1325.9
+ movie.sig$gross                      1     7.359 1933.6 -1282.8
+ movie.sig$movie_facebook_likes       1     1.397 1939.6 -1273.6
<none>                                             1941.0 -1273.4
+ movie.sig$facenumber_in_poster       1     0.926 1940.1 -1272.9
+ movie.sig$director_facebook_likes    1     0.644 1940.3 -1272.4
+ movie.sig$cast_total_facebook_likes  1     0.572 1940.4 -1272.3

Step:  AIC=-1328.68
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$duration                   1    58.373 1846.0 -1420.2
+ movie.sig$num_user_for_reviews       1    27.052 1877.3 -1369.7
+ movie.sig$movie_facebook_likes       1     2.576 1901.8 -1330.8
+ movie.sig$cast_total_facebook_likes  1     2.005 1902.3 -1329.8
<none>                                             1904.4 -1328.7
+ movie.sig$facenumber_in_poster       1     1.071 1903.3 -1328.4
+ movie.sig$director_facebook_likes    1     0.557 1903.8 -1327.6
+ movie.sig$gross                      1     0.074 1904.3 -1326.8

Step:  AIC=-1420.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration

                                               Df Sum of Sq    RSS     AIC
+ movie.sig$num_voted_users:movie.sig$duration  1    70.848 1775.1 -1535.8
+ movie.sig$num_user_for_reviews                1    33.825 1812.2 -1473.8
+ movie.sig$movie_facebook_likes                1     4.702 1841.3 -1425.9
+ movie.sig$facenumber_in_poster                1     2.488 1843.5 -1422.3
+ movie.sig$cast_total_facebook_likes           1     1.601 1844.4 -1420.8
<none>                                                      1846.0 -1420.2
+ movie.sig$gross                               1     0.196 1845.8 -1418.5
+ movie.sig$director_facebook_likes             1     0.043 1845.9 -1418.3

Step:  AIC=-1535.83
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_voted_users:movie.sig$duration

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$num_user_for_reviews       1   26.4426 1748.7 -1578.9
+ movie.sig$facenumber_in_poster       1    2.9576 1772.2 -1538.8
+ movie.sig$cast_total_facebook_likes  1    1.1823 1774.0 -1535.8
<none>                                             1775.1 -1535.8
+ movie.sig$movie_facebook_likes       1    0.9446 1774.2 -1535.4
+ movie.sig$director_facebook_likes    1    0.3854 1774.8 -1534.5
+ movie.sig$gross                      1    0.0191 1775.1 -1533.9

Step:  AIC=-1578.93
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$num_voted_users:movie.sig$duration

                                                           Df Sum of Sq    RSS     AIC
+ movie.sig$num_voted_users:movie.sig$num_user_for_reviews  1    5.4845 1743.2 -1586.4
+ movie.sig$facenumber_in_poster                            1    4.1664 1744.5 -1584.1
+ movie.sig$movie_facebook_likes                            1    3.9301 1744.8 -1583.7
<none>                                                                  1748.7 -1578.9
+ movie.sig$cast_total_facebook_likes                       1    0.7354 1748.0 -1578.2
+ movie.sig$director_facebook_likes                         1    0.2660 1748.4 -1577.4
+ movie.sig$gross                                           1    0.0008 1748.7 -1576.9

Step:  AIC=-1586.37
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$facenumber_in_poster       1    4.0181 1739.2 -1591.3
+ movie.sig$movie_facebook_likes       1    3.2754 1739.9 -1590.0
<none>                                             1743.2 -1586.4
+ movie.sig$cast_total_facebook_likes  1    0.6359 1742.6 -1585.5
+ movie.sig$director_facebook_likes    1    0.3798 1742.8 -1585.0
+ movie.sig$gross                      1    0.0475 1743.2 -1584.5

Step:  AIC=-1591.31
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$facenumber_in_poster + movie.sig$num_voted_users:movie.sig$duration + 
    movie.sig$num_voted_users:movie.sig$num_user_for_reviews

                                      Df Sum of Sq    RSS     AIC
+ movie.sig$movie_facebook_likes       1   3.11243 1736.1 -1594.7
<none>                                             1739.2 -1591.3
+ movie.sig$cast_total_facebook_likes  1   0.90996 1738.3 -1590.9
+ movie.sig$director_facebook_likes    1   0.29041 1738.9 -1589.8
+ movie.sig$gross                      1   0.04757 1739.1 -1589.4

Step:  AIC=-1594.69
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) + 
    movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$facenumber_in_poster + movie.sig$movie_facebook_likes + 
    movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews

                                      Df Sum of Sq    RSS     AIC
<none>                                             1736.1 -1594.7
+ movie.sig$cast_total_facebook_likes  1   0.97305 1735.1 -1594.4
+ movie.sig$director_facebook_likes    1   0.27990 1735.8 -1593.2
+ movie.sig$gross                      1   0.03634 1736.0 -1592.8

Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users + 
    factor(movie.sig$genres) + movie.sig$title_year + movie.sig$num_critic_for_reviews + 
    movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews + 
    movie.sig$facenumber_in_poster + movie.sig$movie_facebook_likes + 
    movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews)

Coefficients:
                                             (Intercept)  
                                               4.817e+01  
                               movie.sig$num_voted_users  
                                               7.152e-06  
                       factor(movie.sig$genres)Adventure  
                                               3.300e-01  
                       factor(movie.sig$genres)Animation  
                                               7.097e-01  
                       factor(movie.sig$genres)Biography  
                                               6.794e-01  
                          factor(movie.sig$genres)Comedy  
                                               1.675e-01  
                           factor(movie.sig$genres)Crime  
                                               4.784e-01  
                     factor(movie.sig$genres)Documentary  
                                               9.449e-01  
                           factor(movie.sig$genres)Drama  
                                               5.252e-01  
                          factor(movie.sig$genres)Family  
                                               2.260e-01  
                         factor(movie.sig$genres)Fantasy  
                                              -1.422e-01  
                          factor(movie.sig$genres)Horror  
                                              -3.440e-01  
                         factor(movie.sig$genres)Musical  
                                              -3.165e-01  
                         factor(movie.sig$genres)Mystery  
                                               1.499e-01  
                         factor(movie.sig$genres)Romance  
                                               5.682e-01  
                          factor(movie.sig$genres)Sci-Fi  
                                               1.953e-01  
                        factor(movie.sig$genres)Thriller  
                                              -4.097e-01  
                         factor(movie.sig$genres)Western  
                                              -4.521e-02  
                                    movie.sig$title_year  
                                              -2.189e-02  
                        movie.sig$num_critic_for_reviews  
                                               2.566e-03  
                                        movie.sig$budget  
                                              -4.370e-09  
                                      movie.sig$duration  
                                               1.206e-02  
                          movie.sig$num_user_for_reviews  
                                              -3.210e-04  
                          movie.sig$facenumber_in_poster  
                                              -1.750e-02  
                          movie.sig$movie_facebook_likes  
                                              -2.239e-06  
            movie.sig$num_voted_users:movie.sig$duration  
                                              -2.661e-08  
movie.sig$num_voted_users:movie.sig$num_user_for_reviews  
                                              -2.729e-10  

For convinience to interpret the result, I will start with Full3(additive mode with interactiin terms). After checking residual, then decide should we add higher order terms.

Split data into Test and Train:

indx = sample(1:nrow(movie.sig), as.integer(0.9*nrow(movie.sig)))
indx # ramdomize rows, save 90% of data into index
   [1]  328 2365 1899 1526 2487 1132 2560 2291 1959 1879 2607 1062 2952   31  122 2966
  [17] 2875 2626 1514 2507 1946 2115  892 1601 2812 1881 2375 1458  967 1217  879 1680
  [33] 2921  701  319  194  203  249 2505 2656 1636 1190 2472 2886  497 1206 1174  505
  [49]  831 2725 1719 2440  873 2917 2403 1087  350 1291 2735 2915  318  297  607 1393
  [65]  952 2619  199 1049 2874   25 1187  898 1602  147 2737 2864 2747 2990 1185 1781
  [81] 1674  440 1876  234 2178  241 2499  804   48  209 2622 1722  748 1965   86    8
  [97] 2491  948 1883  360 2579 1413 1797   20  946  912  584 1827  924 2526  766 2580
 [113] 1743 1990  522  582  618  604 1000  729  317  768 1417 2992 1325 2039 1758 1453
 [129] 2450 1061 2758 1993 1258 2344 2568 2693  294 2723 1103   84 1778 1004  678  455
 [145] 1170 2455 2469  258  930 1841 1964  254  435 2773 2166  176  240  470  853 2268
 [161] 2168 2865 1748 2236  940 1472  213 2625 2994  685  809 2130 1789  150  842 1909
 [177]  268 1211  285 2969 2189 2401 2504 2608 2174 1370 2196 2826 2099 2005 1379 1944
 [193] 1053 2470 1167 2167  138 1402 1505  366 1010 1672 1829 2980 1999 1810 1252  495
 [209] 2783 1633 1255 1209 2462 1994 1351 1531 2252 2691  389 1104 2294  969 1311 2762
 [225] 1886 1120  833  746 1834  539  202 1429   27 1084  925   30 2018  555 2611 2675
 [241] 1647  379   59  646 1686 1678 2909 1536 2391  596 2531 1233 2637 2989 2017 1992
 [257] 1796  926 2687 1266 1403 1298   36 1715 2439 1464 1986  267 2212 1307 2861   55
 [273] 2471 1782 2223  711 2995 2241 2829 1884 1902 1091 2094 2349  954 1246  849  810
 [289] 1801 2106 1776 2427 2374 2044 2181  715 1439 2049 1105 2710 2715 1470 2547  677
 [305]   76 1609  352 1471 1556 1376 1530 2243 1041 2881  619 2125 2779 2227 2052  485
 [321] 2645  724  975 2809 1991  861 2390 1895  712  270  794 2767 1613 2654 2346 2593
 [337] 1295 1204  561  373 2566 2021 1923  710  271  365 2064 2433  489  316 3000 2887
 [353]  124 1588 1017 1357 1323 2860 2782 2819 2267 1048 2143 1524 1467 1929 2927 1421
 [369] 1922   63  269 2993 1142  919 2284 1893 2899 1444 2785 2838   92 2974 2456  815
 [385] 1695 1794 1074 1939  695 2067   68 1126 1656 1476 2669  666  452 1557 1539 1503
 [401] 2030 2062 2128  545  121 1051 2162 2409 1338  987 1490 2592  159 1727  537 2797
 [417]  125 2287 1289 3001 2360 1576 1669 1416  910 2420 2013 1267 1130 1775 1256  115
 [433] 2932  682  656 1865 1177 2913 1593 1655 1806 2288 1153 2325 1019 2718 2134  549
 [449] 2458 1183 1795 2488 1235  951 2912  580 2999  472 1967  672 2131 2750 1790  367
 [465] 2820  843 1683  564 1708  399  583 2643  190 1693  191 1035 2556  281  602 2733
 [481]  897 1840 2053 2653 2185 2641 1579  821 1425 1548 1317  110 1889  743  610 2468
 [497] 2960 1515 2743  227  215 1996    6 1389 1314 1945  642 2448  995  362 2808 1703
 [513]  212 2079 2399 2309 1507 2756 2609 2657  651  736  896  886 1851 2311 2717  334
 [529]  541  135  312 2274 1887 2668 1540 1916  556  790 1500 2385  179 1360  994 1477
 [545]  638 1755  882 1907  683 2345  671 2704 2851 1658  658  500 1650   46 2343 2396
 [561] 1168 2090  304  116  492 1450 1733 2703 2496  625  554 1193 1336 2853 1552 1501
 [577] 1364 2043 1068 2929  417  457  460 2776 1426  598 2054 1861 1640 1392 1352  855
 [593] 2277 1623 1275 2786 2338 2024 2132 1597 1137 2614  697  525 2885 1598 1786  540
 [609] 1312  107 2576 1078  752   44 2144  219 1575 2711 2949 1071  716  398 2602 2769
 [625]  506 2768 2003 2201 2240 1877 2025 1158 1762  689 2452 2480  309 1918 2976 1115
 [641] 1660 1201 2386 1799 1520  253  260 2184   51 2520 2798 1172 2765  181  918 2961
 [657]   50  345 2000 2008 1618 1629 1792 2194  192 1535 2383 2897 1573 2362 2046 1730
 [673]  917 2245 1016 1616  246 2869 1521 2219 2126   61 2156   18  632  723 1725  657
 [689]  659 2822  605 1710 2313 1757 2676 1547 2222 1147  385 1284 1589 2515 1220  931
 [705]  434    9 2477 2958 2276 2621 2228  587 1248 1868  400  231 2280 2541 1221 2918
 [721]  693  765 1229 2679 2872  225  448 1653 2536 2663 2739  419  573  617  781  255
 [737] 1954 2978 1545 2089  332 2542 2561  730  502  742 2518  418 2603 2936  113  449
 [753] 1182 2855  261 1707 2551 2509 2226  200  601 1262  961 1777 1863  830 2598 1624
 [769] 1433 1741 1337 1363 1949 1191 2569 2091 1729  676 1155   79 2681 1551 1056 2028
 [785]   16 2800   96 2317 2719 1761 1857 2601 2364 2397 1390 1671 1013  336 2465 1133
 [801] 1745  349 2821  411 2297   29 2546 2371  262 2239 1662 1125 2372 1178 2259 1044
 [817]   98 1420  520 2435  507 2746  146 1767  968 2459  355  509 1824 2211 1341 2229
 [833] 2944  340   39 2050 1936 2117 2377  899 2424 2460 1077  932 2781  134 1689  151
 [849] 1862 1184 2882 2320 2896 1811  145  763 2700 2879 2996 1427 2586  895  117  424
 [865] 1119 2721 1819 1812  283 1159 1765  469 2880  890 1897  384 1989   21  688 2192
 [881] 1607 2426 1409 2033  153 2748  998 2827 2632 1615  120 2937 2665 1304 2572 1974
 [897] 2613 2856    4  660 1082  284  756 2354 1400  333 1987 1293 1386 1697 1151 2893
 [913] 2041 2173 2707  104   13 1382  359  563 2649  870 2454 2056   66 1313  326  423
 [929]  536   70 2889   83  526 1716 2975 1924  644 2190 2544 2326 2646  623 1694 2973
 [945] 1641 1109 1513   75 1247   80  860 1706 2578  201  916 1627 2523 1666 2636 1826
 [961] 1511 1179 1222 2777 1259  129 2087 2575  962 1214 2304 1198 2849 1412 1230  628
 [977] 2135 1700  740 2237  858  323 1237 1499  771 1542  817 1930  959 2792 1978  351
 [993] 1718 2982  922 2563  293 1369 1648 2183
 [ reached getOption("max.print") -- omitted 1704 entries ]
movie_train = movie.sig[indx,]
movie_test = movie.sig[-indx,]
# lm.fit 1: linear model with interaction term dropping insig predictors.
# insig terms: director facebooklike','movie fb like' and 'cast total fb likes' from summary(full3)
# Note: nothing to do with step function we choose for full3.
lm.fit1<-lm(movie_train$imdb_score~movie_train$num_voted_users+movie_train$num_critic_for_reviews+movie_train$num_user_for_reviews+movie_train$duration+movie_train$facenumber_in_poster+movie_train$gross+movie_train$budget+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
summary(lm.fit1)

Call:
lm(formula = movie_train$imdb_score ~ movie_train$num_voted_users + 
    movie_train$num_critic_for_reviews + movie_train$num_user_for_reviews + 
    movie_train$duration + movie_train$facenumber_in_poster + 
    movie_train$gross + movie_train$budget + movie_train$title_year + 
    factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users + 
    movie_train$num_voted_users * movie_train$num_user_for_reviews + 
    movie_train$gross * movie_train$budget)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1194 -0.3781  0.0931  0.4858  2.0804 

Coefficients:
                                                               Estimate Std. Error
(Intercept)                                                   4.665e+01  3.787e+00
movie_train$num_voted_users                                   8.335e-06  5.213e-07
movie_train$num_critic_for_reviews                            2.187e-03  2.038e-04
movie_train$num_user_for_reviews                             -2.890e-04  7.239e-05
movie_train$duration                                          1.342e-02  9.816e-04
movie_train$facenumber_in_poster                             -1.901e-02  7.330e-03
movie_train$gross                                            -1.518e-09  4.415e-10
movie_train$budget                                           -6.024e-09  6.279e-10
movie_train$title_year                                       -2.115e-02  1.887e-03
factor(movie_train$genres)Adventure                           3.115e-01  5.622e-02
factor(movie_train$genres)Animation                           6.982e-01  1.419e-01
factor(movie_train$genres)Biography                           6.623e-01  7.860e-02
factor(movie_train$genres)Comedy                              1.560e-01  4.511e-02
factor(movie_train$genres)Crime                               4.634e-01  6.614e-02
factor(movie_train$genres)Documentary                         1.286e+00  1.680e-01
factor(movie_train$genres)Drama                               4.970e-01  5.072e-02
factor(movie_train$genres)Family                              3.297e-01  4.429e-01
factor(movie_train$genres)Fantasy                            -1.832e-01  1.466e-01
factor(movie_train$genres)Horror                             -3.607e-01  8.103e-02
factor(movie_train$genres)Musical                             3.017e-01  7.640e-01
factor(movie_train$genres)Mystery                             1.892e-01  2.067e-01
factor(movie_train$genres)Romance                             8.373e-01  7.621e-01
factor(movie_train$genres)Sci-Fi                             -7.964e-03  3.428e-01
factor(movie_train$genres)Thriller                           -4.997e-01  7.626e-01
factor(movie_train$genres)Western                             9.650e-01  7.619e-01
movie_train$num_voted_users:movie_train$duration             -3.375e-08  3.799e-09
movie_train$num_voted_users:movie_train$num_user_for_reviews -4.043e-10  1.018e-10
movie_train$gross:movie_train$budget                          1.496e-17  3.068e-18
                                                             t value Pr(>|t|)    
(Intercept)                                                   12.320  < 2e-16 ***
movie_train$num_voted_users                                   15.990  < 2e-16 ***
movie_train$num_critic_for_reviews                            10.730  < 2e-16 ***
movie_train$num_user_for_reviews                              -3.993 6.71e-05 ***
movie_train$duration                                          13.667  < 2e-16 ***
movie_train$facenumber_in_poster                              -2.594 0.009550 ** 
movie_train$gross                                             -3.437 0.000597 ***
movie_train$budget                                            -9.594  < 2e-16 ***
movie_train$title_year                                       -11.208  < 2e-16 ***
factor(movie_train$genres)Adventure                            5.541 3.30e-08 ***
factor(movie_train$genres)Animation                            4.918 9.25e-07 ***
factor(movie_train$genres)Biography                            8.425  < 2e-16 ***
factor(movie_train$genres)Comedy                               3.457 0.000554 ***
factor(movie_train$genres)Crime                                7.007 3.07e-12 ***
factor(movie_train$genres)Documentary                          7.654 2.71e-14 ***
factor(movie_train$genres)Drama                                9.797  < 2e-16 ***
factor(movie_train$genres)Family                               0.744 0.456735    
factor(movie_train$genres)Fantasy                             -1.249 0.211622    
factor(movie_train$genres)Horror                              -4.452 8.87e-06 ***
factor(movie_train$genres)Musical                              0.395 0.692918    
factor(movie_train$genres)Mystery                              0.915 0.360080    
factor(movie_train$genres)Romance                              1.099 0.272027    
factor(movie_train$genres)Sci-Fi                              -0.023 0.981469    
factor(movie_train$genres)Thriller                            -0.655 0.512335    
factor(movie_train$genres)Western                              1.267 0.205440    
movie_train$num_voted_users:movie_train$duration              -8.884  < 2e-16 ***
movie_train$num_voted_users:movie_train$num_user_for_reviews  -3.973 7.29e-05 ***
movie_train$gross:movie_train$budget                           4.875 1.15e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7604 on 2676 degrees of freedom
Multiple R-squared:  0.4854,    Adjusted R-squared:  0.4802 
F-statistic: 93.49 on 27 and 2676 DF,  p-value: < 2.2e-16

The P-value is very samll.All terms are significant but face number in posters is the least significant variable.Adjusted R^2 is 0.4727, which means 47.27% of the variability can be explained by this model.

Do Lack of fit test to see if removing the predictors improve model performance:

#lm.full: full linear model with interaction terms on train dataset.
lm.full<-lm(movie_train$imdb_score~movie_train$num_voted_users+movie_train$num_critic_for_reviews+movie_train$num_user_for_reviews+movie_train$duration+movie_train$facenumber_in_poster+movie_train$gross+movie_train$movie_facebook_likes+movie_train$director_facebook_likes+movie_train$cast_total_facebook_likes+movie_train$budget+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
anova(lm.full,lm.fit1) # H0: reduced model fits===lack of fit=0
Analysis of Variance Table

Model 1: movie_train$imdb_score ~ movie_train$num_voted_users + movie_train$num_critic_for_reviews + 
    movie_train$num_user_for_reviews + movie_train$duration + 
    movie_train$facenumber_in_poster + movie_train$gross + movie_train$movie_facebook_likes + 
    movie_train$director_facebook_likes + movie_train$cast_total_facebook_likes + 
    movie_train$budget + movie_train$title_year + factor(movie_train$genres) + 
    movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users * 
    movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Model 2: movie_train$imdb_score ~ movie_train$num_voted_users + movie_train$num_critic_for_reviews + 
    movie_train$num_user_for_reviews + movie_train$duration + 
    movie_train$facenumber_in_poster + movie_train$gross + movie_train$budget + 
    movie_train$title_year + factor(movie_train$genres) + movie_train$duration * 
    movie_train$num_voted_users + movie_train$num_voted_users * 
    movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   2673 1545.1                           
2   2676 1547.5 -3   -2.3691 1.3662 0.2513

The P-value of the partial F-test is 0.1379, which means dropping ‘director facebooklike’,‘movie fb like’ and ‘cast total fb likes’ did improve model performance.

Diagnostics:

plot(lm.fit1)
not plotting observations with leverage one:
  420, 1144, 1702, 1734

not plotting observations with leverage one:
  420, 1144, 1702, 1734

# residual vs fitted indicates might be higher order term. Normal plot not good.
library(car)
residualPlots(lm.fit1)
library(car)
residualPlots(lm.fit1)

                                   Test stat Pr(>|t|)
movie_train$num_voted_users           -8.015    0.000
movie_train$num_critic_for_reviews    -7.784    0.000
movie_train$num_user_for_reviews       3.994    0.000
movie_train$duration                  -4.664    0.000
movie_train$facenumber_in_poster       0.400    0.689
movie_train$gross                     -3.958    0.000
movie_train$budget                     4.818    0.000
movie_train$title_year                -3.936    0.000
factor(movie_train$genres)                NA       NA
Tukey test                           -14.747    0.000

All of the residual vs predictor plots have a general trend of cerviture, which indicates the current model does not fit. Higher order terms should be included.

Fit model with higer order terms:

# lm.fit2: model based on lm.fit1 adding higer order for all variables except for 'face number in poster' and 'title-year'.
lm.fit2<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+poly(movie_train$num_user_for_reviews,2)+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+poly(movie_train$gross,2)+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
summary(lm.fit2)

Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 
    2) + poly(movie_train$num_critic_for_reviews, 2) + poly(movie_train$num_user_for_reviews, 
    2) + poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    poly(movie_train$gross, 2) + poly(movie_train$budget, 2) + 
    movie_train$title_year + factor(movie_train$genres) + movie_train$duration * 
    movie_train$num_voted_users + movie_train$num_voted_users * 
    movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9109 -0.3530  0.0637  0.4618  2.2051 

Coefficients: (5 not defined because of singularities)
                                                               Estimate Std. Error
(Intercept)                                                   5.222e+01  3.793e+00
poly(movie_train$num_voted_users, 2)1                         4.696e+01  5.073e+00
poly(movie_train$num_voted_users, 2)2                        -1.515e+01  2.182e+00
poly(movie_train$num_critic_for_reviews, 2)1                  1.360e+01  1.313e+00
poly(movie_train$num_critic_for_reviews, 2)2                 -7.180e+00  8.464e-01
poly(movie_train$num_user_for_reviews, 2)1                   -1.604e+01  2.217e+00
poly(movie_train$num_user_for_reviews, 2)2                    4.249e+00  1.606e+00
poly(movie_train$duration, 2)1                                1.447e+01  1.084e+00
poly(movie_train$duration, 2)2                               -3.403e+00  7.733e-01
movie_train$facenumber_in_poster                             -2.293e-02  7.101e-03
poly(movie_train$gross, 2)1                                  -6.128e+00  2.127e+00
poly(movie_train$gross, 2)2                                  -1.912e+00  1.217e+00
poly(movie_train$budget, 2)1                                 -1.378e+01  1.950e+00
poly(movie_train$budget, 2)2                                  5.600e+00  1.084e+00
movie_train$title_year                                       -2.290e-02  1.898e-03
factor(movie_train$genres)Adventure                           3.535e-01  5.499e-02
factor(movie_train$genres)Animation                           7.302e-01  1.384e-01
factor(movie_train$genres)Biography                           6.244e-01  7.622e-02
factor(movie_train$genres)Comedy                              1.419e-01  4.396e-02
factor(movie_train$genres)Crime                               4.508e-01  6.411e-02
factor(movie_train$genres)Documentary                         1.301e+00  1.632e-01
factor(movie_train$genres)Drama                               4.838e-01  4.936e-02
factor(movie_train$genres)Family                              3.629e-01  4.341e-01
factor(movie_train$genres)Fantasy                            -2.164e-01  1.422e-01
factor(movie_train$genres)Horror                             -3.858e-01  8.014e-02
factor(movie_train$genres)Musical                            -2.386e-03  7.394e-01
factor(movie_train$genres)Mystery                             1.625e-01  2.001e-01
factor(movie_train$genres)Romance                             8.988e-01  7.370e-01
factor(movie_train$genres)Sci-Fi                             -5.665e-02  3.316e-01
factor(movie_train$genres)Thriller                           -4.149e-01  7.376e-01
factor(movie_train$genres)Western                             9.080e-01  7.366e-01
movie_train$duration                                                 NA         NA
movie_train$num_voted_users                                          NA         NA
movie_train$num_user_for_reviews                                     NA         NA
movie_train$gross                                                    NA         NA
movie_train$budget                                                   NA         NA
movie_train$duration:movie_train$num_voted_users             -2.160e-08  3.828e-09
movie_train$num_voted_users:movie_train$num_user_for_reviews  7.634e-10  2.962e-10
movie_train$gross:movie_train$budget                          1.241e-17  5.615e-18
                                                             t value Pr(>|t|)    
(Intercept)                                                   13.768  < 2e-16 ***
poly(movie_train$num_voted_users, 2)1                          9.256  < 2e-16 ***
poly(movie_train$num_voted_users, 2)2                         -6.942 4.83e-12 ***
poly(movie_train$num_critic_for_reviews, 2)1                  10.362  < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2                  -8.483  < 2e-16 ***
poly(movie_train$num_user_for_reviews, 2)1                    -7.233 6.14e-13 ***
poly(movie_train$num_user_for_reviews, 2)2                     2.645  0.00821 ** 
poly(movie_train$duration, 2)1                                13.349  < 2e-16 ***
poly(movie_train$duration, 2)2                                -4.401 1.12e-05 ***
movie_train$facenumber_in_poster                              -3.229  0.00126 ** 
poly(movie_train$gross, 2)1                                   -2.881  0.00399 ** 
poly(movie_train$gross, 2)2                                   -1.571  0.11626    
poly(movie_train$budget, 2)1                                  -7.070 1.97e-12 ***
poly(movie_train$budget, 2)2                                   5.168 2.54e-07 ***
movie_train$title_year                                       -12.062  < 2e-16 ***
factor(movie_train$genres)Adventure                            6.430 1.51e-10 ***
factor(movie_train$genres)Animation                            5.278 1.41e-07 ***
factor(movie_train$genres)Biography                            8.193 3.92e-16 ***
factor(movie_train$genres)Comedy                               3.229  0.00126 ** 
factor(movie_train$genres)Crime                                7.031 2.59e-12 ***
factor(movie_train$genres)Documentary                          7.976 2.22e-15 ***
factor(movie_train$genres)Drama                                9.800  < 2e-16 ***
factor(movie_train$genres)Family                               0.836  0.40327    
factor(movie_train$genres)Fantasy                             -1.522  0.12823    
factor(movie_train$genres)Horror                              -4.814 1.56e-06 ***
factor(movie_train$genres)Musical                             -0.003  0.99743    
factor(movie_train$genres)Mystery                              0.812  0.41675    
factor(movie_train$genres)Romance                              1.220  0.22276    
factor(movie_train$genres)Sci-Fi                              -0.171  0.86437    
factor(movie_train$genres)Thriller                            -0.563  0.57382    
factor(movie_train$genres)Western                              1.233  0.21784    
movie_train$duration                                              NA       NA    
movie_train$num_voted_users                                       NA       NA    
movie_train$num_user_for_reviews                                  NA       NA    
movie_train$gross                                                 NA       NA    
movie_train$budget                                                NA       NA    
movie_train$duration:movie_train$num_voted_users              -5.643 1.85e-08 ***
movie_train$num_voted_users:movie_train$num_user_for_reviews   2.578  0.01000 *  
movie_train$gross:movie_train$budget                           2.210  0.02721 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7349 on 2670 degrees of freedom
Multiple R-squared:  0.5204,    Adjusted R-squared:  0.5145 
F-statistic:  87.8 on 33 and 2670 DF,  p-value: < 2.2e-16

The second order term for ‘num user for reviews’ is not sig, can be droped. The second order term for ‘gross’ is sig but close to not sig, can be droped. The interaction for ‘gross’ and ‘budget’ is not very significant, could be droped.

# lm.fit3: based on lm.fit2 dropping second order term for 'number of users for review', 'gross' and budget*gross
lm.fit3<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit3)

Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 
    2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
    poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year + 
    factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users + 
    movie_train$num_voted_users * movie_train$num_user_for_reviews)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9364 -0.3471  0.0698  0.4640  2.1996 

Coefficients: (2 not defined because of singularities)
                                                               Estimate Std. Error
(Intercept)                                                   5.065e+01  3.751e+00
poly(movie_train$num_voted_users, 2)1                         4.063e+01  4.606e+00
poly(movie_train$num_voted_users, 2)2                        -1.754e+01  2.060e+00
poly(movie_train$num_critic_for_reviews, 2)1                  1.293e+01  1.297e+00
poly(movie_train$num_critic_for_reviews, 2)2                 -6.712e+00  8.256e-01
movie_train$num_user_for_reviews                             -9.081e-04  9.331e-05
poly(movie_train$duration, 2)1                                1.423e+01  1.081e+00
poly(movie_train$duration, 2)2                               -3.319e+00  7.730e-01
movie_train$facenumber_in_poster                             -2.186e-02  7.105e-03
movie_train$gross                                            -5.207e-10  3.186e-10
poly(movie_train$budget, 2)1                                 -1.064e+01  1.170e+00
poly(movie_train$budget, 2)2                                  7.172e+00  8.080e-01
movie_train$title_year                                       -2.195e-02  1.875e-03
factor(movie_train$genres)Adventure                           3.626e-01  5.497e-02
factor(movie_train$genres)Animation                           7.416e-01  1.384e-01
factor(movie_train$genres)Biography                           6.371e-01  7.624e-02
factor(movie_train$genres)Comedy                              1.464e-01  4.395e-02
factor(movie_train$genres)Crime                               4.605e-01  6.406e-02
factor(movie_train$genres)Documentary                         1.306e+00  1.634e-01
factor(movie_train$genres)Drama                               4.869e-01  4.938e-02
factor(movie_train$genres)Family                              1.928e-01  4.282e-01
factor(movie_train$genres)Fantasy                            -2.218e-01  1.422e-01
factor(movie_train$genres)Horror                             -3.893e-01  8.007e-02
factor(movie_train$genres)Musical                            -8.817e-02  7.393e-01
factor(movie_train$genres)Mystery                             1.555e-01  2.002e-01
factor(movie_train$genres)Romance                             8.997e-01  7.382e-01
factor(movie_train$genres)Sci-Fi                             -7.984e-02  3.319e-01
factor(movie_train$genres)Thriller                           -4.206e-01  7.387e-01
factor(movie_train$genres)Western                             9.181e-01  7.378e-01
movie_train$duration                                                 NA         NA
movie_train$num_voted_users                                          NA         NA
movie_train$duration:movie_train$num_voted_users             -2.127e-08  3.785e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users  1.371e-09  2.084e-10
                                                             t value Pr(>|t|)    
(Intercept)                                                   13.502  < 2e-16 ***
poly(movie_train$num_voted_users, 2)1                          8.821  < 2e-16 ***
poly(movie_train$num_voted_users, 2)2                         -8.513  < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1                   9.972  < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2                  -8.130 6.50e-16 ***
movie_train$num_user_for_reviews                              -9.732  < 2e-16 ***
poly(movie_train$duration, 2)1                                13.165  < 2e-16 ***
poly(movie_train$duration, 2)2                                -4.294 1.81e-05 ***
movie_train$facenumber_in_poster                              -3.077 0.002114 ** 
movie_train$gross                                             -1.634 0.102366    
poly(movie_train$budget, 2)1                                  -9.090  < 2e-16 ***
poly(movie_train$budget, 2)2                                   8.877  < 2e-16 ***
movie_train$title_year                                       -11.707  < 2e-16 ***
factor(movie_train$genres)Adventure                            6.596 5.09e-11 ***
factor(movie_train$genres)Animation                            5.357 9.17e-08 ***
factor(movie_train$genres)Biography                            8.356  < 2e-16 ***
factor(movie_train$genres)Comedy                               3.330 0.000879 ***
factor(movie_train$genres)Crime                                7.189 8.44e-13 ***
factor(movie_train$genres)Documentary                          7.995 1.92e-15 ***
factor(movie_train$genres)Drama                                9.861  < 2e-16 ***
factor(movie_train$genres)Family                               0.450 0.652584    
factor(movie_train$genres)Fantasy                             -1.559 0.118999    
factor(movie_train$genres)Horror                              -4.862 1.23e-06 ***
factor(movie_train$genres)Musical                             -0.119 0.905077    
factor(movie_train$genres)Mystery                              0.777 0.437387    
factor(movie_train$genres)Romance                              1.219 0.223045    
factor(movie_train$genres)Sci-Fi                              -0.241 0.809940    
factor(movie_train$genres)Thriller                            -0.569 0.569165    
factor(movie_train$genres)Western                              1.244 0.213456    
movie_train$duration                                              NA       NA    
movie_train$num_voted_users                                       NA       NA    
movie_train$duration:movie_train$num_voted_users              -5.621 2.10e-08 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users   6.581 5.61e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7361 on 2673 degrees of freedom
Multiple R-squared:  0.5183,    Adjusted R-squared:  0.5129 
F-statistic: 95.88 on 30 and 2673 DF,  p-value: < 2.2e-16
anova(lm.fit2,lm.fit3) 
Analysis of Variance Table

Model 1: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) + 
    poly(movie_train$num_critic_for_reviews, 2) + poly(movie_train$num_user_for_reviews, 
    2) + poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    poly(movie_train$gross, 2) + poly(movie_train$budget, 2) + 
    movie_train$title_year + factor(movie_train$genres) + movie_train$duration * 
    movie_train$num_voted_users + movie_train$num_voted_users * 
    movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Model 2: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) + 
    poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
    poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year + 
    factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users + 
    movie_train$num_voted_users * movie_train$num_user_for_reviews
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1   2670 1442.2                                
2   2673 1448.5 -3   -6.2905 3.8821 0.008792 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

P-value for lack of fit test is : 0.074. Meaning lm.fit3 is better than lm.fit2. R^2 for lm.fit3: 0.5075, 50.75% of variation could be explained by this model.

Diagnostics for lm.fit3:

plot(lm.fit3)
not plotting observations with leverage one:
  420, 1144, 1734

not plotting observations with leverage one:
  420, 1144, 1734

NaNs producedNaNs produced

library(car)
residualPlots(lm.fit3)
residualPlots(lm.fit3)
residualPlots(lm.fit3)

                                            Test stat Pr(>|t|)
poly(movie_train$num_voted_users, 2)               NA       NA
poly(movie_train$num_critic_for_reviews, 2)        NA       NA
movie_train$num_user_for_reviews                2.599    0.009
poly(movie_train$duration, 2)                      NA       NA
movie_train$facenumber_in_poster                0.519    0.604
movie_train$gross                              -0.112    0.911
poly(movie_train$budget, 2)                        NA       NA
movie_train$title_year                         -5.952    0.000
factor(movie_train$genres)                         NA       NA
movie_train$duration                            1.084    0.279
movie_train$num_voted_users                     0.031    0.975
Tukey test                                    -12.886    0.000

The plot is way better than lm.fit2. All the residuals vs predictors are strainght lines except for title year. So, let’t try to add second order for title year.

# lm.fit4: based on lm.fit3 addting second order for title year.
lm.fit4<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+poly(movie_train$title_year,2)+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit4)

Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 
    2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
    poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    movie_train$gross + poly(movie_train$budget, 2) + poly(movie_train$title_year, 
    2) + factor(movie_train$genres) + movie_train$duration * 
    movie_train$num_voted_users + movie_train$num_voted_users * 
    movie_train$num_user_for_reviews)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9397 -0.3545  0.0567  0.4631  2.1996 

Coefficients: (2 not defined because of singularities)
                                                               Estimate Std. Error
(Intercept)                                                   6.692e+00  6.411e-02
poly(movie_train$num_voted_users, 2)1                         3.734e+01  4.610e+00
poly(movie_train$num_voted_users, 2)2                        -1.777e+01  2.047e+00
poly(movie_train$num_critic_for_reviews, 2)1                  1.700e+01  1.459e+00
poly(movie_train$num_critic_for_reviews, 2)2                 -7.554e+00  8.325e-01
movie_train$num_user_for_reviews                             -1.026e-03  9.482e-05
poly(movie_train$duration, 2)1                                1.399e+01  1.075e+00
poly(movie_train$duration, 2)2                               -3.164e+00  7.685e-01
movie_train$facenumber_in_poster                             -1.768e-02  7.095e-03
movie_train$gross                                            -4.659e-10  3.167e-10
poly(movie_train$budget, 2)1                                 -1.071e+01  1.163e+00
poly(movie_train$budget, 2)2                                  7.614e+00  8.063e-01
poly(movie_train$title_year, 2)1                             -1.304e+01  1.001e+00
poly(movie_train$title_year, 2)2                             -5.074e+00  8.525e-01
factor(movie_train$genres)Adventure                           3.701e-01  5.463e-02
factor(movie_train$genres)Animation                           7.997e-01  1.379e-01
factor(movie_train$genres)Biography                           6.375e-01  7.575e-02
factor(movie_train$genres)Comedy                              1.515e-01  4.367e-02
factor(movie_train$genres)Crime                               4.648e-01  6.365e-02
factor(movie_train$genres)Documentary                         1.368e+00  1.627e-01
factor(movie_train$genres)Drama                               4.963e-01  4.909e-02
factor(movie_train$genres)Family                              1.581e-01  4.255e-01
factor(movie_train$genres)Fantasy                            -2.713e-01  1.416e-01
factor(movie_train$genres)Horror                             -4.080e-01  7.962e-02
factor(movie_train$genres)Musical                            -1.438e-01  7.347e-01
factor(movie_train$genres)Mystery                             1.683e-01  1.990e-01
factor(movie_train$genres)Romance                             9.931e-01  7.336e-01
factor(movie_train$genres)Sci-Fi                             -9.474e-02  3.298e-01
factor(movie_train$genres)Thriller                           -2.269e-01  7.348e-01
factor(movie_train$genres)Western                             8.717e-01  7.331e-01
movie_train$duration                                                 NA         NA
movie_train$num_voted_users                                          NA         NA
movie_train$duration:movie_train$num_voted_users             -2.002e-08  3.767e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users  1.495e-09  2.081e-10
                                                             t value Pr(>|t|)    
(Intercept)                                                  104.375  < 2e-16 ***
poly(movie_train$num_voted_users, 2)1                          8.101 8.21e-16 ***
poly(movie_train$num_voted_users, 2)2                         -8.680  < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1                  11.655  < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2                  -9.075  < 2e-16 ***
movie_train$num_user_for_reviews                             -10.824  < 2e-16 ***
poly(movie_train$duration, 2)1                                13.019  < 2e-16 ***
poly(movie_train$duration, 2)2                                -4.117 3.95e-05 ***
movie_train$facenumber_in_poster                              -2.492 0.012780 *  
movie_train$gross                                             -1.471 0.141436    
poly(movie_train$budget, 2)1                                  -9.212  < 2e-16 ***
poly(movie_train$budget, 2)2                                   9.443  < 2e-16 ***
poly(movie_train$title_year, 2)1                             -13.028  < 2e-16 ***
poly(movie_train$title_year, 2)2                              -5.952 3.00e-09 ***
factor(movie_train$genres)Adventure                            6.774 1.53e-11 ***
factor(movie_train$genres)Animation                            5.800 7.43e-09 ***
factor(movie_train$genres)Biography                            8.416  < 2e-16 ***
factor(movie_train$genres)Comedy                               3.469 0.000531 ***
factor(movie_train$genres)Crime                                7.302 3.73e-13 ***
factor(movie_train$genres)Documentary                          8.411  < 2e-16 ***
factor(movie_train$genres)Drama                               10.109  < 2e-16 ***
factor(movie_train$genres)Family                               0.372 0.710217    
factor(movie_train$genres)Fantasy                             -1.916 0.055431 .  
factor(movie_train$genres)Horror                              -5.125 3.19e-07 ***
factor(movie_train$genres)Musical                             -0.196 0.844810    
factor(movie_train$genres)Mystery                              0.846 0.397677    
factor(movie_train$genres)Romance                              1.354 0.175968    
factor(movie_train$genres)Sci-Fi                              -0.287 0.773958    
factor(movie_train$genres)Thriller                            -0.309 0.757503    
factor(movie_train$genres)Western                              1.189 0.234512    
movie_train$duration                                              NA       NA    
movie_train$num_voted_users                                       NA       NA    
movie_train$duration:movie_train$num_voted_users              -5.316 1.15e-07 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users   7.187 8.57e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7314 on 2672 degrees of freedom
Multiple R-squared:  0.5246,    Adjusted R-squared:  0.5191 
F-statistic: 95.13 on 31 and 2672 DF,  p-value: < 2.2e-16
anova(lm.fit4,lm.fit3)
Analysis of Variance Table

Model 1: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) + 
    poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
    poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    movie_train$gross + poly(movie_train$budget, 2) + poly(movie_train$title_year, 
    2) + factor(movie_train$genres) + movie_train$duration * 
    movie_train$num_voted_users + movie_train$num_voted_users * 
    movie_train$num_user_for_reviews
Model 2: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) + 
    poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
    poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year + 
    factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users + 
    movie_train$num_voted_users * movie_train$num_user_for_reviews
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   2672 1429.5                                  
2   2673 1448.5 -1   -18.952 35.425 2.997e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

P value is so small, reject null, meaning adding second order term for title year did not improve model.

Marginal Model plot:

marginalModelPlots(lm.fit3)
Splines and/or polynomials replaced by a fitted linear combination
marginalModelPlots(lm.fit3)
marginalModelPlots(lm.fit3)

The plots of the response versus the individual predictors display the conditional distribution of the response given each predictor, ignoring the other predictors. From our plots, our model is really good.since the marginal relationship between the response and the predictor are overlapping.

Check for residual ourliers:

library(car)
qqPlot(lm.fit3$residuals,id.n = 10)
 249 2278 1372 2065  469 2252  707 2639  718 1203 
   1    2    3    4    5    6    7    8    9   10 

outlierTest(lm.fit3) # H0: residual is not an outlier
      rstudent unadjusted p-value Bonferonni p
249  -5.383817         7.9263e-08   0.00021409
2278 -5.318779         1.1312e-07   0.00030552
1372 -4.913107         9.5056e-07   0.00256750
2065 -4.812054         1.5770e-06   0.00425940
469  -4.757227         2.0670e-06   0.00558310
2252 -4.679599         3.0176e-06   0.00815060
707  -4.557735         5.4033e-06   0.01459400
2639 -4.524189         6.3276e-06   0.01709100
718  -4.461174         8.4883e-06   0.02292700
1203 -4.426985         9.9395e-06   0.02684700

All of the 10 residuals have significant p-values, therefore, we can drop them.

Before we drop, let’s do some digsnostics to double check which to drop.

library(car)
influencePlot(lm.fit3, id.n=10)

From the influcence plot, we decided to drop observations: 2572,1423,860,1520,509,682,1017,848,361,237

# lm.fit5: model based on lm.fit3 removing 10 outliers.
movie_train<-movie_train[-c(2572,1423,860,1520,509,682,1017,848,361,237),]
lm.fit5<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit5)

Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 
    2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
    poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + 
    movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year + 
    factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users + 
    movie_train$num_voted_users * movie_train$num_user_for_reviews)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9377 -0.3441  0.0703  0.4639  2.1885 

Coefficients: (2 not defined because of singularities)
                                                               Estimate Std. Error
(Intercept)                                                   5.037e+01  3.754e+00
poly(movie_train$num_voted_users, 2)1                         4.032e+01  4.606e+00
poly(movie_train$num_voted_users, 2)2                        -1.754e+01  2.059e+00
poly(movie_train$num_critic_for_reviews, 2)1                  1.278e+01  1.296e+00
poly(movie_train$num_critic_for_reviews, 2)2                 -6.747e+00  8.258e-01
movie_train$num_user_for_reviews                             -9.099e-04  9.338e-05
poly(movie_train$duration, 2)1                                1.410e+01  1.080e+00
poly(movie_train$duration, 2)2                               -3.261e+00  7.731e-01
movie_train$facenumber_in_poster                             -2.187e-02  7.115e-03
movie_train$gross                                            -4.939e-10  3.197e-10
poly(movie_train$budget, 2)1                                 -1.069e+01  1.170e+00
poly(movie_train$budget, 2)2                                  7.188e+00  8.082e-01
movie_train$title_year                                       -2.181e-02  1.876e-03
factor(movie_train$genres)Adventure                           3.522e-01  5.521e-02
factor(movie_train$genres)Animation                           7.323e-01  1.385e-01
factor(movie_train$genres)Biography                           6.313e-01  7.649e-02
factor(movie_train$genres)Comedy                              1.386e-01  4.410e-02
factor(movie_train$genres)Crime                               4.545e-01  6.414e-02
factor(movie_train$genres)Documentary                         1.297e+00  1.634e-01
factor(movie_train$genres)Drama                               4.814e-01  4.951e-02
factor(movie_train$genres)Family                              1.832e-01  4.283e-01
factor(movie_train$genres)Fantasy                            -2.278e-01  1.423e-01
factor(movie_train$genres)Horror                             -3.967e-01  8.017e-02
factor(movie_train$genres)Musical                            -9.687e-02  7.394e-01
factor(movie_train$genres)Mystery                             1.504e-01  2.003e-01
factor(movie_train$genres)Romance                             8.888e-01  7.383e-01
factor(movie_train$genres)Sci-Fi                             -8.399e-02  3.320e-01
factor(movie_train$genres)Thriller                           -4.309e-01  7.388e-01
factor(movie_train$genres)Western                             9.130e-01  7.378e-01
movie_train$duration                                                 NA         NA
movie_train$num_voted_users                                          NA         NA
movie_train$duration:movie_train$num_voted_users             -2.105e-08  3.787e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users  1.374e-09  2.084e-10
                                                             t value Pr(>|t|)    
(Intercept)                                                   13.418  < 2e-16 ***
poly(movie_train$num_voted_users, 2)1                          8.754  < 2e-16 ***
poly(movie_train$num_voted_users, 2)2                         -8.522  < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1                   9.862  < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2                  -8.171 4.68e-16 ***
movie_train$num_user_for_reviews                              -9.744  < 2e-16 ***
poly(movie_train$duration, 2)1                                13.061  < 2e-16 ***
poly(movie_train$duration, 2)2                                -4.219 2.54e-05 ***
movie_train$facenumber_in_poster                              -3.074  0.00213 ** 
movie_train$gross                                             -1.545  0.12246    
poly(movie_train$budget, 2)1                                  -9.133  < 2e-16 ***
poly(movie_train$budget, 2)2                                   8.895  < 2e-16 ***
movie_train$title_year                                       -11.624  < 2e-16 ***
factor(movie_train$genres)Adventure                            6.380 2.08e-10 ***
factor(movie_train$genres)Animation                            5.288 1.34e-07 ***
factor(movie_train$genres)Biography                            8.254 2.38e-16 ***
factor(movie_train$genres)Comedy                               3.143  0.00169 ** 
factor(movie_train$genres)Crime                                7.087 1.75e-12 ***
factor(movie_train$genres)Documentary                          7.935 3.06e-15 ***
factor(movie_train$genres)Drama                                9.723  < 2e-16 ***
factor(movie_train$genres)Family                               0.428  0.66892    
factor(movie_train$genres)Fantasy                             -1.601  0.10950    
factor(movie_train$genres)Horror                              -4.948 7.98e-07 ***
factor(movie_train$genres)Musical                             -0.131  0.89578    
factor(movie_train$genres)Mystery                              0.751  0.45270    
factor(movie_train$genres)Romance                              1.204  0.22873    
factor(movie_train$genres)Sci-Fi                              -0.253  0.80029    
factor(movie_train$genres)Thriller                            -0.583  0.55979    
factor(movie_train$genres)Western                              1.237  0.21608    
movie_train$duration                                              NA       NA    
movie_train$num_voted_users                                       NA       NA    
movie_train$duration:movie_train$num_voted_users              -5.557 3.01e-08 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users   6.593 5.16e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7362 on 2663 degrees of freedom
Multiple R-squared:  0.5169,    Adjusted R-squared:  0.5114 
F-statistic: 94.97 on 30 and 2663 DF,  p-value: < 2.2e-16
compareCoefs(lm.fit3, lm.fit5)

Call:
1: lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) + 
  poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
  poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + movie_train$gross 
  + poly(movie_train$budget, 2) + movie_train$title_year + factor(movie_train$genres) + 
  movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users * 
  movie_train$num_user_for_reviews)
2: lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) + 
  poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews + 
  poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + movie_train$gross 
  + poly(movie_train$budget, 2) + movie_train$title_year + factor(movie_train$genres) + 
  movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users * 
  movie_train$num_user_for_reviews)
                                                                Est. 1      SE 1
(Intercept)                                                   5.06e+01  3.75e+00
poly(movie_train$num_voted_users, 2)1                         4.06e+01  4.61e+00
poly(movie_train$num_voted_users, 2)2                        -1.75e+01  2.06e+00
poly(movie_train$num_critic_for_reviews, 2)1                  1.29e+01  1.30e+00
poly(movie_train$num_critic_for_reviews, 2)2                 -6.71e+00  8.26e-01
movie_train$num_user_for_reviews                             -9.08e-04  9.33e-05
poly(movie_train$duration, 2)1                                1.42e+01  1.08e+00
poly(movie_train$duration, 2)2                               -3.32e+00  7.73e-01
movie_train$facenumber_in_poster                             -2.19e-02  7.11e-03
movie_train$gross                                            -5.21e-10  3.19e-10
poly(movie_train$budget, 2)1                                 -1.06e+01  1.17e+00
poly(movie_train$budget, 2)2                                  7.17e+00  8.08e-01
movie_train$title_year                                       -2.19e-02  1.87e-03
factor(movie_train$genres)Adventure                           3.63e-01  5.50e-02
factor(movie_train$genres)Animation                           7.42e-01  1.38e-01
factor(movie_train$genres)Biography                           6.37e-01  7.62e-02
factor(movie_train$genres)Comedy                              1.46e-01  4.39e-02
factor(movie_train$genres)Crime                               4.61e-01  6.41e-02
factor(movie_train$genres)Documentary                         1.31e+00  1.63e-01
factor(movie_train$genres)Drama                               4.87e-01  4.94e-02
factor(movie_train$genres)Family                              1.93e-01  4.28e-01
factor(movie_train$genres)Fantasy                            -2.22e-01  1.42e-01
factor(movie_train$genres)Horror                             -3.89e-01  8.01e-02
factor(movie_train$genres)Musical                            -8.82e-02  7.39e-01
factor(movie_train$genres)Mystery                             1.56e-01  2.00e-01
factor(movie_train$genres)Romance                             9.00e-01  7.38e-01
factor(movie_train$genres)Sci-Fi                             -7.98e-02  3.32e-01
factor(movie_train$genres)Thriller                           -4.21e-01  7.39e-01
factor(movie_train$genres)Western                             9.18e-01  7.38e-01
movie_train$duration                                                            
movie_train$num_voted_users                                                     
movie_train$duration:movie_train$num_voted_users             -2.13e-08  3.78e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users  1.37e-09  2.08e-10
                                                                Est. 2      SE 2
(Intercept)                                                   5.04e+01  3.75e+00
poly(movie_train$num_voted_users, 2)1                         4.03e+01  4.61e+00
poly(movie_train$num_voted_users, 2)2                        -1.75e+01  2.06e+00
poly(movie_train$num_critic_for_reviews, 2)1                  1.28e+01  1.30e+00
poly(movie_train$num_critic_for_reviews, 2)2                 -6.75e+00  8.26e-01
movie_train$num_user_for_reviews                             -9.10e-04  9.34e-05
poly(movie_train$duration, 2)1                                1.41e+01  1.08e+00
poly(movie_train$duration, 2)2                               -3.26e+00  7.73e-01
movie_train$facenumber_in_poster                             -2.19e-02  7.12e-03
movie_train$gross                                            -4.94e-10  3.20e-10
poly(movie_train$budget, 2)1                                 -1.07e+01  1.17e+00
poly(movie_train$budget, 2)2                                  7.19e+00  8.08e-01
movie_train$title_year                                       -2.18e-02  1.88e-03
factor(movie_train$genres)Adventure                           3.52e-01  5.52e-02
factor(movie_train$genres)Animation                           7.32e-01  1.38e-01
factor(movie_train$genres)Biography                           6.31e-01  7.65e-02
factor(movie_train$genres)Comedy                              1.39e-01  4.41e-02
factor(movie_train$genres)Crime                               4.55e-01  6.41e-02
factor(movie_train$genres)Documentary                         1.30e+00  1.63e-01
factor(movie_train$genres)Drama                               4.81e-01  4.95e-02
factor(movie_train$genres)Family                              1.83e-01  4.28e-01
factor(movie_train$genres)Fantasy                            -2.28e-01  1.42e-01
factor(movie_train$genres)Horror                             -3.97e-01  8.02e-02
factor(movie_train$genres)Musical                            -9.69e-02  7.39e-01
factor(movie_train$genres)Mystery                             1.50e-01  2.00e-01
factor(movie_train$genres)Romance                             8.89e-01  7.38e-01
factor(movie_train$genres)Sci-Fi                             -8.40e-02  3.32e-01
factor(movie_train$genres)Thriller                           -4.31e-01  7.39e-01
factor(movie_train$genres)Western                             9.13e-01  7.38e-01
movie_train$duration                                                            
movie_train$num_voted_users                                                     
movie_train$duration:movie_train$num_voted_users             -2.10e-08  3.79e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users  1.37e-09  2.08e-10

Removing outliers did not change the result too much.

Diagnostics for lm.fit5:

library(car)
residualPlots(lm.fit5)
residualPlots(lm.fit5)
residualPlots(lm.fit5)

                                            Test stat Pr(>|t|)
poly(movie_train$num_voted_users, 2)               NA       NA
poly(movie_train$num_critic_for_reviews, 2)        NA       NA
movie_train$num_user_for_reviews                2.623    0.009
poly(movie_train$duration, 2)                      NA       NA
movie_train$facenumber_in_poster                0.531    0.595
movie_train$gross                              -0.110    0.913
poly(movie_train$budget, 2)                        NA       NA
movie_train$title_year                         -5.901    0.000
factor(movie_train$genres)                         NA       NA
movie_train$duration                            0.529    0.597
movie_train$num_voted_users                     0.699    0.484
Tukey test                                    -12.781    0.000

Looks good except for residuals vs fitted values show some curviture.

plot(lm.fit5)
not plotting observations with leverage one:
  418

not plotting observations with leverage one:
  418

NaNs producedNaNs produced

Now,let’s look at model assumption for both lm.fit3 and lm.fit5:

# normality
shapiro.test(lm.fit3$residuals)

    Shapiro-Wilk normality test

data:  lm.fit3$residuals
W = 0.9464, p-value < 2.2e-16
shapiro.test(lm.fit5$residuals)

    Shapiro-Wilk normality test

data:  lm.fit5$residuals
W = 0.9461, p-value < 2.2e-16

Both models failed the normality assumption. I think this is due to the many outliers in the data set.

# equal variance : H0: variance is not constant
ncvTest(lm.fit3)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 165.6577    Df = 1     p = 6.571054e-38 
ncvTest(lm.fit5)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 165.6577    Df = 1     p = 6.571054e-38 

Both models passed the equal variance assumption.

This is just to explore more interesting facts Plots for data with fitted regression line:

library(ggplot2)
ggplot(data=movie_train,aes(x=duration,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()

library(ggplot2)
ggplot(data=movie_train,aes(x=num_voted_users,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()

library(ggplot2)
ggplot(data=movie_train,aes(x=facenumber_in_poster,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()

library(ggplot2)
ggplot(data=movie_train,aes(x=gross,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()

library(ggplot2)
ggplot(data=movie_train,aes(x=budget,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()

Step 4: Making predictions on the test dataset

Rewriting model lm.fit5 in another notation: # Note, if write in lm(train\(score~train\)x1+train$x2….), it will create the same number of values with the train data set when predict().

# lm.fit6 =lm.fit 5 using difference writing
lm.fit6<-lm(imdb_score~poly(num_voted_users,2)+poly(num_critic_for_reviews,2)+num_user_for_reviews+poly(duration,2)+facenumber_in_poster+gross+poly(budget,2)+title_year+genres+duration*num_voted_users+num_voted_users*num_user_for_reviews,data=data.frame(movie_train))
summary(lm.fit6)

Call:
lm(formula = imdb_score ~ poly(num_voted_users, 2) + poly(num_critic_for_reviews, 
    2) + num_user_for_reviews + poly(duration, 2) + facenumber_in_poster + 
    gross + poly(budget, 2) + title_year + genres + duration * 
    num_voted_users + num_voted_users * num_user_for_reviews, 
    data = data.frame(movie_train))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9377 -0.3441  0.0703  0.4639  2.1885 

Coefficients: (2 not defined because of singularities)
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           5.037e+01  3.754e+00  13.418  < 2e-16 ***
poly(num_voted_users, 2)1             4.032e+01  4.606e+00   8.754  < 2e-16 ***
poly(num_voted_users, 2)2            -1.754e+01  2.059e+00  -8.522  < 2e-16 ***
poly(num_critic_for_reviews, 2)1      1.278e+01  1.296e+00   9.862  < 2e-16 ***
poly(num_critic_for_reviews, 2)2     -6.747e+00  8.258e-01  -8.171 4.68e-16 ***
num_user_for_reviews                 -9.099e-04  9.338e-05  -9.744  < 2e-16 ***
poly(duration, 2)1                    1.410e+01  1.080e+00  13.061  < 2e-16 ***
poly(duration, 2)2                   -3.261e+00  7.731e-01  -4.219 2.54e-05 ***
facenumber_in_poster                 -2.187e-02  7.115e-03  -3.074  0.00213 ** 
gross                                -4.939e-10  3.197e-10  -1.545  0.12246    
poly(budget, 2)1                     -1.069e+01  1.170e+00  -9.133  < 2e-16 ***
poly(budget, 2)2                      7.188e+00  8.082e-01   8.895  < 2e-16 ***
title_year                           -2.181e-02  1.876e-03 -11.624  < 2e-16 ***
genresAdventure                       3.522e-01  5.521e-02   6.380 2.08e-10 ***
genresAnimation                       7.323e-01  1.385e-01   5.288 1.34e-07 ***
genresBiography                       6.313e-01  7.649e-02   8.254 2.38e-16 ***
genresComedy                          1.386e-01  4.410e-02   3.143  0.00169 ** 
genresCrime                           4.545e-01  6.414e-02   7.087 1.75e-12 ***
genresDocumentary                     1.297e+00  1.634e-01   7.935 3.06e-15 ***
genresDrama                           4.814e-01  4.951e-02   9.723  < 2e-16 ***
genresFamily                          1.832e-01  4.283e-01   0.428  0.66892    
genresFantasy                        -2.278e-01  1.423e-01  -1.601  0.10950    
genresHorror                         -3.967e-01  8.017e-02  -4.948 7.98e-07 ***
genresMusical                        -9.687e-02  7.394e-01  -0.131  0.89578    
genresMystery                         1.504e-01  2.003e-01   0.751  0.45270    
genresRomance                         8.888e-01  7.383e-01   1.204  0.22873    
genresSci-Fi                         -8.399e-02  3.320e-01  -0.253  0.80029    
genresThriller                       -4.309e-01  7.388e-01  -0.583  0.55979    
genresWestern                         9.130e-01  7.378e-01   1.237  0.21608    
duration                                     NA         NA      NA       NA    
num_voted_users                              NA         NA      NA       NA    
duration:num_voted_users             -2.105e-08  3.787e-09  -5.557 3.01e-08 ***
num_user_for_reviews:num_voted_users  1.374e-09  2.084e-10   6.593 5.16e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7362 on 2663 degrees of freedom
Multiple R-squared:  0.5169,    Adjusted R-squared:  0.5114 
F-statistic: 94.97 on 30 and 2663 DF,  p-value: < 2.2e-16
pr<-predict.lm(lm.fit6,newdata = data.frame(movie_test),interval = 'confidence')
prediction from a rank-deficient fit may be misleading
pr
          fit      lwr      upr
4    7.532866 7.070521 7.995212
18   7.147224 6.727831 7.566617
36   7.105444 6.927123 7.283766
41   6.832298 6.717652 6.946944
52   6.745073 6.570269 6.919877
54   7.099408 6.940154 7.258662
62   6.477071 6.167720 6.786422
72   5.518444 5.382021 5.654867
75   5.956378 5.799223 6.113534
97   7.921294 7.522945 8.319643
102  8.028773 7.850098 8.207447
109  5.720808 5.591183 5.850433
149  6.066992 5.934728 6.199256
158  5.706457 5.603469 5.809444
164  5.495290 5.380543 5.610037
165  5.507511 5.400149 5.614873
184  7.536942 7.424599 7.649286
190  6.526406 6.270885 6.781926
192  5.789304 5.679154 5.899454
194  6.144094 6.020476 6.267712
220  6.797991 6.697396 6.898585
223  6.231502 6.126179 6.336825
233  6.003580 5.896874 6.110285
241  7.218573 6.828965 7.608180
262  6.438942 6.350170 6.527715
292  6.545400 6.443179 6.647620
297  7.364689 6.869019 7.860359
304  5.848346 5.702158 5.994533
305  6.052453 5.943964 6.160941
313  5.685453 5.609391 5.761515
324  5.446335 5.322331 5.570339
340  7.835805 7.306835 8.364774
346  7.224485 7.067632 7.381338
362  8.705321 8.489233 8.921409
379  6.086725 5.979906 6.193544
395  6.065458 5.950318 6.180597
399  7.058227 6.961177 7.155278
404  6.836200 6.570506 7.101895
406  6.466229 6.384291 6.548168
410  6.355030 6.251712 6.458347
427  5.728945 5.640956 5.816934
435  8.007525 7.850646 8.164404
440  7.896934 7.588273 8.205595
453  8.184593 7.740853 8.628334
463  6.652393 6.519075 6.785711
475  5.672285 5.583483 5.761086
481  5.796184 5.527854 6.064514
494  5.736717 5.647324 5.826110
499  5.555406 5.468873 5.641940
502  5.664352 5.576523 5.752180
519  5.337230 5.218956 5.455505
525  6.882058 6.753216 7.010900
534  5.436274 5.353219 5.519329
536  6.246131 6.149101 6.343161
551  6.382540 6.120369 6.644711
553  6.006155 5.914869 6.097442
566  6.122604 5.993869 6.251339
570  6.655939 6.544799 6.767079
574  8.392476 8.086851 8.698100
576  6.855142 6.731293 6.978992
577  6.551108 6.443272 6.658943
599  5.818894 5.726167 5.911622
616  5.467505 5.391440 5.543570
617  5.868150 5.781085 5.955214
634  7.979128 7.688574 8.269682
642  6.747629 6.659634 6.835624
673  5.599985 5.496359 5.703612
680  6.667716 6.578642 6.756790
702  5.402351 5.321319 5.483383
706  5.851077 5.767683 5.934471
710  7.981290 7.793918 8.168662
723  6.508575 6.425190 6.591960
730  5.550647 5.428539 5.672754
741  5.945696 5.863457 6.027935
767  5.886350 5.816881 5.955818
789  7.094412 6.950716 7.238107
801  6.362756 6.267244 6.458269
805  6.099380 6.011990 6.186770
807  5.800161 5.723184 5.877138
816  5.983439 5.869589 6.097288
820  5.874646 5.772700 5.976592
842  5.893985 5.801702 5.986267
844  5.506660 5.416189 5.597131
849  6.983865 6.889091 7.078639
851  6.184149 6.105169 6.263130
872  6.035374 5.946974 6.123775
884  6.832841 6.324449 7.341233
897  6.086534 5.989880 6.183189
929  7.677788 6.228189 9.127388
933  7.209955 7.108111 7.311799
934  6.374305 6.218010 6.530600
941  7.423842 7.314980 7.532705
952  5.732756 5.661278 5.804234
1012 6.676683 6.557351 6.796016
1020 6.137487 6.014276 6.260697
1035 5.531226 5.448663 5.613789
1039 7.472331 7.302621 7.642040
1049 6.865405 6.782361 6.948450
1066 5.755046 5.672244 5.837849
1074 5.683657 5.587662 5.779652
1085 8.168594 8.006078 8.331110
1094 6.149181 6.070803 6.227560
1103 5.493684 5.428647 5.558721
1112 5.718656 5.616365 5.820947
1119 5.924898 5.836427 6.013369
1120 5.453503 5.344941 5.562065
1140 6.202411 6.126805 6.278017
1147 6.208350 6.124223 6.292477
1151 5.440569 5.358429 5.522709
1158 5.958038 5.885288 6.030788
1180 6.427328 6.339484 6.515172
1208 6.315590 6.239908 6.391272
1211 5.489064 5.394820 5.583307
1229 6.146876 6.077407 6.216345
1237 7.014990 6.906757 7.123223
1241 6.414861 6.027006 6.802715
1255 7.286743 7.167531 7.405954
1271 6.240700 6.161911 6.319489
1273 5.919283 5.846767 5.991799
1302 6.178886 6.036833 6.320939
1318 5.992964 5.901389 6.084539
1343 5.766825 5.693613 5.840038
1353 6.136637 6.055359 6.217914
1365 6.594790 6.144944 7.044635
1371 5.904377 5.754657 6.054097
1372 7.170810 7.072482 7.269137
1375 7.589885 7.415879 7.763891
1390 5.769504 5.698503 5.840504
1392 6.430170 6.348217 6.512123
1397 5.447974 5.377246 5.518702
1428 5.671113 5.589048 5.753178
1431 6.834620 6.738670 6.930569
1517 5.506728 5.427427 5.586028
1519 6.494431 6.400522 6.588340
1534 6.759141 6.676904 6.841377
1544 6.296068 6.218856 6.373279
1546 7.206950 7.085699 7.328201
1572 8.367261 7.718739 9.015783
1581 6.211045 6.135565 6.286525
1584 6.187331 6.074343 6.300319
1591 7.044950 6.941787 7.148113
1593 7.377296 7.237080 7.517512
1608 6.665452 6.592289 6.738614
1616 7.745531 7.541166 7.949896
1623 5.908825 5.845733 5.971917
1640 6.223100 6.158861 6.287339
1681 6.719219 6.626756 6.811681
1695 6.818825 6.698576 6.939075
1699 6.599794 6.501478 6.698109
1700 7.118055 6.960602 7.275507
1702 6.028337 5.945515 6.111159
1708 6.210212 6.133075 6.287348
1736 7.763616 7.611202 7.916030
1758 6.138560 5.980030 6.297089
1760 6.695719 6.589163 6.802275
1772 6.241149 6.178469 6.303829
1807 6.583871 6.470679 6.697062
1814 7.529013 7.342745 7.715281
1860 6.077985 5.959315 6.196655
1882 5.312082 5.217582 5.406582
1884 5.834634 5.744253 5.925014
1901 6.950317 6.845423 7.055212
1914 5.513860 5.402933 5.624787
1929 5.637303 5.564406 5.710200
1944 6.072144 6.013880 6.130407
1965 5.706108 5.632095 5.780121
1982 6.402153 6.326121 6.478185
1999 5.453124 5.361328 5.544921
2017 6.867382 6.780186 6.954577
2022 6.244585 6.089236 6.399933
2027 6.483438 6.395801 6.571075
2055 6.280783 6.139570 6.421996
2077 6.750369 6.659304 6.841433
2086 6.159448 6.080703 6.238193
2093 6.739150 6.552225 6.926075
2095 6.440827 6.366940 6.514713
2105 6.264204 6.157375 6.371033
2107 6.203387 6.092911 6.313862
2110 6.219974 6.135260 6.304689
2138 8.304860 8.124766 8.484955
2145 6.291060 6.215270 6.366849
2151 5.984690 5.882486 6.086893
2183 5.399695 5.296881 5.502510
2197 5.895995 5.737591 6.054398
2198 6.025398 5.938178 6.112619
2276 5.973059 5.903862 6.042255
2277 6.284302 6.134191 6.434413
2281 6.362727 6.290687 6.434767
2321 5.561182 5.477373 5.644990
2361 6.689518 6.545187 6.833848
2362 5.460404 5.324915 5.595893
2366 5.635655 5.538109 5.733201
2369 5.974270 5.918622 6.029917
2395 7.113046 6.972869 7.253224
2397 6.482995 6.388247 6.577743
2411 6.075614 6.005121 6.146108
2414 6.196374 6.115577 6.277172
2415 7.531062 7.311777 7.750347
2417 6.717303 6.603161 6.831444
2428 6.531399 6.415550 6.647247
2429 6.261654 6.172809 6.350500
2493 6.774596 6.595760 6.953431
2499 6.254368 6.189888 6.318848
2503 6.056060 5.914753 6.197367
2505 6.683703 6.533775 6.833630
2524 7.856346 7.745312 7.967379
2583 6.759891 6.662738 6.857044
2602 7.458668 7.298422 7.618915
2616 6.356146 6.260177 6.452116
2624 7.015640 6.922894 7.108386
2632 5.921147 5.818744 6.023550
2648 6.935343 6.814869 7.055818
2654 7.887763 7.579260 8.196266
2671 5.515390 5.378335 5.652445
2674 5.402407 5.320331 5.484483
2693 6.309477 6.195899 6.423055
2700 5.963442 5.880255 6.046630
2726 6.220098 6.136364 6.303833
2748 5.562933 5.495248 5.630619
2780 6.213294 6.098667 6.327921
2799 5.952863 5.850307 6.055419
2835 7.332987 7.020839 7.645135
2836 8.420098 8.136064 8.704132
2848 5.658914 5.559708 5.758120
2862 5.818235 5.702994 5.933475
2882 5.595872 5.522690 5.669053
2898 6.371319 6.298806 6.443833
2919 7.272168 7.143461 7.400875
2952 5.829849 5.697105 5.962593
2972 6.048737 5.926076 6.171398
2977 6.059970 5.407260 6.712680
2981 5.809145 5.731586 5.886703
2985 5.921258 5.850417 5.992098
3027 8.183790 7.982753 8.384826
3053 5.869260 5.805358 5.933163
3082 5.928392 5.860408 5.996377
3098 6.039541 5.900421 6.178661
3101 7.012963 6.922414 7.103511
3103 5.983910 5.873567 6.094253
3114 5.896691 5.752565 6.040817
3123 6.070311 5.797010 6.343611
3133 6.086494 5.973024 6.199965
3145 6.736984 6.640887 6.833082
3151 7.680189 7.571290 7.789087
3171 6.382080 6.293795 6.470364
3182 5.872762 5.796528 5.948997
3203 6.152732 6.069245 6.236219
3222 6.504499 6.359255 6.649743
3229 5.888606 5.792008 5.985203
3261 5.940752 5.876702 6.004803
3316 5.822529 5.725388 5.919670
3334 6.965722 6.856920 7.074525
3516 6.558064 6.417785 6.698342
3548 6.006390 5.936779 6.076001
3571 6.192664 6.113661 6.271667
3607 5.879491 5.805408 5.953574
3609 5.814770 5.674671 5.954869
3626 5.744500 5.654358 5.834642
3648 6.368112 6.249977 6.486248
3715 8.561899 8.383530 8.740268
3727 6.362990 6.283908 6.442072
3740 5.529139 5.393811 5.664468
3747 6.335849 6.229983 6.441716
3748 6.180180 6.081197 6.279163
3749 5.561311 5.423620 5.699002
3850 8.917223 8.680791 9.153654
3880 5.848465 5.702279 5.994650
3893 6.812039 6.725563 6.898515
3894 6.464500 6.364511 6.564489
3907 8.118958 7.933319 8.304598
3924 6.895080 6.586115 7.204044
3939 6.938407 6.840405 7.036408
4026 7.227623 7.051745 7.403501
4028 6.195250 6.114868 6.275631
4066 5.805236 5.657442 5.953031
4189 6.689414 6.604642 6.774187
4193 6.488028 6.410324 6.565732
4208 5.974875 5.880145 6.069606
4220 5.963858 5.854352 6.073364
4239 8.460741 8.234820 8.686663
4334 5.933228 5.855543 6.010913
4384 5.970913 5.906503 6.035323
4403 6.424157 6.343352 6.504963
4406 6.406195 6.312556 6.499834
4436 5.907443 5.810733 6.004153
4487 6.370477 6.296922 6.444031
4533 6.447268 6.163229 6.731306
4535 5.564566 5.421238 5.707895
4537 6.109283 5.952685 6.265880
4577 5.899824 5.824399 5.975249
4584 5.428643 5.345386 5.511900
4654 5.857927 5.776168 5.939685
4785 6.058881 5.982829 6.134932
4789 6.256824 6.180542 6.333105
4813 7.044035 5.581302 8.506769
4831 5.821646 5.736390 5.906902
4841 6.213957 6.115824 6.312089
4874 5.340048 4.685289 5.994807
4894 6.937580 6.731609 7.143551
5005 6.041962 4.593917 7.490007
5043 6.841339 6.531956 7.150723

Check Accuracy: Mean Absolute Error: how far, on average, prediction is from the true value.

MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
MAE(pr, movie_test$imdb_score)
[1] 0.5114398

Checking the impact significance of predictors on IMDB score.

# stantdardized regression coefficients
library(QuantPsyc)
Loading required package: boot

Attaching package: ‘boot’

The following object is masked from ‘package:car’:

    logit

The following object is masked from ‘package:psych’:

    logit

Loading required package: MASS

Attaching package: ‘QuantPsyc’

The following object is masked from ‘package:base’:

    norm
lm.beta(lm.fit6)
Calling var(x) on a factor x is deprecated and will become an error.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.longer object length is not a multiple of shorter object length
           poly(num_voted_users, 2)1            poly(num_voted_users, 2)2 
                        7.375099e-01                        -3.209507e-01 
    poly(num_critic_for_reviews, 2)1     poly(num_critic_for_reviews, 2)2 
                        4.888110e+03                        -1.234322e-01 
                num_user_for_reviews                   poly(duration, 2)1 
                       -1.802006e-03                         9.709896e+08 
                  poly(duration, 2)2                 facenumber_in_poster 
                       -5.966200e-02                        -2.047902e-01 
                               gross                     poly(budget, 2)1 
                       -1.408564e-09                        -2.188525e+02 
                    poly(budget, 2)2                           title_year 
                        1.060494e+06                        -3.989610e-04 
                     genresAdventure                      genresAnimation 
                        6.443330e-03                         2.800931e+02 
                     genresBiography                         genresComedy 
                        1.154918e-02                         2.745636e-01 
                         genresCrime                    genresDocumentary 
                        3.130022e+07                         2.372774e-02 
                         genresDrama                         genresFamily 
                        4.507383e+00                         5.223540e-01 
                       genresFantasy                         genresHorror 
                       -4.662931e+00                        -5.852019e+04 
                       genresMusical                        genresMystery 
                       -1.772132e-03                         2.751607e-03 
                       genresRomance                         genresSci-Fi 
                        3.399551e+02                        -1.536461e-03 
                      genresThriller                        genresWestern 
                       -8.534308e-01                         6.286873e+07 
            duration:num_voted_users num_user_for_reviews:num_voted_users 
                       -3.850005e-10                         1.286813e-08 

Conclusion: The most important factor that affects movie rating is the duration. The longer the movie is, the higher the rating will be. num_critic_for_reviews is also an important predictor. Budget is important, although there is no strong correlation between budget and movie rating. The number of faces in movie poster has a non-neglectable effect to the movie rating.

