Purpose: By doing a regresson analysis, we want to know: 1) Among the 27 variables given, which of them are critical in telling the IMDB rating of a movie. 2) Is there any correlation between genre & IMDB raging,face number in poster & IMDB rating,director name & IMDB rating and duration & IMDB rating. 3) Predict the IMDB Score using our model
m<- read.csv('movie_metadata.csv')
This data set was found from Kaggle. The author scraped 5000+ movies from IMDB website using a Python library called “scrapy” and obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables: “movie_title” “color” “num_critic_for_reviews” “movie_facebook_likes” “duration” “director_name” “director_facebook_likes” “actor_3_name” “actor_3_facebook_likes” “actor_2_name” “actor_2_facebook_likes” “actor_1_name” “actor_1_facebook_likes” “gross” “genres” “num_voted_users” “cast_total_facebook_likes” “facenumber_in_poster” “plot_keywords” “movie_imdb_link” “num_user_for_reviews” “language” “country” “content_rating” “budget” “title_year” “imdb_score” “aspect_ratio”
This dataset is a proof of concept. It can be used for experimental and learning purpose.For comprehensive movie analysis and accurate movie ratings prediction, 28 attributes from 5000 movies might not be enough. A decent dataset could contain hundreds of attributes from 50K or more movies, and requires tons of feature engineering.
Assign the first word of genres as the genre of each movie:(genres been split into words in Excel):
# remove columns X-X.8
which(colnames(m)=='genres')
[1] 10
which(colnames(m)=='X.8')
[1] 19
m<-m[,-c(11:19)]
Only keep movie data for USA, bacause the “budget” variable was not all converted to US dollars, which might cause a problem in later analysis. If we want to convert all budgets into US dollarts, we have to take in to consideration for inflation as well. This might make the problem more complicated. Therefore, for pratice purpose, we decided to only study data for movies of USA.
movie.usa<-m[which(m[,'country']=='USA'),]
Double check:
movie.usa$country
[1] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[23] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[45] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[67] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[89] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[111] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[133] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[155] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[177] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[199] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[221] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[243] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[265] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[287] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[309] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[331] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[353] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[375] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[397] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[419] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[441] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[463] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[485] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[507] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[529] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[551] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[573] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[595] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[617] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[639] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[661] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[683] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[705] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[727] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[749] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[771] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[793] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[815] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[837] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[859] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[881] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[903] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[925] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[947] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[969] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[991] USA USA USA USA USA USA USA USA USA USA
[ reached getOption("max.print") -- omitted 2807 entries ]
66 Levels: Afghanistan Argentina Aruba Australia Bahamas Belgium Brazil Bulgaria ... West Germany
Remove ‘language’ since after removing all countries except for USA, there is only 4 languages aside from English, not meaningful for our prediction.
summary(movie.usa$language)
Aboriginal Arabic Aramaic Bosnian Cantonese Chinese Czech
10 0 0 1 1 1 0 0
Danish Dari Dutch Dzongkha English Filipino French German
0 1 0 0 3779 1 0 0
Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese
0 1 1 0 0 0 0 1
Kannada Kazakh Korean Mandarin Maya Mongolian None Norwegian
0 0 0 0 1 0 1 0
Panjabi Persian Polish Portuguese Romanian Russian Slovenian Spanish
0 0 0 0 0 0 0 7
Swahili Swedish Tamil Telugu Thai Urdu Vietnamese Zulu
0 0 0 0 0 0 1 0
movie.usa<-movie.usa[, -which(names(movie.usa)=='language')]
Remove ‘movie_imdb_link’ column since it’s not useful for our analysis and store the rest od the data as ‘movie’.
movie.df= data.frame(movie.usa)
mm<-movie.df[, -which(names(movie.df)=='movie_imdb_link')]
str(mm)
'data.frame': 3807 obs. of 26 variables:
$ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
$ director_name : Factor w/ 2399 levels "","\xcc\xe4mile Gaudreault",..: 926 799 379 106 2030 1652 1225 2394 284 799 ...
$ num_critic_for_reviews : int 723 302 813 462 392 324 635 673 434 313 ...
$ duration : int 178 169 164 132 156 100 141 183 169 151 ...
$ director_facebook_likes : int 0 563 22000 475 0 15 0 0 0 563 ...
$ actor_3_facebook_likes : int 855 1000 23000 530 4000 284 19000 2000 903 1000 ...
$ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 534 2549 1228 801 2440 1704 1911 2218 ...
$ actor_1_facebook_likes : int 1000 40000 27000 640 24000 799 26000 15000 18000 40000 ...
$ gross : int 760505847 309404152 448130642 73058679 336530303 200807262 458991599 330249062 200069408 423032628 ...
$ genres : Factor w/ 21 levels "Action","Adventure",..: 1 1 1 1 1 2 1 1 1 1 ...
$ actor_1_name : Factor w/ 2098 levels "","\xcc\xd2lafur Darri \xcc\xd2lafsson",..: 303 982 1968 441 786 221 337 740 1104 982 ...
$ movie_title : Factor w/ 4917 levels "[Rec] 2\xe5\xca",..: 397 2731 3707 1960 3289 3459 398 460 3416 2732 ...
$ num_voted_users : int 886204 471220 1144337 212204 383056 294810 462669 371639 240396 522040 ...
$ cast_total_facebook_likes: int 4834 48350 106759 1873 46055 2036 92000 24450 29991 48486 ...
$ actor_3_name : Factor w/ 3522 levels "","\xcc\xd2scar Jaenada",..: 3442 1393 1769 2714 1969 2162 3018 57 1134 1393 ...
$ facenumber_in_poster : int 0 0 0 1 0 1 4 0 0 2 ...
$ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 3484 651 4745 29 1142 1564 3312 2188 ...
$ num_user_for_reviews : int 3054 1238 2701 738 1902 387 1117 3018 2367 1832 ...
$ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 65 65 65 65 65 65 65 65 ...
$ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 10 9 10 10 10 10 ...
$ budget : num 2.37e+08 3.00e+08 2.50e+08 2.64e+08 2.58e+08 ...
$ title_year : int 2009 2007 2012 2012 2007 2010 2015 2016 2006 2006 ...
$ actor_2_facebook_likes : int 936 5000 23000 632 11000 553 21000 4000 10000 5000 ...
$ imdb_score : num 7.9 7.1 8.5 6.6 6.2 7.8 7.5 6.9 6.1 7.3 ...
$ aspect_ratio : num 1.78 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 2.35 ...
$ movie_facebook_likes : int 33000 0 164000 24000 0 29000 118000 197000 0 5000 ...
Check for missing values:
library(Amelia)
Loading required package: Rcpp
package ‘Rcpp’ was built under R version 3.3.2##
## Amelia II: Multiple Imputation
## (Version 1.7.4, built: 2015-12-05)
## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
missmap(mm, main = "Missing values vs observed")
sapply(mm,function(x) sum(is.na(x))) # number of missing values for each variable
color director_name num_critic_for_reviews
0 0 39
duration director_facebook_likes actor_3_facebook_likes
6 74 13
actor_2_name actor_1_facebook_likes gross
0 4 572
genres actor_1_name movie_title
0 0 0
num_voted_users cast_total_facebook_likes actor_3_name
0 0 0
facenumber_in_poster plot_keywords num_user_for_reviews
12 0 13
country content_rating budget
0 0 298
title_year actor_2_facebook_likes imdb_score
74 7 0
aspect_ratio movie_facebook_likes
222 0
We noticed that there are many missing values for budget,aspect ratio and gross.
Omit missing values:
movie<-na.omit(mm)
sapply(movie,function(x) sum(is.na(x))) # double check for missing values
color director_name num_critic_for_reviews
0 0 0
duration director_facebook_likes actor_3_facebook_likes
0 0 0
actor_2_name actor_1_facebook_likes gross
0 0 0
genres actor_1_name movie_title
0 0 0
num_voted_users cast_total_facebook_likes actor_3_name
0 0 0
facenumber_in_poster plot_keywords num_user_for_reviews
0 0 0
country content_rating budget
0 0 0
title_year actor_2_facebook_likes imdb_score
0 0 0
aspect_ratio movie_facebook_likes
0 0
library(psych)
package ‘psych’ was built under R version 3.3.2
Attaching package: ‘psych’
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
The following object is masked from ‘package:car’:
logit
library(car)
library(RColorBrewer)
library(corrplot)
library(ggplot2)
Explore title_year predictor:
range(movie$title_year) # check movie title year
[1] 1920 2016
sum(with(movie,title_year=='2009')) # 145
[1] 145
sum(with(movie,title_year=='2014')) # 121
[1] 121
Visualization of title Year vs. Score:
scatterplot(x=movie$title_year,y=movie$imdb_score)
There are many outliers for title year. The mojority of data points are around the year of 2000 and later,which make sense that this is less movies in the early years. Also, an intering notice is that movies from early years tend to have higher scores.
Visualization of IMDB Score:
max(movie$imdb_score) # 9.4
[1] 9.3
ggplot(movie, aes(x = imdb_score)) +
geom_histogram(aes(fill = ..count..), binwidth =0.5) +
scale_x_continuous(name = "IMDB Score",
breaks = seq(0,10),
limits=c(1, 10)) +
ggtitle("Histogram of Movie IMDB Score") +
scale_fill_gradient("Count", low = "blue", high = "red")
sum(with(movie,imdb_score>=8))
[1] 148
# 148 movies with IMDB score greater or equal to 8.
IMDB score looks normal.The highest score is 9.4 out of scale 10. And we can consider movies with a score greater or equal to 8 a great movie from many perspectives.
Exploring correlation :
pairs.panels(movie[c('director_name','duration','facenumber_in_poster','imdb_score','genres')])
from the plot, only duration and IMBD score has a high correlation. face number in posters has a negative correaltion with IMBD score. genre has little correlatin with score Interesting, director name has no correlation with IMDB score
pairs.panels(movie[c('color','actor_1_name','title_year','imdb_score','aspect_ratio','gross')])
Color and title year has highly positive correlation. Color and aspect ratia,gross has smaller positive correlations. Actor 1 namem has very small positive correlation with gross, meaning who plays the movies does not have impact on the gross. Title year and aspect ratio and color are highly positively correlated. IMDB score has very small positive correlation with actor 1 name ,which means who was the actor 1 does not make the movie has a higher score. Interestingly, IMDB score has a negative correlation with title year,which means the old movies seems to have a higher score. the result agrees with out pbservation from the scatter plot. IMDB and aspect ratio has small positive correlation. IMDB has a strong positive correlation with gross.
Corplot for all numerical variables:
nums<- sapply(movie,is.numeric) # select numeric columns
movie.num<- movie[,nums]
corrplot(cor(movie.num),method='ellipse')
Note: corrplot cannot use data.frame, use cor() to change it to matrix.
From the correlation plot, we can tell that: Face number in poster has negative correlation with all other predictors. Cast total facebook likes and actor 1 facebook likes has a stronger positive correlation. budget and gross have strong correaltion which is not surprising. Interestingly, IMDB scores has strong positive corrlation with number of critics for review, which means the more the critics review, the higher the score.Duration and number of voted users also have strong positive correlation with IMDB scores.
Find the pairs of correlations
corr.test(movie.num,y=NULL,use='pairwise',method='pearson',adjust='holm',alpha=0.05) # x must be numeric
Call:corr.test(x = movie.num, y = NULL, use = "pairwise", method = "pearson",
adjust = "holm", alpha = 0.05)
Correlation matrix
num_critic_for_reviews duration director_facebook_likes
num_critic_for_reviews 1.00 0.26 0.19
duration 0.26 1.00 0.21
director_facebook_likes 0.19 0.21 1.00
actor_3_facebook_likes 0.28 0.14 0.12
actor_1_facebook_likes 0.17 0.09 0.09
gross 0.48 0.28 0.14
num_voted_users 0.60 0.37 0.32
cast_total_facebook_likes 0.25 0.13 0.12
facenumber_in_poster -0.03 0.01 -0.05
num_user_for_reviews 0.57 0.36 0.24
budget 0.49 0.30 0.09
title_year 0.42 -0.11 -0.06
actor_2_facebook_likes 0.28 0.15 0.12
imdb_score 0.36 0.38 0.22
aspect_ratio 0.18 0.16 0.05
movie_facebook_likes 0.71 0.25 0.17
actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users
num_critic_for_reviews 0.28 0.17 0.48 0.60
duration 0.14 0.09 0.28 0.37
director_facebook_likes 0.12 0.09 0.14 0.32
actor_3_facebook_likes 1.00 0.25 0.30 0.28
actor_1_facebook_likes 0.25 1.00 0.13 0.17
gross 0.30 0.13 1.00 0.64
num_voted_users 0.28 0.17 0.64 1.00
cast_total_facebook_likes 0.48 0.95 0.22 0.25
facenumber_in_poster 0.10 0.05 -0.04 -0.04
num_user_for_reviews 0.22 0.12 0.55 0.78
budget 0.27 0.15 0.64 0.40
title_year 0.13 0.09 0.06 0.03
actor_2_facebook_likes 0.55 0.38 0.25 0.25
imdb_score 0.09 0.12 0.27 0.51
aspect_ratio 0.05 0.05 0.07 0.09
movie_facebook_likes 0.31 0.12 0.38 0.52
cast_total_facebook_likes facenumber_in_poster num_user_for_reviews
num_critic_for_reviews 0.25 -0.03 0.57
duration 0.13 0.01 0.36
director_facebook_likes 0.12 -0.05 0.24
actor_3_facebook_likes 0.48 0.10 0.22
actor_1_facebook_likes 0.95 0.05 0.12
gross 0.22 -0.04 0.55
num_voted_users 0.25 -0.04 0.78
cast_total_facebook_likes 1.00 0.07 0.18
facenumber_in_poster 0.07 1.00 -0.09
num_user_for_reviews 0.18 -0.09 1.00
budget 0.23 -0.03 0.40
title_year 0.13 0.08 0.03
actor_2_facebook_likes 0.63 0.07 0.20
imdb_score 0.14 -0.07 0.35
aspect_ratio 0.07 0.01 0.10
movie_facebook_likes 0.21 0.01 0.39
budget title_year actor_2_facebook_likes imdb_score aspect_ratio
num_critic_for_reviews 0.49 0.42 0.28 0.36 0.18
duration 0.30 -0.11 0.15 0.38 0.16
director_facebook_likes 0.09 -0.06 0.12 0.22 0.05
actor_3_facebook_likes 0.27 0.13 0.55 0.09 0.05
actor_1_facebook_likes 0.15 0.09 0.38 0.12 0.05
gross 0.64 0.06 0.25 0.27 0.07
num_voted_users 0.40 0.03 0.25 0.51 0.09
cast_total_facebook_likes 0.23 0.13 0.63 0.14 0.07
facenumber_in_poster -0.03 0.08 0.07 -0.07 0.01
num_user_for_reviews 0.40 0.03 0.20 0.35 0.10
budget 1.00 0.25 0.25 0.07 0.18
title_year 0.25 1.00 0.13 -0.14 0.22
actor_2_facebook_likes 0.25 0.13 1.00 0.13 0.07
imdb_score 0.07 -0.14 0.13 1.00 0.04
aspect_ratio 0.18 0.22 0.07 0.04 1.00
movie_facebook_likes 0.33 0.31 0.25 0.29 0.11
movie_facebook_likes
num_critic_for_reviews 0.71
duration 0.25
director_facebook_likes 0.17
actor_3_facebook_likes 0.31
actor_1_facebook_likes 0.12
gross 0.38
num_voted_users 0.52
cast_total_facebook_likes 0.21
facenumber_in_poster 0.01
num_user_for_reviews 0.39
budget 0.33
title_year 0.31
actor_2_facebook_likes 0.25
imdb_score 0.29
aspect_ratio 0.11
movie_facebook_likes 1.00
Sample Size
[1] 3005
Probability values (Entries above the diagonal are adjusted for multiple tests.)
num_critic_for_reviews duration director_facebook_likes
num_critic_for_reviews 0.00 0.00 0.00
duration 0.00 0.00 0.00
director_facebook_likes 0.00 0.00 0.00
actor_3_facebook_likes 0.00 0.00 0.00
actor_1_facebook_likes 0.00 0.00 0.00
gross 0.00 0.00 0.00
num_voted_users 0.00 0.00 0.00
cast_total_facebook_likes 0.00 0.00 0.00
facenumber_in_poster 0.09 0.66 0.00
num_user_for_reviews 0.00 0.00 0.00
budget 0.00 0.00 0.00
title_year 0.00 0.00 0.00
actor_2_facebook_likes 0.00 0.00 0.00
imdb_score 0.00 0.00 0.00
aspect_ratio 0.00 0.00 0.01
movie_facebook_likes 0.00 0.00 0.00
actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users
num_critic_for_reviews 0.00 0.00 0.00 0.00
duration 0.00 0.00 0.00 0.00
director_facebook_likes 0.00 0.00 0.00 0.00
actor_3_facebook_likes 0.00 0.00 0.00 0.00
actor_1_facebook_likes 0.00 0.00 0.00 0.00
gross 0.00 0.00 0.00 0.00
num_voted_users 0.00 0.00 0.00 0.00
cast_total_facebook_likes 0.00 0.00 0.00 0.00
facenumber_in_poster 0.00 0.01 0.05 0.02
num_user_for_reviews 0.00 0.00 0.00 0.00
budget 0.00 0.00 0.00 0.00
title_year 0.00 0.00 0.00 0.10
actor_2_facebook_likes 0.00 0.00 0.00 0.00
imdb_score 0.00 0.00 0.00 0.00
aspect_ratio 0.01 0.00 0.00 0.00
movie_facebook_likes 0.00 0.00 0.00 0.00
cast_total_facebook_likes facenumber_in_poster num_user_for_reviews
num_critic_for_reviews 0 0.65 0.00
duration 0 1.00 0.00
director_facebook_likes 0 0.06 0.00
actor_3_facebook_likes 0 0.00 0.00
actor_1_facebook_likes 0 0.07 0.00
gross 0 0.37 0.00
num_voted_users 0 0.17 0.00
cast_total_facebook_likes 0 0.00 0.00
facenumber_in_poster 0 0.00 0.00
num_user_for_reviews 0 0.00 0.00
budget 0 0.14 0.00
title_year 0 0.00 0.12
actor_2_facebook_likes 0 0.00 0.00
imdb_score 0 0.00 0.00
aspect_ratio 0 0.55 0.00
movie_facebook_likes 0 0.50 0.00
budget title_year actor_2_facebook_likes imdb_score aspect_ratio
num_critic_for_reviews 0.00 0.00 0.00 0.00 0.00
duration 0.00 0.00 0.00 0.00 0.00
director_facebook_likes 0.00 0.04 0.00 0.00 0.10
actor_3_facebook_likes 0.00 0.00 0.00 0.00 0.07
actor_1_facebook_likes 0.00 0.00 0.00 0.00 0.05
gross 0.00 0.04 0.00 0.00 0.00
num_voted_users 0.00 0.65 0.00 0.00 0.00
cast_total_facebook_likes 0.00 0.00 0.00 0.00 0.00
facenumber_in_poster 0.65 0.00 0.01 0.00 1.00
num_user_for_reviews 0.00 0.65 0.00 0.00 0.00
budget 0.00 0.00 0.00 0.00 0.00
title_year 0.00 0.00 0.00 0.00 0.00
actor_2_facebook_likes 0.00 0.00 0.00 0.00 0.00
imdb_score 0.00 0.00 0.00 0.00 0.34
aspect_ratio 0.00 0.00 0.00 0.04 0.00
movie_facebook_likes 0.00 0.00 0.00 0.00 0.00
movie_facebook_likes
num_critic_for_reviews 0
duration 0
director_facebook_likes 0
actor_3_facebook_likes 0
actor_1_facebook_likes 0
gross 0
num_voted_users 0
cast_total_facebook_likes 0
facenumber_in_poster 1
num_user_for_reviews 0
budget 0
title_year 0
actor_2_facebook_likes 0
imdb_score 0
aspect_ratio 0
movie_facebook_likes 0
To see confidence intervals of the correlations, print with the short=FALSE option
# Boxplots for significant categorical predictors
Boxplot(movie$imdb_score,movie$color)
[1] "2110" "1763" "2467" "2216" "2391" "2541" "270" "1708" "2477" "423" "1530" "2444"
Black and white movies seems to have a hither meadian rate, and overall a little higher scores. Colors movies have many outliers.
Boxplot for genre:
fill <- "Blue"
line <- "Red"
ggplot(movie, aes(x = genres, y =imdb_score)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "IMDB Score",
breaks = seq(0, 11, 0.5),
limits=c(0, 11)) +
scale_x_discrete(name = "Genres") +
ggtitle("Boxplot of IMDB Score and Genres")
From the boxplot of genres, “Documentation” has the highest median score.And Trill movies has the lowest median. But it is also because there is 1 observation for thrill movies in our data set.
summary(movie$genres)
Action Adventure Animation Biography Comedy Crime Documentary
751 291 36 137 853 204 25
Drama Family Fantasy Film-Noir Game-Show History Horror
506 3 31 0 0 0 138
Music Musical Mystery Romance Sci-Fi Thriller Western
0 2 16 2 7 1 2
library(ggplot2)
fill <- "Blue"
line <- "Red"
ggplot(movie, aes(x = as.factor(title_year), y =imdb_score)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "IMDB Score",
breaks = seq(1.5, 10, 0.5),
limits=c(1.5, 10)) +
scale_x_discrete(name = "title_year") +
ggtitle("Boxplot of IMDB Score and Genres")
The median of imdb score of all years seem different. So let’s try to treat title_year as categorical.
# Scatter plot matrix for correlation significant numerical variables
scatterplotMatrix(~movie$imdb_score+movie$num_voted_users+movie$num_critic_for_reviews+movie$num_user_for_reviews+movie$duration+movie$facenumber_in_poster+movie$gross+movie$movie_facebook_likes+movie$director_facebook_likes+movie$cast_total_facebook_likes+movie$budget)
movie.sig<-movie[,c('imdb_score','num_voted_users','num_critic_for_reviews','num_user_for_reviews','duration','facenumber_in_poster','gross','movie_facebook_likes','director_facebook_likes','cast_total_facebook_likes','budget','title_year','genres')]
Step function to check AIC criteria:
null=lm(movie.sig$imdb_score~1) # set null model
summary(null)
Call:
lm(formula = movie.sig$imdb_score ~ 1)
Residuals:
Min 1Q Median 3Q Max
-4.7873 -0.5873 0.1127 0.7127 2.9127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.3873 0.0192 332.6 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.053 on 3004 degrees of freedom
full1=lm(movie.sig$imdb_score~movie.sig$num_voted_users+movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews+movie.sig$duration+movie.sig$facenumber_in_poster+movie.sig$gross+movie.sig$movie_facebook_likes+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+factor(movie.sig$genres))
summary(full1)
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
movie.sig$num_critic_for_reviews + movie.sig$num_user_for_reviews +
movie.sig$duration + movie.sig$facenumber_in_poster + movie.sig$gross +
movie.sig$movie_facebook_likes + movie.sig$director_facebook_likes +
movie.sig$cast_total_facebook_likes + movie.sig$budget +
movie.sig$title_year + factor(movie.sig$genres))
Residuals:
Min 1Q Median 3Q Max
-4.9157 -0.3693 0.0835 0.4993 2.0350
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.413e+01 3.604e+00 15.019 < 2e-16 ***
movie.sig$num_voted_users 3.158e-06 1.757e-07 17.969 < 2e-16 ***
movie.sig$num_critic_for_reviews 3.333e-03 2.119e-04 15.727 < 2e-16 ***
movie.sig$num_user_for_reviews -4.887e-04 5.976e-05 -8.177 4.26e-16 ***
movie.sig$duration 8.491e-03 7.848e-04 10.820 < 2e-16 ***
movie.sig$facenumber_in_poster -1.750e-02 6.947e-03 -2.519 0.01182 *
movie.sig$gross 2.247e-10 3.096e-10 0.726 0.46808
movie.sig$movie_facebook_likes -4.007e-06 9.702e-07 -4.131 3.72e-05 ***
movie.sig$director_facebook_likes 2.832e-07 4.562e-06 0.062 0.95051
movie.sig$cast_total_facebook_likes 1.110e-06 7.323e-07 1.516 0.12975
movie.sig$budget -4.486e-09 5.125e-10 -8.753 < 2e-16 ***
movie.sig$title_year -2.467e-02 1.797e-03 -13.727 < 2e-16 ***
factor(movie.sig$genres)Adventure 3.458e-01 5.448e-02 6.347 2.53e-10 ***
factor(movie.sig$genres)Animation 6.621e-01 1.345e-01 4.924 8.93e-07 ***
factor(movie.sig$genres)Biography 6.557e-01 7.661e-02 8.558 < 2e-16 ***
factor(movie.sig$genres)Comedy 1.532e-01 4.361e-02 3.513 0.00045 ***
factor(movie.sig$genres)Crime 4.551e-01 6.464e-02 7.040 2.37e-12 ***
factor(movie.sig$genres)Documentary 9.270e-01 1.608e-01 5.765 8.98e-09 ***
factor(movie.sig$genres)Drama 5.326e-01 4.904e-02 10.861 < 2e-16 ***
factor(movie.sig$genres)Family 2.201e-01 4.521e-01 0.487 0.62639
factor(movie.sig$genres)Fantasy -1.629e-01 1.448e-01 -1.125 0.26068
factor(movie.sig$genres)Horror -3.858e-01 7.777e-02 -4.961 7.41e-07 ***
factor(movie.sig$genres)Musical -4.133e-01 5.573e-01 -0.742 0.45839
factor(movie.sig$genres)Mystery 1.968e-01 1.979e-01 0.995 0.32005
factor(movie.sig$genres)Romance 5.466e-01 5.506e-01 0.993 0.32095
factor(movie.sig$genres)Sci-Fi 2.551e-01 2.960e-01 0.862 0.38870
factor(movie.sig$genres)Thriller -4.301e-01 7.786e-01 -0.552 0.58077
factor(movie.sig$genres)Western -1.037e-01 5.521e-01 -0.188 0.85101
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7768 on 2977 degrees of freedom
Multiple R-squared: 0.4604, Adjusted R-squared: 0.4555
F-statistic: 94.07 on 27 and 2977 DF, p-value: < 2.2e-16
step(null,scope = list(lower=null,upper=full1),direction = 'forward')
Start: AIC=309.81
movie.sig$imdb_score ~ 1
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users 1 871.90 2457.2 -600.74
+ movie.sig$duration 1 491.13 2838.0 -167.82
+ movie.sig$num_critic_for_reviews 1 428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews 1 407.62 2921.5 -80.68
+ factor(movie.sig$genres) 16 331.02 2998.1 27.10
+ movie.sig$movie_facebook_likes 1 282.82 3046.3 45.02
+ movie.sig$gross 1 242.62 3086.5 84.42
+ movie.sig$director_facebook_likes 1 166.17 3163.0 157.95
+ movie.sig$title_year 1 69.27 3259.9 248.63
+ movie.sig$cast_total_facebook_likes 1 64.28 3264.8 253.22
+ movie.sig$budget 1 16.26 3312.9 297.09
+ movie.sig$facenumber_in_poster 1 15.14 3314.0 298.11
<none> 3329.1 309.81
Step: AIC=-600.74
movie.sig$imdb_score ~ movie.sig$num_voted_users
Df Sum of Sq RSS AIC
+ factor(movie.sig$genres) 16 311.531 2145.7 -976.12
+ movie.sig$duration 1 147.786 2309.4 -785.13
+ movie.sig$title_year 1 84.649 2372.6 -704.08
+ movie.sig$budget 1 73.211 2384.0 -689.63
+ movie.sig$num_user_for_reviews 1 21.297 2435.9 -624.90
+ movie.sig$gross 1 16.929 2440.3 -619.51
+ movie.sig$num_critic_for_reviews 1 14.632 2442.6 -616.69
+ movie.sig$director_facebook_likes 1 13.657 2443.6 -615.49
+ movie.sig$facenumber_in_poster 1 6.789 2450.4 -607.05
+ movie.sig$movie_facebook_likes 1 2.627 2454.6 -601.95
<none> 2457.2 -600.74
+ movie.sig$cast_total_facebook_likes 1 0.524 2456.7 -599.38
Step: AIC=-976.12
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres)
Df Sum of Sq RSS AIC
+ movie.sig$title_year 1 79.623 2066.1 -1087.75
+ movie.sig$duration 1 74.584 2071.1 -1080.44
+ movie.sig$budget 1 28.689 2117.0 -1014.57
+ movie.sig$num_critic_for_reviews 1 23.116 2122.6 -1006.67
+ movie.sig$num_user_for_reviews 1 12.251 2133.4 -991.33
+ movie.sig$director_facebook_likes 1 3.707 2142.0 -979.32
+ movie.sig$facenumber_in_poster 1 3.274 2142.4 -978.71
+ movie.sig$movie_facebook_likes 1 1.686 2144.0 -976.49
<none> 2145.7 -976.12
+ movie.sig$gross 1 1.391 2144.3 -976.07
+ movie.sig$cast_total_facebook_likes 1 0.362 2145.3 -974.63
Step: AIC=-1087.75
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year
Df Sum of Sq RSS AIC
+ movie.sig$num_critic_for_reviews 1 125.091 1941.0 -1273.4
+ movie.sig$duration 1 55.857 2010.2 -1168.1
+ movie.sig$movie_facebook_likes 1 21.746 2044.3 -1117.5
+ movie.sig$num_user_for_reviews 1 11.741 2054.3 -1102.9
+ movie.sig$budget 1 9.196 2056.9 -1099.2
+ movie.sig$cast_total_facebook_likes 1 2.923 2063.2 -1090.0
+ movie.sig$director_facebook_likes 1 1.740 2064.3 -1088.3
<none> 2066.1 -1087.8
+ movie.sig$facenumber_in_poster 1 1.084 2065.0 -1087.3
+ movie.sig$gross 1 0.638 2065.4 -1086.7
Step: AIC=-1273.43
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$budget 1 36.627 1904.4 -1328.7
+ movie.sig$num_user_for_reviews 1 35.326 1905.7 -1326.6
+ movie.sig$duration 1 34.873 1906.1 -1325.9
+ movie.sig$gross 1 7.359 1933.6 -1282.8
+ movie.sig$movie_facebook_likes 1 1.397 1939.6 -1273.6
<none> 1941.0 -1273.4
+ movie.sig$facenumber_in_poster 1 0.926 1940.1 -1272.9
+ movie.sig$director_facebook_likes 1 0.644 1940.3 -1272.4
+ movie.sig$cast_total_facebook_likes 1 0.572 1940.4 -1272.3
Step: AIC=-1328.68
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget
Df Sum of Sq RSS AIC
+ movie.sig$duration 1 58.373 1846.0 -1420.2
+ movie.sig$num_user_for_reviews 1 27.052 1877.3 -1369.7
+ movie.sig$movie_facebook_likes 1 2.576 1901.8 -1330.8
+ movie.sig$cast_total_facebook_likes 1 2.005 1902.3 -1329.8
<none> 1904.4 -1328.7
+ movie.sig$facenumber_in_poster 1 1.071 1903.3 -1328.4
+ movie.sig$director_facebook_likes 1 0.557 1903.8 -1327.6
+ movie.sig$gross 1 0.074 1904.3 -1326.8
Step: AIC=-1420.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_user_for_reviews 1 33.825 1812.2 -1473.8
+ movie.sig$movie_facebook_likes 1 4.702 1841.3 -1425.9
+ movie.sig$facenumber_in_poster 1 2.488 1843.5 -1422.3
+ movie.sig$cast_total_facebook_likes 1 1.601 1844.4 -1420.8
<none> 1846.0 -1420.2
+ movie.sig$gross 1 0.196 1845.8 -1418.5
+ movie.sig$director_facebook_likes 1 0.043 1845.9 -1418.3
Step: AIC=-1473.81
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$movie_facebook_likes 1 10.4792 1801.7 -1489.2
+ movie.sig$facenumber_in_poster 1 3.7911 1808.4 -1478.1
<none> 1812.2 -1473.8
+ movie.sig$cast_total_facebook_likes 1 0.9926 1811.2 -1473.5
+ movie.sig$gross 1 0.3569 1811.8 -1472.4
+ movie.sig$director_facebook_likes 1 0.0128 1812.2 -1471.8
Step: AIC=-1489.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes
Df Sum of Sq RSS AIC
+ movie.sig$facenumber_in_poster 1 3.5218 1798.2 -1493.1
<none> 1801.7 -1489.2
+ movie.sig$cast_total_facebook_likes 1 1.0918 1800.6 -1489.0
+ movie.sig$gross 1 0.3413 1801.3 -1487.8
+ movie.sig$director_facebook_likes 1 0.0167 1801.7 -1487.3
Step: AIC=-1493.11
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster
Df Sum of Sq RSS AIC
+ movie.sig$cast_total_facebook_likes 1 1.41883 1796.7 -1493.5
<none> 1798.2 -1493.1
+ movie.sig$gross 1 0.33944 1797.8 -1491.7
+ movie.sig$director_facebook_likes 1 0.00320 1798.2 -1491.1
Step: AIC=-1493.48
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster +
movie.sig$cast_total_facebook_likes
Df Sum of Sq RSS AIC
<none> 1796.7 -1493.5
+ movie.sig$gross 1 0.31546 1796.4 -1492.0
+ movie.sig$director_facebook_likes 1 0.00000 1796.7 -1491.5
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
factor(movie.sig$genres) + movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster +
movie.sig$cast_total_facebook_likes)
Coefficients:
(Intercept) movie.sig$num_voted_users
5.446e+01 3.203e-06
factor(movie.sig$genres)Adventure factor(movie.sig$genres)Animation
3.495e-01 6.687e-01
factor(movie.sig$genres)Biography factor(movie.sig$genres)Comedy
6.564e-01 1.558e-01
factor(movie.sig$genres)Crime factor(movie.sig$genres)Documentary
4.522e-01 9.302e-01
factor(movie.sig$genres)Drama factor(movie.sig$genres)Family
5.326e-01 2.466e-01
factor(movie.sig$genres)Fantasy factor(movie.sig$genres)Horror
-1.616e-01 -3.839e-01
factor(movie.sig$genres)Musical factor(movie.sig$genres)Mystery
-4.044e-01 1.950e-01
factor(movie.sig$genres)Romance factor(movie.sig$genres)Sci-Fi
5.455e-01 2.483e-01
factor(movie.sig$genres)Thriller factor(movie.sig$genres)Western
-4.271e-01 -9.845e-02
movie.sig$title_year movie.sig$num_critic_for_reviews
-2.483e-02 3.339e-03
movie.sig$budget movie.sig$duration
-4.311e-09 8.481e-03
movie.sig$num_user_for_reviews movie.sig$movie_facebook_likes
-4.876e-04 -4.010e-06
movie.sig$facenumber_in_poster movie.sig$cast_total_facebook_likes
-1.753e-02 1.121e-06
full2=lm(movie.sig$imdb_score~poly(movie.sig$num_voted_users,2)+poly(movie.sig$num_critic_for_reviews,2)+poly(movie.sig$num_user_for_reviews,2)+poly(movie.sig$duration,2)+movie.sig$facenumber_in_poster+poly(movie.sig$gross,2)+poly(movie.sig$movie_facebook_likes,2)+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+movie.sig$genres+movie.sig$facenumber_in_poster*movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews*movie.sig$num_voted_users+movie.sig$num_voted_users*movie.sig$gross+movie.sig$gross*movie.sig$budget)
summary(full2)
Call:
lm(formula = movie.sig$imdb_score ~ poly(movie.sig$num_voted_users,
2) + poly(movie.sig$num_critic_for_reviews, 2) + poly(movie.sig$num_user_for_reviews,
2) + poly(movie.sig$duration, 2) + movie.sig$facenumber_in_poster +
poly(movie.sig$gross, 2) + poly(movie.sig$movie_facebook_likes,
2) + movie.sig$director_facebook_likes + movie.sig$cast_total_facebook_likes +
movie.sig$budget + movie.sig$title_year + movie.sig$genres +
movie.sig$facenumber_in_poster * movie.sig$num_critic_for_reviews +
movie.sig$num_user_for_reviews * movie.sig$num_voted_users +
movie.sig$num_voted_users * movie.sig$gross + movie.sig$gross *
movie.sig$budget)
Residuals:
Min 1Q Median 3Q Max
-5.3608 -0.3549 0.0642 0.4619 2.1792
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 5.948e+01 3.617e+00 16.446
poly(movie.sig$num_voted_users, 2)1 2.305e+01 3.426e+00 6.727
poly(movie.sig$num_voted_users, 2)2 -1.873e+01 2.200e+00 -8.514
poly(movie.sig$num_critic_for_reviews, 2)1 1.393e+01 1.661e+00 8.388
poly(movie.sig$num_critic_for_reviews, 2)2 -9.490e+00 1.004e+00 -9.452
poly(movie.sig$num_user_for_reviews, 2)1 -1.760e+01 2.325e+00 -7.568
poly(movie.sig$num_user_for_reviews, 2)2 4.166e+00 1.593e+00 2.615
poly(movie.sig$duration, 2)1 1.087e+01 9.246e-01 11.755
poly(movie.sig$duration, 2)2 -3.883e+00 7.809e-01 -4.973
movie.sig$facenumber_in_poster -2.093e-02 1.106e-02 -1.892
poly(movie.sig$gross, 2)1 -1.454e+01 2.418e+00 -6.012
poly(movie.sig$gross, 2)2 -5.285e+00 1.483e+00 -3.565
poly(movie.sig$movie_facebook_likes, 2)1 2.580e+00 1.322e+00 1.952
poly(movie.sig$movie_facebook_likes, 2)2 2.283e-01 8.238e-01 0.277
movie.sig$director_facebook_likes 4.608e-06 4.409e-06 1.045
movie.sig$cast_total_facebook_likes 2.533e-07 7.013e-07 0.361
movie.sig$budget -7.852e-09 7.213e-10 -10.886
movie.sig$title_year -2.656e-02 1.809e-03 -14.680
movie.sig$genresAdventure 3.727e-01 5.268e-02 7.075
movie.sig$genresAnimation 7.564e-01 1.298e-01 5.828
movie.sig$genresBiography 6.264e-01 7.351e-02 8.522
movie.sig$genresComedy 1.576e-01 4.205e-02 3.747
movie.sig$genresCrime 4.558e-01 6.236e-02 7.309
movie.sig$genresDocumentary 9.738e-01 1.542e-01 6.317
movie.sig$genresDrama 5.230e-01 4.726e-02 11.067
movie.sig$genresFamily 5.958e-01 4.362e-01 1.366
movie.sig$genresFantasy -1.891e-01 1.387e-01 -1.364
movie.sig$genresHorror -3.533e-01 7.597e-02 -4.650
movie.sig$genresMusical -4.744e-01 5.328e-01 -0.890
movie.sig$genresMystery 1.947e-01 1.891e-01 1.029
movie.sig$genresRomance 6.094e-01 5.254e-01 1.160
movie.sig$genresSci-Fi 1.471e-01 2.827e-01 0.520
movie.sig$genresThriller -3.085e-01 7.433e-01 -0.415
movie.sig$genresWestern -4.204e-02 5.272e-01 -0.080
movie.sig$num_critic_for_reviews NA NA NA
movie.sig$num_user_for_reviews NA NA NA
movie.sig$num_voted_users NA NA NA
movie.sig$gross NA NA NA
movie.sig$facenumber_in_poster:movie.sig$num_critic_for_reviews -1.291e-06 4.260e-05 -0.030
movie.sig$num_user_for_reviews:movie.sig$num_voted_users 7.966e-10 2.817e-10 2.828
movie.sig$num_voted_users:movie.sig$gross 1.498e-15 1.105e-15 1.355
movie.sig$budget:movie.sig$gross 2.946e-17 4.104e-18 7.180
Pr(>|t|)
(Intercept) < 2e-16 ***
poly(movie.sig$num_voted_users, 2)1 2.07e-11 ***
poly(movie.sig$num_voted_users, 2)2 < 2e-16 ***
poly(movie.sig$num_critic_for_reviews, 2)1 < 2e-16 ***
poly(movie.sig$num_critic_for_reviews, 2)2 < 2e-16 ***
poly(movie.sig$num_user_for_reviews, 2)1 5.01e-14 ***
poly(movie.sig$num_user_for_reviews, 2)2 0.008973 **
poly(movie.sig$duration, 2)1 < 2e-16 ***
poly(movie.sig$duration, 2)2 6.98e-07 ***
movie.sig$facenumber_in_poster 0.058589 .
poly(movie.sig$gross, 2)1 2.05e-09 ***
poly(movie.sig$gross, 2)2 0.000370 ***
poly(movie.sig$movie_facebook_likes, 2)1 0.051079 .
poly(movie.sig$movie_facebook_likes, 2)2 0.781673
movie.sig$director_facebook_likes 0.296005
movie.sig$cast_total_facebook_likes 0.717999
movie.sig$budget < 2e-16 ***
movie.sig$title_year < 2e-16 ***
movie.sig$genresAdventure 1.86e-12 ***
movie.sig$genresAnimation 6.23e-09 ***
movie.sig$genresBiography < 2e-16 ***
movie.sig$genresComedy 0.000183 ***
movie.sig$genresCrime 3.45e-13 ***
movie.sig$genresDocumentary 3.06e-10 ***
movie.sig$genresDrama < 2e-16 ***
movie.sig$genresFamily 0.172120
movie.sig$genresFantasy 0.172721
movie.sig$genresHorror 3.46e-06 ***
movie.sig$genresMusical 0.373334
movie.sig$genresMystery 0.303477
movie.sig$genresRomance 0.246252
movie.sig$genresSci-Fi 0.602769
movie.sig$genresThriller 0.678129
movie.sig$genresWestern 0.936448
movie.sig$num_critic_for_reviews NA
movie.sig$num_user_for_reviews NA
movie.sig$num_voted_users NA
movie.sig$gross NA
movie.sig$facenumber_in_poster:movie.sig$num_critic_for_reviews 0.975820
movie.sig$num_user_for_reviews:movie.sig$num_voted_users 0.004714 **
movie.sig$num_voted_users:movie.sig$gross 0.175492
movie.sig$budget:movie.sig$gross 8.80e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.741 on 2967 degrees of freedom
Multiple R-squared: 0.5107, Adjusted R-squared: 0.5046
F-statistic: 83.69 on 37 and 2967 DF, p-value: < 2.2e-16
step(null,scope=list(lower=null,upper=full2),direction='forward')
Start: AIC=309.81
movie.sig$imdb_score ~ 1
Df Sum of Sq RSS AIC
+ poly(movie.sig$num_voted_users, 2) 2 976.96 2352.2 -730.05
+ movie.sig$num_voted_users 1 871.90 2457.2 -600.74
+ poly(movie.sig$duration, 2) 2 536.11 2793.0 -213.83
+ poly(movie.sig$num_user_for_reviews, 2) 2 483.99 2845.1 -158.27
+ poly(movie.sig$num_critic_for_reviews, 2) 2 436.49 2892.6 -108.52
+ movie.sig$num_critic_for_reviews 1 428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews 1 407.62 2921.5 -80.68
+ poly(movie.sig$movie_facebook_likes, 2) 2 317.80 3011.3 12.32
+ movie.sig$genres 16 331.02 2998.1 27.10
+ poly(movie.sig$gross, 2) 2 251.27 3077.9 77.99
+ movie.sig$gross 1 242.62 3086.5 84.42
+ movie.sig$director_facebook_likes 1 166.17 3163.0 157.95
+ movie.sig$title_year 1 69.27 3259.9 248.63
+ movie.sig$cast_total_facebook_likes 1 64.28 3264.8 253.22
+ movie.sig$budget 1 16.26 3312.9 297.09
+ movie.sig$facenumber_in_poster 1 15.14 3314.0 298.11
<none> 3329.1 309.81
Step: AIC=-730.05
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2)
Df Sum of Sq RSS AIC
+ movie.sig$genres 16 337.58 2014.6 -1163.60
+ poly(movie.sig$duration, 2) 2 137.87 2214.3 -907.55
+ movie.sig$budget 1 133.09 2219.1 -903.07
+ movie.sig$title_year 1 101.46 2250.7 -860.55
+ poly(movie.sig$gross, 2) 2 58.78 2293.4 -802.09
+ movie.sig$gross 1 54.53 2297.6 -798.53
+ poly(movie.sig$num_user_for_reviews, 2) 2 29.12 2323.1 -763.48
+ movie.sig$num_user_for_reviews 1 25.39 2326.8 -760.66
+ movie.sig$director_facebook_likes 1 17.94 2334.2 -751.05
+ movie.sig$facenumber_in_poster 1 6.62 2345.5 -736.52
+ poly(movie.sig$num_critic_for_reviews, 2) 2 5.36 2346.8 -732.90
<none> 2352.2 -730.05
+ movie.sig$num_critic_for_reviews 1 0.18 2352.0 -728.28
+ movie.sig$cast_total_facebook_likes 1 0.15 2352.0 -728.23
+ poly(movie.sig$movie_facebook_likes, 2) 2 1.29 2350.9 -727.70
Step: AIC=-1163.6
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres
Df Sum of Sq RSS AIC
+ movie.sig$title_year 1 97.775 1916.8 -1311.1
+ movie.sig$budget 1 65.238 1949.3 -1260.5
+ poly(movie.sig$duration, 2) 2 65.750 1948.8 -1259.3
+ movie.sig$gross 1 19.722 1994.9 -1191.2
+ poly(movie.sig$gross, 2) 2 20.698 1993.9 -1190.6
+ poly(movie.sig$num_user_for_reviews, 2) 2 20.024 1994.6 -1189.6
+ movie.sig$num_user_for_reviews 1 14.834 1999.8 -1183.8
+ poly(movie.sig$num_critic_for_reviews, 2) 2 9.375 2005.2 -1173.6
+ movie.sig$director_facebook_likes 1 6.114 2008.5 -1170.7
+ movie.sig$facenumber_in_poster 1 3.792 2010.8 -1167.3
<none> 2014.6 -1163.6
+ movie.sig$cast_total_facebook_likes 1 0.355 2014.2 -1162.1
+ movie.sig$num_critic_for_reviews 1 0.042 2014.5 -1161.7
+ poly(movie.sig$movie_facebook_likes, 2) 2 0.813 2013.8 -1160.8
Step: AIC=-1311.1
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year
Df Sum of Sq RSS AIC
+ poly(movie.sig$num_critic_for_reviews, 2) 2 73.976 1842.8 -1425.4
+ poly(movie.sig$duration, 2) 2 49.885 1866.9 -1386.3
+ movie.sig$num_critic_for_reviews 1 43.723 1873.1 -1378.4
+ movie.sig$budget 1 32.246 1884.6 -1360.1
+ poly(movie.sig$num_user_for_reviews, 2) 2 21.755 1895.0 -1341.4
+ poly(movie.sig$gross, 2) 2 19.623 1897.2 -1338.0
+ movie.sig$gross 1 17.879 1898.9 -1337.3
+ poly(movie.sig$movie_facebook_likes, 2) 2 18.788 1898.0 -1336.7
+ movie.sig$num_user_for_reviews 1 14.396 1902.4 -1331.8
+ movie.sig$director_facebook_likes 1 3.373 1913.4 -1314.4
<none> 1916.8 -1311.1
+ movie.sig$facenumber_in_poster 1 1.216 1915.6 -1311.0
+ movie.sig$cast_total_facebook_likes 1 0.300 1916.5 -1309.6
Step: AIC=-1425.37
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2)
Df Sum of Sq RSS AIC
+ poly(movie.sig$num_user_for_reviews, 2) 2 54.189 1788.6 -1511.1
+ movie.sig$budget 1 46.017 1796.8 -1499.4
+ poly(movie.sig$duration, 2) 2 38.533 1804.3 -1484.9
+ movie.sig$num_user_for_reviews 1 33.751 1809.1 -1478.9
+ poly(movie.sig$gross, 2) 2 20.602 1822.2 -1455.2
+ movie.sig$gross 1 16.630 1826.2 -1450.6
+ poly(movie.sig$movie_facebook_likes, 2) 2 8.227 1834.6 -1434.8
+ movie.sig$director_facebook_likes 1 2.296 1840.5 -1427.1
<none> 1842.8 -1425.4
+ movie.sig$facenumber_in_poster 1 0.831 1842.0 -1424.7
+ movie.sig$cast_total_facebook_likes 1 0.104 1842.7 -1423.5
Step: AIC=-1511.06
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2)
Df Sum of Sq RSS AIC
+ poly(movie.sig$duration, 2) 2 51.219 1737.4 -1594.4
+ movie.sig$budget 1 34.907 1753.7 -1568.3
+ poly(movie.sig$gross, 2) 2 20.882 1767.8 -1542.3
+ movie.sig$gross 1 16.727 1771.9 -1537.3
+ poly(movie.sig$movie_facebook_likes, 2) 2 3.910 1784.7 -1513.6
+ movie.sig$director_facebook_likes 1 2.540 1786.1 -1513.3
+ movie.sig$facenumber_in_poster 1 1.970 1786.7 -1512.4
<none> 1788.6 -1511.1
+ movie.sig$cast_total_facebook_likes 1 0.022 1788.6 -1509.1
Step: AIC=-1594.36
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2)
Df Sum of Sq RSS AIC
+ movie.sig$budget 1 62.211 1675.2 -1701.9
+ poly(movie.sig$gross, 2) 2 30.406 1707.0 -1643.4
+ movie.sig$gross 1 23.936 1713.5 -1634.0
+ movie.sig$facenumber_in_poster 1 4.139 1733.3 -1599.5
<none> 1737.4 -1594.4
+ movie.sig$director_facebook_likes 1 0.946 1736.5 -1594.0
+ poly(movie.sig$movie_facebook_likes, 2) 2 1.928 1735.5 -1593.7
+ movie.sig$cast_total_facebook_likes 1 0.064 1737.4 -1592.5
Step: AIC=-1701.94
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget
Df Sum of Sq RSS AIC
+ movie.sig$facenumber_in_poster 1 5.0599 1670.2 -1709.0
+ poly(movie.sig$gross, 2) 2 4.5359 1670.7 -1706.1
+ movie.sig$gross 1 1.8995 1673.3 -1703.3
<none> 1675.2 -1701.9
+ movie.sig$director_facebook_likes 1 0.6239 1674.6 -1701.1
+ movie.sig$cast_total_facebook_likes 1 0.2471 1675.0 -1700.4
+ poly(movie.sig$movie_facebook_likes, 2) 2 0.8695 1674.3 -1699.5
Step: AIC=-1709.03
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget + movie.sig$facenumber_in_poster
Df Sum of Sq RSS AIC
+ poly(movie.sig$gross, 2) 2 4.6247 1665.5 -1713.4
+ movie.sig$gross 1 1.9720 1668.2 -1710.6
<none> 1670.2 -1709.0
+ movie.sig$director_facebook_likes 1 0.4874 1669.7 -1707.9
+ movie.sig$cast_total_facebook_likes 1 0.4443 1669.7 -1707.8
+ poly(movie.sig$movie_facebook_likes, 2) 2 0.8414 1669.3 -1706.5
Step: AIC=-1713.36
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget + movie.sig$facenumber_in_poster +
poly(movie.sig$gross, 2)
Df Sum of Sq RSS AIC
<none> 1665.5 -1713.4
+ movie.sig$director_facebook_likes 1 0.49076 1665.0 -1712.2
+ movie.sig$cast_total_facebook_likes 1 0.41310 1665.1 -1712.1
+ poly(movie.sig$movie_facebook_likes, 2) 2 1.10614 1664.4 -1711.4
Call:
lm(formula = movie.sig$imdb_score ~ poly(movie.sig$num_voted_users,
2) + movie.sig$genres + movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget + movie.sig$facenumber_in_poster +
poly(movie.sig$gross, 2))
Coefficients:
(Intercept) poly(movie.sig$num_voted_users, 2)1
5.851e+01 3.249e+01
poly(movie.sig$num_voted_users, 2)2 movie.sig$genresAdventure
-1.320e+01 3.770e-01
movie.sig$genresAnimation movie.sig$genresBiography
7.306e-01 6.559e-01
movie.sig$genresComedy movie.sig$genresCrime
1.875e-01 4.845e-01
movie.sig$genresDocumentary movie.sig$genresDrama
1.037e+00 5.524e-01
movie.sig$genresFamily movie.sig$genresFantasy
2.093e-01 -1.231e-01
movie.sig$genresHorror movie.sig$genresMusical
-2.986e-01 -4.597e-01
movie.sig$genresMystery movie.sig$genresRomance
2.304e-01 6.151e-01
movie.sig$genresSci-Fi movie.sig$genresThriller
1.706e-01 -2.631e-01
movie.sig$genresWestern movie.sig$title_year
5.056e-02 -2.605e-02
poly(movie.sig$num_critic_for_reviews, 2)1 poly(movie.sig$num_critic_for_reviews, 2)2
1.634e+01 -6.906e+00
poly(movie.sig$num_user_for_reviews, 2)1 poly(movie.sig$num_user_for_reviews, 2)2
-1.209e+01 7.641e+00
poly(movie.sig$duration, 2)1 poly(movie.sig$duration, 2)2
1.072e+01 -3.800e+00
movie.sig$budget movie.sig$facenumber_in_poster
-4.048e-09 -2.026e-02
poly(movie.sig$gross, 2)1 poly(movie.sig$gross, 2)2
-2.770e+00 1.851e+00
full3=
lm(movie.sig$imdb_score ~movie.sig$num_voted_users+movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews+movie.sig$duration+movie.sig$facenumber_in_poster+movie.sig$gross+movie.sig$movie_facebook_likes+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+factor(movie.sig$genres)+movie.sig$duration*movie.sig$num_voted_users+movie.sig$num_voted_users*movie.sig$num_user_for_reviews+movie.sig$gross*movie.sig$budget,data=movie.sig)
summary(full3)
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
movie.sig$num_critic_for_reviews + movie.sig$num_user_for_reviews +
movie.sig$duration + movie.sig$facenumber_in_poster + movie.sig$gross +
movie.sig$movie_facebook_likes + movie.sig$director_facebook_likes +
movie.sig$cast_total_facebook_likes + movie.sig$budget +
movie.sig$title_year + factor(movie.sig$genres) + movie.sig$duration *
movie.sig$num_voted_users + movie.sig$num_voted_users * movie.sig$num_user_for_reviews +
movie.sig$gross * movie.sig$budget, data = movie.sig)
Residuals:
Min 1Q Median 3Q Max
-5.0519 -0.3700 0.0863 0.4828 2.0996
Coefficients:
Estimate Std. Error t value
(Intercept) 4.748e+01 3.592e+00 13.218
movie.sig$num_voted_users 7.890e-06 4.790e-07 16.472
movie.sig$num_critic_for_reviews 2.427e-03 2.275e-04 10.669
movie.sig$num_user_for_reviews -3.039e-04 6.998e-05 -4.343
movie.sig$duration 1.277e-02 9.200e-04 13.882
movie.sig$facenumber_in_poster -1.858e-02 6.806e-03 -2.730
movie.sig$gross -1.469e-09 4.191e-10 -3.505
movie.sig$movie_facebook_likes -2.370e-06 9.659e-07 -2.454
movie.sig$director_facebook_likes 3.969e-06 4.482e-06 0.885
movie.sig$cast_total_facebook_likes 7.641e-07 7.181e-07 1.064
movie.sig$budget -5.900e-09 5.917e-10 -9.971
movie.sig$title_year -2.154e-02 1.790e-03 -12.032
factor(movie.sig$genres)Adventure 3.308e-01 5.338e-02 6.196
factor(movie.sig$genres)Animation 7.426e-01 1.319e-01 5.629
factor(movie.sig$genres)Biography 6.551e-01 7.512e-02 8.720
factor(movie.sig$genres)Comedy 1.515e-01 4.284e-02 3.537
factor(movie.sig$genres)Crime 4.496e-01 6.353e-02 7.077
factor(movie.sig$genres)Documentary 8.960e-01 1.579e-01 5.676
factor(movie.sig$genres)Drama 4.965e-01 4.835e-02 10.269
factor(movie.sig$genres)Family 3.329e-01 4.432e-01 0.751
factor(movie.sig$genres)Fantasy -1.544e-01 1.419e-01 -1.089
factor(movie.sig$genres)Horror -3.577e-01 7.638e-02 -4.683
factor(movie.sig$genres)Musical -2.616e-01 5.459e-01 -0.479
factor(movie.sig$genres)Mystery 1.263e-01 1.939e-01 0.652
factor(movie.sig$genres)Romance 5.476e-01 5.392e-01 1.016
factor(movie.sig$genres)Sci-Fi 1.673e-01 2.900e-01 0.577
factor(movie.sig$genres)Thriller -4.858e-01 7.627e-01 -0.637
factor(movie.sig$genres)Western -1.277e-01 5.408e-01 -0.236
movie.sig$num_voted_users:movie.sig$duration -3.052e-08 3.447e-09 -8.852
movie.sig$num_voted_users:movie.sig$num_user_for_reviews -3.752e-10 9.851e-11 -3.809
movie.sig$gross:movie.sig$budget 1.411e-17 2.887e-18 4.886
Pr(>|t|)
(Intercept) < 2e-16 ***
movie.sig$num_voted_users < 2e-16 ***
movie.sig$num_critic_for_reviews < 2e-16 ***
movie.sig$num_user_for_reviews 1.46e-05 ***
movie.sig$duration < 2e-16 ***
movie.sig$facenumber_in_poster 0.006371 **
movie.sig$gross 0.000463 ***
movie.sig$movie_facebook_likes 0.014175 *
movie.sig$director_facebook_likes 0.376035
movie.sig$cast_total_facebook_likes 0.287447
movie.sig$budget < 2e-16 ***
movie.sig$title_year < 2e-16 ***
factor(movie.sig$genres)Adventure 6.60e-10 ***
factor(movie.sig$genres)Animation 1.98e-08 ***
factor(movie.sig$genres)Biography < 2e-16 ***
factor(movie.sig$genres)Comedy 0.000411 ***
factor(movie.sig$genres)Crime 1.83e-12 ***
factor(movie.sig$genres)Documentary 1.51e-08 ***
factor(movie.sig$genres)Drama < 2e-16 ***
factor(movie.sig$genres)Family 0.452648
factor(movie.sig$genres)Fantasy 0.276414
factor(movie.sig$genres)Horror 2.95e-06 ***
factor(movie.sig$genres)Musical 0.631791
factor(movie.sig$genres)Mystery 0.514773
factor(movie.sig$genres)Romance 0.309947
factor(movie.sig$genres)Sci-Fi 0.563982
factor(movie.sig$genres)Thriller 0.524230
factor(movie.sig$genres)Western 0.813336
movie.sig$num_voted_users:movie.sig$duration < 2e-16 ***
movie.sig$num_voted_users:movie.sig$num_user_for_reviews 0.000143 ***
movie.sig$gross:movie.sig$budget 1.08e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7607 on 2974 degrees of freedom
Multiple R-squared: 0.483, Adjusted R-squared: 0.4778
F-statistic: 92.63 on 30 and 2974 DF, p-value: < 2.2e-16
step(null,scope=list(lower=null,upper=full3),direction='forward')
Start: AIC=309.81
movie.sig$imdb_score ~ 1
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users 1 871.90 2457.2 -600.74
+ movie.sig$duration 1 491.13 2838.0 -167.82
+ movie.sig$num_critic_for_reviews 1 428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews 1 407.62 2921.5 -80.68
+ factor(movie.sig$genres) 16 331.02 2998.1 27.10
+ movie.sig$movie_facebook_likes 1 282.82 3046.3 45.02
+ movie.sig$gross 1 242.62 3086.5 84.42
+ movie.sig$director_facebook_likes 1 166.17 3163.0 157.95
+ movie.sig$title_year 1 69.27 3259.9 248.63
+ movie.sig$cast_total_facebook_likes 1 64.28 3264.8 253.22
+ movie.sig$budget 1 16.26 3312.9 297.09
+ movie.sig$facenumber_in_poster 1 15.14 3314.0 298.11
<none> 3329.1 309.81
Step: AIC=-600.74
movie.sig$imdb_score ~ movie.sig$num_voted_users
Df Sum of Sq RSS AIC
+ factor(movie.sig$genres) 16 311.531 2145.7 -976.12
+ movie.sig$duration 1 147.786 2309.4 -785.13
+ movie.sig$title_year 1 84.649 2372.6 -704.08
+ movie.sig$budget 1 73.211 2384.0 -689.63
+ movie.sig$num_user_for_reviews 1 21.297 2435.9 -624.90
+ movie.sig$gross 1 16.929 2440.3 -619.51
+ movie.sig$num_critic_for_reviews 1 14.632 2442.6 -616.69
+ movie.sig$director_facebook_likes 1 13.657 2443.6 -615.49
+ movie.sig$facenumber_in_poster 1 6.789 2450.4 -607.05
+ movie.sig$movie_facebook_likes 1 2.627 2454.6 -601.95
<none> 2457.2 -600.74
+ movie.sig$cast_total_facebook_likes 1 0.524 2456.7 -599.38
Step: AIC=-976.12
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres)
Df Sum of Sq RSS AIC
+ movie.sig$title_year 1 79.623 2066.1 -1087.75
+ movie.sig$duration 1 74.584 2071.1 -1080.44
+ movie.sig$budget 1 28.689 2117.0 -1014.57
+ movie.sig$num_critic_for_reviews 1 23.116 2122.6 -1006.67
+ movie.sig$num_user_for_reviews 1 12.251 2133.4 -991.33
+ movie.sig$director_facebook_likes 1 3.707 2142.0 -979.32
+ movie.sig$facenumber_in_poster 1 3.274 2142.4 -978.71
+ movie.sig$movie_facebook_likes 1 1.686 2144.0 -976.49
<none> 2145.7 -976.12
+ movie.sig$gross 1 1.391 2144.3 -976.07
+ movie.sig$cast_total_facebook_likes 1 0.362 2145.3 -974.63
Step: AIC=-1087.75
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year
Df Sum of Sq RSS AIC
+ movie.sig$num_critic_for_reviews 1 125.091 1941.0 -1273.4
+ movie.sig$duration 1 55.857 2010.2 -1168.1
+ movie.sig$movie_facebook_likes 1 21.746 2044.3 -1117.5
+ movie.sig$num_user_for_reviews 1 11.741 2054.3 -1102.9
+ movie.sig$budget 1 9.196 2056.9 -1099.2
+ movie.sig$cast_total_facebook_likes 1 2.923 2063.2 -1090.0
+ movie.sig$director_facebook_likes 1 1.740 2064.3 -1088.3
<none> 2066.1 -1087.8
+ movie.sig$facenumber_in_poster 1 1.084 2065.0 -1087.3
+ movie.sig$gross 1 0.638 2065.4 -1086.7
Step: AIC=-1273.43
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$budget 1 36.627 1904.4 -1328.7
+ movie.sig$num_user_for_reviews 1 35.326 1905.7 -1326.6
+ movie.sig$duration 1 34.873 1906.1 -1325.9
+ movie.sig$gross 1 7.359 1933.6 -1282.8
+ movie.sig$movie_facebook_likes 1 1.397 1939.6 -1273.6
<none> 1941.0 -1273.4
+ movie.sig$facenumber_in_poster 1 0.926 1940.1 -1272.9
+ movie.sig$director_facebook_likes 1 0.644 1940.3 -1272.4
+ movie.sig$cast_total_facebook_likes 1 0.572 1940.4 -1272.3
Step: AIC=-1328.68
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget
Df Sum of Sq RSS AIC
+ movie.sig$duration 1 58.373 1846.0 -1420.2
+ movie.sig$num_user_for_reviews 1 27.052 1877.3 -1369.7
+ movie.sig$movie_facebook_likes 1 2.576 1901.8 -1330.8
+ movie.sig$cast_total_facebook_likes 1 2.005 1902.3 -1329.8
<none> 1904.4 -1328.7
+ movie.sig$facenumber_in_poster 1 1.071 1903.3 -1328.4
+ movie.sig$director_facebook_likes 1 0.557 1903.8 -1327.6
+ movie.sig$gross 1 0.074 1904.3 -1326.8
Step: AIC=-1420.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users:movie.sig$duration 1 70.848 1775.1 -1535.8
+ movie.sig$num_user_for_reviews 1 33.825 1812.2 -1473.8
+ movie.sig$movie_facebook_likes 1 4.702 1841.3 -1425.9
+ movie.sig$facenumber_in_poster 1 2.488 1843.5 -1422.3
+ movie.sig$cast_total_facebook_likes 1 1.601 1844.4 -1420.8
<none> 1846.0 -1420.2
+ movie.sig$gross 1 0.196 1845.8 -1418.5
+ movie.sig$director_facebook_likes 1 0.043 1845.9 -1418.3
Step: AIC=-1535.83
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_voted_users:movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_user_for_reviews 1 26.4426 1748.7 -1578.9
+ movie.sig$facenumber_in_poster 1 2.9576 1772.2 -1538.8
+ movie.sig$cast_total_facebook_likes 1 1.1823 1774.0 -1535.8
<none> 1775.1 -1535.8
+ movie.sig$movie_facebook_likes 1 0.9446 1774.2 -1535.4
+ movie.sig$director_facebook_likes 1 0.3854 1774.8 -1534.5
+ movie.sig$gross 1 0.0191 1775.1 -1533.9
Step: AIC=-1578.93
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$num_voted_users:movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users:movie.sig$num_user_for_reviews 1 5.4845 1743.2 -1586.4
+ movie.sig$facenumber_in_poster 1 4.1664 1744.5 -1584.1
+ movie.sig$movie_facebook_likes 1 3.9301 1744.8 -1583.7
<none> 1748.7 -1578.9
+ movie.sig$cast_total_facebook_likes 1 0.7354 1748.0 -1578.2
+ movie.sig$director_facebook_likes 1 0.2660 1748.4 -1577.4
+ movie.sig$gross 1 0.0008 1748.7 -1576.9
Step: AIC=-1586.37
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$facenumber_in_poster 1 4.0181 1739.2 -1591.3
+ movie.sig$movie_facebook_likes 1 3.2754 1739.9 -1590.0
<none> 1743.2 -1586.4
+ movie.sig$cast_total_facebook_likes 1 0.6359 1742.6 -1585.5
+ movie.sig$director_facebook_likes 1 0.3798 1742.8 -1585.0
+ movie.sig$gross 1 0.0475 1743.2 -1584.5
Step: AIC=-1591.31
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$facenumber_in_poster + movie.sig$num_voted_users:movie.sig$duration +
movie.sig$num_voted_users:movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$movie_facebook_likes 1 3.11243 1736.1 -1594.7
<none> 1739.2 -1591.3
+ movie.sig$cast_total_facebook_likes 1 0.90996 1738.3 -1590.9
+ movie.sig$director_facebook_likes 1 0.29041 1738.9 -1589.8
+ movie.sig$gross 1 0.04757 1739.1 -1589.4
Step: AIC=-1594.69
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$facenumber_in_poster + movie.sig$movie_facebook_likes +
movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
<none> 1736.1 -1594.7
+ movie.sig$cast_total_facebook_likes 1 0.97305 1735.1 -1594.4
+ movie.sig$director_facebook_likes 1 0.27990 1735.8 -1593.2
+ movie.sig$gross 1 0.03634 1736.0 -1592.8
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
factor(movie.sig$genres) + movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$facenumber_in_poster + movie.sig$movie_facebook_likes +
movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews)
Coefficients:
(Intercept)
4.817e+01
movie.sig$num_voted_users
7.152e-06
factor(movie.sig$genres)Adventure
3.300e-01
factor(movie.sig$genres)Animation
7.097e-01
factor(movie.sig$genres)Biography
6.794e-01
factor(movie.sig$genres)Comedy
1.675e-01
factor(movie.sig$genres)Crime
4.784e-01
factor(movie.sig$genres)Documentary
9.449e-01
factor(movie.sig$genres)Drama
5.252e-01
factor(movie.sig$genres)Family
2.260e-01
factor(movie.sig$genres)Fantasy
-1.422e-01
factor(movie.sig$genres)Horror
-3.440e-01
factor(movie.sig$genres)Musical
-3.165e-01
factor(movie.sig$genres)Mystery
1.499e-01
factor(movie.sig$genres)Romance
5.682e-01
factor(movie.sig$genres)Sci-Fi
1.953e-01
factor(movie.sig$genres)Thriller
-4.097e-01
factor(movie.sig$genres)Western
-4.521e-02
movie.sig$title_year
-2.189e-02
movie.sig$num_critic_for_reviews
2.566e-03
movie.sig$budget
-4.370e-09
movie.sig$duration
1.206e-02
movie.sig$num_user_for_reviews
-3.210e-04
movie.sig$facenumber_in_poster
-1.750e-02
movie.sig$movie_facebook_likes
-2.239e-06
movie.sig$num_voted_users:movie.sig$duration
-2.661e-08
movie.sig$num_voted_users:movie.sig$num_user_for_reviews
-2.729e-10
For convinience to interpret the result, I will start with Full3(additive mode with interactiin terms). After checking residual, then decide should we add higher order terms.
Split data into Test and Train:
indx = sample(1:nrow(movie.sig), as.integer(0.9*nrow(movie.sig)))
indx # ramdomize rows, save 90% of data into index
[1] 1459 687 622 680 1895 1987 720 2522 1606 2667 2526 1304 1473 170 1039 2805 344
[18] 2297 466 2457 2183 2781 1441 803 2206 1406 956 2236 2130 2220 879 1190 2757 324
[35] 2321 2602 1743 2312 1840 1980 1930 473 1014 1510 2384 1195 520 2442 177 8 475
[52] 478 1058 2654 2966 258 1132 1455 1867 1385 1947 1207 1550 526 1670 2487 2825 346
[69] 2953 373 1082 187 2090 716 1729 756 2434 1458 1931 1942 566 1241 1488 2907 1837
[86] 511 2764 866 701 2879 2732 2137 1408 2878 2808 1126 663 2722 2096 154 2156 683
[103] 1909 2952 2333 1033 2897 1365 2012 2122 203 839 1129 1658 970 2671 2370 2664 1652
[120] 2898 1872 2231 2596 2438 2655 638 2738 2749 2293 2557 456 654 1167 1245 1416 1559
[137] 1929 1047 2858 2358 754 2705 943 724 439 325 1189 2138 125 2046 890 1893 2766
[154] 2177 1656 805 32 118 166 2098 528 405 129 2067 2748 1740 845 283 1998 316
[171] 1125 580 2322 1140 1388 2449 1228 2892 745 2784 2511 568 145 567 294 2540 2224
[188] 2880 577 2842 1653 49 2502 801 1092 1276 2594 2209 718 2372 1732 261 14 1067
[205] 864 1394 2943 2223 1174 448 2288 1130 2478 2396 2876 2977 2736 1297 2834 2488 278
[222] 1041 614 1420 1215 1200 1075 1182 2944 1617 2520 2399 1321 1166 2796 1518 1763 2920
[239] 1730 337 2750 2300 2049 550 1376 1769 2426 82 2350 953 1798 1592 352 1091 1512
[256] 2648 2486 2239 2633 2988 1700 1948 1019 470 454 2428 376 75 2865 2378 1862 2051
[273] 2266 476 780 1484 656 2308 894 617 398 913 2184 2946 2999 1615 2002 789 436
[290] 1211 64 2192 2140 922 250 1979 1920 2505 2393 495 2121 1149 2809 1985 74 194
[307] 2277 1721 686 1119 161 1806 2506 148 2128 886 1111 856 2037 912 134 882 1274
[324] 1582 952 1504 1235 2491 68 2429 2937 1866 2219 2367 380 2200 366 1432 2727 1822
[341] 2916 1679 1471 1257 164 2803 73 869 1439 521 2160 1733 2843 779 700 2101 684
[358] 2674 901 2886 2261 807 988 2762 1278 106 232 1252 1203 576 2310 1961 2807 2411
[375] 1050 1158 655 109 180 812 2084 1568 1774 2395 875 1939 1834 2066 791 1004 2436
[392] 107 2271 1699 2303 2249 447 71 2592 849 2338 1009 1312 794 1976 165 2402 2143
[409] 340 2039 2117 1879 1090 898 1279 2702 3000 2070 2289 169 1030 1693 282 1904 1894
[426] 823 2696 2621 1614 1752 605 1311 2846 2230 2600 2719 648 2650 2362 2643 2877 666
[443] 112 1650 1786 2334 2790 2631 2383 2024 2448 1248 1431 360 2100 1736 2485 2454 1480
[460] 1906 2814 793 1669 1833 1026 2139 1065 618 2881 5 2088 1694 480 2829 1450 243
[477] 2683 1773 2695 1803 897 2382 111 989 1294 2662 1148 1269 623 105 1397 357 1444
[494] 1181 1509 1505 1221 911 263 190 2769 1263 2893 2961 1513 1854 916 332 3001 133
[511] 2328 1982 1242 1169 1162 564 861 1638 2060 1779 2119 1059 149 1819 2458 546 2133
[528] 2777 1478 596 1845 2198 650 2213 1555 2347 1392 1856 2295 1573 948 1794 670 2237
[545] 2001 1981 840 982 2968 752 2565 2590 2132 160 738 298 2175 2578 847 1426 2154
[562] 1151 222 1310 1830 2099 2246 2682 1565 2135 2496 764 1369 972 459 290 992 1630
[579] 1358 865 1110 2863 1564 287 624 2913 978 13 319 240 2938 1688 512 63 2519
[596] 2811 1675 2637 2948 1001 602 2305 1002 1071 1061 2058 438 2555 87 95 1583 608
[613] 994 534 2896 429 556 917 1628 174 1303 171 558 1433 2989 296 1179 2343 70
[630] 2055 1453 765 1475 522 1449 2553 2385 2844 2681 2006 1481 2348 626 2222 2279 22
[647] 2780 2167 45 2452 2389 214 1413 1624 1500 2831 414 652 966 518 1899 2765 225
[664] 1745 2391 1283 510 1016 1536 707 1623 1844 2653 11 2172 1407 2450 1777 2207 1120
[681] 1270 770 1187 2409 1896 998 2755 559 2582 404 1463 1299 492 539 1135 1003 2118
[698] 1134 1145 843 2315 2353 268 2290 1411 1277 1508 1994 1968 2415 2883 2480 1863 2958
[715] 2240 1106 712 1422 1996 2441 1386 1051 2641 2123 2437 2004 2972 39 2813 2280 585
[732] 2847 1919 1057 2984 89 2884 2950 1831 1771 2447 1418 2693 2659 2285 678 2606 751
[749] 1762 1916 2574 1199 104 1680 2775 334 2284 1813 2327 1851 1391 1622 1847 772 2795
[766] 1801 1191 1755 2020 1558 674 2112 1335 204 137 2013 2235 1527 467 728 1836 2895
[783] 792 80 1626 1155 2109 693 1309 382 163 906 2439 2691 1805 1264 1428 2142 692
[800] 1477 60 775 2991 393 1170 2394 1870 1829 984 1494 1250 1322 743 1775 2497 2355
[817] 2741 1659 2922 2188 10 2782 273 842 2023 369 777 867 1457 547 1052 2234 1186
[834] 2194 1519 1711 859 2421 1913 1910 1604 1206 36 2361 402 312 2116 57 1908 1240
[851] 810 2745 2369 2752 1479 372 1613 44 1112 86 1883 2268 1681 2332 482 1541 1561
[868] 788 152 941 349 395 237 1371 1006 1557 443 390 1495 2927 2083 959 2646 1231
[885] 2152 2461 961 3 1724 1156 2743 2832 2489 272 2687 536 834 1799 2164 1725 651
[902] 672 2204 588 2767 1672 1881 990 140 1402 2559 1983 681 2162 2500 433 387 335
[919] 1959 2178 837 2901 2583 841 315 2826 314 1074 1702 2311 757 1902 1122 257 2597
[936] 1222 570 2957 2357 1668 2306 2566 183 175 1064 491 1005 1340 1054 1757 833 1609
[953] 130 938 2414 1667 2558 2349 2131 560 2015 2607 1826 442 926 2304 2345 783 50
[970] 18 1472 474 2619 2509 1717 158 1103 2773 2982 2424 2605 1213 735 1238 950 2267
[987] 432 54 2660 2874 1522 410 2157 2074 2010 1701 2900 1419 1034 2314
[ reached getOption("max.print") -- omitted 1704 entries ]
movie_train = movie.sig[indx,]
movie_test = movie.sig[-indx,]
# lm.fit 1: linear model with interaction term dropping insig predictors.
# insig terms: director facebooklike','movie fb like' and 'cast total fb likes' from summary(full3)
# Note: nothing to do with step function we choose for full3.
lm.fit1<-lm(movie_train$imdb_score~movie_train$num_voted_users+movie_train$num_critic_for_reviews+movie_train$num_user_for_reviews+movie_train$duration+movie_train$facenumber_in_poster+movie_train$gross+movie_train$budget+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
summary(lm.fit1)
Call:
lm(formula = movie_train$imdb_score ~ movie_train$num_voted_users +
movie_train$num_critic_for_reviews + movie_train$num_user_for_reviews +
movie_train$duration + movie_train$facenumber_in_poster +
movie_train$gross + movie_train$budget + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews +
movie_train$gross * movie_train$budget)
Residuals:
Min 1Q Median 3Q Max
-5.1959 -0.3517 0.0879 0.4768 2.0449
Coefficients:
Estimate Std. Error t value
(Intercept) 4.580e+01 3.742e+00 12.241
movie_train$num_voted_users 8.024e-06 5.020e-07 15.984
movie_train$num_critic_for_reviews 2.074e-03 2.033e-04 10.205
movie_train$num_user_for_reviews -2.258e-04 7.116e-05 -3.173
movie_train$duration 1.257e-02 9.638e-04 13.040
movie_train$facenumber_in_poster -1.522e-02 7.082e-03 -2.150
movie_train$gross -1.469e-09 4.415e-10 -3.327
movie_train$budget -6.183e-09 6.179e-10 -10.006
movie_train$title_year -2.067e-02 1.864e-03 -11.086
factor(movie_train$genres)Adventure 3.175e-01 5.634e-02 5.636
factor(movie_train$genres)Animation 7.754e-01 1.370e-01 5.659
factor(movie_train$genres)Biography 6.939e-01 7.815e-02 8.880
factor(movie_train$genres)Comedy 1.399e-01 4.485e-02 3.120
factor(movie_train$genres)Crime 4.577e-01 6.648e-02 6.885
factor(movie_train$genres)Documentary 8.612e-01 1.571e-01 5.482
factor(movie_train$genres)Drama 4.855e-01 5.013e-02 9.684
factor(movie_train$genres)Family 3.124e-01 4.397e-01 0.711
factor(movie_train$genres)Fantasy -1.592e-01 1.457e-01 -1.093
factor(movie_train$genres)Horror -3.674e-01 8.077e-02 -4.548
factor(movie_train$genres)Musical 2.891e-01 7.584e-01 0.381
factor(movie_train$genres)Mystery 1.888e-01 2.123e-01 0.889
factor(movie_train$genres)Romance 5.237e-01 5.351e-01 0.979
factor(movie_train$genres)Sci-Fi 3.399e-01 3.401e-01 0.999
factor(movie_train$genres)Thriller -5.425e-01 7.569e-01 -0.717
factor(movie_train$genres)Western -1.221e-01 5.355e-01 -0.228
movie_train$num_voted_users:movie_train$duration -3.153e-08 3.582e-09 -8.802
movie_train$num_voted_users:movie_train$num_user_for_reviews -4.288e-10 1.021e-10 -4.200
movie_train$gross:movie_train$budget 1.442e-17 2.967e-18 4.862
Pr(>|t|)
(Intercept) < 2e-16 ***
movie_train$num_voted_users < 2e-16 ***
movie_train$num_critic_for_reviews < 2e-16 ***
movie_train$num_user_for_reviews 0.001525 **
movie_train$duration < 2e-16 ***
movie_train$facenumber_in_poster 0.031655 *
movie_train$gross 0.000889 ***
movie_train$budget < 2e-16 ***
movie_train$title_year < 2e-16 ***
factor(movie_train$genres)Adventure 1.92e-08 ***
factor(movie_train$genres)Animation 1.68e-08 ***
factor(movie_train$genres)Biography < 2e-16 ***
factor(movie_train$genres)Comedy 0.001826 **
factor(movie_train$genres)Crime 7.17e-12 ***
factor(movie_train$genres)Documentary 4.59e-08 ***
factor(movie_train$genres)Drama < 2e-16 ***
factor(movie_train$genres)Family 0.477401
factor(movie_train$genres)Fantasy 0.274661
factor(movie_train$genres)Horror 5.65e-06 ***
factor(movie_train$genres)Musical 0.703066
factor(movie_train$genres)Mystery 0.373894
factor(movie_train$genres)Romance 0.327889
factor(movie_train$genres)Sci-Fi 0.317717
factor(movie_train$genres)Thriller 0.473572
factor(movie_train$genres)Western 0.819678
movie_train$num_voted_users:movie_train$duration < 2e-16 ***
movie_train$num_voted_users:movie_train$num_user_for_reviews 2.76e-05 ***
movie_train$gross:movie_train$budget 1.23e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7548 on 2676 degrees of freedom
Multiple R-squared: 0.4806, Adjusted R-squared: 0.4753
F-statistic: 91.7 on 27 and 2676 DF, p-value: < 2.2e-16
The P-value is very samll.All terms are significant but face number in posters is the least significant variable.Adjusted R^2 is 0.4727, which means 47.27% of the variability can be explained by this model.
Do Lack of fit test to see if removing the predictors improve model performance:
#lm.full: full linear model with interaction terms on train dataset.
lm.full<-lm(movie_train$imdb_score~movie_train$num_voted_users+movie_train$num_critic_for_reviews+movie_train$num_user_for_reviews+movie_train$duration+movie_train$facenumber_in_poster+movie_train$gross+movie_train$movie_facebook_likes+movie_train$director_facebook_likes+movie_train$cast_total_facebook_likes+movie_train$budget+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
anova(lm.full,lm.fit1) # H0: reduced model fits===lack of fit=0
Analysis of Variance Table
Model 1: movie_train$imdb_score ~ movie_train$num_voted_users + movie_train$num_critic_for_reviews +
movie_train$num_user_for_reviews + movie_train$duration +
movie_train$facenumber_in_poster + movie_train$gross + movie_train$movie_facebook_likes +
movie_train$director_facebook_likes + movie_train$cast_total_facebook_likes +
movie_train$budget + movie_train$title_year + factor(movie_train$genres) +
movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Model 2: movie_train$imdb_score ~ movie_train$num_voted_users + movie_train$num_critic_for_reviews +
movie_train$num_user_for_reviews + movie_train$duration +
movie_train$facenumber_in_poster + movie_train$gross + movie_train$budget +
movie_train$title_year + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2673 1521.0
2 2676 1524.5 -3 -3.4834 2.0406 0.1061
The P-value of the partial F-test is 0.1379, which means dropping ‘director facebooklike’,‘movie fb like’ and ‘cast total fb likes’ did improve model performance.
Diagnostics:
plot(lm.fit1)
not plotting observations with leverage one:
1943
not plotting observations with leverage one:
1943
NaNs producedNaNs produced
# residual vs fitted indicates might be higher order term. Normal plot not good.
library(car)
package ‘car’ was built under R version 3.3.2
residualPlots(lm.fit1)
library(car)
residualPlots(lm.fit1)
Test stat Pr(>|t|)
movie_train$num_voted_users -8.392 0.000
movie_train$num_critic_for_reviews -8.103 0.000
movie_train$num_user_for_reviews 3.422 0.001
movie_train$duration -4.590 0.000
movie_train$facenumber_in_poster 0.222 0.824
movie_train$gross -4.093 0.000
movie_train$budget 5.060 0.000
movie_train$title_year -3.571 0.000
factor(movie_train$genres) NA NA
Tukey test -14.631 0.000
All of the residual vs predictor plots have a general trend of cerviture, which indicates the current model does not fit. Higher order terms should be included.
Fit model with higer order terms:
# lm.fit2: model based on lm.fit1 adding higer order for all variables except for 'face number in poster' and 'title-year'.
lm.fit2<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+poly(movie_train$num_user_for_reviews,2)+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+poly(movie_train$gross,2)+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
summary(lm.fit2)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + poly(movie_train$num_user_for_reviews,
2) + poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
poly(movie_train$gross, 2) + poly(movie_train$budget, 2) +
movie_train$title_year + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget)
Residuals:
Min 1Q Median 3Q Max
-5.2968 -0.3438 0.0658 0.4506 2.1797
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 5.135e+01 3.743e+00 13.718
poly(movie_train$num_voted_users, 2)1 4.155e+01 4.974e+00 8.353
poly(movie_train$num_voted_users, 2)2 -1.676e+01 2.294e+00 -7.305
poly(movie_train$num_critic_for_reviews, 2)1 1.286e+01 1.322e+00 9.726
poly(movie_train$num_critic_for_reviews, 2)2 -7.240e+00 8.612e-01 -8.406
poly(movie_train$num_user_for_reviews, 2)1 -1.726e+01 2.285e+00 -7.555
poly(movie_train$num_user_for_reviews, 2)2 2.394e+00 1.545e+00 1.550
poly(movie_train$duration, 2)1 1.390e+01 1.090e+00 12.761
poly(movie_train$duration, 2)2 -3.545e+00 7.736e-01 -4.583
movie_train$facenumber_in_poster -1.772e-02 6.853e-03 -2.586
poly(movie_train$gross, 2)1 -7.344e+00 2.222e+00 -3.305
poly(movie_train$gross, 2)2 -2.567e+00 1.249e+00 -2.054
poly(movie_train$budget, 2)1 -1.507e+01 1.982e+00 -7.604
poly(movie_train$budget, 2)2 5.687e+00 1.123e+00 5.064
movie_train$title_year -2.249e-02 1.873e-03 -12.007
factor(movie_train$genres)Adventure 3.607e-01 5.507e-02 6.551
factor(movie_train$genres)Animation 8.269e-01 1.336e-01 6.189
factor(movie_train$genres)Biography 6.398e-01 7.582e-02 8.439
factor(movie_train$genres)Comedy 1.206e-01 4.368e-02 2.760
factor(movie_train$genres)Crime 4.265e-01 6.442e-02 6.620
factor(movie_train$genres)Documentary 8.692e-01 1.525e-01 5.698
factor(movie_train$genres)Drama 4.738e-01 4.877e-02 9.715
factor(movie_train$genres)Family 3.901e-01 4.307e-01 0.906
factor(movie_train$genres)Fantasy -1.906e-01 1.411e-01 -1.350
factor(movie_train$genres)Horror -3.979e-01 7.997e-02 -4.976
factor(movie_train$genres)Musical -5.804e-04 7.333e-01 -0.001
factor(movie_train$genres)Mystery 1.861e-01 2.052e-01 0.907
factor(movie_train$genres)Romance 5.716e-01 5.168e-01 1.106
factor(movie_train$genres)Sci-Fi 2.334e-01 3.287e-01 0.710
factor(movie_train$genres)Thriller -4.679e-01 7.312e-01 -0.640
factor(movie_train$genres)Western -8.372e-02 5.174e-01 -0.162
movie_train$duration NA NA NA
movie_train$num_voted_users NA NA NA
movie_train$num_user_for_reviews NA NA NA
movie_train$gross NA NA NA
movie_train$budget NA NA NA
movie_train$duration:movie_train$num_voted_users -1.902e-08 3.604e-09 -5.278
movie_train$num_voted_users:movie_train$num_user_for_reviews 1.098e-09 3.055e-10 3.595
movie_train$gross:movie_train$budget 1.447e-17 5.637e-18 2.566
Pr(>|t|)
(Intercept) < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 < 2e-16 ***
poly(movie_train$num_voted_users, 2)2 3.64e-13 ***
poly(movie_train$num_critic_for_reviews, 2)1 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 < 2e-16 ***
poly(movie_train$num_user_for_reviews, 2)1 5.71e-14 ***
poly(movie_train$num_user_for_reviews, 2)2 0.121257
poly(movie_train$duration, 2)1 < 2e-16 ***
poly(movie_train$duration, 2)2 4.80e-06 ***
movie_train$facenumber_in_poster 0.009763 **
poly(movie_train$gross, 2)1 0.000962 ***
poly(movie_train$gross, 2)2 0.040030 *
poly(movie_train$budget, 2)1 3.96e-14 ***
poly(movie_train$budget, 2)2 4.39e-07 ***
movie_train$title_year < 2e-16 ***
factor(movie_train$genres)Adventure 6.85e-11 ***
factor(movie_train$genres)Animation 6.99e-10 ***
factor(movie_train$genres)Biography < 2e-16 ***
factor(movie_train$genres)Comedy 0.005817 **
factor(movie_train$genres)Crime 4.32e-11 ***
factor(movie_train$genres)Documentary 1.34e-08 ***
factor(movie_train$genres)Drama < 2e-16 ***
factor(movie_train$genres)Family 0.365151
factor(movie_train$genres)Fantasy 0.177040
factor(movie_train$genres)Horror 6.92e-07 ***
factor(movie_train$genres)Musical 0.999369
factor(movie_train$genres)Mystery 0.364726
factor(movie_train$genres)Romance 0.268800
factor(movie_train$genres)Sci-Fi 0.477651
factor(movie_train$genres)Thriller 0.522324
factor(movie_train$genres)Western 0.871459
movie_train$duration NA
movie_train$num_voted_users NA
movie_train$num_user_for_reviews NA
movie_train$gross NA
movie_train$budget NA
movie_train$duration:movie_train$num_voted_users 1.41e-07 ***
movie_train$num_voted_users:movie_train$num_user_for_reviews 0.000331 ***
movie_train$gross:movie_train$budget 0.010334 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7286 on 2670 degrees of freedom
Multiple R-squared: 0.5171, Adjusted R-squared: 0.5111
F-statistic: 86.64 on 33 and 2670 DF, p-value: < 2.2e-16
The second order term for ‘num user for reviews’ is not sig, can be droped. The second order term for ‘gross’ is sig but close to not sig, can be droped. The interaction for ‘gross’ and ‘budget’ is not very significant, could be droped.
# lm.fit3: based on lm.fit2 dropping second order term for 'number of users for review', 'gross' and budget*gross
lm.fit3<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit3)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews)
Residuals:
Min 1Q Median 3Q Max
-5.3148 -0.3437 0.0642 0.4562 2.1678
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 5.026e+01 3.705e+00 13.566
poly(movie_train$num_voted_users, 2)1 3.777e+01 4.488e+00 8.417
poly(movie_train$num_voted_users, 2)2 -1.839e+01 2.155e+00 -8.536
poly(movie_train$num_critic_for_reviews, 2)1 1.239e+01 1.304e+00 9.500
poly(movie_train$num_critic_for_reviews, 2)2 -6.883e+00 8.398e-01 -8.196
movie_train$num_user_for_reviews -8.981e-04 9.504e-05 -9.450
poly(movie_train$duration, 2)1 1.370e+01 1.087e+00 12.606
poly(movie_train$duration, 2)2 -3.474e+00 7.732e-01 -4.493
movie_train$facenumber_in_poster -1.710e-02 6.854e-03 -2.495
movie_train$gross -6.501e-10 3.199e-10 -2.032
poly(movie_train$budget, 2)1 -1.111e+01 1.172e+00 -9.483
poly(movie_train$budget, 2)2 7.544e+00 8.093e-01 9.322
movie_train$title_year -2.176e-02 1.851e-03 -11.756
factor(movie_train$genres)Adventure 3.666e-01 5.500e-02 6.664
factor(movie_train$genres)Animation 8.306e-01 1.336e-01 6.218
factor(movie_train$genres)Biography 6.505e-01 7.581e-02 8.581
factor(movie_train$genres)Comedy 1.246e-01 4.363e-02 2.855
factor(movie_train$genres)Crime 4.374e-01 6.435e-02 6.797
factor(movie_train$genres)Documentary 8.714e-01 1.526e-01 5.709
factor(movie_train$genres)Drama 4.776e-01 4.877e-02 9.792
factor(movie_train$genres)Family 1.873e-01 4.243e-01 0.441
factor(movie_train$genres)Fantasy -1.894e-01 1.411e-01 -1.343
factor(movie_train$genres)Horror -3.967e-01 7.985e-02 -4.969
factor(movie_train$genres)Musical -9.698e-02 7.326e-01 -0.132
factor(movie_train$genres)Mystery 2.004e-01 2.054e-01 0.976
factor(movie_train$genres)Romance 5.801e-01 5.173e-01 1.121
factor(movie_train$genres)Sci-Fi 2.208e-01 3.288e-01 0.672
factor(movie_train$genres)Thriller -4.749e-01 7.320e-01 -0.649
factor(movie_train$genres)Western -8.251e-02 5.179e-01 -0.159
movie_train$duration NA NA NA
movie_train$num_voted_users NA NA NA
movie_train$duration:movie_train$num_voted_users -1.881e-08 3.566e-09 -5.276
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.460e-09 2.193e-10 6.657
Pr(>|t|)
(Intercept) < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 < 2e-16 ***
poly(movie_train$num_voted_users, 2)2 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 3.82e-16 ***
movie_train$num_user_for_reviews < 2e-16 ***
poly(movie_train$duration, 2)1 < 2e-16 ***
poly(movie_train$duration, 2)2 7.34e-06 ***
movie_train$facenumber_in_poster 0.01265 *
movie_train$gross 0.04222 *
poly(movie_train$budget, 2)1 < 2e-16 ***
poly(movie_train$budget, 2)2 < 2e-16 ***
movie_train$title_year < 2e-16 ***
factor(movie_train$genres)Adventure 3.22e-11 ***
factor(movie_train$genres)Animation 5.82e-10 ***
factor(movie_train$genres)Biography < 2e-16 ***
factor(movie_train$genres)Comedy 0.00434 **
factor(movie_train$genres)Crime 1.31e-11 ***
factor(movie_train$genres)Documentary 1.26e-08 ***
factor(movie_train$genres)Drama < 2e-16 ***
factor(movie_train$genres)Family 0.65895
factor(movie_train$genres)Fantasy 0.17950
factor(movie_train$genres)Horror 7.17e-07 ***
factor(movie_train$genres)Musical 0.89470
factor(movie_train$genres)Mystery 0.32923
factor(movie_train$genres)Romance 0.26223
factor(movie_train$genres)Sci-Fi 0.50193
factor(movie_train$genres)Thriller 0.51656
factor(movie_train$genres)Western 0.87343
movie_train$duration NA
movie_train$num_voted_users NA
movie_train$duration:movie_train$num_voted_users 1.43e-07 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users 3.37e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7294 on 2673 degrees of freedom
Multiple R-squared: 0.5155, Adjusted R-squared: 0.5101
F-statistic: 94.8 on 30 and 2673 DF, p-value: < 2.2e-16
anova(lm.fit2,lm.fit3)
Analysis of Variance Table
Model 1: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + poly(movie_train$num_user_for_reviews,
2) + poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
poly(movie_train$gross, 2) + poly(movie_train$budget, 2) +
movie_train$title_year + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Model 2: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2670 1417.3
2 2673 1422.0 -3 -4.6802 2.9388 0.03202 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P-value for lack of fit test is : 0.074. Meaning lm.fit3 is better than lm.fit2. R^2 for lm.fit3: 0.5075, 50.75% of variation could be explained by this model.
Diagnostics for lm.fit3:
plot(lm.fit3)
not plotting observations with leverage one:
696, 2019
not plotting observations with leverage one:
696, 2019
library(car)
package ‘car’ was built under R version 3.3.2
residualPlots(lm.fit3)
library(car)
residualPlots(lm.fit3)
Test stat Pr(>|t|)
poly(movie_train$num_voted_users, 2) NA NA
poly(movie_train$num_critic_for_reviews, 2) NA NA
movie_train$num_user_for_reviews 1.452 0.147
poly(movie_train$duration, 2) NA NA
movie_train$facenumber_in_poster 0.545 0.586
movie_train$gross -0.351 0.725
poly(movie_train$budget, 2) NA NA
movie_train$title_year -5.751 0.000
factor(movie_train$genres) NA NA
movie_train$duration 0.309 0.757
movie_train$num_voted_users -0.816 0.415
Tukey test -12.501 0.000
The plot is way better than lm.fit2. All the residuals vs predictors are strainght lines except for title year. So, let’t try to add second order for title year.
# lm.fit4: based on lm.fit3 addting second order for title year.
lm.fit4<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+poly(movie_train$title_year,2)+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit4)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + poly(movie_train$title_year,
2) + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews)
Residuals:
Min 1Q Median 3Q Max
-5.2522 -0.3390 0.0530 0.4512 2.1654
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 6.671e+00 6.245e-02 106.824
poly(movie_train$num_voted_users, 2)1 3.453e+01 4.496e+00 7.679
poly(movie_train$num_voted_users, 2)2 -1.862e+01 2.142e+00 -8.693
poly(movie_train$num_critic_for_reviews, 2)1 1.630e+01 1.464e+00 11.135
poly(movie_train$num_critic_for_reviews, 2)2 -7.726e+00 8.476e-01 -9.115
movie_train$num_user_for_reviews -1.012e-03 9.652e-05 -10.483
poly(movie_train$duration, 2)1 1.350e+01 1.081e+00 12.485
poly(movie_train$duration, 2)2 -3.353e+00 7.689e-01 -4.361
movie_train$facenumber_in_poster -1.385e-02 6.836e-03 -2.026
movie_train$gross -5.902e-10 3.182e-10 -1.855
poly(movie_train$budget, 2)1 -1.121e+01 1.165e+00 -9.623
poly(movie_train$budget, 2)2 7.985e+00 8.081e-01 9.881
poly(movie_train$title_year, 2)1 -1.287e+01 9.894e-01 -13.003
poly(movie_train$title_year, 2)2 -4.846e+00 8.428e-01 -5.751
factor(movie_train$genres)Adventure 3.733e-01 5.469e-02 6.826
factor(movie_train$genres)Animation 8.806e-01 1.331e-01 6.618
factor(movie_train$genres)Biography 6.519e-01 7.536e-02 8.651
factor(movie_train$genres)Comedy 1.284e-01 4.338e-02 2.960
factor(movie_train$genres)Crime 4.424e-01 6.398e-02 6.916
factor(movie_train$genres)Documentary 9.256e-01 1.520e-01 6.089
factor(movie_train$genres)Drama 4.851e-01 4.850e-02 10.004
factor(movie_train$genres)Family 1.515e-01 4.218e-01 0.359
factor(movie_train$genres)Fantasy -2.432e-01 1.406e-01 -1.731
factor(movie_train$genres)Horror -4.202e-01 7.948e-02 -5.287
factor(movie_train$genres)Musical -1.489e-01 7.283e-01 -0.204
factor(movie_train$genres)Mystery 2.139e-01 2.042e-01 1.048
factor(movie_train$genres)Romance 6.070e-01 5.143e-01 1.180
factor(movie_train$genres)Sci-Fi 2.309e-01 3.268e-01 0.707
factor(movie_train$genres)Thriller -2.927e-01 7.283e-01 -0.402
factor(movie_train$genres)Western -6.708e-02 5.148e-01 -0.130
movie_train$duration NA NA NA
movie_train$num_voted_users NA NA NA
movie_train$duration:movie_train$num_voted_users -1.767e-08 3.550e-09 -4.978
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.584e-09 2.191e-10 7.229
Pr(>|t|)
(Intercept) < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 2.23e-14 ***
poly(movie_train$num_voted_users, 2)2 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 < 2e-16 ***
movie_train$num_user_for_reviews < 2e-16 ***
poly(movie_train$duration, 2)1 < 2e-16 ***
poly(movie_train$duration, 2)2 1.34e-05 ***
movie_train$facenumber_in_poster 0.0428 *
movie_train$gross 0.0637 .
poly(movie_train$budget, 2)1 < 2e-16 ***
poly(movie_train$budget, 2)2 < 2e-16 ***
poly(movie_train$title_year, 2)1 < 2e-16 ***
poly(movie_train$title_year, 2)2 9.90e-09 ***
factor(movie_train$genres)Adventure 1.08e-11 ***
factor(movie_train$genres)Animation 4.39e-11 ***
factor(movie_train$genres)Biography < 2e-16 ***
factor(movie_train$genres)Comedy 0.0031 **
factor(movie_train$genres)Crime 5.81e-12 ***
factor(movie_train$genres)Documentary 1.30e-09 ***
factor(movie_train$genres)Drama < 2e-16 ***
factor(movie_train$genres)Family 0.7194
factor(movie_train$genres)Fantasy 0.0836 .
factor(movie_train$genres)Horror 1.34e-07 ***
factor(movie_train$genres)Musical 0.8380
factor(movie_train$genres)Mystery 0.2948
factor(movie_train$genres)Romance 0.2379
factor(movie_train$genres)Sci-Fi 0.4798
factor(movie_train$genres)Thriller 0.6878
factor(movie_train$genres)Western 0.8963
movie_train$duration NA
movie_train$num_voted_users NA
movie_train$duration:movie_train$num_voted_users 6.85e-07 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users 6.30e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.725 on 2672 degrees of freedom
Multiple R-squared: 0.5214, Adjusted R-squared: 0.5159
F-statistic: 93.91 on 31 and 2672 DF, p-value: < 2.2e-16
anova(lm.fit4,lm.fit3)
Analysis of Variance Table
Model 1: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + poly(movie_train$title_year,
2) + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews
Model 2: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2672 1404.6
2 2673 1422.0 -1 -17.384 33.07 9.9e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P value is so small, reject null, meaning adding second order term for title year did not improve model.
Marginal Model plot:
marginalModelPlots(lm.fit3)
Splines and/or polynomials replaced by a fitted linear combination
residualPlots(lm.fit3)
The plots of the response versus the individual predictors display the conditional distribution of the response given each predictor, ignoring the other predictors. From our plots, our model is really good.since the marginal relationship between the response and the predictor are overlapping.
Check for residual ourliers:
library(car)
qqPlot(lm.fit3$residuals,id.n = 10)
702 1013 113 2411 1129 1165 526 1256 2591 617
1 2 3 4 5 6 7 8 9 10
outlierTest(lm.fit3) # H0: residual is not an outlier
rstudent unadjusted p-value Bonferonni p
1017 -7.518666 7.5090e-14 2.0289e-10
665 -5.454427 5.3633e-08 1.4492e-04
237 -5.370977 8.5054e-08 2.2982e-04
438 -5.336489 1.0271e-07 2.7754e-04
1638 -4.976677 6.8794e-07 1.8588e-03
1490 -4.821383 1.5056e-06 4.0680e-03
1885 -4.751121 2.1299e-06 5.7551e-03
2004 -4.621856 3.9838e-06 1.0764e-02
230 -4.600509 4.4111e-06 1.1919e-02
1688 -4.463399 8.4012e-06 2.2700e-02
All of the 10 residuals have significant p-values, therefore, we can drop them.
Before we drop, let’s do some digsnostics to double check which to drop.
library(car)
influencePlot(lm.fit3, id.n=10)
From the influcence plot, we decided to drop observations: 2572,1423,860,1520,509,682,1017,848,361,237
# lm.fit5: model based on lm.fit3 removing 10 outliers.
movie_train<-movie_train[-c(2572,1423,860,1520,509,682,1017,848,361,237),]
lm.fit5<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit5)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews)
Residuals:
Min 1Q Median 3Q Max
-4.0199 -0.3417 0.0631 0.4482 2.1921
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 4.939e+01 3.662e+00 13.489
poly(movie_train$num_voted_users, 2)1 3.607e+01 4.258e+00 8.470
poly(movie_train$num_voted_users, 2)2 -1.726e+01 1.912e+00 -9.028
poly(movie_train$num_critic_for_reviews, 2)1 1.179e+01 1.292e+00 9.124
poly(movie_train$num_critic_for_reviews, 2)2 -6.890e+00 8.185e-01 -8.418
movie_train$num_user_for_reviews -8.997e-04 9.718e-05 -9.258
poly(movie_train$duration, 2)1 1.310e+01 1.087e+00 12.055
poly(movie_train$duration, 2)2 -3.713e+00 7.669e-01 -4.841
movie_train$facenumber_in_poster -1.770e-02 6.739e-03 -2.627
movie_train$gross -6.864e-10 3.180e-10 -2.158
poly(movie_train$budget, 2)1 -1.122e+01 1.151e+00 -9.749
poly(movie_train$budget, 2)2 7.581e+00 7.955e-01 9.530
movie_train$title_year -2.135e-02 1.829e-03 -11.669
factor(movie_train$genres)Adventure 3.775e-01 5.422e-02 6.963
factor(movie_train$genres)Animation 8.417e-01 1.314e-01 6.405
factor(movie_train$genres)Biography 6.552e-01 7.454e-02 8.791
factor(movie_train$genres)Comedy 1.319e-01 4.292e-02 3.074
factor(movie_train$genres)Crime 4.393e-01 6.342e-02 6.927
factor(movie_train$genres)Documentary 1.102e+00 1.530e-01 7.203
factor(movie_train$genres)Drama 4.833e-01 4.797e-02 10.075
factor(movie_train$genres)Family 2.445e-01 5.087e-01 0.481
factor(movie_train$genres)Fantasy -1.847e-01 1.387e-01 -1.332
factor(movie_train$genres)Horror -3.908e-01 7.853e-02 -4.976
factor(movie_train$genres)Musical -1.044e-01 7.202e-01 -0.145
factor(movie_train$genres)Mystery 2.057e-01 2.019e-01 1.019
factor(movie_train$genres)Sci-Fi 2.135e-01 3.232e-01 0.660
factor(movie_train$genres)Thriller -4.767e-01 7.196e-01 -0.662
movie_train$duration NA NA NA
movie_train$num_voted_users NA NA NA
movie_train$duration:movie_train$num_voted_users -1.610e-08 3.652e-09 -4.408
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.481e-09 2.349e-10 6.306
Pr(>|t|)
(Intercept) < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 < 2e-16 ***
poly(movie_train$num_voted_users, 2)2 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 < 2e-16 ***
movie_train$num_user_for_reviews < 2e-16 ***
poly(movie_train$duration, 2)1 < 2e-16 ***
poly(movie_train$duration, 2)2 1.36e-06 ***
movie_train$facenumber_in_poster 0.00867 **
movie_train$gross 0.03099 *
poly(movie_train$budget, 2)1 < 2e-16 ***
poly(movie_train$budget, 2)2 < 2e-16 ***
movie_train$title_year < 2e-16 ***
factor(movie_train$genres)Adventure 4.19e-12 ***
factor(movie_train$genres)Animation 1.77e-10 ***
factor(movie_train$genres)Biography < 2e-16 ***
factor(movie_train$genres)Comedy 0.00213 **
factor(movie_train$genres)Crime 5.37e-12 ***
factor(movie_train$genres)Documentary 7.61e-13 ***
factor(movie_train$genres)Drama < 2e-16 ***
factor(movie_train$genres)Family 0.63080
factor(movie_train$genres)Fantasy 0.18300
factor(movie_train$genres)Horror 6.89e-07 ***
factor(movie_train$genres)Musical 0.88477
factor(movie_train$genres)Mystery 0.30847
factor(movie_train$genres)Sci-Fi 0.50900
factor(movie_train$genres)Thriller 0.50773
movie_train$duration NA
movie_train$num_voted_users NA
movie_train$duration:movie_train$num_voted_users 1.09e-05 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users 3.35e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.717 on 2665 degrees of freedom
Multiple R-squared: 0.5223, Adjusted R-squared: 0.5173
F-statistic: 104.1 on 28 and 2665 DF, p-value: < 2.2e-16
compareCoefs(lm.fit3, lm.fit5)
Call:
1: lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + movie_train$gross +
poly(movie_train$budget, 2) + movie_train$title_year + factor(movie_train$genres) +
movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews)
2: lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + movie_train$gross +
poly(movie_train$budget, 2) + movie_train$title_year + factor(movie_train$genres) +
movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews)
Est. 1 SE 1 Est. 2
(Intercept) 5.03e+01 3.70e+00 4.94e+01
poly(movie_train$num_voted_users, 2)1 3.78e+01 4.49e+00 3.61e+01
poly(movie_train$num_voted_users, 2)2 -1.84e+01 2.15e+00 -1.73e+01
poly(movie_train$num_critic_for_reviews, 2)1 1.24e+01 1.30e+00 1.18e+01
poly(movie_train$num_critic_for_reviews, 2)2 -6.88e+00 8.40e-01 -6.89e+00
movie_train$num_user_for_reviews -8.98e-04 9.50e-05 -9.00e-04
poly(movie_train$duration, 2)1 1.37e+01 1.09e+00 1.31e+01
poly(movie_train$duration, 2)2 -3.47e+00 7.73e-01 -3.71e+00
movie_train$facenumber_in_poster -1.71e-02 6.85e-03 -1.77e-02
movie_train$gross -6.50e-10 3.20e-10 -6.86e-10
poly(movie_train$budget, 2)1 -1.11e+01 1.17e+00 -1.12e+01
poly(movie_train$budget, 2)2 7.54e+00 8.09e-01 7.58e+00
movie_train$title_year -2.18e-02 1.85e-03 -2.13e-02
factor(movie_train$genres)Adventure 3.67e-01 5.50e-02 3.78e-01
factor(movie_train$genres)Animation 8.31e-01 1.34e-01 8.42e-01
factor(movie_train$genres)Biography 6.50e-01 7.58e-02 6.55e-01
factor(movie_train$genres)Comedy 1.25e-01 4.36e-02 1.32e-01
factor(movie_train$genres)Crime 4.37e-01 6.44e-02 4.39e-01
factor(movie_train$genres)Documentary 8.71e-01 1.53e-01 1.10e+00
factor(movie_train$genres)Drama 4.78e-01 4.88e-02 4.83e-01
factor(movie_train$genres)Family 1.87e-01 4.24e-01 2.45e-01
factor(movie_train$genres)Fantasy -1.89e-01 1.41e-01 -1.85e-01
factor(movie_train$genres)Horror -3.97e-01 7.99e-02 -3.91e-01
factor(movie_train$genres)Musical -9.70e-02 7.33e-01 -1.04e-01
factor(movie_train$genres)Mystery 2.00e-01 2.05e-01 2.06e-01
factor(movie_train$genres)Romance 5.80e-01 5.17e-01
factor(movie_train$genres)Sci-Fi 2.21e-01 3.29e-01 2.13e-01
factor(movie_train$genres)Thriller -4.75e-01 7.32e-01 -4.77e-01
factor(movie_train$genres)Western -8.25e-02 5.18e-01
movie_train$duration
movie_train$num_voted_users
movie_train$duration:movie_train$num_voted_users -1.88e-08 3.57e-09 -1.61e-08
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.46e-09 2.19e-10 1.48e-09
SE 2
(Intercept) 3.66e+00
poly(movie_train$num_voted_users, 2)1 4.26e+00
poly(movie_train$num_voted_users, 2)2 1.91e+00
poly(movie_train$num_critic_for_reviews, 2)1 1.29e+00
poly(movie_train$num_critic_for_reviews, 2)2 8.18e-01
movie_train$num_user_for_reviews 9.72e-05
poly(movie_train$duration, 2)1 1.09e+00
poly(movie_train$duration, 2)2 7.67e-01
movie_train$facenumber_in_poster 6.74e-03
movie_train$gross 3.18e-10
poly(movie_train$budget, 2)1 1.15e+00
poly(movie_train$budget, 2)2 7.95e-01
movie_train$title_year 1.83e-03
factor(movie_train$genres)Adventure 5.42e-02
factor(movie_train$genres)Animation 1.31e-01
factor(movie_train$genres)Biography 7.45e-02
factor(movie_train$genres)Comedy 4.29e-02
factor(movie_train$genres)Crime 6.34e-02
factor(movie_train$genres)Documentary 1.53e-01
factor(movie_train$genres)Drama 4.80e-02
factor(movie_train$genres)Family 5.09e-01
factor(movie_train$genres)Fantasy 1.39e-01
factor(movie_train$genres)Horror 7.85e-02
factor(movie_train$genres)Musical 7.20e-01
factor(movie_train$genres)Mystery 2.02e-01
factor(movie_train$genres)Romance
factor(movie_train$genres)Sci-Fi 3.23e-01
factor(movie_train$genres)Thriller 7.20e-01
factor(movie_train$genres)Western
movie_train$duration
movie_train$num_voted_users
movie_train$duration:movie_train$num_voted_users 3.65e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users 2.35e-10
Removing outliers did not change the result too much.
Diagnostics for lm.fit5:
library(car)
residualPlots(lm.fit5)
library(car)
residualPlots(lm.fit5)
Test stat Pr(>|t|)
poly(movie_train$num_voted_users, 2) NA NA
poly(movie_train$num_critic_for_reviews, 2) NA NA
movie_train$num_user_for_reviews 1.401 0.161
poly(movie_train$duration, 2) NA NA
movie_train$facenumber_in_poster 0.607 0.544
movie_train$gross -0.212 0.832
poly(movie_train$budget, 2) NA NA
movie_train$title_year -5.570 0.000
factor(movie_train$genres) NA NA
movie_train$duration -0.528 0.598
movie_train$num_voted_users -0.942 0.346
Tukey test -12.189 0.000
Looks good except for residuals vs fitted values show some curviture.
plot(lm.fit5)
not plotting observations with leverage one:
69, 1934
not plotting observations with leverage one:
69, 1934
Now,let’s look at model assumption for both lm.fit3 and lm.fit5:
# normality
shapiro.test(lm.fit3$residuals)
Shapiro-Wilk normality test
data: lm.fit3$residuals
W = 0.93923, p-value < 2.2e-16
shapiro.test(lm.fit5$residuals)
Shapiro-Wilk normality test
data: lm.fit5$residuals
W = 0.94636, p-value < 2.2e-16
Both models failed the normality assumption. I think this is due to the many outliers in the data set.
# equal variance : H0: variance is not constant
ncvTest(lm.fit3)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 145.0484 Df = 1 p = 2.095931e-33
ncvTest(lm.fit5)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 145.0484 Df = 1 p = 2.095931e-33
Both models passed the equal variance assumption.
This is just to explore more interesting facts Plots for data with fitted regression line:
library(ggplot2)
package ‘ggplot2’ was built under R version 3.3.2
ggplot(data=movie_train,aes(x=duration,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=num_voted_users,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=facenumber_in_poster,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=gross,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=budget,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
Rewriting model lm.fit5 in another notation: # Note, if write in lm(train\(score~train\)x1+train$x2….), it will create the same number of values with the train data set when predict().
# lm.fit6 =lm.fit 5 using difference writing
lm.fit6<-lm(imdb_score~poly(num_voted_users,2)+poly(num_critic_for_reviews,2)+num_user_for_reviews+poly(duration,2)+facenumber_in_poster+gross+poly(budget,2)+title_year+genres+duration*num_voted_users+num_voted_users*num_user_for_reviews,data=data.frame(movie_train))
summary(lm.fit6)
Call:
lm(formula = imdb_score ~ poly(num_voted_users, 2) + poly(num_critic_for_reviews,
2) + num_user_for_reviews + poly(duration, 2) + facenumber_in_poster +
gross + poly(budget, 2) + title_year + genres + duration *
num_voted_users + num_voted_users * num_user_for_reviews,
data = data.frame(movie_train))
Residuals:
Min 1Q Median 3Q Max
-4.0199 -0.3417 0.0631 0.4482 2.1921
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.939e+01 3.662e+00 13.489 < 2e-16 ***
poly(num_voted_users, 2)1 3.607e+01 4.258e+00 8.470 < 2e-16 ***
poly(num_voted_users, 2)2 -1.726e+01 1.912e+00 -9.028 < 2e-16 ***
poly(num_critic_for_reviews, 2)1 1.179e+01 1.292e+00 9.124 < 2e-16 ***
poly(num_critic_for_reviews, 2)2 -6.890e+00 8.185e-01 -8.418 < 2e-16 ***
num_user_for_reviews -8.997e-04 9.718e-05 -9.258 < 2e-16 ***
poly(duration, 2)1 1.310e+01 1.087e+00 12.055 < 2e-16 ***
poly(duration, 2)2 -3.713e+00 7.669e-01 -4.841 1.36e-06 ***
facenumber_in_poster -1.770e-02 6.739e-03 -2.627 0.00867 **
gross -6.864e-10 3.180e-10 -2.158 0.03099 *
poly(budget, 2)1 -1.122e+01 1.151e+00 -9.749 < 2e-16 ***
poly(budget, 2)2 7.581e+00 7.955e-01 9.530 < 2e-16 ***
title_year -2.135e-02 1.829e-03 -11.669 < 2e-16 ***
genresAdventure 3.775e-01 5.422e-02 6.963 4.19e-12 ***
genresAnimation 8.417e-01 1.314e-01 6.405 1.77e-10 ***
genresBiography 6.552e-01 7.454e-02 8.791 < 2e-16 ***
genresComedy 1.319e-01 4.292e-02 3.074 0.00213 **
genresCrime 4.393e-01 6.342e-02 6.927 5.37e-12 ***
genresDocumentary 1.102e+00 1.530e-01 7.203 7.61e-13 ***
genresDrama 4.833e-01 4.797e-02 10.075 < 2e-16 ***
genresFamily 2.445e-01 5.087e-01 0.481 0.63080
genresFantasy -1.847e-01 1.387e-01 -1.332 0.18300
genresHorror -3.908e-01 7.853e-02 -4.976 6.89e-07 ***
genresMusical -1.044e-01 7.202e-01 -0.145 0.88477
genresMystery 2.057e-01 2.019e-01 1.019 0.30847
genresSci-Fi 2.135e-01 3.232e-01 0.660 0.50900
genresThriller -4.767e-01 7.196e-01 -0.662 0.50773
duration NA NA NA NA
num_voted_users NA NA NA NA
duration:num_voted_users -1.610e-08 3.652e-09 -4.408 1.09e-05 ***
num_user_for_reviews:num_voted_users 1.481e-09 2.349e-10 6.306 3.35e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.717 on 2665 degrees of freedom
Multiple R-squared: 0.5223, Adjusted R-squared: 0.5173
F-statistic: 104.1 on 28 and 2665 DF, p-value: < 2.2e-16
pr<-predict.lm(lm.fit6,newdata = data.frame(movie_test),interval = 'confidence')
prediction from a rank-deficient fit may be misleading
pr
fit lwr upr
6 7.283794 6.961082 7.606506
58 5.601212 5.455020 5.747404
75 5.921525 5.769292 6.073758
91 7.076609 6.945594 7.207625
96 7.300318 7.055997 7.544639
110 6.182307 6.065289 6.299324
126 7.308791 7.139406 7.478176
133 5.039100 4.921396 5.156804
140 6.138358 6.020223 6.256493
147 5.545048 5.420151 5.669946
148 7.438595 7.195529 7.681661
187 7.713198 7.516667 7.909729
206 8.013502 7.817922 8.209082
228 6.721862 6.307709 7.136016
229 5.212996 5.075929 5.350063
234 5.604565 5.490636 5.718494
293 6.235727 6.155310 6.316144
295 6.462250 6.385914 6.538586
300 6.630399 6.537012 6.723785
307 5.977920 5.900503 6.055336
332 5.838179 5.725647 5.950712
341 7.880073 7.551832 8.208314
342 5.393556 5.302269 5.484843
347 6.197072 6.090796 6.303348
361 5.824857 5.736939 5.912774
364 7.034808 6.949784 7.119832
369 5.453146 5.361252 5.545041
374 6.595132 6.373851 6.816412
390 5.527691 5.419353 5.636030
399 7.052506 6.957343 7.147669
401 7.585817 7.431986 7.739649
410 6.377673 6.275528 6.479819
413 5.611617 5.500098 5.723136
416 5.764839 5.693482 5.836196
422 5.591627 5.518604 5.664651
427 5.693757 5.607101 5.780412
433 6.431746 6.327910 6.535582
448 5.918329 5.771181 6.065476
482 5.492688 5.404294 5.581083
495 5.266105 5.152707 5.379503
500 6.156436 6.020799 6.292072
510 8.159068 7.815930 8.502207
549 6.533162 6.447668 6.618657
557 5.412561 5.327596 5.497527
573 5.355342 5.253819 5.456865
580 6.219821 6.016168 6.423475
591 5.850221 5.770026 5.930415
593 6.045939 5.977434 6.114445
640 7.409309 7.264856 7.553763
655 9.332587 8.723534 9.941640
656 7.436584 7.309997 7.563171
673 5.577126 5.475291 5.678962
675 6.257584 6.179447 6.335721
676 6.301542 6.231706 6.371379
696 5.892425 5.773144 6.011705
707 7.552570 7.319117 7.786023
715 5.504569 5.369260 5.639877
746 6.109287 5.467713 6.750860
748 5.423282 5.347300 5.499264
754 6.131325 6.055557 6.207094
821 5.830406 5.719803 5.941010
826 5.639553 5.552655 5.726452
828 7.560593 7.444525 7.676661
837 6.987393 6.290385 7.684401
856 8.765150 8.520057 9.010242
867 7.174240 7.077694 7.270785
876 5.929131 5.856383 6.001879
889 6.504289 6.423404 6.585173
893 6.541391 6.474521 6.608261
894 7.279606 7.186910 7.372303
905 5.698541 5.569067 5.828014
916 6.186823 6.036703 6.336944
930 6.738695 6.622053 6.855337
937 7.162173 6.751125 7.573220
972 5.903300 5.802902 6.003698
978 6.044203 5.971660 6.116746
988 6.294255 6.178753 6.409758
990 5.149862 5.073185 5.226540
992 5.907266 5.832287 5.982246
999 6.290384 6.175418 6.405350
1009 5.888397 5.798136 5.978657
1066 5.742642 5.660825 5.824460
1074 5.672596 5.578611 5.766580
1076 5.821598 5.711399 5.931797
1099 6.212266 6.115181 6.309351
1112 5.739309 5.638777 5.839840
1118 5.305699 5.222783 5.388615
1122 6.093028 5.996862 6.189193
1131 5.702184 5.635930 5.768439
1133 6.506325 6.413742 6.598907
1139 7.050003 6.942645 7.157360
1174 6.892636 6.784939 7.000333
1180 6.429248 6.342962 6.515534
1182 8.839734 8.591493 9.087976
1185 7.115881 7.010670 7.221091
1192 5.301158 5.204124 5.398191
1193 6.922383 6.788407 7.056359
1198 8.528362 8.327797 8.728927
1201 5.720273 5.569488 5.871058
1217 6.250213 6.184586 6.315839
1241 6.480293 6.088190 6.872396
1245 5.917872 5.858279 5.977466
1246 6.797700 6.653906 6.941495
1247 6.691126 6.572223 6.810029
1270 6.819103 6.697303 6.940904
1273 5.902787 5.831914 5.973661
1287 6.137365 6.041581 6.233149
1321 6.253898 6.190694 6.317102
1336 6.231014 6.094288 6.367740
1341 6.554074 6.477910 6.630238
1364 5.891956 5.786519 5.997393
1373 5.951321 5.851278 6.051364
1388 6.117664 6.039334 6.195995
1397 5.446647 5.377593 5.515701
1400 6.378499 6.258656 6.498342
1401 7.315684 7.184701 7.446667
1408 7.290262 7.141059 7.439464
1410 7.133591 6.992800 7.274381
1411 6.558030 6.460226 6.655834
1418 6.177232 6.079913 6.274552
1422 5.717356 5.579012 5.855700
1439 7.121375 6.848753 7.393997
1446 5.650848 5.547147 5.754549
1548 5.520363 5.385977 5.654748
1549 7.085216 6.952624 7.217809
1558 5.900668 5.821554 5.979782
1582 6.711916 6.597672 6.826161
1599 6.211153 6.145247 6.277060
1611 6.479858 6.382857 6.576859
1646 7.338276 7.246799 7.429753
1651 6.026793 5.947140 6.106446
1655 5.757383 5.678811 5.835956
1671 5.487285 5.345969 5.628602
1672 5.590823 5.518364 5.663283
1682 5.695418 5.632832 5.758005
1694 6.024141 5.942383 6.105898
1699 6.581304 6.485454 6.677153
1700 7.120848 6.967389 7.274306
1713 5.859345 5.790517 5.928174
1714 7.003628 6.915034 7.092222
1745 5.822302 5.751955 5.892650
1749 7.673570 7.500163 7.846977
1758 6.158614 6.007205 6.310022
1770 6.746408 6.662181 6.830635
1797 7.597270 7.495751 7.698789
1807 6.568264 6.456260 6.680269
1823 6.091849 5.980337 6.203362
1846 5.798714 5.662622 5.934806
1850 6.211956 6.151219 6.272693
1852 6.285271 6.182976 6.387566
1853 6.862732 6.782164 6.943300
1854 5.468466 5.387634 5.549298
1862 6.041243 5.977489 6.104997
1873 6.334997 6.245959 6.424035
1904 9.255506 9.021379 9.489633
1924 6.197991 6.140431 6.255552
1927 6.395225 6.320564 6.469885
1946 6.704440 6.614532 6.794349
1965 5.714761 5.642714 5.786809
1969 5.388521 5.249491 5.527552
1985 5.715677 5.576753 5.854601
1999 5.466626 5.375041 5.558210
2009 5.592615 5.474914 5.710315
2028 6.745599 6.657779 6.833419
2068 6.163883 6.080445 6.247321
2077 6.731823 6.643679 6.819968
2091 6.088708 6.010355 6.167061
2093 6.707359 6.523812 6.890906
2102 5.879199 5.822701 5.935698
2129 5.964289 5.857058 6.071519
2149 5.635891 5.561805 5.709978
2153 8.794187 8.589088 8.999285
2175 8.583610 8.365017 8.802203
2176 6.037542 5.959823 6.115260
2193 5.852414 5.670277 6.034552
2194 7.592880 7.459677 7.726084
2198 6.014088 5.928804 6.099371
2202 6.763022 6.636603 6.889440
2204 5.507692 5.410111 5.605274
2214 5.595364 5.505218 5.685510
2241 5.921043 5.842751 5.999335
2265 5.845115 5.790224 5.900005
2295 6.401580 6.297395 6.505766
2305 5.243665 5.162959 5.324371
2312 7.397805 7.135635 7.659976
2323 6.640331 6.546219 6.734444
2342 5.931703 5.813450 6.049957
2373 6.120492 5.982827 6.258156
2378 6.794626 6.667515 6.921738
2402 5.697400 5.626317 5.768483
2420 5.438677 5.343548 5.533807
2431 6.031952 5.966231 6.097672
2512 5.529612 5.458150 5.601073
2513 5.578756 5.460516 5.696996
2516 5.790487 5.598594 5.982379
2536 7.075620 6.991078 7.160162
2542 6.811703 6.724004 6.899402
2551 5.279088 5.193439 5.364738
2572 6.043564 5.960296 6.126833
2603 6.090629 5.977271 6.203988
2633 6.440978 6.343831 6.538125
2652 7.989268 7.847937 8.130599
2654 7.894060 7.598932 8.189187
2670 5.819352 5.711579 5.927125
2688 5.648052 5.564605 5.731500
2690 7.109840 7.023055 7.196625
2693 6.312941 6.200445 6.425436
2715 6.046727 5.983661 6.109794
2716 5.645893 5.581399 5.710386
2726 6.205706 6.124288 6.287123
2732 6.844379 6.725492 6.963266
2757 6.308467 6.217634 6.399301
2774 6.269018 6.187851 6.350185
2798 5.667449 5.541309 5.793589
2801 5.541879 5.465789 5.617970
2817 7.068054 6.901567 7.234540
2819 5.942133 5.867150 6.017115
2851 5.822851 5.717472 5.928229
2852 5.719791 5.600572 5.839010
2917 7.818827 7.633040 8.004613
2919 7.317026 7.189647 7.444406
2926 6.927297 6.291808 7.562786
2929 6.650686 6.533494 6.767879
2932 6.869399 6.757915 6.980882
2961 5.764994 5.687460 5.842527
2979 5.480510 5.399615 5.561405
2982 6.334079 6.260312 6.407845
3027 8.079233 7.879171 8.279295
3035 6.401612 6.331081 6.472142
3045 6.119467 5.997652 6.241281
3051 5.637182 5.569202 5.705163
3056 7.131470 6.996185 7.266755
3057 6.765685 6.662927 6.868443
3092 6.398534 6.255107 6.541960
3103 6.026813 5.920733 6.132893
3134 5.334536 5.201535 5.467536
3172 5.932191 5.865909 5.998472
3173 6.870753 6.750227 6.991279
3177 5.723944 5.626716 5.821172
3196 5.723279 5.624227 5.822331
3200 5.370359 5.230384 5.510335
3245 5.401082 5.309903 5.492261
3268 7.403859 7.196112 7.611605
3295 5.501005 5.369876 5.632135
3333 7.122691 7.027480 7.217902
3340 6.055678 5.981494 6.129863
3365 6.060397 5.974922 6.145872
3366 7.485740 7.335148 7.636332
3370 5.397073 5.261441 5.532704
3382 5.965782 5.901493 6.030071
3390 7.544392 7.431831 7.656952
3399 6.951968 6.685069 7.218867
3414 5.993159 5.928012 6.058306
3531 6.498444 6.357787 6.639101
3592 5.898927 5.820691 5.977162
3598 5.899420 5.782248 6.016592
3628 6.535066 6.457807 6.612325
3665 5.421177 5.309760 5.532594
3675 6.559451 6.477981 6.640921
3713 5.949436 5.679015 6.219857
3717 9.345175 8.874453 9.815897
3722 5.891308 5.813311 5.969306
3724 6.185142 5.929349 6.440936
3766 5.889138 5.822060 5.956216
3789 6.178332 6.059866 6.296798
3849 6.838529 6.726385 6.950673
3850 8.985470 8.743098 9.227843
3866 6.569541 6.491553 6.647528
3894 6.472803 6.376085 6.569521
3903 6.223353 6.084295 6.362410
3919 6.299085 6.218012 6.380159
3986 7.174568 7.065273 7.283863
4004 6.110685 5.865103 6.356266
4009 5.714965 5.641467 5.788463
4026 7.293088 7.123176 7.462999
4068 6.233893 6.120107 6.347679
4073 6.094358 6.022999 6.165717
4092 6.404772 6.286919 6.522625
4158 9.310683 9.052868 9.568497
4169 6.956346 6.831286 7.081406
4362 6.622683 6.362205 6.883161
4403 6.426709 6.347768 6.505649
4424 5.628411 5.492258 5.764564
4436 5.946022 5.852596 6.039448
4447 7.014620 6.924030 7.105210
4465 6.226858 6.139830 6.313886
4502 5.623731 5.542839 5.704623
4537 6.110264 5.957699 6.262829
4546 5.428826 5.287890 5.569761
4656 5.945255 5.853189 6.037320
4697 7.098388 7.009629 7.187146
4755 6.303008 6.212814 6.393203
4813 7.058448 5.634333 8.482564
4829 5.874389 5.793694 5.955084
4853 6.844704 6.752892 6.936516
4859 5.753937 5.676739 5.831134
4864 5.837533 5.732279 5.942788
4892 6.340879 6.260317 6.421441
4985 6.040234 5.967985 6.112482
4988 6.059968 5.982885 6.137051
4998 6.286891 6.200964 6.372817
Checking the impact significance of predictors on IMDB score.
library(QuantPsyc)
lm.beta(lm.fit6)
Calling var(x) on a factor x is deprecated and will become an error.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.longer object length is not a multiple of shorter object length
poly(num_voted_users, 2)1 poly(num_voted_users, 2)2
6.735011e-01 -3.223345e-01
poly(num_critic_for_reviews, 2)1 poly(num_critic_for_reviews, 2)2
4.532313e+03 -1.286433e-01
num_user_for_reviews poly(duration, 2)1
-1.860903e-03 9.193268e+08
poly(duration, 2)2 facenumber_in_poster
-6.932039e-02 -1.693137e-01
gross poly(budget, 2)1
-1.984955e-09 -2.365565e+02
poly(budget, 2)2 title_year
1.109206e+06 -3.986154e-04
genresAdventure genresAnimation
7.048767e-03 3.236437e+02
genresBiography genresComedy
1.223394e-02 2.728816e-01
genresCrime genresDocumentary
3.082938e+07 2.057189e-02
genresDrama genresFamily
4.622652e+00 7.070758e-01
genresFantasy genresHorror
-3.895118e+00 -5.717494e+04
genresMusical genresMystery
-1.949084e-03 3.840150e-03
genresSci-Fi genresThriller
8.208130e+01 -8.900566e-03
duration:num_voted_users num_user_for_reviews:num_voted_users
-3.329438e-08 1.039401e-01
Conclusion: The most important factor that affects movie rating is the duration. The longer the movie is, the higher the rating will be. num_critic_for_reviews is also an important predictor. Budget is important, although there is no strong correlation between budget and movie rating. The number of faces in movie poster has a non-neglectable effect to the movie rating.