Purpose: By doing a regresson analysis, we want to know: 1) Among the 27 variables given, which of them are critical in telling the IMDB rating of a movie. 2) Is there any correlation between genre & IMDB raging,face number in poster & IMDB rating,director name & IMDB rating and duration & IMDB rating. 3) Predict the IMDB Score using our model
m<- read.csv('movie_metadata.csv')
cannot open file 'movie_metadata.csv': No such file or directoryError in file(file, "rt") : cannot open the connection
This data set was found from Kaggle. The author scraped 5000+ movies from IMDB website using a Python library called “scrapy” and obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables: “movie_title” “color” “num_critic_for_reviews” “movie_facebook_likes” “duration” “director_name” “director_facebook_likes” “actor_3_name” “actor_3_facebook_likes” “actor_2_name” “actor_2_facebook_likes” “actor_1_name” “actor_1_facebook_likes” “gross” “genres” “num_voted_users” “cast_total_facebook_likes” “facenumber_in_poster” “plot_keywords” “movie_imdb_link” “num_user_for_reviews” “language” “country” “content_rating” “budget” “title_year” “imdb_score” “aspect_ratio”
This dataset is a proof of concept. It can be used for experimental and learning purpose.For comprehensive movie analysis and accurate movie ratings prediction, 28 attributes from 5000 movies might not be enough. A decent dataset could contain hundreds of attributes from 50K or more movies, and requires tons of feature engineering.
Assign the first word of genres as the genre of each movie:(genres been split into words in Excel):
# remove columns X-X.8
which(colnames(m)=='genres')
[1] 10
which(colnames(m)=='X.8')
[1] 19
m<-m[,-c(11:19)]
Only keep movie data for USA, bacause the “budget” variable was not all converted to US dollars, which might cause a problem in later analysis. If we want to convert all budgets into US dollarts, we have to take in to consideration for inflation as well. This might make the problem more complicated. Therefore, for pratice purpose, we decided to only study data for movies of USA.
movie.usa<-m[which(m[,'country']=='USA'),]
Double check:
movie.usa$country
[1] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[22] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[43] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[64] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[85] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[106] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[127] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[148] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[169] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[190] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[211] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[232] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[253] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[274] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[295] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[316] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[337] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[358] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[379] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[400] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[421] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[442] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[463] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[484] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[505] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[526] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[547] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[568] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[589] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[610] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[631] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[652] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[673] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[694] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[715] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[736] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[757] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[778] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[799] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[820] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[841] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[862] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[883] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[904] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[925] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[946] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[967] USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
[988] USA USA USA USA USA USA USA USA USA USA USA USA USA
[ reached getOption("max.print") -- omitted 2807 entries ]
66 Levels: Afghanistan Argentina Aruba Australia Bahamas Belgium Brazil ... West Germany
Remove ‘language’ since after removing all countries except for USA, there is only 4 languages aside from English, not meaningful for our prediction.
summary(movie.usa$language)
Aboriginal Arabic Aramaic Bosnian Cantonese Chinese Czech
10 0 0 1 1 1 0 0
Danish Dari Dutch Dzongkha English Filipino French German
0 1 0 0 3779 1 0 0
Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese
0 1 1 0 0 0 0 1
Kannada Kazakh Korean Mandarin Maya Mongolian None Norwegian
0 0 0 0 1 0 1 0
Panjabi Persian Polish Portuguese Romanian Russian Slovenian Spanish
0 0 0 0 0 0 0 7
Swahili Swedish Tamil Telugu Thai Urdu Vietnamese Zulu
0 0 0 0 0 0 1 0
movie.usa<-movie.usa[, -which(names(movie.usa)=='language')]
Remove ‘movie_imdb_link’ column since it’s not useful for our analysis and store the rest od the data as ‘movie’.
movie.df= data.frame(movie.usa)
mm<-movie.df[, -which(names(movie.df)=='movie_imdb_link')]
str(mm)
'data.frame': 3807 obs. of 26 variables:
$ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
$ director_name : Factor w/ 2399 levels "","\xcc\xe4mile Gaudreault",..: 926 799 379 106 2030 1652 1225 2394 284 799 ...
$ num_critic_for_reviews : int 723 302 813 462 392 324 635 673 434 313 ...
$ duration : int 178 169 164 132 156 100 141 183 169 151 ...
$ director_facebook_likes : int 0 563 22000 475 0 15 0 0 0 563 ...
$ actor_3_facebook_likes : int 855 1000 23000 530 4000 284 19000 2000 903 1000 ...
$ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 534 2549 1228 801 2440 1704 1911 2218 ...
$ actor_1_facebook_likes : int 1000 40000 27000 640 24000 799 26000 15000 18000 40000 ...
$ gross : int 760505847 309404152 448130642 73058679 336530303 200807262 458991599 330249062 200069408 423032628 ...
$ genres : Factor w/ 21 levels "Action","Adventure",..: 1 1 1 1 1 2 1 1 1 1 ...
$ actor_1_name : Factor w/ 2098 levels "","\xcc\xd2lafur Darri \xcc\xd2lafsson",..: 303 982 1968 441 786 221 337 740 1104 982 ...
$ movie_title : Factor w/ 4917 levels "[Rec] 2\xe5\xca",..: 397 2731 3707 1960 3289 3459 398 460 3416 2732 ...
$ num_voted_users : int 886204 471220 1144337 212204 383056 294810 462669 371639 240396 522040 ...
$ cast_total_facebook_likes: int 4834 48350 106759 1873 46055 2036 92000 24450 29991 48486 ...
$ actor_3_name : Factor w/ 3522 levels "","\xcc\xd2scar Jaenada",..: 3442 1393 1769 2714 1969 2162 3018 57 1134 1393 ...
$ facenumber_in_poster : int 0 0 0 1 0 1 4 0 0 2 ...
$ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 3484 651 4745 29 1142 1564 3312 2188 ...
$ num_user_for_reviews : int 3054 1238 2701 738 1902 387 1117 3018 2367 1832 ...
$ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 65 65 65 65 65 65 65 65 ...
$ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 10 9 10 10 10 10 ...
$ budget : num 2.37e+08 3.00e+08 2.50e+08 2.64e+08 2.58e+08 ...
$ title_year : int 2009 2007 2012 2012 2007 2010 2015 2016 2006 2006 ...
$ actor_2_facebook_likes : int 936 5000 23000 632 11000 553 21000 4000 10000 5000 ...
$ imdb_score : num 7.9 7.1 8.5 6.6 6.2 7.8 7.5 6.9 6.1 7.3 ...
$ aspect_ratio : num 1.78 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 2.35 ...
$ movie_facebook_likes : int 33000 0 164000 24000 0 29000 118000 197000 0 5000 ...
Check for missing values:
library(Amelia)
Loading required package: Rcpp
##
## Amelia II: Multiple Imputation
## (Version 1.7.4, built: 2015-12-05)
## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
missmap(mm, main = "Missing values vs observed")
sapply(mm,function(x) sum(is.na(x))) # number of missing values for each variable
color director_name num_critic_for_reviews
0 0 39
duration director_facebook_likes actor_3_facebook_likes
6 74 13
actor_2_name actor_1_facebook_likes gross
0 4 572
genres actor_1_name movie_title
0 0 0
num_voted_users cast_total_facebook_likes actor_3_name
0 0 0
facenumber_in_poster plot_keywords num_user_for_reviews
12 0 13
country content_rating budget
0 0 298
title_year actor_2_facebook_likes imdb_score
74 7 0
aspect_ratio movie_facebook_likes
222 0
We noticed that there are many missing values for budget,aspect ratio and gross.
Omit missing values:
movie<-na.omit(mm)
sapply(movie,function(x) sum(is.na(x))) # double check for missing values
color director_name num_critic_for_reviews
0 0 0
duration director_facebook_likes actor_3_facebook_likes
0 0 0
actor_2_name actor_1_facebook_likes gross
0 0 0
genres actor_1_name movie_title
0 0 0
num_voted_users cast_total_facebook_likes actor_3_name
0 0 0
facenumber_in_poster plot_keywords num_user_for_reviews
0 0 0
country content_rating budget
0 0 0
title_year actor_2_facebook_likes imdb_score
0 0 0
aspect_ratio movie_facebook_likes
0 0
library(psych)
library(car)
Attaching package: ‘car’
The following object is masked from ‘package:psych’:
logit
library(RColorBrewer)
library(corrplot)
library(ggplot2)
Attaching package: ‘ggplot2’
The following objects are masked from ‘package:psych’:
%+%, alpha
Explore title_year predictor:
range(movie$title_year) # check movie title year
[1] 1920 2016
sum(with(movie,title_year=='2009')) # 145
[1] 145
sum(with(movie,title_year=='2014')) # 121
[1] 121
Visualization of title Year vs. Score:
library(scatter)
Error in library(scatter) : there is no package called ‘scatter’
There are many outliers for title year. The mojority of data points are around the year of 2000 and later,which make sense that this is less movies in the early years. Also, an intering notice is that movies from early years tend to have higher scores.
Visualization of IMDB Score:
max(movie$imdb_score) # 9.4
[1] 9.3
ggplot(movie, aes(x = imdb_score)) +
geom_histogram(aes(fill = ..count..), binwidth =0.5) +
scale_x_continuous(name = "IMDB Score",
breaks = seq(0,10),
limits=c(1, 10)) +
ggtitle("Histogram of Movie IMDB Score") +
scale_fill_gradient("Count", low = "blue", high = "red")
sum(with(movie,imdb_score>=8))
[1] 148
# 148 movies with IMDB score greater or equal to 8.
IMDB score looks normal.The highest score is 9.4 out of scale 10. And we can consider movies with a score greater or equal to 8 a great movie from many perspectives.
Exploring correlation :
pairs.panels(movie[c('director_name','duration','facenumber_in_poster','imdb_score','genres')])
from the plot, only duration and IMBD score has a high correlation. face number in posters has a negative correaltion with IMBD score. genre has little correlatin with score Interesting, director name has no correlation with IMDB score
pairs.panels(movie[c('color','actor_1_name','title_year','imdb_score','aspect_ratio','gross')])
Color and title year has highly positive correlation. Color and aspect ratia,gross has smaller positive correlations. Actor 1 namem has very small positive correlation with gross, meaning who plays the movies does not have impact on the gross. Title year and aspect ratio and color are highly positively correlated. IMDB score has very small positive correlation with actor 1 name ,which means who was the actor 1 does not make the movie has a higher score. Interestingly, IMDB score has a negative correlation with title year,which means the old movies seems to have a higher score. the result agrees with out pbservation from the scatter plot. IMDB and aspect ratio has small positive correlation. IMDB has a strong positive correlation with gross.
Corplot for all numerical variables:
nums<- sapply(movie,is.numeric) # select numeric columns
movie.num<- movie[,nums]
corrplot(cor(movie.num),method='ellipse')
Note: corrplot cannot use data.frame, use cor() to change it to matrix.
From the correlation plot, we can tell that: Face number in poster has negative correlation with all other predictors. Cast total facebook likes and actor 1 facebook likes has a stronger positive correlation. budget and gross have strong correaltion which is not surprising. Interestingly, IMDB scores has strong positive corrlation with number of critics for review, which means the more the critics review, the higher the score.Duration and number of voted users also have strong positive correlation with IMDB scores.
Find the pairs of correlations
corr.test(movie.num,y=NULL,use='pairwise',method='pearson',adjust='holm',alpha=0.05) # x must be numeric
Call:corr.test(x = movie.num, y = NULL, use = "pairwise", method = "pearson",
adjust = "holm", alpha = 0.05)
Correlation matrix
num_critic_for_reviews duration director_facebook_likes
num_critic_for_reviews 1.00 0.26 0.19
duration 0.26 1.00 0.21
director_facebook_likes 0.19 0.21 1.00
actor_3_facebook_likes 0.28 0.14 0.12
actor_1_facebook_likes 0.17 0.09 0.09
gross 0.48 0.28 0.14
num_voted_users 0.60 0.37 0.32
cast_total_facebook_likes 0.25 0.13 0.12
facenumber_in_poster -0.03 0.01 -0.05
num_user_for_reviews 0.57 0.36 0.24
budget 0.49 0.30 0.09
title_year 0.42 -0.11 -0.06
actor_2_facebook_likes 0.28 0.15 0.12
imdb_score 0.36 0.38 0.22
aspect_ratio 0.18 0.16 0.05
movie_facebook_likes 0.71 0.25 0.17
actor_3_facebook_likes actor_1_facebook_likes gross
num_critic_for_reviews 0.28 0.17 0.48
duration 0.14 0.09 0.28
director_facebook_likes 0.12 0.09 0.14
actor_3_facebook_likes 1.00 0.25 0.30
actor_1_facebook_likes 0.25 1.00 0.13
gross 0.30 0.13 1.00
num_voted_users 0.28 0.17 0.64
cast_total_facebook_likes 0.48 0.95 0.22
facenumber_in_poster 0.10 0.05 -0.04
num_user_for_reviews 0.22 0.12 0.55
budget 0.27 0.15 0.64
title_year 0.13 0.09 0.06
actor_2_facebook_likes 0.55 0.38 0.25
imdb_score 0.09 0.12 0.27
aspect_ratio 0.05 0.05 0.07
movie_facebook_likes 0.31 0.12 0.38
num_voted_users cast_total_facebook_likes facenumber_in_poster
num_critic_for_reviews 0.60 0.25 -0.03
duration 0.37 0.13 0.01
director_facebook_likes 0.32 0.12 -0.05
actor_3_facebook_likes 0.28 0.48 0.10
actor_1_facebook_likes 0.17 0.95 0.05
gross 0.64 0.22 -0.04
num_voted_users 1.00 0.25 -0.04
cast_total_facebook_likes 0.25 1.00 0.07
facenumber_in_poster -0.04 0.07 1.00
num_user_for_reviews 0.78 0.18 -0.09
budget 0.40 0.23 -0.03
title_year 0.03 0.13 0.08
actor_2_facebook_likes 0.25 0.63 0.07
imdb_score 0.51 0.14 -0.07
aspect_ratio 0.09 0.07 0.01
movie_facebook_likes 0.52 0.21 0.01
num_user_for_reviews budget title_year actor_2_facebook_likes
num_critic_for_reviews 0.57 0.49 0.42 0.28
duration 0.36 0.30 -0.11 0.15
director_facebook_likes 0.24 0.09 -0.06 0.12
actor_3_facebook_likes 0.22 0.27 0.13 0.55
actor_1_facebook_likes 0.12 0.15 0.09 0.38
gross 0.55 0.64 0.06 0.25
num_voted_users 0.78 0.40 0.03 0.25
cast_total_facebook_likes 0.18 0.23 0.13 0.63
facenumber_in_poster -0.09 -0.03 0.08 0.07
num_user_for_reviews 1.00 0.40 0.03 0.20
budget 0.40 1.00 0.25 0.25
title_year 0.03 0.25 1.00 0.13
actor_2_facebook_likes 0.20 0.25 0.13 1.00
imdb_score 0.35 0.07 -0.14 0.13
aspect_ratio 0.10 0.18 0.22 0.07
movie_facebook_likes 0.39 0.33 0.31 0.25
imdb_score aspect_ratio movie_facebook_likes
num_critic_for_reviews 0.36 0.18 0.71
duration 0.38 0.16 0.25
director_facebook_likes 0.22 0.05 0.17
actor_3_facebook_likes 0.09 0.05 0.31
actor_1_facebook_likes 0.12 0.05 0.12
gross 0.27 0.07 0.38
num_voted_users 0.51 0.09 0.52
cast_total_facebook_likes 0.14 0.07 0.21
facenumber_in_poster -0.07 0.01 0.01
num_user_for_reviews 0.35 0.10 0.39
budget 0.07 0.18 0.33
title_year -0.14 0.22 0.31
actor_2_facebook_likes 0.13 0.07 0.25
imdb_score 1.00 0.04 0.29
aspect_ratio 0.04 1.00 0.11
movie_facebook_likes 0.29 0.11 1.00
Sample Size
[1] 3005
Probability values (Entries above the diagonal are adjusted for multiple tests.)
num_critic_for_reviews duration director_facebook_likes
num_critic_for_reviews 0.00 0.00 0.00
duration 0.00 0.00 0.00
director_facebook_likes 0.00 0.00 0.00
actor_3_facebook_likes 0.00 0.00 0.00
actor_1_facebook_likes 0.00 0.00 0.00
gross 0.00 0.00 0.00
num_voted_users 0.00 0.00 0.00
cast_total_facebook_likes 0.00 0.00 0.00
facenumber_in_poster 0.09 0.66 0.00
num_user_for_reviews 0.00 0.00 0.00
budget 0.00 0.00 0.00
title_year 0.00 0.00 0.00
actor_2_facebook_likes 0.00 0.00 0.00
imdb_score 0.00 0.00 0.00
aspect_ratio 0.00 0.00 0.01
movie_facebook_likes 0.00 0.00 0.00
actor_3_facebook_likes actor_1_facebook_likes gross
num_critic_for_reviews 0.00 0.00 0.00
duration 0.00 0.00 0.00
director_facebook_likes 0.00 0.00 0.00
actor_3_facebook_likes 0.00 0.00 0.00
actor_1_facebook_likes 0.00 0.00 0.00
gross 0.00 0.00 0.00
num_voted_users 0.00 0.00 0.00
cast_total_facebook_likes 0.00 0.00 0.00
facenumber_in_poster 0.00 0.01 0.05
num_user_for_reviews 0.00 0.00 0.00
budget 0.00 0.00 0.00
title_year 0.00 0.00 0.00
actor_2_facebook_likes 0.00 0.00 0.00
imdb_score 0.00 0.00 0.00
aspect_ratio 0.01 0.00 0.00
movie_facebook_likes 0.00 0.00 0.00
num_voted_users cast_total_facebook_likes facenumber_in_poster
num_critic_for_reviews 0.00 0 0.65
duration 0.00 0 1.00
director_facebook_likes 0.00 0 0.06
actor_3_facebook_likes 0.00 0 0.00
actor_1_facebook_likes 0.00 0 0.07
gross 0.00 0 0.37
num_voted_users 0.00 0 0.17
cast_total_facebook_likes 0.00 0 0.00
facenumber_in_poster 0.02 0 0.00
num_user_for_reviews 0.00 0 0.00
budget 0.00 0 0.14
title_year 0.10 0 0.00
actor_2_facebook_likes 0.00 0 0.00
imdb_score 0.00 0 0.00
aspect_ratio 0.00 0 0.55
movie_facebook_likes 0.00 0 0.50
num_user_for_reviews budget title_year actor_2_facebook_likes
num_critic_for_reviews 0.00 0.00 0.00 0.00
duration 0.00 0.00 0.00 0.00
director_facebook_likes 0.00 0.00 0.04 0.00
actor_3_facebook_likes 0.00 0.00 0.00 0.00
actor_1_facebook_likes 0.00 0.00 0.00 0.00
gross 0.00 0.00 0.04 0.00
num_voted_users 0.00 0.00 0.65 0.00
cast_total_facebook_likes 0.00 0.00 0.00 0.00
facenumber_in_poster 0.00 0.65 0.00 0.01
num_user_for_reviews 0.00 0.00 0.65 0.00
budget 0.00 0.00 0.00 0.00
title_year 0.12 0.00 0.00 0.00
actor_2_facebook_likes 0.00 0.00 0.00 0.00
imdb_score 0.00 0.00 0.00 0.00
aspect_ratio 0.00 0.00 0.00 0.00
movie_facebook_likes 0.00 0.00 0.00 0.00
imdb_score aspect_ratio movie_facebook_likes
num_critic_for_reviews 0.00 0.00 0
duration 0.00 0.00 0
director_facebook_likes 0.00 0.10 0
actor_3_facebook_likes 0.00 0.07 0
actor_1_facebook_likes 0.00 0.05 0
gross 0.00 0.00 0
num_voted_users 0.00 0.00 0
cast_total_facebook_likes 0.00 0.00 0
facenumber_in_poster 0.00 1.00 1
num_user_for_reviews 0.00 0.00 0
budget 0.00 0.00 0
title_year 0.00 0.00 0
actor_2_facebook_likes 0.00 0.00 0
imdb_score 0.00 0.34 0
aspect_ratio 0.04 0.00 0
movie_facebook_likes 0.00 0.00 0
To see confidence intervals of the correlations, print with the short=FALSE option
# Boxplots for significant categorical predictors
Boxplot(movie$imdb_score,movie$color)
Black and white movies seems to have a hither meadian rate, and overall a little higher scores. Colors movies have many outliers.
Boxplot for genre:
fill <- "Blue"
line <- "Red"
ggplot(movie, aes(x = genres, y =imdb_score)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "IMDB Score",
breaks = seq(0, 11, 0.5),
limits=c(0, 11)) +
scale_x_discrete(name = "Genres") +
ggtitle("Boxplot of IMDB Score and Genres")
From the boxplot of genres, “Documentation” has the highest median score.And Trill movies has the lowest median. But it is also because there is 1 observation for thrill movies in our data set.
summary(movie$genres)
Action Adventure Animation Biography Comedy Crime Documentary
751 291 36 137 853 204 25
Drama Family Fantasy Film-Noir Game-Show History Horror
506 3 31 0 0 0 138
Music Musical Mystery Romance Sci-Fi Thriller Western
0 2 16 2 7 1 2
library(ggplot2)
fill <- "Blue"
line <- "Red"
ggplot(movie, aes(x = as.factor(title_year), y =imdb_score)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "IMDB Score",
breaks = seq(1.5, 10, 0.5),
limits=c(1.5, 10)) +
scale_x_discrete(name = "title_year") +
ggtitle("Boxplot of IMDB Score and Genres")
The median of imdb score of all years seem different. So let’s try to treat title_year as categorical.
# Scatter plot matrix for correlation significant numerical variables
scatterplotMatrix(~movie$imdb_score+movie$num_voted_users+movie$num_critic_for_reviews+movie$num_user_for_reviews+movie$duration+movie$facenumber_in_poster+movie$gross+movie$movie_facebook_likes+movie$director_facebook_likes+movie$cast_total_facebook_likes+movie$budget)
movie.sig<-movie[,c('imdb_score','num_voted_users','num_critic_for_reviews','num_user_for_reviews','duration','facenumber_in_poster','gross','movie_facebook_likes','director_facebook_likes','cast_total_facebook_likes','budget','title_year','genres')]
Step function to check AIC criteria:
null=lm(movie.sig$imdb_score~1) # set null model
summary(null)
Call:
lm(formula = movie.sig$imdb_score ~ 1)
Residuals:
Min 1Q Median 3Q Max
-4.7873 -0.5873 0.1127 0.7127 2.9127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.3873 0.0192 332.6 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.053 on 3004 degrees of freedom
full1=lm(movie.sig$imdb_score~movie.sig$num_voted_users+movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews+movie.sig$duration+movie.sig$facenumber_in_poster+movie.sig$gross+movie.sig$movie_facebook_likes+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+factor(movie.sig$genres))
summary(full1)
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
movie.sig$num_critic_for_reviews + movie.sig$num_user_for_reviews +
movie.sig$duration + movie.sig$facenumber_in_poster + movie.sig$gross +
movie.sig$movie_facebook_likes + movie.sig$director_facebook_likes +
movie.sig$cast_total_facebook_likes + movie.sig$budget +
movie.sig$title_year + factor(movie.sig$genres))
Residuals:
Min 1Q Median 3Q Max
-4.9157 -0.3693 0.0835 0.4993 2.0350
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.413e+01 3.604e+00 15.019 < 2e-16 ***
movie.sig$num_voted_users 3.158e-06 1.757e-07 17.969 < 2e-16 ***
movie.sig$num_critic_for_reviews 3.333e-03 2.119e-04 15.727 < 2e-16 ***
movie.sig$num_user_for_reviews -4.887e-04 5.976e-05 -8.177 4.26e-16 ***
movie.sig$duration 8.491e-03 7.848e-04 10.820 < 2e-16 ***
movie.sig$facenumber_in_poster -1.750e-02 6.947e-03 -2.519 0.01182 *
movie.sig$gross 2.247e-10 3.096e-10 0.726 0.46808
movie.sig$movie_facebook_likes -4.007e-06 9.702e-07 -4.131 3.72e-05 ***
movie.sig$director_facebook_likes 2.832e-07 4.562e-06 0.062 0.95051
movie.sig$cast_total_facebook_likes 1.110e-06 7.323e-07 1.516 0.12975
movie.sig$budget -4.486e-09 5.125e-10 -8.753 < 2e-16 ***
movie.sig$title_year -2.467e-02 1.797e-03 -13.727 < 2e-16 ***
factor(movie.sig$genres)Adventure 3.458e-01 5.448e-02 6.347 2.53e-10 ***
factor(movie.sig$genres)Animation 6.621e-01 1.345e-01 4.924 8.93e-07 ***
factor(movie.sig$genres)Biography 6.557e-01 7.661e-02 8.558 < 2e-16 ***
factor(movie.sig$genres)Comedy 1.532e-01 4.361e-02 3.513 0.00045 ***
factor(movie.sig$genres)Crime 4.551e-01 6.464e-02 7.040 2.37e-12 ***
factor(movie.sig$genres)Documentary 9.270e-01 1.608e-01 5.765 8.98e-09 ***
factor(movie.sig$genres)Drama 5.326e-01 4.904e-02 10.861 < 2e-16 ***
factor(movie.sig$genres)Family 2.201e-01 4.521e-01 0.487 0.62639
factor(movie.sig$genres)Fantasy -1.629e-01 1.448e-01 -1.125 0.26068
factor(movie.sig$genres)Horror -3.858e-01 7.777e-02 -4.961 7.41e-07 ***
factor(movie.sig$genres)Musical -4.133e-01 5.573e-01 -0.742 0.45839
factor(movie.sig$genres)Mystery 1.968e-01 1.979e-01 0.995 0.32005
factor(movie.sig$genres)Romance 5.466e-01 5.506e-01 0.993 0.32095
factor(movie.sig$genres)Sci-Fi 2.551e-01 2.960e-01 0.862 0.38870
factor(movie.sig$genres)Thriller -4.301e-01 7.786e-01 -0.552 0.58077
factor(movie.sig$genres)Western -1.037e-01 5.521e-01 -0.188 0.85101
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7768 on 2977 degrees of freedom
Multiple R-squared: 0.4604, Adjusted R-squared: 0.4555
F-statistic: 94.07 on 27 and 2977 DF, p-value: < 2.2e-16
step(null,scope = list(lower=null,upper=full1),direction = 'forward')
Start: AIC=309.81
movie.sig$imdb_score ~ 1
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users 1 871.90 2457.2 -600.74
+ movie.sig$duration 1 491.13 2838.0 -167.82
+ movie.sig$num_critic_for_reviews 1 428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews 1 407.62 2921.5 -80.68
+ factor(movie.sig$genres) 16 331.02 2998.1 27.10
+ movie.sig$movie_facebook_likes 1 282.82 3046.3 45.02
+ movie.sig$gross 1 242.62 3086.5 84.42
+ movie.sig$director_facebook_likes 1 166.17 3163.0 157.95
+ movie.sig$title_year 1 69.27 3259.9 248.63
+ movie.sig$cast_total_facebook_likes 1 64.28 3264.8 253.22
+ movie.sig$budget 1 16.26 3312.9 297.09
+ movie.sig$facenumber_in_poster 1 15.14 3314.0 298.11
<none> 3329.1 309.81
Step: AIC=-600.74
movie.sig$imdb_score ~ movie.sig$num_voted_users
Df Sum of Sq RSS AIC
+ factor(movie.sig$genres) 16 311.531 2145.7 -976.12
+ movie.sig$duration 1 147.786 2309.4 -785.13
+ movie.sig$title_year 1 84.649 2372.6 -704.08
+ movie.sig$budget 1 73.211 2384.0 -689.63
+ movie.sig$num_user_for_reviews 1 21.297 2435.9 -624.90
+ movie.sig$gross 1 16.929 2440.3 -619.51
+ movie.sig$num_critic_for_reviews 1 14.632 2442.6 -616.69
+ movie.sig$director_facebook_likes 1 13.657 2443.6 -615.49
+ movie.sig$facenumber_in_poster 1 6.789 2450.4 -607.05
+ movie.sig$movie_facebook_likes 1 2.627 2454.6 -601.95
<none> 2457.2 -600.74
+ movie.sig$cast_total_facebook_likes 1 0.524 2456.7 -599.38
Step: AIC=-976.12
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres)
Df Sum of Sq RSS AIC
+ movie.sig$title_year 1 79.623 2066.1 -1087.75
+ movie.sig$duration 1 74.584 2071.1 -1080.44
+ movie.sig$budget 1 28.689 2117.0 -1014.57
+ movie.sig$num_critic_for_reviews 1 23.116 2122.6 -1006.67
+ movie.sig$num_user_for_reviews 1 12.251 2133.4 -991.33
+ movie.sig$director_facebook_likes 1 3.707 2142.0 -979.32
+ movie.sig$facenumber_in_poster 1 3.274 2142.4 -978.71
+ movie.sig$movie_facebook_likes 1 1.686 2144.0 -976.49
<none> 2145.7 -976.12
+ movie.sig$gross 1 1.391 2144.3 -976.07
+ movie.sig$cast_total_facebook_likes 1 0.362 2145.3 -974.63
Step: AIC=-1087.75
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year
Df Sum of Sq RSS AIC
+ movie.sig$num_critic_for_reviews 1 125.091 1941.0 -1273.4
+ movie.sig$duration 1 55.857 2010.2 -1168.1
+ movie.sig$movie_facebook_likes 1 21.746 2044.3 -1117.5
+ movie.sig$num_user_for_reviews 1 11.741 2054.3 -1102.9
+ movie.sig$budget 1 9.196 2056.9 -1099.2
+ movie.sig$cast_total_facebook_likes 1 2.923 2063.2 -1090.0
+ movie.sig$director_facebook_likes 1 1.740 2064.3 -1088.3
<none> 2066.1 -1087.8
+ movie.sig$facenumber_in_poster 1 1.084 2065.0 -1087.3
+ movie.sig$gross 1 0.638 2065.4 -1086.7
Step: AIC=-1273.43
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$budget 1 36.627 1904.4 -1328.7
+ movie.sig$num_user_for_reviews 1 35.326 1905.7 -1326.6
+ movie.sig$duration 1 34.873 1906.1 -1325.9
+ movie.sig$gross 1 7.359 1933.6 -1282.8
+ movie.sig$movie_facebook_likes 1 1.397 1939.6 -1273.6
<none> 1941.0 -1273.4
+ movie.sig$facenumber_in_poster 1 0.926 1940.1 -1272.9
+ movie.sig$director_facebook_likes 1 0.644 1940.3 -1272.4
+ movie.sig$cast_total_facebook_likes 1 0.572 1940.4 -1272.3
Step: AIC=-1328.68
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget
Df Sum of Sq RSS AIC
+ movie.sig$duration 1 58.373 1846.0 -1420.2
+ movie.sig$num_user_for_reviews 1 27.052 1877.3 -1369.7
+ movie.sig$movie_facebook_likes 1 2.576 1901.8 -1330.8
+ movie.sig$cast_total_facebook_likes 1 2.005 1902.3 -1329.8
<none> 1904.4 -1328.7
+ movie.sig$facenumber_in_poster 1 1.071 1903.3 -1328.4
+ movie.sig$director_facebook_likes 1 0.557 1903.8 -1327.6
+ movie.sig$gross 1 0.074 1904.3 -1326.8
Step: AIC=-1420.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_user_for_reviews 1 33.825 1812.2 -1473.8
+ movie.sig$movie_facebook_likes 1 4.702 1841.3 -1425.9
+ movie.sig$facenumber_in_poster 1 2.488 1843.5 -1422.3
+ movie.sig$cast_total_facebook_likes 1 1.601 1844.4 -1420.8
<none> 1846.0 -1420.2
+ movie.sig$gross 1 0.196 1845.8 -1418.5
+ movie.sig$director_facebook_likes 1 0.043 1845.9 -1418.3
Step: AIC=-1473.81
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$movie_facebook_likes 1 10.4792 1801.7 -1489.2
+ movie.sig$facenumber_in_poster 1 3.7911 1808.4 -1478.1
<none> 1812.2 -1473.8
+ movie.sig$cast_total_facebook_likes 1 0.9926 1811.2 -1473.5
+ movie.sig$gross 1 0.3569 1811.8 -1472.4
+ movie.sig$director_facebook_likes 1 0.0128 1812.2 -1471.8
Step: AIC=-1489.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes
Df Sum of Sq RSS AIC
+ movie.sig$facenumber_in_poster 1 3.5218 1798.2 -1493.1
<none> 1801.7 -1489.2
+ movie.sig$cast_total_facebook_likes 1 1.0918 1800.6 -1489.0
+ movie.sig$gross 1 0.3413 1801.3 -1487.8
+ movie.sig$director_facebook_likes 1 0.0167 1801.7 -1487.3
Step: AIC=-1493.11
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster
Df Sum of Sq RSS AIC
+ movie.sig$cast_total_facebook_likes 1 1.41883 1796.7 -1493.5
<none> 1798.2 -1493.1
+ movie.sig$gross 1 0.33944 1797.8 -1491.7
+ movie.sig$director_facebook_likes 1 0.00320 1798.2 -1491.1
Step: AIC=-1493.48
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster +
movie.sig$cast_total_facebook_likes
Df Sum of Sq RSS AIC
<none> 1796.7 -1493.5
+ movie.sig$gross 1 0.31546 1796.4 -1492.0
+ movie.sig$director_facebook_likes 1 0.00000 1796.7 -1491.5
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
factor(movie.sig$genres) + movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$movie_facebook_likes + movie.sig$facenumber_in_poster +
movie.sig$cast_total_facebook_likes)
Coefficients:
(Intercept) movie.sig$num_voted_users
5.446e+01 3.203e-06
factor(movie.sig$genres)Adventure factor(movie.sig$genres)Animation
3.495e-01 6.687e-01
factor(movie.sig$genres)Biography factor(movie.sig$genres)Comedy
6.564e-01 1.558e-01
factor(movie.sig$genres)Crime factor(movie.sig$genres)Documentary
4.522e-01 9.302e-01
factor(movie.sig$genres)Drama factor(movie.sig$genres)Family
5.326e-01 2.466e-01
factor(movie.sig$genres)Fantasy factor(movie.sig$genres)Horror
-1.616e-01 -3.839e-01
factor(movie.sig$genres)Musical factor(movie.sig$genres)Mystery
-4.044e-01 1.950e-01
factor(movie.sig$genres)Romance factor(movie.sig$genres)Sci-Fi
5.455e-01 2.483e-01
factor(movie.sig$genres)Thriller factor(movie.sig$genres)Western
-4.271e-01 -9.845e-02
movie.sig$title_year movie.sig$num_critic_for_reviews
-2.483e-02 3.339e-03
movie.sig$budget movie.sig$duration
-4.311e-09 8.481e-03
movie.sig$num_user_for_reviews movie.sig$movie_facebook_likes
-4.876e-04 -4.010e-06
movie.sig$facenumber_in_poster movie.sig$cast_total_facebook_likes
-1.753e-02 1.121e-06
full2=lm(movie.sig$imdb_score~poly(movie.sig$num_voted_users,2)+poly(movie.sig$num_critic_for_reviews,2)+poly(movie.sig$num_user_for_reviews,2)+poly(movie.sig$duration,2)+movie.sig$facenumber_in_poster+poly(movie.sig$gross,2)+poly(movie.sig$movie_facebook_likes,2)+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+movie.sig$genres+movie.sig$facenumber_in_poster*movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews*movie.sig$num_voted_users+movie.sig$num_voted_users*movie.sig$gross+movie.sig$gross*movie.sig$budget)
summary(full2)
Call:
lm(formula = movie.sig$imdb_score ~ poly(movie.sig$num_voted_users,
2) + poly(movie.sig$num_critic_for_reviews, 2) + poly(movie.sig$num_user_for_reviews,
2) + poly(movie.sig$duration, 2) + movie.sig$facenumber_in_poster +
poly(movie.sig$gross, 2) + poly(movie.sig$movie_facebook_likes,
2) + movie.sig$director_facebook_likes + movie.sig$cast_total_facebook_likes +
movie.sig$budget + movie.sig$title_year + movie.sig$genres +
movie.sig$facenumber_in_poster * movie.sig$num_critic_for_reviews +
movie.sig$num_user_for_reviews * movie.sig$num_voted_users +
movie.sig$num_voted_users * movie.sig$gross + movie.sig$gross *
movie.sig$budget)
Residuals:
Min 1Q Median 3Q Max
-5.3608 -0.3549 0.0642 0.4619 2.1792
Coefficients: (4 not defined because of singularities)
Estimate Std. Error
(Intercept) 5.948e+01 3.617e+00
poly(movie.sig$num_voted_users, 2)1 2.305e+01 3.426e+00
poly(movie.sig$num_voted_users, 2)2 -1.873e+01 2.200e+00
poly(movie.sig$num_critic_for_reviews, 2)1 1.393e+01 1.661e+00
poly(movie.sig$num_critic_for_reviews, 2)2 -9.490e+00 1.004e+00
poly(movie.sig$num_user_for_reviews, 2)1 -1.760e+01 2.325e+00
poly(movie.sig$num_user_for_reviews, 2)2 4.166e+00 1.593e+00
poly(movie.sig$duration, 2)1 1.087e+01 9.246e-01
poly(movie.sig$duration, 2)2 -3.883e+00 7.809e-01
movie.sig$facenumber_in_poster -2.093e-02 1.106e-02
poly(movie.sig$gross, 2)1 -1.454e+01 2.418e+00
poly(movie.sig$gross, 2)2 -5.285e+00 1.483e+00
poly(movie.sig$movie_facebook_likes, 2)1 2.580e+00 1.322e+00
poly(movie.sig$movie_facebook_likes, 2)2 2.283e-01 8.238e-01
movie.sig$director_facebook_likes 4.608e-06 4.409e-06
movie.sig$cast_total_facebook_likes 2.533e-07 7.013e-07
movie.sig$budget -7.852e-09 7.213e-10
movie.sig$title_year -2.656e-02 1.809e-03
movie.sig$genresAdventure 3.727e-01 5.268e-02
movie.sig$genresAnimation 7.564e-01 1.298e-01
movie.sig$genresBiography 6.264e-01 7.351e-02
movie.sig$genresComedy 1.576e-01 4.205e-02
movie.sig$genresCrime 4.558e-01 6.236e-02
movie.sig$genresDocumentary 9.738e-01 1.542e-01
movie.sig$genresDrama 5.230e-01 4.726e-02
movie.sig$genresFamily 5.958e-01 4.362e-01
movie.sig$genresFantasy -1.891e-01 1.387e-01
movie.sig$genresHorror -3.533e-01 7.597e-02
movie.sig$genresMusical -4.744e-01 5.328e-01
movie.sig$genresMystery 1.947e-01 1.891e-01
movie.sig$genresRomance 6.094e-01 5.254e-01
movie.sig$genresSci-Fi 1.471e-01 2.827e-01
movie.sig$genresThriller -3.085e-01 7.433e-01
movie.sig$genresWestern -4.204e-02 5.272e-01
movie.sig$num_critic_for_reviews NA NA
movie.sig$num_user_for_reviews NA NA
movie.sig$num_voted_users NA NA
movie.sig$gross NA NA
movie.sig$facenumber_in_poster:movie.sig$num_critic_for_reviews -1.291e-06 4.260e-05
movie.sig$num_user_for_reviews:movie.sig$num_voted_users 7.966e-10 2.817e-10
movie.sig$num_voted_users:movie.sig$gross 1.498e-15 1.105e-15
movie.sig$budget:movie.sig$gross 2.946e-17 4.104e-18
t value Pr(>|t|)
(Intercept) 16.446 < 2e-16 ***
poly(movie.sig$num_voted_users, 2)1 6.727 2.07e-11 ***
poly(movie.sig$num_voted_users, 2)2 -8.514 < 2e-16 ***
poly(movie.sig$num_critic_for_reviews, 2)1 8.388 < 2e-16 ***
poly(movie.sig$num_critic_for_reviews, 2)2 -9.452 < 2e-16 ***
poly(movie.sig$num_user_for_reviews, 2)1 -7.568 5.01e-14 ***
poly(movie.sig$num_user_for_reviews, 2)2 2.615 0.008973 **
poly(movie.sig$duration, 2)1 11.755 < 2e-16 ***
poly(movie.sig$duration, 2)2 -4.973 6.98e-07 ***
movie.sig$facenumber_in_poster -1.892 0.058589 .
poly(movie.sig$gross, 2)1 -6.012 2.05e-09 ***
poly(movie.sig$gross, 2)2 -3.565 0.000370 ***
poly(movie.sig$movie_facebook_likes, 2)1 1.952 0.051079 .
poly(movie.sig$movie_facebook_likes, 2)2 0.277 0.781673
movie.sig$director_facebook_likes 1.045 0.296005
movie.sig$cast_total_facebook_likes 0.361 0.717999
movie.sig$budget -10.886 < 2e-16 ***
movie.sig$title_year -14.680 < 2e-16 ***
movie.sig$genresAdventure 7.075 1.86e-12 ***
movie.sig$genresAnimation 5.828 6.23e-09 ***
movie.sig$genresBiography 8.522 < 2e-16 ***
movie.sig$genresComedy 3.747 0.000183 ***
movie.sig$genresCrime 7.309 3.45e-13 ***
movie.sig$genresDocumentary 6.317 3.06e-10 ***
movie.sig$genresDrama 11.067 < 2e-16 ***
movie.sig$genresFamily 1.366 0.172120
movie.sig$genresFantasy -1.364 0.172721
movie.sig$genresHorror -4.650 3.46e-06 ***
movie.sig$genresMusical -0.890 0.373334
movie.sig$genresMystery 1.029 0.303477
movie.sig$genresRomance 1.160 0.246252
movie.sig$genresSci-Fi 0.520 0.602769
movie.sig$genresThriller -0.415 0.678129
movie.sig$genresWestern -0.080 0.936448
movie.sig$num_critic_for_reviews NA NA
movie.sig$num_user_for_reviews NA NA
movie.sig$num_voted_users NA NA
movie.sig$gross NA NA
movie.sig$facenumber_in_poster:movie.sig$num_critic_for_reviews -0.030 0.975820
movie.sig$num_user_for_reviews:movie.sig$num_voted_users 2.828 0.004714 **
movie.sig$num_voted_users:movie.sig$gross 1.355 0.175492
movie.sig$budget:movie.sig$gross 7.180 8.80e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.741 on 2967 degrees of freedom
Multiple R-squared: 0.5107, Adjusted R-squared: 0.5046
F-statistic: 83.69 on 37 and 2967 DF, p-value: < 2.2e-16
step(null,scope=list(lower=null,upper=full2),direction='forward')
Start: AIC=309.81
movie.sig$imdb_score ~ 1
Df Sum of Sq RSS AIC
+ poly(movie.sig$num_voted_users, 2) 2 976.96 2352.2 -730.05
+ movie.sig$num_voted_users 1 871.90 2457.2 -600.74
+ poly(movie.sig$duration, 2) 2 536.11 2793.0 -213.83
+ poly(movie.sig$num_user_for_reviews, 2) 2 483.99 2845.1 -158.27
+ poly(movie.sig$num_critic_for_reviews, 2) 2 436.49 2892.6 -108.52
+ movie.sig$num_critic_for_reviews 1 428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews 1 407.62 2921.5 -80.68
+ poly(movie.sig$movie_facebook_likes, 2) 2 317.80 3011.3 12.32
+ movie.sig$genres 16 331.02 2998.1 27.10
+ poly(movie.sig$gross, 2) 2 251.27 3077.9 77.99
+ movie.sig$gross 1 242.62 3086.5 84.42
+ movie.sig$director_facebook_likes 1 166.17 3163.0 157.95
+ movie.sig$title_year 1 69.27 3259.9 248.63
+ movie.sig$cast_total_facebook_likes 1 64.28 3264.8 253.22
+ movie.sig$budget 1 16.26 3312.9 297.09
+ movie.sig$facenumber_in_poster 1 15.14 3314.0 298.11
<none> 3329.1 309.81
Step: AIC=-730.05
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2)
Df Sum of Sq RSS AIC
+ movie.sig$genres 16 337.58 2014.6 -1163.60
+ poly(movie.sig$duration, 2) 2 137.87 2214.3 -907.55
+ movie.sig$budget 1 133.09 2219.1 -903.07
+ movie.sig$title_year 1 101.46 2250.7 -860.55
+ poly(movie.sig$gross, 2) 2 58.78 2293.4 -802.09
+ movie.sig$gross 1 54.53 2297.6 -798.53
+ poly(movie.sig$num_user_for_reviews, 2) 2 29.12 2323.1 -763.48
+ movie.sig$num_user_for_reviews 1 25.39 2326.8 -760.66
+ movie.sig$director_facebook_likes 1 17.94 2334.2 -751.05
+ movie.sig$facenumber_in_poster 1 6.62 2345.5 -736.52
+ poly(movie.sig$num_critic_for_reviews, 2) 2 5.36 2346.8 -732.90
<none> 2352.2 -730.05
+ movie.sig$num_critic_for_reviews 1 0.18 2352.0 -728.28
+ movie.sig$cast_total_facebook_likes 1 0.15 2352.0 -728.23
+ poly(movie.sig$movie_facebook_likes, 2) 2 1.29 2350.9 -727.70
Step: AIC=-1163.6
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres
Df Sum of Sq RSS AIC
+ movie.sig$title_year 1 97.775 1916.8 -1311.1
+ movie.sig$budget 1 65.238 1949.3 -1260.5
+ poly(movie.sig$duration, 2) 2 65.750 1948.8 -1259.3
+ movie.sig$gross 1 19.722 1994.9 -1191.2
+ poly(movie.sig$gross, 2) 2 20.698 1993.9 -1190.6
+ poly(movie.sig$num_user_for_reviews, 2) 2 20.024 1994.6 -1189.6
+ movie.sig$num_user_for_reviews 1 14.834 1999.8 -1183.8
+ poly(movie.sig$num_critic_for_reviews, 2) 2 9.375 2005.2 -1173.6
+ movie.sig$director_facebook_likes 1 6.114 2008.5 -1170.7
+ movie.sig$facenumber_in_poster 1 3.792 2010.8 -1167.3
<none> 2014.6 -1163.6
+ movie.sig$cast_total_facebook_likes 1 0.355 2014.2 -1162.1
+ movie.sig$num_critic_for_reviews 1 0.042 2014.5 -1161.7
+ poly(movie.sig$movie_facebook_likes, 2) 2 0.813 2013.8 -1160.8
Step: AIC=-1311.1
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year
Df Sum of Sq RSS AIC
+ poly(movie.sig$num_critic_for_reviews, 2) 2 73.976 1842.8 -1425.4
+ poly(movie.sig$duration, 2) 2 49.885 1866.9 -1386.3
+ movie.sig$num_critic_for_reviews 1 43.723 1873.1 -1378.4
+ movie.sig$budget 1 32.246 1884.6 -1360.1
+ poly(movie.sig$num_user_for_reviews, 2) 2 21.755 1895.0 -1341.4
+ poly(movie.sig$gross, 2) 2 19.623 1897.2 -1338.0
+ movie.sig$gross 1 17.879 1898.9 -1337.3
+ poly(movie.sig$movie_facebook_likes, 2) 2 18.788 1898.0 -1336.7
+ movie.sig$num_user_for_reviews 1 14.396 1902.4 -1331.8
+ movie.sig$director_facebook_likes 1 3.373 1913.4 -1314.4
<none> 1916.8 -1311.1
+ movie.sig$facenumber_in_poster 1 1.216 1915.6 -1311.0
+ movie.sig$cast_total_facebook_likes 1 0.300 1916.5 -1309.6
Step: AIC=-1425.37
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2)
Df Sum of Sq RSS AIC
+ poly(movie.sig$num_user_for_reviews, 2) 2 54.189 1788.6 -1511.1
+ movie.sig$budget 1 46.017 1796.8 -1499.4
+ poly(movie.sig$duration, 2) 2 38.533 1804.3 -1484.9
+ movie.sig$num_user_for_reviews 1 33.751 1809.1 -1478.9
+ poly(movie.sig$gross, 2) 2 20.602 1822.2 -1455.2
+ movie.sig$gross 1 16.630 1826.2 -1450.6
+ poly(movie.sig$movie_facebook_likes, 2) 2 8.227 1834.6 -1434.8
+ movie.sig$director_facebook_likes 1 2.296 1840.5 -1427.1
<none> 1842.8 -1425.4
+ movie.sig$facenumber_in_poster 1 0.831 1842.0 -1424.7
+ movie.sig$cast_total_facebook_likes 1 0.104 1842.7 -1423.5
Step: AIC=-1511.06
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2)
Df Sum of Sq RSS AIC
+ poly(movie.sig$duration, 2) 2 51.219 1737.4 -1594.4
+ movie.sig$budget 1 34.907 1753.7 -1568.3
+ poly(movie.sig$gross, 2) 2 20.882 1767.8 -1542.3
+ movie.sig$gross 1 16.727 1771.9 -1537.3
+ poly(movie.sig$movie_facebook_likes, 2) 2 3.910 1784.7 -1513.6
+ movie.sig$director_facebook_likes 1 2.540 1786.1 -1513.3
+ movie.sig$facenumber_in_poster 1 1.970 1786.7 -1512.4
<none> 1788.6 -1511.1
+ movie.sig$cast_total_facebook_likes 1 0.022 1788.6 -1509.1
Step: AIC=-1594.36
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2)
Df Sum of Sq RSS AIC
+ movie.sig$budget 1 62.211 1675.2 -1701.9
+ poly(movie.sig$gross, 2) 2 30.406 1707.0 -1643.4
+ movie.sig$gross 1 23.936 1713.5 -1634.0
+ movie.sig$facenumber_in_poster 1 4.139 1733.3 -1599.5
<none> 1737.4 -1594.4
+ movie.sig$director_facebook_likes 1 0.946 1736.5 -1594.0
+ poly(movie.sig$movie_facebook_likes, 2) 2 1.928 1735.5 -1593.7
+ movie.sig$cast_total_facebook_likes 1 0.064 1737.4 -1592.5
Step: AIC=-1701.94
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget
Df Sum of Sq RSS AIC
+ movie.sig$facenumber_in_poster 1 5.0599 1670.2 -1709.0
+ poly(movie.sig$gross, 2) 2 4.5359 1670.7 -1706.1
+ movie.sig$gross 1 1.8995 1673.3 -1703.3
<none> 1675.2 -1701.9
+ movie.sig$director_facebook_likes 1 0.6239 1674.6 -1701.1
+ movie.sig$cast_total_facebook_likes 1 0.2471 1675.0 -1700.4
+ poly(movie.sig$movie_facebook_likes, 2) 2 0.8695 1674.3 -1699.5
Step: AIC=-1709.03
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget + movie.sig$facenumber_in_poster
Df Sum of Sq RSS AIC
+ poly(movie.sig$gross, 2) 2 4.6247 1665.5 -1713.4
+ movie.sig$gross 1 1.9720 1668.2 -1710.6
<none> 1670.2 -1709.0
+ movie.sig$director_facebook_likes 1 0.4874 1669.7 -1707.9
+ movie.sig$cast_total_facebook_likes 1 0.4443 1669.7 -1707.8
+ poly(movie.sig$movie_facebook_likes, 2) 2 0.8414 1669.3 -1706.5
Step: AIC=-1713.36
movie.sig$imdb_score ~ poly(movie.sig$num_voted_users, 2) + movie.sig$genres +
movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget + movie.sig$facenumber_in_poster +
poly(movie.sig$gross, 2)
Df Sum of Sq RSS AIC
<none> 1665.5 -1713.4
+ movie.sig$director_facebook_likes 1 0.49076 1665.0 -1712.2
+ movie.sig$cast_total_facebook_likes 1 0.41310 1665.1 -1712.1
+ poly(movie.sig$movie_facebook_likes, 2) 2 1.10614 1664.4 -1711.4
Call:
lm(formula = movie.sig$imdb_score ~ poly(movie.sig$num_voted_users,
2) + movie.sig$genres + movie.sig$title_year + poly(movie.sig$num_critic_for_reviews,
2) + poly(movie.sig$num_user_for_reviews, 2) + poly(movie.sig$duration,
2) + movie.sig$budget + movie.sig$facenumber_in_poster +
poly(movie.sig$gross, 2))
Coefficients:
(Intercept) poly(movie.sig$num_voted_users, 2)1
5.851e+01 3.249e+01
poly(movie.sig$num_voted_users, 2)2 movie.sig$genresAdventure
-1.320e+01 3.770e-01
movie.sig$genresAnimation movie.sig$genresBiography
7.306e-01 6.559e-01
movie.sig$genresComedy movie.sig$genresCrime
1.875e-01 4.845e-01
movie.sig$genresDocumentary movie.sig$genresDrama
1.037e+00 5.524e-01
movie.sig$genresFamily movie.sig$genresFantasy
2.093e-01 -1.231e-01
movie.sig$genresHorror movie.sig$genresMusical
-2.986e-01 -4.597e-01
movie.sig$genresMystery movie.sig$genresRomance
2.304e-01 6.151e-01
movie.sig$genresSci-Fi movie.sig$genresThriller
1.706e-01 -2.631e-01
movie.sig$genresWestern movie.sig$title_year
5.056e-02 -2.605e-02
poly(movie.sig$num_critic_for_reviews, 2)1 poly(movie.sig$num_critic_for_reviews, 2)2
1.634e+01 -6.906e+00
poly(movie.sig$num_user_for_reviews, 2)1 poly(movie.sig$num_user_for_reviews, 2)2
-1.209e+01 7.641e+00
poly(movie.sig$duration, 2)1 poly(movie.sig$duration, 2)2
1.072e+01 -3.800e+00
movie.sig$budget movie.sig$facenumber_in_poster
-4.048e-09 -2.026e-02
poly(movie.sig$gross, 2)1 poly(movie.sig$gross, 2)2
-2.770e+00 1.851e+00
full3=
lm(movie.sig$imdb_score ~movie.sig$num_voted_users+movie.sig$num_critic_for_reviews+movie.sig$num_user_for_reviews+movie.sig$duration+movie.sig$facenumber_in_poster+movie.sig$gross+movie.sig$movie_facebook_likes+movie.sig$director_facebook_likes+movie.sig$cast_total_facebook_likes+movie.sig$budget+movie.sig$title_year+factor(movie.sig$genres)+movie.sig$duration*movie.sig$num_voted_users+movie.sig$num_voted_users*movie.sig$num_user_for_reviews+movie.sig$gross*movie.sig$budget,data=movie.sig)
summary(full3)
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
movie.sig$num_critic_for_reviews + movie.sig$num_user_for_reviews +
movie.sig$duration + movie.sig$facenumber_in_poster + movie.sig$gross +
movie.sig$movie_facebook_likes + movie.sig$director_facebook_likes +
movie.sig$cast_total_facebook_likes + movie.sig$budget +
movie.sig$title_year + factor(movie.sig$genres) + movie.sig$duration *
movie.sig$num_voted_users + movie.sig$num_voted_users * movie.sig$num_user_for_reviews +
movie.sig$gross * movie.sig$budget, data = movie.sig)
Residuals:
Min 1Q Median 3Q Max
-5.0519 -0.3700 0.0863 0.4828 2.0996
Coefficients:
Estimate Std. Error t value
(Intercept) 4.748e+01 3.592e+00 13.218
movie.sig$num_voted_users 7.890e-06 4.790e-07 16.472
movie.sig$num_critic_for_reviews 2.427e-03 2.275e-04 10.669
movie.sig$num_user_for_reviews -3.039e-04 6.998e-05 -4.343
movie.sig$duration 1.277e-02 9.200e-04 13.882
movie.sig$facenumber_in_poster -1.858e-02 6.806e-03 -2.730
movie.sig$gross -1.469e-09 4.191e-10 -3.505
movie.sig$movie_facebook_likes -2.370e-06 9.659e-07 -2.454
movie.sig$director_facebook_likes 3.969e-06 4.482e-06 0.885
movie.sig$cast_total_facebook_likes 7.641e-07 7.181e-07 1.064
movie.sig$budget -5.900e-09 5.917e-10 -9.971
movie.sig$title_year -2.154e-02 1.790e-03 -12.032
factor(movie.sig$genres)Adventure 3.308e-01 5.338e-02 6.196
factor(movie.sig$genres)Animation 7.426e-01 1.319e-01 5.629
factor(movie.sig$genres)Biography 6.551e-01 7.512e-02 8.720
factor(movie.sig$genres)Comedy 1.515e-01 4.284e-02 3.537
factor(movie.sig$genres)Crime 4.496e-01 6.353e-02 7.077
factor(movie.sig$genres)Documentary 8.960e-01 1.579e-01 5.676
factor(movie.sig$genres)Drama 4.965e-01 4.835e-02 10.269
factor(movie.sig$genres)Family 3.329e-01 4.432e-01 0.751
factor(movie.sig$genres)Fantasy -1.544e-01 1.419e-01 -1.089
factor(movie.sig$genres)Horror -3.577e-01 7.638e-02 -4.683
factor(movie.sig$genres)Musical -2.616e-01 5.459e-01 -0.479
factor(movie.sig$genres)Mystery 1.263e-01 1.939e-01 0.652
factor(movie.sig$genres)Romance 5.476e-01 5.392e-01 1.016
factor(movie.sig$genres)Sci-Fi 1.673e-01 2.900e-01 0.577
factor(movie.sig$genres)Thriller -4.858e-01 7.627e-01 -0.637
factor(movie.sig$genres)Western -1.277e-01 5.408e-01 -0.236
movie.sig$num_voted_users:movie.sig$duration -3.052e-08 3.447e-09 -8.852
movie.sig$num_voted_users:movie.sig$num_user_for_reviews -3.752e-10 9.851e-11 -3.809
movie.sig$gross:movie.sig$budget 1.411e-17 2.887e-18 4.886
Pr(>|t|)
(Intercept) < 2e-16 ***
movie.sig$num_voted_users < 2e-16 ***
movie.sig$num_critic_for_reviews < 2e-16 ***
movie.sig$num_user_for_reviews 1.46e-05 ***
movie.sig$duration < 2e-16 ***
movie.sig$facenumber_in_poster 0.006371 **
movie.sig$gross 0.000463 ***
movie.sig$movie_facebook_likes 0.014175 *
movie.sig$director_facebook_likes 0.376035
movie.sig$cast_total_facebook_likes 0.287447
movie.sig$budget < 2e-16 ***
movie.sig$title_year < 2e-16 ***
factor(movie.sig$genres)Adventure 6.60e-10 ***
factor(movie.sig$genres)Animation 1.98e-08 ***
factor(movie.sig$genres)Biography < 2e-16 ***
factor(movie.sig$genres)Comedy 0.000411 ***
factor(movie.sig$genres)Crime 1.83e-12 ***
factor(movie.sig$genres)Documentary 1.51e-08 ***
factor(movie.sig$genres)Drama < 2e-16 ***
factor(movie.sig$genres)Family 0.452648
factor(movie.sig$genres)Fantasy 0.276414
factor(movie.sig$genres)Horror 2.95e-06 ***
factor(movie.sig$genres)Musical 0.631791
factor(movie.sig$genres)Mystery 0.514773
factor(movie.sig$genres)Romance 0.309947
factor(movie.sig$genres)Sci-Fi 0.563982
factor(movie.sig$genres)Thriller 0.524230
factor(movie.sig$genres)Western 0.813336
movie.sig$num_voted_users:movie.sig$duration < 2e-16 ***
movie.sig$num_voted_users:movie.sig$num_user_for_reviews 0.000143 ***
movie.sig$gross:movie.sig$budget 1.08e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7607 on 2974 degrees of freedom
Multiple R-squared: 0.483, Adjusted R-squared: 0.4778
F-statistic: 92.63 on 30 and 2974 DF, p-value: < 2.2e-16
step(null,scope=list(lower=null,upper=full3),direction='forward')
Start: AIC=309.81
movie.sig$imdb_score ~ 1
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users 1 871.90 2457.2 -600.74
+ movie.sig$duration 1 491.13 2838.0 -167.82
+ movie.sig$num_critic_for_reviews 1 428.38 2900.8 -102.10
+ movie.sig$num_user_for_reviews 1 407.62 2921.5 -80.68
+ factor(movie.sig$genres) 16 331.02 2998.1 27.10
+ movie.sig$movie_facebook_likes 1 282.82 3046.3 45.02
+ movie.sig$gross 1 242.62 3086.5 84.42
+ movie.sig$director_facebook_likes 1 166.17 3163.0 157.95
+ movie.sig$title_year 1 69.27 3259.9 248.63
+ movie.sig$cast_total_facebook_likes 1 64.28 3264.8 253.22
+ movie.sig$budget 1 16.26 3312.9 297.09
+ movie.sig$facenumber_in_poster 1 15.14 3314.0 298.11
<none> 3329.1 309.81
Step: AIC=-600.74
movie.sig$imdb_score ~ movie.sig$num_voted_users
Df Sum of Sq RSS AIC
+ factor(movie.sig$genres) 16 311.531 2145.7 -976.12
+ movie.sig$duration 1 147.786 2309.4 -785.13
+ movie.sig$title_year 1 84.649 2372.6 -704.08
+ movie.sig$budget 1 73.211 2384.0 -689.63
+ movie.sig$num_user_for_reviews 1 21.297 2435.9 -624.90
+ movie.sig$gross 1 16.929 2440.3 -619.51
+ movie.sig$num_critic_for_reviews 1 14.632 2442.6 -616.69
+ movie.sig$director_facebook_likes 1 13.657 2443.6 -615.49
+ movie.sig$facenumber_in_poster 1 6.789 2450.4 -607.05
+ movie.sig$movie_facebook_likes 1 2.627 2454.6 -601.95
<none> 2457.2 -600.74
+ movie.sig$cast_total_facebook_likes 1 0.524 2456.7 -599.38
Step: AIC=-976.12
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres)
Df Sum of Sq RSS AIC
+ movie.sig$title_year 1 79.623 2066.1 -1087.75
+ movie.sig$duration 1 74.584 2071.1 -1080.44
+ movie.sig$budget 1 28.689 2117.0 -1014.57
+ movie.sig$num_critic_for_reviews 1 23.116 2122.6 -1006.67
+ movie.sig$num_user_for_reviews 1 12.251 2133.4 -991.33
+ movie.sig$director_facebook_likes 1 3.707 2142.0 -979.32
+ movie.sig$facenumber_in_poster 1 3.274 2142.4 -978.71
+ movie.sig$movie_facebook_likes 1 1.686 2144.0 -976.49
<none> 2145.7 -976.12
+ movie.sig$gross 1 1.391 2144.3 -976.07
+ movie.sig$cast_total_facebook_likes 1 0.362 2145.3 -974.63
Step: AIC=-1087.75
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year
Df Sum of Sq RSS AIC
+ movie.sig$num_critic_for_reviews 1 125.091 1941.0 -1273.4
+ movie.sig$duration 1 55.857 2010.2 -1168.1
+ movie.sig$movie_facebook_likes 1 21.746 2044.3 -1117.5
+ movie.sig$num_user_for_reviews 1 11.741 2054.3 -1102.9
+ movie.sig$budget 1 9.196 2056.9 -1099.2
+ movie.sig$cast_total_facebook_likes 1 2.923 2063.2 -1090.0
+ movie.sig$director_facebook_likes 1 1.740 2064.3 -1088.3
<none> 2066.1 -1087.8
+ movie.sig$facenumber_in_poster 1 1.084 2065.0 -1087.3
+ movie.sig$gross 1 0.638 2065.4 -1086.7
Step: AIC=-1273.43
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$budget 1 36.627 1904.4 -1328.7
+ movie.sig$num_user_for_reviews 1 35.326 1905.7 -1326.6
+ movie.sig$duration 1 34.873 1906.1 -1325.9
+ movie.sig$gross 1 7.359 1933.6 -1282.8
+ movie.sig$movie_facebook_likes 1 1.397 1939.6 -1273.6
<none> 1941.0 -1273.4
+ movie.sig$facenumber_in_poster 1 0.926 1940.1 -1272.9
+ movie.sig$director_facebook_likes 1 0.644 1940.3 -1272.4
+ movie.sig$cast_total_facebook_likes 1 0.572 1940.4 -1272.3
Step: AIC=-1328.68
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget
Df Sum of Sq RSS AIC
+ movie.sig$duration 1 58.373 1846.0 -1420.2
+ movie.sig$num_user_for_reviews 1 27.052 1877.3 -1369.7
+ movie.sig$movie_facebook_likes 1 2.576 1901.8 -1330.8
+ movie.sig$cast_total_facebook_likes 1 2.005 1902.3 -1329.8
<none> 1904.4 -1328.7
+ movie.sig$facenumber_in_poster 1 1.071 1903.3 -1328.4
+ movie.sig$director_facebook_likes 1 0.557 1903.8 -1327.6
+ movie.sig$gross 1 0.074 1904.3 -1326.8
Step: AIC=-1420.23
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users:movie.sig$duration 1 70.848 1775.1 -1535.8
+ movie.sig$num_user_for_reviews 1 33.825 1812.2 -1473.8
+ movie.sig$movie_facebook_likes 1 4.702 1841.3 -1425.9
+ movie.sig$facenumber_in_poster 1 2.488 1843.5 -1422.3
+ movie.sig$cast_total_facebook_likes 1 1.601 1844.4 -1420.8
<none> 1846.0 -1420.2
+ movie.sig$gross 1 0.196 1845.8 -1418.5
+ movie.sig$director_facebook_likes 1 0.043 1845.9 -1418.3
Step: AIC=-1535.83
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_voted_users:movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_user_for_reviews 1 26.4426 1748.7 -1578.9
+ movie.sig$facenumber_in_poster 1 2.9576 1772.2 -1538.8
+ movie.sig$cast_total_facebook_likes 1 1.1823 1774.0 -1535.8
<none> 1775.1 -1535.8
+ movie.sig$movie_facebook_likes 1 0.9446 1774.2 -1535.4
+ movie.sig$director_facebook_likes 1 0.3854 1774.8 -1534.5
+ movie.sig$gross 1 0.0191 1775.1 -1533.9
Step: AIC=-1578.93
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$num_voted_users:movie.sig$duration
Df Sum of Sq RSS AIC
+ movie.sig$num_voted_users:movie.sig$num_user_for_reviews 1 5.4845 1743.2 -1586.4
+ movie.sig$facenumber_in_poster 1 4.1664 1744.5 -1584.1
+ movie.sig$movie_facebook_likes 1 3.9301 1744.8 -1583.7
<none> 1748.7 -1578.9
+ movie.sig$cast_total_facebook_likes 1 0.7354 1748.0 -1578.2
+ movie.sig$director_facebook_likes 1 0.2660 1748.4 -1577.4
+ movie.sig$gross 1 0.0008 1748.7 -1576.9
Step: AIC=-1586.37
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$facenumber_in_poster 1 4.0181 1739.2 -1591.3
+ movie.sig$movie_facebook_likes 1 3.2754 1739.9 -1590.0
<none> 1743.2 -1586.4
+ movie.sig$cast_total_facebook_likes 1 0.6359 1742.6 -1585.5
+ movie.sig$director_facebook_likes 1 0.3798 1742.8 -1585.0
+ movie.sig$gross 1 0.0475 1743.2 -1584.5
Step: AIC=-1591.31
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$facenumber_in_poster + movie.sig$num_voted_users:movie.sig$duration +
movie.sig$num_voted_users:movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
+ movie.sig$movie_facebook_likes 1 3.11243 1736.1 -1594.7
<none> 1739.2 -1591.3
+ movie.sig$cast_total_facebook_likes 1 0.90996 1738.3 -1590.9
+ movie.sig$director_facebook_likes 1 0.29041 1738.9 -1589.8
+ movie.sig$gross 1 0.04757 1739.1 -1589.4
Step: AIC=-1594.69
movie.sig$imdb_score ~ movie.sig$num_voted_users + factor(movie.sig$genres) +
movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$facenumber_in_poster + movie.sig$movie_facebook_likes +
movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews
Df Sum of Sq RSS AIC
<none> 1736.1 -1594.7
+ movie.sig$cast_total_facebook_likes 1 0.97305 1735.1 -1594.4
+ movie.sig$director_facebook_likes 1 0.27990 1735.8 -1593.2
+ movie.sig$gross 1 0.03634 1736.0 -1592.8
Call:
lm(formula = movie.sig$imdb_score ~ movie.sig$num_voted_users +
factor(movie.sig$genres) + movie.sig$title_year + movie.sig$num_critic_for_reviews +
movie.sig$budget + movie.sig$duration + movie.sig$num_user_for_reviews +
movie.sig$facenumber_in_poster + movie.sig$movie_facebook_likes +
movie.sig$num_voted_users:movie.sig$duration + movie.sig$num_voted_users:movie.sig$num_user_for_reviews)
Coefficients:
(Intercept)
4.817e+01
movie.sig$num_voted_users
7.152e-06
factor(movie.sig$genres)Adventure
3.300e-01
factor(movie.sig$genres)Animation
7.097e-01
factor(movie.sig$genres)Biography
6.794e-01
factor(movie.sig$genres)Comedy
1.675e-01
factor(movie.sig$genres)Crime
4.784e-01
factor(movie.sig$genres)Documentary
9.449e-01
factor(movie.sig$genres)Drama
5.252e-01
factor(movie.sig$genres)Family
2.260e-01
factor(movie.sig$genres)Fantasy
-1.422e-01
factor(movie.sig$genres)Horror
-3.440e-01
factor(movie.sig$genres)Musical
-3.165e-01
factor(movie.sig$genres)Mystery
1.499e-01
factor(movie.sig$genres)Romance
5.682e-01
factor(movie.sig$genres)Sci-Fi
1.953e-01
factor(movie.sig$genres)Thriller
-4.097e-01
factor(movie.sig$genres)Western
-4.521e-02
movie.sig$title_year
-2.189e-02
movie.sig$num_critic_for_reviews
2.566e-03
movie.sig$budget
-4.370e-09
movie.sig$duration
1.206e-02
movie.sig$num_user_for_reviews
-3.210e-04
movie.sig$facenumber_in_poster
-1.750e-02
movie.sig$movie_facebook_likes
-2.239e-06
movie.sig$num_voted_users:movie.sig$duration
-2.661e-08
movie.sig$num_voted_users:movie.sig$num_user_for_reviews
-2.729e-10
For convinience to interpret the result, I will start with Full3(additive mode with interactiin terms). After checking residual, then decide should we add higher order terms.
Split data into Test and Train:
indx = sample(1:nrow(movie.sig), as.integer(0.9*nrow(movie.sig)))
indx # ramdomize rows, save 90% of data into index
[1] 328 2365 1899 1526 2487 1132 2560 2291 1959 1879 2607 1062 2952 31 122 2966
[17] 2875 2626 1514 2507 1946 2115 892 1601 2812 1881 2375 1458 967 1217 879 1680
[33] 2921 701 319 194 203 249 2505 2656 1636 1190 2472 2886 497 1206 1174 505
[49] 831 2725 1719 2440 873 2917 2403 1087 350 1291 2735 2915 318 297 607 1393
[65] 952 2619 199 1049 2874 25 1187 898 1602 147 2737 2864 2747 2990 1185 1781
[81] 1674 440 1876 234 2178 241 2499 804 48 209 2622 1722 748 1965 86 8
[97] 2491 948 1883 360 2579 1413 1797 20 946 912 584 1827 924 2526 766 2580
[113] 1743 1990 522 582 618 604 1000 729 317 768 1417 2992 1325 2039 1758 1453
[129] 2450 1061 2758 1993 1258 2344 2568 2693 294 2723 1103 84 1778 1004 678 455
[145] 1170 2455 2469 258 930 1841 1964 254 435 2773 2166 176 240 470 853 2268
[161] 2168 2865 1748 2236 940 1472 213 2625 2994 685 809 2130 1789 150 842 1909
[177] 268 1211 285 2969 2189 2401 2504 2608 2174 1370 2196 2826 2099 2005 1379 1944
[193] 1053 2470 1167 2167 138 1402 1505 366 1010 1672 1829 2980 1999 1810 1252 495
[209] 2783 1633 1255 1209 2462 1994 1351 1531 2252 2691 389 1104 2294 969 1311 2762
[225] 1886 1120 833 746 1834 539 202 1429 27 1084 925 30 2018 555 2611 2675
[241] 1647 379 59 646 1686 1678 2909 1536 2391 596 2531 1233 2637 2989 2017 1992
[257] 1796 926 2687 1266 1403 1298 36 1715 2439 1464 1986 267 2212 1307 2861 55
[273] 2471 1782 2223 711 2995 2241 2829 1884 1902 1091 2094 2349 954 1246 849 810
[289] 1801 2106 1776 2427 2374 2044 2181 715 1439 2049 1105 2710 2715 1470 2547 677
[305] 76 1609 352 1471 1556 1376 1530 2243 1041 2881 619 2125 2779 2227 2052 485
[321] 2645 724 975 2809 1991 861 2390 1895 712 270 794 2767 1613 2654 2346 2593
[337] 1295 1204 561 373 2566 2021 1923 710 271 365 2064 2433 489 316 3000 2887
[353] 124 1588 1017 1357 1323 2860 2782 2819 2267 1048 2143 1524 1467 1929 2927 1421
[369] 1922 63 269 2993 1142 919 2284 1893 2899 1444 2785 2838 92 2974 2456 815
[385] 1695 1794 1074 1939 695 2067 68 1126 1656 1476 2669 666 452 1557 1539 1503
[401] 2030 2062 2128 545 121 1051 2162 2409 1338 987 1490 2592 159 1727 537 2797
[417] 125 2287 1289 3001 2360 1576 1669 1416 910 2420 2013 1267 1130 1775 1256 115
[433] 2932 682 656 1865 1177 2913 1593 1655 1806 2288 1153 2325 1019 2718 2134 549
[449] 2458 1183 1795 2488 1235 951 2912 580 2999 472 1967 672 2131 2750 1790 367
[465] 2820 843 1683 564 1708 399 583 2643 190 1693 191 1035 2556 281 602 2733
[481] 897 1840 2053 2653 2185 2641 1579 821 1425 1548 1317 110 1889 743 610 2468
[497] 2960 1515 2743 227 215 1996 6 1389 1314 1945 642 2448 995 362 2808 1703
[513] 212 2079 2399 2309 1507 2756 2609 2657 651 736 896 886 1851 2311 2717 334
[529] 541 135 312 2274 1887 2668 1540 1916 556 790 1500 2385 179 1360 994 1477
[545] 638 1755 882 1907 683 2345 671 2704 2851 1658 658 500 1650 46 2343 2396
[561] 1168 2090 304 116 492 1450 1733 2703 2496 625 554 1193 1336 2853 1552 1501
[577] 1364 2043 1068 2929 417 457 460 2776 1426 598 2054 1861 1640 1392 1352 855
[593] 2277 1623 1275 2786 2338 2024 2132 1597 1137 2614 697 525 2885 1598 1786 540
[609] 1312 107 2576 1078 752 44 2144 219 1575 2711 2949 1071 716 398 2602 2769
[625] 506 2768 2003 2201 2240 1877 2025 1158 1762 689 2452 2480 309 1918 2976 1115
[641] 1660 1201 2386 1799 1520 253 260 2184 51 2520 2798 1172 2765 181 918 2961
[657] 50 345 2000 2008 1618 1629 1792 2194 192 1535 2383 2897 1573 2362 2046 1730
[673] 917 2245 1016 1616 246 2869 1521 2219 2126 61 2156 18 632 723 1725 657
[689] 659 2822 605 1710 2313 1757 2676 1547 2222 1147 385 1284 1589 2515 1220 931
[705] 434 9 2477 2958 2276 2621 2228 587 1248 1868 400 231 2280 2541 1221 2918
[721] 693 765 1229 2679 2872 225 448 1653 2536 2663 2739 419 573 617 781 255
[737] 1954 2978 1545 2089 332 2542 2561 730 502 742 2518 418 2603 2936 113 449
[753] 1182 2855 261 1707 2551 2509 2226 200 601 1262 961 1777 1863 830 2598 1624
[769] 1433 1741 1337 1363 1949 1191 2569 2091 1729 676 1155 79 2681 1551 1056 2028
[785] 16 2800 96 2317 2719 1761 1857 2601 2364 2397 1390 1671 1013 336 2465 1133
[801] 1745 349 2821 411 2297 29 2546 2371 262 2239 1662 1125 2372 1178 2259 1044
[817] 98 1420 520 2435 507 2746 146 1767 968 2459 355 509 1824 2211 1341 2229
[833] 2944 340 39 2050 1936 2117 2377 899 2424 2460 1077 932 2781 134 1689 151
[849] 1862 1184 2882 2320 2896 1811 145 763 2700 2879 2996 1427 2586 895 117 424
[865] 1119 2721 1819 1812 283 1159 1765 469 2880 890 1897 384 1989 21 688 2192
[881] 1607 2426 1409 2033 153 2748 998 2827 2632 1615 120 2937 2665 1304 2572 1974
[897] 2613 2856 4 660 1082 284 756 2354 1400 333 1987 1293 1386 1697 1151 2893
[913] 2041 2173 2707 104 13 1382 359 563 2649 870 2454 2056 66 1313 326 423
[929] 536 70 2889 83 526 1716 2975 1924 644 2190 2544 2326 2646 623 1694 2973
[945] 1641 1109 1513 75 1247 80 860 1706 2578 201 916 1627 2523 1666 2636 1826
[961] 1511 1179 1222 2777 1259 129 2087 2575 962 1214 2304 1198 2849 1412 1230 628
[977] 2135 1700 740 2237 858 323 1237 1499 771 1542 817 1930 959 2792 1978 351
[993] 1718 2982 922 2563 293 1369 1648 2183
[ reached getOption("max.print") -- omitted 1704 entries ]
movie_train = movie.sig[indx,]
movie_test = movie.sig[-indx,]
# lm.fit 1: linear model with interaction term dropping insig predictors.
# insig terms: director facebooklike','movie fb like' and 'cast total fb likes' from summary(full3)
# Note: nothing to do with step function we choose for full3.
lm.fit1<-lm(movie_train$imdb_score~movie_train$num_voted_users+movie_train$num_critic_for_reviews+movie_train$num_user_for_reviews+movie_train$duration+movie_train$facenumber_in_poster+movie_train$gross+movie_train$budget+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
summary(lm.fit1)
Call:
lm(formula = movie_train$imdb_score ~ movie_train$num_voted_users +
movie_train$num_critic_for_reviews + movie_train$num_user_for_reviews +
movie_train$duration + movie_train$facenumber_in_poster +
movie_train$gross + movie_train$budget + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews +
movie_train$gross * movie_train$budget)
Residuals:
Min 1Q Median 3Q Max
-4.1194 -0.3781 0.0931 0.4858 2.0804
Coefficients:
Estimate Std. Error
(Intercept) 4.665e+01 3.787e+00
movie_train$num_voted_users 8.335e-06 5.213e-07
movie_train$num_critic_for_reviews 2.187e-03 2.038e-04
movie_train$num_user_for_reviews -2.890e-04 7.239e-05
movie_train$duration 1.342e-02 9.816e-04
movie_train$facenumber_in_poster -1.901e-02 7.330e-03
movie_train$gross -1.518e-09 4.415e-10
movie_train$budget -6.024e-09 6.279e-10
movie_train$title_year -2.115e-02 1.887e-03
factor(movie_train$genres)Adventure 3.115e-01 5.622e-02
factor(movie_train$genres)Animation 6.982e-01 1.419e-01
factor(movie_train$genres)Biography 6.623e-01 7.860e-02
factor(movie_train$genres)Comedy 1.560e-01 4.511e-02
factor(movie_train$genres)Crime 4.634e-01 6.614e-02
factor(movie_train$genres)Documentary 1.286e+00 1.680e-01
factor(movie_train$genres)Drama 4.970e-01 5.072e-02
factor(movie_train$genres)Family 3.297e-01 4.429e-01
factor(movie_train$genres)Fantasy -1.832e-01 1.466e-01
factor(movie_train$genres)Horror -3.607e-01 8.103e-02
factor(movie_train$genres)Musical 3.017e-01 7.640e-01
factor(movie_train$genres)Mystery 1.892e-01 2.067e-01
factor(movie_train$genres)Romance 8.373e-01 7.621e-01
factor(movie_train$genres)Sci-Fi -7.964e-03 3.428e-01
factor(movie_train$genres)Thriller -4.997e-01 7.626e-01
factor(movie_train$genres)Western 9.650e-01 7.619e-01
movie_train$num_voted_users:movie_train$duration -3.375e-08 3.799e-09
movie_train$num_voted_users:movie_train$num_user_for_reviews -4.043e-10 1.018e-10
movie_train$gross:movie_train$budget 1.496e-17 3.068e-18
t value Pr(>|t|)
(Intercept) 12.320 < 2e-16 ***
movie_train$num_voted_users 15.990 < 2e-16 ***
movie_train$num_critic_for_reviews 10.730 < 2e-16 ***
movie_train$num_user_for_reviews -3.993 6.71e-05 ***
movie_train$duration 13.667 < 2e-16 ***
movie_train$facenumber_in_poster -2.594 0.009550 **
movie_train$gross -3.437 0.000597 ***
movie_train$budget -9.594 < 2e-16 ***
movie_train$title_year -11.208 < 2e-16 ***
factor(movie_train$genres)Adventure 5.541 3.30e-08 ***
factor(movie_train$genres)Animation 4.918 9.25e-07 ***
factor(movie_train$genres)Biography 8.425 < 2e-16 ***
factor(movie_train$genres)Comedy 3.457 0.000554 ***
factor(movie_train$genres)Crime 7.007 3.07e-12 ***
factor(movie_train$genres)Documentary 7.654 2.71e-14 ***
factor(movie_train$genres)Drama 9.797 < 2e-16 ***
factor(movie_train$genres)Family 0.744 0.456735
factor(movie_train$genres)Fantasy -1.249 0.211622
factor(movie_train$genres)Horror -4.452 8.87e-06 ***
factor(movie_train$genres)Musical 0.395 0.692918
factor(movie_train$genres)Mystery 0.915 0.360080
factor(movie_train$genres)Romance 1.099 0.272027
factor(movie_train$genres)Sci-Fi -0.023 0.981469
factor(movie_train$genres)Thriller -0.655 0.512335
factor(movie_train$genres)Western 1.267 0.205440
movie_train$num_voted_users:movie_train$duration -8.884 < 2e-16 ***
movie_train$num_voted_users:movie_train$num_user_for_reviews -3.973 7.29e-05 ***
movie_train$gross:movie_train$budget 4.875 1.15e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7604 on 2676 degrees of freedom
Multiple R-squared: 0.4854, Adjusted R-squared: 0.4802
F-statistic: 93.49 on 27 and 2676 DF, p-value: < 2.2e-16
The P-value is very samll.All terms are significant but face number in posters is the least significant variable.Adjusted R^2 is 0.4727, which means 47.27% of the variability can be explained by this model.
Do Lack of fit test to see if removing the predictors improve model performance:
#lm.full: full linear model with interaction terms on train dataset.
lm.full<-lm(movie_train$imdb_score~movie_train$num_voted_users+movie_train$num_critic_for_reviews+movie_train$num_user_for_reviews+movie_train$duration+movie_train$facenumber_in_poster+movie_train$gross+movie_train$movie_facebook_likes+movie_train$director_facebook_likes+movie_train$cast_total_facebook_likes+movie_train$budget+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
anova(lm.full,lm.fit1) # H0: reduced model fits===lack of fit=0
Analysis of Variance Table
Model 1: movie_train$imdb_score ~ movie_train$num_voted_users + movie_train$num_critic_for_reviews +
movie_train$num_user_for_reviews + movie_train$duration +
movie_train$facenumber_in_poster + movie_train$gross + movie_train$movie_facebook_likes +
movie_train$director_facebook_likes + movie_train$cast_total_facebook_likes +
movie_train$budget + movie_train$title_year + factor(movie_train$genres) +
movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Model 2: movie_train$imdb_score ~ movie_train$num_voted_users + movie_train$num_critic_for_reviews +
movie_train$num_user_for_reviews + movie_train$duration +
movie_train$facenumber_in_poster + movie_train$gross + movie_train$budget +
movie_train$title_year + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2673 1545.1
2 2676 1547.5 -3 -2.3691 1.3662 0.2513
The P-value of the partial F-test is 0.1379, which means dropping ‘director facebooklike’,‘movie fb like’ and ‘cast total fb likes’ did improve model performance.
Diagnostics:
plot(lm.fit1)
not plotting observations with leverage one:
420, 1144, 1702, 1734
not plotting observations with leverage one:
420, 1144, 1702, 1734
# residual vs fitted indicates might be higher order term. Normal plot not good.
library(car)
residualPlots(lm.fit1)
library(car)
residualPlots(lm.fit1)
Test stat Pr(>|t|)
movie_train$num_voted_users -8.015 0.000
movie_train$num_critic_for_reviews -7.784 0.000
movie_train$num_user_for_reviews 3.994 0.000
movie_train$duration -4.664 0.000
movie_train$facenumber_in_poster 0.400 0.689
movie_train$gross -3.958 0.000
movie_train$budget 4.818 0.000
movie_train$title_year -3.936 0.000
factor(movie_train$genres) NA NA
Tukey test -14.747 0.000
All of the residual vs predictor plots have a general trend of cerviture, which indicates the current model does not fit. Higher order terms should be included.
Fit model with higer order terms:
# lm.fit2: model based on lm.fit1 adding higer order for all variables except for 'face number in poster' and 'title-year'.
lm.fit2<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+poly(movie_train$num_user_for_reviews,2)+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+poly(movie_train$gross,2)+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews+movie_train$gross*movie_train$budget)
summary(lm.fit2)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + poly(movie_train$num_user_for_reviews,
2) + poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
poly(movie_train$gross, 2) + poly(movie_train$budget, 2) +
movie_train$title_year + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget)
Residuals:
Min 1Q Median 3Q Max
-3.9109 -0.3530 0.0637 0.4618 2.2051
Coefficients: (5 not defined because of singularities)
Estimate Std. Error
(Intercept) 5.222e+01 3.793e+00
poly(movie_train$num_voted_users, 2)1 4.696e+01 5.073e+00
poly(movie_train$num_voted_users, 2)2 -1.515e+01 2.182e+00
poly(movie_train$num_critic_for_reviews, 2)1 1.360e+01 1.313e+00
poly(movie_train$num_critic_for_reviews, 2)2 -7.180e+00 8.464e-01
poly(movie_train$num_user_for_reviews, 2)1 -1.604e+01 2.217e+00
poly(movie_train$num_user_for_reviews, 2)2 4.249e+00 1.606e+00
poly(movie_train$duration, 2)1 1.447e+01 1.084e+00
poly(movie_train$duration, 2)2 -3.403e+00 7.733e-01
movie_train$facenumber_in_poster -2.293e-02 7.101e-03
poly(movie_train$gross, 2)1 -6.128e+00 2.127e+00
poly(movie_train$gross, 2)2 -1.912e+00 1.217e+00
poly(movie_train$budget, 2)1 -1.378e+01 1.950e+00
poly(movie_train$budget, 2)2 5.600e+00 1.084e+00
movie_train$title_year -2.290e-02 1.898e-03
factor(movie_train$genres)Adventure 3.535e-01 5.499e-02
factor(movie_train$genres)Animation 7.302e-01 1.384e-01
factor(movie_train$genres)Biography 6.244e-01 7.622e-02
factor(movie_train$genres)Comedy 1.419e-01 4.396e-02
factor(movie_train$genres)Crime 4.508e-01 6.411e-02
factor(movie_train$genres)Documentary 1.301e+00 1.632e-01
factor(movie_train$genres)Drama 4.838e-01 4.936e-02
factor(movie_train$genres)Family 3.629e-01 4.341e-01
factor(movie_train$genres)Fantasy -2.164e-01 1.422e-01
factor(movie_train$genres)Horror -3.858e-01 8.014e-02
factor(movie_train$genres)Musical -2.386e-03 7.394e-01
factor(movie_train$genres)Mystery 1.625e-01 2.001e-01
factor(movie_train$genres)Romance 8.988e-01 7.370e-01
factor(movie_train$genres)Sci-Fi -5.665e-02 3.316e-01
factor(movie_train$genres)Thriller -4.149e-01 7.376e-01
factor(movie_train$genres)Western 9.080e-01 7.366e-01
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$num_user_for_reviews NA NA
movie_train$gross NA NA
movie_train$budget NA NA
movie_train$duration:movie_train$num_voted_users -2.160e-08 3.828e-09
movie_train$num_voted_users:movie_train$num_user_for_reviews 7.634e-10 2.962e-10
movie_train$gross:movie_train$budget 1.241e-17 5.615e-18
t value Pr(>|t|)
(Intercept) 13.768 < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 9.256 < 2e-16 ***
poly(movie_train$num_voted_users, 2)2 -6.942 4.83e-12 ***
poly(movie_train$num_critic_for_reviews, 2)1 10.362 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 -8.483 < 2e-16 ***
poly(movie_train$num_user_for_reviews, 2)1 -7.233 6.14e-13 ***
poly(movie_train$num_user_for_reviews, 2)2 2.645 0.00821 **
poly(movie_train$duration, 2)1 13.349 < 2e-16 ***
poly(movie_train$duration, 2)2 -4.401 1.12e-05 ***
movie_train$facenumber_in_poster -3.229 0.00126 **
poly(movie_train$gross, 2)1 -2.881 0.00399 **
poly(movie_train$gross, 2)2 -1.571 0.11626
poly(movie_train$budget, 2)1 -7.070 1.97e-12 ***
poly(movie_train$budget, 2)2 5.168 2.54e-07 ***
movie_train$title_year -12.062 < 2e-16 ***
factor(movie_train$genres)Adventure 6.430 1.51e-10 ***
factor(movie_train$genres)Animation 5.278 1.41e-07 ***
factor(movie_train$genres)Biography 8.193 3.92e-16 ***
factor(movie_train$genres)Comedy 3.229 0.00126 **
factor(movie_train$genres)Crime 7.031 2.59e-12 ***
factor(movie_train$genres)Documentary 7.976 2.22e-15 ***
factor(movie_train$genres)Drama 9.800 < 2e-16 ***
factor(movie_train$genres)Family 0.836 0.40327
factor(movie_train$genres)Fantasy -1.522 0.12823
factor(movie_train$genres)Horror -4.814 1.56e-06 ***
factor(movie_train$genres)Musical -0.003 0.99743
factor(movie_train$genres)Mystery 0.812 0.41675
factor(movie_train$genres)Romance 1.220 0.22276
factor(movie_train$genres)Sci-Fi -0.171 0.86437
factor(movie_train$genres)Thriller -0.563 0.57382
factor(movie_train$genres)Western 1.233 0.21784
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$num_user_for_reviews NA NA
movie_train$gross NA NA
movie_train$budget NA NA
movie_train$duration:movie_train$num_voted_users -5.643 1.85e-08 ***
movie_train$num_voted_users:movie_train$num_user_for_reviews 2.578 0.01000 *
movie_train$gross:movie_train$budget 2.210 0.02721 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7349 on 2670 degrees of freedom
Multiple R-squared: 0.5204, Adjusted R-squared: 0.5145
F-statistic: 87.8 on 33 and 2670 DF, p-value: < 2.2e-16
The second order term for ‘num user for reviews’ is not sig, can be droped. The second order term for ‘gross’ is sig but close to not sig, can be droped. The interaction for ‘gross’ and ‘budget’ is not very significant, could be droped.
# lm.fit3: based on lm.fit2 dropping second order term for 'number of users for review', 'gross' and budget*gross
lm.fit3<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit3)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews)
Residuals:
Min 1Q Median 3Q Max
-3.9364 -0.3471 0.0698 0.4640 2.1996
Coefficients: (2 not defined because of singularities)
Estimate Std. Error
(Intercept) 5.065e+01 3.751e+00
poly(movie_train$num_voted_users, 2)1 4.063e+01 4.606e+00
poly(movie_train$num_voted_users, 2)2 -1.754e+01 2.060e+00
poly(movie_train$num_critic_for_reviews, 2)1 1.293e+01 1.297e+00
poly(movie_train$num_critic_for_reviews, 2)2 -6.712e+00 8.256e-01
movie_train$num_user_for_reviews -9.081e-04 9.331e-05
poly(movie_train$duration, 2)1 1.423e+01 1.081e+00
poly(movie_train$duration, 2)2 -3.319e+00 7.730e-01
movie_train$facenumber_in_poster -2.186e-02 7.105e-03
movie_train$gross -5.207e-10 3.186e-10
poly(movie_train$budget, 2)1 -1.064e+01 1.170e+00
poly(movie_train$budget, 2)2 7.172e+00 8.080e-01
movie_train$title_year -2.195e-02 1.875e-03
factor(movie_train$genres)Adventure 3.626e-01 5.497e-02
factor(movie_train$genres)Animation 7.416e-01 1.384e-01
factor(movie_train$genres)Biography 6.371e-01 7.624e-02
factor(movie_train$genres)Comedy 1.464e-01 4.395e-02
factor(movie_train$genres)Crime 4.605e-01 6.406e-02
factor(movie_train$genres)Documentary 1.306e+00 1.634e-01
factor(movie_train$genres)Drama 4.869e-01 4.938e-02
factor(movie_train$genres)Family 1.928e-01 4.282e-01
factor(movie_train$genres)Fantasy -2.218e-01 1.422e-01
factor(movie_train$genres)Horror -3.893e-01 8.007e-02
factor(movie_train$genres)Musical -8.817e-02 7.393e-01
factor(movie_train$genres)Mystery 1.555e-01 2.002e-01
factor(movie_train$genres)Romance 8.997e-01 7.382e-01
factor(movie_train$genres)Sci-Fi -7.984e-02 3.319e-01
factor(movie_train$genres)Thriller -4.206e-01 7.387e-01
factor(movie_train$genres)Western 9.181e-01 7.378e-01
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$duration:movie_train$num_voted_users -2.127e-08 3.785e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.371e-09 2.084e-10
t value Pr(>|t|)
(Intercept) 13.502 < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 8.821 < 2e-16 ***
poly(movie_train$num_voted_users, 2)2 -8.513 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1 9.972 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 -8.130 6.50e-16 ***
movie_train$num_user_for_reviews -9.732 < 2e-16 ***
poly(movie_train$duration, 2)1 13.165 < 2e-16 ***
poly(movie_train$duration, 2)2 -4.294 1.81e-05 ***
movie_train$facenumber_in_poster -3.077 0.002114 **
movie_train$gross -1.634 0.102366
poly(movie_train$budget, 2)1 -9.090 < 2e-16 ***
poly(movie_train$budget, 2)2 8.877 < 2e-16 ***
movie_train$title_year -11.707 < 2e-16 ***
factor(movie_train$genres)Adventure 6.596 5.09e-11 ***
factor(movie_train$genres)Animation 5.357 9.17e-08 ***
factor(movie_train$genres)Biography 8.356 < 2e-16 ***
factor(movie_train$genres)Comedy 3.330 0.000879 ***
factor(movie_train$genres)Crime 7.189 8.44e-13 ***
factor(movie_train$genres)Documentary 7.995 1.92e-15 ***
factor(movie_train$genres)Drama 9.861 < 2e-16 ***
factor(movie_train$genres)Family 0.450 0.652584
factor(movie_train$genres)Fantasy -1.559 0.118999
factor(movie_train$genres)Horror -4.862 1.23e-06 ***
factor(movie_train$genres)Musical -0.119 0.905077
factor(movie_train$genres)Mystery 0.777 0.437387
factor(movie_train$genres)Romance 1.219 0.223045
factor(movie_train$genres)Sci-Fi -0.241 0.809940
factor(movie_train$genres)Thriller -0.569 0.569165
factor(movie_train$genres)Western 1.244 0.213456
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$duration:movie_train$num_voted_users -5.621 2.10e-08 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users 6.581 5.61e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7361 on 2673 degrees of freedom
Multiple R-squared: 0.5183, Adjusted R-squared: 0.5129
F-statistic: 95.88 on 30 and 2673 DF, p-value: < 2.2e-16
anova(lm.fit2,lm.fit3)
Analysis of Variance Table
Model 1: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + poly(movie_train$num_user_for_reviews,
2) + poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
poly(movie_train$gross, 2) + poly(movie_train$budget, 2) +
movie_train$title_year + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews + movie_train$gross * movie_train$budget
Model 2: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2670 1442.2
2 2673 1448.5 -3 -6.2905 3.8821 0.008792 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P-value for lack of fit test is : 0.074. Meaning lm.fit3 is better than lm.fit2. R^2 for lm.fit3: 0.5075, 50.75% of variation could be explained by this model.
Diagnostics for lm.fit3:
plot(lm.fit3)
not plotting observations with leverage one:
420, 1144, 1734
not plotting observations with leverage one:
420, 1144, 1734
NaNs producedNaNs produced
library(car)
residualPlots(lm.fit3)
residualPlots(lm.fit3)
residualPlots(lm.fit3)
Test stat Pr(>|t|)
poly(movie_train$num_voted_users, 2) NA NA
poly(movie_train$num_critic_for_reviews, 2) NA NA
movie_train$num_user_for_reviews 2.599 0.009
poly(movie_train$duration, 2) NA NA
movie_train$facenumber_in_poster 0.519 0.604
movie_train$gross -0.112 0.911
poly(movie_train$budget, 2) NA NA
movie_train$title_year -5.952 0.000
factor(movie_train$genres) NA NA
movie_train$duration 1.084 0.279
movie_train$num_voted_users 0.031 0.975
Tukey test -12.886 0.000
The plot is way better than lm.fit2. All the residuals vs predictors are strainght lines except for title year. So, let’t try to add second order for title year.
# lm.fit4: based on lm.fit3 addting second order for title year.
lm.fit4<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+poly(movie_train$title_year,2)+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit4)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + poly(movie_train$title_year,
2) + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews)
Residuals:
Min 1Q Median 3Q Max
-3.9397 -0.3545 0.0567 0.4631 2.1996
Coefficients: (2 not defined because of singularities)
Estimate Std. Error
(Intercept) 6.692e+00 6.411e-02
poly(movie_train$num_voted_users, 2)1 3.734e+01 4.610e+00
poly(movie_train$num_voted_users, 2)2 -1.777e+01 2.047e+00
poly(movie_train$num_critic_for_reviews, 2)1 1.700e+01 1.459e+00
poly(movie_train$num_critic_for_reviews, 2)2 -7.554e+00 8.325e-01
movie_train$num_user_for_reviews -1.026e-03 9.482e-05
poly(movie_train$duration, 2)1 1.399e+01 1.075e+00
poly(movie_train$duration, 2)2 -3.164e+00 7.685e-01
movie_train$facenumber_in_poster -1.768e-02 7.095e-03
movie_train$gross -4.659e-10 3.167e-10
poly(movie_train$budget, 2)1 -1.071e+01 1.163e+00
poly(movie_train$budget, 2)2 7.614e+00 8.063e-01
poly(movie_train$title_year, 2)1 -1.304e+01 1.001e+00
poly(movie_train$title_year, 2)2 -5.074e+00 8.525e-01
factor(movie_train$genres)Adventure 3.701e-01 5.463e-02
factor(movie_train$genres)Animation 7.997e-01 1.379e-01
factor(movie_train$genres)Biography 6.375e-01 7.575e-02
factor(movie_train$genres)Comedy 1.515e-01 4.367e-02
factor(movie_train$genres)Crime 4.648e-01 6.365e-02
factor(movie_train$genres)Documentary 1.368e+00 1.627e-01
factor(movie_train$genres)Drama 4.963e-01 4.909e-02
factor(movie_train$genres)Family 1.581e-01 4.255e-01
factor(movie_train$genres)Fantasy -2.713e-01 1.416e-01
factor(movie_train$genres)Horror -4.080e-01 7.962e-02
factor(movie_train$genres)Musical -1.438e-01 7.347e-01
factor(movie_train$genres)Mystery 1.683e-01 1.990e-01
factor(movie_train$genres)Romance 9.931e-01 7.336e-01
factor(movie_train$genres)Sci-Fi -9.474e-02 3.298e-01
factor(movie_train$genres)Thriller -2.269e-01 7.348e-01
factor(movie_train$genres)Western 8.717e-01 7.331e-01
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$duration:movie_train$num_voted_users -2.002e-08 3.767e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.495e-09 2.081e-10
t value Pr(>|t|)
(Intercept) 104.375 < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 8.101 8.21e-16 ***
poly(movie_train$num_voted_users, 2)2 -8.680 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1 11.655 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 -9.075 < 2e-16 ***
movie_train$num_user_for_reviews -10.824 < 2e-16 ***
poly(movie_train$duration, 2)1 13.019 < 2e-16 ***
poly(movie_train$duration, 2)2 -4.117 3.95e-05 ***
movie_train$facenumber_in_poster -2.492 0.012780 *
movie_train$gross -1.471 0.141436
poly(movie_train$budget, 2)1 -9.212 < 2e-16 ***
poly(movie_train$budget, 2)2 9.443 < 2e-16 ***
poly(movie_train$title_year, 2)1 -13.028 < 2e-16 ***
poly(movie_train$title_year, 2)2 -5.952 3.00e-09 ***
factor(movie_train$genres)Adventure 6.774 1.53e-11 ***
factor(movie_train$genres)Animation 5.800 7.43e-09 ***
factor(movie_train$genres)Biography 8.416 < 2e-16 ***
factor(movie_train$genres)Comedy 3.469 0.000531 ***
factor(movie_train$genres)Crime 7.302 3.73e-13 ***
factor(movie_train$genres)Documentary 8.411 < 2e-16 ***
factor(movie_train$genres)Drama 10.109 < 2e-16 ***
factor(movie_train$genres)Family 0.372 0.710217
factor(movie_train$genres)Fantasy -1.916 0.055431 .
factor(movie_train$genres)Horror -5.125 3.19e-07 ***
factor(movie_train$genres)Musical -0.196 0.844810
factor(movie_train$genres)Mystery 0.846 0.397677
factor(movie_train$genres)Romance 1.354 0.175968
factor(movie_train$genres)Sci-Fi -0.287 0.773958
factor(movie_train$genres)Thriller -0.309 0.757503
factor(movie_train$genres)Western 1.189 0.234512
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$duration:movie_train$num_voted_users -5.316 1.15e-07 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users 7.187 8.57e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7314 on 2672 degrees of freedom
Multiple R-squared: 0.5246, Adjusted R-squared: 0.5191
F-statistic: 95.13 on 31 and 2672 DF, p-value: < 2.2e-16
anova(lm.fit4,lm.fit3)
Analysis of Variance Table
Model 1: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + poly(movie_train$title_year,
2) + factor(movie_train$genres) + movie_train$duration *
movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews
Model 2: movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2672 1429.5
2 2673 1448.5 -1 -18.952 35.425 2.997e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P value is so small, reject null, meaning adding second order term for title year did not improve model.
Marginal Model plot:
marginalModelPlots(lm.fit3)
Splines and/or polynomials replaced by a fitted linear combination
marginalModelPlots(lm.fit3)
marginalModelPlots(lm.fit3)
The plots of the response versus the individual predictors display the conditional distribution of the response given each predictor, ignoring the other predictors. From our plots, our model is really good.since the marginal relationship between the response and the predictor are overlapping.
Check for residual ourliers:
library(car)
qqPlot(lm.fit3$residuals,id.n = 10)
249 2278 1372 2065 469 2252 707 2639 718 1203
1 2 3 4 5 6 7 8 9 10
outlierTest(lm.fit3) # H0: residual is not an outlier
rstudent unadjusted p-value Bonferonni p
249 -5.383817 7.9263e-08 0.00021409
2278 -5.318779 1.1312e-07 0.00030552
1372 -4.913107 9.5056e-07 0.00256750
2065 -4.812054 1.5770e-06 0.00425940
469 -4.757227 2.0670e-06 0.00558310
2252 -4.679599 3.0176e-06 0.00815060
707 -4.557735 5.4033e-06 0.01459400
2639 -4.524189 6.3276e-06 0.01709100
718 -4.461174 8.4883e-06 0.02292700
1203 -4.426985 9.9395e-06 0.02684700
All of the 10 residuals have significant p-values, therefore, we can drop them.
Before we drop, let’s do some digsnostics to double check which to drop.
library(car)
influencePlot(lm.fit3, id.n=10)
From the influcence plot, we decided to drop observations: 2572ï¼1423ï¼860ï¼1520ï¼509ï¼682ï¼1017ï¼848ï¼361ï¼237
# lm.fit5: model based on lm.fit3 removing 10 outliers.
movie_train<-movie_train[-c(2572,1423,860,1520,509,682,1017,848,361,237),]
lm.fit5<-lm(movie_train$imdb_score~poly(movie_train$num_voted_users,2)+poly(movie_train$num_critic_for_reviews,2)+movie_train$num_user_for_reviews+poly(movie_train$duration,2)+movie_train$facenumber_in_poster+movie_train$gross+poly(movie_train$budget,2)+movie_train$title_year+factor(movie_train$genres)+movie_train$duration*movie_train$num_voted_users+movie_train$num_voted_users*movie_train$num_user_for_reviews)
summary(lm.fit5)
Call:
lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users,
2) + poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster +
movie_train$gross + poly(movie_train$budget, 2) + movie_train$title_year +
factor(movie_train$genres) + movie_train$duration * movie_train$num_voted_users +
movie_train$num_voted_users * movie_train$num_user_for_reviews)
Residuals:
Min 1Q Median 3Q Max
-3.9377 -0.3441 0.0703 0.4639 2.1885
Coefficients: (2 not defined because of singularities)
Estimate Std. Error
(Intercept) 5.037e+01 3.754e+00
poly(movie_train$num_voted_users, 2)1 4.032e+01 4.606e+00
poly(movie_train$num_voted_users, 2)2 -1.754e+01 2.059e+00
poly(movie_train$num_critic_for_reviews, 2)1 1.278e+01 1.296e+00
poly(movie_train$num_critic_for_reviews, 2)2 -6.747e+00 8.258e-01
movie_train$num_user_for_reviews -9.099e-04 9.338e-05
poly(movie_train$duration, 2)1 1.410e+01 1.080e+00
poly(movie_train$duration, 2)2 -3.261e+00 7.731e-01
movie_train$facenumber_in_poster -2.187e-02 7.115e-03
movie_train$gross -4.939e-10 3.197e-10
poly(movie_train$budget, 2)1 -1.069e+01 1.170e+00
poly(movie_train$budget, 2)2 7.188e+00 8.082e-01
movie_train$title_year -2.181e-02 1.876e-03
factor(movie_train$genres)Adventure 3.522e-01 5.521e-02
factor(movie_train$genres)Animation 7.323e-01 1.385e-01
factor(movie_train$genres)Biography 6.313e-01 7.649e-02
factor(movie_train$genres)Comedy 1.386e-01 4.410e-02
factor(movie_train$genres)Crime 4.545e-01 6.414e-02
factor(movie_train$genres)Documentary 1.297e+00 1.634e-01
factor(movie_train$genres)Drama 4.814e-01 4.951e-02
factor(movie_train$genres)Family 1.832e-01 4.283e-01
factor(movie_train$genres)Fantasy -2.278e-01 1.423e-01
factor(movie_train$genres)Horror -3.967e-01 8.017e-02
factor(movie_train$genres)Musical -9.687e-02 7.394e-01
factor(movie_train$genres)Mystery 1.504e-01 2.003e-01
factor(movie_train$genres)Romance 8.888e-01 7.383e-01
factor(movie_train$genres)Sci-Fi -8.399e-02 3.320e-01
factor(movie_train$genres)Thriller -4.309e-01 7.388e-01
factor(movie_train$genres)Western 9.130e-01 7.378e-01
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$duration:movie_train$num_voted_users -2.105e-08 3.787e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.374e-09 2.084e-10
t value Pr(>|t|)
(Intercept) 13.418 < 2e-16 ***
poly(movie_train$num_voted_users, 2)1 8.754 < 2e-16 ***
poly(movie_train$num_voted_users, 2)2 -8.522 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)1 9.862 < 2e-16 ***
poly(movie_train$num_critic_for_reviews, 2)2 -8.171 4.68e-16 ***
movie_train$num_user_for_reviews -9.744 < 2e-16 ***
poly(movie_train$duration, 2)1 13.061 < 2e-16 ***
poly(movie_train$duration, 2)2 -4.219 2.54e-05 ***
movie_train$facenumber_in_poster -3.074 0.00213 **
movie_train$gross -1.545 0.12246
poly(movie_train$budget, 2)1 -9.133 < 2e-16 ***
poly(movie_train$budget, 2)2 8.895 < 2e-16 ***
movie_train$title_year -11.624 < 2e-16 ***
factor(movie_train$genres)Adventure 6.380 2.08e-10 ***
factor(movie_train$genres)Animation 5.288 1.34e-07 ***
factor(movie_train$genres)Biography 8.254 2.38e-16 ***
factor(movie_train$genres)Comedy 3.143 0.00169 **
factor(movie_train$genres)Crime 7.087 1.75e-12 ***
factor(movie_train$genres)Documentary 7.935 3.06e-15 ***
factor(movie_train$genres)Drama 9.723 < 2e-16 ***
factor(movie_train$genres)Family 0.428 0.66892
factor(movie_train$genres)Fantasy -1.601 0.10950
factor(movie_train$genres)Horror -4.948 7.98e-07 ***
factor(movie_train$genres)Musical -0.131 0.89578
factor(movie_train$genres)Mystery 0.751 0.45270
factor(movie_train$genres)Romance 1.204 0.22873
factor(movie_train$genres)Sci-Fi -0.253 0.80029
factor(movie_train$genres)Thriller -0.583 0.55979
factor(movie_train$genres)Western 1.237 0.21608
movie_train$duration NA NA
movie_train$num_voted_users NA NA
movie_train$duration:movie_train$num_voted_users -5.557 3.01e-08 ***
movie_train$num_user_for_reviews:movie_train$num_voted_users 6.593 5.16e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7362 on 2663 degrees of freedom
Multiple R-squared: 0.5169, Adjusted R-squared: 0.5114
F-statistic: 94.97 on 30 and 2663 DF, p-value: < 2.2e-16
compareCoefs(lm.fit3, lm.fit5)
Call:
1: lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + movie_train$gross
+ poly(movie_train$budget, 2) + movie_train$title_year + factor(movie_train$genres) +
movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews)
2: lm(formula = movie_train$imdb_score ~ poly(movie_train$num_voted_users, 2) +
poly(movie_train$num_critic_for_reviews, 2) + movie_train$num_user_for_reviews +
poly(movie_train$duration, 2) + movie_train$facenumber_in_poster + movie_train$gross
+ poly(movie_train$budget, 2) + movie_train$title_year + factor(movie_train$genres) +
movie_train$duration * movie_train$num_voted_users + movie_train$num_voted_users *
movie_train$num_user_for_reviews)
Est. 1 SE 1
(Intercept) 5.06e+01 3.75e+00
poly(movie_train$num_voted_users, 2)1 4.06e+01 4.61e+00
poly(movie_train$num_voted_users, 2)2 -1.75e+01 2.06e+00
poly(movie_train$num_critic_for_reviews, 2)1 1.29e+01 1.30e+00
poly(movie_train$num_critic_for_reviews, 2)2 -6.71e+00 8.26e-01
movie_train$num_user_for_reviews -9.08e-04 9.33e-05
poly(movie_train$duration, 2)1 1.42e+01 1.08e+00
poly(movie_train$duration, 2)2 -3.32e+00 7.73e-01
movie_train$facenumber_in_poster -2.19e-02 7.11e-03
movie_train$gross -5.21e-10 3.19e-10
poly(movie_train$budget, 2)1 -1.06e+01 1.17e+00
poly(movie_train$budget, 2)2 7.17e+00 8.08e-01
movie_train$title_year -2.19e-02 1.87e-03
factor(movie_train$genres)Adventure 3.63e-01 5.50e-02
factor(movie_train$genres)Animation 7.42e-01 1.38e-01
factor(movie_train$genres)Biography 6.37e-01 7.62e-02
factor(movie_train$genres)Comedy 1.46e-01 4.39e-02
factor(movie_train$genres)Crime 4.61e-01 6.41e-02
factor(movie_train$genres)Documentary 1.31e+00 1.63e-01
factor(movie_train$genres)Drama 4.87e-01 4.94e-02
factor(movie_train$genres)Family 1.93e-01 4.28e-01
factor(movie_train$genres)Fantasy -2.22e-01 1.42e-01
factor(movie_train$genres)Horror -3.89e-01 8.01e-02
factor(movie_train$genres)Musical -8.82e-02 7.39e-01
factor(movie_train$genres)Mystery 1.56e-01 2.00e-01
factor(movie_train$genres)Romance 9.00e-01 7.38e-01
factor(movie_train$genres)Sci-Fi -7.98e-02 3.32e-01
factor(movie_train$genres)Thriller -4.21e-01 7.39e-01
factor(movie_train$genres)Western 9.18e-01 7.38e-01
movie_train$duration
movie_train$num_voted_users
movie_train$duration:movie_train$num_voted_users -2.13e-08 3.78e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.37e-09 2.08e-10
Est. 2 SE 2
(Intercept) 5.04e+01 3.75e+00
poly(movie_train$num_voted_users, 2)1 4.03e+01 4.61e+00
poly(movie_train$num_voted_users, 2)2 -1.75e+01 2.06e+00
poly(movie_train$num_critic_for_reviews, 2)1 1.28e+01 1.30e+00
poly(movie_train$num_critic_for_reviews, 2)2 -6.75e+00 8.26e-01
movie_train$num_user_for_reviews -9.10e-04 9.34e-05
poly(movie_train$duration, 2)1 1.41e+01 1.08e+00
poly(movie_train$duration, 2)2 -3.26e+00 7.73e-01
movie_train$facenumber_in_poster -2.19e-02 7.12e-03
movie_train$gross -4.94e-10 3.20e-10
poly(movie_train$budget, 2)1 -1.07e+01 1.17e+00
poly(movie_train$budget, 2)2 7.19e+00 8.08e-01
movie_train$title_year -2.18e-02 1.88e-03
factor(movie_train$genres)Adventure 3.52e-01 5.52e-02
factor(movie_train$genres)Animation 7.32e-01 1.38e-01
factor(movie_train$genres)Biography 6.31e-01 7.65e-02
factor(movie_train$genres)Comedy 1.39e-01 4.41e-02
factor(movie_train$genres)Crime 4.55e-01 6.41e-02
factor(movie_train$genres)Documentary 1.30e+00 1.63e-01
factor(movie_train$genres)Drama 4.81e-01 4.95e-02
factor(movie_train$genres)Family 1.83e-01 4.28e-01
factor(movie_train$genres)Fantasy -2.28e-01 1.42e-01
factor(movie_train$genres)Horror -3.97e-01 8.02e-02
factor(movie_train$genres)Musical -9.69e-02 7.39e-01
factor(movie_train$genres)Mystery 1.50e-01 2.00e-01
factor(movie_train$genres)Romance 8.89e-01 7.38e-01
factor(movie_train$genres)Sci-Fi -8.40e-02 3.32e-01
factor(movie_train$genres)Thriller -4.31e-01 7.39e-01
factor(movie_train$genres)Western 9.13e-01 7.38e-01
movie_train$duration
movie_train$num_voted_users
movie_train$duration:movie_train$num_voted_users -2.10e-08 3.79e-09
movie_train$num_user_for_reviews:movie_train$num_voted_users 1.37e-09 2.08e-10
Removing outliers did not change the result too much.
Diagnostics for lm.fit5:
library(car)
residualPlots(lm.fit5)
residualPlots(lm.fit5)
residualPlots(lm.fit5)
Test stat Pr(>|t|)
poly(movie_train$num_voted_users, 2) NA NA
poly(movie_train$num_critic_for_reviews, 2) NA NA
movie_train$num_user_for_reviews 2.623 0.009
poly(movie_train$duration, 2) NA NA
movie_train$facenumber_in_poster 0.531 0.595
movie_train$gross -0.110 0.913
poly(movie_train$budget, 2) NA NA
movie_train$title_year -5.901 0.000
factor(movie_train$genres) NA NA
movie_train$duration 0.529 0.597
movie_train$num_voted_users 0.699 0.484
Tukey test -12.781 0.000
Looks good except for residuals vs fitted values show some curviture.
plot(lm.fit5)
not plotting observations with leverage one:
418
not plotting observations with leverage one:
418
NaNs producedNaNs produced
Now,let’s look at model assumption for both lm.fit3 and lm.fit5:
# normality
shapiro.test(lm.fit3$residuals)
Shapiro-Wilk normality test
data: lm.fit3$residuals
W = 0.9464, p-value < 2.2e-16
shapiro.test(lm.fit5$residuals)
Shapiro-Wilk normality test
data: lm.fit5$residuals
W = 0.9461, p-value < 2.2e-16
Both models failed the normality assumption. I think this is due to the many outliers in the data set.
# equal variance : H0: variance is not constant
ncvTest(lm.fit3)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 165.6577 Df = 1 p = 6.571054e-38
ncvTest(lm.fit5)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 165.6577 Df = 1 p = 6.571054e-38
Both models passed the equal variance assumption.
This is just to explore more interesting facts Plots for data with fitted regression line:
library(ggplot2)
ggplot(data=movie_train,aes(x=duration,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=num_voted_users,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=facenumber_in_poster,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=gross,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
library(ggplot2)
ggplot(data=movie_train,aes(x=budget,y=imdb_score,colour=factor(genres)))+stat_smooth(method=lm,fullrange = FALSE)+geom_point()
Rewriting model lm.fit5 in another notation: # Note, if write in lm(train\(score~train\)x1+train$x2….), it will create the same number of values with the train data set when predict().
# lm.fit6 =lm.fit 5 using difference writing
lm.fit6<-lm(imdb_score~poly(num_voted_users,2)+poly(num_critic_for_reviews,2)+num_user_for_reviews+poly(duration,2)+facenumber_in_poster+gross+poly(budget,2)+title_year+genres+duration*num_voted_users+num_voted_users*num_user_for_reviews,data=data.frame(movie_train))
summary(lm.fit6)
Call:
lm(formula = imdb_score ~ poly(num_voted_users, 2) + poly(num_critic_for_reviews,
2) + num_user_for_reviews + poly(duration, 2) + facenumber_in_poster +
gross + poly(budget, 2) + title_year + genres + duration *
num_voted_users + num_voted_users * num_user_for_reviews,
data = data.frame(movie_train))
Residuals:
Min 1Q Median 3Q Max
-3.9377 -0.3441 0.0703 0.4639 2.1885
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.037e+01 3.754e+00 13.418 < 2e-16 ***
poly(num_voted_users, 2)1 4.032e+01 4.606e+00 8.754 < 2e-16 ***
poly(num_voted_users, 2)2 -1.754e+01 2.059e+00 -8.522 < 2e-16 ***
poly(num_critic_for_reviews, 2)1 1.278e+01 1.296e+00 9.862 < 2e-16 ***
poly(num_critic_for_reviews, 2)2 -6.747e+00 8.258e-01 -8.171 4.68e-16 ***
num_user_for_reviews -9.099e-04 9.338e-05 -9.744 < 2e-16 ***
poly(duration, 2)1 1.410e+01 1.080e+00 13.061 < 2e-16 ***
poly(duration, 2)2 -3.261e+00 7.731e-01 -4.219 2.54e-05 ***
facenumber_in_poster -2.187e-02 7.115e-03 -3.074 0.00213 **
gross -4.939e-10 3.197e-10 -1.545 0.12246
poly(budget, 2)1 -1.069e+01 1.170e+00 -9.133 < 2e-16 ***
poly(budget, 2)2 7.188e+00 8.082e-01 8.895 < 2e-16 ***
title_year -2.181e-02 1.876e-03 -11.624 < 2e-16 ***
genresAdventure 3.522e-01 5.521e-02 6.380 2.08e-10 ***
genresAnimation 7.323e-01 1.385e-01 5.288 1.34e-07 ***
genresBiography 6.313e-01 7.649e-02 8.254 2.38e-16 ***
genresComedy 1.386e-01 4.410e-02 3.143 0.00169 **
genresCrime 4.545e-01 6.414e-02 7.087 1.75e-12 ***
genresDocumentary 1.297e+00 1.634e-01 7.935 3.06e-15 ***
genresDrama 4.814e-01 4.951e-02 9.723 < 2e-16 ***
genresFamily 1.832e-01 4.283e-01 0.428 0.66892
genresFantasy -2.278e-01 1.423e-01 -1.601 0.10950
genresHorror -3.967e-01 8.017e-02 -4.948 7.98e-07 ***
genresMusical -9.687e-02 7.394e-01 -0.131 0.89578
genresMystery 1.504e-01 2.003e-01 0.751 0.45270
genresRomance 8.888e-01 7.383e-01 1.204 0.22873
genresSci-Fi -8.399e-02 3.320e-01 -0.253 0.80029
genresThriller -4.309e-01 7.388e-01 -0.583 0.55979
genresWestern 9.130e-01 7.378e-01 1.237 0.21608
duration NA NA NA NA
num_voted_users NA NA NA NA
duration:num_voted_users -2.105e-08 3.787e-09 -5.557 3.01e-08 ***
num_user_for_reviews:num_voted_users 1.374e-09 2.084e-10 6.593 5.16e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7362 on 2663 degrees of freedom
Multiple R-squared: 0.5169, Adjusted R-squared: 0.5114
F-statistic: 94.97 on 30 and 2663 DF, p-value: < 2.2e-16
pr<-predict.lm(lm.fit6,newdata = data.frame(movie_test),interval = 'confidence')
prediction from a rank-deficient fit may be misleading
pr
fit lwr upr
4 7.532866 7.070521 7.995212
18 7.147224 6.727831 7.566617
36 7.105444 6.927123 7.283766
41 6.832298 6.717652 6.946944
52 6.745073 6.570269 6.919877
54 7.099408 6.940154 7.258662
62 6.477071 6.167720 6.786422
72 5.518444 5.382021 5.654867
75 5.956378 5.799223 6.113534
97 7.921294 7.522945 8.319643
102 8.028773 7.850098 8.207447
109 5.720808 5.591183 5.850433
149 6.066992 5.934728 6.199256
158 5.706457 5.603469 5.809444
164 5.495290 5.380543 5.610037
165 5.507511 5.400149 5.614873
184 7.536942 7.424599 7.649286
190 6.526406 6.270885 6.781926
192 5.789304 5.679154 5.899454
194 6.144094 6.020476 6.267712
220 6.797991 6.697396 6.898585
223 6.231502 6.126179 6.336825
233 6.003580 5.896874 6.110285
241 7.218573 6.828965 7.608180
262 6.438942 6.350170 6.527715
292 6.545400 6.443179 6.647620
297 7.364689 6.869019 7.860359
304 5.848346 5.702158 5.994533
305 6.052453 5.943964 6.160941
313 5.685453 5.609391 5.761515
324 5.446335 5.322331 5.570339
340 7.835805 7.306835 8.364774
346 7.224485 7.067632 7.381338
362 8.705321 8.489233 8.921409
379 6.086725 5.979906 6.193544
395 6.065458 5.950318 6.180597
399 7.058227 6.961177 7.155278
404 6.836200 6.570506 7.101895
406 6.466229 6.384291 6.548168
410 6.355030 6.251712 6.458347
427 5.728945 5.640956 5.816934
435 8.007525 7.850646 8.164404
440 7.896934 7.588273 8.205595
453 8.184593 7.740853 8.628334
463 6.652393 6.519075 6.785711
475 5.672285 5.583483 5.761086
481 5.796184 5.527854 6.064514
494 5.736717 5.647324 5.826110
499 5.555406 5.468873 5.641940
502 5.664352 5.576523 5.752180
519 5.337230 5.218956 5.455505
525 6.882058 6.753216 7.010900
534 5.436274 5.353219 5.519329
536 6.246131 6.149101 6.343161
551 6.382540 6.120369 6.644711
553 6.006155 5.914869 6.097442
566 6.122604 5.993869 6.251339
570 6.655939 6.544799 6.767079
574 8.392476 8.086851 8.698100
576 6.855142 6.731293 6.978992
577 6.551108 6.443272 6.658943
599 5.818894 5.726167 5.911622
616 5.467505 5.391440 5.543570
617 5.868150 5.781085 5.955214
634 7.979128 7.688574 8.269682
642 6.747629 6.659634 6.835624
673 5.599985 5.496359 5.703612
680 6.667716 6.578642 6.756790
702 5.402351 5.321319 5.483383
706 5.851077 5.767683 5.934471
710 7.981290 7.793918 8.168662
723 6.508575 6.425190 6.591960
730 5.550647 5.428539 5.672754
741 5.945696 5.863457 6.027935
767 5.886350 5.816881 5.955818
789 7.094412 6.950716 7.238107
801 6.362756 6.267244 6.458269
805 6.099380 6.011990 6.186770
807 5.800161 5.723184 5.877138
816 5.983439 5.869589 6.097288
820 5.874646 5.772700 5.976592
842 5.893985 5.801702 5.986267
844 5.506660 5.416189 5.597131
849 6.983865 6.889091 7.078639
851 6.184149 6.105169 6.263130
872 6.035374 5.946974 6.123775
884 6.832841 6.324449 7.341233
897 6.086534 5.989880 6.183189
929 7.677788 6.228189 9.127388
933 7.209955 7.108111 7.311799
934 6.374305 6.218010 6.530600
941 7.423842 7.314980 7.532705
952 5.732756 5.661278 5.804234
1012 6.676683 6.557351 6.796016
1020 6.137487 6.014276 6.260697
1035 5.531226 5.448663 5.613789
1039 7.472331 7.302621 7.642040
1049 6.865405 6.782361 6.948450
1066 5.755046 5.672244 5.837849
1074 5.683657 5.587662 5.779652
1085 8.168594 8.006078 8.331110
1094 6.149181 6.070803 6.227560
1103 5.493684 5.428647 5.558721
1112 5.718656 5.616365 5.820947
1119 5.924898 5.836427 6.013369
1120 5.453503 5.344941 5.562065
1140 6.202411 6.126805 6.278017
1147 6.208350 6.124223 6.292477
1151 5.440569 5.358429 5.522709
1158 5.958038 5.885288 6.030788
1180 6.427328 6.339484 6.515172
1208 6.315590 6.239908 6.391272
1211 5.489064 5.394820 5.583307
1229 6.146876 6.077407 6.216345
1237 7.014990 6.906757 7.123223
1241 6.414861 6.027006 6.802715
1255 7.286743 7.167531 7.405954
1271 6.240700 6.161911 6.319489
1273 5.919283 5.846767 5.991799
1302 6.178886 6.036833 6.320939
1318 5.992964 5.901389 6.084539
1343 5.766825 5.693613 5.840038
1353 6.136637 6.055359 6.217914
1365 6.594790 6.144944 7.044635
1371 5.904377 5.754657 6.054097
1372 7.170810 7.072482 7.269137
1375 7.589885 7.415879 7.763891
1390 5.769504 5.698503 5.840504
1392 6.430170 6.348217 6.512123
1397 5.447974 5.377246 5.518702
1428 5.671113 5.589048 5.753178
1431 6.834620 6.738670 6.930569
1517 5.506728 5.427427 5.586028
1519 6.494431 6.400522 6.588340
1534 6.759141 6.676904 6.841377
1544 6.296068 6.218856 6.373279
1546 7.206950 7.085699 7.328201
1572 8.367261 7.718739 9.015783
1581 6.211045 6.135565 6.286525
1584 6.187331 6.074343 6.300319
1591 7.044950 6.941787 7.148113
1593 7.377296 7.237080 7.517512
1608 6.665452 6.592289 6.738614
1616 7.745531 7.541166 7.949896
1623 5.908825 5.845733 5.971917
1640 6.223100 6.158861 6.287339
1681 6.719219 6.626756 6.811681
1695 6.818825 6.698576 6.939075
1699 6.599794 6.501478 6.698109
1700 7.118055 6.960602 7.275507
1702 6.028337 5.945515 6.111159
1708 6.210212 6.133075 6.287348
1736 7.763616 7.611202 7.916030
1758 6.138560 5.980030 6.297089
1760 6.695719 6.589163 6.802275
1772 6.241149 6.178469 6.303829
1807 6.583871 6.470679 6.697062
1814 7.529013 7.342745 7.715281
1860 6.077985 5.959315 6.196655
1882 5.312082 5.217582 5.406582
1884 5.834634 5.744253 5.925014
1901 6.950317 6.845423 7.055212
1914 5.513860 5.402933 5.624787
1929 5.637303 5.564406 5.710200
1944 6.072144 6.013880 6.130407
1965 5.706108 5.632095 5.780121
1982 6.402153 6.326121 6.478185
1999 5.453124 5.361328 5.544921
2017 6.867382 6.780186 6.954577
2022 6.244585 6.089236 6.399933
2027 6.483438 6.395801 6.571075
2055 6.280783 6.139570 6.421996
2077 6.750369 6.659304 6.841433
2086 6.159448 6.080703 6.238193
2093 6.739150 6.552225 6.926075
2095 6.440827 6.366940 6.514713
2105 6.264204 6.157375 6.371033
2107 6.203387 6.092911 6.313862
2110 6.219974 6.135260 6.304689
2138 8.304860 8.124766 8.484955
2145 6.291060 6.215270 6.366849
2151 5.984690 5.882486 6.086893
2183 5.399695 5.296881 5.502510
2197 5.895995 5.737591 6.054398
2198 6.025398 5.938178 6.112619
2276 5.973059 5.903862 6.042255
2277 6.284302 6.134191 6.434413
2281 6.362727 6.290687 6.434767
2321 5.561182 5.477373 5.644990
2361 6.689518 6.545187 6.833848
2362 5.460404 5.324915 5.595893
2366 5.635655 5.538109 5.733201
2369 5.974270 5.918622 6.029917
2395 7.113046 6.972869 7.253224
2397 6.482995 6.388247 6.577743
2411 6.075614 6.005121 6.146108
2414 6.196374 6.115577 6.277172
2415 7.531062 7.311777 7.750347
2417 6.717303 6.603161 6.831444
2428 6.531399 6.415550 6.647247
2429 6.261654 6.172809 6.350500
2493 6.774596 6.595760 6.953431
2499 6.254368 6.189888 6.318848
2503 6.056060 5.914753 6.197367
2505 6.683703 6.533775 6.833630
2524 7.856346 7.745312 7.967379
2583 6.759891 6.662738 6.857044
2602 7.458668 7.298422 7.618915
2616 6.356146 6.260177 6.452116
2624 7.015640 6.922894 7.108386
2632 5.921147 5.818744 6.023550
2648 6.935343 6.814869 7.055818
2654 7.887763 7.579260 8.196266
2671 5.515390 5.378335 5.652445
2674 5.402407 5.320331 5.484483
2693 6.309477 6.195899 6.423055
2700 5.963442 5.880255 6.046630
2726 6.220098 6.136364 6.303833
2748 5.562933 5.495248 5.630619
2780 6.213294 6.098667 6.327921
2799 5.952863 5.850307 6.055419
2835 7.332987 7.020839 7.645135
2836 8.420098 8.136064 8.704132
2848 5.658914 5.559708 5.758120
2862 5.818235 5.702994 5.933475
2882 5.595872 5.522690 5.669053
2898 6.371319 6.298806 6.443833
2919 7.272168 7.143461 7.400875
2952 5.829849 5.697105 5.962593
2972 6.048737 5.926076 6.171398
2977 6.059970 5.407260 6.712680
2981 5.809145 5.731586 5.886703
2985 5.921258 5.850417 5.992098
3027 8.183790 7.982753 8.384826
3053 5.869260 5.805358 5.933163
3082 5.928392 5.860408 5.996377
3098 6.039541 5.900421 6.178661
3101 7.012963 6.922414 7.103511
3103 5.983910 5.873567 6.094253
3114 5.896691 5.752565 6.040817
3123 6.070311 5.797010 6.343611
3133 6.086494 5.973024 6.199965
3145 6.736984 6.640887 6.833082
3151 7.680189 7.571290 7.789087
3171 6.382080 6.293795 6.470364
3182 5.872762 5.796528 5.948997
3203 6.152732 6.069245 6.236219
3222 6.504499 6.359255 6.649743
3229 5.888606 5.792008 5.985203
3261 5.940752 5.876702 6.004803
3316 5.822529 5.725388 5.919670
3334 6.965722 6.856920 7.074525
3516 6.558064 6.417785 6.698342
3548 6.006390 5.936779 6.076001
3571 6.192664 6.113661 6.271667
3607 5.879491 5.805408 5.953574
3609 5.814770 5.674671 5.954869
3626 5.744500 5.654358 5.834642
3648 6.368112 6.249977 6.486248
3715 8.561899 8.383530 8.740268
3727 6.362990 6.283908 6.442072
3740 5.529139 5.393811 5.664468
3747 6.335849 6.229983 6.441716
3748 6.180180 6.081197 6.279163
3749 5.561311 5.423620 5.699002
3850 8.917223 8.680791 9.153654
3880 5.848465 5.702279 5.994650
3893 6.812039 6.725563 6.898515
3894 6.464500 6.364511 6.564489
3907 8.118958 7.933319 8.304598
3924 6.895080 6.586115 7.204044
3939 6.938407 6.840405 7.036408
4026 7.227623 7.051745 7.403501
4028 6.195250 6.114868 6.275631
4066 5.805236 5.657442 5.953031
4189 6.689414 6.604642 6.774187
4193 6.488028 6.410324 6.565732
4208 5.974875 5.880145 6.069606
4220 5.963858 5.854352 6.073364
4239 8.460741 8.234820 8.686663
4334 5.933228 5.855543 6.010913
4384 5.970913 5.906503 6.035323
4403 6.424157 6.343352 6.504963
4406 6.406195 6.312556 6.499834
4436 5.907443 5.810733 6.004153
4487 6.370477 6.296922 6.444031
4533 6.447268 6.163229 6.731306
4535 5.564566 5.421238 5.707895
4537 6.109283 5.952685 6.265880
4577 5.899824 5.824399 5.975249
4584 5.428643 5.345386 5.511900
4654 5.857927 5.776168 5.939685
4785 6.058881 5.982829 6.134932
4789 6.256824 6.180542 6.333105
4813 7.044035 5.581302 8.506769
4831 5.821646 5.736390 5.906902
4841 6.213957 6.115824 6.312089
4874 5.340048 4.685289 5.994807
4894 6.937580 6.731609 7.143551
5005 6.041962 4.593917 7.490007
5043 6.841339 6.531956 7.150723
Check Accuracy: Mean Absolute Error: how far, on average, prediction is from the true value.
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
MAE(pr, movie_test$imdb_score)
[1] 0.5114398
Checking the impact significance of predictors on IMDB score.
# stantdardized regression coefficients
library(QuantPsyc)
Loading required package: boot
Attaching package: ‘boot’
The following object is masked from ‘package:car’:
logit
The following object is masked from ‘package:psych’:
logit
Loading required package: MASS
Attaching package: ‘QuantPsyc’
The following object is masked from ‘package:base’:
norm
lm.beta(lm.fit6)
Calling var(x) on a factor x is deprecated and will become an error.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.longer object length is not a multiple of shorter object length
poly(num_voted_users, 2)1 poly(num_voted_users, 2)2
7.375099e-01 -3.209507e-01
poly(num_critic_for_reviews, 2)1 poly(num_critic_for_reviews, 2)2
4.888110e+03 -1.234322e-01
num_user_for_reviews poly(duration, 2)1
-1.802006e-03 9.709896e+08
poly(duration, 2)2 facenumber_in_poster
-5.966200e-02 -2.047902e-01
gross poly(budget, 2)1
-1.408564e-09 -2.188525e+02
poly(budget, 2)2 title_year
1.060494e+06 -3.989610e-04
genresAdventure genresAnimation
6.443330e-03 2.800931e+02
genresBiography genresComedy
1.154918e-02 2.745636e-01
genresCrime genresDocumentary
3.130022e+07 2.372774e-02
genresDrama genresFamily
4.507383e+00 5.223540e-01
genresFantasy genresHorror
-4.662931e+00 -5.852019e+04
genresMusical genresMystery
-1.772132e-03 2.751607e-03
genresRomance genresSci-Fi
3.399551e+02 -1.536461e-03
genresThriller genresWestern
-8.534308e-01 6.286873e+07
duration:num_voted_users num_user_for_reviews:num_voted_users
-3.850005e-10 1.286813e-08
Conclusion: The most important factor that affects movie rating is the duration. The longer the movie is, the higher the rating will be. num_critic_for_reviews is also an important predictor. Budget is important, although there is no strong correlation between budget and movie rating. The number of faces in movie poster has a non-neglectable effect to the movie rating.