knitr::opts_chunk$set(echo = TRUE,cache=TRUE)
Load packages
library(ggplot2)
library(dplyr)
library(statsr)
library(leaps)
library(grid)
library(gridExtra)
library(car)
Movies data set was downloaded from Coursera assignment page into R project. The observations
consist of random samples compiled from audience and critics reviews.
The dataset information about movies contains 651 randomly sampled movies produced and released
before 2016. This data set comes from Rotten Tomatoes and IMDB.There are 32 available variables.
With this dataset it is only possible to do an observational study and no causal analysis will
done.This study can generalize to movies produced and released before 2016.
Some the variables are not much relevant for the purpose of identification of movies popularity.
Before starting the analysis one missing value was omited from analysis. Required variables
were selected. A backward elimination step method was used for selecting variables.
The research question is "RELATIONSHIP OF MOVIE POPULARITY and its CORRELATION with different
ratings,scores as well as the awards won by director and actors/actresses and year of release".
To answer this question I will have to assess how far the different elements influence the
popularity.
The data set divided into 10 categories and an extra category called "other". For the sake of
exploratory analysis I joined some of the smallest categories into the "other" category. The
majority of the movies are categorized as drama movies. Figures below shows all this info.
Before starting the analysis I removed missing value( there was only one.Facet Panel showing genre categories has been drawn to visualize categories of genre in the data set. A subset(df) consisting
of 8 variables was extracted from movies data frame.
MLR(Multiple Regression) is used to assess most important causes of popularity of movies.
Multiple linear regression (MLR) allows the user to account for multiple explanatory
variables and therefore to create a model that predicts the specific outcome under study.
The simple linear regression model is based on a straight line which has the formula
Y = a + bX
(where a is the intercept and b is the gradient).We simply add a new b value
(regression coefficient) for each additional explanatory variable:
Y = a + b1x1 + b2x2 + b3x3 + ... + bnxn
Multiple r which is equivalent of Pearson???s r shows the strength of the relationship between the
outcome variable and the values predicted by the model as a whole.
In Multiple regression a model is constructed with more than one explanatory variable.
We want to create a model selectively which accounts for as much of the variance in the outcome
variable as much as possible. We create a model elegantly including only the relevant variables.
The best way to select which variables to include in a model is to refer to previous research.
There are a few new issues to think about and it is worth reiterating our assumptions for using
multiple explanatory variables.Such as:
Linear relationship: The model is a roughly linear one.
Homoscedasticity: homoscedasticity means that the variance of the residuals should be the same
at each level of the explanatory variable/s.
Independent errors: This means that residuals should be uncorrelated.
There are some more important things to considered. while using simple regression, the
assumptions are the most important issues to tackle but there are also other potential
problems you should look out for:
Outliers/influential cases: As with simple linear regression, it is important to look out
for cases which may have a disproportionate influence over your regression model.
Variance in all explanatory variables.
Explanatory variables may be continuous, ordinal or nominal but each must have at least a
small range of values even if there are only two categorical possibilities.
Multicollinearity: Multicollinearity exists when two or more of the explanatory variables are
highly correlated. This is a problem as it can be hard to disentangle which of them best explains
any shared variance with the outcome. It also suggests that the two variables may actually
represent the same underlying factor.
Normally distributed residuals: The residuals should be normally distributed.
I will attempt to analyse movies data base to find an answer for the question (of most
significant variables for a movie) accounting for its popularity. I shall look into the issues
and list shortcomings of method used in the study.
load("C:/R Data Files/movies.RData")
attach(movies)
ggplot(movies, aes(x=thtr_rel_year, y=audience_score)) +
geom_point(aes(colour=factor(genre))) +
xlab("Year of Release (theater)") +
ylab("Audience Score")
df <- movies[ -c(1, 6:12, 14, 25:32) ] #Extracting 8 Variables
df <- na.omit(df)
##MissingData <- function(x){sum(is.na(x))/length(x)*100}
##apply(df, 2, MissingData)#
names(df) #Variables to be used for analysis
## [1] "title_type" "genre" "runtime"
## [4] "mpaa_rating" "imdb_rating" "critics_rating"
## [7] "critics_score" "audience_rating" "audience_score"
## [10] "best_pic_nom" "best_pic_win" "best_actor_win"
## [13] "best_actress_win" "best_dir_win" "top200_box"
Forward Variable Selection Algorithm
regsubsets.out <- regsubsets(audience_score ~., data=df, nbest=1,
nvmax=NULL, method="backward")
plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")
print(regsubsets.out)
## Subset selection object
## Call: regsubsets.formula(audience_score ~ ., data = df, nbest = 1,
## nvmax = NULL, method = "backward")
## 29 Variables (and intercept)
## Forced in Forced out
## title_typeFeature Film FALSE FALSE
## title_typeTV Movie FALSE FALSE
## genreAnimation FALSE FALSE
## genreArt House & International FALSE FALSE
## genreComedy FALSE FALSE
## genreDocumentary FALSE FALSE
## genreDrama FALSE FALSE
## genreHorror FALSE FALSE
## genreMusical & Performing Arts FALSE FALSE
## genreMystery & Suspense FALSE FALSE
## genreOther FALSE FALSE
## genreScience Fiction & Fantasy FALSE FALSE
## runtime FALSE FALSE
## mpaa_ratingNC-17 FALSE FALSE
## mpaa_ratingPG FALSE FALSE
## mpaa_ratingPG-13 FALSE FALSE
## mpaa_ratingR FALSE FALSE
## mpaa_ratingUnrated FALSE FALSE
## imdb_rating FALSE FALSE
## critics_ratingFresh FALSE FALSE
## critics_ratingRotten FALSE FALSE
## critics_score FALSE FALSE
## audience_ratingUpright FALSE FALSE
## best_pic_nomyes FALSE FALSE
## best_pic_winyes FALSE FALSE
## best_actor_winyes FALSE FALSE
## best_actress_winyes FALSE FALSE
## best_dir_winyes FALSE FALSE
## top200_boxyes FALSE FALSE
## 1 subsets of each size up to 29
## Selection Algorithm: backward
Backward Elimination Selection Algorithm
load("C:/R Data Files/movies.RData")
attach(movies)
## The following objects are masked from movies (pos = 3):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
model = lm(audience_score~.,data=df)
#summary(model)
final_model = step(model, direction = "both", trace=FALSE )
summary(final_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## critics_score + audience_rating + best_pic_nom + best_actress_win,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2299 -4.3817 0.5075 4.3523 24.1505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.75008 2.66718 -3.281 0.00109 **
## genreAnimation 3.20421 2.45639 1.304 0.19256
## genreArt House & International -2.66418 2.02406 -1.316 0.18856
## genreComedy 1.35890 1.13110 1.201 0.23005
## genreDocumentary 0.16974 1.39624 0.122 0.90328
## genreDrama -0.79769 0.96859 -0.824 0.41050
## genreHorror -2.09686 1.67267 -1.254 0.21045
## genreMusical & Performing Arts 2.60624 2.18644 1.192 0.23371
## genreMystery & Suspense -3.13253 1.24911 -2.508 0.01240 *
## genreOther -0.02899 1.92785 -0.015 0.98801
## genreScience Fiction & Fantasy -0.08638 2.43485 -0.035 0.97171
## runtime -0.02547 0.01567 -1.626 0.10445
## imdb_rating 9.47342 0.45974 20.606 < 2e-16 ***
## critics_score 0.02177 0.01502 1.450 0.14759
## audience_ratingUpright 20.05406 0.77665 25.821 < 2e-16 ***
## best_pic_nomyes 3.52360 1.59063 2.215 0.02710 *
## best_actress_winyes -1.29501 0.89570 -1.446 0.14873
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.823 on 633 degrees of freedom
## Multiple R-squared: 0.8891, Adjusted R-squared: 0.8863
## F-statistic: 317.2 on 16 and 633 DF, p-value: < 2.2e-16
Figure 1. Summary statistics of genre,runtime,critics score, audience score, best picture award, best actor award, best actress award and best director award.
1.determining the range and concentration of the values of the X variables and whether
there exist any outliers with pairs plots.
pairs(~ critics_score + genre + audience_score+ best_pic_win+best_actor_win
+best_actress_win+best_dir_win, data=df)
2.B. Residuals and Leverage.We can compute and plot the leverage of each point using the
following commands:
results = lm(critics_score ~ genre + audience_score, data=df)
lev = hat(model.matrix(results))
plot(lev,col=rainbow(7))
3.C. Residual Plots**
We can use residuals to study whether:
1.The regression function is nonlinear.
2.The error terms have nonconstant variance.
3.The error terms are not independent.
4.There are outliers.
5.The error terms are not normally distributed.
par(mfrow=c(1,3))
plot(df$audience_score, results$res,col=rainbow(7))
plot(df$genre, results$res,col=rainbow(7))
plot(results$fitted, results$res,col=rainbow(7))
To make the same residual plots using the studentized residuals type:
r = rstudent(results)
plot(df$genre,r,col=rainbow(7))
plot(results$fitted,col=rainbow(7))
4.A Normal probability plot of the residuals can be used to check the normality assumption.
Here each residual is plotted against its expected value under normality.To make normal
probability plots, as well as a histogram, type:
qqnorm(results$res,col=rainbow(7))
qqline(results$res,col=rainbow(7))
hist(results$res,col=rainbow(7))
layout(matrix(c(1,1,2,3,4,5), 3, 2, byrow = TRUE))
barplot(summary(df$genre,maxsum=6),
main="a) Genre of movie ")
barplot(summary(df$best_pic_win),
main="b) Whether best picture Oscar (no, yes)")
barplot(summary(df$best_actor_win),
main="c) Whether or not actor ever won an Oscar (no, yes)")
barplot(summary(df$best_actress_win),
main="d) Whether or not actresses ever won an Oscar (no, yes)")
barplot(summary(df$best_dir_win),
main="e) Whether or not the director ever won an Oscar (no, yes)")
Figure 2. Barplots for Genre of df, best picture award, main actors award, main actresses award, and director award.
For numerical variables we used histograms and barplots. The average runtime of a movie is around
105 and its distribution is slightly right skewed, with some outliers df with runtime around 250
minutes. The critics scores are nearly uniforml but slightly left skewed. The score given by the
audience is nearly uniform with left skewed distribution.
layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow = TRUE))
boxplot(df$runtime,xlab="Runtime",
main="a) Boxplot of df runtime")
hist(df$runtime, xlab="Runtime",
main="b) Histogram of df runtime")
boxplot(df$critics_score,xlab="Critics score",
main="c) Boxplot of critics score")
hist(df$critics_score, xlab="Critics score",
main="d) Histogram of critics score")
boxplot(df$audience_score,xlab="Audience score",
main="e) Boxplot of audience score")
hist(df$audience_score, xlab="Audience score",
main="f) Histogram of audience score")
Figure 3. Barplots and histograms of df runtime, critics score and audience score.
Decisions about the explanatory variables added to the model are made by the computer
based entirely on statistical criteria.The Forward method starts from scratch - the
computer searches from the specified list of possible explanatory variables for the one
with the strongest correlation with the outcome and enters that first. It continues
to add variables in order of how much additional (unique) variance they explain. It only
stops when there are no further variables that can explain additional (unique) variance
that is not already accounted for by the variables already entered.
The Backward method does the opposite - it begins with all of the specified potential
explanatory variables included in the model and then removes those which are not making
a significant contribution. The Stepwise option is similar but uses both forward and
backwards criteria for deciding when to add or remove an explanatory variable. The forward
selection strategy where highest R^2 is used starts with no variables selected in the
model,it adds variables until there is no further improvement in R^2. The second method
is backward elimination, begins with largest model and eliminates variables until there
is no further improvement in the model.Former method uses highest R^2 for selection and
later uses P-value less than 0.05.
In this project I will use BACKWARD ELIMINATION and start by fitting a model with 8
variables (described in the previous section). Backward elimination will help us to define
if better results can be obtained by using a smaller set of attributes. The advantage of
backward elimination is that it allows to start with all the variables, deleting one
variable at a time until there are no improvements in the model.
First, let's fit an initial model with the 8 variables. The adjusted R-squared is 0.5223.
The least significant variable is movie director which I will exclude from the next model.
model <- lm(audience_score~.,data=df)
#summary.data.frame(model)
I will make use of the step function for constructing next model which which will be conclusive
for variable selection.
final_model<-step(model, direction = "both", trace=FALSE )
print(final_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## critics_score + audience_rating + best_pic_nom + best_actress_win,
## data = df)
##
## Coefficients:
## (Intercept) genreAnimation
## -8.75008 3.20421
## genreArt House & International genreComedy
## -2.66418 1.35890
## genreDocumentary genreDrama
## 0.16974 -0.79769
## genreHorror genreMusical & Performing Arts
## -2.09686 2.60624
## genreMystery & Suspense genreOther
## -3.13253 -0.02899
## genreScience Fiction & Fantasy runtime
## -0.08638 -0.02547
## imdb_rating critics_score
## 9.47342 0.02177
## audience_ratingUpright best_pic_nomyes
## 20.05406 3.52360
## best_actress_winyes
## -1.29501
The final model has only 3 variables but it has a slightly larger adjusted R-squared of 0.5235.
I use this model as it captures same 53% variability as in the full model(model).The genre and
critics score variables are most significant.
plot(final_model$residuals~df$runtime,xlab="Runtime",ylab="Residuals",
main="a) Residuals vs. Runtime",col=rainbow(7))
plot(final_model$residuals~df$critics_score,xlab="Critics score",ylab="Residuals",
main="b) Residuals vs critics score",col=rainbow(7))
Figure 4. Residuals plots for Runtime and critics score. The residuals seem scattered
around 0 for both Runtime and Critics score (Figure 4 a and b).
hist(final_model$residuals,ylab="Residuals",
main="a) Histogram of smalll_model residuals")
qqnorm(final_model$residuals,
main="b) Normal probability of residuals")
qqline(final_model$residuals)
Figure 5. Histogram and Q-Q plot of residuals.
The conclusion from figure 5 shows that there is a little skewness in residual but there
is deviations in the tail of the graph(5b).
plot(final_model$residuals~final_model$fitted,
main="Residuals vs. fitted")
abline(0,0)
Figure 6. Residuals plot
Homoscedasticity: We can check that residuals do not vary systematically with the predicted
values by plotting the residuals against the values predicted by the regression model.Here
the residuals are generally homoscedastic. Although some degree of heteroscedasticity exists
in the right end of Figure 6; which makes the model less accurate in predicting higher values.
I wanted to predict the audience score for a new movie that has not been used to fit the
model.For the movie "Teenage Mutant Ninja Turtles: Out of the Shadows(2016)" I extracted
the runtime, critics score and audience score from IMDB and Rotten tomatoes,the genre was
set to "Action & Adventure". The real and the predicted audience score were 55 and 69.57
respectively, a confidence interval between 62.58 and 76.56. With a 95% confidence interval
if repeated samples were taken 95% of the samples would contain the population mean that
can be between 62.58 and 76.56.
new_movie <- data.frame(title_type="Feature Film",
genre="Action & Adventure",
runtime=112,
mpaa_rating="PG-13",
imdb_rating=6.1,
critics_rating="Certified Fresh",
critics_score=38,
audience_rating="Upright",
audience_score=55,
best_pic_nom="yes",
best_pic_win="yes",
best_actor_win="yes",
best_actress_win="no",
best_dir_win="yes",
top200_box="yes")
prediction_Ninja <- predict(model, newdata=new_movie, interval="confidence")
prediction_Ninja
## fit lwr upr
## 1 69.57239 62.58277 76.562
Further Diagnostic functions
I have used some additional diagnostics for evaluation of the above analysis. These are
given below.
coefficients(final_model) # model coefficients
## (Intercept) genreAnimation
## -8.75007740 3.20421210
## genreArt House & International genreComedy
## -2.66417705 1.35889547
## genreDocumentary genreDrama
## 0.16973937 -0.79768827
## genreHorror genreMusical & Performing Arts
## -2.09685534 2.60624302
## genreMystery & Suspense genreOther
## -3.13253261 -0.02898621
## genreScience Fiction & Fantasy runtime
## -0.08638193 -0.02547179
## imdb_rating critics_score
## 9.47341810 0.02177056
## audience_ratingUpright best_pic_nomyes
## 20.05406062 3.52359708
## best_actress_winyes
## -1.29501450
confint(final_model, level=0.95) # CIs for model parameters
## 2.5 % 97.5 %
## (Intercept) -13.987662766 -3.512492029
## genreAnimation -1.619455999 8.027880201
## genreArt House & International -6.638854191 1.310500097
## genreComedy -0.862260310 3.580051249
## genreDocumentary -2.572078471 2.911557213
## genreDrama -2.699725465 1.104348925
## genreHorror -5.381508923 1.187798234
## genreMusical & Performing Arts -1.687301338 6.899787370
## genreMystery & Suspense -5.585429521 -0.679635705
## genreOther -3.814748466 3.756776036
## genreScience Fiction & Fantasy -4.867744212 4.694980355
## runtime -0.056234674 0.005291097
## imdb_rating 8.570626504 10.376209689
## critics_score -0.007715782 0.051256893
## audience_ratingUpright 18.528928915 21.579192328
## best_pic_nomyes 0.400051430 6.647142723
## best_actress_winyes -3.053923603 0.463894600
fit <- fitted(final_model) # predicted values
summary(fit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.414 45.290 70.500 62.350 78.580 93.040
tres <- residuals(final_model) # residuals
summary(tres)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -21.2300 -4.3820 0.5075 0.0000 4.3520 24.1500
anova(final_model) # anova table
## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 10 51633 5163 110.9111 < 2.2e-16 ***
## runtime 1 6236 6236 133.9459 < 2.2e-16 ***
## imdb_rating 1 145918 145918 3134.4033 < 2.2e-16 ***
## critics_score 1 885 885 19.0178 1.511e-05 ***
## audience_rating 1 31298 31298 672.3087 < 2.2e-16 ***
## best_pic_nom 1 191 191 4.0976 0.04336 *
## best_actress_win 1 97 97 2.0904 0.14873
## Residuals 633 29468 47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
covar <- vcov(final_model) # covariance matrix for model parameters
summary(covar)
## (Intercept) genreAnimation genreArt House & International
## Min. :-1.42530 Min. :-1.425296 Min. :-0.560849
## 1st Qu.:-0.89643 1st Qu.:-0.001702 1st Qu.: 0.000637
## Median :-0.40278 Median : 0.727011 Median : 0.700980
## Mean : 0.04481 Mean : 0.652814 Mean : 0.595612
## 3rd Qu.: 0.01316 3rd Qu.: 0.749107 3rd Qu.: 0.755176
## Max. : 7.11383 Max. : 6.033875 Max. : 4.096804
## genreComedy genreDocumentary genreDrama
## Min. :-1.0465266 Min. :-0.3759 Min. :-0.51092
## 1st Qu.:-0.0006876 1st Qu.:-0.0028 1st Qu.:-0.00763
## Median : 0.7163751 Median : 0.7281 Median : 0.71852
## Mean : 0.3898457 Mean : 0.5113 Mean : 0.41763
## 3rd Qu.: 0.7220625 3rd Qu.: 0.8139 3rd Qu.: 0.76528
## Max. : 1.2793798 Max. : 1.9495 Max. : 0.93817
## genreHorror genreMusical & Performing Arts
## Min. :-0.993640 Min. :-0.199098
## 1st Qu.: 0.002677 1st Qu.:-0.001355
## Median : 0.700980 Median : 0.690401
## Mean : 0.489095 Mean : 0.680041
## 3rd Qu.: 0.736532 3rd Qu.: 0.785155
## Max. : 2.797825 Max. : 4.780500
## genreMystery & Suspense genreOther
## Min. :-0.4027783 Min. :-0.648327
## 1st Qu.:-0.0005663 1st Qu.:-0.002233
## Median : 0.7163751 Median : 0.722063
## Mean : 0.4571015 Mean : 0.565469
## 3rd Qu.: 0.7589054 3rd Qu.: 0.758905
## Max. : 1.5602710 Max. : 3.716620
## genreScience Fiction & Fantasy runtime imdb_rating
## Min. :-1.034839 Min. :-1.804e-02 Min. :-0.896426
## 1st Qu.: 0.000327 1st Qu.:-1.355e-03 1st Qu.:-0.044359
## Median : 0.718849 Median : 1.143e-05 Median :-0.003888
## Mean : 0.676500 Mean :-8.854e-04 Mean :-0.055338
## 3rd Qu.: 0.732801 3rd Qu.: 1.103e-03 3rd Qu.: 0.017312
## Max. : 5.928499 Max. : 3.911e-03 Max. : 0.211357
## critics_score audience_ratingUpright best_pic_nomyes
## Min. :-0.0038883 Min. :-0.16364 Min. :-0.20995
## 1st Qu.:-0.0022330 1st Qu.:-0.04908 1st Qu.:-0.04533
## Median :-0.0014056 Median :-0.01836 Median :-0.00763
## Mean :-0.0005441 Mean : 0.05427 Mean : 0.18050
## 3rd Qu.:-0.0001545 3rd Qu.: 0.02257 3rd Qu.: 0.06148
## Max. : 0.0131628 Max. : 0.73329 Max. : 2.53010
## best_actress_winyes
## Min. :-0.2099536
## 1st Qu.:-0.0938484
## Median :-0.0177532
## Mean : 0.0136220
## 3rd Qu.:-0.0001545
## Max. : 0.8022842
infl <- influence(final_model) # regression diagnostics
summary(infl)
## Length Class Mode
## hat 650 -none- numeric
## coefficients 11050 -none- numeric
## sigma 650 -none- numeric
## wt.res 650 -none- numeric
plot(residuals(final_model),col=rainbow(7))
plot(residuals(model),col=rainbow(7))
Diagnostic Plots-Diagnostic plots provide checks for heteroscedasticity, normality,
and influential observerations.Diagnostic plots-Outliers-Assessing Outliers
car::outlierTest(final_model)# Shows outliers
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 123 3.622165 0.00031559 0.20514
Assessing Outlier-Bonferonni p-value for most extreme observation qqplot for
studentized reside.
require(car)
car::qqPlot(final_model, main="QQ Plot") #qq plot for studentized resid
leveragePlots(final_model) # leverage plots
leverage plot
Influential Observations- Influential Observations- added variable plots.This function
construct added-variable (also called partial-regression)plots for linear and
generalized linear models.
require(car)
car::avPlots(final_model)
Cook’s D plot-identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(df)-length(final_model$coefficients)-2))
plot(final_model, which=4, cook.levels=cutoff)
Influence Plot
influencePlot(final_model,id.method="identify", main="Influence Plot",
sub="Circle size is proportial to Cook's Distance")
av plots Cook’s D plot influence plot
Non-normality-Normality of Residuals-qq plot for studentized residue
qqPlot(final_model, main="QQ Plot")
Distribution of studentized residuals
library(MASS)
sresid <- studres(final_model)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit,col=rainbow(7))
QQ plot histogram of studentized residuals click to view-Non-constant Error Variance Evaluate homoscedasticity, non-constant error variance test
ncvTest(final_model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 55.1253 Df = 1 p = 1.130859e-13
plot studentized residuals vs. fitted values
spreadLevelPlot(final_model,col=rainbow(5))
##
## Suggested power transformation: 1.428811
spread vs. levels
Multi-collinearity-Evaluate Collinearity-The target multiple regression model,
because VIF is less than 10, there is no multicolinearity.
require(fmsb)
res <- lm(critics_score ~ genre+audience_score, data=df)
summary(res)
##
## Call:
## lm(formula = critics_score ~ genre + audience_score, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.410 -13.557 2.582 14.441 50.802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.16813 3.34770 -2.141 0.032636 *
## genreAnimation 0.96657 7.01397 0.138 0.890437
## genreArt House & International 0.98163 5.81881 0.169 0.866086
## genreComedy 0.64434 3.22918 0.200 0.841908
## genreDocumentary 18.67671 3.88761 4.804 1.94e-06 ***
## genreDrama 10.34083 2.73506 3.781 0.000171 ***
## genreHorror 9.71695 4.78999 2.029 0.042915 *
## genreMusical & Performing Arts 11.39751 6.28868 1.812 0.070397 .
## genreMystery & Suspense 11.54559 3.54243 3.259 0.001176 **
## genreOther 11.78540 5.52343 2.134 0.033247 *
## genreScience Fiction & Fantasy 11.18576 7.00537 1.597 0.110818
## audience_score 0.90358 0.04256 21.230 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.69 on 638 degrees of freedom
## Multiple R-squared: 0.5277, Adjusted R-squared: 0.5195
## F-statistic: 64.8 on 11 and 638 DF, p-value: < 2.2e-16
Multicollinearity is nearly always a problem in multiple regression models
1) Even small degrees of multicollinearity can cause serious problems for an analysis
if you are interested in the effects of individual predictors.
2) Small samples are particularly vulnerable to multicollinearity problems because
multicollinearity reduces your effective sample size for the effects of individual
predictors
3) There are no 'easy' solutions (e.g., dropping predictors is generally a bad idea)
Checking multicolinearity for independent variables.
car::vif(final_model)
## GVIF Df GVIF^(1/(2*Df))
## genre 1.646620 10 1.025250
## runtime 1.293612 1 1.137371
## imdb_rating 3.467809 1 1.862205
## critics_score 2.537211 1 1.592862
## audience_rating 2.055665 1 1.433759
## best_pic_nom 1.155186 1 1.074796
## best_actress_win 1.103369 1 1.050414
Nonlinearity-Evaluate Nonlinearity-component + residual plot
crPlots(final_model)
Evaluate Nonlinearity
require(car)
ceresPlots(final_model,col=rainbow(2))
## Warning in ceresPlots(final_model, col = rainbow(2)): Factors skipped in
## drawing CERES plots.
Additional Diagnostic Help-The gvlma() function in the gvlma package, performs a global
validation of linear model assumptions as well separate evaluations of skewness, kurtosis,
and heteroscedasticity.Global test of model assumptions**
library(gvlma)
gvmodel <- gvlma(final_model)
summary(gvmodel)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## critics_score + audience_rating + best_pic_nom + best_actress_win,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2299 -4.3817 0.5075 4.3523 24.1505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.75008 2.66718 -3.281 0.00109 **
## genreAnimation 3.20421 2.45639 1.304 0.19256
## genreArt House & International -2.66418 2.02406 -1.316 0.18856
## genreComedy 1.35890 1.13110 1.201 0.23005
## genreDocumentary 0.16974 1.39624 0.122 0.90328
## genreDrama -0.79769 0.96859 -0.824 0.41050
## genreHorror -2.09686 1.67267 -1.254 0.21045
## genreMusical & Performing Arts 2.60624 2.18644 1.192 0.23371
## genreMystery & Suspense -3.13253 1.24911 -2.508 0.01240 *
## genreOther -0.02899 1.92785 -0.015 0.98801
## genreScience Fiction & Fantasy -0.08638 2.43485 -0.035 0.97171
## runtime -0.02547 0.01567 -1.626 0.10445
## imdb_rating 9.47342 0.45974 20.606 < 2e-16 ***
## critics_score 0.02177 0.01502 1.450 0.14759
## audience_ratingUpright 20.05406 0.77665 25.821 < 2e-16 ***
## best_pic_nomyes 3.52360 1.59063 2.215 0.02710 *
## best_actress_winyes -1.29501 0.89570 -1.446 0.14873
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.823 on 633 degrees of freedom
## Multiple R-squared: 0.8891, Adjusted R-squared: 0.8863
## F-statistic: 317.2 on 16 and 633 DF, p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = final_model)
##
## Value p-value Decision
## Global Stat 18.7023 0.0008992 Assumptions NOT satisfied!
## Skewness 0.2177 0.6408296 Assumptions acceptable.
## Kurtosis 6.5022 0.0107744 Assumptions NOT satisfied!
## Link Function 11.7270 0.0006160 Assumptions NOT satisfied!
## Heteroscedasticity 0.2555 0.6132416 Assumptions acceptable.
anova(final_model,test="Chisq")
## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 10 51633 5163 110.9111 < 2.2e-16 ***
## runtime 1 6236 6236 133.9459 < 2.2e-16 ***
## imdb_rating 1 145918 145918 3134.4033 < 2.2e-16 ***
## critics_score 1 885 885 19.0178 1.511e-05 ***
## audience_rating 1 31298 31298 672.3087 < 2.2e-16 ***
## best_pic_nom 1 191 191 4.0976 0.04336 *
## best_actress_win 1 97 97 2.0904 0.14873
## Residuals 633 29468 47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vartest <- anova(final_model,test="Chisq")
plot(vartest)
Perhaps determining the popularity of a movie is not simple task. The hiden attributes of
a movie seem to have some degree of correlation with the popularity. From regression
analysis it appears that the variables like critics score,audience score,imdb rating
and runtime have significant correlations.The movie genre, runtime and critics score can
be used to predict the popularity of a movie. However due to shortcomings of using current
subseting,projecting popularity with external rating other than IMBD and rotten tomatoes
could be more useful. This issue can be addressed in future studies.