The IMDB-Movies was selected from Hadley Wickham datasets, containing 5,215 movies, over 1903 ~ 2005. The variable names are listed below. Also the movie informations are provided by IMDb. We will use IMDb-ratings as dependent variable, IMDb user vote and movie budeget as independent variables to check whether they have effect on people’s ratings.High ratings of movie(grade:10) can be treated as a “good” movie which is reconized by IMDb users. In constrast, low ratings of movie will be seen as “bad” one.
Title: Film Title
Year: The year that the Film publicated
Length: The length of the movie
Budgets: The Movie Budget
Rating: IMDb Ratings
Vote: Number of IMDb users who vote for the movies
rm(list=ls())
Movie<-read.csv("~/Desktop/Applied_Regression/IMDb_Movie.csv");
Movie.use<-subset(Movie,select = c(Rating,Budgets, Votes));
head(Movie.use,n=14L);
## Rating Budgets Votes
## 1 7.2 450000 281
## 2 1.6 19000 7996
## 3 4.8 23000000 799
## 4 3.7 5000000 271
## 5 6.7 16000000 19095
## 6 5.6 1100000 181
## 7 3.3 140000 19
## 8 7.8 200000 299
## 9 5.8 200000 7
## 10 4.7 85000000 1987
## 11 7.1 6000000 605
## 12 8.7 340000 29278
## 13 2.8 150000 89
## 14 6.4 37000000 7859
tail(Movie.use,n=14L);
## Rating Budgets Votes
## 5202 8.1 0 24
## 5203 7.0 5000000 4820
## 5204 6.5 3240816 335
## 5205 6.3 500000 61
## 5206 7.8 8000000 844
## 5207 5.9 800000 26
## 5208 4.4 176357 49
## 5209 4.6 100000 13
## 5210 7.7 6000000 633
## 5211 6.1 28000000 18277
## 5212 7.5 1300000 168
## 5213 8.0 1000000 10
## 5214 5.5 85000000 18514
## 5215 3.9 87000000 1584
attach(Movie.use)
summary(Movie.use)
## Rating Budgets Votes
## Min. : 1.000 Min. : 0 Min. : 5
## 1st Qu.: 5.200 1st Qu.: 250000 1st Qu.: 67
## Median : 6.300 Median : 3000000 Median : 612
## Mean : 6.141 Mean : 13412513 Mean : 4974
## 3rd Qu.: 7.200 3rd Qu.: 15000000 3rd Qu.: 4642
## Max. :10.000 Max. :200000000 Max. :157608
plot(Movie.use)
str(Movie.use)
## 'data.frame': 5215 obs. of 3 variables:
## $ Rating : num 7.2 1.6 4.8 3.7 6.7 5.6 3.3 7.8 5.8 4.7 ...
## $ Budgets: int 450000 19000 23000000 5000000 16000000 1100000 140000 200000 200000 85000000 ...
## $ Votes : int 281 7996 799 271 19095 181 19 299 7 1987 ...
\(H_0_Budgets\):The Budgets of the movie have no effect on the quality of the movie. \(H_1_Budgets\): The Budgets of the movie have effect on the quality of the movie.
\(H_0_Votes\): The number of the votes by IMDb users have no effect on the quality of the movie. \(H_1_Votes\): The number of the votes by IMDb users have effect on the quality of the movie.
Firstly,to analyze which IVs has significant effect on the IMDb users’ rating, I include all three continue IVs in the model, which are Length, Budgets and Votes.
attach(Movie)
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
Movie.useAll<- subset(Movie, select = c(Rating,Length, Budgets, Votes))
attach(Movie.useAll)
## The following objects are masked from Movie:
##
## Budgets, Length, Rating, Votes
##
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
cor(Movie.useAll)
## Rating Length Budgets Votes
## Rating 1.00000000 0.02836237 -0.01422905 0.2646416
## Length 0.02836237 1.00000000 0.33818503 0.3187666
## Budgets -0.01422905 0.33818503 1.00000000 0.4412935
## Votes 0.26464163 0.31876661 0.44129350 1.0000000
Based on the correlation matrix, Variable Votes has the largest correlation with Ratings and then Length and Budgets follow.
Model.step1<-lm(Movie.useAll$Rating ~ Movie.useAll$Votes)
summary(Model.step1)
##
## Call:
## lm(formula = Movie.useAll$Rating ~ Movie.useAll$Votes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9661 -0.9058 0.1282 1.0210 4.0342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.966e+00 2.247e-02 265.48 <2e-16 ***
## Movie.useAll$Votes 3.520e-05 1.777e-06 19.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.492 on 5213 degrees of freedom
## Multiple R-squared: 0.07004, Adjusted R-squared: 0.06986
## F-statistic: 392.6 on 1 and 5213 DF, p-value: < 2.2e-16
Model.step2<-lm(Movie.useAll$Rating ~ Movie.useAll$Votes+Movie.useAll$Length)
summary(Model.step2)
##
## Call:
## lm(formula = Movie.useAll$Rating ~ Movie.useAll$Votes + Movie.useAll$Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0501 -0.9206 0.1446 1.0236 3.9120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.227e+00 6.304e-02 98.768 < 2e-16 ***
## Movie.useAll$Votes 3.784e-05 1.871e-06 20.226 < 2e-16 ***
## Movie.useAll$Length -2.857e-03 6.447e-04 -4.431 9.57e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.489 on 5212 degrees of freedom
## Multiple R-squared: 0.07353, Adjusted R-squared: 0.07317
## F-statistic: 206.8 on 2 and 5212 DF, p-value: < 2.2e-16
Model.step3<-lm(Movie.useAll$Rating ~ Movie.useAll$Votes+Movie.useAll$Length+Movie.useAll$Budgets)
summary(Model.step3)
##
## Call:
## lm(formula = Movie.useAll$Rating ~ Movie.useAll$Votes + Movie.useAll$Length +
## Movie.useAll$Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0970 -0.8610 0.1752 0.9740 3.8623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.176e+00 6.261e-02 98.645 <2e-16 ***
## Movie.useAll$Votes 4.554e-05 1.997e-06 22.800 <2e-16 ***
## Movie.useAll$Length -1.287e-03 6.563e-04 -1.961 0.05 *
## Movie.useAll$Budgets -1.032e-08 1.002e-09 -10.303 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 5211 degrees of freedom
## Multiple R-squared: 0.09202, Adjusted R-squared: 0.0915
## F-statistic: 176 on 3 and 5211 DF, p-value: < 2.2e-16
The R-square does not changed much when adding Length variable into the model. But the coefficient of the Length is still significant. Therefore, we need to check whether the size of dataset is appropriate. We use G* Power with alpha error prob 0.05, power 0.95 and effect size 0.1005, we find the total sample size we needed is 156, which much smaller than the dataset provided. Then we randomly pick 156 samples from the Movieset.
#form an index array
index = 1:5215
samplesize<-156
set.seed(99)
#random pick index from the oringinal set
Movie.sampleindex<-sample(index,samplesize)
#construct a new set containing the samples
Movie.sample<- Movie.useAll[Movie.sampleindex[1],]
for (i in 2:samplesize){
Movie.sample = rbind(Movie.sample, Movie.useAll[Movie.sampleindex[i],])
}
Next, we use stepwise to check the model again.
attach(Movie.sample)
## The following objects are masked from Movie.useAll:
##
## Budgets, Length, Rating, Votes
##
## The following objects are masked from Movie:
##
## Budgets, Length, Rating, Votes
##
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
Model.samplestep1<-lm(Movie.sample$Rating ~ Movie.sample$Votes)
summary(Model.samplestep1)
##
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4054 -0.8304 0.1431 0.9708 3.6944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.805e+00 1.179e-01 49.249 < 2e-16 ***
## Movie.sample$Votes 4.364e-05 9.439e-06 4.623 7.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.364 on 154 degrees of freedom
## Multiple R-squared: 0.1219, Adjusted R-squared: 0.1162
## F-statistic: 21.37 on 1 and 154 DF, p-value: 7.949e-06
Model.samplestep2<-lm(Movie.sample$Rating ~ Movie.sample$Votes+Movie.sample$Length)
summary(Model.samplestep2)
##
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4328 -0.8378 0.1923 0.9592 3.5602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.962e+00 3.465e-01 17.207 < 2e-16 ***
## Movie.sample$Votes 4.543e-05 1.017e-05 4.465 1.54e-05 ***
## Movie.sample$Length -1.722e-03 3.583e-03 -0.481 0.631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.367 on 153 degrees of freedom
## Multiple R-squared: 0.1232, Adjusted R-squared: 0.1117
## F-statistic: 10.75 on 2 and 153 DF, p-value: 4.286e-05
Model.samplestep3<-lm(Movie.sample$Rating ~ Movie.sample$Votes+Movie.sample$Length+Movie.sample$Budgets)
summary(Model.samplestep3)
##
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Length +
## Movie.sample$Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4951 -0.7111 0.1209 0.9263 3.5930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.909e+00 3.451e-01 17.122 < 2e-16 ***
## Movie.sample$Votes 5.212e-05 1.075e-05 4.848 3.05e-06 ***
## Movie.sample$Length -1.901e-04 3.655e-03 -0.052 0.9586
## Movie.sample$Budgets -1.027e-08 5.662e-09 -1.815 0.0716 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.357 on 152 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1248
## F-statistic: 8.371 on 3 and 152 DF, p-value: 3.469e-05
The coefficient of the Length is not significant after using the right size of the samples. Therefore, we will drop the Length variable and use the other two variables as independent variables.
attach(Movie.sample)
## The following objects are masked from Movie.sample (pos = 3):
##
## Budgets, Length, Rating, Votes
##
## The following objects are masked from Movie.useAll:
##
## Budgets, Length, Rating, Votes
##
## The following objects are masked from Movie:
##
## Budgets, Length, Rating, Votes
##
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
Model<-lm(Rating ~ Votes+Budgets)
summary(Model)
##
## Call:
## lm(formula = Rating ~ Votes + Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4926 -0.7138 0.1259 0.9253 3.6072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.892e+00 1.258e-01 46.852 < 2e-16 ***
## Votes 5.198e-05 1.036e-05 5.019 1.43e-06 ***
## Budgets -1.034e-08 5.491e-09 -1.884 0.0615 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1306
## F-statistic: 12.64 on 2 and 153 DF, p-value: 8.329e-06
library(scatterplot3d)
Model3D<-scatterplot3d(Budgets, Votes, Rating,pch=21,main = "Movie Ratings Vs. Movie budeget and IMDb users vote",xlab = "Rating",ylab = "Votes",zlab = "Budgets",axis = TRUE)
Model3D$plane3d(Model,col="red")
Model.res<-resid(Model)
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(Model.res, main="Model Residual Histogram",xlab = "Fitted value of model")
qqnorm(Model.res)
The distribution of the Residual is not exactly norm from the figure above. We can see the histgram of the Residual distribution is left skewed.
plot(Budgets, Model.res,xlab ="Budgets",ylab = "Residual")
plot(Votes, Model.res,xlab="Votes",ylab = "Residual")
It’s hard to tell from graph whether the variance of the residual is constant, since most of the residual value falls in a small interval.So we need White test to check whether the variance of residual is Heteroskedasticity.
Corrcheck<-cbind(Votes,Budgets, Model.res)
cor(Corrcheck)
## Votes Budgets Model.res
## Votes 1.000000e+00 4.275957e-01 -4.068226e-17
## Budgets 4.275957e-01 1.000000e+00 2.460192e-17
## Model.res -4.068226e-17 2.460192e-17 1.000000e+00
The residual has little correlation with the other two independent variable. Also, two independent variables are not collinearity (less than 0.5 correlation)
acf(Model.res,main="Autocorrelation of Residual")
According to the figure above, the residual has no autocorrelation.
#.5 Interpret
summary(Model)
##
## Call:
## lm(formula = Rating ~ Votes + Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4926 -0.7138 0.1259 0.9253 3.6072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.892e+00 1.258e-01 46.852 < 2e-16 ***
## Votes 5.198e-05 1.036e-05 5.019 1.43e-06 ***
## Budgets -1.034e-08 5.491e-09 -1.884 0.0615 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1306
## F-statistic: 12.64 on 2 and 153 DF, p-value: 8.329e-06
library(het.test)
## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
library(vars)
modeldata<-data.frame(Model.res,fitted(Model))
modeltest<-VAR(modeldata,p=1)
whites.htest(modeltest)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 10.9706
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.5314