The IMDB-Movies sets was selected from Hadley Wickham datasets, containing 5,215 movies wich are ranged over 1903 ~ 2005. The variable names are listed below. Also the movie information is provided by IMDb. The dataset is retrived from URL: https://github.com/hadley/data-movies.
Title: Film Title
Year: The year that the Film publicated
Length: The length of the movie
Budgets: The Movie Budget
Rating: IMDb Ratings
Vote: Number of IMDb users who vote for the movies
#data read in
rm(list=ls())
Movie.data<-read.csv("~/Desktop/Applied_Regression/IMDb_Movie.csv");
head(Movie.data,n=14L);
## Title Year Length Budgets Rating Votes
## 1 'G' Men 1935 85 450000 7.2 281
## 2 'Manos' the Hands of Fate 1966 74 19000 1.6 7996
## 3 'Til There Was You 1997 113 23000000 4.8 799
## 4 .com for Murder 2002 96 5000000 3.7 271
## 5 10 Things I Hate About You 1999 97 16000000 6.7 19095
## 6 100 Mile Rule 2002 98 1100000 5.6 181
## 7 100 Proof 1997 94 140000 3.3 19
## 8 101 1989 117 200000 7.8 299
## 9 101-vy kilometer 2001 103 200000 5.8 7
## 10 102 Dalmatians 2000 100 85000000 4.7 1987
## 11 0.468055556 2003 95 6000000 7.1 605
## 12 12 Angry Men 1957 96 340000 8.7 29278
## 13 12 to the Moon 1960 74 150000 2.8 89
## 14 13 Going On 30 2004 98 37000000 6.4 7859
tail(Movie.data,n=14L);
## Title Year Length Budgets Rating Votes
## 5202 Zero 2000 20 0 8.1 24
## 5203 Zero Effect 1998 116 5000000 7.0 4820
## 5204 Ziegfeld Follies 1946 110 3240816 6.5 335
## 5205 Zimmer Feri 1998 97 500000 6.3 61
## 5206 Zivot je cudo 2004 155 8000000 7.8 844
## 5207 Zohar 1993 116 800000 5.9 26
## 5208 Zombies of the Stratosphere 1952 167 176357 4.4 49
## 5209 Zoo Radio 1990 90 100000 4.6 13
## 5210 Zookeeper, The 2001 108 6000000 7.7 633
## 5211 Zoolander 2001 89 28000000 6.1 18277
## 5212 Zvezda 2002 97 1300000 7.5 168
## 5213 Zzyzx 2005 90 1000000 8.0 10
## 5214 xXx 2002 132 85000000 5.5 18514
## 5215 xXx: State of the Union 2005 101 87000000 3.9 1584
attach(Movie.data)
summary(Movie.data)
## Title Year Length
## Star Is Born, A : 3 Min. :1903 Min. : 1.00
## Alamo, The : 2 1st Qu.:1975 1st Qu.: 86.00
## Alfie : 2 Median :1996 Median : 97.00
## Assault on Precinct 13: 2 Mean :1985 Mean : 95.99
## Barbarians, The : 2 3rd Qu.:2001 3rd Qu.:111.00
## Ben-Hur : 2 Max. :2005 Max. :390.00
## (Other) :5202
## Budgets Rating Votes
## Min. : 0 Min. : 1.000 Min. : 5
## 1st Qu.: 250000 1st Qu.: 5.200 1st Qu.: 67
## Median : 3000000 Median : 6.300 Median : 612
## Mean : 13412513 Mean : 6.141 Mean : 4974
## 3rd Qu.: 15000000 3rd Qu.: 7.200 3rd Qu.: 4642
## Max. :200000000 Max. :10.000 Max. :157608
##
str(Movie.data)
## 'data.frame': 5215 obs. of 6 variables:
## $ Title : Factor w/ 5137 levels ".com for Murder",..: 2 3 4 1 8 9 10 11 12 13 ...
## $ Year : int 1935 1966 1997 2002 1999 2002 1997 1989 2001 2000 ...
## $ Length : int 85 74 113 96 97 98 94 117 103 100 ...
## $ Budgets: int 450000 19000 23000000 5000000 16000000 1100000 140000 200000 200000 85000000 ...
## $ Rating : num 7.2 1.6 4.8 3.7 6.7 5.6 3.3 7.8 5.8 4.7 ...
## $ Votes : int 281 7996 799 271 19095 181 19 299 7 1987 ...
In the dataset, Length, Year, Budgets, Vote and Rating are continous variable.Hence,I will use Rating as dependent variable, Vote and Budegets as independent variables to check whether they have effect on Rating.High ratings of movie (grade:10) can be treated as a “good” movie which is reconized by IMDb users. In constrast, low ratings of movie will be seen as “bad” one. The Units for the variables we used are: Rating (no units); Vote( no units); Budgets(USD)
#create a subset for the DVs and IVs
Movie.use <- subset(Movie.data, select = c(Rating, Budgets, Votes))
\(H_0_Budgets\):The Budgets of the movie have no effect on the quality of the movie. \(H_1_Budgets\): The Budgets of the movie have effect on the quality of the movie.
\(H_0_Votes\): The number of the votes by IMDb users have no effect on the quality of the movie. \(H_1_Votes\): The number of the votes by IMDb users have effect on the quality of the movie.
For the model, I will use step-wise method to see if the variables I selected as independent variables can explained the dependent variables. Firstly, I will start with the correlation between each of the variables.
attach(Movie.use)
## The following objects are masked from Movie.data:
##
## Budgets, Rating, Votes
cor(Movie.use)
## Rating Budgets Votes
## Rating 1.00000000 -0.01422905 0.2646416
## Budgets -0.01422905 1.00000000 0.4412935
## Votes 0.26464163 0.44129350 1.0000000
Based on the correlation matrix, I will start with Votes, since it has the highest correlation with Ratings
Model.step1<-lm(Movie.use$Rating ~ Movie.use$Votes)
summary(Model.step1)
##
## Call:
## lm(formula = Movie.use$Rating ~ Movie.use$Votes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9661 -0.9058 0.1282 1.0210 4.0342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.966e+00 2.247e-02 265.48 <2e-16 ***
## Movie.use$Votes 3.520e-05 1.777e-06 19.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.492 on 5213 degrees of freedom
## Multiple R-squared: 0.07004, Adjusted R-squared: 0.06986
## F-statistic: 392.6 on 1 and 5213 DF, p-value: < 2.2e-16
Model.step2<-lm(Movie.use$Rating ~ Movie.use$Votes+Movie.use$Budgets)
summary(Model.step2)
##
## Call:
## lm(formula = Movie.use$Rating ~ Movie.use$Votes + Movie.use$Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0633 -0.8637 0.1651 0.9636 3.9371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.063e+00 2.389e-02 253.81 <2e-16 ***
## Movie.use$Votes 4.475e-05 1.957e-06 22.86 <2e-16 ***
## Movie.use$Budgets -1.078e-08 9.748e-10 -11.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 5212 degrees of freedom
## Multiple R-squared: 0.09135, Adjusted R-squared: 0.091
## F-statistic: 262 on 2 and 5212 DF, p-value: < 2.2e-16
The R-square does not changed much when adding Budgets variable into the model. But the coefficients for the two variables are still significant. Therefore, we need to check whether the size of dataset is appropriate. We use G* Power with alpha error prob 0.05, power 0.95 and effect size 0.1005, we find the total sample size we needed is 156, which much smaller than the dataset provided. Then, we randomly pick 156 samples from the dataset.
#form an index array
samplesize<-156
set.seed(99)
samplerow<- nrow(Movie.use)
#random pick index from the oringinal set
model.index <- sample(samplerow, samplesize, replace = FALSE)
#construct a new set containing the samples
Movie.sample<- Movie.use[model.index,]
Next, I use stepwise to check the model again.
attach(Movie.sample)
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.data:
##
## Budgets, Rating, Votes
Model.samplestep1<-lm(Movie.sample$Rating ~ Movie.sample$Votes)
summary(Model.samplestep1)
##
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4054 -0.8304 0.1431 0.9708 3.6944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.805e+00 1.179e-01 49.249 < 2e-16 ***
## Movie.sample$Votes 4.364e-05 9.439e-06 4.623 7.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.364 on 154 degrees of freedom
## Multiple R-squared: 0.1219, Adjusted R-squared: 0.1162
## F-statistic: 21.37 on 1 and 154 DF, p-value: 7.949e-06
Model.samplestep2<-lm(Movie.sample$Rating ~ Movie.sample$Votes+Movie.sample$Budgets)
summary(Model.samplestep2)
##
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4926 -0.7138 0.1259 0.9253 3.6072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.892e+00 1.258e-01 46.852 < 2e-16 ***
## Movie.sample$Votes 5.198e-05 1.036e-05 5.019 1.43e-06 ***
## Movie.sample$Budgets -1.034e-08 5.491e-09 -1.884 0.0615 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1306
## F-statistic: 12.64 on 2 and 153 DF, p-value: 8.329e-06
Model<- Model.samplestep2
After minimizing the size of the dataset, the R square increased from 0.07 to 0.1418. The coefficient of Budgets is no more significant when test in 95% significant level.
attach(Movie.sample)
## The following objects are masked from Movie.sample (pos = 3):
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.data:
##
## Budgets, Rating, Votes
summary(Model)
##
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4926 -0.7138 0.1259 0.9253 3.6072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.892e+00 1.258e-01 46.852 < 2e-16 ***
## Movie.sample$Votes 5.198e-05 1.036e-05 5.019 1.43e-06 ***
## Movie.sample$Budgets -1.034e-08 5.491e-09 -1.884 0.0615 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1306
## F-statistic: 12.64 on 2 and 153 DF, p-value: 8.329e-06
par(mfrow = c(1,1))
Model.res<-rstandard(Model)
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
par(mfrow = c(2,2))
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(Model.res, main="2.Model Residual Histogram",xlab = "Fitted value of model")
boxplot(Model.res,xlab = "3.Standardized residual",main = "residual distribution")
qqnorm(Model.res, main = "4.QQplot residuals vs normal distribution")
qqline(Model.res)
graph 1: The variance is large, when the fitted value equal roughly to 6. Also the variance of the residual is not constant along with the fitted values. graph 2: THe histogram of the residual is not exactly normal distribution. Also the distributon of residual is apparantly towards to left tails. graph 3: The box plot also suggests left skewness and there is no outlier in the graph. graph 4:The QQ plot confirm that the distribution of the residuals is not normal distribution.
In the summary of the model, I can find that coefficient of Budgets to the model turns to negtive. This says that the high cost of the movie does not lead to high rating by IMDb users.
The R^2 is 0.1418, which means there still much of Ratings is not explained by the model.
Also, the coefficient of Votes is still significant in 99.9% test level, but another cefficient (Budgets) is rejected, if I use same test level. In other words, we may confirm that IMDb users rating movie do not quitely depends on the investment of the movie.
The significance of Vote coefficient can be eplained that good movies will attract more IMDb users to vote or comment.
summary(Model)
##
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4926 -0.7138 0.1259 0.9253 3.6072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.892e+00 1.258e-01 46.852 < 2e-16 ***
## Movie.sample$Votes 5.198e-05 1.036e-05 5.019 1.43e-06 ***
## Movie.sample$Budgets -1.034e-08 5.491e-09 -1.884 0.0615 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1306
## F-statistic: 12.64 on 2 and 153 DF, p-value: 8.329e-06
8 Assumptions:
The distribution of residuals is normal.
The variance of the residuals for every set of values for every set of values for the predictor is equal.
3.The error term is additive.
At every value of the outcome the expected value of the residual is zero.
The expected correlation between residuals, for any two cases is zero.
All predictors are uncorrelated with the error term.
No predictors are a perfect linear function of other predictors.
The mean of the error term is 0.
par(mfrow = c(3,1))
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(Model.res, main="2.Model Residual Histogram",xlab = "Fitted value of model")
qqnorm(Model.res, main = "3.QQplot residuals vs normal distribution")
qqline(Model.res)
It apparantly to see, for every fitted value, the residual value is not always zero. Moreover, the distribution of residuals is not normal.
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
It’s hard to tell from graph whether the variance of the residual is constant, since most of the residual values fall in a small interval, which is 6. Because the Budgets and Votes are much larger than the Rating variable, small change of Budgets Votes won’t affect the value of Rating a lot. This suggested that a non-linear model may be more fit for the dataset. (try to solve the fitted problem)
#Take logarithm of the Votes and Budgets
attach(Movie.sample)
## The following objects are masked from Movie.sample (pos = 3):
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.sample (pos = 4):
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.data:
##
## Budgets, Rating, Votes
Model.fix<-lm(Rating~log(Votes)+log(Budgets))
Model.fix_res<-rstandard(Model.fix)
summary(Model.fix)
##
## Call:
## lm(formula = Rating ~ log(Votes) + log(Budgets))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6879 -0.8006 0.1577 0.9311 3.8647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.65157 0.57996 11.469 < 2e-16 ***
## log(Votes) 0.28012 0.06610 4.238 3.89e-05 ***
## log(Budgets) -0.16669 0.05793 -2.877 0.00459 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.38 on 153 degrees of freedom
## Multiple R-squared: 0.1066, Adjusted R-squared: 0.09491
## F-statistic: 9.127 on 2 and 153 DF, p-value: 0.00018
#Plot the fix model
par(mfrow = c(2,2))
plot(fitted(Model.fix),Model.fix_res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values_Fix Model vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(Model.fix_res, main="2.Fix Model Residual Histogram",xlab = "Fitted value of model")
boxplot(Model.fix_res,xlab = "3.Standardized residual",main = "residual distribution")
qqnorm(Model.fix_res, main = "4.QQplot residuals vs normal distribution")
qqline(Model.fix_res)
After using logarithm test, model’s reisduals turns out to be Homoskedasticity. Also the distribution of residual is more likely norm distribution. The histogram of the residual is still left skew.
I use White test to find whether residuals in two models are both homoscedastic.
library(het.test)
## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
library(vars)
#linear model
Model.test<-data.frame(Model.res,fitted(Model))
Model.test<-VAR(Model.test,p=1)
whites.htest(Model.test)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 9.6853
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.6436
#logarithm model
Model.fix_test<-data.frame(Model.fix_res,fitted(Model.fix))
Model.fix_test<-VAR(Model.fix_test,p=1)
whites.htest(Model.fix_test)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 8.7799
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.7216
Both models have a homoscedastic residual variance. Since the R^2 of logarithm model is lower than linear one, I still keep the oringinal model unchanged.
Corrcheck<-cbind(Votes,Budgets, Model.res)
cor(Corrcheck)
## Votes Budgets Model.res
## Votes 1.00000000 0.427595669 -0.028494772
## Budgets 0.42759567 1.000000000 -0.007867789
## Model.res -0.02849477 -0.007867789 1.000000000
Both Independent variables has low correlations with Model residuals, and there is no perfect collinearity between two independent variables.
Votes and Budgets are probalistic cause to the IMDb users rating but not deterministic. Also Votes Budgets are not completely non-collinear, since there is stll 0.44 correlation between them. For example, IMDb users will try to vote or comment the movie, if movie promition team release the budgets before publish.
The sample sizes is much larger than the we needed. So we need to shrink the size we need. The effective size is 0.1005347. Using G*Power, with alpha level of 0.05 and power level of 0.95, I get a sample size of 156.
Model.All<-lm(Rating ~ Budgets+Votes, data = Movie.use)
r2<-summary(Model.All)$r.square
f2<- r2/(1-r2)
f2
## [1] 0.1005347
Because two independent variables has 0.44 correlation between each other. I want to check if this correlation is significant.
attach(Movie.sample)
## The following objects are masked from Movie.sample (pos = 11):
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.sample (pos = 12):
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.sample (pos = 13):
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.use:
##
## Budgets, Rating, Votes
##
## The following objects are masked from Movie.data:
##
## Budgets, Rating, Votes
Model.IVs<- lm(Votes ~ Budgets)
summary(Model.IVs)
##
## Call:
## lm(formula = Votes ~ Budgets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20898 -2497 -1918 -1230 82469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.927e+03 9.662e+02 1.994 0.0479 *
## Budgets 2.267e-04 3.863e-05 5.870 2.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10530 on 154 degrees of freedom
## Multiple R-squared: 0.1828, Adjusted R-squared: 0.1775
## F-statistic: 34.46 on 1 and 154 DF, p-value: 2.585e-08
The coefficient of two Independent variables is significantly from zero. Hence, there is collinearity between Budgets and Votes.
The biggest measurement error is the evalueation of the budgets.The periods of the moive published ranged from 1903 to 2005. The budgets of movie at early time cannot simply compared with the budgets in 2005, since the current inflation.