Ge Chen

Apr.20th, 2015

RPI

Final


1.Data

(1).Data Selection

The IMDB-Movies sets was selected from Hadley Wickham datasets, containing 5,215 movies wich are ranged over 1903 ~ 2005. The variable names are listed below. Also the movie information is provided by IMDb. The dataset is retrived from URL: https://github.com/hadley/data-movies.

(2).Data Description

Title: Film Title

Year: The year that the Film publicated

Length: The length of the movie

Budgets: The Movie Budget

Rating: IMDb Ratings

Vote: Number of IMDb users who vote for the movies

(3).Organization for Data

#data read in 
rm(list=ls())
Movie.data<-read.csv("~/Desktop/Applied_Regression/IMDb_Movie.csv");
head(Movie.data,n=14L);
##                         Title Year Length  Budgets Rating Votes
## 1                     'G' Men 1935     85   450000    7.2   281
## 2   'Manos' the Hands of Fate 1966     74    19000    1.6  7996
## 3          'Til There Was You 1997    113 23000000    4.8   799
## 4             .com for Murder 2002     96  5000000    3.7   271
## 5  10 Things I Hate About You 1999     97 16000000    6.7 19095
## 6               100 Mile Rule 2002     98  1100000    5.6   181
## 7                   100 Proof 1997     94   140000    3.3    19
## 8                         101 1989    117   200000    7.8   299
## 9            101-vy kilometer 2001    103   200000    5.8     7
## 10             102 Dalmatians 2000    100 85000000    4.7  1987
## 11                0.468055556 2003     95  6000000    7.1   605
## 12               12 Angry Men 1957     96   340000    8.7 29278
## 13             12 to the Moon 1960     74   150000    2.8    89
## 14             13 Going On 30 2004     98 37000000    6.4  7859
tail(Movie.data,n=14L);
##                            Title Year Length  Budgets Rating Votes
## 5202                        Zero 2000     20        0    8.1    24
## 5203                 Zero Effect 1998    116  5000000    7.0  4820
## 5204            Ziegfeld Follies 1946    110  3240816    6.5   335
## 5205                 Zimmer Feri 1998     97   500000    6.3    61
## 5206               Zivot je cudo 2004    155  8000000    7.8   844
## 5207                       Zohar 1993    116   800000    5.9    26
## 5208 Zombies of the Stratosphere 1952    167   176357    4.4    49
## 5209                   Zoo Radio 1990     90   100000    4.6    13
## 5210              Zookeeper, The 2001    108  6000000    7.7   633
## 5211                   Zoolander 2001     89 28000000    6.1 18277
## 5212                      Zvezda 2002     97  1300000    7.5   168
## 5213                       Zzyzx 2005     90  1000000    8.0    10
## 5214                         xXx 2002    132 85000000    5.5 18514
## 5215     xXx: State of the Union 2005    101 87000000    3.9  1584
attach(Movie.data)
summary(Movie.data)
##                     Title           Year          Length      
##  Star Is Born, A       :   3   Min.   :1903   Min.   :  1.00  
##  Alamo, The            :   2   1st Qu.:1975   1st Qu.: 86.00  
##  Alfie                 :   2   Median :1996   Median : 97.00  
##  Assault on Precinct 13:   2   Mean   :1985   Mean   : 95.99  
##  Barbarians, The       :   2   3rd Qu.:2001   3rd Qu.:111.00  
##  Ben-Hur               :   2   Max.   :2005   Max.   :390.00  
##  (Other)               :5202                                  
##     Budgets              Rating           Votes       
##  Min.   :        0   Min.   : 1.000   Min.   :     5  
##  1st Qu.:   250000   1st Qu.: 5.200   1st Qu.:    67  
##  Median :  3000000   Median : 6.300   Median :   612  
##  Mean   : 13412513   Mean   : 6.141   Mean   :  4974  
##  3rd Qu.: 15000000   3rd Qu.: 7.200   3rd Qu.:  4642  
##  Max.   :200000000   Max.   :10.000   Max.   :157608  
## 
str(Movie.data)
## 'data.frame':    5215 obs. of  6 variables:
##  $ Title  : Factor w/ 5137 levels ".com for Murder",..: 2 3 4 1 8 9 10 11 12 13 ...
##  $ Year   : int  1935 1966 1997 2002 1999 2002 1997 1989 2001 2000 ...
##  $ Length : int  85 74 113 96 97 98 94 117 103 100 ...
##  $ Budgets: int  450000 19000 23000000 5000000 16000000 1100000 140000 200000 200000 85000000 ...
##  $ Rating : num  7.2 1.6 4.8 3.7 6.7 5.6 3.3 7.8 5.8 4.7 ...
##  $ Votes  : int  281 7996 799 271 19095 181 19 299 7 1987 ...

In the dataset, Length, Year, Budgets, Vote and Rating are continous variable.Hence,I will use Rating as dependent variable, Vote and Budegets as independent variables to check whether they have effect on Rating.High ratings of movie (grade:10) can be treated as a “good” movie which is reconized by IMDb users. In constrast, low ratings of movie will be seen as “bad” one. The Units for the variables we used are: Rating (no units); Vote( no units); Budgets(USD)

#create a subset for the DVs and IVs
Movie.use <- subset(Movie.data, select = c(Rating, Budgets, Votes))

2.Hypothesis

\(H_0_Budgets\):The Budgets of the movie have no effect on the quality of the movie. \(H_1_Budgets\): The Budgets of the movie have effect on the quality of the movie.

\(H_0_Votes\): The number of the votes by IMDb users have no effect on the quality of the movie. \(H_1_Votes\): The number of the votes by IMDb users have effect on the quality of the movie.

3 Model

1.Determining the Independent Variables

For the model, I will use step-wise method to see if the variables I selected as independent variables can explained the dependent variables. Firstly, I will start with the correlation between each of the variables.

Correlation

attach(Movie.use)
## The following objects are masked from Movie.data:
## 
##     Budgets, Rating, Votes
cor(Movie.use)
##              Rating     Budgets     Votes
## Rating   1.00000000 -0.01422905 0.2646416
## Budgets -0.01422905  1.00000000 0.4412935
## Votes    0.26464163  0.44129350 1.0000000

Step-Wise Method

Based on the correlation matrix, I will start with Votes, since it has the highest correlation with Ratings

(1).Ratings Vs. Votes

Model.step1<-lm(Movie.use$Rating ~ Movie.use$Votes)
summary(Model.step1)
## 
## Call:
## lm(formula = Movie.use$Rating ~ Movie.use$Votes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9661 -0.9058  0.1282  1.0210  4.0342 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.966e+00  2.247e-02  265.48   <2e-16 ***
## Movie.use$Votes 3.520e-05  1.777e-06   19.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.492 on 5213 degrees of freedom
## Multiple R-squared:  0.07004,    Adjusted R-squared:  0.06986 
## F-statistic: 392.6 on 1 and 5213 DF,  p-value: < 2.2e-16

(2).Ratings Vs. Vote+Budgets

Model.step2<-lm(Movie.use$Rating ~ Movie.use$Votes+Movie.use$Budgets)
summary(Model.step2)
## 
## Call:
## lm(formula = Movie.use$Rating ~ Movie.use$Votes + Movie.use$Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0633 -0.8637  0.1651  0.9636  3.9371 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.063e+00  2.389e-02  253.81   <2e-16 ***
## Movie.use$Votes    4.475e-05  1.957e-06   22.86   <2e-16 ***
## Movie.use$Budgets -1.078e-08  9.748e-10  -11.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.475 on 5212 degrees of freedom
## Multiple R-squared:  0.09135,    Adjusted R-squared:  0.091 
## F-statistic:   262 on 2 and 5212 DF,  p-value: < 2.2e-16

The R-square does not changed much when adding Budgets variable into the model. But the coefficients for the two variables are still significant. Therefore, we need to check whether the size of dataset is appropriate. We use G* Power with alpha error prob 0.05, power 0.95 and effect size 0.1005, we find the total sample size we needed is 156, which much smaller than the dataset provided. Then, we randomly pick 156 samples from the dataset.

#form an index array
samplesize<-156
set.seed(99)
samplerow<- nrow(Movie.use)
#random pick index from the oringinal set
model.index <- sample(samplerow, samplesize, replace = FALSE)
#construct a new set containing the samples
Movie.sample<- Movie.use[model.index,]

Next, I use stepwise to check the model again.

(1).Ratings Vs. Votes

attach(Movie.sample)
## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.data:
## 
##     Budgets, Rating, Votes
Model.samplestep1<-lm(Movie.sample$Rating ~ Movie.sample$Votes)
summary(Model.samplestep1)
## 
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4054 -0.8304  0.1431  0.9708  3.6944 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.805e+00  1.179e-01  49.249  < 2e-16 ***
## Movie.sample$Votes 4.364e-05  9.439e-06   4.623 7.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.364 on 154 degrees of freedom
## Multiple R-squared:  0.1219, Adjusted R-squared:  0.1162 
## F-statistic: 21.37 on 1 and 154 DF,  p-value: 7.949e-06

(2). Ratings Vs. Votes+Budgets

Model.samplestep2<-lm(Movie.sample$Rating ~ Movie.sample$Votes+Movie.sample$Budgets)
summary(Model.samplestep2)
## 
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4926 -0.7138  0.1259  0.9253  3.6072 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.892e+00  1.258e-01  46.852  < 2e-16 ***
## Movie.sample$Votes    5.198e-05  1.036e-05   5.019 1.43e-06 ***
## Movie.sample$Budgets -1.034e-08  5.491e-09  -1.884   0.0615 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1306 
## F-statistic: 12.64 on 2 and 153 DF,  p-value: 8.329e-06
Model<- Model.samplestep2

After minimizing the size of the dataset, the R square increased from 0.07 to 0.1418. The coefficient of Budgets is no more significant when test in 95% significant level.

Model Summary

attach(Movie.sample)
## The following objects are masked from Movie.sample (pos = 3):
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.data:
## 
##     Budgets, Rating, Votes
summary(Model)
## 
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4926 -0.7138  0.1259  0.9253  3.6072 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.892e+00  1.258e-01  46.852  < 2e-16 ***
## Movie.sample$Votes    5.198e-05  1.036e-05   5.019 1.43e-06 ***
## Movie.sample$Budgets -1.034e-08  5.491e-09  -1.884   0.0615 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1306 
## F-statistic: 12.64 on 2 and 153 DF,  p-value: 8.329e-06

4.Plot

Residual Plot

par(mfrow = c(1,1))
Model.res<-rstandard(Model)
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")

Diagnal Plot

par(mfrow = c(2,2))
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(Model.res, main="2.Model Residual Histogram",xlab = "Fitted value of model")
boxplot(Model.res,xlab = "3.Standardized residual",main = "residual distribution")
qqnorm(Model.res, main = "4.QQplot residuals vs normal distribution")
qqline(Model.res)   

graph 1: The variance is large, when the fitted value equal roughly to 6. Also the variance of the residual is not constant along with the fitted values. graph 2: THe histogram of the residual is not exactly normal distribution. Also the distributon of residual is apparantly towards to left tails. graph 3: The box plot also suggests left skewness and there is no outlier in the graph. graph 4:The QQ plot confirm that the distribution of the residuals is not normal distribution.

5 Interpret

Summary Model

In the summary of the model, I can find that coefficient of Budgets to the model turns to negtive. This says that the high cost of the movie does not lead to high rating by IMDb users.

The R^2 is 0.1418, which means there still much of Ratings is not explained by the model.

Also, the coefficient of Votes is still significant in 99.9% test level, but another cefficient (Budgets) is rejected, if I use same test level. In other words, we may confirm that IMDb users rating movie do not quitely depends on the investment of the movie.

The significance of Vote coefficient can be eplained that good movies will attract more IMDb users to vote or comment.

summary(Model)
## 
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4926 -0.7138  0.1259  0.9253  3.6072 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.892e+00  1.258e-01  46.852  < 2e-16 ***
## Movie.sample$Votes    5.198e-05  1.036e-05   5.019 1.43e-06 ***
## Movie.sample$Budgets -1.034e-08  5.491e-09  -1.884   0.0615 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1306 
## F-statistic: 12.64 on 2 and 153 DF,  p-value: 8.329e-06

Assumption

8 Assumptions:

  1. The distribution of residuals is normal.

  2. The variance of the residuals for every set of values for every set of values for the predictor is equal.

3.The error term is additive.

  1. At every value of the outcome the expected value of the residual is zero.

  2. The expected correlation between residuals, for any two cases is zero.

  3. All predictors are uncorrelated with the error term.

  4. No predictors are a perfect linear function of other predictors.

  5. The mean of the error term is 0.

Assumption test1:Normality and zero mean of the Residual (assumption 1, 4, 8)

par(mfrow = c(3,1))
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(Model.res, main="2.Model Residual Histogram",xlab = "Fitted value of model")
qqnorm(Model.res, main = "3.QQplot residuals vs normal distribution")
qqline(Model.res)   

It apparantly to see, for every fitted value, the residual value is not always zero. Moreover, the distribution of residuals is not normal.

Assumption test2:Variance of the Residual(assumption 2, 3)

plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")

It’s hard to tell from graph whether the variance of the residual is constant, since most of the residual values fall in a small interval, which is 6. Because the Budgets and Votes are much larger than the Rating variable, small change of Budgets Votes won’t affect the value of Rating a lot. This suggested that a non-linear model may be more fit for the dataset. (try to solve the fitted problem)

#Take logarithm of the Votes and Budgets
attach(Movie.sample)
## The following objects are masked from Movie.sample (pos = 3):
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.sample (pos = 4):
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.data:
## 
##     Budgets, Rating, Votes
Model.fix<-lm(Rating~log(Votes)+log(Budgets))
Model.fix_res<-rstandard(Model.fix)
summary(Model.fix)
## 
## Call:
## lm(formula = Rating ~ log(Votes) + log(Budgets))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6879 -0.8006  0.1577  0.9311  3.8647 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.65157    0.57996  11.469  < 2e-16 ***
## log(Votes)    0.28012    0.06610   4.238 3.89e-05 ***
## log(Budgets) -0.16669    0.05793  -2.877  0.00459 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.38 on 153 degrees of freedom
## Multiple R-squared:  0.1066, Adjusted R-squared:  0.09491 
## F-statistic: 9.127 on 2 and 153 DF,  p-value: 0.00018
#Plot the fix model
par(mfrow = c(2,2))
plot(fitted(Model.fix),Model.fix_res,pch=21, cex=1, bg='blue',main="1. Plot of Fitted Values_Fix Model vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(Model.fix_res, main="2.Fix Model Residual Histogram",xlab = "Fitted value of model")
boxplot(Model.fix_res,xlab = "3.Standardized residual",main = "residual distribution")
qqnorm(Model.fix_res, main = "4.QQplot residuals vs normal distribution")
qqline(Model.fix_res)   

After using logarithm test, model’s reisduals turns out to be Homoskedasticity. Also the distribution of residual is more likely norm distribution. The histogram of the residual is still left skew.

I use White test to find whether residuals in two models are both homoscedastic.

White Test

library(het.test)
## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
library(vars)
#linear model
Model.test<-data.frame(Model.res,fitted(Model))
Model.test<-VAR(Model.test,p=1)
whites.htest(Model.test)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  9.6853 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.6436
#logarithm model
Model.fix_test<-data.frame(Model.fix_res,fitted(Model.fix))
Model.fix_test<-VAR(Model.fix_test,p=1)
whites.htest(Model.fix_test)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  8.7799 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.7216

Both models have a homoscedastic residual variance. Since the R^2 of logarithm model is lower than linear one, I still keep the oringinal model unchanged.

Assumption test3:Correlation with Residual (assumption,6, 7)

Corrcheck<-cbind(Votes,Budgets, Model.res)
cor(Corrcheck)
##                 Votes      Budgets    Model.res
## Votes      1.00000000  0.427595669 -0.028494772
## Budgets    0.42759567  1.000000000 -0.007867789
## Model.res -0.02849477 -0.007867789  1.000000000

Both Independent variables has low correlations with Model residuals, and there is no perfect collinearity between two independent variables.

Issues

1.Causality

Votes and Budgets are probalistic cause to the IMDb users rating but not deterministic. Also Votes Budgets are not completely non-collinear, since there is stll 0.44 correlation between them. For example, IMDb users will try to vote or comment the movie, if movie promition team release the budgets before publish.

2.Sample Sizes

The sample sizes is much larger than the we needed. So we need to shrink the size we need. The effective size is 0.1005347. Using G*Power, with alpha level of 0.05 and power level of 0.95, I get a sample size of 156.

Model.All<-lm(Rating ~ Budgets+Votes, data = Movie.use)
r2<-summary(Model.All)$r.square
f2<- r2/(1-r2)
f2
## [1] 0.1005347

3. Collinearity

Because two independent variables has 0.44 correlation between each other. I want to check if this correlation is significant.

attach(Movie.sample)
## The following objects are masked from Movie.sample (pos = 11):
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.sample (pos = 12):
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.sample (pos = 13):
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes
## 
## The following objects are masked from Movie.data:
## 
##     Budgets, Rating, Votes
Model.IVs<- lm(Votes ~ Budgets)
summary(Model.IVs)
## 
## Call:
## lm(formula = Votes ~ Budgets)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20898  -2497  -1918  -1230  82469 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.927e+03  9.662e+02   1.994   0.0479 *  
## Budgets     2.267e-04  3.863e-05   5.870 2.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10530 on 154 degrees of freedom
## Multiple R-squared:  0.1828, Adjusted R-squared:  0.1775 
## F-statistic: 34.46 on 1 and 154 DF,  p-value: 2.585e-08

The coefficient of two Independent variables is significantly from zero. Hence, there is collinearity between Budgets and Votes.

4. Measurement Error

The biggest measurement error is the evalueation of the budgets.The periods of the moive published ranged from 1903 to 2005. The budgets of movie at early time cannot simply compared with the budgets in 2005, since the current inflation.