Ge Chen

Apr.13th, 2015

RPI

FIRT DRAFT

1.Data

(1).Data Selection

The IMDB-Movies was selected from Hadley Wickham datasets, containing 5,215 movies, over 1903 ~ 2005. The variable names are listed below. Also the movie informations are provided by IMDb. We will use IMDb-ratings as dependent variable, IMDb user vote and movie budeget as independent variables to check whether they have effect on people’s ratings.High ratings of movie(grade:10) can be treated as a “good” movie which is reconized by IMDb users. In constrast, low ratings of movie will be seen as “bad” one.

(2).Data Description

Title: Film Title

Year: The year that the Film publicated

Length: The length of the movie

Budgets: The Movie Budget

Rating: IMDb Ratings

Vote: Number of IMDb users who vote for the movies

(3).Organization fo Data

rm(list=ls())
Movie<-read.csv("~/Desktop/Applied_Regression/IMDb_Movie.csv");
Movie.use<-subset(Movie,select = c(Rating,Budgets, Votes));
head(Movie.use,n=14L);

##    Rating  Budgets Votes
## 1     7.2   450000   281
## 2     1.6    19000  7996
## 3     4.8 23000000   799
## 4     3.7  5000000   271
## 5     6.7 16000000 19095
## 6     5.6  1100000   181
## 7     3.3   140000    19
## 8     7.8   200000   299
## 9     5.8   200000     7
## 10    4.7 85000000  1987
## 11    7.1  6000000   605
## 12    8.7   340000 29278
## 13    2.8   150000    89
## 14    6.4 37000000  7859

tail(Movie.use,n=14L);

##      Rating  Budgets Votes
## 5202    8.1        0    24
## 5203    7.0  5000000  4820
## 5204    6.5  3240816   335
## 5205    6.3   500000    61
## 5206    7.8  8000000   844
## 5207    5.9   800000    26
## 5208    4.4   176357    49
## 5209    4.6   100000    13
## 5210    7.7  6000000   633
## 5211    6.1 28000000 18277
## 5212    7.5  1300000   168
## 5213    8.0  1000000    10
## 5214    5.5 85000000 18514
## 5215    3.9 87000000  1584

attach(Movie.use)
summary(Movie.use)

##      Rating          Budgets              Votes       
##  Min.   : 1.000   Min.   :        0   Min.   :     5  
##  1st Qu.: 5.200   1st Qu.:   250000   1st Qu.:    67  
##  Median : 6.300   Median :  3000000   Median :   612  
##  Mean   : 6.141   Mean   : 13412513   Mean   :  4974  
##  3rd Qu.: 7.200   3rd Qu.: 15000000   3rd Qu.:  4642  
##  Max.   :10.000   Max.   :200000000   Max.   :157608

plot(Movie.use)

str(Movie.use)

## 'data.frame':    5215 obs. of  3 variables:
##  $ Rating : num  7.2 1.6 4.8 3.7 6.7 5.6 3.3 7.8 5.8 4.7 ...
##  $ Budgets: int  450000 19000 23000000 5000000 16000000 1100000 140000 200000 200000 85000000 ...
##  $ Votes  : int  281 7996 799 271 19095 181 19 299 7 1987 ...

2.Hypothesis

\(H_0_Budgets\):The Budgets of the movie have no effect on the quality of the movie. \(H_1_Budgets\): The Budgets of the movie have effect on the quality of the movie.

\(H_0_Votes\): The number of the votes by IMDb users have no effect on the quality of the movie. \(H_1_Votes\): The number of the votes by IMDb users have effect on the quality of the movie.

3 Model

1.Determining the Independent Variables

Firstly,to analyze which IVs has significant effect on the IMDb users’ rating, I include all three continue IVs in the model, which are Length, Budgets and Votes.

attach(Movie)

## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes

Movie.useAll<- subset(Movie, select = c(Rating,Length, Budgets, Votes))

Correlation

attach(Movie.useAll)

## The following objects are masked from Movie:
## 
##     Budgets, Length, Rating, Votes
## 
## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes

cor(Movie.useAll)

##              Rating     Length     Budgets     Votes
## Rating   1.00000000 0.02836237 -0.01422905 0.2646416
## Length   0.02836237 1.00000000  0.33818503 0.3187666
## Budgets -0.01422905 0.33818503  1.00000000 0.4412935
## Votes    0.26464163 0.31876661  0.44129350 1.0000000

Step-Wise Method

Based on the correlation matrix, Variable Votes has the largest correlation with Ratings and then Length and Budgets follow.

(1).Ratings Vs. Votes

Model.step1<-lm(Movie.useAll$Rating ~ Movie.useAll$Votes)
summary(Model.step1)

## 
## Call:
## lm(formula = Movie.useAll$Rating ~ Movie.useAll$Votes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9661 -0.9058  0.1282  1.0210  4.0342 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.966e+00  2.247e-02  265.48   <2e-16 ***
## Movie.useAll$Votes 3.520e-05  1.777e-06   19.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.492 on 5213 degrees of freedom
## Multiple R-squared:  0.07004,    Adjusted R-squared:  0.06986 
## F-statistic: 392.6 on 1 and 5213 DF,  p-value: < 2.2e-16

(2).Ratings Vs. Votes+Length

Model.step2<-lm(Movie.useAll$Rating ~ Movie.useAll$Votes+Movie.useAll$Length)
summary(Model.step2)

## 
## Call:
## lm(formula = Movie.useAll$Rating ~ Movie.useAll$Votes + Movie.useAll$Length)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0501 -0.9206  0.1446  1.0236  3.9120 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.227e+00  6.304e-02  98.768  < 2e-16 ***
## Movie.useAll$Votes   3.784e-05  1.871e-06  20.226  < 2e-16 ***
## Movie.useAll$Length -2.857e-03  6.447e-04  -4.431 9.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.489 on 5212 degrees of freedom
## Multiple R-squared:  0.07353,    Adjusted R-squared:  0.07317 
## F-statistic: 206.8 on 2 and 5212 DF,  p-value: < 2.2e-16

(3). Ratings Vs. Votes+Length+Budgets

Model.step3<-lm(Movie.useAll$Rating ~ Movie.useAll$Votes+Movie.useAll$Length+Movie.useAll$Budgets)
summary(Model.step3)

## 
## Call:
## lm(formula = Movie.useAll$Rating ~ Movie.useAll$Votes + Movie.useAll$Length + 
##     Movie.useAll$Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0970 -0.8610  0.1752  0.9740  3.8623 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.176e+00  6.261e-02  98.645   <2e-16 ***
## Movie.useAll$Votes    4.554e-05  1.997e-06  22.800   <2e-16 ***
## Movie.useAll$Length  -1.287e-03  6.563e-04  -1.961     0.05 *  
## Movie.useAll$Budgets -1.032e-08  1.002e-09 -10.303   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.475 on 5211 degrees of freedom
## Multiple R-squared:  0.09202,    Adjusted R-squared:  0.0915 
## F-statistic:   176 on 3 and 5211 DF,  p-value: < 2.2e-16

The R-square does not changed much when adding Length variable into the model. But the coefficient of the Length is still significant. Therefore, we need to check whether the size of dataset is appropriate. We use G* Power with alpha error prob 0.05, power 0.95 and effect size 0.1005, we find the total sample size we needed is 156, which much smaller than the dataset provided. Then we randomly pick 156 samples from the Movieset.

#form an index array
index = 1:5215
samplesize<-156
set.seed(99)
#random pick index from the oringinal set
Movie.sampleindex<-sample(index,samplesize)
#construct a new set containing the samples
Movie.sample<- Movie.useAll[Movie.sampleindex[1],]
for (i in 2:samplesize){
  Movie.sample = rbind(Movie.sample, Movie.useAll[Movie.sampleindex[i],]) 
}

Next, we use stepwise to check the model again.

(1).Ratings Vs. Votes

attach(Movie.sample)

## The following objects are masked from Movie.useAll:
## 
##     Budgets, Length, Rating, Votes
## 
## The following objects are masked from Movie:
## 
##     Budgets, Length, Rating, Votes
## 
## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes

Model.samplestep1<-lm(Movie.sample$Rating ~ Movie.sample$Votes)
summary(Model.samplestep1)

## 
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4054 -0.8304  0.1431  0.9708  3.6944 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.805e+00  1.179e-01  49.249  < 2e-16 ***
## Movie.sample$Votes 4.364e-05  9.439e-06   4.623 7.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.364 on 154 degrees of freedom
## Multiple R-squared:  0.1219, Adjusted R-squared:  0.1162 
## F-statistic: 21.37 on 1 and 154 DF,  p-value: 7.949e-06

(2).Ratings Vs. Votes+Length

Model.samplestep2<-lm(Movie.sample$Rating ~ Movie.sample$Votes+Movie.sample$Length)
summary(Model.samplestep2)

## 
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Length)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4328 -0.8378  0.1923  0.9592  3.5602 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.962e+00  3.465e-01  17.207  < 2e-16 ***
## Movie.sample$Votes   4.543e-05  1.017e-05   4.465 1.54e-05 ***
## Movie.sample$Length -1.722e-03  3.583e-03  -0.481    0.631    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.367 on 153 degrees of freedom
## Multiple R-squared:  0.1232, Adjusted R-squared:  0.1117 
## F-statistic: 10.75 on 2 and 153 DF,  p-value: 4.286e-05

(3). Ratings Vs. Votes+Length+Budgets

Model.samplestep3<-lm(Movie.sample$Rating ~ Movie.sample$Votes+Movie.sample$Length+Movie.sample$Budgets)
summary(Model.samplestep3)

## 
## Call:
## lm(formula = Movie.sample$Rating ~ Movie.sample$Votes + Movie.sample$Length + 
##     Movie.sample$Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4951 -0.7111  0.1209  0.9263  3.5930 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.909e+00  3.451e-01  17.122  < 2e-16 ***
## Movie.sample$Votes    5.212e-05  1.075e-05   4.848 3.05e-06 ***
## Movie.sample$Length  -1.901e-04  3.655e-03  -0.052   0.9586    
## Movie.sample$Budgets -1.027e-08  5.662e-09  -1.815   0.0716 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.357 on 152 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1248 
## F-statistic: 8.371 on 3 and 152 DF,  p-value: 3.469e-05

The coefficient of the Length is not significant after using the right size of the samples. Therefore, we will drop the Length variable and use the other two variables as independent variables.

Model Summary

attach(Movie.sample)

## The following objects are masked from Movie.sample (pos = 3):
## 
##     Budgets, Length, Rating, Votes
## 
## The following objects are masked from Movie.useAll:
## 
##     Budgets, Length, Rating, Votes
## 
## The following objects are masked from Movie:
## 
##     Budgets, Length, Rating, Votes
## 
## The following objects are masked from Movie.use:
## 
##     Budgets, Rating, Votes

Model<-lm(Rating ~ Votes+Budgets)
summary(Model)

## 
## Call:
## lm(formula = Rating ~ Votes + Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4926 -0.7138  0.1259  0.9253  3.6072 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.892e+00  1.258e-01  46.852  < 2e-16 ***
## Votes        5.198e-05  1.036e-05   5.019 1.43e-06 ***
## Budgets     -1.034e-08  5.491e-09  -1.884   0.0615 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1306 
## F-statistic: 12.64 on 2 and 153 DF,  p-value: 8.329e-06

4.Plot

3D plot

library(scatterplot3d)
Model3D<-scatterplot3d(Budgets, Votes, Rating,pch=21,main = "Movie Ratings Vs. Movie budeget and IMDb users vote",xlab = "Rating",ylab = "Votes",zlab = "Budgets",axis = TRUE)
Model3D$plane3d(Model,col="red")

Residual Plot

Assumption test1:Normality and zero mean of the Residual

Model.res<-resid(Model)
plot(fitted(Model),Model.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")

hist(Model.res, main="Model Residual Histogram",xlab = "Fitted value of model")

qqnorm(Model.res)

The distribution of the Residual is not exactly norm from the figure above. We can see the histgram of the Residual distribution is left skewed.

Assumption test2:Variance of the Residual

plot(Budgets, Model.res,xlab ="Budgets",ylab = "Residual")

plot(Votes, Model.res,xlab="Votes",ylab = "Residual")

It’s hard to tell from graph whether the variance of the residual is constant, since most of the residual value falls in a small interval.So we need White test to check whether the variance of residual is Heteroskedasticity.

Assumption test3:Correlation with Residual

Corrcheck<-cbind(Votes,Budgets, Model.res)
cor(Corrcheck)

##                   Votes      Budgets     Model.res
## Votes      1.000000e+00 4.275957e-01 -4.068226e-17
## Budgets    4.275957e-01 1.000000e+00  2.460192e-17
## Model.res -4.068226e-17 2.460192e-17  1.000000e+00

The residual has little correlation with the other two independent variable. Also, two independent variables are not collinearity (less than 0.5 correlation)

Assumption test4.Autocorrelation of Residual

acf(Model.res,main="Autocorrelation of Residual")

According to the figure above, the residual has no autocorrelation.

#.5 Interpret

Summary Model

summary(Model)

## 
## Call:
## lm(formula = Rating ~ Votes + Budgets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4926 -0.7138  0.1259  0.9253  3.6072 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.892e+00  1.258e-01  46.852  < 2e-16 ***
## Votes        5.198e-05  1.036e-05   5.019 1.43e-06 ***
## Budgets     -1.034e-08  5.491e-09  -1.884   0.0615 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 153 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1306 
## F-statistic: 12.64 on 2 and 153 DF,  p-value: 8.329e-06

White Test

library(het.test)

## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest

library(vars)
modeldata<-data.frame(Model.res,fitted(Model))
modeltest<-VAR(modeldata,p=1)
whites.htest(modeltest)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  10.9706 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.5314