knitr::opts_chunk$set(echo = TRUE,cache=TRUE)

Setup

Load packages

   library(ggplot2)
   library(dplyr)
   library(statsr)
   library(leaps)
   library(grid)
   library(gridExtra)
   library(car)

Part 1: Data

Movies data set was downloaded from Coursera assignment page into R project. The observations
consist of random samples compiled from audience and critics reviews.

The dataset information about movies contains 651 randomly sampled movies produced and released
before 2016. This data set comes from Rotten Tomatoes and IMDB.There are 32 available variables.
With this dataset it is only possible to do an observational study and no causal analysis will
done.This study can generalize to movies produced and released before 2016.

Some the variables are not much relevant for the purpose of identification of movies popularity. 

Before starting the analysis one missing value was omited from analysis. Required variables
were selected. A backward elimination step method was used for selecting variables.

Part 2: Research question

The research question is "RELATIONSHIP OF MOVIE POPULARITY and its CORRELATION with different
ratings,scores as well as the awards won by director and actors/actresses and year of release".

To answer this question I will have to assess how far the different elements influence the
popularity.

Part 3: Exploratory data analysis

The data set divided into 10 categories and an extra category called "other". For the sake of
exploratory analysis I joined some of the smallest categories into the "other" category. The 
majority of the movies  are categorized as drama movies. Figures below shows all this info.

Before starting the analysis I removed missing value( there was only one.Facet Panel showing genre categories has been drawn to visualize categories of genre in the data set. A subset(df) consisting
of 8 variables was extracted from movies data frame.

Principles of Multiple Regression

MLR(Multiple Regression) is used to assess most important causes of popularity of movies.
Multiple linear regression (MLR) allows the user to account for multiple explanatory
variables and therefore to create a model that predicts the specific outcome under study. 

The simple linear regression model is based on a straight line which has the formula

Y = a + bX

(where a is the intercept and b is the gradient).We simply add a new b value 
(regression coefficient) for each additional explanatory variable:   

Y = a + b1x1 + b2x2 + b3x3 + ... + bnxn 

Multiple r which is equivalent of Pearson???s r shows the strength of the relationship between the
outcome variable and the values predicted by the model as a whole.

In Multiple regression a model is constructed with more than one explanatory variable.  
We want to create a model selectively which accounts for as much of the variance in the outcome 
variable as much as possible. We create a model elegantly including only the relevant variables.
The best way to select which variables to include in a model is to refer to previous research.

There are a few new issues to think about and it is worth reiterating our assumptions for using 
multiple explanatory variables.Such as:

Linear relationship: The model is a roughly linear one. 
Homoscedasticity: homoscedasticity means that the variance of the residuals should be the same
at each level of the explanatory variable/s.
Independent errors: This means that residuals should be uncorrelated.

There are some more important things to considered. while using simple regression, the 
assumptions are the most important issues to tackle but there are also other potential
problems you should look out for:

Outliers/influential cases: As with simple linear regression, it is important to look out 
for cases which may have a disproportionate influence over your regression model.

Variance in all explanatory variables.

Explanatory variables may be continuous, ordinal or nominal but each must have at least a 
small range of values even if there are only two categorical possibilities.

Multicollinearity: Multicollinearity exists when two or more of the explanatory variables are
highly correlated. This is a problem as it can be hard to disentangle which of them best explains
any shared variance with the outcome. It also suggests that the two variables may actually 
represent the same underlying factor.

Normally distributed residuals: The residuals should be normally distributed. 

I will attempt to analyse movies data base to find an answer for the question (of most 
significant variables for a movie) accounting for its popularity. I shall look into the issues
and list shortcomings of method used in the study.

      load("C:/R Data Files/movies.RData")
      attach(movies)

      ggplot(movies, aes(x=thtr_rel_year, y=audience_score)) +
      geom_point(aes(colour=factor(genre))) +
      xlab("Year of Release (theater)") +
      ylab("Audience Score")

      df <- movies[ -c(1, 6:12, 14, 25:32) ] #Extracting 8 Variables
      df <- na.omit(df)
     
      ##MissingData <- function(x){sum(is.na(x))/length(x)*100}
      ##apply(df, 2, MissingData)#
      
      names(df) #Variables to be used for analysis

##  [1] "title_type"       "genre"            "runtime"         
##  [4] "mpaa_rating"      "imdb_rating"      "critics_rating"  
##  [7] "critics_score"    "audience_rating"  "audience_score"  
## [10] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [13] "best_actress_win" "best_dir_win"     "top200_box"

Forward Variable Selection Algorithm

     regsubsets.out <- regsubsets(audience_score ~., data=df, nbest=1,
                       nvmax=NULL, method="backward")
     plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")

     print(regsubsets.out)

## Subset selection object
## Call: regsubsets.formula(audience_score ~ ., data = df, nbest = 1, 
##     nvmax = NULL, method = "backward")
## 29 Variables  (and intercept)
##                                Forced in Forced out
## title_typeFeature Film             FALSE      FALSE
## title_typeTV Movie                 FALSE      FALSE
## genreAnimation                     FALSE      FALSE
## genreArt House & International     FALSE      FALSE
## genreComedy                        FALSE      FALSE
## genreDocumentary                   FALSE      FALSE
## genreDrama                         FALSE      FALSE
## genreHorror                        FALSE      FALSE
## genreMusical & Performing Arts     FALSE      FALSE
## genreMystery & Suspense            FALSE      FALSE
## genreOther                         FALSE      FALSE
## genreScience Fiction & Fantasy     FALSE      FALSE
## runtime                            FALSE      FALSE
## mpaa_ratingNC-17                   FALSE      FALSE
## mpaa_ratingPG                      FALSE      FALSE
## mpaa_ratingPG-13                   FALSE      FALSE
## mpaa_ratingR                       FALSE      FALSE
## mpaa_ratingUnrated                 FALSE      FALSE
## imdb_rating                        FALSE      FALSE
## critics_ratingFresh                FALSE      FALSE
## critics_ratingRotten               FALSE      FALSE
## critics_score                      FALSE      FALSE
## audience_ratingUpright             FALSE      FALSE
## best_pic_nomyes                    FALSE      FALSE
## best_pic_winyes                    FALSE      FALSE
## best_actor_winyes                  FALSE      FALSE
## best_actress_winyes                FALSE      FALSE
## best_dir_winyes                    FALSE      FALSE
## top200_boxyes                      FALSE      FALSE
## 1 subsets of each size up to 29
## Selection Algorithm: backward

Backward Elimination Selection Algorithm

         load("C:/R Data Files/movies.RData")
         attach(movies)

## The following objects are masked from movies (pos = 3):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

         model = lm(audience_score~.,data=df)
         #summary(model)
       
         final_model = step(model, direction = "both", trace=FALSE ) 
         summary(final_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     critics_score + audience_rating + best_pic_nom + best_actress_win, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.2299  -4.3817   0.5075   4.3523  24.1505 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -8.75008    2.66718  -3.281  0.00109 ** 
## genreAnimation                  3.20421    2.45639   1.304  0.19256    
## genreArt House & International -2.66418    2.02406  -1.316  0.18856    
## genreComedy                     1.35890    1.13110   1.201  0.23005    
## genreDocumentary                0.16974    1.39624   0.122  0.90328    
## genreDrama                     -0.79769    0.96859  -0.824  0.41050    
## genreHorror                    -2.09686    1.67267  -1.254  0.21045    
## genreMusical & Performing Arts  2.60624    2.18644   1.192  0.23371    
## genreMystery & Suspense        -3.13253    1.24911  -2.508  0.01240 *  
## genreOther                     -0.02899    1.92785  -0.015  0.98801    
## genreScience Fiction & Fantasy -0.08638    2.43485  -0.035  0.97171    
## runtime                        -0.02547    0.01567  -1.626  0.10445    
## imdb_rating                     9.47342    0.45974  20.606  < 2e-16 ***
## critics_score                   0.02177    0.01502   1.450  0.14759    
## audience_ratingUpright         20.05406    0.77665  25.821  < 2e-16 ***
## best_pic_nomyes                 3.52360    1.59063   2.215  0.02710 *  
## best_actress_winyes            -1.29501    0.89570  -1.446  0.14873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.823 on 633 degrees of freedom
## Multiple R-squared:  0.8891, Adjusted R-squared:  0.8863 
## F-statistic: 317.2 on 16 and 633 DF,  p-value: < 2.2e-16

Figure 1. Summary statistics of genre,runtime,critics score, audience score, best picture award, best actor award, best actress award and best director award.

1.determining the range and concentration of the values of the X variables and whether
  there exist any outliers with pairs plots.

          pairs(~ critics_score + genre + audience_score+ best_pic_win+best_actor_win
          +best_actress_win+best_dir_win, data=df)

2.B. Residuals and Leverage.We can compute and plot the leverage of each point using the
     following commands:

    results = lm(critics_score ~ genre + audience_score, data=df)

    lev = hat(model.matrix(results))
    plot(lev,col=rainbow(7))

3.C. Residual Plots**

We can use residuals to study whether:
1.The regression function is nonlinear.
2.The error terms have nonconstant variance.
3.The error terms are not independent.
4.There are outliers.
5.The error terms are not normally distributed.

par(mfrow=c(1,3))
plot(df$audience_score, results$res,col=rainbow(7))
plot(df$genre, results$res,col=rainbow(7))
plot(results$fitted, results$res,col=rainbow(7))

To make the same residual plots using the studentized residuals type:

      r = rstudent(results)
      plot(df$genre,r,col=rainbow(7))

      plot(results$fitted,col=rainbow(7))

4.A Normal probability plot of the residuals can be used to check the normality assumption.
Here each residual is plotted against its expected value under normality.To make normal 
probability plots, as well as a histogram, type:

  qqnorm(results$res,col=rainbow(7))
  qqline(results$res,col=rainbow(7))

  hist(results$res,col=rainbow(7))

layout(matrix(c(1,1,2,3,4,5), 3, 2, byrow = TRUE))

barplot(summary(df$genre,maxsum=6),
        main="a) Genre of movie ")

barplot(summary(df$best_pic_win),
        main="b) Whether best picture Oscar (no, yes)")

barplot(summary(df$best_actor_win),
        main="c) Whether or not actor ever won an Oscar (no, yes)")

barplot(summary(df$best_actress_win),
        main="d) Whether or not actresses ever won an Oscar (no, yes)")

barplot(summary(df$best_dir_win),
        main="e) Whether or not the director ever won an Oscar (no, yes)")

Figure 2. Barplots for Genre of df, best picture award, main actors award, main actresses award, and director award.

For numerical variables we used histograms and barplots. The average runtime of a movie is around
105 and its distribution is slightly right skewed, with some outliers df with runtime around 250
minutes. The critics scores are nearly uniforml but slightly left skewed. The score given by the 
audience is nearly uniform with left skewed distribution.

    layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow = TRUE))

    boxplot(df$runtime,xlab="Runtime",
        main="a) Boxplot of df runtime")

      hist(df$runtime, xlab="Runtime",
      main="b) Histogram of df runtime")

      boxplot(df$critics_score,xlab="Critics score",
        main="c) Boxplot of critics score")

      hist(df$critics_score, xlab="Critics score",
      main="d) Histogram of critics score")

      boxplot(df$audience_score,xlab="Audience score",
        main="e) Boxplot of audience score")

      hist(df$audience_score, xlab="Audience score",
      main="f) Histogram of audience score")

Figure 3. Barplots and histograms of df runtime, critics score and audience score.

Part 4: Modeling

Decisions about the explanatory variables added to the model are made by the computer
based entirely on statistical criteria.The Forward method starts from scratch - the 
computer searches from the specified list of possible explanatory variables for the one
with the strongest correlation with the outcome and enters that first. It continues 
to add variables in order of how much additional (unique) variance they explain. It only
stops when there are no further variables that can explain additional (unique) variance
that is not already accounted for by the variables already entered.

The Backward method does the opposite - it begins with all of the specified potential
explanatory variables included in the model and then removes those which are not making
a significant contribution. The Stepwise option is similar but uses both forward and 
backwards criteria for deciding when to add or remove an explanatory variable. The forward
selection strategy where highest R^2 is used starts with no variables selected in the
model,it adds variables until there is no further improvement in R^2. The second method 
is backward elimination, begins with largest model and eliminates variables until there
is no further improvement in the model.Former method uses highest R^2 for selection and
later uses P-value less than 0.05.

In this project I will use BACKWARD ELIMINATION and start by fitting a model with 8 
variables (described in the previous section). Backward elimination will help us to define
if better results can be obtained by using a smaller set of attributes. The advantage of
backward elimination is that it allows to start with all the variables, deleting one 
variable at a time until there are no improvements in the model.

First, let's fit an initial model with the 8 variables. The adjusted R-squared is 0.5223.
The least significant variable is movie director which I will exclude from the next model.

      model <- lm(audience_score~.,data=df)
      #summary.data.frame(model)

I will make use of the step function for constructing next model which which will be conclusive
for variable selection.

      final_model<-step(model, direction = "both", trace=FALSE ) 
      print(final_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     critics_score + audience_rating + best_pic_nom + best_actress_win, 
##     data = df)
## 
## Coefficients:
##                    (Intercept)                  genreAnimation  
##                       -8.75008                         3.20421  
## genreArt House & International                     genreComedy  
##                       -2.66418                         1.35890  
##               genreDocumentary                      genreDrama  
##                        0.16974                        -0.79769  
##                    genreHorror  genreMusical & Performing Arts  
##                       -2.09686                         2.60624  
##        genreMystery & Suspense                      genreOther  
##                       -3.13253                        -0.02899  
## genreScience Fiction & Fantasy                         runtime  
##                       -0.08638                        -0.02547  
##                    imdb_rating                   critics_score  
##                        9.47342                         0.02177  
##         audience_ratingUpright                 best_pic_nomyes  
##                       20.05406                         3.52360  
##            best_actress_winyes  
##                       -1.29501

The final model has only 3 variables but it has a slightly larger adjusted R-squared of 0.5235.
I use this model as it captures same 53% variability as in the full model(model).The genre and 
critics score variables are most significant.

plot(final_model$residuals~df$runtime,xlab="Runtime",ylab="Residuals",
                 main="a) Residuals vs. Runtime",col=rainbow(7))

plot(final_model$residuals~df$critics_score,xlab="Critics score",ylab="Residuals",
                 main="b) Residuals vs critics score",col=rainbow(7))

Figure 4. Residuals plots for Runtime and critics score. The residuals seem scattered
around 0 for both Runtime and Critics score (Figure 4 a and b).

     hist(final_model$residuals,ylab="Residuals",
     main="a) Histogram of smalll_model residuals")

     qqnorm(final_model$residuals,
     main="b) Normal probability of residuals")
     qqline(final_model$residuals)

Figure 5. Histogram and Q-Q plot of residuals.

The conclusion from figure 5 shows that there is a little skewness in residual but there 
is deviations in the tail of the graph(5b).

     plot(final_model$residuals~final_model$fitted,
     main="Residuals vs. fitted")
     abline(0,0)

Figure 6. Residuals plot

Homoscedasticity: We can check that residuals do not vary systematically with the predicted
values by plotting the residuals against the values predicted by the regression model.Here 
the residuals are generally homoscedastic. Although  some degree of heteroscedasticity exists
in the right end of Figure 6; which makes the model less accurate in predicting higher values.

Part 5: Prediction

I wanted to predict the audience score for a new movie that has not been used to fit the
model.For the movie "Teenage Mutant Ninja Turtles: Out of the Shadows(2016)" I extracted 
the runtime, critics score and audience score from IMDB and Rotten tomatoes,the genre was
set to "Action & Adventure". The real and the predicted audience score were 55 and 69.57 
respectively, a confidence interval between 62.58 and 76.56. With a 95% confidence interval
if repeated samples were taken 95% of the samples would contain the population mean that 
can be between 62.58 and 76.56.

                     new_movie <- data.frame(title_type="Feature Film",
                     genre="Action & Adventure",
                     runtime=112,
                     mpaa_rating="PG-13",
                     imdb_rating=6.1,
                     critics_rating="Certified Fresh",
                     critics_score=38,
                     audience_rating="Upright",
                     audience_score=55,
                     best_pic_nom="yes",
                     best_pic_win="yes",
                     best_actor_win="yes",
                     best_actress_win="no",
                     best_dir_win="yes",
                     top200_box="yes")

prediction_Ninja <- predict(model, newdata=new_movie, interval="confidence")

prediction_Ninja

##        fit      lwr    upr
## 1 69.57239 62.58277 76.562

Further Diagnostic functions

I have used some additional diagnostics for evaluation of the above analysis. These are
given below.

coefficients(final_model) # model coefficients

##                    (Intercept)                 genreAnimation 
##                    -8.75007740                     3.20421210 
## genreArt House & International                    genreComedy 
##                    -2.66417705                     1.35889547 
##               genreDocumentary                     genreDrama 
##                     0.16973937                    -0.79768827 
##                    genreHorror genreMusical & Performing Arts 
##                    -2.09685534                     2.60624302 
##        genreMystery & Suspense                     genreOther 
##                    -3.13253261                    -0.02898621 
## genreScience Fiction & Fantasy                        runtime 
##                    -0.08638193                    -0.02547179 
##                    imdb_rating                  critics_score 
##                     9.47341810                     0.02177056 
##         audience_ratingUpright                best_pic_nomyes 
##                    20.05406062                     3.52359708 
##            best_actress_winyes 
##                    -1.29501450

confint(final_model, level=0.95) # CIs for model parameters

##                                        2.5 %       97.5 %
## (Intercept)                    -13.987662766 -3.512492029
## genreAnimation                  -1.619455999  8.027880201
## genreArt House & International  -6.638854191  1.310500097
## genreComedy                     -0.862260310  3.580051249
## genreDocumentary                -2.572078471  2.911557213
## genreDrama                      -2.699725465  1.104348925
## genreHorror                     -5.381508923  1.187798234
## genreMusical & Performing Arts  -1.687301338  6.899787370
## genreMystery & Suspense         -5.585429521 -0.679635705
## genreOther                      -3.814748466  3.756776036
## genreScience Fiction & Fantasy  -4.867744212  4.694980355
## runtime                         -0.056234674  0.005291097
## imdb_rating                      8.570626504 10.376209689
## critics_score                   -0.007715782  0.051256893
## audience_ratingUpright          18.528928915 21.579192328
## best_pic_nomyes                  0.400051430  6.647142723
## best_actress_winyes             -3.053923603  0.463894600

fit <- fitted(final_model) # predicted values
summary(fit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.414  45.290  70.500  62.350  78.580  93.040

tres <- residuals(final_model) # residuals
summary(tres)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -21.2300  -4.3820   0.5075   0.0000   4.3520  24.1500

anova(final_model) # anova table

## Analysis of Variance Table
## 
## Response: audience_score
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## genre             10  51633    5163  110.9111 < 2.2e-16 ***
## runtime            1   6236    6236  133.9459 < 2.2e-16 ***
## imdb_rating        1 145918  145918 3134.4033 < 2.2e-16 ***
## critics_score      1    885     885   19.0178 1.511e-05 ***
## audience_rating    1  31298   31298  672.3087 < 2.2e-16 ***
## best_pic_nom       1    191     191    4.0976   0.04336 *  
## best_actress_win   1     97      97    2.0904   0.14873    
## Residuals        633  29468      47                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

covar <- vcov(final_model) # covariance matrix for model parameters 
summary(covar)

##   (Intercept)       genreAnimation      genreArt House & International
##  Min.   :-1.42530   Min.   :-1.425296   Min.   :-0.560849             
##  1st Qu.:-0.89643   1st Qu.:-0.001702   1st Qu.: 0.000637             
##  Median :-0.40278   Median : 0.727011   Median : 0.700980             
##  Mean   : 0.04481   Mean   : 0.652814   Mean   : 0.595612             
##  3rd Qu.: 0.01316   3rd Qu.: 0.749107   3rd Qu.: 0.755176             
##  Max.   : 7.11383   Max.   : 6.033875   Max.   : 4.096804             
##   genreComedy         genreDocumentary    genreDrama      
##  Min.   :-1.0465266   Min.   :-0.3759   Min.   :-0.51092  
##  1st Qu.:-0.0006876   1st Qu.:-0.0028   1st Qu.:-0.00763  
##  Median : 0.7163751   Median : 0.7281   Median : 0.71852  
##  Mean   : 0.3898457   Mean   : 0.5113   Mean   : 0.41763  
##  3rd Qu.: 0.7220625   3rd Qu.: 0.8139   3rd Qu.: 0.76528  
##  Max.   : 1.2793798   Max.   : 1.9495   Max.   : 0.93817  
##   genreHorror        genreMusical & Performing Arts
##  Min.   :-0.993640   Min.   :-0.199098             
##  1st Qu.: 0.002677   1st Qu.:-0.001355             
##  Median : 0.700980   Median : 0.690401             
##  Mean   : 0.489095   Mean   : 0.680041             
##  3rd Qu.: 0.736532   3rd Qu.: 0.785155             
##  Max.   : 2.797825   Max.   : 4.780500             
##  genreMystery & Suspense   genreOther       
##  Min.   :-0.4027783      Min.   :-0.648327  
##  1st Qu.:-0.0005663      1st Qu.:-0.002233  
##  Median : 0.7163751      Median : 0.722063  
##  Mean   : 0.4571015      Mean   : 0.565469  
##  3rd Qu.: 0.7589054      3rd Qu.: 0.758905  
##  Max.   : 1.5602710      Max.   : 3.716620  
##  genreScience Fiction & Fantasy    runtime            imdb_rating       
##  Min.   :-1.034839              Min.   :-1.804e-02   Min.   :-0.896426  
##  1st Qu.: 0.000327              1st Qu.:-1.355e-03   1st Qu.:-0.044359  
##  Median : 0.718849              Median : 1.143e-05   Median :-0.003888  
##  Mean   : 0.676500              Mean   :-8.854e-04   Mean   :-0.055338  
##  3rd Qu.: 0.732801              3rd Qu.: 1.103e-03   3rd Qu.: 0.017312  
##  Max.   : 5.928499              Max.   : 3.911e-03   Max.   : 0.211357  
##  critics_score        audience_ratingUpright best_pic_nomyes   
##  Min.   :-0.0038883   Min.   :-0.16364       Min.   :-0.20995  
##  1st Qu.:-0.0022330   1st Qu.:-0.04908       1st Qu.:-0.04533  
##  Median :-0.0014056   Median :-0.01836       Median :-0.00763  
##  Mean   :-0.0005441   Mean   : 0.05427       Mean   : 0.18050  
##  3rd Qu.:-0.0001545   3rd Qu.: 0.02257       3rd Qu.: 0.06148  
##  Max.   : 0.0131628   Max.   : 0.73329       Max.   : 2.53010  
##  best_actress_winyes 
##  Min.   :-0.2099536  
##  1st Qu.:-0.0938484  
##  Median :-0.0177532  
##  Mean   : 0.0136220  
##  3rd Qu.:-0.0001545  
##  Max.   : 0.8022842

infl <- influence(final_model) # regression diagnostics
summary(infl)

##              Length Class  Mode   
## hat            650  -none- numeric
## coefficients 11050  -none- numeric
## sigma          650  -none- numeric
## wt.res         650  -none- numeric

plot(residuals(final_model),col=rainbow(7))

plot(residuals(model),col=rainbow(7))

Diagnostic Plots-Diagnostic plots provide checks for heteroscedasticity, normality,
and influential observerations.Diagnostic plots-Outliers-Assessing Outliers

   car::outlierTest(final_model)# Shows outliers

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 123 3.622165         0.00031559      0.20514

Assessing Outlier-Bonferonni p-value for most extreme observation qqplot for 
studentized reside.

        require(car)
        car::qqPlot(final_model, main="QQ Plot") #qq plot for studentized resid

        leveragePlots(final_model) # leverage plots

leverage plot

Influential Observations- Influential Observations- added variable plots.This function
construct added-variable (also called partial-regression)plots for  linear and 
generalized linear models.

 require(car)   
 car::avPlots(final_model)

Cook’s D plot-identify D values > 4/(n-k-1)

cutoff <- 4/((nrow(df)-length(final_model$coefficients)-2)) 

plot(final_model, which=4, cook.levels=cutoff)

Influence Plot

influencePlot(final_model,id.method="identify", main="Influence Plot",
              sub="Circle size is proportial to Cook's Distance")

av plots Cook’s D plot influence plot

Non-normality-Normality of Residuals-qq plot for studentized residue

qqPlot(final_model, main="QQ Plot")

Distribution of studentized residuals

library(MASS)
sresid <- studres(final_model) 
hist(sresid, freq=FALSE, 
   main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit,col=rainbow(7))

QQ plot histogram of studentized residuals click to view-Non-constant Error Variance Evaluate homoscedasticity, non-constant error variance test

ncvTest(final_model)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 55.1253    Df = 1     p = 1.130859e-13

plot studentized residuals vs. fitted values

spreadLevelPlot(final_model,col=rainbow(5))

## 
## Suggested power transformation:  1.428811

spread vs. levels

Multi-collinearity-Evaluate Collinearity-The target multiple regression model,
because VIF is less than 10, there is no multicolinearity.

    require(fmsb)
    res <- lm(critics_score ~ genre+audience_score, data=df)
    summary(res)

## 
## Call:
## lm(formula = critics_score ~ genre + audience_score, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.410 -13.557   2.582  14.441  50.802 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -7.16813    3.34770  -2.141 0.032636 *  
## genreAnimation                  0.96657    7.01397   0.138 0.890437    
## genreArt House & International  0.98163    5.81881   0.169 0.866086    
## genreComedy                     0.64434    3.22918   0.200 0.841908    
## genreDocumentary               18.67671    3.88761   4.804 1.94e-06 ***
## genreDrama                     10.34083    2.73506   3.781 0.000171 ***
## genreHorror                     9.71695    4.78999   2.029 0.042915 *  
## genreMusical & Performing Arts 11.39751    6.28868   1.812 0.070397 .  
## genreMystery & Suspense        11.54559    3.54243   3.259 0.001176 ** 
## genreOther                     11.78540    5.52343   2.134 0.033247 *  
## genreScience Fiction & Fantasy 11.18576    7.00537   1.597 0.110818    
## audience_score                  0.90358    0.04256  21.230  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.69 on 638 degrees of freedom
## Multiple R-squared:  0.5277, Adjusted R-squared:  0.5195 
## F-statistic:  64.8 on 11 and 638 DF,  p-value: < 2.2e-16

Multicollinearity is nearly always a problem in multiple regression models

1) Even small degrees of multicollinearity can cause serious problems for an analysis
   if you are interested in the effects of individual predictors.

2) Small samples are particularly vulnerable to multicollinearity problems because
   multicollinearity reduces your effective sample size for the effects of individual
   predictors

3) There are no 'easy' solutions (e.g., dropping predictors is generally a bad idea)

   Checking multicolinearity for independent variables.

      car::vif(final_model)

##                      GVIF Df GVIF^(1/(2*Df))
## genre            1.646620 10        1.025250
## runtime          1.293612  1        1.137371
## imdb_rating      3.467809  1        1.862205
## critics_score    2.537211  1        1.592862
## audience_rating  2.055665  1        1.433759
## best_pic_nom     1.155186  1        1.074796
## best_actress_win 1.103369  1        1.050414

Nonlinearity-Evaluate Nonlinearity-component + residual plot

     crPlots(final_model)

Evaluate Nonlinearity

  require(car)   
  ceresPlots(final_model,col=rainbow(2))

## Warning in ceresPlots(final_model, col = rainbow(2)): Factors skipped in
## drawing CERES plots.

Additional Diagnostic Help-The gvlma() function in the gvlma package, performs a global 
validation of linear model assumptions as well separate evaluations of skewness, kurtosis,
and heteroscedasticity.Global test of model assumptions**

library(gvlma)
gvmodel <- gvlma(final_model) 
summary(gvmodel)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     critics_score + audience_rating + best_pic_nom + best_actress_win, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.2299  -4.3817   0.5075   4.3523  24.1505 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -8.75008    2.66718  -3.281  0.00109 ** 
## genreAnimation                  3.20421    2.45639   1.304  0.19256    
## genreArt House & International -2.66418    2.02406  -1.316  0.18856    
## genreComedy                     1.35890    1.13110   1.201  0.23005    
## genreDocumentary                0.16974    1.39624   0.122  0.90328    
## genreDrama                     -0.79769    0.96859  -0.824  0.41050    
## genreHorror                    -2.09686    1.67267  -1.254  0.21045    
## genreMusical & Performing Arts  2.60624    2.18644   1.192  0.23371    
## genreMystery & Suspense        -3.13253    1.24911  -2.508  0.01240 *  
## genreOther                     -0.02899    1.92785  -0.015  0.98801    
## genreScience Fiction & Fantasy -0.08638    2.43485  -0.035  0.97171    
## runtime                        -0.02547    0.01567  -1.626  0.10445    
## imdb_rating                     9.47342    0.45974  20.606  < 2e-16 ***
## critics_score                   0.02177    0.01502   1.450  0.14759    
## audience_ratingUpright         20.05406    0.77665  25.821  < 2e-16 ***
## best_pic_nomyes                 3.52360    1.59063   2.215  0.02710 *  
## best_actress_winyes            -1.29501    0.89570  -1.446  0.14873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.823 on 633 degrees of freedom
## Multiple R-squared:  0.8891, Adjusted R-squared:  0.8863 
## F-statistic: 317.2 on 16 and 633 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = final_model) 
## 
##                      Value   p-value                   Decision
## Global Stat        18.7023 0.0008992 Assumptions NOT satisfied!
## Skewness            0.2177 0.6408296    Assumptions acceptable.
## Kurtosis            6.5022 0.0107744 Assumptions NOT satisfied!
## Link Function      11.7270 0.0006160 Assumptions NOT satisfied!
## Heteroscedasticity  0.2555 0.6132416    Assumptions acceptable.

anova(final_model,test="Chisq")

## Analysis of Variance Table
## 
## Response: audience_score
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## genre             10  51633    5163  110.9111 < 2.2e-16 ***
## runtime            1   6236    6236  133.9459 < 2.2e-16 ***
## imdb_rating        1 145918  145918 3134.4033 < 2.2e-16 ***
## critics_score      1    885     885   19.0178 1.511e-05 ***
## audience_rating    1  31298   31298  672.3087 < 2.2e-16 ***
## best_pic_nom       1    191     191    4.0976   0.04336 *  
## best_actress_win   1     97      97    2.0904   0.14873    
## Residuals        633  29468      47                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

vartest <- anova(final_model,test="Chisq")
plot(vartest)

Part 6: Conclusion

Perhaps determining the popularity of a movie is not simple task. The hiden attributes of
a movie seem to have some degree of correlation with the popularity. From regression 
analysis it appears that the variables like critics score,audience score,imdb rating
and runtime have significant correlations.The movie genre, runtime and critics score can
be used to predict the popularity of a movie. However due to shortcomings of using current
subseting,projecting popularity with external rating other than IMBD and rotten tomatoes 
could be more useful. This issue can be addressed in future studies.

Movie Popularity Prediction

Tahir Hussain

August 23, 2016