Synopsis

This analysis performs exploratory data anlaysis and linear regression, modeling, and prediction with a data set of 651 randomly sampled movies. The movie data used in the analysis was sourced from IMDB and Rotten Tomatoes APIs.

Setup

We will first prepare the workspace environment by setting global options.

Set Global Options

#Install Knitr pckage if necessary and load Knitr library
list.of.packages <- c("knitr")
new.packages <-
list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, repos = "http://cran.us.r-project.org")
suppressWarnings (suppressMessages (library (knitr)))
knitr::opts_chunk$set(
fig.width = 8,
fig.height = 4,
fig.path = 'figures/DataAnalysisProject_',
echo = TRUE,
warning = FALSE,
message = FALSE
)
#Clear variables
rm (list = ls (all = TRUE))
#Get and set working directory
setwd (getwd ())

Load Packages

Install and load required libraries if neccessary.

#Check installed status of requried packages, and install if necessary
list.of.packages <-
  c("statsr", "dplyr", "ggplot2", "kableExtra", "olsrr", "lmtest", "car")
new.packages <-
  list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
  install.packages(new.packages, repos = "http://cran.us.r-project.org")
suppressWarnings (suppressMessages (library (statsr)))
suppressWarnings (suppressMessages (library (dplyr)))
suppressWarnings (suppressMessages (library (ggplot2)))
suppressWarnings (suppressMessages (library (kableExtra)))
suppressWarnings (suppressMessages (library (olsrr)))
suppressWarnings (suppressMessages (library (lmtest)))
suppressWarnings (suppressMessages (library (car)))

Load data

Load the data set.

load ( url ("https://d18ky98rnyall9.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1516924800&Signature=WUuGqHLm1BK2nWJPFSb~ONIiNylTRh4amRdLaX0bWK8T1daAC~XVI~Kw4rHHXsWMMp2pa2HLTbEUhk4sW3TA1MNvtocSihOxm33Dc~oEJuQ7alE0Wmm4PLyTGi~XUvNejiyoVpugdfuDphurmxsEBtF2xo59jrRRjjY7XjfJIQs_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"))

Part 1: Data

According to the project’s code book, the observations in the “Movies” data set were randomly sampled from IMDB and Rotten Tomatoes APIs. For the sake of the analysis, we will assume the sampling method used was simple random sampling.

Since simple random sampling was used, we know that each population observation has an equal chance of being selected. Thus, we can infer generalizability about the “Movies” data set. Note that we cannot infer causality becase random assignment of the observations was not used.


Part 2: Research question

We will research the association and predictive value between the response variable “Audience Score on Rotten Tomatoes,” and multiple explanatory variables to build a multiple linear regression model. Variable selection criteria will be covered in Part 4: Modeling. Note the explanatory variables used in the model were selected based on their contextual relevance, and we won’t be excluding variables upfront other than those that have obviously no value.

The “Audience score on Rotten Tomatoes” variable as a response is of interest because Rotten Tomatoes is more of a “popularity” score generator based on likes and dislikes, and as percentage score, this metric is easily digested. So, as a very simplified and condensed metric, uncovering any meaningful relationships will be interesting.


Part 3: Exploratory data analysis

For our Exploratory Data Analysis, let’s first get familiar with our response variable, “Audience Score on Rotten Tomatoes,” by determining its distribution and summary statistics.

Using a histogram, we will survey the normality.

ggplot(data = movies, aes(movies$audience_score)) +
  geom_histogram(
    breaks = seq(0, 100, by = 10),
    col = "white",
    fill = "blue",
    alpha = 1
  ) +
  scale_x_continuous(
    expand = c(0, 0),
    limits = c(0, 100),
    breaks = seq(0, 100, by = 10)
  ) +
  scale_y_continuous(
    expand = c(0, 0),
    limits = c(0, 150),
    breaks = seq(0, 150, by = 25)
  ) +
  labs(title = "Histogram for Audience Score on Rotten Tomatoes") +
  labs(x = "Audience Score", y = "Count")

There data is slightly skewed to the left, suggesting Rotten Tomatoes reviews tend to be more positive. Let’s calculate summary statistics.

#Compute summary stats
AudienceScoreSummary <- movies %>%
select (audience_score) %>%
filter(audience_score != "NA") %>%
summarise (
Total = n (),
MinAudienceScore = min(audience_score, na.rm = TRUE),
MaxAudienceScore = max(audience_score, na.rm = TRUE),
AverageAudienceScore = mean(audience_score, na.rm = TRUE),
MedianAudienceScore = median(audience_score, na.rm = TRUE),
AudienceScoreIQR = IQR(audience_score, na.rm = TRUE)
)

#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
AudienceScoreSummary %>%
kable("html") %>%
kable_styling()
Total MinAudienceScore MaxAudienceScore AverageAudienceScore MedianAudienceScore AudienceScoreIQR
651 11 97 62.36252 65 34

We see the median score is higher than the mean score, confirming the left skewness. Now, let’s compare the distribution of the rating on IMDB to see any potential bias. We will again use a histogram to survey the normality.

ggplot(data = movies, aes(movies$imdb_rating)) +
  geom_histogram(
    breaks = seq(0, 10, by = 1),
    col = "white",
    fill = "blue",
    alpha = 1
  ) +
  scale_x_continuous(
    expand = c(0, 0),
    limits = c(0, 10),
    breaks = seq(0, 10, by = 1)
  ) +
  scale_y_continuous(
    expand = c(0, 0),
    limits = c(0, 250),
    breaks = seq(0, 250, by = 25)
  ) +
  labs(title = "Histogram for IMDB Rating") +
  labs(x = "IMDB Rating", y = "Count")

There data is also slightly skewed to the left, suggesting IMDB ratings tend to be more positive. Let’s calculate summary statistics.

#Compute summary stats
IMDBRatingSummary <- movies %>%
  select (imdb_rating) %>%
  filter(imdb_rating != "NA") %>%
  summarise (
  Total = n (),
  MinIMDBRating = min(imdb_rating, na.rm = TRUE),
  MaxIMDBRating = max(imdb_rating, na.rm = TRUE),
  AverageIMDBRating = mean(imdb_rating, na.rm = TRUE),
  MedianIMDBRating = median(imdb_rating, na.rm = TRUE),
  IMDBRatingIQR = IQR(imdb_rating, na.rm = TRUE)
  )

#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
IMDBRatingSummary %>%
  kable("html") %>%
  kable_styling()
Total MinIMDBRating MaxIMDBRating AverageIMDBRating MedianIMDBRating IMDBRatingIQR
651 1.9 9 6.493088 6.6 1.4

Again, we see the median score is higher than the mean score, confirming the left skewness of the IMDB Ratings. At this point, we want to call out potential bias in the sample and its potential effects on generalizability.

However, this bias may disappear in a multi-variate view. Let’s breakdown our Rotten Tomatoes Audience Score by MPAA rating.

ggplot(data = subset(movies, !is.na(mpaa_rating) &
                       !is.na(audience_score)),
                       aes(x = mpaa_rating, y = audience_score)) +
                       geom_boxplot(fill = "#56B4E9") +
                       labs(title = "Rotten Tomatoes Audience Score by MPAA Rating", x = "MPAA Rating", y = "Audience Score")

We can see that “Unrated” movies have overwhelmingly positive reviews with little variation, and PG-13 movies have the lowest median score with much variability. To get a more detailed view of skewness, let’s look at a quantile-quantile plot.

qplot(
  sample = audience_score,
  data = subset(movies,!is.na(mpaa_rating) &
                  !is.na(audience_score)),
  color = mpaa_rating,
  shape = mpaa_rating
)

Left-skewness is definitive except for the PG-13 rating. Let’s see what the summary statistics say.

#Compute summary stats
AudienceScoreSummaryMPAA <- movies %>%
  select (mpaa_rating, audience_score) %>%
  filter(audience_score != "NA") %>%
  group_by (mpaa_rating) %>%
  summarise (
    Total = n (),
    MinAudienceScore = min(audience_score, na.rm = TRUE),
    MaxAudienceScore = max(audience_score, na.rm = TRUE),
    AverageAudienceScore = mean(audience_score, na.rm = TRUE),
    MedianAudienceScore = median(audience_score, na.rm = TRUE),
    AudienceScoreIQR = IQR(audience_score, na.rm = TRUE)
  ) %>%
  arrange (desc(AverageAudienceScore))

#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
AudienceScoreSummaryMPAA %>%
  kable("html") %>%
  kable_styling()
mpaa_rating Total MinAudienceScore MaxAudienceScore AverageAudienceScore MedianAudienceScore AudienceScoreIQR
Unrated 50 19 96 78.48000 81.5 14.00
G 19 18 92 68.47368 74.0 25.50
NC-17 2 56 71 63.50000 63.5 7.50
R 329 14 97 62.04255 64.0 35.00
PG 118 13 93 61.83051 65.0 32.75
PG-13 133 11 94 56.67669 56.0 26.00

We can conclude that left-skenewness is pervasive for almost all of the MPAA ratings except the normally distributed PG-13 rating.

As an additional angle on multi-variate analysis of skewness, we will look at the movie genre and repeat our set of visuals and summary stats.

ggplot(data = subset(movies,!is.na(genre) &
                       !is.na(audience_score)),
                       aes(x = genre, y = audience_score)) +
                       geom_boxplot(fill = "#56B4E9") +
                       labs(title = "Rotten Tomatoes Audience Score by Genre", x = "Genre", y = "Audience Score") +
                       theme(axis.text.x = element_text(angle = 90, hjust = 1))

Genre variability is incredibly diverse. We can see that a movie’s rating is definitely affected by its genre, where documentaries fare much better than action movies. This revelation begs the question of significance between genres, which we may wan to address in additional analysis. Let’s look at the quantile-quantile plot.

p <- qplot(
  sample = audience_score,
  data = subset(movies, !is.na(genre) &
  !is.na(audience_score)),
  color = genre,
  shape = genre
  )
  p + scale_shape_manual(values = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11))

The “Documentary” and “Musical & Performing Arts” genres are notable examples of left-skew. Now for the summary stats.

#Compute summary stats
AudienceScoreSummaryGenre <- movies %>%
  select (genre, audience_score) %>%
  filter(audience_score != "NA") %>%
  group_by (genre) %>%
  summarise (
    Total = n (),
    MinAudienceScore = min(audience_score, na.rm = TRUE),
    MaxAudienceScore = max(audience_score, na.rm = TRUE),
    AverageAudienceScore = mean(audience_score, na.rm = TRUE),
    MedianAudienceScore = median(audience_score, na.rm = TRUE),
    AudienceScoreIQR = IQR(audience_score, na.rm = TRUE)
  ) %>%
  arrange (desc(AverageAudienceScore))

#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
AudienceScoreSummaryGenre %>%
  kable("html") %>%
  kable_styling()
genre Total MinAudienceScore MaxAudienceScore AverageAudienceScore MedianAudienceScore AudienceScoreIQR
Documentary 52 57 96 82.75000 86.0 12.75
Musical & Performing Arts 12 55 95 80.16667 80.5 13.75
Other 16 21 91 66.68750 73.5 29.50
Drama 305 13 95 65.34754 70.0 28.00
Art House & International 14 29 86 64.00000 65.5 29.00
Animation 9 18 88 62.44444 65.0 11.00
Mystery & Suspense 59 15 97 55.94915 54.0 30.00
Action & Adventure 65 11 94 53.78462 52.0 28.00
Comedy 87 19 93 52.50575 50.0 30.50
Science Fiction & Fantasy 9 17 85 50.88889 47.0 53.00
Horror 23 24 84 45.82609 43.0 17.50

The “Drama” genre had the highest number of observations, and also had the 4th largest mean Audience score. This is helping to shift our global average.


Part 4: Modeling

As mentioned in Part 2, the “Audience Score on Rotten Tomatoes” (audience_score) is our dependent variable. The below explanatory variables will be excluded from modeling because in terms of dimensional richness, they offer no meaningful information gain or they add model noise because they are very high-cardinality (unique). For example, the “Link to Rotten Tomatoes page for the movie” variable would have virtually no model value due to its tangential nature and its high-cardinality - in other words, a URL has no domain context from a decision making perspective.

The below explanatory variables will be included in our modeling as they offer the most information gain potential and dimensional richness.

To build our mulitple linear regression model, we will use stepwise forward selection via the “ols_step_forward” function from the “olsrr” R package. The function will build the model from our set of explanatory variables by entering predictors based on p-values in a stepwise manner. The first variable to be added to the model is most the significant, and more variables are included until none of remaining variables are “significant” when added to the model.

Compared to stepwise backward selection, stepwise forward selection was used for its low resource overhead and thus ease of reproducibility. It is worth noting the forward selection key disadvantage, such that each addition of a new variable may undermine the significance of variables already in the model.

Let’s run the model.

#create the list of predictor coefficients
MultiRegression1 <- lm(
  audience_score ~ title_type +
  genre +
  runtime +
  mpaa_rating +
  thtr_rel_month +
  dvd_rel_month +
  imdb_rating +
  critics_rating +
  critics_score +
  best_pic_nom +
  best_pic_win +
  best_actor_win +
  best_actress_win +
  best_dir_win +
  top200_box,
  data = movies
  )

#generate the multi-regression model  
ols_step_forward (MultiRegression1, details = TRUE)
## Variable Selection Procedure
##  Dependent Variable: audience_score 
## 
##  Forward Selection: Step 1 
## 
##  Variable imdb_rating Entered 
## 
##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.865       RMSE                10.160 
## R-Squared               0.748       Coef. Var           16.291 
## Adj. R-Squared          0.748       MSE                103.219 
## Pred R-Squared          0.745       MAE                  7.742 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                   ANOVA                                   
## -------------------------------------------------------------------------
##                   Sum of                                                 
##                  Squares         DF    Mean Square       F          Sig. 
## -------------------------------------------------------------------------
## Regression    198831.499          1     198831.499    1926.312    0.0000 
## Residual       66988.946        649        103.219                       
## Total         265820.445        650                                      
## -------------------------------------------------------------------------
## 
##                                     Parameter Estimates                                      
## --------------------------------------------------------------------------------------------
##       model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
## --------------------------------------------------------------------------------------------
## (Intercept)    -42.328         2.418                 -17.503    0.000    -47.077    -37.580 
## imdb_rating     16.123         0.367        0.865     43.890    0.000     15.402     16.845 
## --------------------------------------------------------------------------------------------
## 
## 
## Forward Selection: Step 2 
## 
##  Variable critics_score Entered 
## 
##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.867       RMSE                10.079 
## R-Squared               0.752       Coef. Var           16.162 
## Adj. R-Squared          0.752       MSE                101.581 
## Pred R-Squared          0.749       MAE                  7.702 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                  ANOVA                                   
## ------------------------------------------------------------------------
##                   Sum of                                                
##                  Squares         DF    Mean Square       F         Sig. 
## ------------------------------------------------------------------------
## Regression    199995.983          2      99997.991    984.417    0.0000 
## Residual       65824.463        648        101.581                      
## Total         265820.445        650                                     
## ------------------------------------------------------------------------
## 
##                                      Parameter Estimates                                       
## ----------------------------------------------------------------------------------------------
##         model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
## ----------------------------------------------------------------------------------------------
##   (Intercept)    -37.032         2.864                 -12.930    0.000    -42.656    -31.408 
##   imdb_rating     14.658         0.566        0.786     25.901    0.000     13.546     15.769 
## critics_score      0.073         0.022        0.103      3.386    0.001      0.031      0.116 
## ----------------------------------------------------------------------------------------------
## 
## 
## Forward Selection: Step 3 
## 
##  Variable genre Entered 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.877       RMSE                9.817 
## R-Squared               0.769       Coef. Var          15.742 
## Adj. R-Squared          0.764       MSE                96.378 
## Pred R-Squared          0.759       MAE                 7.415 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                  ANOVA                                   
## ------------------------------------------------------------------------
##                   Sum of                                                
##                  Squares         DF    Mean Square       F         Sig. 
## ------------------------------------------------------------------------
## Regression    204331.585         12      17027.632    176.676    0.0000 
## Residual       61488.860        638         96.378                      
## Total         265820.445        650                                     
## ------------------------------------------------------------------------
## 
##                                               Parameter Estimates                                               
## ---------------------------------------------------------------------------------------------------------------
##                          model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
## ---------------------------------------------------------------------------------------------------------------
##                    (Intercept)    -37.153         3.109                 -11.950    0.000    -43.258    -31.048 
##                    imdb_rating     14.766         0.573        0.792     25.777    0.000     13.642     15.891 
##                  critics_score      0.067         0.021        0.094      3.122    0.002      0.025      0.109 
##                 genreAnimation      9.117         3.498        0.053      2.606    0.009      2.247     15.987 
## genreArt House & International      0.030         2.905        0.000      0.010    0.992     -5.674      5.734 
##                    genreComedy      2.092         1.614        0.035      1.296    0.195     -1.078      5.261 
##               genreDocumentary      1.194         1.968        0.016      0.607    0.544     -2.671      5.059 
##                     genreDrama     -0.203         1.380       -0.005     -0.147    0.883     -2.913      2.507 
##                    genreHorror     -5.028         2.387       -0.046     -2.106    0.036     -9.716     -0.340 
## genreMusical & Performing Arts      4.398         3.138        0.029      1.401    0.162     -1.765     10.561 
##        genreMystery & Suspense     -6.253         1.779       -0.089     -3.515    0.000     -9.746     -2.759 
##                     genreOther      1.582         2.763        0.012      0.573    0.567     -3.843      7.007 
## genreScience Fiction & Fantasy     -0.291         3.503       -0.002     -0.083    0.934     -7.170      6.588 
## ---------------------------------------------------------------------------------------------------------------
## Forward Selection Method                                                         
## 
## Candidate Terms:                                                                 
## 
## 1 . title_type                                                                   
## 2 . genre                                                                        
## 3 . runtime                                                                      
## 4 . mpaa_rating                                                                  
## 5 . thtr_rel_month                                                               
## 6 . dvd_rel_month                                                                
## 7 . imdb_rating                                                                  
## 8 . critics_rating                                                               
## 9 . critics_score                                                                
## 10 . best_pic_nom                                                                
## 11 . best_pic_win                                                                
## 12 . best_actor_win                                                              
## 13 . best_actress_win                                                            
## 14 . best_dir_win                                                                
## 15 . top200_box                                                                  
## 
## --------------------------------------------------------------------------------
##                                Selection Summary                                 
## --------------------------------------------------------------------------------
##         Variable                       Adj.                                         
## Step       Entered       R-Square    R-Square     C(p)         AIC        RMSE      
## --------------------------------------------------------------------------------
##    1    imdb_rating        0.7480      0.7476    49.1726    4870.0444    10.1597    
##    2    critics_score      0.7524      0.7516    39.0709    4860.6284    10.0787    
##    3    genre              0.7687      0.7643    -3.9862    4836.2722     9.8172    
## --------------------------------------------------------------------------------

Our model has selected the variables of IMDB Rating, Genre, and the Critics score on Rotten Tomatoes. Let’s peform diagnostics to evaluate how well the model fits the data. We will use the standard diagnostic plot set of:

First, we need to generate a new model based on the stepwise forward selection.

#generate the new model
MultiRegression2 <- lm(
  audience_score ~ 
    imdb_rating +
    critics_score +
    genre,
  data = movies
)

Next, we create our plots.

#plot diagnostics with the generated model
plot (MultiRegression2)

Now we can step through each plot to understand the diagnostic assessment.

To verify our residuals vs. fitted values’ heteroscedasticity, we will use a Breusch-Pagan Test and an NCV Test.

bptest (MultiRegression2)
## 
##  studentized Breusch-Pagan test
## 
## data:  MultiRegression2
## BP = 108.51, df = 12, p-value < 2.2e-16
ncvTest (MultiRegression2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 164.1416    Df = 1     p = 1.4087e-37

Since both tests have a p-value less than a significance level of 0.05, we can conclude heteroscedasticity exists.

To conclude our modeling section, we will provide the interpretation of the model coefficients. Here’s a summary of the model.

summary (MultiRegression2)
## 
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + genre, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.708  -6.446   0.614   5.479  50.111 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -37.15275    3.10900 -11.950  < 2e-16 ***
## imdb_rating                     14.76644    0.57286  25.777  < 2e-16 ***
## critics_score                    0.06687    0.02142   3.122 0.001879 ** 
## genreAnimation                   9.11698    3.49847   2.606 0.009375 ** 
## genreArt House & International   0.03008    2.90452   0.010 0.991740    
## genreComedy                      2.09167    1.61412   1.296 0.195493    
## genreDocumentary                 1.19414    1.96833   0.607 0.544281    
## genreDrama                      -0.20316    1.38018  -0.147 0.883022    
## genreHorror                     -5.02795    2.38740  -2.106 0.035591 *  
## genreMusical & Performing Arts   4.39791    3.13828   1.401 0.161588    
## genreMystery & Suspense         -6.25279    1.77914  -3.515 0.000472 ***
## genreOther                       1.58228    2.76266   0.573 0.567024    
## genreScience Fiction & Fantasy  -0.29079    3.50319  -0.083 0.933872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.817 on 638 degrees of freedom
## Multiple R-squared:  0.7687, Adjusted R-squared:  0.7643 
## F-statistic: 176.7 on 12 and 638 DF,  p-value: < 2.2e-16

Let’s review our coefficients:


Part 5: Prediction

At this point, we are ready to test the predictive ability of our multi-regession model. We will use the 2016 comedy film “Bad Santa 2” to predict the Rotten Tomatoes Audience Score.

Using the URLs https://www.rottentomatoes.com/m/bad_santa_2 and http://www.imdb.com/title/tt1798603/?ref_=ttls_li_tt, we see that our Rotten Tomatoes Critics Score is 23%, and the IMDB rating is 5.6.

Using R’s “predict” function, let’s run our model.

#Create a data frame for the variables we will pass to the model
NewMovie <- data.frame(imdb_rating = 5.6, critics_score = 23, genre = "Comedy" )
#Predict the Audience score with the provided variables
predict (MultiRegression2, NewMovie, interval="predict", level = 0.95)
##        fit      lwr      upr
## 1 49.16892 29.76966 68.56818

With the provided variables, the predicted Audience Score is 49.16892 which is generous compared to the actual Audience Score of 34.

The 95% prediction interval of the Audience Score for an IMDB Rating of 5.6, a Critic’s Score of 23, and a genre of “Comedy” is between 29.76966 % and 68.56818 %. If we were to collect additional sample movie data sets, there is a 95% probability that our provided variables will generate a predicted value within the prediction interval.

Our prediction interval width is allowing for a high degree of accuracy, but it is not very precise because it spans almost 40 percentage points. If we are concerned with precision, we may want to select a narrower prediction interval or increase our sample size.


Part 6: Conclusion

When looking at any ratings methodology, our desire is to see ratings that do not have inherent bias in so far that we don’t have too many critical or complimentary reviews. Our exploratory data analysis indicates that our given sample may actually have a complimentary bias. Therefore, we need to be careful with any inference or modeling applications.

Despite a resulting simple model, additional mulitple regression techniques, such as backward selection, should also be entertained to check model convergence and help refine prediction. We would also want to investigate the sampling method and experiment with larger sample sizes to increase our prediction precision. It is assumed that the IMDB and Rotten Tomatoe APIs would not prohibit us from obtaining an adequately sized data set.