Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

We will investigate one question using the “movies” dataset. This dataset includes information from Rotten Tomatoes and IMDB and is comprised of 651 randomly sampled movies produced and released before 2016. The results are generalizable to movies of all genres, as this is an observational study that uses random sampling. Since there is no random assignment, we cannot make causal conclusions. As there are no volunteers employed we can exclude the possibility of voluntary response bias. There is not non-response bias as well. However, there might coveniece bias since the movies included are American productions.

Part 2: Research question

The reasons I m undertaking this project is cause I am asked by my boss to explore what attributes make a movie popular and to present her with something new about movies, as well as personal curiosity on the subject. Therefore I will investigate the following question: “Which factors contribute to making a movie popular?”

Part 3: Exploratory data analysis

At first we will exclude all NAs from the variables used in the modelling process (see part 4).

#exluding NAs
c_movies <- movies %>%
  filter(!is.na(imdb_rating), !is.na(title_type), !is.na(genre), !is.na(runtime), !is.na(mpaa_rating), !is.na(imdb_num_votes), !is.na(critics_rating), !is.na(critics_score), !is.na(best_pic_nom), !is.na(best_pic_win), !is.na(best_actor_win), !is.na(best_actress_win), !is.na(best_dir_win), !is.na(top200_box))

For reasons explained in “Part 4: Modeling” the variables included in the Exploratory data analysis (EDA) are “imdb_rating”, “genre”, “runtime”, “imdb_num_votes”, “mpaa_rating” and “critics_score”.

#statistics of imdb_rating (ir)
c_movies %>%
  summarise(mean_ir = mean(imdb_rating), med_ir = median(imdb_rating))

## # A tibble: 1 x 2
##   mean_ir med_ir
##     <dbl>  <dbl>
## 1    6.49    6.6

#statistics of genre
c_movies <- c_movies %>% 
  mutate(numgenre = as.numeric(genre))  
c_movies %>%
  summarise(mean_ng = mean(numgenre), med_ng = median(numgenre))

## # A tibble: 1 x 2
##   mean_ng med_ng
##     <dbl>  <dbl>
## 1    5.55      6

#statistics of runtime
c_movies %>%
  summarise(mean_runtime = mean(runtime), med_runtime = median(runtime), sd_runtime = sd(runtime))

## # A tibble: 1 x 3
##   mean_runtime med_runtime sd_runtime
##          <dbl>       <dbl>      <dbl>
## 1         106.         103       19.4

#statistics of imdb_num_votes (inv)
c_movies %>%
  summarise(mean_inv = mean(imdb_num_votes), med_inv = median(imdb_num_votes), inv = sd(imdb_num_votes))

## # A tibble: 1 x 3
##   mean_inv med_inv     inv
##      <dbl>   <dbl>   <dbl>
## 1   57620.  15204. 112189.

#statistics of mpaa_rating (cr)
c_movies <- c_movies %>% 
  mutate(nmr = as.numeric(mpaa_rating))
c_movies %>%
  summarise( mean_nmr = mean(nmr), med_nmr = median(nmr))

## # A tibble: 1 x 2
##   mean_nmr med_nmr
##      <dbl>   <dbl>
## 1     4.38       5

#statistics of critics_score (cs)
c_movies %>%
  summarise(mean_cs = mean(critics_score), med_cs = median(critics_score), sd_cs=sd(critics_score))

## # A tibble: 1 x 3
##   mean_cs med_cs sd_cs
##     <dbl>  <dbl> <dbl>
## 1    57.7     61  28.4

As we can see in summary statistics of the variables used to make a parsimonious model in order to answer our research question, for all variables except for “imdb_num_votes” the values of the median and the mean are fairly close to one another.

#sectioning variable "genre" by factor
ggplot(c_movies, aes(x = factor(genre), y = imdb_rating)) +
  geom_boxplot() + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10)) + scale_y_continuous(breaks=seq(0, 10, 0.25))

#sectioning variable "runtime" by step of sd (rounding up)
ggplot(c_movies, aes(x = runtime, y = imdb_rating)) +
  geom_boxplot(aes(group = cut_width(runtime, 20))) + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10))  + scale_y_continuous(breaks=seq(0, 10, 0.25)) + scale_x_continuous(breaks=seq(0, 300, 60))

#sectioning variable "imdb_num_votes" by step of sd (rounding up)
ggplot(c_movies, aes(x = imdb_num_votes, y = imdb_rating)) +
  geom_boxplot(aes(group = cut_width(imdb_num_votes, 112189))) + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10)) + scale_y_continuous(breaks=seq(0, 10, 0.25))

#sectioning variable "critics_score" by step of sd (rounding up)
ggplot(c_movies, aes(x = critics_score, y = imdb_rating)) +
  geom_boxplot(aes(group = cut_width(critics_score, 29))) + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10)) + scale_y_continuous(breaks=seq(0, 10, 0.25))

In addition, we observe that in all the boxplots above there are some outliers within our variables.

#correlation of numeric variables
c_movies_1 <- c_movies %>%
  select("runtime", "imdb_num_votes", "critics_score")
ggpairs(c_movies_1, columns = 1:3)

The variables “runtime”, “imdb_num_votes” and “critics_score” are not collinear (correlated), therefore the value of the model remains intact.

# correlation of variables

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits=digits)[1]
    txt <- paste(prefix, txt, sep="")
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(~  genre + runtime + imdb_num_votes + critics_score , data=c_movies, lower.panel=panel.smooth, upper.panel=panel.cor, pch=20, main=" Scatterplot Matrix 1")

The variables “runtime”, “imdb_num_votes”, “mpaa_rating”, “genre” and “critics_score” are not collinear (correlated), therefore the value of the model remains intact.

Part 4: Modeling

The first step we need to take in order to make a model that shows if a movie is going to become popular is to choose which variables of the dataset will be useful to our statistical analysis. The variables “title”, “studio”, “imdb_url” and “rt_url” will be excluded from the analysis because they do not give us any valiable information as to whether they could make a movie popular. In addition, the variables “director”, “actor1”, “actor2”, “actor3”, “actor4” and “actor5” will be excluded from the analysis because they are used as a factor to determine other variables (such as “best_actor_win” etc). Also, the viariables “thtr_rel_year”, “thtr_rel_month”, “thtr_rel_day”, “dvd_rel_year”, “dvd_rel_month” and “dvd_rel_day” add no valiable information as to making a movie popular so they will be excluded as well. Therefore, we are left with the variables “imdb_rating”, “title_type”, “genre”, “runtime”, “mpaa_rating”, “imdb_num_votes”, “critics_rating”, “critics_score”, “best_pic_nom”, “best_pic_win”, “best_actor_win”, “best_actress_win”, “best_dgenre_win” and “top200_box”. From them “imdb_rating” will be the dependent variable. From the rest of the variables we will research which will be used in order to reach a parsimonious model.

The model selection technique that will be used is Forward selection - adjusted R^2, this is because we are interested in the parsimonious model that will be more reliable for predictions since this is what we will be doing next. The outcome of this process is the model given below. (The steps taken to end up in this particular model follow.)

#final model, forward selection - adjusted R^2
m_pmov_44 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime, data = c_movies)
summary(m_pmov_44)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + imdb_num_votes + genre + 
##     runtime, data = c_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9094 -0.3305  0.0380  0.3873  1.8059 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.381e+00  1.673e-01  26.185  < 2e-16 ***
## critics_score                   2.374e-02  1.018e-03  23.315  < 2e-16 ***
## imdb_num_votes                  1.938e-06  2.483e-07   7.804 2.48e-14 ***
## genreAnimation                 -1.634e-01  2.272e-01  -0.719  0.47231    
## genreArt House & International  5.454e-01  1.883e-01   2.896  0.00391 ** 
## genreComedy                    -1.161e-01  1.046e-01  -1.110  0.26758    
## genreDocumentary                7.861e-01  1.304e-01   6.029 2.80e-09 ***
## genreDrama                      2.118e-01  9.017e-02   2.349  0.01915 *  
## genreHorror                    -1.172e-01  1.551e-01  -0.756  0.45006    
## genreMusical & Performing Arts  5.548e-01  2.043e-01   2.716  0.00680 ** 
## genreMystery & Suspense         1.578e-01  1.152e-01   1.369  0.17137    
## genreOther                     -1.039e-02  1.787e-01  -0.058  0.95369    
## genreScience Fiction & Fantasy -4.185e-01  2.260e-01  -1.852  0.06446 .  
## runtime                         4.350e-03  1.451e-03   2.997  0.00283 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6347 on 636 degrees of freedom
## Multiple R-squared:  0.6646, Adjusted R-squared:  0.6577 
## F-statistic: 96.92 on 13 and 636 DF,  p-value: < 2.2e-16

In order to not fill in a lot of pages of summaries of every single model used till a parsimonious model is reached, i will only print out the adjusted R^2 of each model for every step of the process. After taking the first step we conclude that the variable “critics_score” has the highest adjusted R^2, therefore it is the one that is picked. We then move on to add more variables to our model.

#step 1 picking: m_pmov_13


m_pmov_7 <- lm(imdb_rating ~ title_type , data = c_movies)


m_pmov_8 <- lm(imdb_rating ~  genre , data = c_movies)


m_pmov_9 <- lm(imdb_rating ~  runtime , data = c_movies)


m_pmov_10 <- lm(imdb_rating ~  mpaa_rating  , data = c_movies)


m_pmov_11 <- lm(imdb_rating ~  imdb_num_votes , data = c_movies)


m_pmov_12 <- lm(imdb_rating ~  critics_rating , data = c_movies)


m_pmov_13 <- lm(imdb_rating ~  critics_score , data = c_movies)


m_pmov_14 <- lm(imdb_rating ~  best_pic_nom , data = c_movies)


m_pmov_15 <- lm(imdb_rating ~  best_pic_win , data = c_movies)


m_pmov_16 <- lm(imdb_rating ~  best_actor_win , data = c_movies)


m_pmov_17 <- lm(imdb_rating ~  best_actress_win, data = c_movies)


m_pmov_18 <- lm(imdb_rating ~  best_dir_win , data = c_movies)


m_pmov_19 <- lm(imdb_rating ~  top200_box, data = c_movies)


summary_r_squared <- data.frame(model = c("-title_type", "-genre", "-runtime", "-mpaa_rating", "-imdb_num_votes", "-critics_rating", "-critics_score", "-best_pic_nom", "-best_pic_win", "-best_actor_win", "-best_actress_win", "-best_dir_win", "-top200_box"), 
                                adj.r.squared = c(summary(m_pmov_7)$adj.r.squared, summary(m_pmov_8)$adj.r.squared, summary(m_pmov_9)$adj.r.squared, summary(m_pmov_10)$adj.r.squared, summary(m_pmov_11)$adj.r.squared, summary(m_pmov_12)$adj.r.squared, summary(m_pmov_13)$adj.r.squared, summary(m_pmov_14)$adj.r.squared, summary(m_pmov_15)$adj.r.squared, summary(m_pmov_16)$adj.r.squared, summary(m_pmov_17)$adj.r.squared, summary(m_pmov_18)$adj.r.squared, summary(m_pmov_19)$adj.r.squared))

summary_r_squared %>%
  arrange(desc(adj.r.squared))

##                model adj.r.squared
## 1     -critics_score   0.584252802
## 2    -critics_rating   0.402167619
## 3             -genre   0.215008617
## 4    -imdb_num_votes   0.108959405
## 5        -title_type   0.105509052
## 6           -runtime   0.070520783
## 7       -mpaa_rating   0.064514680
## 8      -best_pic_nom   0.045742834
## 9      -best_pic_win   0.016863116
## 10     -best_dir_win   0.016742153
## 11       -top200_box   0.006922747
## 12 -best_actress_win   0.003640350
## 13   -best_actor_win   0.002714714

After taking the second step we conclude that the variable “imdb_num_votes” has the highest adjusted R^2, so it is added to the previously mentioned model. We then move on to add more variables to our model.

#step 2 (previous adjusted R^2=0.5843) picking: m_pmov_24

m_pmov_20 <- lm(imdb_rating ~  critics_score +title_type, data = c_movies)

m_pmov_21 <- lm(imdb_rating ~  critics_score +genre, data = c_movies)

m_pmov_22 <- lm(imdb_rating ~  critics_score +runtime, data = c_movies)

m_pmov_23 <- lm(imdb_rating ~  critics_score +mpaa_rating, data = c_movies)

m_pmov_24 <- lm(imdb_rating ~  critics_score +imdb_num_votes, data = c_movies)

m_pmov_25 <- lm(imdb_rating ~  critics_score +critics_rating, data = c_movies)

m_pmov_26 <- lm(imdb_rating ~  critics_score +best_pic_nom, data = c_movies)

m_pmov_27 <- lm(imdb_rating ~  critics_score +best_pic_win, data = c_movies)

m_pmov_28 <- lm(imdb_rating ~  critics_score +best_actor_win, data = c_movies)

m_pmov_29 <- lm(imdb_rating ~  critics_score +best_actress_win, data = c_movies)

m_pmov_30 <- lm(imdb_rating ~  critics_score +best_dir_win, data = c_movies)

m_pmov_31 <- lm(imdb_rating ~  critics_score +top200_box, data = c_movies)


summary_r_squared <- data.frame(model = c("-title_type", "-genre", "-runtime", "-mpaa_rating", "-imdb_num_votes", "-critics_rating", "-best_pic_nom", "-best_pic_win", "-best_actor_win", "-best_actress_win", "-best_dir_win", "-top200_box"), 
                                adj.r.squared = c(summary(m_pmov_20)$adj.r.squared, summary(m_pmov_21)$adj.r.squared, summary(m_pmov_22)$adj.r.squared, summary(m_pmov_23)$adj.r.squared, summary(m_pmov_24)$adj.r.squared, summary(m_pmov_25)$adj.r.squared, summary(m_pmov_26)$adj.r.squared, summary(m_pmov_27)$adj.r.squared, summary(m_pmov_28)$adj.r.squared, summary(m_pmov_29)$adj.r.squared, summary(m_pmov_30)$adj.r.squared, summary(m_pmov_31)$adj.r.squared))

summary_r_squared %>%
  arrange(desc(adj.r.squared))

##                model adj.r.squared
## 1    -imdb_num_votes     0.6144942
## 2             -genre     0.6088802
## 3           -runtime     0.6028214
## 4    -critics_rating     0.5942752
## 5        -title_type     0.5909191
## 6      -best_pic_nom     0.5882977
## 7      -best_pic_win     0.5853254
## 8       -mpaa_rating     0.5852451
## 9      -best_dir_win     0.5846406
## 10   -best_actor_win     0.5845023
## 11 -best_actress_win     0.5842432
## 12       -top200_box     0.5839351

After taking the third step we conclude that the variable “genre” has the highest adjusted R^2, so it is added to the previously mentioned model. We then move on to add more variables to our model.

#step 3 (previous adjusted R^2 = 0.6145) picking: m_pmov_33

m_pmov_32 <- lm(imdb_rating ~  critics_score +imdb_num_votes +title_type, data = c_movies)

m_pmov_33 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre, data = c_movies)

m_pmov_34 <- lm(imdb_rating ~  critics_score +imdb_num_votes +runtime, data = c_movies)

m_pmov_35 <- lm(imdb_rating ~  critics_score +imdb_num_votes +mpaa_rating, data = c_movies)

m_pmov_36 <- lm(imdb_rating ~  critics_score +imdb_num_votes +critics_rating, data = c_movies)

m_pmov_37 <- lm(imdb_rating ~  critics_score +imdb_num_votes +best_pic_nom, data = c_movies)

m_pmov_38 <- lm(imdb_rating ~  critics_score +imdb_num_votes +best_pic_win, data = c_movies)

m_pmov_39 <- lm(imdb_rating ~  critics_score +imdb_num_votes +best_actor_win, data = c_movies)

m_pmov_40 <- lm(imdb_rating ~  critics_score +imdb_num_votes +best_actress_win, data = c_movies)

m_pmov_41 <- lm(imdb_rating ~  critics_score +imdb_num_votes +best_dir_win, data = c_movies)

m_pmov_42 <- lm(imdb_rating ~  critics_score +imdb_num_votes +top200_box, data = c_movies)

summary_r_squared <- data.frame(model = c("-title_type", "-genre", "-runtime", "-mpaa_rating", "-critics_rating", "-best_pic_nom", "-best_pic_win", "-best_actor_win", "-best_actress_win", "-best_dir_win", "-top200_box"), 
                                adj.r.squared = c(summary(m_pmov_32)$adj.r.squared, summary(m_pmov_33)$adj.r.squared, summary(m_pmov_34)$adj.r.squared, summary(m_pmov_35)$adj.r.squared, summary(m_pmov_36)$adj.r.squared, summary(m_pmov_37)$adj.r.squared, summary(m_pmov_38)$adj.r.squared, summary(m_pmov_39)$adj.r.squared, summary(m_pmov_40)$adj.r.squared, summary(m_pmov_41)$adj.r.squared, summary(m_pmov_42)$adj.r.squared))

summary_r_squared %>%
  arrange(desc(adj.r.squared))

##                model adj.r.squared
## 1             -genre     0.6534061
## 2        -title_type     0.6293135
## 3    -critics_rating     0.6249116
## 4           -runtime     0.6213892
## 5       -mpaa_rating     0.6182556
## 6        -top200_box     0.6149312
## 7    -best_actor_win     0.6141784
## 8      -best_pic_nom     0.6141784
## 9      -best_pic_win     0.6140433
## 10 -best_actress_win     0.6139291
## 11     -best_dir_win     0.6139050

After taking the forth step we would conclude that the variable “critics_rating” has the highest adjusted R^2, so it would be added to the previously mentioned model but due to collinearity between variables (look below, Scatterplot Matrix 2) we will move on to the next best R^2. This variable is “runtime”. We then move on to add more variables to (check) our model.

#step 4 (previous adjusted R^2 = 0.6534) picking: m_pmov_44 (m_pmov_46:declined due to collinearity)

m_pmov_43 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +title_type, data = c_movies)

m_pmov_44 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime, data = c_movies)

m_pmov_45 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +mpaa_rating, data = c_movies)

m_pmov_46 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +critics_rating, data = c_movies)

m_pmov_47 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +best_pic_nom, data = c_movies)

m_pmov_48 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +best_pic_win, data = c_movies)

m_pmov_49 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +best_actor_win, data = c_movies)

m_pmov_50 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +best_actress_win, data = c_movies)

m_pmov_51 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +best_dir_win, data = c_movies)

m_pmov_52 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +top200_box, data = c_movies)

summary_r_squared <- data.frame(model = c("-title_type", "-runtime", "-mpaa_rating", "-critics_rating", "-best_pic_nom", "-best_pic_win", "-best_actor_win", "-best_actress_win", "-best_dir_win", "-top200_box"), 
                                adj.r.squared = c(summary(m_pmov_43)$adj.r.squared, summary(m_pmov_44)$adj.r.squared, summary(m_pmov_45)$adj.r.squared, summary(m_pmov_46)$adj.r.squared, summary(m_pmov_47)$adj.r.squared, summary(m_pmov_48)$adj.r.squared, summary(m_pmov_49)$adj.r.squared, summary(m_pmov_50)$adj.r.squared, summary(m_pmov_51)$adj.r.squared, summary(m_pmov_52)$adj.r.squared))

summary_r_squared %>%
  arrange(desc(adj.r.squared))

##                model adj.r.squared
## 1    -critics_rating     0.6620667
## 2           -runtime     0.6576960
## 3        -title_type     0.6535431
## 4        -top200_box     0.6532531
## 5      -best_pic_nom     0.6532279
## 6    -best_actor_win     0.6531465
## 7      -best_dir_win     0.6530777
## 8      -best_pic_win     0.6530590
## 9  -best_actress_win     0.6528854
## 10      -mpaa_rating     0.6524638

The variables “critics_rating” and “critics_score” are collinear (correlated), and adding more than one of them to the model would not add much value to the model. Since we used forward selection and the variable “critics_score” was in the first step and the variable “critics_rating” was in the forth step, we ’ll go back to the forth step, we will eliminate the variable “critics_rating”, and find out if there is another better fitting model.

# correlation of all variables

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits=digits)[1]
    txt <- paste(prefix, txt, sep="")
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(~  genre + runtime + imdb_num_votes + critics_rating + critics_score , data=c_movies, lower.panel=panel.smooth, upper.panel=panel.cor, pch=20, main=" Scatterplot Matrix 2")

After taking the firth and final step we conclude that no variable yields a model with higher adjusted R^2, so we have already reached a parsimonious model in the previous step.

#step 5 (previous adjusted R^2 = 0.6577) picking: none

m_pmov_53 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +title_type, data = c_movies)

m_pmov_55 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +mpaa_rating, data = c_movies)

m_pmov_56 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +best_pic_nom, data = c_movies)

m_pmov_57 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +best_pic_win, data = c_movies)

m_pmov_58 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +best_actor_win, data = c_movies)

m_pmov_59 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +best_actress_win, data = c_movies)

m_pmov_60 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +best_dir_win, data = c_movies)

m_pmov_61 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime +top200_box, data = c_movies)


summary_r_squared <- data.frame(model = c("-title_type", "-mpaa_rating", "-best_pic_nom", "-best_pic_win", "-best_actor_win", "-best_actress_win", "-best_dir_win", "-top200_box"), 
                                adj.r.squared = c(summary(m_pmov_53)$adj.r.squared, summary(m_pmov_55)$adj.r.squared, summary(m_pmov_56)$adj.r.squared, summary(m_pmov_57)$adj.r.squared, summary(m_pmov_58)$adj.r.squared, summary(m_pmov_59)$adj.r.squared, summary(m_pmov_60)$adj.r.squared,  summary(m_pmov_61)$adj.r.squared))

summary_r_squared %>%
  arrange(desc(adj.r.squared))

##               model adj.r.squared
## 1       -title_type     0.6577241
## 2      -mpaa_rating     0.6577114
## 3       -top200_box     0.6576791
## 4     -best_pic_win     0.6576007
## 5     -best_pic_nom     0.6572393
## 6 -best_actress_win     0.6571878
## 7     -best_dir_win     0.6571618
## 8   -best_actor_win     0.6571596

Now, we are proceding with model diagnostics.

#m_pmov_44 Model diagnostics
m_pmov_44 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime, data = c_movies)
plot(m_pmov_44$residuals ~ c_movies$critics_score)
abline(h=0, col=2)

plot(m_pmov_44$residuals ~ c_movies$imdb_num_votes)
abline(h=0, col=2)

plot(m_pmov_44$residuals ~ c_movies$runtime)
abline(h=0, col=2)

hist(m_pmov_44$residuals)

qqnorm(m_pmov_44$residuals)
qqline(m_pmov_44$residuals)

plot(m_pmov_44$residuals ~ m_pmov_44$fitted)

plot(abs(m_pmov_44$residuals) ~ m_pmov_44$fitted)

plot(m_pmov_44$residuals)
abline(h=0, col=2)

In the model diagnostics we observe that: 1)about the linear relationships between numerical (“critics_score”, “imdb_num_votes” and “runtime”) variables and “imdb_rating”, there appears to be a random scatter around 0. 2)about nearly normal residuals with mean 0: there is a somewhat of a scew and except for the tail areas there arent huge devations from the mean. 3)about constant variability of residuals: there are no fan or tringle shapes shapes. 4)about independent residuals: no increasing of decreasing patterns. So, all in all we would say that the conditions are fairly met for our model.

Part 5: Prediction

At this point, we want to use the model we created earlier, m_pmov_44 to predict the popularity of a movie based on its ‘imdb_rating’ , Independence day: Resurgence(2016), that is a movie with 30 as “critics_score”, 146,385 as “imdb_num_votes”, action and adventure as “genre” and 120 mins of “runtime”. The data used to make this prediction come from Rotten Tomatoes and IMDB. We will also construct a prediction interval around this prediction, which will provide a measure of uncertainty around the prediction.

#final model, forward selection - adjusted R^2
m_pmov_44 <- lm(imdb_rating ~  critics_score +imdb_num_votes +genre +runtime, data = c_movies)
summary(m_pmov_44)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + imdb_num_votes + genre + 
##     runtime, data = c_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9094 -0.3305  0.0380  0.3873  1.8059 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.381e+00  1.673e-01  26.185  < 2e-16 ***
## critics_score                   2.374e-02  1.018e-03  23.315  < 2e-16 ***
## imdb_num_votes                  1.938e-06  2.483e-07   7.804 2.48e-14 ***
## genreAnimation                 -1.634e-01  2.272e-01  -0.719  0.47231    
## genreArt House & International  5.454e-01  1.883e-01   2.896  0.00391 ** 
## genreComedy                    -1.161e-01  1.046e-01  -1.110  0.26758    
## genreDocumentary                7.861e-01  1.304e-01   6.029 2.80e-09 ***
## genreDrama                      2.118e-01  9.017e-02   2.349  0.01915 *  
## genreHorror                    -1.172e-01  1.551e-01  -0.756  0.45006    
## genreMusical & Performing Arts  5.548e-01  2.043e-01   2.716  0.00680 ** 
## genreMystery & Suspense         1.578e-01  1.152e-01   1.369  0.17137    
## genreOther                     -1.039e-02  1.787e-01  -0.058  0.95369    
## genreScience Fiction & Fantasy -4.185e-01  2.260e-01  -1.852  0.06446 .  
## runtime                         4.350e-03  1.451e-03   2.997  0.00283 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6347 on 636 degrees of freedom
## Multiple R-squared:  0.6646, Adjusted R-squared:  0.6577 
## F-statistic: 96.92 on 13 and 636 DF,  p-value: < 2.2e-16

#use Independence day: Resurgence to predict 
ind_r <- data.frame(critics_score = 30, imdb_num_votes = 146385, genre = "Action & Adventure", runtime = 120)
predict(m_pmov_44, ind_r)

##        1 
## 5.899329

predict(m_pmov_44, ind_r, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 5.899329 4.642075 7.156584

This model predicts that the score of the movie would roughly be 5.9. Also, the model predicts, with 95% confidence, that is a movie with 30 as “critics_score”, 146,385 as “imdb_num_votes”, action and adventure as “genre” and 120 mins of “runtime” is expected to have an evaluation score between 4.64 and 7.16.

So, all in all, I believe that the factors contributing to the popularity of a movie are: 1)the genre of the movie, 2)the critics score on Rotten Tomatoes, 3)the number of votes on IMDB, and 4)the runtime of movie * * *

Part 6: Conclusion

As final remarks, I believe that I have ended up in a parsimonious model that makes pretty close to reality predictions. The prediction for the movie “Independence day: Resurgence” is one of a imdb_rating of 5.83 when the actual rating of the movie is 5.2 (which actually belongs to prediction interval). Maybe, the slection of the variables would have been less complicated if we didnt run into the problem of the same adjusted R^2 for some of the variables and that could probably have been solved with the addition of more data into our database. Finally, I believe that adding variables with information about the actors’ and directors’ ratings would make for a better model.