Modeling and prediction for movies

Setup

This is an analysis of the movie dataset that contains information from Rotten Tomatos and IMDB for a random sample of movies that are needed to develop a multiple linear regression model that will explain what makes movies popular.The dataset keeps track of all reviews for each films and aggregates the results and Internet Movie Database IMDB, an online database of information related to film, television programs and video games.

Load packages

library(ggplot2)
library(dplyr)
library(caret)
library(statsr)

## Warning: package 'statsr' was built under R version 4.0.2

library(gridExtra)
library(GGally)
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.0.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.0.2

Load data

Loading the data from the directory.

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016.

Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis. It is up to you to decide which variables are meaningful and which should be omitted. For example information in the the actor1 through actor5 variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar.

You might also choose to omit certain observations or restructure some of the variables to make them suitable for answering your research questions.

When you are fitting a model you should also be careful about collinearity, as some of these variables may be dependent on each other.

Part 2: Research question

Is movie popularity, as measured by IMDb rating, associated with other film characteristics like: having actors,type of movie,genre, runtime, imdb rating, votes, actresses, directors who won an oscar, critic reviews, studios that produced the movie, and year that the movie was released in theaters.

Part 3: Exploratory data analysis

An actor, actress, or director who has won an oscar award is a great motivation to see a movie. An oscar award is like a guarantee of excellence. Now we need to get the variables to the model.

movies_new <- movies %>% select(title, title_type, genre, runtime, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_win, best_actor_win, best_actress_win, best_dir_win)

str(movies_new)

## tibble [651 x 14] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...

summary(movies_new)

##     title                  title_type                 genre        runtime     
##  Length:651         Documentary : 55   Drama             :305   Min.   : 39.0  
##  Class :character   Feature Film:591   Comedy            : 87   1st Qu.: 92.0  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
##                                        Mystery & Suspense: 59   Mean   :105.8  
##                                        Documentary       : 52   3rd Qu.:115.8  
##                                        Horror            : 23   Max.   :267.0  
##                                        (Other)           : 60   NA's   :1      
##   imdb_rating    imdb_num_votes           critics_rating critics_score   
##  Min.   :1.900   Min.   :   180   Certified Fresh:135    Min.   :  1.00  
##  1st Qu.:5.900   1st Qu.:  4546   Fresh          :209    1st Qu.: 33.00  
##  Median :6.600   Median : 15116   Rotten         :307    Median : 61.00  
##  Mean   :6.493   Mean   : 57533                          Mean   : 57.69  
##  3rd Qu.:7.300   3rd Qu.: 58301                          3rd Qu.: 83.00  
##  Max.   :9.000   Max.   :893008                          Max.   :100.00  
##                                                                          
##  audience_rating audience_score  best_pic_win best_actor_win best_actress_win
##  Spilled:275     Min.   :11.00   no :644      no :558        no :579         
##  Upright:376     1st Qu.:46.00   yes:  7      yes: 93        yes: 72         
##                  Median :65.00                                               
##                  Mean   :62.36                                               
##                  3rd Qu.:80.00                                               
##                  Max.   :97.00                                               
##                                                                              
##  best_dir_win
##  no :608     
##  yes: 43     
##              
##              
##              
##              
##

Now is necessary to clean the data from NA’s

movies_new <- na.omit(movies_new)

Before modeling is needed to separate the train and test data.

index <- createDataPartition(movies_new$imdb_rating, p= 0.975, list = FALSE)
training <- movies_new[index,]

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

testing <- movies_new[-index,]
dim(training)

## [1] 636  14

dim(testing)

## [1] 14 14

Now testing some variables for the Exploratory analysis

d1 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = critics_rating)) + geom_point()
d2 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = audience_rating)) + geom_point()
grid.arrange(d1, d2, nrow = 1, ncol = 2)

Histograms to see the sknesswess

The dependent variable:

head(training)

## # A tibble: 6 x 14
##   title title_type genre runtime imdb_rating imdb_num_votes critics_rating
##   <chr> <fct>      <fct>   <dbl>       <dbl>          <int> <fct>         
## 1 Fill~ Feature F~ Drama      80         5.5            899 Rotten        
## 2 The ~ Feature F~ Drama     101         7.3          12285 Certified Fre~
## 3 Wait~ Feature F~ Come~      84         7.6          22381 Certified Fre~
## 4 The ~ Feature F~ Drama     139         7.2          35096 Certified Fre~
## 5 Male~ Feature F~ Horr~      90         5.1           2386 Rotten        
## 6 Old ~ Documenta~ Docu~      78         7.8            333 Fresh         
## # ... with 7 more variables: critics_score <dbl>, audience_rating <fct>,
## #   audience_score <dbl>, best_pic_win <fct>, best_actor_win <fct>,
## #   best_actress_win <fct>, best_dir_win <fct>

training%>% ggplot(aes(audience_score, fill= title_type)) + geom_histogram(bins = 30)

As it sees the great distribution from the data came from Feature Films

summary(training$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.39   80.00   97.00

The median score is 65, and the maximum score was 97.

head(training)

## # A tibble: 6 x 14
##   title title_type genre runtime imdb_rating imdb_num_votes critics_rating
##   <chr> <fct>      <fct>   <dbl>       <dbl>          <int> <fct>         
## 1 Fill~ Feature F~ Drama      80         5.5            899 Rotten        
## 2 The ~ Feature F~ Drama     101         7.3          12285 Certified Fre~
## 3 Wait~ Feature F~ Come~      84         7.6          22381 Certified Fre~
## 4 The ~ Feature F~ Drama     139         7.2          35096 Certified Fre~
## 5 Male~ Feature F~ Horr~      90         5.1           2386 Rotten        
## 6 Old ~ Documenta~ Docu~      78         7.8            333 Fresh         
## # ... with 7 more variables: critics_score <dbl>, audience_rating <fct>,
## #   audience_score <dbl>, best_pic_win <fct>, best_actor_win <fct>,
## #   best_actress_win <fct>, best_dir_win <fct>

g1 <- training %>% ggplot(aes(runtime, fill = title_type)) + 
  geom_histogram() + ggtitle('Run Time')

g2 <- training %>% ggplot(aes(imdb_rating, fill = genre)) +
  geom_histogram()  + ggtitle('IMDB rating')

g3 <- training %>% ggplot(aes(log10(imdb_num_votes), fill= critics_rating)) +
  geom_histogram()  + ggtitle('log(IMDB number of votes)')

g4 <- training %>% ggplot(aes(critics_score, fill= audience_rating)) +
  geom_histogram() + ggtitle('Critics Score')

g5 <- training %>% ggplot(aes(audience_score, fill= best_pic_win)) +
  geom_histogram() + ggtitle('Critics Score')


grid.arrange(g1, g2, g3, g4,g5, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now for the categorical variables:

head(training)

## # A tibble: 6 x 14
##   title title_type genre runtime imdb_rating imdb_num_votes critics_rating
##   <chr> <fct>      <fct>   <dbl>       <dbl>          <int> <fct>         
## 1 Fill~ Feature F~ Drama      80         5.5            899 Rotten        
## 2 The ~ Feature F~ Drama     101         7.3          12285 Certified Fre~
## 3 Wait~ Feature F~ Come~      84         7.6          22381 Certified Fre~
## 4 The ~ Feature F~ Drama     139         7.2          35096 Certified Fre~
## 5 Male~ Feature F~ Horr~      90         5.1           2386 Rotten        
## 6 Old ~ Documenta~ Docu~      78         7.8            333 Fresh         
## # ... with 7 more variables: critics_score <dbl>, audience_rating <fct>,
## #   audience_score <dbl>, best_pic_win <fct>, best_actor_win <fct>,
## #   best_actress_win <fct>, best_dir_win <fct>

g1 <- training %>% ggplot(aes(genre, fill = title_type)) + 
  geom_bar() + ggtitle('Genre')

g2 <- training %>% ggplot(aes(title_type, fill = genre)) +
  geom_bar()  + ggtitle('Title Type')

g3 <- training %>% ggplot(aes(critics_rating, fill= critics_rating)) +
  geom_bar()  + ggtitle('Critics rating')

g4 <- training %>% ggplot(aes(audience_rating, fill= best_actor_win)) +
  geom_bar() + ggtitle('Audience Rating')

g5 <- training %>% ggplot(aes(audience_rating, fill= best_pic_win)) +
  geom_bar() + ggtitle('Audience Rating')

grid.arrange(g1,g2,g3,g4,g5, ncol=2)

Now is turn to see the correlation between variables, specailly numericals.

g1 <- training %>% ggplot(aes(runtime, audience_score,color= title_type)) + 
  geom_point() + ggtitle('Audience Score vs Run Time') + geom_smooth()

g2 <- training %>% ggplot(aes(imdb_rating, audience_score,color = genre)) +
  geom_point()  + ggtitle('Audience Score vs IMDB rating') +geom_smooth()

g3 <- training %>% ggplot(aes(log10(imdb_num_votes),audience_score, color= critics_rating)) +
  geom_point()  + ggtitle('Audience Score vs log(IMDB number of votes)') + geom_smooth()

g4 <- training %>% ggplot(aes(critics_score,audience_score, color= audience_rating)) +
  geom_point() + ggtitle('Audience Score vs Critics Score') +geom_smooth()

grid.arrange(g1,g2,g3,g4, ncol=2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 87.86

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 17.14

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 124.1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5

## Warning in sqrt(sum.squares/one.delta): Se han producido NaNs

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small. fewer
## data values than degrees of freedom.

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 87.86

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 17.14

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 124.1

## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced

## Warning in max(ids, na.rm = TRUE): ningun argumento finito para max; retornando
## -Inf

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

and the correlations are:

library(corrplot)
variables <- c('runtime', 'imdb_rating', 'imdb_num_votes', 'critics_score', "audience_score")
selected_train <- training[variables]
head(selected_train)

## # A tibble: 6 x 5
##   runtime imdb_rating imdb_num_votes critics_score audience_score
##     <dbl>       <dbl>          <int>         <dbl>          <dbl>
## 1      80         5.5            899            45             73
## 2     101         7.3          12285            96             81
## 3      84         7.6          22381            91             91
## 4     139         7.2          35096            80             76
## 5      90         5.1           2386            33             27
## 6      78         7.8            333            91             86

correlation_matrix <- cor(selected_train)
correlation_matrix

##                  runtime imdb_rating imdb_num_votes critics_score
## runtime        1.0000000   0.2676279      0.3452433     0.1737387
## imdb_rating    0.2676279   1.0000000      0.3271261     0.7677127
## imdb_num_votes 0.3452433   0.3271261      1.0000000     0.2131793
## critics_score  0.1737387   0.7677127      0.2131793     1.0000000
## audience_score 0.1822614   0.8646578      0.2873827     0.7098398
##                audience_score
## runtime             0.1822614
## imdb_rating         0.8646578
## imdb_num_votes      0.2873827
## critics_score       0.7098398
## audience_score      1.0000000

corrplot(correlation_matrix, main="\n\nCorrelation Plot of numerical variables", method="number")

For the categorical variables we need to see the variations on median and other factors.

p1 <- training %>% ggplot(aes(color= audience_rating))+ geom_boxplot(aes(audience_score,audience_rating))

p2 <- training %>% ggplot(aes(color= critics_rating))+ geom_boxplot(aes(audience_score,critics_rating))

p3 <- training %>% ggplot(aes(color= title_type))+ geom_boxplot(aes(audience_score,title_type))
p4 <- training %>% ggplot(aes(color= genre))+ geom_boxplot(aes(audience_score,genre))

grid.arrange(p1,p2,p3, ncol=2)

p4

Part 4: Modeling

I start with the full model

full_model <- lm(audience_score~imdb_rating+title_type+genre+runtime+imdb_num_votes+critics_rating+audience_rating+best_pic_win+best_actor_win+best_actress_win+best_dir_win, data=training)
summary(full_model)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + title_type + genre + 
##     runtime + imdb_num_votes + critics_rating + audience_rating + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win, 
##     data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4540  -4.5599   0.6738   4.3682  24.6415 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.011e+01  4.093e+00  -2.470   0.0138 *  
## imdb_rating                     9.543e+00  4.314e-01  22.118   <2e-16 ***
## title_typeFeature Film          2.291e+00  2.571e+00   0.891   0.3733    
## title_typeTV Movie              8.071e-01  4.056e+00   0.199   0.8423    
## genreAnimation                  3.153e+00  2.485e+00   1.269   0.2050    
## genreArt House & International -2.653e+00  2.071e+00  -1.281   0.2005    
## genreComedy                     1.591e+00  1.154e+00   1.378   0.1686    
## genreDocumentary                2.437e+00  2.754e+00   0.885   0.3765    
## genreDrama                     -7.618e-01  1.003e+00  -0.759   0.4479    
## genreHorror                    -1.716e+00  1.754e+00  -0.978   0.3282    
## genreMusical & Performing Arts  3.425e+00  2.479e+00   1.381   0.1677    
## genreMystery & Suspense        -3.181e+00  1.282e+00  -2.481   0.0134 *  
## genreOther                      1.163e-01  2.011e+00   0.058   0.9539    
## genreScience Fiction & Fantasy -2.961e-01  2.468e+00  -0.120   0.9046    
## runtime                        -2.244e-02  1.677e-02  -1.338   0.1813    
## imdb_num_votes                  2.746e-06  3.222e-06   0.852   0.3944    
## critics_ratingFresh            -1.664e-02  8.586e-01  -0.019   0.9845    
## critics_ratingRotten           -1.315e+00  9.478e-01  -1.387   0.1660    
## audience_ratingUpright          2.010e+01  8.020e-01  25.063   <2e-16 ***
## best_pic_winyes                 3.585e-01  2.944e+00   0.122   0.9031    
## best_actor_winyes               4.909e-01  8.237e-01   0.596   0.5514    
## best_actress_winyes            -1.008e+00  9.338e-01  -1.079   0.2810    
## best_dir_winyes                 1.672e-02  1.206e+00   0.014   0.9889    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.889 on 613 degrees of freedom
## Multiple R-squared:  0.8883, Adjusted R-squared:  0.8843 
## F-statistic: 221.7 on 22 and 613 DF,  p-value: < 2.2e-16

For the backward method we obtain next model:

finalmodel <- lm(audience_score~imdb_rating+genre+runtime+imdb_num_votes+audience_rating, data=training)
summary(finalmodel)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + genre + runtime + 
##     imdb_num_votes + audience_rating, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5443  -4.5285   0.7232   4.3756  24.9183 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.028e+01  2.612e+00  -3.935 9.25e-05 ***
## imdb_rating                     9.823e+00  3.967e-01  24.762  < 2e-16 ***
## genreAnimation                  3.228e+00  2.470e+00   1.307   0.1917    
## genreArt House & International -2.685e+00  2.060e+00  -1.303   0.1929    
## genreComedy                     1.511e+00  1.143e+00   1.322   0.1866    
## genreDocumentary                5.545e-01  1.457e+00   0.381   0.7036    
## genreDrama                     -6.794e-01  9.855e-01  -0.689   0.4908    
## genreHorror                    -1.571e+00  1.744e+00  -0.901   0.3680    
## genreMusical & Performing Arts  2.734e+00  2.314e+00   1.182   0.2378    
## genreMystery & Suspense        -3.058e+00  1.263e+00  -2.420   0.0158 *  
## genreOther                      2.331e-01  1.984e+00   0.118   0.9065    
## genreScience Fiction & Fantasy  9.225e-02  2.449e+00   0.038   0.9700    
## runtime                        -2.448e-02  1.596e-02  -1.534   0.1256    
## imdb_num_votes                  3.052e-06  2.920e-06   1.045   0.2963    
## audience_ratingUpright          2.025e+01  7.863e-01  25.758  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.876 on 621 degrees of freedom
## Multiple R-squared:  0.8873, Adjusted R-squared:  0.8848 
## F-statistic: 349.2 on 14 and 621 DF,  p-value: < 2.2e-16

Interpretation of the model

Intercept(-9.8) is the estimated audience score for a movie with imdb_rating, audience_rating and genre at zero. It does not provide any meaningful interpretation here.
imdb_rating coefficient(9.8): All else hold constant, for every one unit increase in imdb_rating, the model predicts a 9.8 increase in audience_score on average.
audience_ratingUpright coefficient(20.3246): All else hold constant, the model predicts rating Upright movie is 20.3 higher in audience score on average than rating Spilled movie.
genreMystery & Suspense (3.01): The model predicts that Animation films get an audience score that is 3.01 higher than Action & Adventure(reference category) films on average after controlling for imdb_rating and audience rating.
runtime (-0.03): All else hold constant, the model predicts audience score movie is -0.03 lower per each one incremente in runtime.
There are total 11 genre categories in the dataset, the audience score can higher or lower than Action & Adventure films depends on what genre is selected.
R-Squared(0.8876): 88.76% of the variablity in audience score can be explained by the model.

Model diagnostics

ggplot(finalmodel, aes(x = .fitted, y = .resid, color= genre)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

We can observe that there is clear a linear relationship between imdb rating and audience score(by genre). The linearity condition is met by our model.

Constant variance of residuals condition met, No fan shape in residuals plot.

ggplot(data = finalmodel, aes(x = .resid, fill= genre)) +
  geom_histogram(color= "white") +
  xlab("Residuals")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

and the normality

b1 <- ggplot(data = finalmodel, aes(sample = .resid, color= genre)) +
  stat_qq() + ggtitle("By Genre")

b2 <- ggplot(data = finalmodel, aes(sample = .resid)) +
  stat_qq() + ggtitle("Total Model Error Normality")

grid.arrange(b2,b1)

plot(finalmodel)

Part 5: Prediction

We are going to use our final model to predict the audience score for the movie in the test set. First we create a new dataframe for this movie.

predict(finalmodel,testing)

##        1        2        3        4        5        6        7        8 
## 53.23105 77.11966 44.57628 88.91113 82.98794 48.81621 74.78927 40.93251 
##        9       10       11       12       13       14 
## 84.76071 76.65694 74.01250 43.01909 45.96933 37.31517

final_predict <- predict(finalmodel,testing, interval = "prediction", level= 0.95)
final_predict <- as.data.frame(final_predict)
final_predict

##         fit      lwr       upr
## 1  53.23105 39.06447  67.39763
## 2  77.11966 63.57628  90.66305
## 3  44.57628 31.02482  58.12773
## 4  88.91113 74.91748 102.90477
## 5  82.98794 69.34884  96.62704
## 6  48.81621 35.25092  62.38150
## 7  74.78927 61.22679  88.35175
## 8  40.93251 27.30097  54.56406
## 9  84.76071 71.08354  98.43788
## 10 76.65694 62.68351  90.63036
## 11 74.01250 60.14881  87.87620
## 12 43.01909 29.39473  56.64345
## 13 45.96933 32.13903  59.79964
## 14 37.31517 23.75545  50.87490

The model predicts movie Aliens in the test set will have an audience score at approximate 90.

plot(final_predict)

See the errors:

final_data <- data.frame(testing$audience_score, final_predict$fit)
final_data

##    testing.audience_score final_predict.fit
## 1                      55          53.23105
## 2                      85          77.11966
## 3                      47          44.57628
## 4                      90          88.91113
## 5                      84          82.98794
## 6                      49          48.81621
## 7                      64          74.78927
## 8                      32          40.93251
## 9                      73          84.76071
## 10                     77          76.65694
## 11                     70          74.01250
## 12                     38          43.01909
## 13                     44          45.96933
## 14                     35          37.31517

final_data <- final_data %>% mutate( errores = (testing.audience_score - final_predict.fit))
final_data

##    testing.audience_score final_predict.fit     errores
## 1                      55          53.23105   1.7689486
## 2                      85          77.11966   7.8803363
## 3                      47          44.57628   2.4237230
## 4                      90          88.91113   1.0888743
## 5                      84          82.98794   1.0120586
## 6                      49          48.81621   0.1837871
## 7                      64          74.78927 -10.7892681
## 8                      32          40.93251  -8.9325134
## 9                      73          84.76071 -11.7607080
## 10                     77          76.65694   0.3430650
## 11                     70          74.01250  -4.0125038
## 12                     38          43.01909  -5.0190899
## 13                     44          45.96933  -1.9693345
## 14                     35          37.31517  -2.3151737

plot(final_data$testing.audience_score,final_data$final_predict.fit)

summary(finalmodel)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + genre + runtime + 
##     imdb_num_votes + audience_rating, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5443  -4.5285   0.7232   4.3756  24.9183 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.028e+01  2.612e+00  -3.935 9.25e-05 ***
## imdb_rating                     9.823e+00  3.967e-01  24.762  < 2e-16 ***
## genreAnimation                  3.228e+00  2.470e+00   1.307   0.1917    
## genreArt House & International -2.685e+00  2.060e+00  -1.303   0.1929    
## genreComedy                     1.511e+00  1.143e+00   1.322   0.1866    
## genreDocumentary                5.545e-01  1.457e+00   0.381   0.7036    
## genreDrama                     -6.794e-01  9.855e-01  -0.689   0.4908    
## genreHorror                    -1.571e+00  1.744e+00  -0.901   0.3680    
## genreMusical & Performing Arts  2.734e+00  2.314e+00   1.182   0.2378    
## genreMystery & Suspense        -3.058e+00  1.263e+00  -2.420   0.0158 *  
## genreOther                      2.331e-01  1.984e+00   0.118   0.9065    
## genreScience Fiction & Fantasy  9.225e-02  2.449e+00   0.038   0.9700    
## runtime                        -2.448e-02  1.596e-02  -1.534   0.1256    
## imdb_num_votes                  3.052e-06  2.920e-06   1.045   0.2963    
## audience_ratingUpright          2.025e+01  7.863e-01  25.758  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.876 on 621 degrees of freedom
## Multiple R-squared:  0.8873, Adjusted R-squared:  0.8848 
## F-statistic: 349.2 on 14 and 621 DF,  p-value: < 2.2e-16

Part 6: Conclusion

The model demonstrates that it is possible to predict a movie’s popularity, as measured by audience score with only four predictors - imdb score,runtime, audience rating and genre. Movie industries can use the similar methods when producing movies that are more likely to be liked by the target audience.

However, the potential shortcoming is that our model’s predictive power is limited because the sample data is not representative. Therefore, a larger number of observations to capture more variability in the population data in our testing data set is required to have a better measure of the model’s accuracy. And finally the model predicts accurately, the max error is -10 and was twice when we appered in the test data, the median or the error is zero.