This is an analysis of the movie dataset that contains information from Rotten Tomatos and IMDB for a random sample of movies that are needed to develop a multiple linear regression model that will explain what makes movies popular.The dataset keeps track of all reviews for each films and aggregates the results and Internet Movie Database IMDB, an online database of information related to film, television programs and video games.
## Warning: package 'statsr' was built under R version 4.0.2
## Warning: package 'ggthemes' was built under R version 4.0.2
## Warning: package 'corrplot' was built under R version 4.0.2
The data set is comprised of 651 randomly sampled movies produced and released before 2016.
Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis. It is up to you to decide which variables are meaningful and which should be omitted. For example information in the the actor1 through actor5 variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar.
You might also choose to omit certain observations or restructure some of the variables to make them suitable for answering your research questions.
When you are fitting a model you should also be careful about collinearity, as some of these variables may be dependent on each other.
Is movie popularity, as measured by IMDb rating, associated with other film characteristics like: having actors,type of movie,genre, runtime, imdb rating, votes, actresses, directors who won an oscar, critic reviews, studios that produced the movie, and year that the movie was released in theaters.
An actor, actress, or director who has won an oscar award is a great motivation to see a movie. An oscar award is like a guarantee of excellence. Now we need to get the variables to the model.
movies_new <- movies %>% select(title, title_type, genre, runtime, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_win, best_actor_win, best_actress_win, best_dir_win)## tibble [651 x 14] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## title title_type genre runtime
## Length:651 Documentary : 55 Drama :305 Min. : 39.0
## Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:115.8
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## imdb_rating imdb_num_votes critics_rating critics_score
## Min. :1.900 Min. : 180 Certified Fresh:135 Min. : 1.00
## 1st Qu.:5.900 1st Qu.: 4546 Fresh :209 1st Qu.: 33.00
## Median :6.600 Median : 15116 Rotten :307 Median : 61.00
## Mean :6.493 Mean : 57533 Mean : 57.69
## 3rd Qu.:7.300 3rd Qu.: 58301 3rd Qu.: 83.00
## Max. :9.000 Max. :893008 Max. :100.00
##
## audience_rating audience_score best_pic_win best_actor_win best_actress_win
## Spilled:275 Min. :11.00 no :644 no :558 no :579
## Upright:376 1st Qu.:46.00 yes: 7 yes: 93 yes: 72
## Median :65.00
## Mean :62.36
## 3rd Qu.:80.00
## Max. :97.00
##
## best_dir_win
## no :608
## yes: 43
##
##
##
##
##
Now is necessary to clean the data from NA’s
Before modeling is needed to separate the train and test data.
index <- createDataPartition(movies_new$imdb_rating, p= 0.975, list = FALSE)
training <- movies_new[index,]## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## [1] 636 14
## [1] 14 14
Now testing some variables for the Exploratory analysis
d1 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = critics_rating)) + geom_point()
d2 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = audience_rating)) + geom_point()
grid.arrange(d1, d2, nrow = 1, ncol = 2)The dependent variable:
## # A tibble: 6 x 14
## title title_type genre runtime imdb_rating imdb_num_votes critics_rating
## <chr> <fct> <fct> <dbl> <dbl> <int> <fct>
## 1 Fill~ Feature F~ Drama 80 5.5 899 Rotten
## 2 The ~ Feature F~ Drama 101 7.3 12285 Certified Fre~
## 3 Wait~ Feature F~ Come~ 84 7.6 22381 Certified Fre~
## 4 The ~ Feature F~ Drama 139 7.2 35096 Certified Fre~
## 5 Male~ Feature F~ Horr~ 90 5.1 2386 Rotten
## 6 Old ~ Documenta~ Docu~ 78 7.8 333 Fresh
## # ... with 7 more variables: critics_score <dbl>, audience_rating <fct>,
## # audience_score <dbl>, best_pic_win <fct>, best_actor_win <fct>,
## # best_actress_win <fct>, best_dir_win <fct>
As it sees the great distribution from the data came from Feature Films
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.39 80.00 97.00
The median score is 65, and the maximum score was 97.
## # A tibble: 6 x 14
## title title_type genre runtime imdb_rating imdb_num_votes critics_rating
## <chr> <fct> <fct> <dbl> <dbl> <int> <fct>
## 1 Fill~ Feature F~ Drama 80 5.5 899 Rotten
## 2 The ~ Feature F~ Drama 101 7.3 12285 Certified Fre~
## 3 Wait~ Feature F~ Come~ 84 7.6 22381 Certified Fre~
## 4 The ~ Feature F~ Drama 139 7.2 35096 Certified Fre~
## 5 Male~ Feature F~ Horr~ 90 5.1 2386 Rotten
## 6 Old ~ Documenta~ Docu~ 78 7.8 333 Fresh
## # ... with 7 more variables: critics_score <dbl>, audience_rating <fct>,
## # audience_score <dbl>, best_pic_win <fct>, best_actor_win <fct>,
## # best_actress_win <fct>, best_dir_win <fct>
g1 <- training %>% ggplot(aes(runtime, fill = title_type)) +
geom_histogram() + ggtitle('Run Time')
g2 <- training %>% ggplot(aes(imdb_rating, fill = genre)) +
geom_histogram() + ggtitle('IMDB rating')
g3 <- training %>% ggplot(aes(log10(imdb_num_votes), fill= critics_rating)) +
geom_histogram() + ggtitle('log(IMDB number of votes)')
g4 <- training %>% ggplot(aes(critics_score, fill= audience_rating)) +
geom_histogram() + ggtitle('Critics Score')
g5 <- training %>% ggplot(aes(audience_score, fill= best_pic_win)) +
geom_histogram() + ggtitle('Critics Score')
grid.arrange(g1, g2, g3, g4,g5, ncol=2)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now for the categorical variables:
## # A tibble: 6 x 14
## title title_type genre runtime imdb_rating imdb_num_votes critics_rating
## <chr> <fct> <fct> <dbl> <dbl> <int> <fct>
## 1 Fill~ Feature F~ Drama 80 5.5 899 Rotten
## 2 The ~ Feature F~ Drama 101 7.3 12285 Certified Fre~
## 3 Wait~ Feature F~ Come~ 84 7.6 22381 Certified Fre~
## 4 The ~ Feature F~ Drama 139 7.2 35096 Certified Fre~
## 5 Male~ Feature F~ Horr~ 90 5.1 2386 Rotten
## 6 Old ~ Documenta~ Docu~ 78 7.8 333 Fresh
## # ... with 7 more variables: critics_score <dbl>, audience_rating <fct>,
## # audience_score <dbl>, best_pic_win <fct>, best_actor_win <fct>,
## # best_actress_win <fct>, best_dir_win <fct>
g1 <- training %>% ggplot(aes(genre, fill = title_type)) +
geom_bar() + ggtitle('Genre')
g2 <- training %>% ggplot(aes(title_type, fill = genre)) +
geom_bar() + ggtitle('Title Type')
g3 <- training %>% ggplot(aes(critics_rating, fill= critics_rating)) +
geom_bar() + ggtitle('Critics rating')
g4 <- training %>% ggplot(aes(audience_rating, fill= best_actor_win)) +
geom_bar() + ggtitle('Audience Rating')
g5 <- training %>% ggplot(aes(audience_rating, fill= best_pic_win)) +
geom_bar() + ggtitle('Audience Rating')
grid.arrange(g1,g2,g3,g4,g5, ncol=2)Now is turn to see the correlation between variables, specailly numericals.
g1 <- training %>% ggplot(aes(runtime, audience_score,color= title_type)) +
geom_point() + ggtitle('Audience Score vs Run Time') + geom_smooth()
g2 <- training %>% ggplot(aes(imdb_rating, audience_score,color = genre)) +
geom_point() + ggtitle('Audience Score vs IMDB rating') +geom_smooth()
g3 <- training %>% ggplot(aes(log10(imdb_num_votes),audience_score, color= critics_rating)) +
geom_point() + ggtitle('Audience Score vs log(IMDB number of votes)') + geom_smooth()
g4 <- training %>% ggplot(aes(critics_score,audience_score, color= audience_rating)) +
geom_point() + ggtitle('Audience Score vs Critics Score') +geom_smooth()
grid.arrange(g1,g2,g3,g4, ncol=2)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 87.86
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 17.14
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 124.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5
## Warning in sqrt(sum.squares/one.delta): Se han producido NaNs
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small. fewer
## data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 87.86
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 17.14
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 124.1
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced
## Warning in max(ids, na.rm = TRUE): ningun argumento finito para max; retornando
## -Inf
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
and the correlations are:
library(corrplot)
variables <- c('runtime', 'imdb_rating', 'imdb_num_votes', 'critics_score', "audience_score")
selected_train <- training[variables]
head(selected_train)## # A tibble: 6 x 5
## runtime imdb_rating imdb_num_votes critics_score audience_score
## <dbl> <dbl> <int> <dbl> <dbl>
## 1 80 5.5 899 45 73
## 2 101 7.3 12285 96 81
## 3 84 7.6 22381 91 91
## 4 139 7.2 35096 80 76
## 5 90 5.1 2386 33 27
## 6 78 7.8 333 91 86
## runtime imdb_rating imdb_num_votes critics_score
## runtime 1.0000000 0.2676279 0.3452433 0.1737387
## imdb_rating 0.2676279 1.0000000 0.3271261 0.7677127
## imdb_num_votes 0.3452433 0.3271261 1.0000000 0.2131793
## critics_score 0.1737387 0.7677127 0.2131793 1.0000000
## audience_score 0.1822614 0.8646578 0.2873827 0.7098398
## audience_score
## runtime 0.1822614
## imdb_rating 0.8646578
## imdb_num_votes 0.2873827
## critics_score 0.7098398
## audience_score 1.0000000
For the categorical variables we need to see the variations on median and other factors.
p1 <- training %>% ggplot(aes(color= audience_rating))+ geom_boxplot(aes(audience_score,audience_rating))
p2 <- training %>% ggplot(aes(color= critics_rating))+ geom_boxplot(aes(audience_score,critics_rating))
p3 <- training %>% ggplot(aes(color= title_type))+ geom_boxplot(aes(audience_score,title_type))
p4 <- training %>% ggplot(aes(color= genre))+ geom_boxplot(aes(audience_score,genre))
grid.arrange(p1,p2,p3, ncol=2)I start with the full model
full_model <- lm(audience_score~imdb_rating+title_type+genre+runtime+imdb_num_votes+critics_rating+audience_rating+best_pic_win+best_actor_win+best_actress_win+best_dir_win, data=training)
summary(full_model)##
## Call:
## lm(formula = audience_score ~ imdb_rating + title_type + genre +
## runtime + imdb_num_votes + critics_rating + audience_rating +
## best_pic_win + best_actor_win + best_actress_win + best_dir_win,
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4540 -4.5599 0.6738 4.3682 24.6415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.011e+01 4.093e+00 -2.470 0.0138 *
## imdb_rating 9.543e+00 4.314e-01 22.118 <2e-16 ***
## title_typeFeature Film 2.291e+00 2.571e+00 0.891 0.3733
## title_typeTV Movie 8.071e-01 4.056e+00 0.199 0.8423
## genreAnimation 3.153e+00 2.485e+00 1.269 0.2050
## genreArt House & International -2.653e+00 2.071e+00 -1.281 0.2005
## genreComedy 1.591e+00 1.154e+00 1.378 0.1686
## genreDocumentary 2.437e+00 2.754e+00 0.885 0.3765
## genreDrama -7.618e-01 1.003e+00 -0.759 0.4479
## genreHorror -1.716e+00 1.754e+00 -0.978 0.3282
## genreMusical & Performing Arts 3.425e+00 2.479e+00 1.381 0.1677
## genreMystery & Suspense -3.181e+00 1.282e+00 -2.481 0.0134 *
## genreOther 1.163e-01 2.011e+00 0.058 0.9539
## genreScience Fiction & Fantasy -2.961e-01 2.468e+00 -0.120 0.9046
## runtime -2.244e-02 1.677e-02 -1.338 0.1813
## imdb_num_votes 2.746e-06 3.222e-06 0.852 0.3944
## critics_ratingFresh -1.664e-02 8.586e-01 -0.019 0.9845
## critics_ratingRotten -1.315e+00 9.478e-01 -1.387 0.1660
## audience_ratingUpright 2.010e+01 8.020e-01 25.063 <2e-16 ***
## best_pic_winyes 3.585e-01 2.944e+00 0.122 0.9031
## best_actor_winyes 4.909e-01 8.237e-01 0.596 0.5514
## best_actress_winyes -1.008e+00 9.338e-01 -1.079 0.2810
## best_dir_winyes 1.672e-02 1.206e+00 0.014 0.9889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.889 on 613 degrees of freedom
## Multiple R-squared: 0.8883, Adjusted R-squared: 0.8843
## F-statistic: 221.7 on 22 and 613 DF, p-value: < 2.2e-16
For the backward method we obtain next model:
finalmodel <- lm(audience_score~imdb_rating+genre+runtime+imdb_num_votes+audience_rating, data=training)
summary(finalmodel)##
## Call:
## lm(formula = audience_score ~ imdb_rating + genre + runtime +
## imdb_num_votes + audience_rating, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5443 -4.5285 0.7232 4.3756 24.9183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.028e+01 2.612e+00 -3.935 9.25e-05 ***
## imdb_rating 9.823e+00 3.967e-01 24.762 < 2e-16 ***
## genreAnimation 3.228e+00 2.470e+00 1.307 0.1917
## genreArt House & International -2.685e+00 2.060e+00 -1.303 0.1929
## genreComedy 1.511e+00 1.143e+00 1.322 0.1866
## genreDocumentary 5.545e-01 1.457e+00 0.381 0.7036
## genreDrama -6.794e-01 9.855e-01 -0.689 0.4908
## genreHorror -1.571e+00 1.744e+00 -0.901 0.3680
## genreMusical & Performing Arts 2.734e+00 2.314e+00 1.182 0.2378
## genreMystery & Suspense -3.058e+00 1.263e+00 -2.420 0.0158 *
## genreOther 2.331e-01 1.984e+00 0.118 0.9065
## genreScience Fiction & Fantasy 9.225e-02 2.449e+00 0.038 0.9700
## runtime -2.448e-02 1.596e-02 -1.534 0.1256
## imdb_num_votes 3.052e-06 2.920e-06 1.045 0.2963
## audience_ratingUpright 2.025e+01 7.863e-01 25.758 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.876 on 621 degrees of freedom
## Multiple R-squared: 0.8873, Adjusted R-squared: 0.8848
## F-statistic: 349.2 on 14 and 621 DF, p-value: < 2.2e-16
Intercept(-9.8) is the estimated audience score for a movie with imdb_rating, audience_rating and genre at zero. It does not provide any meaningful interpretation here.
imdb_rating coefficient(9.8): All else hold constant, for every one unit increase in imdb_rating, the model predicts a 9.8 increase in audience_score on average.
audience_ratingUpright coefficient(20.3246): All else hold constant, the model predicts rating Upright movie is 20.3 higher in audience score on average than rating Spilled movie.
genreMystery & Suspense (3.01): The model predicts that Animation films get an audience score that is 3.01 higher than Action & Adventure(reference category) films on average after controlling for imdb_rating and audience rating.
runtime (-0.03): All else hold constant, the model predicts audience score movie is -0.03 lower per each one incremente in runtime.
There are total 11 genre categories in the dataset, the audience score can higher or lower than Action & Adventure films depends on what genre is selected.
R-Squared(0.8876): 88.76% of the variablity in audience score can be explained by the model.
ggplot(finalmodel, aes(x = .fitted, y = .resid, color= genre)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") We can observe that there is clear a linear relationship between imdb rating and audience score(by genre). The linearity condition is met by our model.
Constant variance of residuals condition met, No fan shape in residuals plot.
ggplot(data = finalmodel, aes(x = .resid, fill= genre)) +
geom_histogram(color= "white") +
xlab("Residuals")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
and the normality
b1 <- ggplot(data = finalmodel, aes(sample = .resid, color= genre)) +
stat_qq() + ggtitle("By Genre")
b2 <- ggplot(data = finalmodel, aes(sample = .resid)) +
stat_qq() + ggtitle("Total Model Error Normality")
grid.arrange(b2,b1)We are going to use our final model to predict the audience score for the movie in the test set. First we create a new dataframe for this movie.
## 1 2 3 4 5 6 7 8
## 53.23105 77.11966 44.57628 88.91113 82.98794 48.81621 74.78927 40.93251
## 9 10 11 12 13 14
## 84.76071 76.65694 74.01250 43.01909 45.96933 37.31517
final_predict <- predict(finalmodel,testing, interval = "prediction", level= 0.95)
final_predict <- as.data.frame(final_predict)
final_predict## fit lwr upr
## 1 53.23105 39.06447 67.39763
## 2 77.11966 63.57628 90.66305
## 3 44.57628 31.02482 58.12773
## 4 88.91113 74.91748 102.90477
## 5 82.98794 69.34884 96.62704
## 6 48.81621 35.25092 62.38150
## 7 74.78927 61.22679 88.35175
## 8 40.93251 27.30097 54.56406
## 9 84.76071 71.08354 98.43788
## 10 76.65694 62.68351 90.63036
## 11 74.01250 60.14881 87.87620
## 12 43.01909 29.39473 56.64345
## 13 45.96933 32.13903 59.79964
## 14 37.31517 23.75545 50.87490
The model predicts movie Aliens in the test set will have an audience score at approximate 90.
See the errors:
## testing.audience_score final_predict.fit
## 1 55 53.23105
## 2 85 77.11966
## 3 47 44.57628
## 4 90 88.91113
## 5 84 82.98794
## 6 49 48.81621
## 7 64 74.78927
## 8 32 40.93251
## 9 73 84.76071
## 10 77 76.65694
## 11 70 74.01250
## 12 38 43.01909
## 13 44 45.96933
## 14 35 37.31517
final_data <- final_data %>% mutate( errores = (testing.audience_score - final_predict.fit))
final_data## testing.audience_score final_predict.fit errores
## 1 55 53.23105 1.7689486
## 2 85 77.11966 7.8803363
## 3 47 44.57628 2.4237230
## 4 90 88.91113 1.0888743
## 5 84 82.98794 1.0120586
## 6 49 48.81621 0.1837871
## 7 64 74.78927 -10.7892681
## 8 32 40.93251 -8.9325134
## 9 73 84.76071 -11.7607080
## 10 77 76.65694 0.3430650
## 11 70 74.01250 -4.0125038
## 12 38 43.01909 -5.0190899
## 13 44 45.96933 -1.9693345
## 14 35 37.31517 -2.3151737
##
## Call:
## lm(formula = audience_score ~ imdb_rating + genre + runtime +
## imdb_num_votes + audience_rating, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5443 -4.5285 0.7232 4.3756 24.9183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.028e+01 2.612e+00 -3.935 9.25e-05 ***
## imdb_rating 9.823e+00 3.967e-01 24.762 < 2e-16 ***
## genreAnimation 3.228e+00 2.470e+00 1.307 0.1917
## genreArt House & International -2.685e+00 2.060e+00 -1.303 0.1929
## genreComedy 1.511e+00 1.143e+00 1.322 0.1866
## genreDocumentary 5.545e-01 1.457e+00 0.381 0.7036
## genreDrama -6.794e-01 9.855e-01 -0.689 0.4908
## genreHorror -1.571e+00 1.744e+00 -0.901 0.3680
## genreMusical & Performing Arts 2.734e+00 2.314e+00 1.182 0.2378
## genreMystery & Suspense -3.058e+00 1.263e+00 -2.420 0.0158 *
## genreOther 2.331e-01 1.984e+00 0.118 0.9065
## genreScience Fiction & Fantasy 9.225e-02 2.449e+00 0.038 0.9700
## runtime -2.448e-02 1.596e-02 -1.534 0.1256
## imdb_num_votes 3.052e-06 2.920e-06 1.045 0.2963
## audience_ratingUpright 2.025e+01 7.863e-01 25.758 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.876 on 621 degrees of freedom
## Multiple R-squared: 0.8873, Adjusted R-squared: 0.8848
## F-statistic: 349.2 on 14 and 621 DF, p-value: < 2.2e-16
The model demonstrates that it is possible to predict a movie’s popularity, as measured by audience score with only four predictors - imdb score,runtime, audience rating and genre. Movie industries can use the similar methods when producing movies that are more likely to be liked by the target audience.
However, the potential shortcoming is that our model’s predictive power is limited because the sample data is not representative. Therefore, a larger number of observations to capture more variability in the population data in our testing data set is required to have a better measure of the model’s accuracy. And finally the model predicts accurately, the max error is -10 and was twice when we appered in the test data, the median or the error is zero.