load("movies.Rdata")
movies <- na.omit(movies) # Removing the NA values from the dataset
head(movies)## # A tibble: 6 x 32
## title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month
## <chr> <fct> <fct> <dbl> <fct> <fct> <dbl> <dbl>
## 1 Fill… Feature F… Drama 80 R Indom… 2013 4
## 2 The … Feature F… Drama 101 PG-13 Warne… 2001 3
## 3 Wait… Feature F… Come… 84 R Sony … 1996 8
## 4 The … Feature F… Drama 139 PG Colum… 1993 10
## 5 Male… Feature F… Horr… 90 R Ancho… 2004 9
## 6 Lady… Feature F… Drama 142 PG-13 Param… 1986 1
## # … with 24 more variables: thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## # dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## # imdb_num_votes <int>, critics_rating <fct>, critics_score <dbl>,
## # audience_rating <fct>, audience_score <dbl>, best_pic_nom <fct>,
## # best_pic_win <fct>, best_actor_win <fct>, best_actress_win <fct>,
## # best_dir_win <fct>, top200_box <fct>, director <chr>, actor1 <chr>,
## # actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>,
## # rt_url <chr>
The dataset contains information on movies from the famous websites Rotten Tomatoes and IMDB. There are 651 randomly sampled movies produced and released before 2016. There are 32 available variables. With this dataset and for the purpose of this project, it is only possible to do an observational study and no causal inference can be drawn.The study can generalize to movies produced and released before 2016.
I have always wondered if casting an established actor/actress in a movie or directed by one really influences the overall performance or rating of a movie. Do audience agree with critics? Or are the critics ratings biased. Well why sit and wonder,let us just go and find out.
The goal of this project is to predict the overall performance and rating of a particular movie and find out if factors such as cast , crew and critics have any influence
I have selected few variables from the available dataset that best suites the purpose of this study and are expected to be independent of each other. Thus, variables such as DVD release date, number of IMDB votes, best picture nomination/win, etc. were not chosen to be in the model. Variables with large domains, such as studio name, actor/director names, URLs, etc. were excluded as well;some of the variables are irrelevant to the purpose of identifying the popularity of a movie: the Link to IMDB page for the movie and the Link to Rotten Tomatoes page for the movie. Theater release month was included assuming that movies released at certain times of the year may be more popular than others.
1.genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) 2.runtime: Runtime of movie (in minutes) 3.mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated) 4.thtr_rel_year: Year the movie is released in theaters 5.imdb_rating: Rating on IMDB 6.critics_score: Critics score on Rotten Tomatoes 7.audience_score: Audience score on Rotten Tomatoes 8.best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) 9.best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) 10.best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) 11.top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) ### Summarising the required data
summary(movies %>% select(genre,runtime,mpaa_rating,thtr_rel_year,imdb_rating,critics_score,audience_score,best_actor_win,best_actress_win,best_dir_win,top200_box))## genre runtime mpaa_rating thtr_rel_year
## Drama :298 Min. : 65.0 G : 16 Min. :1972
## Comedy : 86 1st Qu.: 93.0 NC-17 : 1 1st Qu.:1991
## Action & Adventure: 62 Median :103.0 PG :111 Median :2000
## Mystery & Suspense: 56 Mean :106.5 PG-13 :131 Mean :1998
## Documentary : 40 3rd Qu.:116.0 R :319 3rd Qu.:2007
## Horror : 22 Max. :267.0 Unrated: 41 Max. :2014
## (Other) : 55
## imdb_rating critics_score audience_score best_actor_win
## Min. :1.900 Min. : 1.00 Min. :11.00 no :528
## 1st Qu.:5.900 1st Qu.: 33.00 1st Qu.:46.00 yes: 91
## Median :6.600 Median : 61.00 Median :65.00
## Mean :6.486 Mean : 57.43 Mean :62.21
## 3rd Qu.:7.300 3rd Qu.: 82.50 3rd Qu.:80.00
## Max. :9.000 Max. :100.00 Max. :97.00
##
## best_actress_win best_dir_win top200_box
## no :548 no :576 no :604
## yes: 71 yes: 43 yes: 15
##
##
##
##
##
## [1] 0.2974388
g <- ggplot(data = movies, aes(x = genre, y = imdb_rating, fill = "genre", draw_quantiles=TRUE))
g+theme_bw() +
geom_violin() +
labs(title="Violin Plot",
x="Genre",
y="IMDB Rating")This violin plot shows the relationship of genre type to imdb_rating. The plot elements show the median rating for documentary and drama is higher than for other genre types. The shape of the distribution (extremely skinny on each end and wide in the middle) indicates the ratings of horror and documentary are highly concentrated around the median.
## [1] 0.7015256
g <- ggplot(data = movies, aes(x = critics_score, y = audience_score))
g + geom_point() + geom_smooth() + geom_jitter(alpha=3) + facet_grid(movies$critics_rating)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
With a corelation coefficient as high as 0.7, it is clear that critics and audience most of the times agree with each other on the movie rating.
p1 <- ggplot(data=movies, aes(x=imdb_rating)) +
geom_histogram(binwidth=0.5, fill="purple") +
xlab("IMDB Scores")
p2 <- ggplot(data=movies, aes(x=critics_score)) +
geom_histogram(binwidth=5, fill="pink") +
xlab("Critics Scores")
p3 <- ggplot(data=movies, aes(x=audience_score)) +
geom_histogram(binwidth=5, fill="magenta") +
xlab("Audience Scores")
grid.arrange(p1, p2, p3, nrow=1,
top="Distribution of Rating Scores")Contrary to the two Rotten Tomatoe scores, the IMDB scores show a nice, mostly normal distribution centered around at mean of 6.37 with somewhat of a left-sided skew.
The target response variable for the prediction model is a movie rating score, but with three to choose from, which one should be used? Two of the ratings come from the Rotten Tomatoes web site: one is an average of reviews by movie critics and the other is an average of reviews from the public (a.k.a., audience). The third rating is an average of reviews on the IMDB web site (no distinction made between critics and audience reviews). Given its distribution and the fact that it has the highest pairwise correlation with the other scores, the IMDB rating (imdb_rating) is the chosen response variable
We will be using a backward elimination method, i.e we will start with all the variables in other words the full model and we’ll remove variables to create a parsimonious model. The initial variables are :
genre runtime mpaa_rating thtr_rel_month best_actor_win best_actress win best_dir_win critics_score audience_score
The initial model is summmarised below.
model <-lm(imdb_rating ~ genre+runtime+mpaa_rating+best_actor_win+best_actress_win+best_dir_win+critics_score+audience_score, data=movies)
summary(model)##
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + best_actor_win +
## best_actress_win + best_dir_win + critics_score + audience_score,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.37794 -0.18657 0.04211 0.27789 1.19448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2971065 0.1816755 18.148 < 2e-16 ***
## genreAnimation -0.5263118 0.1917443 -2.745 0.00624 **
## genreArt House & International 0.2459594 0.1528671 1.609 0.10815
## genreComedy -0.1559576 0.0797020 -1.957 0.05084 .
## genreDocumentary 0.2808051 0.1129641 2.486 0.01320 *
## genreDrama 0.0294810 0.0696210 0.423 0.67212
## genreHorror 0.0744675 0.1199214 0.621 0.53486
## genreMusical & Performing Arts 0.0329651 0.1519943 0.217 0.82837
## genreMystery & Suspense 0.2135603 0.0901403 2.369 0.01814 *
## genreOther -0.0033619 0.1373917 -0.024 0.98049
## genreScience Fiction & Fantasy -0.0869348 0.1768009 -0.492 0.62310
## runtime 0.0050418 0.0011441 4.407 1.25e-05 ***
## mpaa_ratingNC-17 -0.0165491 0.4886483 -0.034 0.97299
## mpaa_ratingPG -0.1318487 0.1386581 -0.951 0.34204
## mpaa_ratingPG-13 -0.0742621 0.1415873 -0.524 0.60013
## mpaa_ratingR -0.0427815 0.1372020 -0.312 0.75529
## mpaa_ratingUnrated -0.1809026 0.1601397 -1.130 0.25908
## best_actor_winyes 0.0256733 0.0562272 0.457 0.64812
## best_actress_winyes 0.0686706 0.0615902 1.115 0.26532
## best_dir_winyes 0.0564405 0.0777405 0.726 0.46812
## critics_score 0.0104516 0.0009839 10.623 < 2e-16 ***
## audience_score 0.0334177 0.0013661 24.463 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4686 on 597 degrees of freedom
## Multiple R-squared: 0.8163, Adjusted R-squared: 0.8098
## F-statistic: 126.3 on 21 and 597 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = imdb_rating ~ genre + runtime + critics_score +
## audience_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.34317 -0.19822 0.04111 0.26913 1.16952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.182655 0.130365 24.413 < 2e-16 ***
## genreAnimation -0.490449 0.177001 -2.771 0.00576 **
## genreArt House & International 0.219695 0.148553 1.479 0.13969
## genreComedy -0.148901 0.078375 -1.900 0.05793 .
## genreDocumentary 0.216866 0.101665 2.133 0.03331 *
## genreDrama 0.047788 0.067161 0.712 0.47702
## genreHorror 0.087379 0.117252 0.745 0.45642
## genreMusical & Performing Arts 0.011288 0.150541 0.075 0.94025
## genreMystery & Suspense 0.251075 0.087187 2.880 0.00412 **
## genreOther -0.019440 0.136217 -0.143 0.88657
## genreScience Fiction & Fantasy -0.074232 0.176285 -0.421 0.67384
## runtime 0.005435 0.001060 5.127 3.97e-07 ***
## critics_score 0.010438 0.000962 10.850 < 2e-16 ***
## audience_score 0.033525 0.001360 24.641 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4681 on 605 degrees of freedom
## Multiple R-squared: 0.8143, Adjusted R-squared: 0.8103
## F-statistic: 204 on 13 and 605 DF, p-value: < 2.2e-16
The reduced model has only 3 variables but it has a slightly larger adjusted R-squared of 0.8103. Then, using only a fraction of the variables this model captures almost the same variability (81%) than the full model. The genre and the critics score variables are the most significant variables
plot(red_model$residuals~movies$runtime,xlab="Runtime",ylab="Residuals",main="a) Residuals vs. Runtime")plot(red_model$residuals~movies$critics_score,xlab="Critics score",ylab="Residuals",main="b) Residuals vs critics score")On observation of the histogram of residuals in Figure 5 a there is a little skewness in the residuals, however in Figure 5 b there are deviations only in the tail of the graph.
The residuals seem to be mostly homoscedastic though there are few discripencies. * * *
To predict the IMDB rating score of a movie that has been released in 2016, we have considered “Doctor Strange”. Plugging in the required variable data, the imdb_rating obtained is 7.54. The actual IMDB rating is 7.5. The model predicted the variable acurately for this movie. The lower and upper values for a 95% confidence interval are 6.5 and 8.522 respectively.
new_movie <- data.frame(genre = "Science Fiction & Fantasy", runtime = 115, mpaa_rating = "PG-13",best_dir_win = "No" ,critics_score = 89, audience_score = 86)
predict(red_model,new_movie)## 1
## 7.545571
## fit lwr upr
## 1 7.545571 6.568234 8.522908
For the movie, “The Lord of the Rings: The Fellowship of the Ring”, the imdb_rating obtained is 8.2 where as the rating on website is 8.8. There is a discrepancy of 0.6. The lower and upper values for a 95% confidence interval are 7.3 and 9.2 respectively. The obtained value falls in the 95% CI range.
new_movie <- data.frame(genre = "Action & Adventure", runtime = 178, mpaa_rating = "PG-13", best_dir_win = "Yes", critics_score = 91, audience_score = 95)
predict(red_model,new_movie)## 1
## 8.28481
## fit lwr upr
## 1 8.28481 7.343928 9.225692
Though there was a significantly low p-value for the best actor, actress and director variables, variables like genre, audience score and critic score has higher significant scores. According to the model outputs, we can say that factors like cast and critics play a significant role in predicting the overall performance rating for a movie. There may be many more factors that are actual contributing factors but due to the limitations of the dataset, this model is our best fit.