library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.2
library(statsr)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies
. Delete this note when before you submit your work.
load("movies.Rdata")
Dataset is made of information about movies in Rotten Tomatoes and IMDB. We have got 651 randomly sampled movies produced and released before 2016, with 32 variables. We cannot infere any causality here as this is observational. We are able to generalize results to all movies. Well, as there are lot of irrelevant variables (like title or link) for our analysis (as I am going to use score) and there are few NAs in dataset, I am not going to work with them, so I do not consider it to be important to deal with that.
As I am fan of B-movies and I do think people tend to copy authorities (ie. critics) I would like to know if there is correlation between movie score and if audience is following the critics voice. I do suppose I can predict movie score with this. I do also wonder if runtime is what affect score of movie.
First we will look on what kind of data we do have. First and last six observations will give us good image of data.
head(movies)
## # A tibble: 6 × 32
## title title_type genre runtime mpaa_rating
## <chr> <fctr> <fctr> <dbl> <fctr>
## 1 Filly Brown Feature Film Drama 80 R
## 2 The Dish Feature Film Drama 101 PG-13
## 3 Waiting for Guffman Feature Film Comedy 84 R
## 4 The Age of Innocence Feature Film Drama 139 PG
## 5 Malevolence Feature Film Horror 90 R
## 6 Old Partner Documentary Documentary 78 Unrated
## # ... with 27 more variables: studio <fctr>, thtr_rel_year <dbl>,
## # thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## # dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## # imdb_num_votes <int>, critics_rating <fctr>, critics_score <dbl>,
## # audience_rating <fctr>, audience_score <dbl>, best_pic_nom <fctr>,
## # best_pic_win <fctr>, best_actor_win <fctr>, best_actress_win <fctr>,
## # best_dir_win <fctr>, top200_box <fctr>, director <chr>, actor1 <chr>,
## # actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## # imdb_url <chr>, rt_url <chr>
tail(movies)
## # A tibble: 6 × 32
## title title_type genre runtime
## <chr> <fctr> <fctr> <dbl>
## 1 Cocoon: The Return Feature Film Drama 116
## 2 Death Defying Acts Feature Film Drama 97
## 3 Half Baked Feature Film Comedy 82
## 4 Dance of the Dead Feature Film Action & Adventure 87
## 5 Around the World in 80 Days Feature Film Action & Adventure 120
## 6 LOL Feature Film Comedy 97
## # ... with 28 more variables: mpaa_rating <fctr>, studio <fctr>,
## # thtr_rel_year <dbl>, thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## # dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## # imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fctr>,
## # critics_score <dbl>, audience_rating <fctr>, audience_score <dbl>,
## # best_pic_nom <fctr>, best_pic_win <fctr>, best_actor_win <fctr>,
## # best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## # director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## # actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>
dim(movies)
## [1] 651 32
We can see we have 651 entries with lot of variables - title, type of movie, genre, mpaa_rating, studio, release zear, month, day, etc. 32 variables and 651 observations…
summary(movies)
## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Well there are lot of feature film (591), which is huge number compared to 55 documentary or even 5 TV movies! Also we have lot of drama (305) compared to 87 of comedy…this can make analysis harder as these number can overweight others. These are not big numbers so our analysis will be not as accurate as we could hope for. Fortunately there are not a lot of NAs in data so we have not to worry about them.
p_genrerun <- ggplot(movies, aes(x=factor(genre), y=runtime)) +
geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p_genrerun + ggtitle("Genre to runtime") + geom_hline(yintercept =median(movies$runtime, na.rm = TRUE), col = "royalblue",lwd = 1)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
We can see all genres are oscilating around median runtime value of 103 minutes. There are lot of outliers in data, mostly documentary genre.
p_genreimdb <- ggplot(movies, aes(x=factor(genre), y=imdb_rating)) +
geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p_genreimdb + ggtitle("Genre to IMDB rating") + geom_hline(yintercept =median(movies$imdb_rating, na.rm = TRUE), col = "royalblue",lwd = 1)
We can see sci-fi has biggest variance between 1Q a 3Q, median of sci-fi genre is lower then median of all genres. Documentary performs best as Musical and performing arts movies. There are outliers too for lot of genres. From median of genre we can see that people tend to give higher (around 7) rating on IMDB.
plot(movies$imdb_rating,movies$runtime, main="IMDB rating to runtime", xlab = "IMDB rating", ylab="Runtime in minutes")
abline(lm(movies$runtime~movies$imdb_rating),col = "royalblue",lwd = 1)
From simple linear regression used to IMDB rating and runtime of movie we can see that we can expect better IMDB rating if runtime is longer.
ggplot(data = movies, aes(x = critics_score, y = audience_score)) +
geom_jitter() + geom_smooth(method = "lm", se = FALSE) + ggtitle("Critics score to audience score - Rotten")
We see there is correlation between critics score of Rotten Tomato and audience score. What about IMDB?
ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) +
geom_jitter() + geom_smooth(method = "lm", se = FALSE) + ggtitle("Critics score to audience score - IMDB")
Very strong correlation…
ggplot(data = movies, aes(x = critics_score, y = imdb_rating)) +
geom_jitter() + geom_smooth(method = "lm", se = FALSE) + ggtitle("IMDB vs. Rotten")
We definitely have correlation between critics and IMDB rating. I will use these values in model.
regres <- lm(movies$audience_score ~ movies$critics_score + movies$imdb_rating)
summary(regres)
##
## Call:
## lm(formula = movies$audience_score ~ movies$critics_score + movies$imdb_rating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.668 -6.758 0.723 5.513 52.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.03195 2.86401 -12.930 < 2e-16 ***
## movies$critics_score 0.07318 0.02161 3.386 0.000753 ***
## movies$imdb_rating 14.65760 0.56590 25.901 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.08 on 648 degrees of freedom
## Multiple R-squared: 0.7524, Adjusted R-squared: 0.7516
## F-statistic: 984.4 on 2 and 648 DF, p-value: < 2.2e-16
Not surprisingly we have got strong relationship between these values. I wonder if they are correlated.
reg <- lm( movies$critics_score ~ movies$imdb_rating)
summary(reg)
##
## Call:
## lm(formula = movies$critics_score ~ movies$imdb_rating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.855 -12.803 3.116 13.694 43.196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -72.3792 4.3572 -16.61 <2e-16 ***
## movies$imdb_rating 20.0317 0.6619 30.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.31 on 649 degrees of freedom
## Multiple R-squared: 0.5853, Adjusted R-squared: 0.5846
## F-statistic: 915.9 on 1 and 649 DF, p-value: < 2.2e-16
These values are correlated, which means I do not need to add more of them as this will not be helpful. I will use only critics score, as IMDB score and audience_score are strongly correlated (both are audience score).So critics score is going to by my explanatory variable.
model <- lm(audience_score ~ critics_score, data=movies) #funny thing, if you will not use "data=" you will not be able to get predict() to work properly
par(mfrow=c(2,2)) #combine plots to 2x2 table
hist(model$residuals, main="residuals")
qqnorm(model$residuals)
qqline(model$residuals)
plot(model$residuals ~ model$fitted)
summary(model)
##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.043 -9.571 0.504 10.422 43.544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.43551 1.27561 26.21 <2e-16 ***
## critics_score 0.50144 0.01984 25.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared: 0.496, Adjusted R-squared: 0.4952
## F-statistic: 638.7 on 1 and 649 DF, p-value: < 2.2e-16
We can see thet residuals are normaly distributed around zero. Few outliers do not infere with data, therefore we do not care for them. Also there is linear relationships between x and y, we have got nearly normal residuals and we see from constant variability of residuals the independence of residuals.
From summary table is clear what we did know - there is strong relationship between critics score and audience score. And we can see that critic score is very significant preedictor for audience score (as 1 unit increasing in critics score meand 0.5 increase for audience score).
As I am fan of B-movies I am constantly looking for low score on IMDB (bellow 5 at least, usualy bellow 3). Lowest IMDB score has Disaster Movie (1.9 on IMDB, 1 critics score, 19 audience score). So I am going to predict this movie score with last “model” variable…
film <- data.frame(title ="Disaster Movie", critics_score = 1)
predict(model, film, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 33.93695 5.616396 62.2575
We can see that we have got prediction of 33.94 audience score with 95% confidence, that our score will be between 5.62 and 62.26 score. Well, actualy, audience score of this movie was 19…
film2 <- data.frame(title ="Hellraiser - Bloodline", critics_score = 25)
predict(model, film2, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 45.97145 17.70844 74.23446
I tried it with another movie - Hellraiser, with average 25 critics score. Model predicted that this movie will get audience score of ~46 and that with 95% confidence will be between 17.7 and 74.23 score. As this movie got 47 audience score we see it was pretty accurate.
I was trying to determine if there is any association between audience score and others scoring systems and how we can use it for predictions. As there is strong positive relationship between critics and audience score, critics are best predictors of audience scores.