Task background

Congratulations on getting a job as a data scientist at Paramount Pictures! Please see the Data Analysis Project for your assignment. Below you will find the files that you will need.

Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants you team to figure it all out.

As part of this project you will complete exploratory data analysis (EDA), modeling, and prediction.

Getting started

load('movies.Rdata')
library(statsr)
library(ggplot2)
library(dplyr)
library(GGally)

Part1: Data

The dataset has 651 observations that come from Rotten Tomatoes and IMDB. It includes 32 variables (e.g., ‘title’,‘title type’,‘genre’).

The data is randomly samples and its sample size is rational because it lies between the minimum requirement size and the maximum of that (smaller than 10% of that population).

Part2: Data manipulation

In this part, we create new categorial variables using dplyr package based on ‘title_type’,‘genre’,‘mpaa-rating’,‘thtr_rel_month’. We will get

# New categorical variables added
movies <- movies %>% mutate(feature_film = factor(ifelse(title_type == "Feature Film", "yes", "no")),
                            drama = factor(ifelse(genre == "Drama", "yes", "no")), 
                            mpaa_rating_R = factor(ifelse(mpaa_rating == "R", "yes", "no"))) 
# New time-based variables added
movies <- movies %>% mutate(oscar_season = factor(ifelse(thtr_rel_month > 9,"yes", "no")),
                            summer_season = factor(ifelse(thtr_rel_month > 4 & thtr_rel_month < 9,"yes", "no")))
# Only the required variables are selected for
movies <- movies %>% select(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season,imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)

Part3: Exploratory data analysis

Here performs EDA of the relationship between variables created above and audience_score. Since five variables created, five tables and five diagrams are set and plotted.

First, we’d have a look at the value description of selected variables

str(movies)
## tibble [651 × 17] (S3: tbl_df/tbl/data.frame)
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ feature_film    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
##  $ drama           : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating_R   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ oscar_season    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ summer_season   : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Next, drawing the diagrams

ggplot(data=movies, aes(x = feature_film, y =audience_score)) +
  geom_boxplot()+ggtitle('feature_film Vs. audience_score')

ggplot(data=movies, aes(x = drama, y =audience_score)) +
  geom_boxplot()+ggtitle('drama Vs. audience_score')

ggplot(data=movies, aes(x = mpaa_rating_R, y =audience_score)) +
  geom_boxplot()+ggtitle('mpaa_rating_R Vs. audience_score')

ggplot(data=movies, aes(x = oscar_season, y =audience_score)) +
  geom_boxplot()+ggtitle('oscar_season Vs. audience_score')

ggplot(data=movies, aes(x = summer_season, y =audience_score)) +
  geom_boxplot()+ggtitle('summer_season Vs. audience_score')

From above box plots, it is clear that ‘feature_film’ has a strong linear linkage to intended variable, and ‘drama’ shows a certain relationship while other vairables ‘mpaa_rating_R’,‘oscar_season’, and ‘summer_season’ are weakenly lined to ‘aduience score’.

Besides, in addition to above created variables, other categorial variables ‘best picture nom’,‘best actor win’,‘best actress win’,‘best director win’, ‘box office top 200’, their relationships with response variable should be checked.

plot(movies$audience_score ~ movies$best_pic_nom , col = "yellow" , main = "best picture nomination")

plot(movies$audience_score ~ movies$best_actor_win , col = "orange" , main = "best actor win ")

plot(movies$audience_score ~ movies$best_actress_win , col = "steelblue" , main = "best actress win ")

plot(movies$audience_score~ movies$best_dir_win , col = "purple" , main = "best director win ")

plot(movies$audience_score ~ movies$top200_box , col = "green" , main = "Box office top 200")

According to box plotting result, little evidence support any strong relationship between ‘audience_score’ and ‘best_actor_win’,‘best_actress_win’,‘best_dir_win’. Among five box plots, ‘best picture nomination’ shows the strongest linkage to y variable, and ‘top200_box’ presents a postive relationship. (‘top200_box’,‘best picture nomination’ will be taken into modelling)

Next step, research focuses on contious variables’ relationship with audience score (‘critics_score’,‘imbd_rating’,‘imbd_num_votes’).

ggplot(data=movies,aes(x=critics_score,y=audience_score))+geom_point()+
        geom_smooth(method=lm,se=F)+ggtitle("relationship between critic score Vs audience score")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data=movies, aes(x=imdb_rating, y=audience_score)) + geom_point() + geom_smooth(method=lm,se=F)+ggtitle("relationship between IMDB Vs Audience score")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data=movies, aes(x=imdb_num_votes, y=audience_score)) + geom_point() + geom_smooth(method=lm,se=F)+ggtitle("relationship between IMDB Vs Audience score")
## `geom_smooth()` using formula 'y ~ x'

From above plotting, we can clearly judge the linear relationship between audience score and other three variables. Clearly, the linear relationship between ‘audience_score’ and ‘imdb_num_votes’ is very weak because the scatters present radial pattern.

Next, below plots the relationship between score and runtime.

ggplot(data = movies , aes(x = runtime , y = audience_score, color = runtime)  ) + geom_col()
## Warning: Removed 1 rows containing missing values (position_stack).

Below is linear fitting checkin.

my_fn <- function(data, mapping, ...){
    p <- ggplot(data = data, mapping = mapping) + 
        geom_point() + 
        geom_smooth(method=loess, fill="red", color="red", ...) +
        geom_smooth(method=lm, fill="blue", color="blue", ...)
    p
}
movies %>% select(runtime, thtr_rel_year, imdb_rating, critics_score, audience_score) %>% ggpairs(lower = list(continuous = my_fn))
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Removed 1 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Removed 1 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Removed 1 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

From above loss (red) curve and regression line (blue), these variables are almost linear and it is not necessary to conduct any transformation.

Part4: Modelling

Regfull<-lm(audience_score~feature_film+drama+best_pic_nom
              +top200_box+critics_score+imdb_rating+runtime+thtr_rel_year,data=movies)
Regfull<-summary(Regfull)
Regfull
## 
## Call:
## lm(formula = audience_score ~ feature_film + drama + best_pic_nom + 
##     top200_box + critics_score + imdb_rating + runtime + thtr_rel_year, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.811  -6.549   0.306   5.687  52.390 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     82.28788   73.71220   1.116  0.26469    
## feature_filmyes -2.15987    1.58944  -1.359  0.17466    
## dramayes         0.78616    0.86014   0.914  0.36107    
## best_pic_nomyes  4.14927    2.30051   1.804  0.07176 .  
## top200_boxyes    2.23748    2.66859   0.838  0.40209    
## critics_score    0.05839    0.02209   2.643  0.00841 ** 
## imdb_rating     14.84961    0.58570  25.354  < 2e-16 ***
## runtime         -0.06541    0.02242  -2.917  0.00366 ** 
## thtr_rel_year   -0.05575    0.03667  -1.520  0.12898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 641 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7578, Adjusted R-squared:  0.7548 
## F-statistic: 250.7 on 8 and 641 DF,  p-value: < 2.2e-16
Finalmodel<-lm(audience_score~feature_film+best_pic_nom
              +critics_score+imdb_rating+runtime+thtr_rel_year,data=movies)
summary(Finalmodel)
## 
## Call:
## lm(formula = audience_score ~ feature_film + best_pic_nom + critics_score + 
##     imdb_rating + runtime + thtr_rel_year, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.382  -6.451   0.302   5.773  52.278 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     81.30090   73.54752   1.105  0.26939    
## feature_filmyes -1.65667    1.51131  -1.096  0.27341    
## best_pic_nomyes  4.22527    2.29809   1.839  0.06643 .  
## critics_score    0.06145    0.02191   2.804  0.00520 ** 
## imdb_rating     14.88891    0.58419  25.486  < 2e-16 ***
## runtime         -0.06070    0.02205  -2.753  0.00607 ** 
## thtr_rel_year   -0.05574    0.03660  -1.523  0.12828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 643 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7572, Adjusted R-squared:  0.755 
## F-statistic: 334.3 on 6 and 643 DF,  p-value: < 2.2e-16

Multicollinearity

pairs(data=movies,~audience_score+feature_film+best_pic_nom+critics_score+imdb_rating+runtime)

Linearity

par(mfrow=c(1,2),mar=c(3,2,3,2))
hist(Finalmodel$residuals,main="histogram of residuals",col="skyblue")
qqnorm(Finalmodel$residuals,pch=20,main="Normal Probability Plot of Residuals")
qqline(Finalmodel$residuals)

par(mfrow=c(1,2))

plot(Finalmodel$residuals~Finalmodel$fitted,main='Residuals vs. Predicted (fitted) ')
plot(abs(Finalmodel$residuals)~Finalmodel$fitted,main='Absolute Residuals vs. Predicted')

audience_score+feature_film+best_pic_nom+critics_score+imdb_rating+runtim

Part5: Prediction

imdb_rating<-8.3
feature_film<-'yes'
runtime<- 176
critics_score<- 84
best_pic_nom <-'yes'
thtr_rel_year<- 2022
Batman<-data.frame(imdb_rating,feature_film,runtime,critics_score,best_pic_nom,thtr_rel_year)
predict(Finalmodel,Batman)
##        1 
## 89.22295
predict(Finalmodel,Batman,interval="confidence")
##        fit      lwr      upr
## 1 89.22295 84.22315 94.22274

Part5: Prediction

By applying EDA, the modelling result is reliable based on checking assumptions required in Bayesian linear regression. The final result of predicting audience score, taking the Batman released in 2022 as an example, proving a reliable reference for predicting the popularity of a film.