Emiliano La Rocca

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.4.0

Load data

load("movies.Rdata")

Part 1: Data

Movies data set was downloaded from Coursera assignment page into R project. The observations consist of random samples compiled from audience and critics reviews. The dataset information about movies contains 651 randomly sampled movies produced and released before 2016. This data set comes from Rotten Tomatoes and IMDB.There are 32 available variables. With this dataset it is only possible to do an observational study and no causal analysis will done.This study can generalize to movies produced and released before 2016. Some the variables are not much relevant for the purpose of identification of movies popularity. Before starting the analysis one missing value was omited from analysis.

Generalizability

The data is gathered online, from whoever wants to leave ratings or reviews for movies. Although the respondents are self-selecting (as opposed to randomly selected), there is no reason to think they are skewed toward any one demographic, except for the fact they all have access to a computer and the internet. Thus, the opinions (ratings and scores) are generalizable to internet users who watch movies.

Part 2: Research question

The research question I set out to answer is: What variables in the data can be used to predict overall popularity (as an average of the Rotten Tomatoes scores and IMDB rating) for a given movie and how well do they do?

What variables in the data can be used to predict overall popularity (as an average of the Rotten Tomatoes scores and IMDB rating) for a given movie and how well do they do? Fo example we want to determine how does the type of movie, runtime, critics score, rating, numbers of votes, relates to the popularity of a movie. The purpose of this study is to define through statistical modeling, interesting patterns with respect to popularity in the data and some features which are more important than others.

Part 3: Exploratory data analysis

The movies in the dataset are divided into 10 categories and and extra category named “other”. For the exploratory analysis we joined some of the smallest categories into the “other” category, in Figure 2.a) it is straightforward to determine that the majority of the movies,305 (Figure 1), are categorized as drama movies. Also the most common scenario is that where the majority of movies didn’t win an award as depicted in Figure 2.b); this scenario was also present in the actors, actresses and director awards variables (Figure 2 c,d and e).

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Figure 1. Summary statistics of genre,runtime,critics score, audience score, best picture award, best actor award, best actress award and best director award.

layout(matrix(c(1,1,2,3,4,5), 3, 2, byrow = TRUE))
barplot(summary(movies$genre,maxsum=6),main="a) Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other")
barplot(summary(movies$best_pic_win),main="b) Whether or not the movie won a best picture Oscar (no, yes)")
barplot(summary(movies$best_actor_win),main="c) Whether or not one of the main actors in the movie ever won an Oscar (no, yes)")
barplot(summary(movies$best_actress_win),main="d) Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - \n not that this is not necessarily whether the actresses won an Oscar for their role in the given movie")
barplot(summary(movies$best_dir_win),main="e) Whether or not the director of the movie ever won an Oscar (no, yes) - \n not that this is not necessarily whether the director won an Oscar for the given movie")

Figure 2. Barplots for Genre of movies, best picture award, main actors award, main actresses award, director award. In the case of the numerical variables we used histograms and barplots (Figure 3). The average runtime of a movie is around 105 and its distribution is slightly right skewed (Figure 3 a and b), with some outliers movies with runtime around 250 minutes. The critics scores are nearly uniformly distributed, but slightly left skewed(Figure 3 c and d). The score given by the audience has a similar behavior has the critics scores, with a nearly uniform left skewed distribution.

layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow = TRUE))
boxplot(movies$runtime,xlab="Runtime",main="a) Boxplot of movies runtime")
hist(movies$runtime, xlab="Runtime",main="b) Histogram of movies runtime")
boxplot(movies$critics_score,xlab="Critics score",main="c) Boxplot of critics score")
hist(movies$critics_score, xlab="Critics score",main="d) Histogram of critics score")
boxplot(movies$audience_score,xlab="Audience score",main="e) Boxplot of audience score")
hist(movies$audience_score, xlab="Audience score",main="f) Histogram of audience score")

Figure 3. Barplots and histograms of movies runtime, critics score and audience score.

layout(matrix(c(1,2), 1, 2, byrow = TRUE))
boxplot(movies$imdb_rating,xlab="imdb_rating",main="a) Boxplot of imdb_rating")
hist(movies$imdb_rating, xlab="imdb_rating",main="b) Histogram of imdb_rating")

Figure 4. Barplots and histograms of imdb_rating with a nearly uniform left skewed distribution.

p1 <- ggplot(data = movies, aes(x = imdb_rating)) + geom_histogram(colour = "black", fill = "skyblue", binwidth = .3)
p2 <- ggplot(data = movies, aes(x = imdb_num_votes)) + geom_histogram(colour = "black", fill = "salmon", binwidth = 40000, alpha = 0.5)
grid.arrange(p1, p2, nrow = 1, ncol = 2)

quantile(movies$imdb_rating, c(0, 0.25, 0.5, 0.75, 0.9, 1))

##   0%  25%  50%  75%  90% 100% 
##  1.9  5.9  6.6  7.3  7.8  9.0

quantile(movies$imdb_num_votes, c(0, 0.25, 0.5, 0.75, 0.9, 1))

##       0%      25%      50%      75%      90%     100% 
##    180.0   4545.5  15116.0  58300.5 151934.0 893008.0

Figure 5 - imdb_rating appears to have the closest afinity to a shape of a normal distribution. imdb_num_votes is heavily skewed with 90% of movies having a score of 151,934 and below.

p3 <- ggplot(data = movies, aes(x = critics_score)) + geom_histogram(colour = "black", fill = "cyan", binwidth = 5, alpha = 0.5)
p4 <- ggplot(data = movies, aes(x = audience_score)) + geom_histogram(colour = "black", fill = "yellow", binwidth = 5, alpha = 0.7)
grid.arrange(p3, p4,  nrow = 1, ncol = 2)

quantile(movies$audience_score, c(0, 0.25, 0.5, 0.75, 0.9, 1))

##   0%  25%  50%  75%  90% 100% 
##   11   46   65   80   87   97

quantile(movies$critics_score, c(0, 0.25, 0.5, 0.75, 0.9, 1))

##   0%  25%  50%  75%  90% 100% 
##    1   33   61   83   93  100

Figure 6- the distribution of critics_score and audience_score appear similar except that the audience score taper more in both ends.

After describing the main features of the dataset movies, in relation to the research question I needed to quantify the notion of “popularity.” I decided to combine the following variables imdb_rating, critics_score, and audience_score into one number called “popularity.” I decided on a simple average, giving each of the three variables equal weight in the popularity score. Since the IMDB score was from 1 through 10 whereas the Rotten Tomatoes scores were out of 100, I multiplied the IMDB score by 10 before adding it in. The result is a number from 0-100 that indicates a movie’s overall popularity as given by the three measures.

# index of popularity 
movies <- movies %>% mutate(popularity = (imdb_rating*10+critics_score+audience_score)/3)
summary(movies$popularity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.67   48.33   62.67   61.66   77.67   94.67

# relationschip between popularity and genre 
ggplot(movies, aes(x = genre, y = popularity))+
  geom_boxplot(color = '#000000') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#relationschip between popularity and best_dir_win
ggplot(data = movies, aes(x = best_dir_win, y = popularity))+
  geom_boxplot()

#distributions of movies by studio
#looks like studio names are not consistent, would need cleanup to be used for analysis
df_studios <- movies %>% group_by(studio) %>% summarise(count = n())
df_studios <- df_studios %>% arrange(desc(count))
head(df_studios)

## # A tibble: 6 × 2
##                             studio count
##                             <fctr> <int>
## 1               Paramount Pictures    37
## 2            Warner Bros. Pictures    30
## 3 Sony Pictures Home Entertainment    27
## 4               Universal Pictures    23
## 5                Warner Home Video    19
## 6                 20th Century Fox    18

#relationschip between popularity and best_dir_win
ggplot(data = movies, aes(x = studio, y = popularity))+
  geom_boxplot()

Part 4: Modeling

We want to study through a multiple regression model as the response variable Y depends on 4 predictors.
Variables used to construct our model:
Runtime: duration of a film;
imdb_rating is scale of 1 to 10 Scored by users of IMDb. An IMDb user have one vote per title per user and can change anytime.The Their vote totals are converted into a weighted mean-rating That is displayed beside each title in the website;
The imdb_num_votes is the number of users of IMDb who voted for a particular film. It is an open-ended scale begining with 0 and the maximum votes is dependent on the number of users of the website voted for a Particular That film;
critics_score, audience_score and ratings are from the website Rotten tomatoes. Both have scales of 1 to 100. Each movie features in “average user,” Which calculates the percentage of users who have rated the film positively.
In the first part of the study analyzes the relationship between the varibile Y (response) with each of the predictors (X1, X2, X3, X4), knowing that the relation between the response and each predictor variable changes if the predictor is taken single, or in the global model with all other predictors.
Then it looks for any relationship between the varibile Y and its predictors the analysis of variance.
The next step is to run the first modeling with multiple regression with previous variables and analysis of variance.
To build the final model we proceed to the elimination of a varibile. The variable can be eliminated, for example one in which the test has a high p-value. To choose a complete model reduced I adopted the principle of agreement with three selection algorithms “Stepwise”: backward, forward, and both.

#Variable assignment
Y<-movies$imdb_rating
X1<-movies$runtime
X2<-movies$critics_score
X3<-movies$audience_score
X4<-movies$imdb_num_votes

#scatter plot di Y vs X1
plot(X1, Y, main= 'scatterplot imdb_rating vs runtime', lwd=2, xlab='runtime', ylab='imdb_rating')

# Simple linear regression model
result<- lm(Y~X1)
plot(lm(Y~X1))

summary(lm(Y~X1))

## 
## Call:
## lm(formula = Y ~ X1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3099 -0.5791  0.0714  0.7572  2.2500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.907873   0.227163  21.605  < 2e-16 ***
## X1          0.014965   0.002111   7.088 3.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.046 on 648 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.07195,    Adjusted R-squared:  0.07052 
## F-statistic: 50.24 on 1 and 648 DF,  p-value: 3.564e-12

anova (lm(Y~X1))

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X1          1  54.96  54.959   50.24 3.564e-12 ***
## Residuals 648 708.86   1.094                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#scatter plot di Y vs X2
plot(X2, Y, main= 'scatterplot imdb_rating vs critics_score', lwd=2, xlab='critics_score', ylab='imdb_rating')

# Simple linear regression model
result<- lm(Y~X2)
plot(lm(Y~X2))

summary(lm(Y~X2))

## 
## Call:
## lm(formula = Y ~ X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.93679 -0.39499  0.04512  0.43875  2.47556 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.8075715  0.0620690   77.45   <2e-16 ***
## X2          0.0292177  0.0009654   30.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6991 on 649 degrees of freedom
## Multiple R-squared:  0.5853, Adjusted R-squared:  0.5846 
## F-statistic: 915.9 on 1 and 649 DF,  p-value: < 2.2e-16

anova (lm(Y~X2))

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X2          1 447.64  447.64  915.91 < 2.2e-16 ***
## Residuals 649 317.19    0.49                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#scatter plot di Y vs X3
plot(X3, Y, main= 'scatterplot imdb_rating vs audience_score', lwd=2, xlab='audience_score', ylab='imdb_rating')

# Simple linear regression model
result<- lm(Y~X3)
plot(lm(Y~X3))

summary(lm(Y~X3))

## 
## Call:
## lm(formula = Y ~ X3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2082 -0.1866  0.0712  0.3093  1.1516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.599992   0.069291   51.95   <2e-16 ***
## X3          0.046392   0.001057   43.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.545 on 649 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7476 
## F-statistic:  1926 on 1 and 649 DF,  p-value: < 2.2e-16

anova (lm(Y~X3))

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X3          1 572.09  572.09  1926.3 < 2.2e-16 ***
## Residuals 649 192.75    0.30                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#scatter plot di Y vs X4
plot(X3, Y, main= 'scatterplot imdb_rating vs imdb_num_votes', lwd=2, xlab='imdb_num_votes', ylab='imdb_rating')

# Simple linear regression model
result<- lm(Y~X4)
plot(lm(Y~X4))

summary(lm(Y~X4))

## 
## Call:
## lm(formula = Y ~ X4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6433 -0.5205  0.0798  0.7004  2.1837 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.309e+00  4.513e-02 139.789   <2e-16 ***
## X4          3.204e-06  3.583e-07   8.941   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 649 degrees of freedom
## Multiple R-squared:  0.1097, Adjusted R-squared:  0.1083 
## F-statistic: 79.94 on 1 and 649 DF,  p-value: < 2.2e-16

anova (lm(Y~X4))

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X4          1  83.87  83.874  79.937 < 2.2e-16 ***
## Residuals 649 680.97   1.049                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comment linear models

From the previous graphics it is observed that the correlation between the variable Y and predictors X1, X2, X3, X4, is positive. For all predictors seem to be a good linear relationship between the response variable and individual predictor. Analysis of variance indicates values near zero for predictors X2, X3, X4.

Multiple regression model

#multiple regression model
result<-lm(Y~X1+X2+X3+X4)
result

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##   3.230e+00    4.644e-03    1.149e-02    3.327e-02    5.795e-07

summary(result)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.43133 -0.19202  0.03291  0.26440  1.17105 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.230e+00  1.176e-01  27.470  < 2e-16 ***
## X1          4.644e-03  1.030e-03   4.509 7.74e-06 ***
## X2          1.149e-02  9.272e-04  12.388  < 2e-16 ***
## X3          3.327e-02  1.328e-03  25.063  < 2e-16 ***
## X4          5.795e-07  1.831e-07   3.165  0.00162 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4755 on 645 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8091, Adjusted R-squared:  0.8079 
## F-statistic: 683.3 on 4 and 645 DF,  p-value: < 2.2e-16

The estimated multiple regression model is then:
Y = 3.230e + 00 + 4.644e-03 X1 + 1.149e-02X2 + 3.327e-02X3 + 5.795e-07X4
Note that the tests on coefficients B1 and B4 associated with X1 and X4 predictors have a p-value high-value ’> 0:10, so these two predictors may not be statistically significant for the model. You can not think to eliminate both quato should be noted that the T test is done on a single factor, assuming that the other predictors are included in the model. You can then delete a predictor for a time, for example one in which the p-value is higher.

#confidence interval
confint(result)

##                    2.5 %       97.5 %
## (Intercept) 2.999055e+00 3.460828e+00
## X1          2.621306e-03 6.665835e-03
## X2          9.665977e-03 1.330746e-02
## X3          3.066717e-02 3.588103e-02
## X4          2.199952e-07 9.389851e-07

anova(result)

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X1          1  54.96   54.96  243.084 < 2.2e-16 ***
## X2          1 406.42  406.42 1797.607 < 2.2e-16 ***
## X3          1 154.34  154.34  682.663 < 2.2e-16 ***
## X4          1   2.27    2.27   10.019  0.001622 ** 
## Residuals 645 145.83    0.23                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparison with Stepwise selection algorithms.

Condition: “all three algorithms must be agreed in the variable to be deleted.”

#metodo stepwise backward
step(result, direction='backward')

## Start:  AIC=-961.45
## Y ~ X1 + X2 + X3 + X4
## 
##        Df Sum of Sq    RSS     AIC
## <none>              145.83 -961.45
## - X4    1     2.265 148.09 -953.43
## - X1    1     4.597 150.43 -943.28
## - X2    1    34.698 180.53 -824.71
## - X3    1   142.026 287.86 -521.44

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##   3.230e+00    4.644e-03    1.149e-02    3.327e-02    5.795e-07

#metodo stepwise forwardward
step(result, direction='forward')

## Start:  AIC=-961.45
## Y ~ X1 + X2 + X3 + X4

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##   3.230e+00    4.644e-03    1.149e-02    3.327e-02    5.795e-07

#metodo stepwise both
step(result, direction='both')

## Start:  AIC=-961.45
## Y ~ X1 + X2 + X3 + X4
## 
##        Df Sum of Sq    RSS     AIC
## <none>              145.83 -961.45
## - X4    1     2.265 148.09 -953.43
## - X1    1     4.597 150.43 -943.28
## - X2    1    34.698 180.53 -824.71
## - X3    1   142.026 287.86 -521.44

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##   3.230e+00    4.644e-03    1.149e-02    3.327e-02    5.795e-07

In this case there is agreement between all three methods of selection, leading to choosing the reduced model in which X4 predictor is excluded. This result also coincides with the result of the analysis of variance, in which the predictor X4 has the highest p-value.

#reduced model
result.rid<-lm(Y~X1+X2+X3)
summary(result.rid)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.45175 -0.20677  0.02588  0.28036  1.20442 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.1080482  0.1118680  27.783  < 2e-16 ***
## X1          0.0056647  0.0009848   5.752 1.36e-08 ***
## X2          0.0114496  0.0009336  12.264  < 2e-16 ***
## X3          0.0340659  0.0013129  25.947  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4788 on 646 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8061, Adjusted R-squared:  0.8052 
## F-statistic: 895.3 on 3 and 646 DF,  p-value: < 2.2e-16

#anova table
result.aov.rid<- anova(result.rid)
result.aov.rid

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X1          1  54.96   54.96  239.74 < 2.2e-16 ***
## X2          1 406.42  406.42 1772.86 < 2.2e-16 ***
## X3          1 154.34  154.34  673.26 < 2.2e-16 ***
## Residuals 646 148.09    0.23                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#scatterplot of residuals
res<-result.rid$residuals
Ycapp<-result.rid$fitted.values
plot(Ycapp, res, main='Residuals vs estimated values', lwd=2,
     xlab = 'Y estimated values', ylab ='Residuals')
abline(h=0, lwd=2)

#Normal probability plot of residuals
qqnorm(res, datax=T, main='NPP Residuals of model', lwd=2)
qqline(result.rid$residuals)

The residual plot shows a point cloud without any particular form, confirming the fact that the new model adopted looks good. The Normal Probability Plot shows no evidence of violation of the assumption of normality for the residuals.
The new model has the equation:
Y = 3.10804 + 0.0056647X1 + 0.0114496X2 + 0.0340659X3
The t test on the coefficients have a low p-value, then you can reject the null hypothesis that the single parameter is equal to zero.

Part 5: Prediction

We wanted to predict the IMDB-rating for 5 a new movies that has not been used to fit the model.
In this small random sample consisting of five movies we see that the constructed model can predict well IMDB real-rating.
Any discrepancy can be seen in the film 4 where critics and audiences have discordant opinion (critics=35 and audience=75).

Moana 2016, IMDB rating R =7,9 IMDB rating P= 7.87
Hidden Figures 2017, IMDB rating R = 7,9 IMDB rating P=8
Spilt 2017, IMDB rating R =7,5 IMDB rating P= 7,37
A dog’s purpose IMDB rating 2017, R = 4,4 IMDB rating P= 6,7
La La land 2016, IMDB rating R =8,5 IMDB rating P= 7,79

 new_movie <- data.frame(title_type="Feature Film",
                     genre="Animation",
                     X1=103,
                     X2=95,
                     X3=91)

prediction_Moana <- predict(result.rid, new_movie)
prediction_Moana

##        1 
## 7.879221

#confidence intervals for the mean value of Y
result.conf<- predict (result.rid, new_movie, interval='confidence')
result.conf

##        fit     lwr      upr
## 1 7.879221 7.81122 7.947222

#prediction intervals for the mean value of Y
result.pred<-predict(result.rid, new_movie, interval='prediction')
result.pred

##        fit      lwr      upr
## 1 7.879221 6.936574 8.821868

new_movie2 <- data.frame(title_type="Feature Film",
                     genre="Drama",
                     X1=127,
                     X2=92,
                     X3=94)

prediction_HiddenFigures <- predict(result.rid, new_movie2)
prediction_HiddenFigures

##        1 
## 8.083022

new_movie3 <- data.frame(title_type="Feature Film",
                     genre="Horror",
                     X1=116,
                     X2=74,
                     X3=81)

prediction_Split <- predict(result.rid, new_movie3)
prediction_Split

##        1 
## 7.371761

new_movie4 <- data.frame(title_type="Feature Film",
                     genre="Comedy",
                     X1=120,
                     X2=35,
                     X3=75)

prediction_Adogspurpose <- predict(result.rid, new_movie4)
prediction_Adogspurpose

##        1 
## 6.743489

new_movie5 <- data.frame(title_type="Feature Film",
                     genre="Comedy",
                     X1=128,
                     X2=93,
                     X3=85)

prediction_lalaland <- predict(result.rid, new_movie5)
prediction_lalaland

##        1 
## 7.793544

Part 6: Conclusion

Determining the popularity of a movie is not simple task. The intrinsic characteristics of a movie seem to have some degree of correlation with the popularity of a movie. However, external attributes, like critics score , also seem to be correlated with its popularity.
Our model has demonstrated that with only three predictors, we can predict with a certain amount of accuracy the popularity of movies using imdb_rating as a measure of popularity. However, we have to remember that our predictors, audience_score and critics_score are subjective measures that are easily prone to bias. Our model did not include interesting categorical variables as box office and other this is a limit, but the model can predict IMDB rating.