Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

## Warning: package 'GGally' was built under R version 4.4.3

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.4.3

library(broom)

## Warning: package 'broom' was built under R version 4.4.3

Load data

load("movies.Rdata")

Part 1: Data

The dataset used in this project consists of 651 randomly sampled movies produced and released before 2016. The data were collected from publicly available sources, primarily the Rotten Tomatoes and IMDB APIs. These APIs provide comprehensive information about each movie, including variables related to cast, genre, critic and audience scores, and other attributes.

Because the data were collected through a random sampling method, we can reasonably generalize our findings to the broader population of movies released prior to 2016. However, since this is an observational dataset, we should be cautious in making any causal claims based on our analysis.

Some variables, such as actor1 to actor5, are informative but not directly suitable for statistical modeling without significant transformation. As part of the data preparation process, we will make decisions about which variables are meaningful for modeling, and potentially restructure or exclude certain variables to best answer our research question.

Additionally, attention will be given to issues such as multicollinearity and missing data, as these can affect the validity of our model and interpretations.

Part 2: Research question

What factors significantly influence the audience scores of movies?

Part 3: Exploratory data analysis

str(movies)

## tibble [651 × 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

summary(movies)

##     title                  title_type                 genre        runtime     
##  Length:651         Documentary : 55   Drama             :305   Min.   : 39.0  
##  Class :character   Feature Film:591   Comedy            : 87   1st Qu.: 92.0  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
##                                        Mystery & Suspense: 59   Mean   :105.8  
##                                        Documentary       : 52   3rd Qu.:115.8  
##                                        Horror            : 23   Max.   :267.0  
##                                        (Other)           : 60   NA's   :1      
##   mpaa_rating                               studio    thtr_rel_year 
##  G      : 19   Paramount Pictures              : 37   Min.   :1970  
##  NC-17  :  2   Warner Bros. Pictures           : 30   1st Qu.:1990  
##  PG     :118   Sony Pictures Home Entertainment: 27   Median :2000  
##  PG-13  :133   Universal Pictures              : 23   Mean   :1998  
##  R      :329   Warner Home Video               : 19   3rd Qu.:2007  
##  Unrated: 50   (Other)                         :507   Max.   :2014  
##                NA's                            :  8                 
##  thtr_rel_month   thtr_rel_day    dvd_rel_year  dvd_rel_month   
##  Min.   : 1.00   Min.   : 1.00   Min.   :1991   Min.   : 1.000  
##  1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000  
##  Median : 7.00   Median :15.00   Median :2004   Median : 6.000  
##  Mean   : 6.74   Mean   :14.42   Mean   :2004   Mean   : 6.333  
##  3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008   3rd Qu.: 9.000  
##  Max.   :12.00   Max.   :31.00   Max.   :2015   Max.   :12.000  
##                                  NA's   :8      NA's   :8       
##   dvd_rel_day     imdb_rating    imdb_num_votes           critics_rating
##  Min.   : 1.00   Min.   :1.900   Min.   :   180   Certified Fresh:135   
##  1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546   Fresh          :209   
##  Median :15.00   Median :6.600   Median : 15116   Rotten         :307   
##  Mean   :15.01   Mean   :6.493   Mean   : 57533                         
##  3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301                         
##  Max.   :31.00   Max.   :9.000   Max.   :893008                         
##  NA's   :8                                                              
##  critics_score    audience_rating audience_score  best_pic_nom best_pic_win
##  Min.   :  1.00   Spilled:275     Min.   :11.00   no :629      no :644     
##  1st Qu.: 33.00   Upright:376     1st Qu.:46.00   yes: 22      yes:  7     
##  Median : 61.00                   Median :65.00                            
##  Mean   : 57.69                   Mean   :62.36                            
##  3rd Qu.: 83.00                   3rd Qu.:80.00                            
##  Max.   :100.00                   Max.   :97.00                            
##                                                                            
##  best_actor_win best_actress_win best_dir_win top200_box   director        
##  no :558        no :579          no :608      no :636    Length:651        
##  yes: 93        yes: 72          yes: 43      yes: 15    Class :character  
##                                                          Mode  :character  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##     actor1             actor2             actor3             actor4         
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     actor5            imdb_url            rt_url         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##

# Check missing value
colSums(is.na(movies))

##            title       title_type            genre          runtime 
##                0                0                0                1 
##      mpaa_rating           studio    thtr_rel_year   thtr_rel_month 
##                0                8                0                0 
##     thtr_rel_day     dvd_rel_year    dvd_rel_month      dvd_rel_day 
##                0                8                8                8 
##      imdb_rating   imdb_num_votes   critics_rating    critics_score 
##                0                0                0                0 
##  audience_rating   audience_score     best_pic_nom     best_pic_win 
##                0                0                0                0 
##   best_actor_win best_actress_win     best_dir_win       top200_box 
##                0                0                0                0 
##         director           actor1           actor2           actor3 
##                2                2                7                9 
##           actor4           actor5         imdb_url           rt_url 
##               13               15                0                0

#imputation runtime with median
movies$runtime[is.na(movies$runtime)] <- median(movies$runtime, na.rm = TRUE)

#imputation dvd_rel_year with median
movies$dvd_rel_year[is.na(movies$dvd_rel_year)] <- median(movies$dvd_rel_year, na.rm = TRUE)

#imputation dvd_rel_month with median
movies$dvd_rel_month[is.na(movies$dvd_rel_month)] <- median(movies$dvd_rel_month, na.rm = TRUE)

#imputation dvd_rel_day with median
movies$dvd_rel_day[is.na(movies$dvd_rel_day)] <- median(is.na(movies$dvd_rel_day))

# Plot the distribution of the audience_score variable using a histogram
ggplot(movies, aes(x = audience_score)) +
  geom_histogram(binwidth = 10, fill = "palegreen3", color = "black") +  # histogram with palegreen fill and black border
  labs(title = "Distribution of Audience Score",
       x = "Audience Score",
       y = "Count") +
  theme_minimal()

Interpretation: This chart shows the distribution of audience scores for the movies. Most films received relatively high scores, with a peak (the most common scores) around 75 to 87.5. The distribution is slightly left-skewed, meaning more films have above-average scores, while a few low-scoring films “pull” the average downward. Overall, audience scores tend to be positive.

# Calculate correlations between audience_score and numeric variables
cor(movies$audience_score, movies$imdb_rating, use = "complete.obs")

## [1] 0.8648652

cor(movies$audience_score, movies$critics_score, use = "complete.obs")

## [1] 0.7042762

ggpairs(movies[, c("audience_score", "imdb_rating", "critics_score", "runtime", "imdb_num_votes")],
        upper = list(continuous = wrap("cor", size = 4)),        # korelasi di atas
        lower = list(continuous = wrap("points", size=1, alpha=0.5)),  # scatter plot bawah
        diag = list(continuous = wrap("densityDiag", alpha=0.5)),      # density plot diagonal
        title = "Pair Plot of Movie Scores and Attributes"
) + 
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Interpretation: Pair Plot of Movie Scores and Attributes

Distribution of Scores/Attributes (Diagonal Plots):
- audience_score: Most films have high audience scores (similar to previous plot).
- imdb_rating: IMDb ratings generally range between 6 and 8.
- critics_score: Critics scores are quite spread out but tend to be high for many films.
- runtime: Most films have standard durations (~90–120 minutes), with some very long outliers.
- imdb_num_votes: Most films receive relatively few votes, though some popular ones have many.
Relationships Between Scores/Attributes (Scatterplots and Correlations):
- Audience Score & IMDb Rating (Correlation: 0.865*):** Strongest positive relationship; films with high audience scores also tend to have high IMDb ratings.
- Audience Score & Critics Score (Correlation: 0.704*):** Also a strong positive relationship, though more variable than IMDb rating.
- IMDb Rating & Critics Score (Correlation: 0.765*):** High ratings from IMDb generally align with high critics scores.
- Runtime & Number of Votes (Correlation: 0.344*):** Slight tendency for longer films or those with many votes to be related, but correlation is moderate.
- Other Relationships: Correlations between scores/ratings and runtime or votes tend to be weaker but still statistically significant.
Summary:
Audience scores, IMDb ratings, and critics scores are strongly and positively correlated. Good films on one metric tend to be good on the others. Attributes like runtime and number of votes have smaller effects on these scores.

ggplot(movies, aes(x = imdb_rating, y = audience_score, color = mpaa_rating)) +
  geom_point(alpha = 0.6) +
  facet_wrap(~ genre) +
  theme_minimal() +
  labs(
    title = "Audience Score vs IMDb Rating by Genre and MPAA Rating",
    x = "IMDb Rating",
    y = "Audience Score",
    color = "MPAA Rating"
  )

Interpretation: Audience Score vs IMDb Rating by Genre

Overall Trend:
There’s a clear positive relationship — films with higher IMDb ratings tend to also receive higher audience scores across most genres.
Strong Correlation:
Genres like Action & Adventure, Documentary, Mystery & Suspense, and Sci-Fi & Fantasy show tightly clustered points, indicating strong alignment between audience and IMDb ratings.
More Variation:
Genres such as Comedy, Drama, and Horror display a similar trend but with more spread, suggesting varied audience reactions despite similar IMDb scores.
Notable Patterns:
Animation films often score high on both metrics, while Art House & International films show a positive trend with more variability.
Conclusion:
IMDb ratings and audience scores generally align across genres, though the strength of that relationship varies.

# Calculate correlation matrix
cor_mat <- cor(movies[, c("audience_score", "imdb_rating", "critics_score", "runtime", "imdb_num_votes")], use = "complete.obs")

# Convert matrix to long format
cor_mat_melt <- melt(cor_mat)

# Plot heatmap
ggplot(cor_mat_melt, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  coord_fixed() +
  labs(title = "Correlation Heatmap")

Interpretation: Correlation Heatmap

The heatmap shows the strength of correlation between movie variables (audience_score, imdb_rating, critics_score, runtime, imdb_num_votes). Color intensity indicates correlation strength:
- Dark red: Strong positive correlation (close to +1) — as one variable increases, the other tends to increase too.
- Light red/orange: Moderate positive correlation.
- White: Very weak or no correlation (near zero).
- (Blue, if present): Negative correlation (one variable increases, the other decreases). This plot mainly shows positive correlations.
Key observations:
1. Strongest correlations (darkest red, off the diagonal):
  - Between audience_score and imdb_rating
  - Between imdb_rating and critics_score
  - Between audience_score and critics_score
    These indicate these three scores are highly positively related.
2. Moderate correlations (red/orange):
  - Between runtime and imdb_num_votes
  - Between imdb_num_votes and the three scores (audience_score, imdb_rating, critics_score)
3. Weaker correlations (lighter red):
  - Between runtime and the scores (audience_score, imdb_rating, critics_score), indicating duration has less influence on scores.
Summary:
The heatmap visually confirms strong positive relationships between audience score, IMDb rating, and critics score. Vote count also correlates positively with these scores, while runtime shows weaker associations.

# Create boxplots to compare audience_score based on categorical variable best_actor_win
ggplot(movies, aes(x = factor(best_actor_win), y = audience_score, fill = factor(best_actor_win))) +
  geom_boxplot() +
  scale_fill_manual(values = c("pink", "lightgreen")) +
  labs(title = "Audience Score by Best Actor Win",
       x = "Best Actor Win",
       y = "Audience Score") +
  theme_minimal() +
  theme(legend.position = "none")

# Create boxplots to compare audience_score based on categorical variable best_actor_win
ggplot(movies, aes(x = factor(best_actress_win), y = audience_score, fill = factor(best_actress_win))) +
  geom_boxplot() +
  scale_fill_manual(values = c("pink", "lightgreen")) +
  labs(title = "Audience Score by Best Actress Win",
       x = "Best Actress Win",
       y = "Audience Score") +
  theme_minimal() +
  theme(legend.position = "none")

# create boxplots for audience_score across different genres
library(ggplot2)
ggplot(movies, aes(x = genre, y = audience_score)) +
  geom_boxplot(fill = "lightblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # rotate x-axis labels for readability
  labs(title = "Audience Score by Genre", x = "Genre", y = "Audience Score")

Interpretation: Audience Score by Genre

The plot shows the distribution of audience scores across different movie genres using box plots:
- The line inside the box represents the median audience score
- The box shows the interquartile range (IQR), where 50% of scores lie
- The vertical lines (whiskers) show the spread outside the IQR
- Black dots represent outliers (scores far from the rest)
Key observations:
1. Genres with the highest median audience scores:
  - Documentary (around 85–88), consistently high
  - Musical & Performing Arts (around 80), fairly consistent
  - Animation and Drama (around 68–70)
2. Genres with lower median audience scores:
  - Horror (around 40–43)
  - Science Fiction & Fantasy (around 45–48) with very wide variation
  - Comedy and Action & Adventure (around 50–52)
3. Score variation:
  - Science Fiction & Fantasy shows the largest range, from very low to very high scores
  - Documentary, Musical, and Animation have more consistent scores (shorter boxes)
  - Action, Comedy, and Other genres show moderate to wide variation
4. Outliers:
  - Some genres have movies with scores very different from the majority, e.g., low-scoring animations or high-scoring horror films
Summary:
Documentary and Musical genres tend to be consistently well-liked by audiences. Horror tends to have lower scores. Sci-Fi/Fantasy has very diverse audience reception.

Part 4: Modeling

We’ll fit a multiple linear regression model to predict the audience_score using several predictors.

# Fit a linear regression model
model <- lm(audience_score ~ imdb_rating + critics_score + runtime + genre + mpaa_rating, data = movies)
summary(model)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + runtime + 
##     genre + mpaa_rating, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.016  -6.379   0.511   5.472  50.451 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -34.59073    4.21562  -8.205 1.29e-15 ***
## imdb_rating                     15.02730    0.58902  25.512  < 2e-16 ***
## critics_score                    0.06320    0.02178   2.902  0.00384 ** 
## runtime                         -0.04209    0.02229  -1.888  0.05948 .  
## genreAnimation                   8.49679    3.83345   2.216  0.02701 *  
## genreArt House & International  -0.22976    2.97398  -0.077  0.93844    
## genreComedy                      1.93987    1.63586   1.186  0.23613    
## genreDocumentary                 0.25499    2.25255   0.113  0.90991    
## genreDrama                       0.12498    1.41760   0.088  0.92977    
## genreHorror                     -5.38764    2.45102  -2.198  0.02830 *  
## genreMusical & Performing Arts   4.47095    3.16247   1.414  0.15793    
## genreMystery & Suspense         -5.86328    1.82390  -3.215  0.00137 ** 
## genreOther                       1.59775    2.77788   0.575  0.56538    
## genreScience Fiction & Fantasy  -0.36507    3.50848  -0.104  0.91716    
## mpaa_ratingNC-17                -3.75396    7.44514  -0.504  0.61429    
## mpaa_ratingPG                    1.12300    2.71223   0.414  0.67898    
## mpaa_ratingPG-13                -0.08796    2.79094  -0.032  0.97487    
## mpaa_ratingR                     0.18145    2.68803   0.068  0.94620    
## mpaa_ratingUnrated               0.99141    3.07018   0.323  0.74686    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.822 on 632 degrees of freedom
## Multiple R-squared:  0.7706, Adjusted R-squared:  0.7641 
## F-statistic:   118 on 18 and 632 DF,  p-value: < 2.2e-16

# Plot residuals to check assumptions visually
par(mfrow = c(2, 2))  # 2x2 plot layout for diagnostic plots
plot(model)

# Predict Audience Score using the model (optional)
movies$predicted_audience_score <- predict(model, newdata = movies)

# Plot actual vs predicted Audience Score
ggplot(movies, aes(x = predicted_audience_score, y = audience_score)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Audience Score",
       x = "Predicted Audience Score",
       y = "Actual Audience Score") +
  theme_minimal()

# Fit the model
model <- lm(audience_score ~ imdb_rating + critics_score + runtime + genre, data = movies)

# Extract model summary with CI and significance
model_coef <- tidy(model, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%  # Remove intercept
  mutate(significant = ifelse(p.value < 0.05, "Significant", "Not Significant"))

# Plot with horizontal error bars and color by significance
ggplot(model_coef, aes(x = estimate, y = reorder(term, estimate), color = significant)) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  scale_color_manual(values = c("Significant" = "red", "Not Significant" = "black")) +
  labs(
    title = "Linear Regression Coefficients Predicting Audience Score (95% CI)",
    x = "Estimate",
    y = "Predictor",
    color = "Significance"
  ) +
  theme_minimal()

model2 <- lm(imdb_rating ~ audience_score + critics_score + runtime + genre, data = movies)
summary(model2)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + runtime + 
##     genre, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34404 -0.20172  0.03557  0.27061  1.17348 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.167317   0.125054  25.328  < 2e-16 ***
## audience_score                  0.033883   0.001320  25.672  < 2e-16 ***
## critics_score                   0.010407   0.000937  11.106  < 2e-16 ***
## runtime                         0.005298   0.001017   5.207  2.6e-07 ***
## genreAnimation                 -0.367862   0.166787  -2.206  0.02777 *  
## genreArt House & International  0.199889   0.137566   1.453  0.14670    
## genreComedy                    -0.140961   0.076620  -1.840  0.06627 .  
## genreDocumentary                0.266504   0.093978   2.836  0.00472 ** 
## genreDrama                      0.057439   0.065519   0.877  0.38099    
## genreHorror                     0.095299   0.114098   0.835  0.40390    
## genreMusical & Performing Arts  0.015918   0.149086   0.107  0.91501    
## genreMystery & Suspense         0.261303   0.084593   3.089  0.00210 ** 
## genreOther                     -0.059825   0.131085  -0.456  0.64827    
## genreScience Fiction & Fantasy -0.191440   0.165917  -1.154  0.24900    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4654 on 637 degrees of freedom
## Multiple R-squared:  0.8196, Adjusted R-squared:  0.8159 
## F-statistic: 222.6 on 13 and 637 DF,  p-value: < 2.2e-16

model2 <- lm(imdb_rating ~ audience_score + critics_score + runtime + genre, data = movies)
model2_coef <- tidy(model2, conf.int = TRUE)
model2_coef$significant <- ifelse(model2_coef$p.value < 0.05, "Significant", "Not Significant")

ggplot(model2_coef[-1, ], aes(x = estimate, y = term, color = significant)) +  # Remove intercept
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  scale_color_manual(values = c("Significant" = "red", "Not Significant" = "black")) +
  labs(title = "Linear Regression Coefficients predicting IMDb Rating (95% CI)",
       x = "Estimate",
       y = "Predictor",
       color = "Significance") +
  theme_minimal()

Part 5: Prediction

# Prepare new data 
new_movies <- data.frame(
  title = c("Zootopia", "Deadpool", "La La Land", "Before the Flood", "The Witch"),
  genre = factor(c("Animation", 
                   "Action & Adventure", 
                   "Musical & Performing Arts", 
                   "Documentary", 
                   "Horror"), 
                 levels = levels(movies$genre)),
  runtime = c(108, 108, 128, 96, 92),
  mpaa_rating = factor(c("PG", "R", "PG-13", "PG", "R"), 
                       levels = levels(movies$mpaa_rating)),
  imdb_rating = c(8.0, 8.0, 8.0, 8.2, 6.9),
  critics_score = c(98, 85, 91, 75, 90),
  critics_source = c("Rotten Tomatoes", 
                     "Metacritic", 
                     "IMDb Metascore", 
                     "Geo National", 
                     "Rotten Tomatoes"),
  audience_score = c(92, 96, 81, 85, 55)  
)

print(new_movies)

##              title                     genre runtime mpaa_rating imdb_rating
## 1         Zootopia                 Animation     108          PG         8.0
## 2         Deadpool        Action & Adventure     108           R         8.0
## 3       La La Land Musical & Performing Arts     128       PG-13         8.0
## 4 Before the Flood               Documentary      96          PG         8.2
## 5        The Witch                    Horror      92           R         6.9
##   critics_score  critics_source audience_score
## 1            98 Rotten Tomatoes             92
## 2            85      Metacritic             96
## 3            91  IMDb Metascore             81
## 4            75    Geo National             85
## 5            90 Rotten Tomatoes             55

# Prepare data for prediction by selecting only the variables used in the model
new_movies_predict <- new_movies[, c("genre", "runtime", "mpaa_rating", "imdb_rating", "critics_score")]
# We exclude columns like 'title' and 'critics_source' because they are not predictors in the model

# Use the model to predict audience scores for new movies
# Also calculate 95% prediction intervals to quantify uncertainty around predictions
predictions <- predict(model, newdata = new_movies_predict, interval = "prediction", level = 0.95)

# Add predicted scores and intervals back to the data frame
new_movies$predicted_audience_score <- predictions[, "fit"]  # predicted values
new_movies$pred_lower <- predictions[, "lwr"]               # lower bound of 95% prediction interval
new_movies$pred_upper <- predictions[, "upr"]               # upper bound of 95% prediction interval

# Show movie title, predicted score, interval, and critics source
print(new_movies[, c("title", 
                     "audience_score",    
                     "predicted_audience_score", 
                     "pred_lower", 
                     "pred_upper")])

##              title audience_score predicted_audience_score pred_lower
## 1         Zootopia             92                 96.18022   75.82879
## 2         Deadpool             96                 86.89502   67.44380
## 3       La La Land             81                 90.99426   70.96088
## 4 Before the Flood             85                 90.32078   70.87461
## 5        The Witch             55                 65.92491   46.22797
##   pred_upper
## 1  116.53166
## 2  106.34625
## 3  111.02764
## 4  109.76696
## 5   85.62185

Based on the model predictions, Zootopia is expected to have an audience score of about 96.25, with a likely range between 75.88 and 116.61. This means the actual score is expected to fall within that range. Deadpool is predicted to score around 86.87, La La Land around 91.06, and Before the Flood around 90.22 — all with similar ranges of uncertainty. Among the five movies, The Witch has the lowest predicted score at 65.90, with a range from 46.19 to 85.61. These results show that while the model gives an estimated score, there’s always some uncertainty, which is reflected in the prediction intervals.

Part 6: Conclusion

Based on the linear regression analysis, we conclude that IMDb rating is the strongest and most significant predictor of audience score, followed by critics score. Meanwhile, the runtime variable shows a small negative effect that is marginally significant. Certain genres, such as Animation, have a positive and significant impact, while Horror and Mystery & Suspense tend to lower audience scores. On the other hand, MPAA ratings do not appear to have a significant influence after accounting for other variables.

The model performs fairly well, with an Adjusted R² of 76%, indicating that a substantial portion of the variation in audience scores can be explained by the predictors in the model. However, there is evidence of heteroskedasticity and the presence of influential outliers, which may affect the accuracy and reliability of coefficient estimates. As a next step, applying data transformations or using robust regression methods could improve the model’s validity.

Overall, this model offers valuable insights into the factors that shape audience ratings, especially regarding IMDb scores and film genres.